Software and Containers on Discovery for Lotterhos Lab

We currently use containers environments to run software on Discovery. Containers contain all the dependencies you need to run a program, so in theory they make getting started a lot easier!

First, read the research computing documentation: https://rc-docs.northeastern.edu/en/latest/containers/singularity.html

Part 0: Running an installed container

You will need read access to /work/lotterhos/Containers/

cd /work/lotterhos/Containers Go to the shared folder in the Lotterhos Lab Storage on Discovery where we keep the containers ls look at the list of containers. The containers will end with an sif extension, which is a singularity image.

And to run a container from anywhere type the following, but replace with the actual name: singularity run /work/lotterhos/Containers/<NAME>.sif

For example: singularity run /work/lotterhos/Containers/lumpy_0.2.13.sif

Part 1: Find a container on Dockerhub for a program you want to run, and create the disk image in the Lotterhos Lab Containers Folder on Discovery

You will need write access to /work/lotterhos/Containers/ on Discovery
Go to DockerHub https://hub.docker.com
Use the search to find the program you need a container for
- for example, you could search for samtools. There are a lot of results but this one looks promising! https://hub.docker.com/r/biocontainers/samtools
- check the version of the program compared to the latest version. You may not be able to find out the version until you install it.

Once you have the path to the container, on Discovery type the following commands:

srun --constraint=ib -p lotterhos --pty /bin/bash This takes you to a partition to run the code. The option ib = infiniband gives faster network connection, which you may need to download larger containers.

module load singularity/3.10.3 Singularity is the program that we will use to get the container from the web, download it, and run it

cd /work/lotterhos/Containers Go to the shared folder in the Lotterhos Lab Storage on Discovery where we keep the containers

export SINGULARITY_CACHEDIR=$(pwd)/cache set the cache directory. Singularity will cache files here when building the container.

export SINGULARITY_TMPDIR=$(pwd)/tmp set the tmp directory. Singularity will store tmp files here when building the container.

singularity pull docker://dceoy/lumpy:latest Pull the container from DockerHub. In this case, we pull the program lumpy from the user dceoy and it has the tag latest. The sif is the image (Container) file that Singularity will create after it pulls, and we name the file lumpy_latest.sif.

ls when you list files, you should see the lumpy_latest.sif in the file list

In practice, it’s not a good idea to save everything with the name latest, so it’s good practice to give it a name with the version:

mv lumpy_latest.sif lumpy_0.2.13.sif

And to run it from anywhere type:

singularity run /work/lotterhos/Containers/lumpy_0.2.13.sif

Part 2: Building your own Containers and run the disk image on your laptop or in the Lotterhos Lab Containers Folder on Discovery

You will need write access to /work/lotterhos/Containers/ on Discovery

Step 1: Create the Dockerfile in a text editor on your laptop.

Create a Dockerfile and test it in PodMan Desktop.

The Dockerfile should just be named Dockerfile and nothing else - so make sure to have a good subdirectory organization and README files to avoid confusion.

The container basically builds an operating system from scratch - so the first several lines of the Dockerfile have to do exactly that.

Serena Caplins currently recommends the following header - everything down to the #Install FastQC line, which you would replace with the installation instructions for the programs you want to include.

Here is the text you can use to create and test your own Dockerfile. Make sure to give it the name Dockerfile with no extension!

FROM ubuntu:22.04
ENV TERM linux

ARG DEBIAN_FRONTEND=noninteractive

# Install base compilers and packages
RUN apt-get update && \
    apt-get install -y build-essential \
                       software-properties-common \
                       xsltproc \
                       autoconf \
                       automake \
                       autotools-dev \
                       gfortran \
                       cmake \
                       protobuf-compiler \
                       make \
                       gcc \
                       wget \
                       git \
                       libc-dev \
                       python3-dev \
                       python3-pip \
                       csh \
                       libbz2-dev \
                       perl \
                       xsltproc \
                       docbook-xsl \
                       docbook-xml \
                       zlib1g-dev \
                       libeigen3-dev \
                       gfortran \
                       unzip \
                       vim \
                       nano \
                       pkg-config \
                       libfreetype6-dev \
                       libpng-dev \
                       python3-matplotlib \
                       default-jre \
                       bash \
                       libboost-dev \
                       liblzma-dev  \ 
                       libcurl4-openssl-dev \
                       libssl-dev \
                       gdebi-core

# Install FastQC
WORKDIR /opt
RUN wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip && \
    unzip fastqc_v0.12.1.zip && \
    chmod 777 /opt/FastQC/fastqc && rm /opt/fastqc_v0.12.1.zip

# Fix language warning in FastQC
 RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y locales && \
     sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
     dpkg-reconfigure --frontend=noninteractive locales && \
     update-locale LANG=en_US.UTF-8
 ENV LANG en_US.UTF-8

ENV PATH /opt:/opt/bcftools-1.19:/opt/plink2:/opt/console:/opt/pcangsd:/opt/angsd:/opt/angsd/misc:/opt/MultiQC:/opt/miniconda/bin:/opt/FastQC:/opt/vcftools/src/cpp:/opt/migrate-newest/src:/opt/ngsRelate:${PATH} 
COPY startup.sh /opt
RUN chmod 777 /opt/startup.sh

Notes

Some notes on how to take instructions from a website and customize them for a container - for example let’s take the instructions from the website:

git clone --recursive git@github.com:arq5x/lumpy-sv.git
cd lumpy-sv
make
cp bin/* /usr/local/bin/.

You would need to change this to:

RUN git clone --recursive https://github.com/arq5x/lumpy-sv.git /opt/lumpy-sv && \
    cd /opt/lumpy-sv && make && cp bin/* /usr/local/bin/

Make sure to change the git@github to https:// because the former will require you to set up key authentication.

When the instructions say to export PATH="$PATH:/opt/bin/htslib-1.9", you would convert this to:

ENV PATH="${PATH}:/opt/bin/htslib-1.9" OR ENV PATH="/opt/bin/htslib-1.9:${PATH}". It doesn’t really matter if the ${PATH} is at the beginning or end. Probably just a matter of preference or convention where you decide to put it. It does need to be somewhere though to include the other paths that are stored in ${PATH}. So when you run the ENV command it is appending ${PATH} instead of overwriting path.

For setting channels in conda, for my personal use I prefer to not have them set to my .condarc and just specify them when I run “conda install”. This is mainly because I don’t want any channel priority in case the same package is found in multiple channels. But this may just be a personal preference.

Step 2: Create the Container in PodMan Desktop on your laptop.

When you open Podman Desktop, make sure that you have containers.rc.northeastern.edu listed in your registries. To add it, just use the address and log in with your NU ID and password.

Then, go to the Container Icon. Click +Create and choose Containerfile or Dockerfile.

If the program proceeds normally, you should be prompted to specify the path to the Dockerfile that you just created on your laptop (first entry in figure below) and the Image Name. VERY IMPORTANT: The Image Name needs a path to our containers on research computing’s Harbor site: containers.rc.northeastern.edu/lotterhos/<ImageName:tag>. For example: containers.rc.northeastern.edu/lotterhos/test:3. The ImageName is a description, while the tag is the version of the container. They are seperated by a colon : and the program expects this naming system. By keeping the ImageName the same and updating the tag every time you make an edit to the container, the program can build the container faster each time from the cache for ImageName. Make sure to click Intel and AMD x86_64 because that is what you’ll need to run the container on Discovery!

When it is done, you will see at the bottom of the output: Successfully built <code> Successfully tagged <ImageName:tag>

Troubleshooting (if you are unable to create a new container) If you are having trouble creating a new container, it may be because your PodMan Machine isn’t starting because it needs more memory or space than your computer can handle. In this case you would go to Settings-->Resources-->Podman-->Create new and select about half of the CPUs, Memory, and a small amount of disk size for your container. If you ask for too much, it will just wait forever for those resources to be allocated!

Step 3: Test the Container in PodMan Desktop on your laptop.

Next, go to the “Images” tab. The container that you just created is at the top. You will see the <ImageName> with the <tag> below. Here is an example for the test with the tag 3 that I just created:

Click the “Play” button next to the container you want to test. It will create a container. You can leave all the fields blank. Click “Start Container”. It will open the command line. Try running commands for the programs you installed.

To test fastqc, which is the program in the example Dockerfile above, type fastqc.

Step 4: Repeat Steps 1-3 until you finalize the container

Repeat Steps 1-3 until you finalize the container - edit the Dockerfile, remake the container, and create a test container from the image in Podman Desktop. It is an interative process.

Step 5: Push the image to `containers.rc.northeastern.edu/lotterhos`

From the Image tab, click the vertical ... and select Push Image. This will push it to our Harbor site.

Then you would log into Discovery and type the following:

srun --pty bash
module load singularity/3.10.3
cd /work/lotterhos/Containers
export SINGULARITY_CACHEDIR=$(pwd)/cache 
export SINGULARITY_TMPDIR=$(pwd)/tmp
singularity pull --docker-login docker://containers.rc.northeastern.edu/lotterhos/<ImageName:version>

ls when you list files, you should see the <ImageName>.sif in the file list In practice, it’s not a good idea to save everything with the name latest, so if necessary rename it with the mv command and give it a version.

And to run it from anywhere type: singularity run /work/lotterhos/Containers/<ImageName>

Part 3: Create the disk image in your own Containers folder on Discovery

In your home, scratch, or work directory create your own folder for Containers with subfolders for cache and tmp

mkdir Containers
cd Containers
mkdir cache
mkdir tmp

Then, you basically follow all the steps above, but replace work/lotterhos/Containers with the path to this folder.

Outdated Conda environments

We use conda environments to load software on Discovery.

See this site for more information on how to load environments and update the yml files.

If you create a yml for your analysis, be sure to talk with Dr. L about sharing it with the group.

Shared Lotterhos Group Environments with Research Computing

    lotterhos_R.yml
    lotterhos_RADseq.yml
    lotterhos_RNAseq.yml
    lotterhos_aligners.yml
    lotterhos_gatk.yml
    lotterhos_methylation.yml
    lotterhos_popgen.yml
    lotterhos_utils.yml
    lotterhos_variantCallers.yml

Use conda deactivate to change conda environments

information about outdated modules

When we first started using Discovery, we had modules set up for our research

We have a software module on Discovery for the lab’s research
There are some programs loaded in modules, and others loaded in conda environments. If you are working in a conda environment, it won’t detect programs loaded in the module outside that environment
One conda environment is lotterhos-py38, which is running python 3.8. Note that many bioinformatics programs are not updated for this environment. Slim 3.3.2 is in this environment.
Another conda environment is ddocent2, which is running python 3.7 and all of the ddocent dependencies
```
module show lotterhos/2020-08-24
module load lotterhos/2020-08-24
source activate ddocent2
conda list
```
Use the module avail command to show a list of the most currently available software on Discovery.

sratoolkit/2.10.8 was added as a seperate module that was then added to your module, while ddocent-2.7.8 was added to the conda environment “lotterhos-py38” within the anaconda3/L2020-03 module

Old module: lotterhos/2019-11-15, lotterhos/2020-03-18, lotterhos/2020-07-21

In the future we hope to work on Docker containers for loading R packages and developing containers for specific projects. A container allows you to create the same environment that you were using when you analyzed the data.

PackRat environments for R

PackRat environments can be great for using on OOD or Discovery

Create PackRat environments for R