Parallelization with Array

Overview

Brief review of using the #SBATCH --array slurm argument to restrict the number of jobs running in parallel on Discovery. Arrays seem to be most effective when you have a large number of jobs that all require the same number of resources and the user wants to limit the amount of the computing resource being used at a particular time. This guide follows the same template format from the lotterhos wiki, which reads through and leverages a reference file that contains a number of file names corresponding to files you wish to process. For example, this cod contain a list of raw sequence files (e.g., fq.gz) that you with to perform trimming and qc on.

Breaking down the `--array` argument

The --array argument can take a single integer,1, or a range, 1-32, of integers that the slurm script will iterate through. Following the range of integers you specify the number of these values you wish to process simultaneously using the % operator. For example, if you wish to iterate through the range of integers in batches of 4 you could code: #SBATCH --array=1-32%4

Other options

# A job array with array tasks numbered 1, 2, 5, 19, 27.
#SBATCH --array=1,2,5,19,27

# A job array with array tasks numbered 1, 3, 5 and 7.
#SBATCH --array=1-7:2

Using the `--array` argument in your script

Including the above example (#SBATCH --array=1-32%4) in your slurm script will effectively iterate through your slurm script 32 times, by running 4 simultaneous instances (or jobs) of the script at a time on Discovery. However, running the same script with identical parameters (or files) may not be particularly useful. Instead we probably want to run the same script but apply it to a different set of data or use different parameters. At the very least we want to be able to create unique outputs for each iteration!

Conviently, the --array argument creats a unix variable - SLURM_ARRAY_TASK_ID, which in the case above references the specific integer that corresponds to the particular instances of the script. For example, the first iteration of your slurm script will have SLURM_ARRAY_TASK_ID=1.

This could be leveraged in multiple ways, but one option is using this index to extract specific lines from a reference script. The data on these lines could be anything, but having file names is a particular useful approach if you want to apply your script to multiple files.

Short illustration

VARNAME=`sed "${SLURM_ARRAY_TASK_ID}q;d" path/to/reference/reference.txt`

file1 = path/to/data/${VARNAME}

echo '${file1}'

Example of reference file:

datafile1.csv
datafile2.csv
datafile3.csv

Example output in .out file

datafile1.csv
datafile2.csv
datafile3.csv

Calculations

The lotterhos partition has 2 nodes, 36 cores per node, and 5GB per core. A job that only uses a single cpu should not take up more than 5GB of memory.

Here is an example of an array that runs 72 jobs at a time, each job runs on one core. In the remaining script, not shown, is code that reads the input parameters from a file that has 1000 lines (the first line is the header, and each line after that has parameters). Hence, the array starts on line 2 and goes to line 1000. Unless the program is able to cross-talk among nodes (most do not have the architecture to do that), then nodes should always be set to 1. However, this script will run arrays across both lotterhos nodes (the nodes=1 just indicates that each task will use one node).

#SBATCH --partition=lotterhos
#SBATCH --mem=5GB
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --array=2-1000%70

1000 is the upper limit that can be requested by the “array” command. It’s also good to keep the total number of jobs submitted at a time capped at 70, so that a couple of cores are left open for srun

Below is an example of a job that calls a program that can use multiple cpus. If 32 cpus are requested, then 32 cpus * 5GB/cpu = 160GB can be also requested for memory. Unless the program is able to cross-talk among nodes (most do not have the architecture to do that), then nodes should always be set to 1. y Since there are only 2 nodes, each with 36 cpus, then we can only run 2 jobs at a time and stay within the bounds of the architecture. Hence thearray=2-1000%2 argument

#SBATCH --partition=lotterhos
#SBATCH --mem=160GB
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --array=2-1000%2

Pratical example

Script performs bismark mapping, bismark on list of files based on the file names in a reference text file, sample_files.txt. Based on the --array argument, I am peforming the mapping on 32 samples, processing 4 samples at a time.

    #!/bin/bash

    #SBATCH --job-name=DNAMethylation_mapping
    #SBATCH --mem=100Gb
    #SBATCH --mail-user=downey-wall.a@northeastern.edu
    #SBATCH --mail-type=FAIL
    #SBATCH --partition=lotterhos
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=32
    #SBATCH --array=1-32%2
    #SBATCH --output=/work/lotterhos/2018OALarvae_DNAm/slurm_log/DNAm_mapping_%j.out
    #SBATCH --error=/work/lotterhos/2018OALarvae_DNAm/slurm_log/DNAm_mapping_%j.err

    ## Store name of individual sample
    VARNAME=`sed "${SLURM_ARRAY_TASK_ID}q;d" 2018OALarvae_DNAm/RAW/sample_files.txt`

    ## Code for running bismark (DNA methylation sequence mapper)

    i=$(ls 2018OALarvae_DNAm/trimmed_files/${VARNAME}_R1*fq.gz | rev | cut -d'/' -f 1 | rev) 
    j=$(ls 2018OALarvae_DNAm/trimmed_files/${VARNAME}_R2*fq.gz | rev | cut -d'/' -f 1 | rev)

    # Run Bismark
    bismark --non_directional \
    --parallel 32 \
    --genome oyster_references/haplo_tig_genome \
    --score_min L,0,-0.4 \
    --bowtie2 \
    -1 2018OALarvae_DNAm/trimmed_files/${i} \
    -2 2018OALarvae_DNAm/trimmed_files/${j} \
    -o 2018OALarvae_DNAm/mapped_files

Below are some examples of scripts written to progress to using an Array

Script 1: Running one job as you would in the cluster

In this script, I copied the exact same lines of codes that I would use to run these processes in the Discovery Cluster into a batch script.

#!/bin/bash
#SBATCH --job-name=Pop1_16220DellyGridss
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=5
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220DellyGridss.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220DellyGridss.err

## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"

## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
cp /work/lotterhos/2020_CodGenomes/labeled_bam_Out/Pop1_16220.f.rg.bam /home/curtis.lei/cod_data
samtools sort -m 40G Pop1_16220.f.rg.bam > Pop1_16220.sorted.f.rg.bam
samtools index Pop1_16220.sorted.f.rg.bam
echo "sort and index complete"

## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/GCF_902167405.1_gadMor3.0_genomic.fna /home/curtis.lei/gridss/cod_data/Pop1_16220.sorted.f.rg.bam > /home/curtis.lei/delly/delly.Pop1_16220.vcf
echo "delly complete"

## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.Pop1_16220.vcf Pop1_16220.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r GCF_902167405.1_gadMor3.0_genomic.fna -o /home/curtis.lei/gridss/gridssDelly.Pop1_16220.vcf Pop1_16220.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"

Script 2: Running one job using variables

In this script, I slightly modified Script 1. Instead of using the name of the individual’s file, I made a variable to define the name of the individual as well as the name of the reference genome.

#!/bin/bash
#SBATCH --job-name=Pop1_16220_Variables
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220_Variables.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220_Variables.err

## Define Variables
REFERENCE="GCF_902167405.1_gadMor3.0_genomic"
ALPHA_2="Pop1_16220"

## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"

## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
samtools sort -m 40G /work/lotterhos/2020_CodGenomes/labeled_bam_Out/${ALPHA_2}.f.rg.bam > ${ALPHA_2}.sorted.f.rg.bam
samtools index ${ALPHA_2}.sorted.f.rg.bam
echo "sort and index complete"

## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/${REFERENCE}.fna /home/curtis.lei/gridss/cod_data/${ALPHA_2}.sorted.f.rg.bam > /home/curtis.lei/delly/delly.${ALPHA_2}.vcf
echo "delly complete"

## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.${ALPHA_2}.vcf ${ALPHA_2}.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r ${REFERENCE}.fna -o /home/curtis.lei/gridss/gridssDelly.${ALPHA_2}.vcf ${ALPHA_2}.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"

Script 3: Running more than one job using a text file with the variable names

In this script, I have now added a separate text file containing the names of more than one individual that will now be stored into an array. Now, rather than a single variable, the array will be called and run. The ‘%j’ allows an output and error file to be made for each job within the array.

#!/bin/bash
#SBATCH --job-name=Pop1_16220-24
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --array=1-2%2
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220-24%j.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220-24%j.err

## Define Variables
Cod_ID=`sed -n ${SLURM_ARRAY_TASK_ID}p /home/curtis.lei/2024Cod_Inversions/src/Cod_ID_Variables_2-3.txt`
echo ${Cod_ID}
REFERENCE="GCF_902167405.1_gadMor3.0_genomic"

## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"

## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
samtools sort -m 40G /work/lotterhos/2020_CodGenomes/labeled_bam_Out/${Cod_ID}.f.rg.bam > ${Cod_ID}.sorted.f.rg.bam
samtools index ${Cod_ID}.sorted.f.rg.bam
echo "sort and index complete"

## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/${REFERENCE}.fna /home/curtis.lei/gridss/cod_data/${Cod_ID}.sorted.f.rg.bam > /home/curtis.lei/delly/delly.${Cod_ID}.vcf
echo "delly complete"

## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.${Cod_ID}.vcf ${Cod_ID}.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r ${REFERENCE}.fna -o /home/curtis.lei/gridss/gridssDelly.${Cod_ID}.vcf ${Cod_ID}.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"

Text file: Cod_ID_Variables_2-3.txt Pop1_16220 Pop1_16224

Another example from KL

repo https://github.com/ModelValidationProgram/MVP-NonClinalAF/tree/main/src

I create a text file with all the parameters 0b-final_params-20220428.txt. I created this file with an R script 0a-setUpSims.Rmd

I split the parameters up into files of 1000 rows, each file becomes an array https://github.com/ModelValidationProgram/MVP-NonClinalAF/blob/main/src/0b-final_params-fastruns-20220428-b.txt

Here is an array script that runs the first 950 simulations, on 70 CPUs at a time https://github.com/ModelValidationProgram/MVP-NonClinalAF/blob/main/src/d-run_nonAF_sims_0Slim-fastruns-20220428-b.sh

#SBATCH --partition=lotterhos
#SBATCH --mem=5G
#SBATCH --nodes=1
#SBATCH --array=2-952%70

This runs 950 simulations, 70 at at time, on individual CPUs in the Lotterhos node. It works because the lotterhos nodes are available for us, it might not work on a short partition where you have to wait for a node.

Each simluation reads from the parameter file - in the code below ${SLURM_ARRAY_TASK_ID} corresponds to the row and $1 corresponds to the column.

# Extracting individual variables
# If you update or change your parameters see
# https://github.com/ModelValidationProgram/MVP-NonClinalAF/blob/alan/notebook/20210630_creatingSLiMBatchScriptVariables.md
# for guide on replacing this block of code.
level=`awk NR==${SLURM_ARRAY_TASK_ID} ${params} | awk '{print $1}'`
reps=`awk NR==${SLURM_ARRAY_TASK_ID} ${params} | awk '{print $2}'`
arch=`awk NR==${SLURM_ARRAY_TASK_ID} ${params} | awk '{print $3}'`

The values that are read in are then sent to slim:

slim -d MY_SEED1=${seed} -d MY_RESULTS_PATH1=\"${outpath}\" -d MIG_x1=${MIG_x} -d MIG_y1=${MIG_y} ...

It’s not shown here but the seed is used to write the outputs from slim.