Parallelization with Array
Overview
Brief review of using the #SBATCH --array
slurm argument to restrict the number of jobs running in parallel on Discovery. Arrays seem to be most effective when you have a large number of jobs that all require the same number of resources and the user wants to limit the amount of the computing resource being used at a particular time. This guide follows the same template format from the lotterhos wiki, which reads through and leverages a reference file that contains a number of file names corresponding to files you wish to process. For example, this cod contain a list of raw sequence files (e.g., fq.gz
) that you with to perform trimming and qc on.
Breaking down the --array
argument
The --array
argument can take a single integer,1
, or a range, 1-32
, of integers that the slurm script will iterate through. Following the range of integers you specify the number of these values you wish to process simultaneously using the %
operator. For example, if you wish to iterate through the range of integers in batches of 4 you could code: #SBATCH --array=1-32%4
Other options
# A job array with array tasks numbered 1, 2, 5, 19, 27.
#SBATCH --array=1,2,5,19,27
# A job array with array tasks numbered 1, 3, 5 and 7.
#SBATCH --array=1-7:2
Using the --array
argument in your script
Including the above example (#SBATCH --array=1-32%4
) in your slurm script will effectively iterate through your slurm script 32 times, by running 4 simultaneous instances (or jobs) of the script at a time on Discovery. However, running the same script with identical parameters (or files) may not be particularly useful. Instead we probably want to run the same script but apply it to a different set of data or use different parameters. At the very least we want to be able to create unique outputs for each iteration!
Conviently, the --array
argument creats a unix variable - SLURM_ARRAY_TASK_ID
, which in the case above references the specific integer that corresponds to the particular instances of the script. For example, the first iteration of your slurm script will have SLURM_ARRAY_TASK_ID=1
.
This could be leveraged in multiple ways, but one option is using this index to extract specific lines from a reference script. The data on these lines could be anything, but having file names is a particular useful approach if you want to apply your script to multiple files.
Short illustration
VARNAME=`sed "${SLURM_ARRAY_TASK_ID}q;d" path/to/reference/reference.txt`
file1 = path/to/data/${VARNAME}
echo '${file1}'
Example of reference file:
datafile1.csv
datafile2.csv
datafile3.csv
Example output in .out
file
datafile1.csv
datafile2.csv
datafile3.csv
Calculations
The lotterhos partition has 2 nodes, 36 cores per node, and 5GB per core. A job that only uses a single cpu should not take up more than 5GB of memory.
Here is an example of an array that runs 72 jobs at a time, each job runs on one core. In the remaining script, not shown, is code that reads the input parameters from a file that has 1000 lines (the first line is the header, and each line after that has parameters). Hence, the array starts on line 2 and goes to line 1000. Unless the program is able to cross-talk among nodes (most do not have the architecture to do that), then nodes should always be set to 1. However, this script will run arrays across both lotterhos nodes (the nodes=1 just indicates that each task will use one node).
#SBATCH --partition=lotterhos
#SBATCH --mem=5GB
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --array=2-1000%70
1000 is the upper limit that can be requested by the “array” command. It’s also good to keep the total number of jobs submitted at a time capped at 70, so that a couple of cores are left open for srun
Below is an example of a job that calls a program that can use multiple cpus. If 32 cpus are requested, then 32 cpus * 5GB/cpu = 160GB can be also requested for memory. Unless the program is able to cross-talk among nodes (most do not have the architecture to do that), then nodes should always be set to 1. y Since there are only 2 nodes, each with 36 cpus, then we can only run 2 jobs at a time and stay within the bounds of the architecture. Hence thearray=2-1000%2
argument
#SBATCH --partition=lotterhos
#SBATCH --mem=160GB
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --array=2-1000%2
Pratical example
Script performs bismark mapping, bismark
on list of files based on the file names in a reference text file, sample_files.txt
. Based on the --array
argument, I am peforming the mapping on 32 samples, processing 4 samples at a time.
#!/bin/bash
#SBATCH --job-name=DNAMethylation_mapping
#SBATCH --mem=100Gb
#SBATCH --mail-user=downey-wall.a@northeastern.edu
#SBATCH --mail-type=FAIL
#SBATCH --partition=lotterhos
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --array=1-32%2
#SBATCH --output=/work/lotterhos/2018OALarvae_DNAm/slurm_log/DNAm_mapping_%j.out
#SBATCH --error=/work/lotterhos/2018OALarvae_DNAm/slurm_log/DNAm_mapping_%j.err
## Store name of individual sample
VARNAME=`sed "${SLURM_ARRAY_TASK_ID}q;d" 2018OALarvae_DNAm/RAW/sample_files.txt`
## Code for running bismark (DNA methylation sequence mapper)
i=$(ls 2018OALarvae_DNAm/trimmed_files/${VARNAME}_R1*fq.gz | rev | cut -d'/' -f 1 | rev)
j=$(ls 2018OALarvae_DNAm/trimmed_files/${VARNAME}_R2*fq.gz | rev | cut -d'/' -f 1 | rev)
# Run Bismark
bismark --non_directional \
--parallel 32 \
--genome oyster_references/haplo_tig_genome \
--score_min L,0,-0.4 \
--bowtie2 \
-1 2018OALarvae_DNAm/trimmed_files/${i} \
-2 2018OALarvae_DNAm/trimmed_files/${j} \
-o 2018OALarvae_DNAm/mapped_files
Below are some examples of scripts written to progress to using an Array
Script 1: Running one job as you would in the cluster
In this script, I copied the exact same lines of codes that I would use to run these processes in the Discovery Cluster into a batch script.
#!/bin/bash
#SBATCH --job-name=Pop1_16220DellyGridss
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=5
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220DellyGridss.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220DellyGridss.err
## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"
## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
cp /work/lotterhos/2020_CodGenomes/labeled_bam_Out/Pop1_16220.f.rg.bam /home/curtis.lei/cod_data
samtools sort -m 40G Pop1_16220.f.rg.bam > Pop1_16220.sorted.f.rg.bam
samtools index Pop1_16220.sorted.f.rg.bam
echo "sort and index complete"
## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/GCF_902167405.1_gadMor3.0_genomic.fna /home/curtis.lei/gridss/cod_data/Pop1_16220.sorted.f.rg.bam > /home/curtis.lei/delly/delly.Pop1_16220.vcf
echo "delly complete"
## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.Pop1_16220.vcf Pop1_16220.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r GCF_902167405.1_gadMor3.0_genomic.fna -o /home/curtis.lei/gridss/gridssDelly.Pop1_16220.vcf Pop1_16220.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"
Script 2: Running one job using variables
In this script, I slightly modified Script 1. Instead of using the name of the individual’s file, I made a variable to define the name of the individual as well as the name of the reference genome.
#!/bin/bash
#SBATCH --job-name=Pop1_16220_Variables
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220_Variables.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220_Variables.err
## Define Variables
REFERENCE="GCF_902167405.1_gadMor3.0_genomic"
ALPHA_2="Pop1_16220"
## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"
## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
samtools sort -m 40G /work/lotterhos/2020_CodGenomes/labeled_bam_Out/${ALPHA_2}.f.rg.bam > ${ALPHA_2}.sorted.f.rg.bam
samtools index ${ALPHA_2}.sorted.f.rg.bam
echo "sort and index complete"
## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/${REFERENCE}.fna /home/curtis.lei/gridss/cod_data/${ALPHA_2}.sorted.f.rg.bam > /home/curtis.lei/delly/delly.${ALPHA_2}.vcf
echo "delly complete"
## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.${ALPHA_2}.vcf ${ALPHA_2}.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r ${REFERENCE}.fna -o /home/curtis.lei/gridss/gridssDelly.${ALPHA_2}.vcf ${ALPHA_2}.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"
Script 3: Running more than one job using a text file with the variable names
In this script, I have now added a separate text file containing the names of more than one individual that will now be stored into an array. Now, rather than a single variable, the array will be called and run. The ‘%j’ allows an output and error file to be made for each job within the array.
#!/bin/bash
#SBATCH --job-name=Pop1_16220-24
#SBATCH --mail-user=curtis.lei@northeastern.edu
#SBATCH --partition=lotterhos
#SBATCH --mem=25Gb
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --array=1-2%2
#SBATCH --output=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220-24%j.out
#SBATCH --error=/home/curtis.lei/2024Cod_Inversions/slurm_log/Pop1_16220-24%j.err
## Define Variables
Cod_ID=`sed -n ${SLURM_ARRAY_TASK_ID}p /home/curtis.lei/2024Cod_Inversions/src/Cod_ID_Variables_2-3.txt`
echo ${Cod_ID}
REFERENCE="GCF_902167405.1_gadMor3.0_genomic"
## Load the necessary modules
module load singularity/3.10.3
module load samtools/1.19.2
echo "module load complete"
## Sort and Index the BAM File in GRIDSS Cod Data folder
cd /home/curtis.lei/gridss/cod_data
samtools sort -m 40G /work/lotterhos/2020_CodGenomes/labeled_bam_Out/${Cod_ID}.f.rg.bam > ${Cod_ID}.sorted.f.rg.bam
samtools index ${Cod_ID}.sorted.f.rg.bam
echo "sort and index complete"
## Run Delly
singularity run /home/curtis.lei/Containers/delly_v1.2.6.sif delly call -g /home/curtis.lei/gridss/cod_data/${REFERENCE}.fna /home/curtis.lei/gridss/cod_data/${Cod_ID}.sorted.f.rg.bam > /home/curtis.lei/delly/delly.${Cod_ID}.vcf
echo "delly complete"
## Run GRIDSS on generated Delly VCF
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss_extract_overlapping_fragments --targetvcf /home/curtis.lei/delly/delly.${Cod_ID}.vcf ${Cod_ID}.sorted.f.rg.bam
singularity run /home/curtis.lei/Containers/gridss_2.13.2.sif gridss -r ${REFERENCE}.fna -o /home/curtis.lei/gridss/gridssDelly.${Cod_ID}.vcf ${Cod_ID}.sorted.f.rg.bam.targeted.bam
echo "GRIDSS complete"
Text file: Cod_ID_Variables_2-3.txt Pop1_16220 Pop1_16224