Halvade on a local cluster¶
Setup¶
A folder with the required binaries is required on every node, getting the binaries is described here. Similarly, a folder with the required reference files needs to be available on every node, please look here for an overview of the required reference files. Both folders need to be accessible on every node in the exact same place, copy these folders to every node with scp like this:
1 2 3 4 5 6 7 | # the folder might need to be created on the nodes first: # do this for every node <nodename> in the cluster ssh nodename "mkdir -p $halvaderef $halvadebin" # again for every node <nodename> in the cluster scp -r $halvaderef nodename:$halvaderef scp -r $halvadebin nodename:$halvadebin |
Halvade relies on several libraries, which need to be downloaded and be located where you will run the halvade.sh script from, the required files are described here.
The script to run halvade can be found in the scripts folders of the git repository or can be downloaded here:
1 | wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade.sh
|
Run¶
1 2 3 4 5 6 7 8 9 10 11 | # SOMATIC # FASTQ input, a folder with paired FASTQ files per read group ./halvade.sh somatic /halvade/ref/ /halvade/bin/ /halvade/input/tumor/ /halvade/input/normal/ /halvade/output/somatic.vcf # BAM input, already aligned reads with read groups added ./halvade.sh somatic /halvade/ref/ /halvade/bin/ /halvade/input/tumor.bam /halvade/input/normal.bam /halvade/output/somatic.vcf # GERMLINE # FASTQ input, a folder with paired FASTQ files per read group ./halvade.sh germline /halvade/ref/ /halvade/bin/ /halvade/input/germline /halvade/output/germline.vcf # BAM input, already aligned reads with read groups added ./halvade.sh germline /halvade/ref/ /halvade/bin/ /halvade/input/germline.bam /halvade/output/germline.vcf |
Input¶
There are several valid inputs that Halvade accepts:
- a directory with paired
fastq|fq(.gz)?files per read group or unaligned BAM files per read group. The files must have_1.fastq(.gz)?or_1.fq(.gz)?suffixes for the first file and_2.fastq(.gz)?or_2.fq(.gz)?for the second. - a directory which has already been preprocessed, containing a folder per read group and
fastq|fq(.gz)?files in thos folders - a single aligned BAM file with containing all read groups of a sample with read group information
Other Options¶
The script supports these options:
--exome: run the exome pipeline--tmpdir<string>: set the folder for tmp files--germlineSM<string>: germline samplename--tumorSM<string>: tumor samplename--normalSM<string>: normal samplename--partitions<int>: override default number of partitions
To override automatically detected resources:
--memory<int>: sets the available memory in GB--cpus<int>: sets the number of available CPUs--nodes<int>: sets the number of nodes in the cluster--executor_memory<int>: sets the memory in MB per executor--executor_cpus<int>: sets the number of CPUs per executor
Additional Halvade options can be set with the variable HALVADE_OPTS, while extra option for spark-submit can be set with EXTRA_SPARK_OPTIONS:
1 2 | HALVADE_OPTS="--variant_caller both" EXTRA_SPARK_OPTIONS="--master=spark://$ip:7077" |
An overview of additional Halvade options can be found here
For expert users, you can run the script with the --quiet option to show the command that will start a halvade job and change any required parameters.