Halvade on a local cluster

Setup

A folder with the required binaries is required on every node, getting the binaries is described here. Similarly, a folder with the required reference files needs to be available on every node, please look here for an overview of the required reference files. Both folders need to be accessible on every node in the exact same place, copy these folders to every node with scp like this:

1
2
3
4
5
6
7
# the folder might need to be created on the nodes first:
# do this for every node <nodename> in the cluster
ssh nodename "mkdir -p $halvaderef $halvadebin"

# again for every node <nodename> in the cluster
scp -r $halvaderef nodename:$halvaderef
scp -r $halvadebin nodename:$halvadebin

Halvade relies on several libraries, which need to be downloaded and be located where you will run the halvade.sh script from, the required files are described here. The script to run halvade can be found in the scripts folders of the git repository or can be downloaded here:

1
wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade.sh

Hadoop Yarn and Apache Spark

Hadoop Yarn is used as a resource manager to run Halvade on Spark because we need to be able to set the overhead memory of each executor and this cannot be done in Spark standalone mode. Setting up Hadoop Yarn is described here and setting up Spark is described here.

Run

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# SOMATIC
# FASTQ input, a folder with paired FASTQ files per read group
./halvade.sh somatic /halvade/ref/ /halvade/bin/ /halvade/input/tumor/ /halvade/input/normal/ /halvade/output/somatic.vcf
# BAM input, already aligned reads with read groups added
./halvade.sh somatic /halvade/ref/ /halvade/bin/ /halvade/input/tumor.bam /halvade/input/normal.bam /halvade/output/somatic.vcf

# GERMLINE
# FASTQ input, a folder with paired FASTQ files per read group
./halvade.sh germline /halvade/ref/ /halvade/bin/ /halvade/input/germline /halvade/output/germline.vcf
# BAM input, already aligned reads with read groups added
./halvade.sh germline /halvade/ref/ /halvade/bin/ /halvade/input/germline.bam /halvade/output/germline.vcf

Input

There are several valid inputs that Halvade accepts:

  • a directory with paired fastq|fq(.gz)? files per read group or unaligned BAM files per read group. The files must have _1.fastq(.gz)? or _1.fq(.gz)? suffixes for the first file and _2.fastq(.gz)? or _2.fq(.gz)? for the second.
  • a directory which has already been preprocessed, containing a folder per read group and fastq|fq(.gz)? files in thos folders
  • a single aligned BAM file with containing all read groups of a sample with read group information

Other Options

The script supports these options:

  • --exome: run the exome pipeline
  • --tmpdir <string>: set the folder for tmp files
  • --germlineSM <string>: germline samplename
  • --tumorSM <string>: tumor samplename
  • --normalSM <string>: normal samplename
  • --partitions <int>: override default number of partitions

To override automatically detected resources:

  • --memory <int>: sets the available memory in GB
  • --cpus <int>: sets the number of available CPUs
  • --nodes <int>: sets the number of nodes in the cluster
  • --executor_memory <int>: sets the memory in MB per executor
  • --executor_cpus <int>: sets the number of CPUs per executor

Additional Halvade options can be set with the variable HALVADE_OPTS, while extra option for spark-submit can be set with EXTRA_SPARK_OPTIONS:

1
2
HALVADE_OPTS="--variant_caller both"
EXTRA_SPARK_OPTIONS="--master=spark://$ip:7077"

An overview of additional Halvade options can be found here

For expert users, you can run the script with the --quiet option to show the command that will start a halvade job and change any required parameters.