Halvade on Amazon EMR¶

Amazon AWS Setup¶

First create an Amazon AWS account, more information can be found here. .. note:: When using spot instances, make sure your spot instance limit is more than the amount requested, see this page for more information.

AWS CLI¶

The Amazon AWS CLI is used in the Halvade script to start a cluster. An detailed installation guide can be found here and is summarized here:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

An S3 bucket is required to store the data for the pipeline and can be created as described here. This tutorial assumes you are using the s3://halvade/ bucket.

EC2 access key¶

In order for Amazon EMR to access the nodes in the cluster, an EC2 access key needs to be created. This can be done in the AWS console like this or with the AWS CLI:

aws create-key-pair --key-name $keyname
# the value in $keyname needs to be given in the EC2_KEY_PAIR variable in your configuration file

Halvade Setup¶

Binaries¶

The binaries for BWA and SAMtools should be build on the same operating system as run on the EMR images.

Either use the prebuild binaries for the default settings or create the binaries yourself on an Amazon AWS EC2 cluster.

Prebuild for default settings¶

Copy these binaries for the default instances and Ubuntu 18.04:

wget -O bin.tar.gz https://www.dropbox.com/s/fya87fzhzdgvghr/amazon-bin.tar.gz?dl=0
aws s3 cp bin.tar.gz s3://halvade/bin/

# s3://halvade/bin/ is used as $S3_BIN_DIR in your configuration file

Build the binaries on an AWS EC2 cluster¶

To do this you can start an ec2 instance with access to S3, a micro instance can be used to use the free tier with an Amazon AMI. Connect to the Instance and build the binaries there and then copy the files to your s3 bucket. For the samtools binary, some features need to be disabled by running ./configure --without-curses --disable-lzma before make. More information to build the binaries can be found here.

The bwa, samtools, GATK4 and optionally Strelka2 can be archived together in a .tar.gz.

tar -cvzf bin.tar.gz bwa samtools gatk-package-${GATK_V}-local.jar strelka-${STRELKA_V}.centos6_x86_64
aws s3 cp bin.tar.gz s3://halvade/bin/

# s3://halvade/bin/ is used as $S3_BIN_DIR in your configuration file

Note

Do not forget to stop the cluster with gcloud dataproc clusters delete binary-build

Libraries¶

The amazon-boostrap.sh needs to be uploaded to S3 as it will be used in a bootstrap step when creating the EMR cluster. This script downloads the required binaries reference files and libraries from S3 to every node.

these libraries for halvade, the Halvade jar and the halvade.sh runscript also needs to be available in this S3 folder. A prepackaged tar.gz file can be downloaded and uploaded to S3:

wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade-cloud-bundle.tar.gz
wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/amazon-bootstrap.sh
aws s3 cp halvade-cloud-bundle.tar.gz s3://halvade/lib/
aws s3 cp amazon-bootstrap.sh s3://halvade/lib/

# s3://halvade/lib/ is used as $S3_LIB_DIR in your configuration file

Reference Files¶

The reference fasta file with BWA indexes and the corresponding dbSNP file also need to be available on S3, so Halvade can use these files. An overview of all required reference files is given here. After you got all files, they can be uploaded to S3. In order to decrease the storage space they can be tarred and gzipped, *.tar.gz files will automatically be unzipped while .gz files will not be. This is in order to preserve the indexes of the dbsnp.vcf.gz file.

# assuming all files have a grch38 prefix:
tar -cvzf grch38.tar.gz grch38.fasta* grch38.dict # optionally the corresponding vcf file + index if not gzipped
aws s3 cp grch38.tar.gz s3://halvade/grch38/

# if your dbsnp is gzipped and indexed with bgzip/tabix:
aws s3 cp grch38.vcf.gz* s3://halvade/grch38/

# s3://halvade/grch38/ is used as $S3_REF_DIR in your configuration file

Configuration file¶

The halvade-amazon.sh script (which will be downloaded in a later step) reads the AWS configuration and the required S3 folders from an aws-config.sh file or a file set by the ${AWS_CONFIG} variable. This only needs to be done once. The contents with required and optional settings should look like this:

# adjust accordingly
# REQUIRED USER INPUT
EC2_KEY_PAIR=aws-key # name of the key
S3_LIB_DIR=s3://halvade/lib/
S3_REF_DIR=s3://halvade/grch38/
S3_BIN_DIR=s3://halvade/bin/

# OPTIONAL IF SET IN AWS CONFIGURATION
# AWS_REGION=eu-west-1

# OPTIONAL USER INPUT/DEFAULT IS SHOWN BELOW
# CORE_INSTANCE_COUNT=1 # default number of worker nodes can be overridden by --nodes
# CORE_INSTANCE_TYPE=r5d.8xlarge
# CORE_BID_PRICE=0.9 # max bid price for spot instances (core nodes)
# MASTER_INSTANCE_TYPE=r5d.xlarge
# MASTER_BID_PRICE=0.25 # max bid price for spot instance (master node)
# LOG_URI=s3://halvade-logs/

Run script¶

The halvade-amazon.sh script to create an EMR cluster and run halvade can be found in the scripts folders of the git repository or can be downloaded here:

wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade-amazon.sh

Note

This script runs asynchronous so will exit before the job has completed. The progress can be checked through the AWS CLI or the AWS Console online.

To run Halvade on Amazon EMR the reference, binary and library folder together with the AWS configuration variables needs to be provided in the aws-config.sh file. The input and output locations are given as arguments to the script.

# SOMATIC
# FASTQ input, a folder with paired FASTQ files per read group
./halvade-amazon.sh somatic s3://halvade-io/somatic-input/tumor/ s3://halvade-io/somatic-input/normal/ s3://halvade-io/somatic-output/
# BAM input, already aligned reads with read groups added
./halvade-amazon.sh somatic s3://halvade-io/somatic-input/tumor.bam s3://halvade-io/somatic-input/normal.bam s3://halvade-io/somatic-output/

# GERMLINE
# FASTQ input, a folder with paired FASTQ files per read group
./halvade-amazon.sh germline s3://halvade-io/somatic-input/germline/ s3://halvade-io/germline-output/
# BAM input, already aligned reads with read groups added
./halvade-amazon.sh germline s3://halvade-io/somatic-input/germline.bam s3://halvade-io/germline-output/

If required files are missing in the reference, binary or library directories, this will be detected during the bootstrapping. Once the task is completed, the output will be placed in the given output folder on S3, with a unique identifier based on the start time.

Input¶

There are several valid inputs that Halvade accepts:

a directory with paired fastq|fq(.gz)? files per read group or unaligned BAM files per read group. The files must have _1.fastq(.gz)? or _1.fq(.gz)? suffixes for the first file and _2.fastq(.gz)? or _2.fq(.gz)? for the second.
a directory which has already been preprocessed, containing a folder per read group and fastq|fq(.gz)? files in those folders
a single aligned BAM file with containing all read groups of a sample with read group information

Other Options¶

The script supports these options:

--exome: run the exome pipeline
--germlineSM <string>: germline samplename
--tumorSM <string>: tumor samplename
--normalSM <string>: normal samplename
--partitions <int>: override default number of partitions
--nodes <int>: number of worker nodes
--keep_cluster: Do not terminate the cluster after Halvade completes. Allows to run other samples on this cluster afterwards
--cluster <string>: existing cluster id to run Halvade on (halvade bootstrap script should have been called on every node in this cluster)

To override automatically detected executor memory and CPU settings:

--executor_memory <int>: sets the memory in MB per executor
--executor_cpus <int>: sets the number of CPUs per executor

Additional Halvade options can be set with the variable HALVADE_OPTS:

HALVADE_OPTS="--variant_caller both"

An overview of additional Halvade options can be found here