Halvade on Google Cloud¶

Google Cloud Setup¶

Create a Google Cloud account as described here.

Note

Make sure you have created a project with billing enabled on your Google Cloud account, see this page for more information.

gcloud¶

Google cloud has a very useful CLI which is required to run Halvade on the Google Cloud. It is important that you create a project before this as the initialization will require you to select a project. Installation and initialization is described here and looks like this:

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-361.0.0-linux-x86_64.tar.gz
tar -xvf google-cloud-sdk-361.0.0-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh # adds the necessary paths to PATH variable

# to use tools without the path like follows a new terminal login is required or use the path: ./google-cloud-sdk/bin/gcloud
gcloud init # login to your google cloud account and select default project/region/zone

# set the default region for dataproc clusters
gcloud config set dataproc/region europe-west1

Storage Bucket¶

A Google Storage (GS) bucket is required so that the Google compute cluster can access the input data and store the output data. This can be created as follows (based on this):

BUCKET_NAME=halvade
gsutil mb gs://$BUCKET_NAME

Halvade Setup¶

Binaries¶

The binaries for BWA and SAMtools should be build on the same operating system as run on the Google Cloud hardware. To do this you can start a compute instance with a node like c2-standard-4 to build the binaries there and then copy the files to your Google Storage bucket. Either use the prebuild binaries for the default settings or create the binaries yourself on a Google Cloud cluster.

Prebuild for default settings¶

Copy these binaries for the default instances and Ubuntu 18.04:

wget -O bin.tar.gz https://www.dropbox.com/s/r8wvw7jh692lbt6/gcloud-bin.tar.gz?dl=0
gsutil cp bin.tar.gz gs://halvade/bin/

# gs://halvade/bin/ is used as $GCLOUD_BIN_DIR in your configuration file

Build the binaries on a Google Cloud cluster¶

First, create a compute instance like this:

# The default image-version used by Halvade is '2.0-ubuntu18'
gcloud dataproc clusters create binary-build --single-node --worker-machine-type c2-standard-4 --image-version 2.0-ubuntu18

# connect to the node like this:
gcloud compute ssh "binary-build-m"  # might create a ssh key if none exists yet, which will be placed in your $HOME/.ssh/

Now build the necessary binaries as described here And then archive and copy the bwa, samtools, GATK4 and optionally Strelka2 binaries.

tar -cvzf bin.tar.gz bwa samtools gatk-package-${GATK_V}-local.jar strelka-${STRELKA_V}.centos6_x86_64
gsutil cp bin.tar.gz gs://halvade/bin/

# gs://halvade/bin/ is used as $GCLOUD_BIN_DIR in your configuration file

Note

Do not forget to stop the cluster with gcloud dataproc clusters delete binary-build

Libraries¶

The gcloud-boostrap.sh needs to be uploaded to Google Cloud Storage (GS) as it will be used in a initialization step when creating the dataproc cluster. This script downloads the required binaries reference files and libraries from GS to every node.

These libraries for halvade, the Halvade jar and the halvade.sh runscript also needs to be available in this GS folder. A prepackaged tar.gz file can be downloaded and uploaded to GS:

wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade-cloud-bundle.tar.gz
wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/gcloud-bootstrap.sh
gsutil cp halvade-cloud-bundle.tar.gz gs://halvade/lib/
gsutil cp gcloud-bootstrap.sh gs://halvade/lib/

# gs://halvade/lib/ is used as $GCLOUD_LIB_DIR in your configuration file

Reference Files¶

The reference fasta file with BWA indexes and the corresponding dbSNP file also need to be available on S3, so Halvade can use these files. An overview of all required reference files is given here. After you got all files, they can be uploaded to Google Storage. In order to decrease the storage space they can be tarred and gzipped, as *.tar.gz files will automatically be decompressed while .gz files will not be. This is in order to preserve the indexes if a gzipped dbsnp file is used (dbsnp.vcf.gz).

# assuming all files have a grch38 prefix:
tar -cvzf grch38.tar.gz grch38.fasta* grch38.dict # optionally add the corresponding dbsnp vcf file + index if not here if it is not gzipped
gsutil cp grch38.tar.gz gs://halvade/grch38/

# if your dbsnp is gzipped and indexed with bgzip/tabix also upload these:
gsutil cp grch38.vcf.gz* gs://halvade/grch38/

# gs://halvade/grch38/ is used as $GCLOUD_REF_DIR in your configuration file

Configuration file¶

The halvade-gcloud.sh script (which will be downloaded later) reads the required GS folders from a gcloud-config.sh file or a file set by the ${GCLOUD_CONFIG} variable. This only needs to be done once. The contents with required and optional settings should look like this:

# adjust accordingly
# REQUIRED USER INPUT
GCLOUD_LIB_DIR=gs://halvade/lib/
GCLOUD_REF_DIR=gs://halvade/grch38/
GCLOUD_BIN_DIR=gs://halvade/bin/

# OPTIONAL GOOGLE CLOUD CONFIGURATION IF DEFAULT NOT SET WITH GCLOUD CONFIG
# GCLOUD_REGION=eu-west-1
# GCLOUD_ZONE=eu-west-1-d
# GCLOUD_PROJECT=halvade-project

# OPTIONAL USER INPUT/DEFAULT IS SHOWN BELOW
# WORKER_INSTANCE_COUNT=1 # default number of worker nodes can be overridden by --nodes
# WORKER_INSTANCE_TYPE=n2-highmem-32 # or ultramem: m1-ultramem-40
# MASTER_INSTANCE_TYPE=c2-standard-4

Run script¶

The halvade-gcloud.sh script to create an EMR cluster and run halvade can be found in the scripts folders of the git repository or can be downloaded here:

wget https://bitbucket.org/dries_decap/halvadeforspark/downloads/halvade-gcloud.sh

Note

This script does not run asynchronous so it will exit when the job has completed. Consider running the script in a SCREEN or TMUX session.

Note

The first run will require you to create a ssh key, leaving the password empty will let the script connect automatically in the next runs.

To run Halvade on Google Cloud Engine (dataproc) the reference, binary and library folder together with the GCLOUD configuration variables needs to be provided in the gcloud-config.sh file. The input and output locations are given as arguments to the script.

# SOMATIC
# FASTQ input, a folder with paired FASTQ files per read group
./halvade-gcloud.sh somatic gs://halvade-io/somatic-input/tumor/ gs://halvade-io/somatic-input/normal/ gs://halvade-io/somatic-output/
# BAM input, already aligned reads with read groups added
./halvade-gcloud.sh somatic gs://halvade-io/somatic-input/tumor.bam gs://halvade-io/somatic-input/normal.bam gs://halvade-io/somatic-output/

# GERMLINE
# FASTQ input, a folder with paired FASTQ files per read group
./halvade-gcloud.sh germline gs://halvade-io/somatic-input/germline/ gs://halvade-io/germline-output/
# BAM input, already aligned reads with read groups added
./halvade-gcloud.sh germline gs://halvade-io/somatic-input/germline.bam gs://halvade-io/germline-output/

If required files are missing in the reference, binary or library directories, this will be detected during the bootstrapping. Once the task is completed, the output will be placed in the given output folder on S3, with a unique identifier based on the start time.

Input¶

There are several valid inputs that Halvade accepts:

a directory with paired fastq|fq(.gz)? files per read group or unaligned BAM files per read group. The files must have _1.fastq(.gz)? or _1.fq(.gz)? suffixes for the first file and _2.fastq(.gz)? or _2.fq(.gz)? for the second.
a directory which has already been preprocessed, containing a folder per read group and fastq|fq(.gz)? files in those folders
a single aligned BAM file with containing all read groups of a sample with read group information

Other Options¶

The script supports these options:

--exome: run the exome pipeline
--germlineSM <string>: germline samplename
--tumorSM <string>: tumor samplename
--normalSM <string>: normal samplename
--partitions <int>: override default number of partitions
--nodes <int>: number of worker nodes
--keep_cluster: Do not terminate the cluster after Halvade completes. Allows to run other samples on this cluster afterwards
--cluster <string>: existing cluster id to run Halvade on (halvade bootstrap script should have been called on every node in this cluster)

To override automatically detected executor memory and CPU settings:

--executor_memory <int>: sets the memory in MB per executor
--executor_cpus <int>: sets the number of CPUs per executor

Additional Halvade options can be set with the variable HALVADE_OPTS:

HALVADE_OPTS="--variant_caller both"

An overview of additional Halvade options can be found here