Example datasets

The input data for these pipelines typically consist of either 2 FASTQ files for paired-end reads or a BAM file containing already aligned reads.

Germline sample

The whole genome sequencing sample is the NA12878 dataset, this dataset is typically used in similar benchmarks and papers. This dataset consists of 1.5 billion paired-end reads of 100 basepairs in length. This translates into a 50x coverage. Execute the following commands to download and preprocess the data:

1
2
3
4
5
6
halvadedata=/halvade/input/germline
mkdir -p $halvadedata
cd $halvadedata

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

Somatic sample

The somatic sample is HCC1395, the samples are sequenced with a HiSeq 2000 instrument producing 100bp paired-end reads. With approximately 63x coverage and 34x coverage for respectively the tumor and normal sample. More information can be found here We use GATK 4 to convert to BAMS back to fastq, since these are unaligned BAM files, to get the GATK 4 binary see this page.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
halvadetumor=/halvade/input/tumor
halvadenormal=/halvade/input/normal
mkdir -p $halvadetumor $halvadenormal

cd $halvadetumor
wget http://genomedata.org/pmbio-workshop/fastqs/all/WGS_Tumor.tar
tar xvf WGS_Tumor.tar

cd $halvadenormal
wget http://genomedata.org/pmbio-workshop/fastqs/all/WGS_Norm.tar
tar xvf WGS_Norm.tar

Exome sequencing data is also available from this sample and can be downloaded like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
halvadetumor=/halvade/input/tumorwxs
halvadenormal=/halvade/input/normalwxs
mkdir -p $halvadetumor $halvadenormal

cd $halvadetumor
wget http://genomedata.org/pmbio-workshop/fastqs/all/Exome_Tumor.tar
tar xvf Exome_Tumor.tar

cd $halvadenormal
wget http://genomedata.org/pmbio-workshop/fastqs/all/Exome_Norm.tar
tar xvf Exome_Norm.tar