Example datasets¶
The input data for these pipelines typically consist of either 2 FASTQ files for paired-end reads or a BAM file containing already aligned reads.
Germline sample¶
The whole genome sequencing sample is the NA12878 dataset, this dataset is typically used in similar benchmarks and papers. This dataset consists of 1.5 billion paired-end reads of 100 basepairs in length. This translates into a 50x coverage. Execute the following commands to download and preprocess the data:
1 2 3 4 5 6 | halvadedata=/halvade/input/germline mkdir -p $halvadedata cd $halvadedata wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz |
Somatic sample¶
The somatic sample is HCC1395, the samples are sequenced with a HiSeq 2000 instrument producing 100bp paired-end reads. With approximately 63x coverage and 34x coverage for respectively the tumor and normal sample. More information can be found here We use GATK 4 to convert to BAMS back to fastq, since these are unaligned BAM files, to get the GATK 4 binary see this page.
1 2 3 4 5 6 7 8 9 10 11 | halvadetumor=/halvade/input/tumor halvadenormal=/halvade/input/normal mkdir -p $halvadetumor $halvadenormal cd $halvadetumor wget http://genomedata.org/pmbio-workshop/fastqs/all/WGS_Tumor.tar tar xvf WGS_Tumor.tar cd $halvadenormal wget http://genomedata.org/pmbio-workshop/fastqs/all/WGS_Norm.tar tar xvf WGS_Norm.tar |
Exome sequencing data is also available from this sample and can be downloaded like this:
1 2 3 4 5 6 7 8 9 10 11 | halvadetumor=/halvade/input/tumorwxs halvadenormal=/halvade/input/normalwxs mkdir -p $halvadetumor $halvadenormal cd $halvadetumor wget http://genomedata.org/pmbio-workshop/fastqs/all/Exome_Tumor.tar tar xvf Exome_Tumor.tar cd $halvadenormal wget http://genomedata.org/pmbio-workshop/fastqs/all/Exome_Norm.tar tar xvf Exome_Norm.tar |