WGS VARIANT CALLS:
TBD
WES VARIANT CALLS:
A. Germ-line Variant Calling using DeepVariant+Sarek on Nextflow:
This involves transformation of WES fastq
or cram
files to variant call files in VCF format.
As of Jan 2022, the reference genome
used is GRCh38
.
The processing steps include the following:
Raw fastq files uploaded to Synapse by researcher in a folder with name format
experiment_name_rnaseq_fastq_date
. No white space should be present in the filenames (all filenames should have_
for whitespaces.All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in the following format (saved as a
.txt
file) :
sample | subject | status | sex | file_1 | file_2 | lane | parentId |
Synapse specimenID | Synapse individualID | 1 (Tumor = 1, Normal 0) | XX or XY | synID | synID | Lane information | Synapse folder information |
The files are pulled into NextFlow workflow setup and processed using the following versions of software:
nf-core/sarek v2.7.1 Nextflow v21.10.5 BWA <span style="color:#999999;">N/A</span> GATK v4.1.7.0 FreeBayes v1.3.2 samtools v1.9 Strelka v2.9.10 Manta v1.6.0 TIDDIT v2.7.1 AlleleCount v4.0.2 ASCAT v2.5.2 Control-FREEC vv11.6 msisensor v0.5 SnpEff v4.3t VEP v99.2 MultiQC v1.8 FastQC v0.11.9 bcftools v1.9 CNVkit v0.9.6 htslib v1.9 QualiMap v2.2.2-dev Trim Galore v0.6.4_dev vcftools v0.1.16 R v4.0.2
Estimated costs for germ-line variant calling (per 50 samples):
According to the DeepVariant docs, it costs about $1 per WES sample and $12 per WGS sample on Google Cloud using a
n1-standard-16
machine (16 vCPUs, 60 GB of memory, $0.76/hour).If we infer the run time from the costs and price per hour, it should be roughly 2 hours per WES sample and 16 hours per WGS sample.
More information here: - WORKFLOWS-98Getting issue details... STATUS
B. Somatic Variant Calling using Strelka+Mutect2+Freebayes on Nextflow:
TBD
C. Variant Annotation:
Currently germ-line variant calls in VCF format are being processed manually using VEP and vcf2maf
Estimated costs for germ-line variant annotation (per 50 samples) using VEP:
The compute cost should range from $50 to $2,500 depending on how many of the 50 samples are WGS and how many mutations they have.
RNA SEQUENCING DATA QUANTIFICATION:
Processing RNA-seq files involve transformation of raw data (fastq
files) to transcript counts (quants.sf files).
The quantification software of choice is Salmon
.
As of Jan 2022, the reference genome
used is GRCh38
.
Processing involves the following steps:
Raw fastq files uploaded to Synapse by researcher in a folder with name format
experiment_name_rnaseq_fastq_date
. No white space should be present in the filenames (all filenames should have_
for whitespaces.All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in the following format (saved as a
.csv
files):
sample | single_end | fastq_1 | fastq_2 | strandedness |
sample_id_1 | 0 (1 if paired-end) | synID | synID | reverse or forward |
sample_id_2 |
The files are pulled into NextFlow workflow setup and processed using the following versions of software:
BEDTOOLS_GENOMECOV: bedtools: 2.30.0 CAT_FASTQ: cat: 8.3 CUSTOM_DUMPSOFTWAREVERSIONS: python: 3.9.5 yaml: 5.4.1 DESEQ2_QC_STAR_SALMON: bioconductor-deseq2: 1.28.0 r-base: 4.0.3 DUPRADAR: bioconductor-dupradar: 1.18.0 r-base: 4.0.2 FASTQC: fastqc: 0.11.9 GET_CHROM_SIZES: samtools: 1.1 GTF_GENE_FILTER: python: 3.8.3 PICARD_MARKDUPLICATES: picard: 2.25.7 PRESEQ_LCEXTRAP: preseq: 3.1.1 QUALIMAP_RNASEQ: qualimap: 2.2.2-dev RSEM_PREPAREREFERENCE_TRANSCRIPTS: rsem: 1.3.1 star: 2.7.6a RSEQC_BAMSTAT: rseqc: 3.0.1 RSEQC_INFEREXPERIMENT: rseqc: 3.0.1 RSEQC_INNERDISTANCE: rseqc: 3.0.1 RSEQC_JUNCTIONANNOTATION: rseqc: 3.0.1 RSEQC_JUNCTIONSATURATION: rseqc: 3.0.1 RSEQC_READDISTRIBUTION: rseqc: 3.0.1 RSEQC_READDUPLICATION: rseqc: 3.0.1 SALMON_QUANT: salmon: 1.5.2 SALMON_SE_GENE: bioconductor-summarizedexperiment: 1.20.0 r-base: 4.0.3 SALMON_TX2GENE: python: 3.8.3 SALMON_TXIMPORT: bioconductor-tximeta: 1.8.0 r-base: 4.0.3 SAMPLESHEET_CHECK: python: 3.8.3 SAMTOOLS_FLAGSTAT: samtools: 1.13 SAMTOOLS_IDXSTATS: samtools: 1.13 SAMTOOLS_INDEX: samtools: 1.13 SAMTOOLS_SORT: samtools: 1.13 SAMTOOLS_STATS: samtools: 1.13 STAR_ALIGN: star: 2.6.1d STRINGTIE: stringtie: 2.1.7 TRIMGALORE: cutadapt: 3.4 trimgalore: 0.6.7 UCSC_BEDCLIP: ucsc: 377 UCSC_BEDGRAPHTOBIGWIG: ucsc: 377 Workflow: Nextflow: 21.10.5 nf-core/rnaseq: '3.4'
Estimated costs for processing:
Estimated Cost per sample = $0.20 ($51 for 261 samples)
Estimated time per 100 samples = approx 6 hrs
Add Comment