SRA Google meeting 11_3_11

Meeting objectives:

What data can we access?
How do we access it?
What formats is it in?
How is data organized including meta-data sample annotations, and how we can interact with it.
Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?

Notes on available sequencing data

TCGA

Samples sequenced: 11/1/2011

This is data that from in BAM Telemetry Report (see below)

There are 4883 sequencing files, but an individual may have multiple data types (exome, genome, miRNA, or RNAseq), and occasionally more than one sequence file/data type. There are 1610 subjects that have at least one type of data. I have attached the complete file where it also shows cancer type

1610 Number of unique individuals with any type of Sequencing data
1338 Number of unique individuals with Exome data
99 Number of unique individuals with genome data
552 Number of unique individuals with miRNA data
306 Number of unique individuals with RNAseq data

TCGA Data Portal:

What sequencing data can you find where?

The TCGA Data Coordinating Center (DCC) organizes, stores, and provides access to metadata associated with the sequences samples, including clinical and biospecimen information. The National Center for Biotechnology Information ([NCBI|http://www.ncbi.nlm.nih.gov/]) organizes, stores, and provides actual sequence data and associated genotype/phenotype annotations through its Sequence Read Archive ([SRA|http://www.ncbi.nlm.nih.gov/sra]) service and the Database of Genotypes and Phenotypes ([dbGaP|http://www.ncbi.nlm.nih.gov/gap]). The SRA houses raw sequence reads, while dbGaP houses finished alignments in the [BAM format|http://samtools.sourceforge.net/SAM1.pdf].

Data File Submissions

(Raw data is in the SRA, aligned files in dbGAP)

The data-generating centers that use sequencing platforms generate the following sequence-derived data files:

NCBI Deposit Site

File Type	File Suffix	Data Level	Description	Deposit Site
Sequence Read Files	various or fastq		Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ format.	SRA
Binary-sequence Alignment Format (BAM) files	bam	1	A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments.	dbGaP
Sequence Trace files	scf	1	Sequence trace files contain the raw data output from automated sequencing instruments.	Trace

DCC Deposit Site

File Type	File Suffix	Data Level	Description
Wiggle (WIG) format files	wig	2	The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores.
Mutation Annotation Format (MAF) files	maf	2 or 3	A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. Depending on the type of mutation (somatic or germline) and the state of the mutation (validated or putative), the MAF file can be considered Level 2 or Level 3 data.
Variant Call Format (VCF) files	vcf	2 or 3	The Variant Call Format (VCF) is a standardized format for storing the most prevalent types of sequence variation.
Trace ID-to-sample relationship files	tr	1	The page Trace ID-to Sample Relationship File does not exist.
Verbose Coverage File	vcf	2	A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file.
Quantification files	quantification.txt	3	A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results.

Mapping Sequence-Based Data

(this is a bit confusing, only a small fration of the data in the BAM telemetry report appears to be in dbGAP)

BAM Telemetry Report The total number of cases sequenced for different disease types, molecules, and centers are tabulated, and the most recent incoming sequence data files are listed.

BAM Telemetry Files on TCGA Sequence Data in dbGaP report identifies TCGA sequence data that is available from the dbGaP database.

The State of Whole Genome/Exome/RNA sequencing: 9-2011

WGS.WES.RNAseq_September.2011v2.pptx

The NCBI SRA: An Evaluation

Some interesting comments from the "tree of life" blog http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html

3 general problems

1) Difficult to upload and download data

2) Limited search functionality

3) The data model (study/experiment/sample/run) is not intuitive

DNAnexus SRA: An Evaluation

Same data model as NCBI and therfore the same problem in evaluating the linearity between a study and its data

Slightly better search capabilities than NCBI