Meeting objectives:
- What data can we access?
- How do we access it?
- What formats is it in?
- How is data organized including meta-data sample annotations, and how we can interact with it.
- Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?
Notes on available sequencing data
TCGA
Samples sequenced: 11/1/2011
This is data that from in BAM Telemetry Report (see below)
There are 4883 sequencing files, but an individual may have multiple data types (exome, genome, miRNA, or RNAseq), and occasionally more than one sequence file/data type. There are 1610 subjects that have at least one type of data. I have attached the complete file where it also shows cancer type
1610 Number of unique individuals with any type of Sequencing data
1338 Number of unique individuals with Exome data
99 Number of unique individuals with genome data
552 Number of unique individuals with miRNA data
306 Number of unique individuals with RNAseq data
TCGA Data Portal:
What sequencing data can you find where?
The TCGA Data Coordinating Center (DCC) organizes, stores, and provides access to metadata associated with the sequences samples, including clinical and biospecimen information. The National Center for Biotechnology Information ([NCBI|http://www.ncbi.nlm.nih.gov/]) organizes, stores, and provides actual sequence data and associated genotype/phenotype annotations through its Sequence Read Archive ([SRA|http://www.ncbi.nlm.nih.gov/sra]) service and the Database of Genotypes and Phenotypes ([dbGaP|http://www.ncbi.nlm.nih.gov/gap]). The SRA houses raw sequence reads, while dbGaP houses finished alignments in the [BAM format|http://samtools.sourceforge.net/SAM1.pdf].
Data File Submissions
(Raw data is in the SRA, aligned files in dbGAP)
The data-generating centers that use sequencing platforms generate the following sequence-derived data files:
NCBI Deposit Site
File Type |
File Suffix |
Data Level |
Description |
Deposit Site |
---|---|---|---|---|
Sequence Read Files |
various or fastq |
|
Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ format. |
SRA |
Binary-sequence Alignment Format (BAM) files |
bam |
1 |
A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments. |
dbGaP |
Sequence Trace files |
scf |
1 |
Sequence trace files contain the raw data output from automated sequencing instruments. |
Trace |
DCC Deposit Site
File Type |
File Suffix |
Data Level |
Description |
---|---|---|---|
Wiggle (WIG) format files |
wig |
2 |
The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores. |
Mutation Annotation Format (MAF) files |
maf |
2 or 3 |
A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. Depending on the type of mutation (somatic or germline) and the state of the mutation (validated or putative), the MAF file can be considered Level 2 or Level 3 data. |
Variant Call Format (VCF) files |
vcf |
2 or 3 |
The Variant Call Format (VCF) is a standardized format for storing the most prevalent types of sequence variation. |
Trace ID-to-sample relationship files |
tr |
1 |
The page Trace ID-to Sample Relationship File does not exist. |
Verbose Coverage File |
vcf |
2 |
A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file. |
Quantification files |
quantification.txt |
3 |
A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results. |
Mapping Sequence-Based Data
(this is a bit confusing, only a small fration of the data in the BAM telemetry report appears to be in dbGAP)
BAM Telemetry Report The total number of cases sequenced for different disease types, molecules, and centers are tabulated, and the most recent incoming sequence data files are listed.
BAM Telemetry Files on TCGA Sequence Data in dbGaP report identifies TCGA sequence data that is available from the dbGaP database.
The State of Whole Genome/Exome/RNA sequencing: 9-2011
WGS.WES.RNAseq_September.2011v2.pptx
The NCBI SRA: An Evaluation
Some interesting comments from the "tree of life" blog http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html
3 general problems
1) Difficult to upload and download data
2) Limited search functionality
3) The data model (study/experiment/sample/run) is not intuitive
DNAnexus SRA: An Evaluation
Same data model as NCBI and therfore the same problem in evaluating the linearity between a study and its data
Slightly better search capabilities than NCBI