SRA Google meeting 11_3_11
THIS PAGE HAS MOVED
See https://sites.google.com/a/google.com/sage-google/projects/sra-data
Context
- Sage scientific expertise
- Sage expertise in data intensive biological analysis and interest in initiating large-scale sequencing analysis program.
- Google engineering expertise
- Google tools potentially enabling for large scale (tera or petabyte) data query and analysis.
Once we are started we will think hard about the scientific and engineering challenges. However, I am confident that if the data is transparently queryable we can make a "go" decision and commit to pursuing the collaboration. However, data access is a major challenge given regulatory hurdles (e.g. dbGAP) and difficulty in organizing the data. Therefore, I would like to focus this hour on practically determining our ability to interact with SRA data through Google tools which will allow us to decide on next steps.
Meeting objectives:
- What data can we access?
- How to handle dbGAP-type access restrictions if we want to do meta analysis across all of SRA?
- How do we access data?
- What formats are data in?
- How is data organized including meta-data sample annotations, and how we can interact with it?
- Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?
** B. Bot - a few notes - I think we should stay clear of the "can we get our hands on the data" questions - which many of the above seem to elude to. We may want to steer the questions (first) more towards what can we do with these data - and then drill down to the details listed above later. I also think it is important to go into this meeting thinking about the long term - as in "what can we do over time as these data continue to evolve and become even more rich" as apposed to "what can we do right now". There are plenty of other groups that have access to the SRA data who I'm sure that thinking about the 'now'. I believe the long term visions are where the true value of this possible collaboration lie. **
Notes on available sequencing data
TCGA
Samples sequenced: 11/1/2011
This is data that from in BAM Telemetry Report (see below)
There are 4883 sequencing files, but an individual may have multiple data types (exome, genome, miRNA, or RNAseq), and occasionally more than one sequence file/data type. There are 1610 subjects that have at least one type of data. I have attached the complete file where it also shows cancer type
1610 Number of unique individuals with any type of Sequencing data
1338 Number of unique individuals with Exome data
99 Number of unique individuals with genome data
552 Number of unique individuals with miRNA data
306 Number of unique individuals with RNAseq data
TCGA Data Portal:
What sequencing data can you find where?
The TCGA Data Coordinating Center (DCC) organizes, stores, and provides access to metadata associated with the sequences samples, including clinical and biospecimen information. The National Center for Biotechnology Information ([NCBI|http://www.ncbi.nlm.nih.gov/]) organizes, stores, and provides actual sequence data and associated genotype/phenotype annotations through its Sequence Read Archive ([SRA|http://www.ncbi.nlm.nih.gov/sra]) service and the Database of Genotypes and Phenotypes ([dbGaP|http://www.ncbi.nlm.nih.gov/gap]). The SRA houses raw sequence reads, while dbGaP houses finished alignments in the [BAM format|http://samtools.sourceforge.net/SAM1.pdf].
Data File Submissions
(Raw data is in the SRA, aligned files in dbGAP)
The data-generating centers that use sequencing platforms generate the following sequence-derived data files:
NCBI Deposit Site
File Type |
File Suffix |
Data Level |
Description |
Deposit Site |
---|---|---|---|---|
Sequence Read Files |
various or fastq |
|
Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ format. |
SRA |
Binary-sequence Alignment Format (BAM) files |
bam |
1 |
A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments. |
dbGaP |
Sequence Trace files |
scf |
1 |
Sequence trace files contain the raw data output from automated sequencing instruments. |
Trace |
DCC Deposit Site
File Type |
File Suffix |
Data Level |
Description |
---|---|---|---|
Wiggle (WIG) format files |
wig |
2 |
The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores. |
Mutation Annotation Format (MAF) files |
maf |
2 or 3 |
A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. Depending on the type of mutation (somatic or germline) and the state of the mutation (validated or putative), the MAF file can be considered Level 2 or Level 3 data. |
Variant Call Format (VCF) files |
vcf |
2 or 3 |
The Variant Call Format (VCF) is a standardized format for storing the most prevalent types of sequence variation. |
Trace ID-to-sample relationship files |
tr |
1 |
The page Trace ID-to Sample Relationship File does not exist. |
Verbose Coverage File |
vcf |
2 |
A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file. |
Quantification files |
quantification.txt |
3 |
A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results. |
Mapping Sequence-Based Data
(this is a bit confusing, only a small fration of the data in the BAM telemetry report appears to be in dbGAP)
BAM Telemetry Report The total number of cases sequenced for different disease types, molecules, and centers are tabulated, and the most recent incoming sequence data files are listed.
BAM Telemetry Files on TCGA Sequence Data in dbGaP report identifies TCGA sequence data that is available from the dbGaP database.
The State of Whole Genome/Exome/RNA sequencing: 9-2011
WGS.WES.RNAseq_September.2011v2.pptx
The NCBI SRA: An Evaluation
Some interesting comments from the "tree of life" blog http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html
3 general problems
1) Difficult to upload and download data
2) Limited search functionality
3) The data model (study/experiment/sample/run) is not intuitive
DNAnexus SRA: An Evaluation
Same data model as NCBI and therfore the same problem in evaluating the linearity between a study and its data
Slightly better search capabilities than NCBI
Bioconductor tools
To do: Check out Bioconductor tools for SRA access: http://bioconductor.org/help/workflows/high-throughput-sequencing/
** B. Bot - The Bioconductor tools are nice for querying the SRA database for information re: studies, samples, etc. as Mette has done through the web UI ... but I'm not sure there is a lot of functionality there as far as accessing the raw data files. This entire project was put on hold when the future of the SRA was in jeopardy earlier this year. I have contact with one of the co-authors of the package, though if we need more specific information. **