Document toolboxDocument toolbox

SRA Google meeting 11_3_11

THIS PAGE HAS MOVED

See https://sites.google.com/a/google.com/sage-google/projects/sra-data

Context

  1. Sage scientific expertise
    1. Sage expertise in data intensive biological analysis and interest in initiating large-scale sequencing analysis program.
  2. Google engineering expertise
    1. Google tools potentially enabling for large scale (tera or petabyte) data query and analysis.

Once we are started we will think hard about the scientific and engineering challenges. However, I am confident that if the data is transparently queryable we can make a "go" decision and commit to pursuing the collaboration. However, data access is a major challenge given regulatory hurdles (e.g. dbGAP) and difficulty in organizing the data. Therefore, I would like to focus this hour on practically determining our ability to interact with SRA data through Google tools which will allow us to decide on next steps.

Meeting objectives:

  1. What data can we access?
  2. How to handle dbGAP-type access restrictions if we want to do meta analysis across all of SRA?
  3. How do we access data?
  4. What formats are data in?
  5. How is data organized including meta-data sample annotations, and how we can interact with it?
  6. Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?

** B. Bot - a few notes - I think we should stay clear of the "can we get our hands on the data" questions - which many of the above seem to elude to.  We may want to steer the questions (first) more towards what can we do with these data - and then drill down to the details listed above later.  I also think it is important to go into this meeting thinking about the long term - as in "what can we do over time as these data continue to evolve and become even more rich" as apposed to "what can we do right now".  There are plenty of other groups that have access to the SRA data who I'm sure that thinking about the 'now'.  I believe the long term visions are where the true value of this possible collaboration lie. **

Notes on available sequencing data

TCGA

Samples sequenced: 11/1/2011

This is data that from in BAM Telemetry Report (see below)

There are 4883 sequencing files, but an individual may have multiple data types (exome, genome, miRNA, or RNAseq), and occasionally more than one sequence file/data type. There are 1610 subjects that have at least one type of data. I have attached the complete file where it also shows cancer type

1610 Number of unique individuals with any type of Sequencing data
1338 Number of unique individuals with Exome data
99 Number of unique individuals with genome data
552 Number of unique individuals with miRNA data
306 Number of unique individuals with RNAseq data

TCGA Data Portal:

What sequencing data can you find where?

The TCGA Data Coordinating Center (DCC) organizes, stores, and provides access to metadata associated with the sequences samples, including clinical and biospecimen information. The National Center for Biotechnology Information ([NCBI|http://www.ncbi.nlm.nih.gov/]) organizes, stores, and provides actual sequence data and associated genotype/phenotype annotations through its Sequence Read Archive ([SRA|http://www.ncbi.nlm.nih.gov/sra]) service and the Database of Genotypes and Phenotypes ([dbGaP|http://www.ncbi.nlm.nih.gov/gap]). The SRA houses raw sequence reads, while dbGaP houses finished alignments in the [BAM format|http://samtools.sourceforge.net/SAM1.pdf].

Data File Submissions

 (Raw data is in the SRA, aligned files in dbGAP)

The data-generating centers that use sequencing platforms generate the following sequence-derived data files:

NCBI Deposit Site

File Type

File Suffix

Data Level

Description

Deposit Site

Sequence Read Files

various or fastq

 

Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ format.

SRA

Binary-sequence Alignment Format (BAM) files

bam

1

A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments.

dbGaP

Sequence Trace files

scf

1

Sequence trace files contain the raw data output from automated sequencing instruments.

Trace

DCC Deposit Site

File Type

File Suffix

Data Level

Description

Wiggle (WIG) format files

wig

2

The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores.

Mutation Annotation Format (MAF) files

maf

2 or 3

A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. Depending on the type of mutation (somatic or germline) and the state of the mutation (validated or putative), the MAF file can be considered Level 2 or Level 3 data.

Variant Call Format (VCF) files

vcf

2 or 3

The Variant Call Format (VCF) is a standardized format for storing the most prevalent types of sequence variation.

Trace ID-to-sample relationship files

tr

1

The page Trace ID-to Sample Relationship File does not exist.

Verbose Coverage File

vcf

2

A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file.

Quantification files

quantification.txt

3

A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results.

Mapping Sequence-Based Data

(this is a bit confusing, only a small fration of the data in the BAM telemetry report appears to be in dbGAP)

BAM Telemetry Report The total number of cases sequenced for different disease types, molecules, and centers are tabulated, and the most recent incoming sequence data files are listed.

BAM Telemetry Files on TCGA Sequence Data in dbGaP report identifies TCGA sequence data that is available from the dbGaP database.

The State of Whole Genome/Exome/RNA sequencing: 9-2011

WGS.WES.RNAseq_September.2011v2.pptx

The NCBI SRA: An Evaluation

Some interesting comments from the "tree of life" blog http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html

3 general problems

1) Difficult to upload and download data

2) Limited search functionality

3) The data model (study/experiment/sample/run) is not intuitive

DNAnexus SRA: An Evaluation

Same data model as NCBI and therfore the same problem in evaluating the linearity between a study and its data

Slightly better search capabilities than NCBI

Bioconductor tools

To do: Check out Bioconductor tools for SRA access: http://bioconductor.org/help/workflows/high-throughput-sequencing/

** B. Bot - The Bioconductor tools are nice for querying the SRA database for information re: studies, samples, etc. as Mette has done through the web UI ... but I'm not sure there is a lot of functionality there as far as accessing the raw data files.  This entire project was put on hold when the future of the SRA was in jeopardy earlier this year.  I have contact with one of the co-authors of the package, though if we need more specific information. **