Skip to end of banner
Go to start of banner

SRA Google meeting 11_3_11

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Meeting objectives:

  1. What data can we access?
  2. How do we access it?
  3. What formats is it in?
  4. How is data organized including meta-data sample annotations, and how we can interact with it.
  5. Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?

Notes on available sequencing data

TCGA

Samples sequenced: 11/1/2011

This is data that from in BAM Telemetry Report (see below)

There are 4883 sequencing files, but an individual may have multiple data types (exome, genome, miRNA, or RNAseq), and occasionally more than one sequence file/data type. There are 1610 subjects that have at least one type of data. I have attached the complete file where it also shows cancer type

1610 Number of unique individuals with any type of Sequencing data
1338 Number of unique individuals with Exome data
99 Number of unique individuals with genome data
552 Number of unique individuals with miRNA data
306 Number of unique individuals with RNAseq data

TCGA Data Portal:

What sequencing data can you find where?

The TCGA Data Coordinating Center (DCC) organizes, stores, and provides access to metadata associated with the sequences samples, including clinical and biospecimen information. The National Center for Biotechnology Information ([NCBI|http://www.ncbi.nlm.nih.gov/]) organizes, stores, and provides actual sequence data and associated genotype/phenotype annotations through its Sequence Read Archive ([SRA|http://www.ncbi.nlm.nih.gov/sra]) service and the Database of Genotypes and Phenotypes ([dbGaP|http://www.ncbi.nlm.nih.gov/gap]). The SRA houses raw sequence reads, while dbGaP houses finished alignments in the [BAM format|http://samtools.sourceforge.net/SAM1.pdf].

Data File Submissions

 (Raw data is in the SRA, aligned files in dbGAP)

The data-generating centers that use sequencing platforms generate the following sequence-derived data files:

NCBI Deposit Site

File Type

File Suffix

Data Level

Description

Deposit Site

Sequence Read Files

various or fastq

 

Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ format.

SRA

Binary-sequence Alignment Format (BAM) files

bam

1

A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments.

dbGaP

Sequence Trace files

scf

1

Sequence trace files contain the raw data output from automated sequencing instruments.

Trace

DCC Deposit Site

File Type

File Suffix

Data Level

Description

Wiggle (WIG) format files

wig

2

The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores.

Mutation Annotation Format (MAF) files

maf

2 or 3

A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. Depending on the type of mutation (somatic or germline) and the state of the mutation (validated or putative), the MAF file can be considered Level 2 or Level 3 data.

Variant Call Format (VCF) files

vcf

2 or 3

The Variant Call Format (VCF) is a standardized format for storing the most prevalent types of sequence variation.

Trace ID-to-sample relationship files

tr

1

The page Trace ID-to Sample Relationship File does not exist.

Verbose Coverage File

vcf

2

A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file.

Quantification files

quantification.txt

3

A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results.

Mapping Sequence-Based Data

(this is a bit confusing, only a small fration of the data in the BAM telemetry report appears to be in dbGAP)

BAM Telemetry Report The total number of cases sequenced for different disease types, molecules, and centers are tabulated, and the most recent incoming sequence data files are listed.

BAM Telemetry Files on TCGA Sequence Data in dbGaP report identifies TCGA sequence data that is available from the dbGaP database.

The State of Whole Genome/Exome/RNA sequencing: 9-2011

WGS.WES.RNAseq_September.2011v2.pptx

The NCBI SRA: An Evaluation

Some interesting comments from the "tree of life" blog http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html

3 general problems

1) Difficult to upload and download data

2) Limited search functionality

3) The data model (study/experiment/sample/run) is not intuitive

DNAnexus SRA: An Evaluation

Same data model as NCBI and therfore the same problem in evaluating the linearity between a study and its data

Slightly better search capabilities than NCBI

  • No labels