Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For the purposes of this portal, we define key data as data that, when shared in a raw or semi-processed format, is of sufficient size or complexity OR can be combined with similar data such that that it can be mined for additional insights. knowledge beyond the primary research question.

For example, a single Western blot image is typically not key data, because it can be used to answer just a handful of questions, typically all related to the protein that was assayed, and it is difficult to combine this information with lots of other Western blots to create a resource that can be mined. On the other hand, a collection of 5 whole slide images of patient tumor sections would likely be key data, because there are lots of questions that could potentially be asked of the data that were not examined in the study that generated in the data.


As a rough rule of thumb, you might ask yourself - if I was not doing this experiment myself, would I still want access to the raw data to combine it with other data or to ask my own questions about the data? Or would a figure in a publication suffice? If the former, it’s probably key data. If the latter, it’s probably optional.

...

  1. Dataset contains data generated using high-throughput methods that output raw data presented in a widely used systematic format, and has more than just one or two samples. See the table below for examples!

  2. Dataset considered to be validation data for a new method that is being developed in the funded grant.

  3. Dataset is specifically deemed of interest by investigator for some other reason, e.g. particularly unique or non-recreate-able data.

  4. Dataset is specifically deemed of interest by funder for some other reason.

...

Requirement

Levelsa

Format

Notes

DNA

whole genome sequencing

required

raw OR semi-processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

whole exome sequencing

required

raw OR semi-processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

SNP microarray

required

raw AND processed

raw: CEL, IDAT, tsv (raw values per SNP)

processed: tsv (genotypes per SNP)

immunosequencing

required

raw OR semi-processed

vendor-dependent, e.g. ImmunoSEQ and 10XGenomics formats

Sanger sequencing

optional

processed

RNA expression

RNA sequencing (bulk)

required

raw OR semi-processed AND processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

processed: counts matrices or quantification files

quantification files: like the quant.sf files generated by Salmon-based RNA-seq workflows

RNA sequencing (single-cell)

required

raw AND processed

raw: FASTQ

processed: hda5/hdf5 format following cellxgene required format

fastq should be created from bcl files with a program like cellranger mkfastq

More documentation on formatting hda5 files can be found here. hda5 format is a type of hdf5 file.

gene expression microarray

required

raw AND processed

raw: CEL, IDAT, tsv (raw values per SNP, copy number, and loss of heterozygosity)

processed: tsv (normalized values and purity/ploidy)

qPCR

optional

processed

csv/tsv (according to template)

methylation

ATAC sequencing

required

raw OR semi-processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

methylation array

required

raw OR semi-processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

bisulfite sequencing

required

raw OR semi-processed

raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM

protein

LC-MS

required

raw AND processed

raw: mzML

processed: protein intensities (csv/tsv)

https://www.psidev.info/mzML

western blot

optional

processed

densitometry output (csv/tsv)

plate-based ELISA

optional

raw

plate reader output (csv/tsv)

protein/peptide microarrays

required

processed

label-free quantification matrix (csv/tsv)

metabolomics

LC-MS

required

raw AND processed

raw: mzML or vendor-dependent format & processed: metabolite intensities (csv/tsv)

clinical

structured clinical data

required

processed

csv/tsv or XML with metadata for each variable

key primary and secondary endpoints only

EEG

required

raw

pending additional comments

clinical/imaging

MRI or other radiological image

required

raw

dicom, nifti, mincDICOM

imaging

immunohistochemistry

required

raw

OME-TIFF (preferred), at least bio-formats compatible file format

immunofluorescence

required

raw

OME-TIFF (preferred), at least bio-formats compatible file format

gross morphology photos (mice)

optional

raw

tiff, png, jpg

in vitro drug screening

plate-based cell viability assay

required

processed

csv/tsv (according to template)

other

flow cytometry

optional

raw

fsc with gating parameters

in vivo tumor growth experiments

optional

raw OR processed

csv/tsv (according to template) where raw: tumor dimensions or other raw measurements & processed: calculated tumor volume/size

aLevel nomenclature can be cross-referenced with https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/data-levels , where 'raw' corresponds to Level 1 and 'semi-processed' most closely corresponds to Level 2.

...