Experimental data is often shared in a variety of formats. Carefully choosing a data format is a great way to extend the impact of your research, by ensuring others can use it in the future.
On this page, we begin by defining the difference between raw data and results. Then, we provide a reference table for different types of data that you might be sharing, followed by a breakdown of the information that we require when uploading your data.
Raw data, processed data, and results
From a reusability perspective, data is the most useful to future users. Both results and data can be shared, but data is more important for reproducibility and reuse.
We consider data to be raw or partially processed information from a single sample, depending on the type of experiment.
Results are generally post-analysis information from an aggregate of samples or manuscript figures.
For example, if you are sharing RNA-seq information, raw data would be the raw, fastq.gz files, processed data would be the aligned reads (.bam) or gene counts data, and differential expression analysis and volcano plots would be considered results. This distinction is well defined for many types of data, but for assays we encounter less often this may be less clear. "Results" might also be acceptable for assays that do not lend themselves to re-analysis, such as western blotting. We can work with you to help figure this out.
What data are required, and how shoud you format it?
Please note: many common experimental data types are included in this requirements table, but you may be generating different or novel types of data that are not included here. Please don’t hesitate to reach out and ask us for a recommendation for your type of data if you do not see it mentioned here.
Requirement | Levelsa | Format | Notes | |
---|---|---|---|---|
DNA | ||||
whole genome sequencing | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
whole exome sequencing | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
SNP microarray | required | raw AND processed | raw: CEL, IDAT, tsv (raw values per SNP) processed: tsv (genotypes per SNP) | |
immunosequencing | required | raw OR semi-processed | vendor-dependent, e.g. ImmunoSEQ and 10XGenomics formats | |
Sanger sequencing | optional | processed | ||
RNA expression | ||||
RNA sequencing (bulk) | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
RNA sequencing (single-cell) | required | raw AND processed | raw: FASTQ processed: hda5/hdf5 format following cellxgene required format | fastq should be created from bcl files with a program like More documentation on formatting hda5 files can be found here. hda5 format is a type of hdf5 file. |
gene expression microarray | required | raw AND processed | raw: CEL, IDAT, tsv (raw values per SNP, copy number, and loss of heterozygosity) processed: tsv (normalized values and purity/ploidy) | |
qPCR | optional | processed | csv/tsv (according to template) | |
methylation | ||||
ATAC sequencing | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
methylation array | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
bisulfite sequencing | required | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
protein | ||||
LC-MS | required | raw AND processed | raw: mzML processed: protein intensities (csv/tsv) | |
western blot | optional | processed | densitometry output (csv/tsv) | |
plate-based ELISA | optional | raw | plate reader output (csv/tsv) | |
protein/peptide microarrays | required | processed | label-free quantification matrix (csv/tsv) | |
metabolomics | ||||
LC-MS | required | raw AND processed | raw: mzML or vendor-dependent format & processed: metabolite intensities (csv/tsv) | |
clinical | ||||
structured clinical data | required | processed | csv/tsv or XML with metadata for each variable | key primary and secondary endpoints only |
EEG | required | raw | pending additional comments | |
clinical/imaging | ||||
MRI or other radiological image | required | raw | dicom, nifti, minc | |
imaging | ||||
immunohistochemistry | required | raw | OME-TIFF (preferred), at least bio-formats compatible file format | |
immunofluorescence | required | raw | OME-TIFF (preferred), at least bio-formats compatible file format | |
gross morphology photos (mice) | optional | raw | tiff, png, jpg | |
in vitro drug screening | ||||
plate-based cell viability assay | required | processed | csv/tsv (according to template) | |
other | ||||
flow cytometry | optional | raw | fsc with gating parameters | |
in vivo tumor growth experiments | optional | raw OR processed | csv/tsv (according to template) where raw: tumor dimensions or other raw measurements & processed: calculated tumor volume/size | |
a Level nomenclature can be cross-referenced with https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/data-levels , where 'raw' corresponds to Level 1 and 'semi-processed' most closely corresponds to Level 2. |
Metadata requirements
To share your data on the NF Data Portal, we require annotations as defined in the assay-specific manifests.
Note: Only files annotated with the resourceType
marked as experimentalData
are visible in the portal.
0 Comments