Skip to end of banner
Go to start of banner

TCGA Curation Pipeline Design

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Steps in the workflow

Identify new TCGA Datasets

  1. Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets
  2. Compare this list to the TCGA datasets held in Synapse
    • for each new dataset, proceed with the workflow
    • for each updated dataset, proceed with the workflow (be sure to bump the version number for the dataset in Synapse)

Download the dataset to S3

http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/

Create a new S3 bucket for raw, uncurated data and use that as the landing location for the download

Process the dataset

TODO get details from Justin as to what this entails

Submit the processed dataset to Synapse

Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.

Notify all individuals following TCGA datasets of the new dataset

Technologies to Consider

  • AWS's Simple Workflow Service
  • Taverna
  • No labels