Steps in the workflow
Identify new TCGA Datasets
- Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets
- Compare this list to the TCGA datasets held in Synapse
- for each new dataset, proceed with the workflow
- for each updated dataset, proceed with the workflow (be sure to bump the version number for the dataset in Synapse)
Download the dataset to S3
http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
Create a new S3 bucket for raw, uncurated data and use that as the landing location for the download
Be sure to check the MD5 of the download.
Process the dataset
TODO get details from Justin as to what this entails
Submit the processed dataset to Synapse
Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
Notify all individuals following TCGA datasets of the new dataset
Technologies to Consider
- AWS's Simple Workflow Service
- Taverna