Skip to end of banner
Go to start of banner

TCGA Curation Pipeline Design

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Steps in the workflow

Identify new or updated TCGA Datasets

Details:

  • Input: n/a
  • Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
    • We'll always look for updates within a configurable number of days, any repeat processing will be aborted further down the pipeline
  • Compare this list to the TCGA datasets held in Synapse
    1. for each new dataset
      1. create the dataset metadata in Synapse
      2. crawl the download site to identify the layers
      3. proceed with the workflow for each layer
    2. for each updated layer in an existing dataset
      1. crawl the [download site|http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ to identify the updated/new layers
      2. proceed with the workflow for the updated/new layer(s)
  • Output: (Synapse dataset id, TCGA layer download uri)

Short Term Deliverables:

  • Focus on just the Colon adenocarcinoma COAD dataset for now
  • Create the Synapse dataset by hand

Download the dataset layers to S3

Details:

  • Input: (Synapse dataset id, TCGA layer download uri)
  • Just pull down clinical and array data, sequence data will come later
    • bcr/ has clinical info
    • cgcc/ array platforms
  • Get the MD5 checksum first
    • see if we already have that checksum stored, if so, skip to the end of this step
  • Download the data file
  • Check compute its MD5 and compare
  • Upload the file first and checksum second to a subdirectory in the S3 bucket tcga/source for source data from TCGA
    • the filename is the tcga/source concatenated with the path portion of the TCGA download URL
    • if a file of that name already exists in S3
  • Output: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Short Term Deliverables:

Create Synapse record for the source Layers from TCGA

Details

  • Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Process the dataset layers

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

Short Term Deliverables:

  • a matrix of id to probe for Level 2 expression data
  • a matrix of id to gene for Level 3 expression data

Submit the processed dataset to Synapse

Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.

Deliverable:

  • Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
  • The layer can be downloaded via the Web UI and the R client.
  • We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage

Notify via email all individuals following TCGA datasets of the new layer(s)

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

  • AWS's Simple Workflow Service
  • Taverna

Remember to ensure that each workflow step is idempotent.

  • No labels