Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Steps in the workflow

Identify new or updated TCGA Datasets

Details:

  • Input: n/a
  • Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
    • We'll always look for updates within a configurable number of days, any repeat processing will be aborted further down the pipeline
  • Compare this list to the TCGA datasets held in Synapse
    1. for each new

...

    1. dataset
      1. create the dataset metadata in Synapse
      2. crawl the download site to identify the layers
      3. proceed with the workflow for each layer
    2. for each updated

...

    1. layer in an existing dataset
      1. crawl the [download site|http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ to identify the updated/new layers
      2. proceed with the workflow

...

Download the dataset to S3

      1. for the updated/new layer(s)
  • Output: (Synapse dataset id, TCGA layer download uri)

Short Term Deliverables:

  • Focus on just the Colon adenocarcinoma COAD dataset for now
  • Create the Synapse dataset by hand

Download the dataset layers to S3

Details:

  • Input: (Synapse dataset id, TCGA layer download uri)
  • Just pull down clinical and array data, sequence data will come later
    • bcr/ has clinical info
    • cgcc/ array platforms
  • Get the MD5 checksum first
    • see if we already have that checksum stored, if so, skip to the end of this step
  • Download the data file
  • Check compute its MD5 and compare
  • Upload the file first and checksum second to a subdirectory in the S3 bucket tcga/source for source data from TCGA
    • the filename is the tcga/source concatenated with the path portion of the TCGA download URL
    • if a file of that name already exists in S3
  • Output: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Short Term Deliverables:

...

Create a new S3 bucket for raw, uncurated data and use that as the landing location for the download

...

Create Synapse record for the source Layers from TCGA

Details

  • Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)

Process the dataset layers

TODO get details from Justin as to what this entailsmore details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

Short Term Deliverables:

  • a matrix of id to probe for Level 2 expression data
  • a matrix of id to gene for Level 3 expression data

Submit the processed dataset to Synapse

Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.

Deliverable:

  • Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
  • The layer can be downloaded via the Web UI and the R client.
  • We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage

Notify via email all individuals following TCGA datasets of the new

...

layer(s)

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

Workflow

  • AWS's Simple Workflow Service
  • Taverna

Remember to ensure that each workflow step is idempotent.