Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Identify new or updated TCGA Datasets

Details:

  • Input: n/a
  • Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
    • We'll always look for updates within a configurable number of days, any
    • Any repeat processing will be aborted should be safe to do because we will ensure this pipeline is idempotent
    • But to reduce costs, we should abort repeat work when it is identified further down the pipeline
  • Compare this list to the TCGA datasets held in Synapse
    1. for each new dataset
      1. create the dataset metadata in Synapse
        • Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
      2. crawl the download site to identify the layers
      3. proceed with the workflow for each layer
    2. for each updated layer in an existing dataset
      1. crawl the [download site|http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ \ to identify the updated/new layers
      2. proceed with the workflow for the updated/new layer(s)
  • Just get layer urls for clinical and array data, sequence data will come later
    • bcr/ has clinical info
    • cgcc/ array platforms
  • Output: one or more of (Synapse dataset id, TCGA layer download uri) or error

Short Term Deliverables:

  • Focus on just the Colon adenocarcinoma COAD dataset for now
  • Create the Synapse dataset by hand

Download the dataset layers to S3

Details:

  • Input: (Synapse dataset id, TCGA layer download uri)
  • Just pull down clinical and array data, sequence data will come later
    • bcr/ has clinical info
    • cgcc/ array platforms
  • Get the MD5 checksum Download the layer MD5 checksum from TCGA first
    • see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
  • Download the layer data file from TCGA
  • Check compute Compute its MD5 and compare
  • Upload the layer data file first and checksum second to a subdirectory in the S3 bucket tcga/source for source data from TCGA
    • the filename is the tcga/source concatenated with the path portion of the TCGA download URL
    • if a file of that name already exists in S3
  • Output: (Synapse dataset id, S3 layer md5 uri, S3 layer uri) or error

Short Term Deliverables:

Create Synapse record for the source Layers from TCGA

Details

  • Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
  • Search Synapse to see if this layer already exists
  • if the layer exists
    • if the m5sum matches
      • skip to the end of this task
    • else the md5 does not match
      • bump the version number of the layer and set the new S3 urls
  • else the layer does not exist
    • create new layer metadata in Synapse
  • Output: (Synapse dataset id, Synapse layer id) or error

Process the TCGA dataset

...

layer

TODO more details here. But generally what we want to create matrices with patient_sample id as the column header and gene/probeId/etc as the row header.

We will probably have different tasks here depending upon the type of the layer.

Details:

  • Input: (Synapse dataset id, Synapse layer id)
  • Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
  • Download the layer and any metadata needed to make the matrix
    • the patient_sample ids are stored in a different layer I think
    • if that's right, we would return an error here if that layer is not available yet and retry this task later
  • Make the matrix
  • Upload it to S3
  • Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error

Short Term Deliverables:

  • a matrix of id to probe for Level 2 expression data
  • a matrix of id to gene for Level 3 expression data

Submit the processed dataset to Synapse

Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.

...

Create Synapse record for Processed TCGA dataset layer

  • Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
  • create the new layer metadata for this matrix in synapse
    • handle dups
    • handle new versions
  • Output: (dataset id, matrix layer id) or error

Deliverables:

  • Once this step is complete the layers are now visible via the Web UI and when you search with the R client.
  • The layer can be downloaded via the Web UI and the R client.
  • We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage

Formulate notification message

  • Input: (dataset id, matrix layer id)
  • query synapse for dataset and layer metadata to formulate a decent email
  • query synapse for dataset followers
  • Output: one or more of (follower, email body) or error

Notify via email all individuals following TCGA datasets of the new layer(s)

  • Input: one or more of (follower, email body) or error
  • send the email
  • later on if the volume is high, consolodate and dedup messages
  • Output: success or error

Technologies to Consider

Languages: A combination of Python, R, and Perl (only legacy scripts) should do the trick. This seems amenable to scripting languages (as opposed to Java). It is a design goal to keep each task small and understandable.

...