Page Comparison

...

Input: n/a
Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
Compare this list to the TCGA datasets held in Synapse
1. for each new dataset
  1. create the dataset metadata in Synapse
    - Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
  2. crawl the download site to identify the layers
  3. proceed with the workflow for each layer
2. for each updated layer in an existing dataset
  1. crawl the download site to identify the updated/new layers
  2. proceed with the workflow for the updated/new layer(s)
raise an error if we find multiple Synapse datasets for the TCGA dataset
Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
Output: one or more of (Synapse dataset id, TCGA layer download uriURL) or error

Short Term Deliverables:

Focus on just the Colon adenocarcinoma COAD dataset for now
Create the Synapse dataset by hand

Create Synapse record for the source layer from TCGA

Details

Input: (Synapse dataset id, TCGA layer download URL)
Search Synapse to see if this layer already exists
- assumption: there is sufficient information in the path portion of the TCGA layer URL due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
- raise an error if we find multiple matching layers
formulate the layer metadata for this layer from the TCGA URL
if the layer exists in synapse
- if all fields of the metadata match
  - skip to the end of this task
- else the metadata does not match
  - raise an error (our scheme for formulating layer metadata may have changed?)
else the layer does not exist
- create new layer in Synapse using the layer metadata
Output: (Synapse dataset id, Synapse layer id, TCGA layer download URL) or error

Download the dataset

...

layer from TCGA to S3

Details:

Input: (Synapse dataset id, Synapse layer id, TCGA layer download uriURL)
Download the layer MD5 checksum from TCGA first
Get the layer metadata from Synapse to see if we already have a corresponding layer with that checksum stored in Synapse, if so, an S3 location and MD5 for this layer id
if we have an MD5 from Synapse
- if the Synapse MD5 matches the TCGA MD5
  - skip to the end of this
  step
  - task because we already uploaded this file
- else the md5 does not match
  - bump the version number in our client-side copy of the layer metadata
Download the layer data file from TCGA
Compute its MD5 and compare Ask synapse it to the MD5 we got from TCGA
- if they do not match, we had a download error, return an error (we'll try again later)
Ask Synapse for a pre-signed URL to which to put the layer
- Arguments to this call would include
  - dataset id
  - layer type
  - path portion of the TCGA URL
  - id
- naming convention for TCGA layer S3 URLs
  - use a subdirectory in the S3 bucket tcga/source for source data from TCGA
  - perhaps the filename is the tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we should probably reuse might consider reusing their scheme
  - TODO what if it already exists? Synapse layer version number
- NOTE: if something already exists in S3 at that location, it was because workflow had a failure after the S3 upload step but before the update metadata step, perhaps be conservative at first and just raise an error, later on if/when we feel more confident, we can just overwrite it
  - look at S3 API to see if there is a create-only flag (don't use the s3 version feature, we want the version to be explicit in the URL)
Upload the layer to S3
Upload the layer to S3
Update the layer metadata in Synapse with the new S3 URL, the new MD5, and perhaps the bumped version number
Output: (Synapse dataset id, Synapse layer md5, S3 layer uriid) or error

Short Term Deliverables:

Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/, skip all other urls

Create Synapse record for the source layers from TCGA

Details

Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
Search Synapse to see if this layer already exists
assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
if the layer exists
if the m5sum matches
- skip to the end of this task

else the md5 does not match bump the version number of the layer and set the new S3

urls

else the layer does not exist
- create new layer metadata in Synapse
Output: (Synapse dataset id, Synapse layer id) or error

Process the TCGA dataset layer

...

Versions Compared

Old Version 7

New Version 8

Key

Create Synapse record for the source layer from TCGA

Download the dataset

layer from TCGA to S3

Create Synapse record for the source layers from TCGA

Process the TCGA dataset layer