Page Comparison

...

Input: n/a
Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
Compare this list to the TCGA datasets held in Synapse
1. for each new dataset
  1. create the dataset metadata in Synapse
    - Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
  2. crawl the download site to identify the layers
  3. proceed with the workflow for each layer
2. for each updated layer in an existing dataset
  1. crawl the download site\ to identify the updated/new layers
  2. proceed with the workflow for the updated/new layer(s)
Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
Output: one or more of (Synapse dataset id, TCGA layer download uri) or error

...

Download the dataset layers from TCGA to S3

Details:

Input: (Synapse dataset id, TCGA layer download uri)
Download the layer MD5 checksum from TCGA first
- see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
Download the layer data file from TCGA
Compute its MD5 and compare
Ask synapse for a pre-signed URL to which to put the layer, arguments to this call would include
- dataset id
- layer type
- path portion of the TCGA URL
Upload the layer data file to a subdirectory in the S3 bucket tcga/source for source data from TCGA
- the filename is the tcga/source concatenated with the path portion of the TCGA download URL
Output: (Synapse dataset id, layer md5, S3 layer uri) or error

...

Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
Search Synapse to see if this layer already exists
assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
if the layer exists
- if the m5sum matches
  - skip to the end of this task
- else the md5 does not match
  - bump the version number of the layer and set the new S3 urls
else the layer does not exist
- create new layer metadata in Synapse
Output: (Synapse dataset id, Synapse layer id) or error

...

Input: (Synapse dataset id, Synapse layer id)
Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
Download the layer and any metadata needed to make the matrix
- the patient_sample ids are stored in a different layer I think
- if that's right, we would return an error here if that layer is not available yet and retry this task later
Make the matrix
Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
- dataset id
- layer type (an analysis result)
- path portion of the source URL? (many source layers may be involved in a particular analysis result)
Upload it to S3
Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error

...