...
- Input: n/a
- Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
- Note that without integrity constraints on creation operations (e.g., no two datasets can have the same name, no two layers in a dataset can have the same name), we could wind up with duplicate datasets and layers due to race conditions if we do not have our timeouts set correctly for workflow operations.
- Compare this list to the TCGA datasets held in Synapse
- for each new dataset
- create the dataset metadata in Synapse
- raise an error upon duplicate dataset
- Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
- crawl the download site to identify the layers
- proceed with the workflow for each layer
- create the dataset metadata in Synapse
- for each updated layer in an existing dataset
- crawl the download site to identify the updated/new layers
- proceed with the workflow for the updated/new layer(s)
- for each new dataset
- raise an error if we find multiple Synapse datasets for the TCGA dataset
- Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
- Output: one or more of (Synapse dataset id, TCGA layer download URL) or error
...
- Input: (Synapse dataset id, TCGA layer download URL)
- Search Synapse to see if this layer already exists
- assumption: there is sufficient information in the path portion of the TCGA layer URL due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another stepetc.
- raise an error if we find multiple matching layers
- formulate the layer metadata for this layer from the TCGA URL
- if the layer exists in synapse
- if all fields of the metadata match
- skip to the end of this task
- else the metadata does not match
- raise an error (our scheme for formulating layer metadata may have changed?)
- if all fields of the metadata match
- else the layer does not exist
- create new layer in Synapse using the layer metadata
- raise an error upon duplicate layer
- create new layer in Synapse using the layer metadata
- Output: (Synapse dataset id, Synapse layer id, TCGA layer download URL) or error
...
- Input: (Synapse dataset id, Synapse layer id, TCGA layer download URL)
- Download the layer MD5 checksum from TCGA first
- Get the layer metadata from Synapse to see if we already have an S3 location and MD5 for this layer id
- if we have an MD5 from Synapse
- if the Synapse MD5 matches the TCGA MD5
- skip to the end of this task because we already uploaded this file
- else the md5 does not match
- bump the version number in our client-side copy of the layer metadata
- if the Synapse MD5 matches the TCGA MD5
- Download the layer data file from TCGA
- Compute its MD5 and compare it to the MD5 we got from TCGA
- if they do not match, we had a download error, return an error (we'll try again later)
- Ask Synapse for a pre-signed URL to which to put the layer
- Arguments to this call would include
- dataset id
- layer id
- naming convention for TCGA layer S3 URLs
- use a subdirectory in the S3 bucket tcga/source for source data from TCGA
- perhaps the filename is tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we might consider reusing their scheme
- Synapse layer version number
- NOTE: if something already exists in S3 at that location, it was because workflow had a failure after the S3 upload step but before the update metadata step, perhaps be conservative at first and just raise an error, later on if/when we feel more confident, we can just overwrite it
- look at S3 API to see if there is a create-only flag
- Arguments to this call would include
- Upload the layer to S3
- Update the layer metadata in Synapse with the new S3 URL, the new MD5, and perhaps (if applicable) the bumped version number
- Output: (Synapse dataset id, Synapse layer id) or error
...
- Input: (Synapse dataset id, Synapse source layer id)
- Search Synapse for analysis result layers matching this, uniquely identified by
- source layer(s) and their versions
- analysis method and code version
- if this result already exists
- skip to the end of this task
- TODO currently assuming we want a different layer when the source data changes, an alternative would be to do a new version of this layer
- Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
- Download the layer and any metadata needed to make the matrix
- the patient_sample ids are stored in a different layer I think
- if that's right, we would return an error here if that layer is not available yet and retry this task later
- Make the matrix
- this could fail for a variety of reasons so do this part before creating any objects in Synapse for the new analysis result
- Formulate the layer metadata
- compute the MD5 for the matrix
- create the new layer in Synapse for this analysis result
- raise an error upon duplicate layer
- Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
- Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
- create the new layer metadata for this matrix in synapse
- handle dups
- handle new versions
- Output: (dataset id, matrix
- dataset id
- layer type (an analysis result)
- path portion of the source URL? (many source layers may be involved in a particular analysis result)
- Upload it to S3
- Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
Short Term Deliverables:
- a matrix of id to probe for Level 2 expression data
- a matrix of id to gene for Level 3 expression data
Create Synapse record for processed TCGA dataset layer
- id
- TODO what should our naming convention be here?
- Upload the matrix to S3
- Update the layer metadata in Synapse with the new S3 URL, the new MD5, and (if applicable) the bumped version number
- Output: (Synapse dataset id, Synapse analysis result layer id) or error
Deliverables:
- Once this step is complete the layers are now visible to Sage Scientists via the Web UI and when you search with the R client.
- The layer can be downloaded via the Web UI and the R client.
- We might also consider adding one more task to the workflow to just pull a copy of the data from Synapse to Sage's local shared storage
Short Term Deliverables:
- a matrix of id to probe for Level 2 expression data
- a matrix of id to gene for Level 3 expression data
Formulate notification message
- Input: (Synapse dataset id, matrix Synapse analysis result layer id)
- query synapse for dataset and layer metadata to formulate a decent email
- query synapse for dataset followers
- Output: one or more of (follower, email body) or error
...