...
- Input: n/a
- Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
- We'll always look for updates within a configurable number of days
- Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
- But to reduce costs, we should abort repeat work when it is identified further down the pipeline
- Compare this list to the TCGA datasets held in Synapse
- for each new dataset
- create the dataset metadata in Synapse
- Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
- crawl the download site to identify the layers
- proceed with the workflow for each layer
- create the dataset metadata in Synapse
- for each updated layer in an existing dataset
- crawl the download site\ to identify the updated/new layers
- proceed with the workflow for the updated/new layer(s)
- for each new dataset
- Just get layer urls for clinical and array data, sequence data will come later
- bcr/ has clinical info
- cgcc/ array platforms
- Output: one or more of (Synapse dataset id, TCGA layer download uri) or error
...
- Focus on just the Colon adenocarcinoma COAD dataset for now
- Create the Synapse dataset by hand
Download the dataset layers from TCGA to S3
Details:
- Input: (Synapse dataset id, TCGA layer download uri)
- Download the layer MD5 checksum from TCGA first
- see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
- Download the layer data file from TCGA
- Compute its MD5 and compare
- Ask synapse for a pre-signed URL to which to put the layer, arguments to this call would include
- dataset id
- layer type
- path portion of the TCGA URL
- Upload the layer data file to a subdirectory in the S3 bucket tcga/source for source data from TCGA
- the filename is the tcga/source concatenated with the path portion of the TCGA download URL
- Output: (Synapse dataset id, layer md5, S3 layer uri) or error
...
- Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
- Search Synapse to see if this layer already exists
- assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
- if the layer exists
- if the m5sum matches
- skip to the end of this task
- else the md5 does not match
- bump the version number of the layer and set the new S3 urls
- if the m5sum matches
- else the layer does not exist
- create new layer metadata in Synapse
- Output: (Synapse dataset id, Synapse layer id) or error
...
- Input: (Synapse dataset id, Synapse layer id)
- Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
- Download the layer and any metadata needed to make the matrix
- the patient_sample ids are stored in a different layer I think
- if that's right, we would return an error here if that layer is not available yet and retry this task later
- Make the matrix
- Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
- dataset id
- layer type (an analysis result)
- path portion of the source URL? (many source layers may be involved in a particular analysis result)
- Upload it to S3
- Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
...