Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Input: n/a
  • Poll http://tcga-data.nci.nih.gov/tcga/ for currently available datasets and their last modified date
    • We'll always look for updates within a configurable number of days
    • Any repeat processing should be safe to do because we will ensure this pipeline is idempotent
    • But to reduce costs, we should abort repeat work when it is identified further down the pipeline
  • Compare this list to the TCGA datasets held in Synapse
    1. for each new dataset
      1. create the dataset metadata in Synapse
        • Be sure that the permissions on the dataset are such that the public cannot see it but Sage Scientists can. We cannot redistribute TCGA datasets via the Sage Commons.
      2. crawl the download site to identify the layers
      3. proceed with the workflow for each layer
    2. for each updated layer in an existing dataset
      1. crawl the download site\ to identify the updated/new layers
      2. proceed with the workflow for the updated/new layer(s)
  • Just get layer urls for clinical and array data, sequence data will come later
    • bcr/ has clinical info
    • cgcc/ array platforms
  • Output: one or more of (Synapse dataset id, TCGA layer download uri) or error

...

  • Focus on just the Colon adenocarcinoma COAD dataset for now
  • Create the Synapse dataset by hand

Download the dataset layers from TCGA to S3

Details:

  • Input: (Synapse dataset id, TCGA layer download uri)
  • Download the layer MD5 checksum from TCGA first
    • see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
  • Download the layer data file from TCGA
  • Compute its MD5 and compare
  • Ask synapse for a pre-signed URL to which to put the layer, arguments to this call would include
    • dataset id
    • layer type
    • path portion of the TCGA URL
  • Upload the layer data file to a subdirectory in the S3 bucket tcga/source for source data from TCGA
    • the filename is the tcga/source concatenated with the path portion of the TCGA download URL
  • Output: (Synapse dataset id, layer md5, S3 layer uri) or error

...

  • Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
  • Search Synapse to see if this layer already exists
  • assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
  • if the layer exists
    • if the m5sum matches
      • skip to the end of this task
    • else the md5 does not match
      • bump the version number of the layer and set the new S3 urls
  • else the layer does not exist
    • create new layer metadata in Synapse
  • Output: (Synapse dataset id, Synapse layer id) or error

...

  • Input: (Synapse dataset id, Synapse layer id)
  • Future enhancement: co-locate this worker with the download worker to save time by skipping the download from S3
  • Download the layer and any metadata needed to make the matrix
    • the patient_sample ids are stored in a different layer I think
    • if that's right, we would return an error here if that layer is not available yet and retry this task later
  • Make the matrix
  • Ask synapse for a pre-signed URL to which to put the matrix layer, arguments to this call would include
    • dataset id
    • layer type (an analysis result)
    • path portion of the source URL? (many source layers may be involved in a particular analysis result)
  • Upload it to S3
  • Output: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error

...