Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Input: (Synapse dataset id, TCGA layer download uri)
  • Download the layer MD5 checksum from TCGA first
    • see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
  • Download the layer data file from TCGA
  • Compute its MD5 and compare
  • Ask synapse for a pre-signed URL to which to put the layer , arguments
    • Arguments to this call would include
      • dataset id
      • layer type
      • path portion of the TCGA URL
    Upload the layer data file to
    • naming convention for TCGA layer S3 URLs
      • use a subdirectory in the S3 bucket tcga/source for source data from TCGA
      • the filename is the tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we should probably reuse their scheme
      • TODO what if it already exists? look at S3 API to see if there is a create only flag (don't use the s3 version feature, we want the version to be explicit in the URL)
  • Upload the layer to S3
  • Output: (Synapse dataset id, layer md5, S3 layer uri) or error

...

Create Synapse record for the source

...

layers from TCGA

Details

  • Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
  • Search Synapse to see if this layer already exists
  • assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
  • if the layer exists
    • if the m5sum matches
      • skip to the end of this task
    • else the md5 does not match
      • bump the version number of the layer and set the new S3 urls
  • else the layer does not exist
    • create new layer metadata in Synapse
  • Output: (Synapse dataset id, Synapse layer id) or error

...

  • a matrix of id to probe for Level 2 expression data
  • a matrix of id to gene for Level 3 expression data

Create Synapse record for

...

processed TCGA dataset layer

  • Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
  • create the new layer metadata for this matrix in synapse
    • handle dups
    • handle new versions
  • Output: (dataset id, matrix layer id) or error

...