...
- Input: (Synapse dataset id, TCGA layer download uri)
- Download the layer MD5 checksum from TCGA first
- see if we already have a corresponding layer with that checksum stored in Synapse, if so, skip to the end of this step
- Download the layer data file from TCGA
- Compute its MD5 and compare
- Ask synapse for a pre-signed URL to which to put the layer , arguments
- Arguments to this call would include
- dataset id
- layer type
- path portion of the TCGA URL
- naming convention for TCGA layer S3 URLs
- use a subdirectory in the S3 bucket tcga/source for source data from TCGA
- the filename is the tcga/source concatenated with the path portion of the TCGA download URL -> TCGA has spent a long time thinking about their URL conventions so we should probably reuse their scheme
- TODO what if it already exists? look at S3 API to see if there is a create only flag (don't use the s3 version feature, we want the version to be explicit in the URL)
- Arguments to this call would include
- Upload the layer to S3
- Output: (Synapse dataset id, layer md5, S3 layer uri) or error
...
- Pull down levels 1, 2, and 3 of Agilent expression data from http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/agilentg4502a_07_3/transcriptome/, skip all other urls
Create Synapse record for the source
...
layers from TCGA
Details
- Input: (Synapse dataset id, S3 layer md5 uri, S3 layer uri)
- Search Synapse to see if this layer already exists
- assumption: there is sufficient information in the path portion of the layer uri due to the way TCGA formulates its paths to determine the type (clinical vs. expression) and make/model of the platform to have sufficient metadata for the layer, we can always add more layer metadata at another step
- if the layer exists
- if the m5sum matches
- skip to the end of this task
- else the md5 does not match
- bump the version number of the layer and set the new S3 urls
- if the m5sum matches
- else the layer does not exist
- create new layer metadata in Synapse
- Output: (Synapse dataset id, Synapse layer id) or error
...
- a matrix of id to probe for Level 2 expression data
- a matrix of id to gene for Level 3 expression data
Create Synapse record for
...
processed TCGA dataset layer
- Input: (Synapse dataset id, Synapse layer id, S3 url of matrix) or error
- create the new layer metadata for this matrix in synapse
- handle dups
- handle new versions
- Output: (dataset id, matrix layer id) or error
...