Summary of MetaGEO steps:
Step 0: Find studies that are to be processed and initiate workflow instances for these studies.
Step 1: Download and modify/parse the study's series_matrix files. (Q: Where should the file go: file system, Synapse, other? Is it temporary?)
Step 2: Parse metadata from the series_matrix file and create folder hierarchy (Q: Why create folders *now*? What if folders already exist? Where should metago: file system, Synapse? Should steps 1&2 be combined?)
Step 3: Create 'raw data' folders and 'makefiles'. (Q: Can 'makefile' logic be split between later steps and the 'decider'? Can 'raw data' folder creation be done in Step 1 or 2 instead of here?)
Step 4: Download CEL files. (Q: Should there be one activity per CEL file? If not, then the activity should know how to start from a partial result and/or how to recover if the ftp fails. A: Can download all as a .tar.gz)
Step 5: Unzip .tar.gz and discard non-CEL files.
Step 6: Extract scan timestamp and add to metadata.
Step 7: Reconcile CEL files with metadata. If there is a mismatch then halt the process.
Step 8: Add CEL file name to meta data, transpose the metadata file.
** at this point we run parallel steps for each platform (="array pattern") in the study **
Step 9: Create Sweave file for processing the <study,platform> in R.
Step 10: Run Sweave file. Input: CEL file set and metadata file; Output: processed data (R object), diagnostics (many image files), and inference (R objects).
Step 11: Clean up temporary files. (Currently done by makefile.)
Output:
- processed data
- diagnostics
- inference data
- *might* need to save files for manual rework, i.e. makefiles. (But this might be subsumed by the workflow framework.)
Temporary files, to be deleted:
- CEL files - can be deleted once the output is complete
- various files made by makefile
Add Comment