Document toolboxDocument toolbox

Notes

Summary of MetaGEO steps:

Step 0:  Find studies that are to be processed and initiate workflow instances for these studies.
Step 1:  Download and modify/parse the study's series_matrix files. (Q:  Where should the file go:  file system, Synapse, other?  Is it temporary?)
Step 2:  Parse metadata from the series_matrix file and create folder hierarchy (Q:  Why create folders *now*?  What if folders already exist?  Where should metago:  file system, Synapse?  Should steps 1&2 be combined?)

Step 3:  Create 'raw data' folders and 'makefiles'.  (Q:  Can 'makefile' logic be split between later steps and the 'decider'? Can 'raw data' folder creation be done in Step 1 or 2 instead of here?)

Step 4: Download CEL files.  (Q: Should there be one activity per CEL file?  If not, then the activity should know how to start from a partial result and/or how to recover if the ftp fails. A: Can download all as a .tar.gz)

Step 5: Unzip .tar.gz and discard non-CEL files.

Step 6: Extract scan timestamp and add to metadata. 

Step 7:  Reconcile CEL files with metadata.  If there is a mismatch then halt the process.

Step 8: Add CEL file name to meta data, transpose the metadata file.

** at this point we run parallel steps for each platform (="array pattern") in the study **

Step 9: Create Sweave file for processing the <study,platform> in R.

Step 10: Run Sweave file.  Input:  CEL file set and metadata file; Output: processed data (R object), diagnostics (many image files), and inference (R objects).

Step 11: Clean up temporary files.  (Currently done by makefile.)

Output:

- processed data

- diagnostics

- inference data

- *might* need to save files for manual rework, i.e. makefiles.  (But this might be subsumed by the workflow framework.)

Temporary files, to be deleted:

- CEL files - can be deleted once the output is complete

- various files made by makefile