Document toolboxDocument toolbox

Scenarios: File / Folder API

This is a high-level outline of the types of computational work we are aiming to support with Synapse, and how some typical operations map against our API.  The goal is an understanding of the user conceptual model of the system and his/her benefits gained by using the system, not an exhaustive summary of all functions needed to support this work.  See also Entities, Files, and Folders Oh My!

Ad hoc analysis - private & exploratory

Alice is an data analyst / computational scientist starting a new project.  She creates a folder in her local linux home directory and populates it with a set of files (e.g. starting raw data) obtained from Bob, her biologist friend.  She starts some exploratory statistical analysis in an interactive R session. Initially she doesn't know when or even if she will find anything of interest, so there's probably a period of time where Synapse is not involved at all.  After some time she arrives at some preliminary findings she wants to at least remember for her own future reference.  At this point she creates a new Synapse project, and adds a local folder to it.  Command line interaction with Synapse might look something like:

  >syn create Project 'AlicesProject'
  >syn add . AlicesProject - recurse = true -location = local

At this point her Synapse project is populated with a mirror of her local filesystem folder, although all the files are still living exclusively on her local file system.  Synapse has some metadata on the files and folders (e.g. SHA1, timestamp and user of when they were created, maybe file size). Now, in her interactive R session she has a plot and a data frame she'd also like to add to the project as they are the start of interesting results.  At R command line:

  synAdd('AlicesProject/localTopFolder', aRDataframe)
  synAdd('AlicesProject/localTopFolder', aRPlot)

These commands save the dataframe and plot to files.  Location defaults to Synapse S3 storage, so there are now two additional files in the Synapse project.  As these were pushed up to S3, Synapse generates previews for the plot and dataframe files.

She switches over the the web client: 

  synOnWeb(aRPlot)

In the project she's able to make some notes on what she's done using the wiki tools, referencing the data frame and plot.  The new project is now in her list of her recent projects.

At local OS command line

  syn get . -recurse = true

Pulls down the two new plot files locally.  Dataframe could be either .csv or Rbinary file.

Benefits:

  • Synapse as dashboard of all her projects, regardless of where the data is living or who the collaborators are (this is likely one of many projects she is switching among)
  • Wiki as notebook for future self.
  • Annotations to enhance ability to later find the data / project if it goes dormant for a while.
  • Ability to easily move projects between different work environments (PC, shared computational servers, cloud)

Ad hoc analysis - Collaborative

After some time she arrives at some preliminary findings she wants to share with her collaborator Bob (more of a biologist).  She adds Bob to the project and emails him a link to view the results.  Bob is able to review Alice's findings, comment on the wiki pages.  He's got some new data he wants to share with Alice so he uploads it to the project from the web client.  Alice receives a notification (via configurated email notifications, or project activity history, etc). Alice is able to pull the files down to her local environment and continue working.

  syn get . -recurse = true

Later, Alice would like her analyst friend Carl at another institution to check her analysis.  (or would like a backup of her work, or access to it from another machine...)

  syn put . -recurse = true -location = SynapseStorage

This pushes the files up to Synapse's native S3 storage.  Carl can now move the project over to his own computer, or his Amazon account.  (Why not just sync files using Git, or Dropbox, or any number of other solutions?  Assume some of the files are large. e.g. raw genomics data.  In this case files always remain local, and if Carl wants to access them he will get an account on Alice's system.  Different folders of the project might be stored in different places.)

The project could evolve for sometime in this fashion, mainly relying on the file-folder API, wiki, and collaboration features.  Extensions could be to have users manage multiple storage locations (e.g. their own S3 buckets), or have clients that automatically synched content in the background.

Benefits:

  • Authorization controls over project contents
  • Synchronize files among multiple environments with parallel concurrent use (different instution's in house systems, cloud offerings, etc)
  • Shared online collaborative workspace to pull key findings together from multiple people and document project status.

Reproducible Ad hoc analysis

After some time, Alice has a result she believes is important and will eventually form part of a paper, and she wants to make sure Carl can see exactly what she did.  At this point she builds a set of R scripts which process the data though a series of steps.  She stores the scrips in a GitHub repository associated with the project.  She also uses a few bioinformatics tools installed on her local system from the command line of linux as part of her process.  Now, she re-runs the analysis, this time recording what she did using Synapse provenance features to link all the files starting with raw data through all intermediate results and ending with a set of figures, vectors, and other output data.  All this can be pushed up to Synapse as before, but now there is a graphical representation of her process available in Synapse that Carl can use to review her work, including links to the code and tools she used.  (Command line client would need to push up the commands used to run tools at the linux command line).  If Carl and Alice are working on the same system, access to the code or commands to execute system programs should give Carl a pretty good idea of exactly what Alice did, and she can provide additional commentary in the wiki and/or edit the provenance records to provide more details (e.g. version info for some of the tools she used).

TODO: Outline of adding provenance calls from R / command line / python

An extension of this scenario in the case where both users are working in Amazon would include capturing the specifics of the environment used to run the analysis (AMI, size, etc) as additional parts of the provenance record.  These environment descriptions could be stored as Files pointing to publicly-accessible AMIs, allowing anyone to execute the work (in their own AWS account).  In fact, Alice may want to rerun the analysis on Amazon again before publication to ensure that her reviewer can step into her analysis, using her project as supplemental materials to her paper.

Benefits:

  • Ease of additional people to step and in review / contribute to work
  • Move project towards publicly publishable state.

Pipelined Analysis

It turns out that Alice's paper is a hit and now she has lots of biologists asking for help running similar analyses on different data sets. She converges on a particular structure to capture the results of various intermediate stages of her analysis, e.g. to help out a new collaborator (Diane) in a new project:

  f <- Folder(parent='DianesProject', name='Stage 1 Results')

Annotate to describe the results.  A couple integer values, e.g. certain key statistics of the result

  f$annotations$Pval <- 1  // or even more ideally just f$Pval <- 1
  f$annotations$Fval <- 4
  f$annotations$status <- 'valid'  //A text annotation

Above, I am assuming this is syntatic sugar for things like

  f <- setAnnotation(f, key='Pval', value='1', type='int')

Assume we have a handle to some text file

  f$annotations$result1 <- someFileHandle

Syntatic sugar for

  f <- addFile(f, someFileHandle, location=local)  
f <- setAnnotation(f, key='result1', value=referenceToFile, type='file')

Could do the same thing with other objects that get serialized to files

  f$annotations$result2 <- anotherRobject //Save as serialized R binary?
  f$annotations$image <- plot  //Save as image
  f$annotations$vectorData <- {3,4,5,6}  //In principle could be very large, store as file?
  f$annotations$matrix <- {2,3,4;3,4,5;4,5,6} //In principle could be very large, store as file?

Push everything up to Synapse:

  synPut(f, recurse='true')

Behind the scenes the client must do this

1. Create someFileHandle as a child File of f of type .txt.  Synapse generates preview of it

2. Create anotherRobject as a child File of f of type .Rbin  Preview?

3. Always, (or only if vector / matrix are large), create them as additional child File handles (.csv?)

4. Create another File handle to store the plot

5. Update all the file / folder entities in one call to reference all the new File Handles.  This could be done via generalization of the "bundle" or "package" API currently in use to optimize data transfer between the service and web tiers.  In the simplest case, users aren't exposed to a new object at all, but the use of options like recurse=true would use this object under the covers.

Another user must be able to do this to get back the same data:

  f <- synGet(path='DianesProject/Stage 1 Results', recurse='true')

Alice turns her set of scripts into a publicly-hosted R package.  This includes the development of R objects specific to her analysis that encapsulate some of the key steps / data structures that are handed off between different steps. She also includes helper functions that store and retrieve the pieces of the object in Synapse as a set of folders, files, and annotations that follow a particular convention.  This set of objects together would be a Package - a higher level API allowing interactions with a collection of files and folders in a single transaction.  Objects in scientific analytical environments can be mapped to Packages to make the Synapse back end feel more like an object store.  This approach allow allows Alice to develop a widget for the Synapse UI that presents a visualization of this data in a way understandable by her collaborators.  This gives her and other analysts an object-centric view of the data structures relevant to this analysis in R, and the ability to easily load and save these objects to/from Synapse.  Other analysts can do the same thing in other environments (e.g. Python) by defining similar objects and helper functions. 

Synapse now needs to help Alice run large numbers of analytical pipelines for various collaborators who want her to do it for them.  She has them contribute data to their own projects following particular conventions for the raw data, and then runs her pipeline publishing back the results including even auto generating a first draft of the wiki.  She then uses Synapse to communicate her results back to these collaborators.  The Evaluation API can be used to help manage large sets of similar requests.

If we have many of these sorts of objects, an extension to this use case is for Synapse to provide central storage, retrieval, of these object definitions, and / or ways to autogenerate the objects and helper functions them from existing synapse data structures used as prototype instances.

Benefits:

  • Consistent structures and hardened pipelines evolve out of ad hoc work supporting common preprocessing
  • Structure work for large scale comparison of methods / data in public or private challenges