Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Reproducible research is a fundamental responsibility of scientists, but the best practices for achieving it are not established in computational biology. The Synapse “Provenance” provenance system is one of many solutions you can use to make your work reproducible by you and others.

Overview of Synapse Provenance

Provenance is a concept describing the origin of something; in . In Synapse, it is used to describe the connections between the workflow steps that derive used to create a particular file or set of results. Data analysis often involves multiple steps to go from a raw data file to a finished analysis. Synapse’s provenance tools allow users to keep track of each step involved in an analysis and share those steps with other users.

The

...

The model Synapse uses for provenance is based on the W3C provenance spec where items are derived from an activity which has components that were used and components that were executed. Think of the used items as input files and executed items as software or code. Both used and executed items can reside in Synapse or in URLs such as a link to a GitHub commit or a link to a specific version of a software tool.

...

Below is a Synapse visualization of provenance relationships that was created with the example in this guide using our programmatic and web clients. In this example, we have two scripts, one that generates random numbers and another that takes a list of numbers and computes their squares. The project’s workflow resembles the provenance relationships.

...

Setting Provenance

...

when Uploading a File

Let’s begin with a script that generates a list of normally distributed random numbers and saves the output to a file. For example, you have an R script file called generate_random_data.R and you’ve saved the output to a data file called random_numbers.txt. We’ll begin by uploading the files to Synapse and then set their provenance.

Upload a

...

File and

...

Add Provenance

For this example, we’ll use a Project that a project that already exists (Wondrous Research Example : syn1901847). The code file is saved in Synapse with synID syn7205215, so we’ll upload the data file to this Projectthis project, or in Synapse terminology, the project will be the parent of the new entities.

As the random_numbers.txt file was generated from the above script, we are going to specify this using Provenanceprovenance.

There are a couple ways to set provenance information for a Synapse entity. The used and executed arguments specify resources used and code executed in the process of creating the entity. Code can be stored in Synapse (as we did in the previous step) or, better yet, linked by URL to a source code versioning system like GitHub or SVN. As an example, we’ll specify two somewhat contrived sources of provenance:

  1. Synapse entity by synID: syn7205215 (the code file)

  2. URL to a page describing normal distributions

Web

The web client does not support setting provenance when uploading a Filefile. Instead, upload the File file first, then navigate to the File the file in your Projectproject. Click on the File Tools dropdown in the upper right hand corner and select Edit File Provenance. In the resulting pop-up, enter the relevant information. If you are entering an external URL as a reference, include the full URL path. In this example, you would enter http://mathworld.wolfram.com/NormalDistribution.html.

...

To update the provenance on a file, navigate to the File that you would like to update. Click on the File Tools dropdown in the upper right hand corner and select Edit File Provenance. In the resulting pop-up, enter the relevant information.

...

Command Line

Code Block
# Add the data file to Synapse
synapse add squares.txt -parentId syn1901847 
# Set the provenance for newly created entity syn7209166 using synId
synapse set-provenance -id syn7209166 -executed syn7209078 -used syn7208917
# Set the provenance for newly created entity syn7209166 using local path
synapse set-provenance -id syn7209166 -executed ./square.R -used ./random_numbers.txt

...

Deleting Provenance

To delete a Provenance provenance relationship, you must be the person who created the entity.

...

Navigate to the entity you would like to delete provenance from (e.g. a File file or Folderfolder). In this example, we are deleting provenance from a file. Select File Tools, then Edit File Provenance. In the list of Used and Executed, click the minus symbol ( (minus) ) next to the URL or synID to delete each activity and Save your changes.

...

Viewing Provenance

Web

Navigate to a File a file to view its provenance. Clicking on the triple dots above an entity will expand it to show the Filethe file's full provenance.

...

Command Line

...

Reusing Provenance for Multiple Files

An Activity activity is a Synapse object that helps to keep track of what objects were used in an analysis step, as well as what objects were generated. Thus, all relationships between Synapse objects and an Activity are an activity are governed by dependencies. That is, an Activity needs an activity needs to know what it ‘used’, and outputs need to know what Activity they what activity they were ‘generatedBy’. A couple of points for clarity:

  • An Activity can An activity can ‘use’ many things (i.e. many inputs to an analysis)

  • Many outputs can be ‘generatedBy’ the same Activitysame activity

If an activity isn’t assigned to an entity and then stored, a separate graph will be created for each file that the activity generated. The following example is used to assign the same activity to multiple files resulting in one provenance graph:

...

Code Block
languager
# Code used to generate the file will be syn123456
# Files used to generate the information
expr_file <- synGet("syn246810", download=F)
filter_file <- synGet("syn135791", download=F)

# Activity to assign to multiple files
act <- Activity(name="filtering",
                used=list(expr_file, filter_file),
                executed="syn123456")
finalFile <- synStore(finalFile, activity=act)

# Get the activity now associated with an entity
act <- synGetProvenance(finalFile)

# Now you can set this activity to as many files as you want (file1, file2, etc are Synapse Files)
finalList <- c(file1, file2, file3)
finalList <- lapply(finalList, function(x) synStore(x, activity=act))

Related Articles

Files and VersioningAnnotations and Queries

...