...
Provenance is a concept describing the origin of something; in Synapse it is used to describe the connections between workflow steps that derive a particular file of results. Data analysis often involves multiple steps to go from a raw data file to a finished analysis. Synapse’s Provenance Tools provenance tools allow users to keep track of each step involved in an analysis , and share those steps with other users.
The basic elements of Synapse provenance
The model Synapse uses for provenance is based on the W3C provenance spec where items are derived from an activity which an activity which has components that were used and were used and components that were executedwere executed. Think of the used items as input files and executed items as software or code. Both used and executed items can either be items reside in Synapse or in URLs such as a link to a github GitHub commit or a link to a specific version of a software tool.
The Synapse clients for You can use the command line, Python, and R support creating and editing of to create and edit provenance relationships. The Web client allows editing of provenance In the web client, you can add provenance relationships once the file has been uploaded.
On the right Below is a Synapse visualization of provenance relationships that is demonstrated in the following section was created with the example in this guide using our programmatic and web clients. In this example, we have two scripts, one that generates random numbers and another that takes a list of numbers and computes their squares. The project’s workflow looks like the image to the rightresembles the provenance relationships.
...
Setting Provenance When Uploading a File
Let’s begin with a script that generates a list of normally distributed random numbers and saves the output to a file. For example, you have an R script file called generate_random_data.R and you’ve saved the output to a data file called random_numbers.txt. We’ll begin by uploading the files to Synapse and then set their provenance.
Upload a file and add provenance
For this example, we’ll use a Project
that a Project that already exists (Wondrous Research Example : syn1901847). The code file is already saved in Synapse with synId synID syn7205215
so , so we’ll upload the data file to this Project
this Project, or in Synapse terminology, the project will be the parent of the new entities.
As the random_numbers.txt file was generated from the above script, we are going to specify this using provenanceProvenance.
There are a couple ways to set provenance information for a Synapse entity. The used
and executed
arguments arguments specify resources used and code executed in the process of creating the entity. Code can be stored in Synapse (as we did in the previous step) or, better yet, linked by URL to a source code versioning system like GitHub or SVN. As an example, we’ll specify 2 two somewhat contrived sources of provenance:
Synapse entity by synIdsynID: syn7205215 (the code file)
URL to a page describing normal distributions
Web
...
The web client does not support setting provenance when uploading a
...
Navigate to the File's
tab and click on the File
that you would like to update. Click on the File. Instead, upload the File first, then navigate to the File in your Project. Click on the File Tools dropdown in the upper right hand corner and select Edit File Provenance. In the resulting pop-up, enter the relevant information. If you are entering an external URL as a reference, include the full URL path. In this example, you would enter http://mathworld.wolfram.com/NormalDistribution.html
.
Command Line
Code Block |
---|
synapse add random_numbers.txt --parentId syn1901847 --executed syn7205215 --used http://mathworld.wolfram.com/NormalDistribution.html |
...
Code Block |
---|
synapse add random_numbers.txt --parentId syn1901847 --executed ./generate_random_data.R --used http://mathworld.wolfram.com/NormalDistribution.html |
Python
Code Block |
---|
# Set provenance for data file generated by the script file data_file = File(path="random_numbers.txt", parent="syn1901847") data_file = syn.store(data_file, executed="syn7205215", used="http://mathworld.wolfram.com/NormalDistribution.html") |
R
Code Block |
---|
# Set provenance for data file generated by the script file data_file <- File(path="random_numbers.txt", parent="syn1901847") data_file <- synStore(data_file, executed="syn7205215", used="http://mathworld.wolfram.com/NormalDistribution.html") |
Once the data file is uploaded, it Synapse will provide the synId synID assigned to itthat file. In this case, the data file’s synId synID is syn7208917
.
Editing Provenance
To continue our example above, we’ll now add some new results from our initial data file. We’re going to take the results in random_numbers.txt
and square them. The script to square the numbers will be square.R, and we’ll save the output to a data file, squares.txt. As with the previous example, the code file is already saved in Synapse, so we’ll upload the data file and set its provenance.
Web
To update the provenance on a file, navigate to the File's tab and click on the File
that that you would like to update. Click on the the File Tools dropdown in the upper right hand corner and select Edit File Provenance. In the resulting pop-up, enter the relevant information.
...
Command Line
Code Block |
---|
# Add the data file to Synapse synapse add squares.txt -parentId syn1901847 # Set the provenance for newly created entity syn7209166 using synId synapse set-provenance -id syn7209166 -executed syn7209078 -used syn7208917 # Set the provenance for newly created entity syn7209166 using local path synapse set-provenance -id syn7209166 -executed ./square.R -used ./random_numbers.txt |
Python
Code Block |
---|
# Add the data file to Synapse squared_file = File(path="squares.txt", parentId="syn1901847") squared_file = syn.store(squared_file) # Set provenance for newly created entity syn7209166 squared_file = syn.setProvenance(squared_file, activity = Activity(used = "syn7208917", executed = "syn7209078")) # Provenance can also be set using local variables instead of looking up synIds squared_file = syn.setProvenance(squared_file, activity = Activity(used = data_file, executed = "syn7209078")) |
R
Code Block |
---|
# Add the data file to Synapse squared_file <- File(path="squares.txt", parentId="syn1901847") squared_file <- synStore(squared_file) # Set provenance for newly created entity syn7209166 act <- Activity(name = "Squared numbers", used = "syn7208917", executed = "syn7209078") synStore(squared_file, activity=act) # Provenance can also be set using local variables instead of looking up synIds act <- Activity(name = "Squared numbers", used = data_file, executed = "syn7209078") squared_file <- synStore(squared_file, activity=act) |
Deleting Provenance
If at any point you need to delete provenance on an entity, you can do so. You To delete a Provenance relationship, you must be the person who created the entity to delete provenance.
Web
Navigate to the entity you would like to delete provenance from (e.g. a File or Folder). In this example, we are deleting provenance from a file. Select File Tools->Edit , then Edit File Provenance. In the list of Used and Executed, click the X to the minus symbol button (-) next to the URL or synID to delete each activity and Save your changes.
Command Line
Currently, deleting provenance is not supported in the command line client.
Python
Code Block |
---|
# Delete provenance on entity syn123 delete_provenance = syn.deleteProvenance('syn123') |
R
Code Block |
---|
# Delete provenance on entity syn123 deleteProvenance = synDeleteProvenance('syn123') |
...
Viewing Provenance
Web
Navigate to the File's
page a File to view its provenance. Clicking on the triple dots above entities an entity will expand it to show the Filethe File's full s full provenance.
...
Command Line
Code Block |
---|
synapse get-provenance -id syn7209166 |
Python
Code Block |
---|
provenance = syn.getProvenance("syn7209166") provenance |
R
Code Block |
---|
provenance <- synGetProvenance("syn7209166") provenance |
Reusing
...
Provenance for Multiple Files
An Activity
An Activity is a Synapse object that helps to keep track of what objects were ‘used’ used in an analysis step … , as well as what objects were generated. Thus, all relationships between Synapse objects and an Activity
are an Activity are governed by dependencies. That is, an Activity
needs an Activity needs to know what it ‘used’ – , and outputs need to know what Activity
they what Activity they were ‘generatedBy’. A couple of points for clarity:
An
Activity
can An Activity can ‘use’ many things (i.e. many inputs to an analysis)Many outputs can be ‘generatedBy’ the same
Activity
same Activity
If an activity isn’t assigned to an entity and then stored, a separate graph will be created for each file that the activity generated. The following example is used to assign the same activity to multiple files resulting in one provenance graph:
Web
Unfortunately, the web interface currently does not support assigning the same activity to multiple files. This action must be completed using either the R or the Python client.
Command Line
Unfortunately, The command line currently does not support assigning the same activity to multiple files.
Python
Code Block |
---|
# Code used to generate the file will be syn123456 # Files used to generate the information expr_file = syn.get("syn246810", download=F) filter_file = syn.get("syn135791", download=F) # Activity to assign to multiple files act = Activity(name="filtering", used=[expr_file, filter_file], executed="syn123456") syn.store(final_file, activity=act) # Get the activity now associated with an entity act = syn.getProvenance(final_file) # Now you can set this activity to as many files as you want (file1, file2, etc are Synapse Files) file_list = [file_1, file_2, file_3] file_list = map(lambda x: syn.store(x, activity=act), file_list) |
R
Code Block |
---|
# Code used to generate the file will be syn123456 # Files used to generate the information expr_file <- synGet("syn246810", download=F) filter_file <- synGet("syn135791", download=F) # Activity to assign to multiple files act <- Activity(name="filtering", used=list(expr_file, filter_file), executed="syn123456") finalFile <- synStore(finalFile, activity=act) # Get the activity now associated with an entity act <- synGetProvenance(finalFile) # Now you can set this activity to as many files as you want (file1, file2, etc are Synapse Files) finalList <- c(file1, file2, file3) finalList <- lapply(finalList, function(x) synStore(x, activity=act)) |
...