This is the specification of the command set for Synapse clients. The goal is to align the command sets for clients in different languages to ease users' transitions between languages. Additionally we define the organization of the file cache so that various clients arrange local copies of file consistently.
Synapse File Management
To motivate the design of the client side cache and file manipulation commands, we review file management in Synapse: Synapse tracks the shared location of the files it stores (e.g. the location in Amazon S3 or some other web-accessible location) and also the file's MD5 hash. While Synapse records a file's name, it does not know the location of the file when downloaded by any user. The client has a notion of a File object (defined below). This object has a slot for the ID of the File object in Synapse and also has a slot for the local file handle. When the client moves a file from the local machine to Synapse (via the "synStore" command, defined below), the file is uploaded from the local location to Synapse. When the client retrieves a File (via the "synGet" command, defined below) it may specify the local file handle to which the file is downloaded, or allow the client to place in a default location. With this understanding, we can discuss how the clients cache files.
Cache Design Principles
The analytical clients provide for client-side caching of Synapse files, to avoid unnecessarily retrieving (large or many) files that are already available locally. When a file is uploaded or downloaded the client keeps track of the location along with information to determine if it is later changed. Specifically, the client maintains a "Cache Map" whose keys are Synapse FileHandle IDs and whose values are lists of local file locations. Each file location has (1) a path on the local file system, and (2) a 'last modified' time. The use of this map is as follows:
Case: synStore (defined below) is called to upload a new file to Synapse.
Action: An entry is made in the Cache Map.
Case: synStore is called for a File object whose file has already been uploaded to Synapse.
Action: The File object contains the file path to the local copy of the file and the FileHandle ID. The associated 'last modified' time in the File Cache is compared to the current 'last modified time' for the file. If the timestamps are the same no upload occurs. Otherwise the file is uploaded (generating a new FileHandle ID) and the Cache Map entry is updated with the new FileHandleID and timestamp. (Note: The old entry is left in place, since some other in-memory File object may reference the same local file.)
Case: synGet is called for a File object which has not been downloaded.
Action: The File metadata are retrieved, including the FileHandleID. Since there is no entry in the Cache Map, the file is downloaded and an entry made in the Cache Map.
Case: synGet is called for File object which has been downloaded previously with a different target location.
Action: The File metadata are retrieved, including the FileHandle ID. An entry is found in the Cache Map for the given FileHandle ID, but not for the given location. If any currently downloaded file in the Cache Map for the FileHandle ID has an unchanged 'last modified' timestamp, it is copied to the new location, else the file is downloaded from Synapse to the new location. Either way a new Cache Map entry is created for the new file. (Note: We do NOT make the new File object point to the cached file, since unexpected behavior would result when multiple File objects modify the same on-disk file.)
Case: synGet is called for a File object which has been downloaded previously with the same target location.
Action: The File metadata are retrieved, including the FileHandle ID. An entry is found in the Cache Map for the given FileHandle ID and location, and the 'last modified' is retrieved. If this timestamp matches the current 'last modified' timestamp for the file, no download occurs. If the local file is *missing* then the file is downloaded. Otherwise, the action depends of the "ifcollision" mode specified for synGet:
(1) ifcollsion=overwriteLocal: The file is downloaded to the target location and the Cache Map entry is updated with a new timestamp;
(2) ifcollision=keepLocal: No download occurs. The File references the locally modified file at the given location;
(3) ifcollision=keepBoth: The file is downloaded to the target location, but given a modified local file name. A second entry for the FileHandle ID is made in the Cache Map.
When a file is downloaded, specifying the file location is optional. If it isn't specified the file is placed in a default 'cache folder'. The organization of the file cache is:
<cache root>/<file handle id>/<file name>
where
<cache root> is user configurable and defaults to ~/.synapseCache
<file handle id> is the file id part of the file handle
<file name> is the file name given by the file handle, or if there is a "collision" and ifcollision="keep.both" is selected, then the name is a modification created by the client to resolve the collision, e.g. file.txt may become file(1).txt.
Cache Map Design
Cache Entry
There is a file for each Synapse FileHandle ID that has been downloaded or uploaded. The file has the path:
<cache root> / <file handle id> / .cacheMap
The file contains the location and last-modified time stamp of each downloaded or uploaded file. The format is that of a JSON map whose keys are file paths and whose values are time stamps, e.g.
{ "/path/to/file.txt": "2013-04-02 16:33:10" }
TODO: file naming convention and folder organization. TODO: file format (JSON?)
File Usage Examples
In each example, we have a project in which the File will reside:
project<-(name="myproject") # 'synStore' will either create the project or retrieve it if it already exists project<-synStore(project) pid<-propertyValue(project, "id")
Example 1: Create File entity wrapping local file, save to Synapse, retrieve, and save again
file <- File(path="~/myproject/genotypedata.csv", name="genotypedata", parentId=pid) # 'synStore' will upload the file to Synapse # locally we record that the uploaded file is available at ~/myproject/genotypedata.csv file <- synStore(file) # we can get the ID of the file in Synapse fileId <- propertyValue(file, "id") # ----- Now assume a new session (perhaps a different user) # at first we have only the Synapse file ID fileId <- "synXXXXX" file <-synGet(fileId) # client recognizes that local copy exists, avoiding repeated download getFileLocation(file) > "~/myproject/genotypedata.csv" # now change something, e.g. add an annotation... synAnnot(file, "data type")<-"genotype" # ... and save. the client determines that the file is unchanged so does not upload again file <-synStore(file) # we can also download to a specific location fileCopy<-synGet(fileId, downloadLocation="~/scratch/") getFileLocation(fileCopy) > "~/scratch/genotypedata.csv" # we now have two copies on the local file system
Example 2: Link to File on web, then download
# we use 'synapseStore=F' to indicate that we only wish to link file <- File(path="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1000/matrix/GSExxxx_RAW.tar", synapseStore=F, name="genotypedata", parentId=pid) # Synapse stores the metadata, but does not upload the file file <- synStore(file) # we can get the ID of the file in Synapse fileId <- propertyValue(file, "id") # synGet downloads the file to a default location file <-synGet(fileId) getFileLocation(file) > "~/.synapseCache/GSExxxx_RAW.tar" # now change the meta data and save synAnnot(file, "data type")<-"gene expression" # synStore does not upload the file file<-synStore(file)
Example 3: Lose session after editing file
codeFileId <- "synXXXXX" codeFile <-synGet(codeFileId, load=F) getFileLocation(codeFile) > "~/.synapseCache/rScript.R" # the file is edited # the session is lost # a new session begins codeFileId <- "synXXXXX" codeFile <-synGet(codeFileId, load=F, if.collision="keep.local") # The File object now refers to the edited file # synStore detects that the file is changed and uploads it file <-synStore(file)
Example 4: link to file on NFS
# we use 'synapseStore=F' to indicate that we only wish to link file <- File(path="file:///corporatenfs/sharedproject/genotypedata.csv", synapseStore=F, name="genotypedata", parentId=pid) # Synapse stores the metadata, but does not upload the file file <- synStore(file) # Now assume a new session, perhaps by a different user # synGet downloads the file to a default location fileId<-"synXXXXX" # we use 'downloadFile=F' to indicate that we do not need a new copy on our local disk file <-synGet(fileId, downloadFile=F) getFileLocation(file) > "/corporatenfs/sharedproject/genotypedata.csv" # now change the meta data and save synAnnot(file, "data type")<-"SNP" # since the File was created with "synapseStore=F", synStore does not upload the file file<-synStore(file)
Command Set
We conceptual divide the client commands into three levels (1) Common functions, (2) Advanced functions and (3) low-level Web API functions. The first collection of commands captures the majority of functionality of interest to users. The second collection rounds out the functionality with less frequently used functions. The third set comprises simple, low level wrappers around the Synapse web service interface. By including this third set users can access web services in advance of having specialized commands in the analytic clients.
Command | comments | R Syntax | Python Syntax | Command Line Syntax |
---|---|---|---|---|
1 – Common functions | ||||
Create a Synapse file handle in memory, specifying the path to the file in the local file system, the name in Synapse, and the Folder in Synapse. This step 'stages' a file to be sent to Synapse. Additional parameters (...) are interpreted as properties or annotations, in the manner of synSet(), defined below. If 'synapseStore' is TRUE then file is uploaded to S3, else only the file location is saved. | The specified file doesn't move or get copied. | File(path, synapseStore=T, parentId, ...)
example: File(path="/path/to/file", parentId="syn101") | File(path="/path/to/file", synapseStore=T, parentId="syn101", **kwargs) | NA |
Create a Synapse file handle in memory which will be a serialized version of an in-memory object. Additional parameters (...) are interpreted as properties or annotations, in the manner of synSet(), defined below. If 'synapseStore' is TRUE then file is uploaded to S3, else only the file location is saved. | The object is not serialized at this time. (We are hoping people will like calling the object a File, even though it takes an in-memory object as a parameter.) | File(obj, synapseStore=T, parentId, ...)
example: File(obj=dataObject, parentId="syn101") | Will not be implemented in python. | NA |
Create a Synapse Record in memory, specifying the name and the Folder in Synapse. This step 'stages' a Record to be sent to Synapse. Additional parameters (...) are interpreted as properties or annotations, in the manner of synSet(), defined below.
| Files aren't moved or copied. TODO: How do you specify file annotations (as distinct from Strings)? Shall we introduce in-memory wrappers around files and urls to help distinguish them? | Record(name=NULL, parentId="syn101", ...) example: | Record(name="foo", parentId="syn101", **kwargs) | |
Create a Folder or Project in memory. Name and parentId are optional. | Folder(name=NULL, parentId=NULL, ...) Project(name=NULL, ...) example: | Folder(name="foo", parentId="syn101", **kwargs) Project(name="foo", **kwargs) | ||
Set an entity's attribute (property or annotation) in memory. Client first checks properties, then goes to annotations; (setting to NULL deletes it in R, using DEL operator in python deletes it) | TODO: we want to include files and (for R) in memory objects | synAnnot(entity, name)<-value | entity.parentId="syn101" | synapse update id --parentId syn101 |
Gets an entity's attribute value (property or annotation) from the object already in memory. | synAnnot(entity, name); returns NULL if undefined | entity.name; throws exception if value is undefined | ||
Create or update an entity (File, Folder, etc.) in Synapse. May also specify (1) the list of entities 'used' to generate this one, (2) the list of entities 'executed' to generate this one, (3) the name of the generation activity, and (4) the description of the generation activity, (5) whether a name collision in an attempted 'create' should become an 'update', (6) whether to 'force' a new version to be created, and (7) whether the data is restricted (which will put a download 'lock' on the data and contact the Synapse Access and Compliance team for review. | TODO: Give some examples. | synStore(entity, used, executed, activityName=NULL, activityDescription=NULL, createOrUpdate=T, forceVersion=T, isRestricted=F) | synapse.store(entity, used, executed, activityName=None, activityDescription=None, createOrUpdate=T, forceVersion=T, isRestricted=F) | synapse create --name NAME --parentid PARENTID --description DESCRIPTION --type TYPE --file PATH --update=T/F --forceVersion=T/F
--annotations={foo=bar, bar=foo} |
Get an entity (file, folder, etc.) from the Synapse server, with its attributes (properties, annotations) and, optionally, with its associated file(s). if.collision is one of "keep.both", "keep.local", or "overwrite.local", telling the system what to do if a different file is found at the given local file location. | 'download' and 'load' are ignored for objects lacking Files. OK for download=F and load=T, this means don't cache (a valid choice if the File lives on a network share). If a downloadLocation is not provided a default, read-only cache location is used. If a downloadLocation IS provided, then the client must handle collisions with existing files. Note, 'downloadLocation' must be a folder, i.e. it cannot be used to rename files. | synGet(id, version, downloadFile=T, downloadLocation=NULL, ifcollision="keep.both", load=F) | synapse.get(id, version, downloadFile=True, downloadLocation=None, ifcollision="keep.both", load=False) | synapse get ID -v NUMBER |
Get the directly readable location of the file associated with a file object. | For downloaded files, this is the path on the local file system. For "linked" files (not uploaded into Synapse) that are not downloaded, this is the URL known to Synapse. For uploaded files which have not been retrieved, returns NULL. | getFileLocation(file) | TODO | TODO |
Trash an entity, and all of its children (move all Folders and Files within a Folder to the trash can). | synTrash(id) / synTrash(entity) | synapse.trash(id) | synapse trash id | |
Open the web browser to the page for this entity. | onWeb(entityId) / onWeb(entity) | synapse.onweb(entityId) / synapse.onweb(entity) | synapse onweb id | |
log-in | get API key and write to user's properties file | synapseLogin(<user>,<pw>) | synapse.login(<user>,<pw>, sessionToken=None) | synapse login -u USER -p PASSWORD |
log-out | delete API key from properties file | synapseLogout() | synapse.logout() | synapse logout |
2 –Advanced functions | ||||
Execute query | TODO: pagination, e.g. the function returns an iterator. Look at current implementation in R client. | synQuery(queryString) | synapse.query(queryString) | synapse query |
we talked about this, but is it needed? | ||||
we talked about this, but is it needed? | ||||
Retrieve the wiki for an entity | TODO: Is it a requirement that we retrieve attachments? If not, do we retrieve file handles? Is this the id of the wiki or the wiki? | synGetWiki(id, version) / synGetWiki(entity) | synapse.getWiki(id, version) synapse.getWiki(entity) | |
synStoreWiki() | synapse.storeWiki() | |||
synGetAnnotations() | synapse.getAnnotations(entity/entityId) | |||
synSetAnnotations() | synapse.setAnntotations(entity/entityId, annotations) | |||
synGetProperties() | NA | NA | ||
Access properties, throwing exception if property is not defined. | synSetProperties() | NA | NA | |
synGetAnnotation() | ||||
synSetAnnotation() | ||||
Access property, throwing exception if property is not defined. | synGetProperty() | NA | NA | |
Access property, throwing exception if property is not defined. Setting to NULL deletes. | synSetProperty() | NA | NA | |
Create an Activity (provenance object) in memory. | Activity(name, description, used, executed) | Activity(name, description, used, exectuted) | NA | |
Create or update the Activity in Synapse | synStoreActivity(activity) | Activity.store() | NA | |
Get the Activity which generated the given entity. | synGetActivity(entity) / synGetActivity(entityId) | synapse.getActivity(entity/entityId) | NA | |
Set the Activity which generated the given entity | synSetActivity(entity)<-activity | synapse.setActivity(entity/entityId, activity) | NA | |
Empty trash can | ||||
Restore from trash can | ||||
Run code, capturing output, code and provenance relationship. | synapseExecute(executable, args, resultParentId, codeParentId, resultEntityProperties = NULL, resultEntityName=NULL, replChar=".") | synapse.exceute(executable, args, resultParentId, codeParentId, resultEntityProperties = None, resultEntityName=None, replChar=".") | NA | |
Create evaluation object | Evaluation(name, description, status) | Evaluation(name, description, status) | NA | |
Join evaluation | addParticipant(evaluation, principalId) | evaluation.addParticipant(principalId) | NA | |
Submit for evaluation | submit(evaluation, entity) | evaluation.submit(entity)/ synapse.submitToEvaluation(entity, evaluation) | synapse submitEvaluation | |
3 – Web API Level functions | ||||
Execute GET request | See details below. | synRestGET(endpoint, uri) | NA? (already only a line) | |
Execute POST request | See details below. | synRestPOST(endpoint, uri, body) | NA? | |
Execute PUT request | See details below. | synRestPUT(endpoint, uri, body) | NA? | |
Execute DELETE request | See details below. | synRestDELETE(endpoint, uri) | NA? |
Endpoints
At the time of this writing, there are three endpoints for web service calls in our production system:
https://auth-prod.prod.sagebase.org/auth/v1
https://repo-prod.prod.sagebase.org/repo/v1
https://file-prod.prod.sagebase.org/file/v1
These are used to call the web APIs linked below.
Web APIs
The URIs, request bodies and request methods are defined by the Synapse Web APIs. The URIs omit the endpoints given above, e.g. to retrieve entity metadata the endpoint would be "https://repo-prod.prod.sagebase.org/repo/v1" while the URI might be "/entity/syn123456". The web APIs define request and response bodies in terms of JSON objects. In the analytic clients these are expressed as named lists or nested named list, e.g. in R the JSON object {"foo":"bar", "bas":"bah"} is passed in as list(foo="bar", bas="bah").
The Web APIs are defined here:
Common Configuration File
This is a properties file in a standard place that is interpreted upon client initialization. The location should be private for a user.
The format will that of an .ini file (http://en.wikipedia.org/wiki/INI_file). Although the format is somewhat 'dated', there is a Python parser available:
http://docs.python.org/2/library/configparser.html
and an R parsing algorithm has been suggested:
http://r.789695.n4.nabble.com/Read-Windows-like-INI-files-into-R-data-structure-td827353.html
Things to specify in the common config file:
- username, API key (as returned by Authentication and Authorization API#GetSecretKeyforHMACAuthentication)
- root cache location (should be private to the user)
- cache max size
Appendix: Current implementation of the file cache in the R Client:
- files are cached (meatadata used to be cached in entity.json)
- cache is mix of read/write
- each entity version has a location within the cache is based on its URI (e.g. .synapseCache/proddata.sagebase.org/<entityId>/<locationId>/version/<version>)
- files.json specifies what resides within the archive
- <fileName> file which R Client currently assumes to be a zip (this is immutable by convention until storeEntity is called) (TODO: What happens when it is not a zip archive)
- <fileName>_unpacked directory within which all unzipped content lives
- this subdirectory is writable (by convention)
- re-stores file if not an archive (both as <fileName> and <fileName>_unpacked/<fileName>)