Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...


The current solution poses some challenges:

  • The process is relatively lengthy and requires non trivial technical skills
  • The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
  • The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
  • The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads


Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

In the following we provide some basic download statistics computed for the last 30 days worth of data (as of 06/26/2019) as reported by the data warehouse:

"direct" downloads statistics
Included APIsAverage Daily DownloadsMax Daily Downloads
  • /entity/{id}/file
  • /entity/{id}/version/{version}/file
  • /file/{d}
  • /filehandle/{id}/url
2325

  • /entity/{id}/file
  • /entity/{id}/version/{version}/file
67
"Batch and bulk" Downloads
Association TypeAverage Daily DownloadsMax Daily DownloadsAverage daily usersMax daily users
All
33175
167002138227
FileEntity
12100
16426958100

Proposed API Design



The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.

...

MethodEndpointRequest BodyResponse BodyDescriptionRestrictions
POST/asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The DownloadBucketsStatisticsRequest will be used to compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.
GET/asynchronous/job/{id}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id.

...