Page Comparison

Table of Contents

maxLevel	3
indent	15px
style	dics

...

The current solution poses some challenges:

The process is relatively lengthy and requires non trivial technical skills
The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

Proposed API Design

The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.

At the same time statistics for different kind of entities might return a different kind of response (in terms of semantics): for example the number of downloads for a project have a different semantic than the number of downloads for a file entity. Nevertheless the response will most likely be similar and can be grouped under a generic statistics object.

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to the use cases and the requirements we can potentially pre-compute statistics aggregates so that we can serve them relatively quickly, nevertheless we cannot guarantee that all the statistics will return within a reasonable response time.

In general the statistics for an entity might include more information aside from the number of downloads and we should keep in mind that we might want to extend it in order to collect more data (such as the number of files, users, teams etc) but we can start with including the download statistics only in a first phase.

To this end we propose an API that integrates into the current /asynchronous/job API (See Asynchronous Job API) so that a client can send a request to compute certain statistics and wait for the final response polling the API for the result. This allows to offload the computation to a background worker keeping the API servers free of heavy computation. The two main objects that the statistics API will need to extend are: AsynchronousRequestBody and AsynchronousResponseBody that represent respectively the request for a background job and the response from the job (returned as part of the AsynchronousJobStatus).

Note: In the actual implementation we might have a generic SynapseStatistics interface used as a marker any of the statistics request, for brevity in the following we simply refer to EntityStatistics.

Request Objects

In the following we provide a list of object representations for requests of entity statistics:

...

Method	Endpoint	Request Body	Response Body	Description	Restrictions
POST	/asynchronous/job	DownloadStatisticsRequest DownloadBucketsStatisticsRequest	AsynchronousJobStatus: the responseBody property in the status will contain either the DownloadStatisticsResponse or the DownloadBucketsStatisticsResponse	Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.	The entity specified in the request must be a project or a file entity (if not 400) The user should have view permission on the entity specified in the body (if not 403) The DownloadBucketsStatisticsRequest will compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.
GET	/asynchronous/job/{id}	N/A	AsynchronousJobStatus	Allows to get the current status of the statistics job with the given id.

...

Versions Compared

Old Version 5

New Version 6

Key

Files and Downloads

Proposed API Design

Request Objects