Table of Contents | ||||||
---|---|---|---|---|---|---|
|
...
- The process is relatively lengthy and requires non trivial technical skills
- The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
- The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
- The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer
An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:
Files and Downloads
Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.
...
There are also different ways to actually get pre-signed urls for file handles that are exposed through the API. For the context of project and files download we are interested in specific operations, in particular there are several API endpoints that are used to request a pre-signed url to download the content of entities of type FileEntity (For a complete list of endpoints that reference file handles see Batch FileHandle and File URL Services):
Endpoint | Description |
---|---|
| Currently deprecated API, will return a 307 with the redirect to the actual pre-signed url of the file for the given entity |
/fileHandle/batch | Allows to get a batch of pre-signed urls for a set of file handles. The request (BatchFileRequest) specifically requires the id of the object that reference the file handle id and the type of the object that reference the file handle (See FileHandleAssociation and FileHandleAssociateType) |
/file/{id} | Allows to get the pre-signed url of the file handle with the given id. The request must contain within the parameters the FileHandleAssociateType and the id of the object that reference the file. Similarly to the deprecated /entity/{id}/file will return a 307 with the redirect to the pre-signed url |
/file/bulk/async/start | This is part of an asynchronous API to start a bulk download of a list of files. The api allows to monitor the background task and get the result through a dedicated file handle id that points to the zip file that can be used to download the bulk of files |
The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.
...
...
Common parent object for a job request to compute statistics of an entity. This object is an AsynchronousResponseBody that will be sent to the /asynchronous/job
endpoint in order to ask for the computation of a certain statistic.
Property | Required | Type | Description |
---|---|---|---|
entityId | Yes | STRING | The (synapse) id of the entity for which the statistics should be computed for. For the time being we will restrict it to Project and File entities. |
version | No | STRING | The version of the entity (when of type FileEntity) for which the statistics should be computed for. If omitted gather statistics for the file entity as a whole that includes all the versions. |
DownloadStatisticsRequest <extends> EntityStatisticsRequest
The actual job request for the download statistics of an entity, this will be used to get counts only.
Property | Required | Type | Default Value | Description |
---|---|---|---|---|
fromTimestamp | No | Date | A timestamp used to limit the considered time frame from the given timestamp | |
toTimestamp | No | Date | A timestamp used to limit the considered time frame until the given timestamp |
DownloadBucketsStatisticsRequest <extends> DownloadStatisticsRequest
Extends the DownloadStatisticsRequest to compute the download counts (along with the unique users count) bucketed into a fixed amount of time, e.g. grouping by a given time frame, which allows to build an histogram and have the download trend over the specified time frame.
Property | Required | Type | Default Value | Description |
---|---|---|---|---|
aggregationType | Yes | STRING | 'daily' | Specifies the type of aggregation in order to group the count of downloads, allowed values are: 'daily', 'weekly', 'monthly', 'yearly'. Note: For this type of job a limit is imposed on the time frame allowed for each aggregationType, in particular the maximum amount of buckets within the time frame according to the aggregation type should be less or equals than 100: Considering a resolution of 1 day, for each aggregation type we have max 100 days, 100 weeks (52 days * 100), 100 months (30 days * 100) or 100 years (365 days * 100)). |
Response Objects
In the following we provide a list of object representation for the statistics API:
...
Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.
Property | Type | Description |
---|---|---|
lastUpdateTimestamp | Date | A timestamp representing the freshness of the statistic (last update date time of the requested statistic) |
DownloadStatisticsResponse <extends> EntityStatisticsResponse
...
Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.
Property | Type | Description |
---|---|---|
downloadCountBuckets | ARRAY<StatisticCountBucket> | An array of StatisticCountBucket each containing the download count for one bucket, according to the aggregationType specified in the request |
userCountBuckets | ARRAY<StatisticCountBucket> | An array of StatisticCountBucket each containing the unique users count for one bucket, according to the aggregationType specified in the request |
StatisticsCountBucket
The purpose of this object is to include information about the count of a certain metric within a specific time frame:
...
The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).
Method | Endpoint | Request Body | Response Body | Description | Restrictions |
---|---|---|---|---|---|
POST | /asynchronous/job |
| AsynchronousJobStatus: the responseBody property in the status will contain either the
| Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status. |
|
GET | /asynchronous/job/{id} | AsynchronousJobStatus | Allows to get the current status of the statistics job with the given id. |
Web Client Integration
The web client should have a way to show the statistics for a project or a specific file, some initial ideas:
...