Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...


The current solution poses some challenges:

  • The process is relatively lengthy and requires non trivial technical skills
  • The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
  • The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
  • The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads


Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

There are also different ways to actually get pre-signed urls for file handles that are exposed through the API. For the context of project and files download we are interested in specific operations, in particular there are several API endpoints that are used to request a pre-signed url to download the content of entities of type FileEntity (For a complete list of endpoints that reference file handles see Batch FileHandle and File URL Services):


EndpointDescription

/entity/{id}/file /entity/{id}/version/{versionNumber}/file

Currently deprecated API, will return a 307 with the redirect to the actual pre-signed url of the file for the given entity
/fileHandle/batch

Allows to get a batch of pre-signed urls for a set of file handles. The request (BatchFileRequest) specifically requires the id of the object that reference the file handle id and the type of the object that reference the file handle (See FileHandleAssociation and FileHandleAssociateType)

/file/{id}

Allows to get the pre-signed url of the file handle with the given id. The request must contain within the parameters the FileHandleAssociateType and the id of the object that reference the file. Similarly to the deprecated /entity/{id}/file will return a 307 with the redirect to the pre-signed url

/file/bulk/async/startThis is part of an asynchronous API to start a bulk download of a list of files. The api allows to monitor the background task and get the result through a dedicated file handle id that points to the zip file that can be used to download the bulk of files


The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

...


The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See EntitiesSee Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.

At the same time statistics for different kind of entities might return a different kind of response (in terms of semantics): for example the number of downloads for a project have a different semantic than the number of downloads for a file entity. Nevertheless the response will most likely be similar and can be grouped under a generic statistics object.

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to the use cases and the requirements we can potentially pre-compute statistics aggregates so that we can serve them relatively quickly, nevertheless we cannot guarantee that all the statistics will return within a reasonable response time. 

In general the statistics for an entity might include more information aside from the number of downloads and we should keep in mind that we might want to extend it in order to collect more data (such as the number of files, users, teams etc) but we can start with including the download statistics only in a first phase.

...

Common parent object for a job request to compute statistics of an entity. This object is an AsynchronousResponseBody that will be sent to the /asynchronous/job endpoint in order to ask for the computation of a certain statistic.

PropertyRequiredTypeDescription
entityIdYesSTRINGThe (synapse) id of the entity for which the statistics should be computed for. For the time being we will restrict it to Project and File entities.
versionNoSTRINGThe version of the entity (when of type FileEntity) for which the statistics should be computed for. If omitted gather statistics for the file entity as a whole that includes all the versions.

DownloadStatisticsRequest <extends> EntityStatisticsRequest

The actual job request for the download statistics of an entity, this will be used to get counts only.

PropertyRequiredTypeDefault ValueDescription
fromTimestampNoDate
A timestamp used to limit the considered time frame from the given timestamp
toTimestampNoDate
A timestamp used to limit the considered time frame until the given timestamp

DownloadBucketsStatisticsRequest <extends> DownloadStatisticsRequest

Extends the DownloadStatisticsRequest to compute the download counts (along with the unique users count) bucketed into a fixed amount of time, e.g. grouping by a given time frame, which allows to build an histogram and have the download trend over the specified time frame.

PropertyRequiredTypeDefault ValueDescription
aggregationTypeYesSTRING'daily'Specifies the type of aggregation in order to group the count of downloads, allowed values are: 'daily', 'weekly', 'monthly', 'yearly'.

Note: For this type of job a limit is imposed on the time frame allowed for each aggregationType, in particular the maximum amount of buckets within the time frame according to the aggregation type should be less or equals than 100: Considering a resolution of 1 day, for each aggregation type we have max 100 days, 100 weeks (52 days * 100), 100 months (30 days * 100) or 100 years (365 days * 100)).

Response Objects


In the following we provide a list of object representation for the statistics API:

...

Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.

PropertyTypeDescription
lastUpdateTimestamp
DateA timestamp representing the freshness of the statistic (last update date time of the requested statistic)

DownloadStatisticsResponse <extends> EntityStatisticsResponse

...

Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.

PropertyTypeDescription
downloadCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the download count for one bucket, according to the aggregationType specified in the request
userCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the unique users count for one bucket, according to the aggregationType specified in the request

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame:

...

The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).

MethodEndpointRequest BodyResponse BodyDescriptionRestrictions
POST/asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The limit on the number of buckets for aggregations specified by the time frame in the request should be less or equal than 100. This limit applies only when the request is of DownloadBucketsStatisticsRequest type (if limit exceeds 400)
GET/asynchronous/job/{id}
AsynchronousJobStatusAllows to get the current status of the statistics job with the given id.


Web Client Integration


The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...