Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...


The current solution poses some challenges:

  • The process is relatively lengthy and requires non trivial technical skills
  • The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
  • The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
  • The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads


Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...


The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.

At the same time statistics for different kind of entities might return a different kind of response (in terms of semantics): for example the number of downloads for a project have a different semantic than the number of downloads for a file entity. Nevertheless the response will most likely be similar and can be grouped under a generic statistics object.

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to the use cases and the requirements we can potentially pre-compute statistics aggregates so that we can serve them relatively quickly, nevertheless we cannot guarantee that all the statistics will return within a reasonable response time. 

In general the statistics for an entity might include more information aside from the number of downloads and we should keep in mind that we might want to extend it in order to collect more data (such as the number of files, users, teams etc) but we can start with including the download statistics only in a first phase.
To this end we propose an API that integrates into the current /asynchronous/job API (See Asynchronous Job API) so that a client can send a request to compute certain statistics and wait for the final response polling the API for the result. This allows to offload the computation to a background worker keeping the API servers free of heavy computation. The two main objects that the statistics API will need to extend are: AsynchronousRequestBody and AsynchronousResponseBody that represent respectively the request for a background job and the response from the job (returned as part of the AsynchronousJobStatus).
Note: In the actual implementation we might have a generic SynapseStatistics interface used as a marker any of the statistics request, for brevity in the following we simply refer to EntityStatistics.

Request Objects


In the following we provide a list of object representations for requests of entity statistics:

...

DownloadStatisticsRequest <extends> EntityStatisticsRequest

The actual job request for that extends the EntityStatisticsRequest to get the download statistics of an entity, this will be used to get download and users counts only.

PropertyRequiredTypeDefault ValueDescription
fromTimestampNoDate
A timestamp used to limit the considered time frame from the given timestamp
toTimestampNoDate
A timestamp used to limit the considered time frame until the given timestamp

...

Extends the DownloadStatisticsRequest in order to compute the download counts (along with the unique users count) bucketed into a fixed amount of time, e.g. grouping by a given time frame, which allows to build an histogram and have the download trend over the specified time frame.

PropertyRequiredTypeDefault ValueDescription
aggregationTypeYesSTRING'daily'Specifies the type of aggregation in order to group the count of downloads, allowed values are: 'daily', 'weekly', 'monthly', 'yearly'. Note: For this type of job a limit is imposed on the time frame allowed for each aggregationType, in particular the maximum amount of buckets within the time frame according to the aggregation type should be less or equals than 100: Considering a resolution of 1 day, for each aggregation type we have max 100 days, 100 weeks (52 days * 100), 100 months (30 days * 100) or 100 years (365 days * 100)).

Response Objects

In the following we provide a list of object representation for the statistics API:

<<abstract>> EntityStatisticsResponse <extends> AsynchronousResponseBody

Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.

...

lastUpdateTimestamp

...

DownloadStatisticsResponse <extends> EntityStatisticsResponse

Represents the response for a job request for computing the download statistics for an entity (In response to a DownloadStatisticsRequest).

...

DownloadBucketsStatisticsResponse <extends> DownloadStatisticsResponse

Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.

...

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame:

...

Endpoints

...

limitNoInteger30Used for pagination, limits the number of buckets to the given value. This value has a max allowed value of 100.
offsetNoInteger0Used for pagination, specifies the offset index for the page of buckets. This value is a function of the given limit: the first page has an offset of 0, the second page offset is the limit value itself (e.g. 30 by default).

Response Objects


In the following we provide a list of object representation for the statistics API:

<<abstract>> EntityStatisticsResponse <extends> AsynchronousResponseBody

Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.

PropertyTypeDescription
lastUpdateTimestamp
DateA timestamp representing the freshness of the statistic (last update date time of the requested statistic)

DownloadStatisticsResponse <extends> EntityStatisticsResponse

Represents the response for a job request for computing the download statistics for an entity (In response to a DownloadStatisticsRequest).

PropertyTypeDescription
totalCountINTEGERThe total number of downloads
usersCountINTEGERThe total number of unique users that performed a download

DownloadBucketsStatisticsResponse <extends> DownloadStatisticsResponse

Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.

PropertyTypeDescription
downloadCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the download count for one bucket, according to the aggregationType specified in the request
userCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the unique users count for one bucket, according to the aggregationType specified in the request
limitIntegerWill include the limit value used in the initial request for pagination purposes
offsetIntegerWill include the offset value used in the initial request for pagination purposes

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame:

PropertyTypeDescription
totalINTEGERThe count in the time frame
fromTimestampSTRINGThe starting timestamp of the time frame represented by the bucket
toTimestampSTRINGThe ending timestamp of the time frame represented by the bucket

Endpoints


The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).

MethodEndpointRequest BodyResponse BodyDescriptionRestrictions
POST/asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The DownloadBucketsStatisticsRequest will compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.
GET/asynchronous/job/{id}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id.


Additionally we propose to have dedicated statistics endpoints to be consistent with the current API design that accepts the above requests:


MethodEndpointRequest BodyResponse BodyDescription
Restrictions
POST/
asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The limit on the number of buckets for aggregations specified by the time frame in the request should be less or equal than 100. This limit applies only when the request is of DownloadBucketsStatisticsRequest type (if limit exceeds 400)
GET/asynchronous/job/{id}
statistics/entity/download/async/start
  • DownloadStattisticsRequest 
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatusAllows to submit the job to gather the download statistics for an entity. The id returned by the request is used in order to the the job status.
GET/statistics/entity/download/async/get/{asyncToken}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id
.

Web Client Integration


The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...