Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updates according to review meeting

Table of Contents
maxLevel3
indent15px
styledics

Warning

WIP: Current design subject to changes

The epic 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keySWC-4488
 groups the issues submitted about a new feature request that revolves around exposing from the the
Table of Contents
maxLevel3
indent15px
styledics

Warning

WIP: Current design subject to changes


The epic 

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keySWC-4488
 groups the issues submitted about a new feature request that revolves around exposing from the the synapse API some basic statistics about projects. In the following the list of relevant tickets in the epic:

...

  1. Downloads Count
  2. Page Views
  3. Data Breaches/Audit Trail

Downloads Count

This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).

Page Views

This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.

Data Breaches/Audit Trail

Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.

Project Statistics

With this brief introduction in mind this document focuses on the main driving use case, that is:

  • A funder and/or project creator would like to have a way to understand if the project is successful and if its data is used.

There are several metrics that can be used in order to determine the usage and success of a project, among which:

  • Project Access (e.g. page views)
  • Number of Downloads
  • Number of Uploads
  • User Discussions

...

  • Expose statistics about number of download and number of uploads
  • Expose statistics about the number of unique users that downloaded and/or uploaded
  • Statistics are collected on a per project basis (no file level statistics)
  • Statistics limited to the last 12 months
  • Statistics aggregated monthly (no total downloads or total users)
  • Statistics are accessible to the project owner/administrator onlyaccess is restricted through a new ACCESS_TYPE VIEW_STATISTICS, initially granted to project owners and administrators
  • No run-time filtering filtering

Files, Downloads and Uploads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to For the current use cases and the requirements case we can potentially pre-compute statistics aggregates monthly for the projects so that we can serve them relatively quickly , nevertheless with a single lookup. While we cannot guarantee that all type kind of statistics will return within a reasonable response time. To this end we propose an API that integrates into the current /asynchronous/job API (See Asynchronous Job API) so that a client can send a request to compute certain statistics and wait for the final response polling the API for the result. This allows that will be exposed in the future can be pre-computed and will return within a reasonable response time to ease the clients integration we propose to have a dedicated endpoint that will return the statistics for a project within a single synchronous HTTP call.

If in the future we require more complex and/or expensive computation for certain statistics we can extend the API to integrate with the /asynchronous/job API (See Asynchronous Job API) in order to offload the computation to a background worker keeping the API servers free of heavy computation. Note that the computation of certain statistics might actually be relatively short without the need of a background job (e.g. we might be able to cache the statistics for a whole project): using a worker might lead to a small delay in retrieving the statistics from a user point of view but it is probably fine for now. We might want to consider (if needed, as an optimization) to provide from the Asynchronous API the result right away without the need to offload to a background worker.

The two main objects that the statistics API will need to extend are: AsynchronousRequestBody and AsynchronousResponseBody that represent respectively the request for a background job and the response from the job (returned as part of the AsynchronousJobStatus).

Note: In the actual implementation we might have a generic SynapseStatistics interface used as a marker of any statistics request.

Request Object

ProjectStatisticsRequest

<extends> AsynchronousRequestBody

Object used to start the request for retrieving the statistics about a specific project. This object is an AsynchronousResponseBody that will be sent to the /asynchronous/job endpoint in order to start the computation of the project statistics.

...

Allows to define which statistics to be included in the response, similarly to the Entity Bundle Services. The supported values are as follow:

  • downloads = 0x1
  • uploads = 0x2

By default both are included with the mask 0x3 (0x1 + 0x2).

Code Block
languagejs
titleExample
{
	"projectId": "syn12345"
	"mask": 1
}

Response Object

ProjectStatisticsResponse

<extends> AsynchronousResponseBody

Represents the response for a job request for computing the project statistics (In response to a ProjectStatisticsRequest).

.

Endpoints

We propose to introduce a dedicated /statistics endpoint that will serve as main entry point to statistics requests. The project statistics are nested within this endpoint:

EndpointMethodDescriptionResponse TypeRestrictions
/statistics/project/{projectSynId}GETAllows to get the statistics for the given projectProjectStatistics 
  • The project with the given id should exists: NotFoundException (404)
  • The user (directly or through its groups) must have the VIEW_STATISTICS ACCESS_TYPE for the project with the given id: UnauthorizedException (403) 

 The endpoint accepts the following optional URL parameters:

Parameter NameTypeDefault ValueDescription
downloadsBooleantrueIf set to false allows to exclude the downloads statistics from the response
uploadsBooleantrueIf set to false allows to exclude the uploads statistics from the response


Code Block
languagejs
titleExample
GET /statistics/project/syn123?downloads=true&uploads=false

Response Objects

ProjectStatistics

Represents the response for the project statistics request:

PropertyTypeDescription
lastUpdatedOnDateContains the last (approximate) update date for the project statistics, this value provide an approximation on the freshness of the statistics. This value might be null, in which case the statistics for the project are not currently available.
downloads

DownloadStatistics

Contains the download statistics for the project specified in the request, this is included only if the downloads parameter in the request is set to true.
uploads

UploadStatistics

Contains the upload statistics for the project specified in the request, this is included only if the uploads parameter in the request is set to true.

DownloadStatistics/UploadStatistics

PropertyTypeDescription
lastUpdatedOnDateContains the last
(approximate) update
update date for the
project
download/upload statistics, this value provide an approximation on the freshness of the statistics.
This
 This value might be null, in which case the download/upload statistics
for the project
are not currently available.
downloads
monthly

DownloadStatistics

Contains the download statistics for the project specified in the request, this is included only if the mask property in the ProjectStatisticsRequest has the 0x1 flag set.uploads

UploadStatistics

Contains the upload statistics for the project specified in the request, this is included only if the mask property in the ProjectStatisticsRequest has the 0x2 flag set.

...

ARRAY<StatisticsCountBucket>An array containing the monthly download/upload count for the last 12 months, each bucket aggregates a month worth of data. The number of buckets is limited to 12. Each bucket will include the unique users count for the month.

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame (in this case monthly):

PropertyTypeDescription
lastUpdatedOn
startDateDate
Contains the last update date for the download/upload statistics, this value provide an approximation on the freshness of the statistics. This value might be null, in which case the download/upload statistics are not currently available.monthlyARRAY<StatisticsCountBucket>An array containing the monthly download/upload count for the last 12 months, each bucket aggregates a month worth of data. The number of buckets is limited to 12. Each bucket will include the unique users count for the month.

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame (in this case monthly):

...

languagejs
titleExample

...

The starting date of the time frame represented by the bucket
countINTEGERThe download/upload count in the time frame
usersCountINTEGERThe number of unique users that performed a download/upload in the time frame of the bucket


Code Block
languagejs
titleExample
{
	"lastUpdatedOn: "2019-26-06T01:01:00.000Z",
	"downloads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 1230,
			"usersCount": 10
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	},
	"uploads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 123051200,
			"usersCount": 10200
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	},
	"uploads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 51200,
			"usersCount": 200
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	}
}

Endpoints

The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).

...

ProjectStatisticsRequest

...

Failures Statuses

ProjectStatisticsRequest job can fail for a variety of reasons, in particular aside from exceptional cases the AsynchronousJobStatus might be set as failed in the following cases:

  • The project specified in the request does not exist (NotFoundException)
  • The user that requested the project statistics is not the owner or the administrator of the project (UnauthorizedException)

Additionally we propose to have dedicated statistics endpoints to be consistent with the current API design that accepts the above requests:

...

ProjectStatisticsRequest

 

...


}

Web Client Integration

The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...