Skip to end of banner
Go to start of banner

Synapse Statistics

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

WIP: Current design subject to changes


The epic  SWC-4488 - Getting issue details... STATUS  groups the issues submitted about a new feature request that revolves around exposing from the the synapse API some basic statistics about projects. In the following the list of relevant tickets in the epic:

key summary type created updated due assignee reporter priority status resolution
Loading...
Refresh

Introduction


It is clear that there is the need for the platform to provide some statistics about the usage of the data within synapse: discussions around various aspects and statistics to gather and expose have been going on for several years.

There are different statistics that the users are currently interested in and different levels of understanding of what should be exposed by the API. In particular we identified 3 main points that the users are focusing on and that are related but need clarifications:

  1. Downloads count
  2. Page views
  3. Data breaches

Downloads Count

This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).

Page Views

This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.

Data Breaches

Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.


With this brief introduction in mind this document focuses on the first kind of data that can be collected and exposed by the API. While the number of page views and in general data collected about user interaction with the portal (and other clients) is an important aspect to consider we treat it as a different issue all together that can be tackled and discussed separately and that might be included in a future release as part of the statistics exposed by the API.

For the first implementation we do not plan to include the page views or analysis of anonymous user interactions in the statistics, but instead we will focus on the (synapse) user downloads only.

In the following we collect the various uses cases that were highlighted and a brief description of the current situation as well as a proposed API design to integrate in the synapse platform to expose the statistics.

Download Statistics

For the first phase we focus on exposing statistics about file downloads, in particular about download counts and count of users that downloaded a file.

Use Cases


The main driving use cases for the feature is to be able to monitor the overall dataset usage of the project as well as the usage of the single dataset overtime. 

We provide a list of use cases that were discussed with (internal) users of the synapse platform and from documents with some relevant use cases (See  and ), we refer to the generic user actor and provide a list of main actors the use case is potentially intended for. Among the actors we list the Project Owner, a Funder, a generic (registered) Synapse User and a (Data) Contributor.

#LevelActorsDescription
1ProjectProject Owner, FunderA user would like to see an overview of the total number of files downloaded within a project
2ProjectProject Owner, FunderA user would like to see the trend over time of the total number of files downloaded within a project
3ProjectProject Owner, FunderA user would like to see the number of unique users that performed file downloads within a project
4ProjectProject Owner, FunderIn reference to use cases 1, 2 and 3 a user would like to have the option to see this information within a specific time range
5ProjectProject Owner, FunderIn reference to use cases 1, 2 and 3 a user would like to have the option to filter this information by a specific team
6ProjectProject OwnerA user would like to get the list of most downloaded files in the project along with their download/users counts.
7FileProject Owner, Synapse User, ContributorA user would like to see the total number of downloads for a specific file
8FileProject Owner, Synapse User, ContributorA user would like to see the trend over time of the total number of downloads for a specific file
9FileProject Owner, Synapse User, ContributorA user would like to see the number of unique users that downloaded a specific file
10FileProject Owner, Synapse User, ContributorIn reference to use cases 6, 7 and 8 a user would like to have the option to see this information within a specific time range
11FileProject Owner, Synapse User, ContributorIn reference to use cases 6, 7 and 8 a user would like to have the option to filter this information by a specific team
12FileProject OwnerIn the context of a challenge the user would like to see the trend of downloads of a specific file (dataset) among the teams participating in the challenge. In particular they would like to have the fraction of teams out of all the teams in the challenge that downloaded a file at least once in order to understand how many teams didn't download the dataset.


  • Use cases 5 and 11 are currently removed as it is not a request.
  • Use case 6 is a nice to have, we might want to skip this in the first implementation.
  • Use case 12 is a nice to have, not a priority.


Use cases for the count for folders, table, views etc are not included as they were not specifically mentioned by the users

Considerations


In general the idea is to collect data over the number of file downloads and the number of unique users that downloaded files within a project, providing some minimal filtering options. Additional filtering/data cleaning options that we might need to consider include: annotations, team exclusion (e.g. to reduce noise, I want to exclude from the statistics the users that are part of some administrative sage team ~ tricky as sage team users might actually work with the data). This might be part of a cleaning pipeline when we gather statistics and we might keep the filtering only at the level of time frame.

In the following we list some consideration about the use cases that we received:

  • Aside from the count of downloads there is also the need to aggregate this data in a way that captures the trend of file downloads over time. For example the user is interested to see the number of downloads within a project or for a specific file grouped daily, weekly, monthly, yearly. There was no specific request to have averages but we could think of including it.
  • Note that the information about the number of downloads might also appear in the file entities in a view, while no specific request was done for files (handles) linked in a table. In general the use cases focus on statistics about downloads over FileEntity.
  • We might consider an additional dedicated permission that could be used to allow access to statistics (e.g. limiting its access by default). On the other end the use case for this is not defined, it should not be a problem to use the (computed) READ permission on the entity for statistics that do not contain identifiable data and expose aggregations only. We can start simple and use the canRead permission computed for entity (See Entity Permissions) and if needed in the future add a new type of permission specific for statistics (e.g. VIEW_STATISTICS ACCESS_TYPE).
    Note that if a file is not accessible to the user its statistics should not be accessible, if a folder is not accessible to the user the statistics about the files in the folder should not be accessible (unless the files have local permissions that allow access to them).
  • Also note that the value of the statistics about a parent element (e.g. the project) should maintain consistency and include the count relative to the child objects no matter the permissions of the child objects (e.g. in other words if I change the permissions of a child entity, the aggregate information about the parent should not change).
  • File entities are versioned, we would need to have a way to gather statistics for a specific version of the entity as well as the total among all the versions.
  • File entities might have previews, we should avoid including in the count the downloads of the previews.
  • Note from John: Is the team filtering really needed? It appears so from the doc  , but the actual statistics reported for downloads do not seem to filter by team (See screenshot below)
  • Filtering the information might be contextual, for example filtering by a specific team might mean to filter by the current team members or depending on the team members at the time of the download: it seems that the users are oriented towards the first option, a team filter might be used to limit the statistics to a limited set of users. This poses an interesting question: should we limit the team filtering to teams with at least X users? (e.g. to avoid people misusing the statistics to identify specific user behavior). Also adding the option to filter the statistics by team adds a great deal of complexity in computing and storing the statistics.
  • Use Case 12 is from Andrew L. for the dream challenges, is not a strong use case but a nice to have.

Current Situation


Currently the statistics about the file downloads are collected in various ways mostly through the Data Warehouse, using /wiki/spaces/DW/pages/796819457 of the data collected to run ad-hoc queries for aggregations as well as through dedicated clients (See Synapse Usage Report).


The current solution poses some challenges:

  • The process is relatively lengthy and requires non trivial technical skills
  • The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
  • The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
  • The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Files and Downloads


Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

In the context of a project and project files downloads we are interested in particular to the entity of type FileEntity and the download request made for the associated file handles (which does not include the preview handle) specifically in the context of a particular project or file entity.

There are also different ways to actually get pre-signed urls for file handles that are exposed through the API. For the context of project and files download we are interested in specific operations, in particular there are several API endpoints that are used to request a pre-signed url to download the content of entities of type FileEntity (For a complete list of endpoints that reference file handles see Batch FileHandle and File URL Services):


EndpointDescription

/entity/{id}/file /entity/{id}/version/{versionNumber}/file

Currently deprecated API, will return a 307 with the redirect to the actual pre-signed url of the file for the given entity
/fileHandle/batch

Allows to get a batch of pre-signed urls for a set of file handles. The request (BatchFileRequest) specifically requires the id of the object that reference the file handle id and the type of the object that reference the file handle (See FileHandleAssociation and FileHandleAssociateType)

/file/{id}

Allows to get the pre-signed url of the file handle with the given id. The request must contain within the parameters the FileHandleAssociateType and the id of the object that reference the file. Similarly to the deprecated /entity/{id}/file will return a 307 with the redirect to the pre-signed url

/file/bulk/async/startThis is part of an asynchronous API to start a bulk download of a list of files. The api allows to monitor the background task and get the result through a dedicated file handle id that points to the zip file that can be used to download the bulk of files


The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

In the following we provide some basic download statistics computed for the last 30 days worth of data (as of 06/26/2019) as reported by the data warehouse:

"direct" downloads statistics
Included APIsAverage Daily DownloadsMax Daily Downloads
  • /entity/{id}/file
  • /entity/{id}/version/{version}/file
  • /file/{d}
  • /filehandle/{id}/url
2324
4030
  • /entity/{id}/file
  • /entity/{id}/version/{version}/file
67245
"Batch and bulk" Downloads
Association TypeAverage Daily DownloadsMax Daily DownloadsAverage daily usersMax daily users
All
33175
167002138227
FileEntity
12100
16426958100

Proposed API Design


The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.


At the same time statistics for different kind of entities might return a different kind of response (in terms of semantics): for example the number of downloads for a project have a different semantic than the number of downloads for a file entity. Nevertheless the response will most likely be similar and can be grouped under a generic statistics object.


In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to the use cases and the requirements we can potentially pre-compute statistics aggregates so that we can serve them relatively quickly, nevertheless we cannot guarantee that all the statistics will return within a reasonable response time. 


In general the statistics for an entity might include more information aside from the number of downloads and we should keep in mind that we might want to extend it in order to collect more data (such as the number of files, users, teams etc) but we can start with including the download statistics only in a first phase.


To this end we propose an API that integrates into the current /asynchronous/job API (See Asynchronous Job API) so that a client can send a request to compute certain statistics and wait for the final response polling the API for the result. This allows to offload the computation to a background worker keeping the API servers free of heavy computation. The two main objects that the statistics API will need to extend are: AsynchronousRequestBody and AsynchronousResponseBody that represent respectively the request for a background job and the response from the job (returned as part of the AsynchronousJobStatus).


Note: In the actual implementation we might have a generic SynapseStatistics interface used as a marker of any statistics request, for brevity in the following we simply refer to the EntityStatistics objects.

Request Objects


In the following we provide a list of object representations for requests of entity statistics:

<<abstract>> EntityStatisticsRequest <extends> AsynchronousRequestBody

Common parent object for a job request to compute statistics of an entity. This object is an AsynchronousResponseBody that will be sent to the /asynchronous/job endpoint in order to ask for the computation of a certain statistic.

PropertyRequiredTypeDescription
entityIdYesSTRINGThe (synapse) id of the entity for which the statistics should be computed for. For the time being we will restrict it to Project and File entities.
versionNoSTRINGThe version of the entity (when of type FileEntity) for which the statistics should be computed for. If omitted gather statistics for the file entity as a whole that includes all the versions.

DownloadStatisticsRequest <extends> EntityStatisticsRequest

The actual job request that extends the EntityStatisticsRequest to get the download statistics of an entity, this will be used to get download and users counts only.

PropertyRequiredTypeDefault ValueDescription
fromTimestampNoDate
A timestamp used to limit the considered time frame from the given timestamp
toTimestampNoDate
A timestamp used to limit the considered time frame until the given timestamp

DownloadBucketsStatisticsRequest <extends> DownloadStatisticsRequest

Extends the DownloadStatisticsRequest in order to compute the download counts (along with the unique users count) bucketed into a fixed amount of time, e.g. grouping by a given time frame, which allows to build an histogram and have the download trend over the specified time frame.

PropertyRequiredTypeDefault ValueDescription
aggregationTypeYesSTRING'daily'Specifies the type of aggregation in order to group the count of downloads, allowed values are: 'daily', 'weekly', 'monthly', 'yearly'.
limitNoInteger30Used for pagination, limits the number of buckets to the given value. This value has a max allowed value of 100.
offsetNoInteger0Used for pagination, specifies the offset index for the page of buckets. This value is a function of the given limit: the first page has an offset of 0, the second page offset is the limit value itself (e.g. 30 by default).

Response Objects


In the following we provide a list of object representation for the statistics API:

<<abstract>> EntityStatisticsResponse <extends> AsynchronousResponseBody

Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.

PropertyTypeDescription
lastUpdateTimestamp
DateA timestamp representing the freshness of the statistic (last update date time of the requested statistic)

DownloadStatisticsResponse <extends> EntityStatisticsResponse

Represents the response for a job request for computing the download statistics for an entity (In response to a DownloadStatisticsRequest).

PropertyTypeDescription
totalCountINTEGERThe total number of downloads
usersCountINTEGERThe total number of unique users that performed a download

DownloadBucketsStatisticsResponse <extends> DownloadStatisticsResponse

Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.

PropertyTypeDescription
downloadCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the download count for one bucket, according to the aggregationType specified in the request
userCountBucketsARRAY<StatisticCountBucket>An array of StatisticCountBucket each containing the unique users count for one bucket, according to the aggregationType specified in the request
limitIntegerWill include the limit value used in the initial request for pagination purposes
offsetIntegerWill include the offset value used in the initial request for pagination purposes

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame:

PropertyTypeDescription
totalINTEGERThe count in the time frame
fromTimestampSTRINGThe starting timestamp of the time frame represented by the bucket
toTimestampSTRINGThe ending timestamp of the time frame represented by the bucket

Endpoints


The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).

MethodEndpointRequest BodyResponse BodyDescriptionRestrictions
POST/asynchronous/job
  • DownloadStatisticsRequest
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatus: the responseBody property in the status will contain either the 
  • DownloadStatisticsResponse
    or the 
  • DownloadBucketsStatisticsResponse
Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status.
  • The entity specified in the request must be a project or a file entity (if not 400)
  • The user should have view permission on the entity specified in the body (if not 403)
  • The DownloadBucketsStatisticsRequest will be used to compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.
GET/asynchronous/job/{id}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id.


Additionally we propose to have dedicated statistics endpoints to be consistent with the current API design that accepts the above requests:


MethodEndpointRequest BodyResponse BodyDescription
POST/statistics/entity/download/async/start
  • DownloadStattisticsRequest 
  • DownloadBucketsStatisticsRequest
AsynchronousJobStatusAllows to submit the job to gather the download statistics for an entity. The id returned by the request is used in order to the the job status.
GET/statistics/entity/download/async/get/{asyncToken}N/AAsynchronousJobStatusAllows to get the current status of the statistics job with the given id

Web Client Integration


The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

  • Have a separate tab for statistics in the web client
  • Have a dedicated menu item that leads to a statistics dashboard
  • Might want to have a dedicated wiki widget that reuses the API


  • No labels