Table of Contents | ||||||
---|---|---|---|---|---|---|
|
...
There are different statistics that the synapse projects owner and administrators users are currently interested in and different levels of understanding of what should be exposed by the API. In particular there are we identified 3 main points of discussion that the users are focusing on and that are related but need clarifications:
- Downloads Countcount
- Page Viewsviews
- Data Breaches/Audit Trailbreaches
Downloads Count
This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).
...
This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.
Data Breaches
...
Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.
Project Statistics
With this brief introduction in mind this document focuses on the main driving use case, that is:
- A funder and/or project creator would like to have a way to understand if the project is succesful and if its data is used.There are several metrics that can be used in order to determine the usage and success of a project, among which:
- Project Access (e.g. page views)
- Number of Downloads
- Number of Uploads
- User Discussions
In the following we provide the use cases and proposed first design to expose statistics about synapse projects.
Use Case
The main driving use case for exposing statistics from the platform is formulated as follows:
...
first kind of data that can be collected and exposed by the API. While the number of page views and in general data collected about user interaction with the portal (and other clients) is an important aspect to consider we treat it as a different issue all together that can be tackled and discussed separately and that might be included in a future release as part of the statistics exposed by the API.
For the first implementation we do not plan to include the page views or analysis of anonymous user interactions in the statistics, but instead we will focus on the (synapse) user downloads only.
In the following we collect the various uses cases that were highlighted and a brief description of the current situation as well as a proposed API design to integrate in the synapse platform to expose the statistics.
Download Statistics
For the first phase we focus on exposing statistics about file downloads, in particular about download counts and count of users that downloaded a file.
Use Cases
The main driving use cases for the feature is to be able to monitor the overall dataset usage of the project as well as the usage of the single dataset overtime.
...
- The process is relatively lengthy and requires non trivial technical skills
- The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
- The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
- The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer
An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:
Files and Downloads
Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.
...
Method | Endpoint | Request Body | Response Body | Description | Restrictions |
---|---|---|---|---|---|
POST | /asynchronous/job |
| AsynchronousJobStatus: the responseBody property in the status will contain either the
| Allows to submit a job to gather the download statistics for an entity. The id returned by the request is used in order to get the job status. |
|
GET | /asynchronous/job/{id} | N/A | AsynchronousJobStatus | Allows to get the current status of the statistics job with the given id. |
...