Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...

  1. Downloads Count
  2. Page Views
  3. Data Breaches/Audit Trail

Downloads Count

This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).

Page Views

This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.

Data Breaches/Audit Trail

Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.

Project Statistics

With this brief introduction in mind this document focuses on the main driving use case, that is:

  • A funder and/or project creator would like to have a way to understand if the project is successful and if its data is used.

There are several metrics that can be used in order to determine the usage and success of a project, among which:

  • Project Access (e.g. page views)
  • Number of Downloads
  • Number of Uploads
  • User Discussions

...

Files, Downloads and Uploads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

...

We propose to introduce a dedicated /statistics endpoint that will serve as main entry point to statistics requests. The project statistics are nested within this endpoint:

EndpointMethodDescriptionResponse TypeRestrictions
/statistics/project/{projectSynId}GETAllows to get the statistics for the given projectProjectStatistics 
  • The project with the given id should exists: NotFoundException (404)
  • The user (directly or through its groups) must have the VIEW_STATISTICS ACCESS_TYPE for the project with the given id: UnauthorizedException (403) 

 The endpoint accepts the following optional URL parameters:

Parameter NameTypeDefault ValueDescription
downloadsBooleantrueIf set to false allows to exclude the downloads statistics from the response
uploadsBooleantrueIf set to false allows to exclude the uploads statistics from the response


Code Block
languagejs
titleExample
GET /statistics/project/syn123?downloads=true&uploads=false

...

Code Block
languagejs
titleExample
{
	"lastUpdatedOn: "2019-26-06T01:01:00.000Z",
	"downloads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 1230,
			"usersCount": 10
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	},
	"uploads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 51200,
			"usersCount": 200
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	}
}

Proposed Backend Architecture


In order to serve the statistics from the synapse API we need a way to efficiently access the statistics without heavy loading the web instances of the API.

To this end we propose an architecture that leverage various AWS services in order to collect relevant events from the synapse API and service calls, store them for long term analitics and create aggregates that can be efficiently queried from the synapse services.

In the following we provide an high level architecture of the components involved:

Image Added

In particular the following key components are integrated into the system:

  • AWS Kinesis Firehose: Allows to collect events records from the Synapse API, convert the records into an columnar format such as Apache Parquet and store the stream to an S3 destination bucket
  • AWS GlueGlue is used to build the catalog of tables used both by Kinesis Firehose for the record conversion and by Athena to efficiently query the data stored in S3
  • AWS Athena Is used to query the data produced by kinetics firehose, the data will be stored using the Apache Parquet format thanks to the Kinesis Firehose automatic conversion
The idea is to use Kinesis Firehose to send the events we are interested in (e.g. file upload and file download) as json records, the kinesis stream will funnel the records to firehose that will be converted to the columnar format Apache Parquet (the table schema is created and managed in AWS glue) and stored to an S3 bucket.
For the first phase we collect statistics for download and uploads, for each type of event we will have a separate stream, an example of JSON object sent to the kinesis stream for a donwload event:


Code Block
titleDownload Record Example
{
    "timestamp": "1562626674712",
    "stack": "dev",
    "instance": 123,
    "projectId": 456,
    "userId": 5432,
    "associationType": "FileEntity",
    "associationId": 12312
    "fileId": 6789
}
The record is sent to the appropriate kinesis stream (e.g. fileDownloadsStream and fileUploadStream), converted to Apache Parquet and finally stored in S3 by firehose.

Once the data is in S3 it can be queried with AWS Athena with standard SQL. For the JSON schema example above we can run an SQL query grouping for example by projectId (and filtering by timestamp).

Athena can be accessed using the Amazon SDK directly in Java. This allows us to implement synapse background workers that periodically queries the data using Athena in order to compute and store manageable aggregates that can be queried directly from the synapse services.

In a first phase we only provide monthly aggregates per project to the client, to this end we can simply store the monthly aggregates into RDS using dedicated statistics tables. If we need to store more fine grained statistics (e.g. daily, or at the file level) we can move in a later moment to a more scalable solution (e.g. DynamoDB seems a good fit).

In order to compute the monthly aggregates we want to make sure that the workers only queries the current month (unless the last successful run was past the current month). For each type of statistics we store the last successful run into a dedicated table: statistics_last_update that simply stores the statistic type (e.g. statistics_project_monthly_download and statistics_project_monthly_upload) and its last update time.

The workers will query this table in order to get the months for which the aggregates needs to be performed (Note that for each worker run the query will need to scan all the data for each considered month as we need to get estimates on the number of unique users that downloaded/uplodaded files on a monthly basis).
The aggregate for each project is then stored into a dedicated table statistics_project_monthly with the following columns:


Code Block
titleProject Monthly Statistics Table
statistics_project_monthly <project_id, year, month, download_count, download_users_count, upload_count, upload_users_count>

Note that when the aggregate query is run from a worker, only the projects for which at least one download or upload was recorded will be stored in the table, if no download and uploads were performed on a particular project (e.g. no record is found) the API can simply return a 0 count.

Note also that we can implement a worker per type of statistics, e.g. one that queries the downloads and one that queries the uploads and both store the results in the same table for now. For the moment we can simply have one worker per statistics aggregate.

Finally note that given that we query the data stored in S3 month by month, we can partition the data directly so that Athena will scan only the needed records (See https://docs.aws.amazon.com/athena/latest/ug/partitions.html). We can initially create partitions day by day so that we do not restrict ourselves to always load a month of data for each query that we run using athena, this will produce slightly slower queries for the worker as multiple partitions will need to be loaded, but will gives us more flexibility.

Web Client Integration

The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...