Table of Contents

maxLevel	3
indent	15px
style	dics

Warning
WIP: Current design subject to changes

The epic

Jira Legacy

server	System JIRA
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2
key	SWC-4488

groups the issues submitted about a new feature request that revolves around exposing from the the synapse API some basic statistics about projects. In the following the list of relevant tickets in the epic:

Jira LegacyserverSystem JIRAcolumnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolutionmaximumIssues20jqlQueryissueKey in (SWC-4488,PLFM-4435,SWC-2329,WW-25,SWC-4588,WW-72)

Table of Contents

maxLevel	3
indent	15px
style	dics

Warning
WIP: Current design subject to changes

The epic This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project,

Jira Legacy

server	System JIRA
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2

Introduction

It is clear that there is the need for the platform to provide some statistics about the usage of the data within synapse: discussions around various aspects and statistics to gather and expose have been going on for several years.

There are different statistics that the synapse projects owner and administrators are currently interested in and different levels of understanding of what should be exposed by the API. In particular there are 3 main points of discussion that are related but need clarifications:

Downloads Count
Page Views
Data Breaches/Audit Trail

Downloads Count

key	SWC-4488

groups the issues submitted about a new feature request that revolves around exposing from the the synapse API some basic statistics about projects. In the following the list of relevant tickets in the epic:

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	issueKey in (SWC-4488,PLFM-4435,SWC-2329,WW-25,SWC-4588,WW-72)
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2

Introduction

It is clear that there is the need for the platform to provide some statistics about the usage of the data within synapse: discussions around various aspects and statistics to gather and expose have been going on for several years.

There are different statistics that the synapse projects owner and administrators are currently interested in and different levels of understanding of what should be exposed by the API. In particular there are 3 main points of discussion that are related but need clarifications:

Downloads Count
Page Views
Data Breaches/Audit Trail

Downloads Count

This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).

Page Views

This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.

Data Breaches/Audit Trail

Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.

Project Statistics

With this brief introduction in mind this document focuses on the main driving use case, that is:

A funder and/or project creator would like to have a way to understand if the project is

...

successful and if its data is used.

There are several metrics that can be used in order to determine the usage and success of a project, among which:

Project Access (e.g. page views)
Number of Downloads
Number of Uploads
User Discussions

In the following we provide the use cases and proposed first design to expose statistics about synapse projects.

Use Case

The main driving use case for exposing statistics from the platform is formulated as follows:

The funder and/or project owner would like to see if the project is successful

Use Cases

The main driving use cases for the feature is to be able to monitor the overall dataset usage of the project as well as the usage of the single dataset overtime.

We provide a list of use cases that were discussed with (internal) users of the synapse platform and from documents with some relevant use cases (See

Lref gdrive file

url	https://docs.google.com/document/d/1t3OL3TnORZwpHOAgLuUt8NNUTLGjDuGj1UUrpggj54k/edit

and

Lref gdrive file

url	https://docs.google.com/document/d/1H1ElqL4mWZ4jPI7sXiQMDT4m9fuCiH4OfFGznImNseY/edit#heading=h.e6vt20kwuui6

), we refer to the generic user actor and provide a list of main actors the use case is potentially intended for. Among the actors we list the Project Owner, a Funder, a generic (registered) Synapse User and a (Data) Contributor.

...

Use cases 5 and 11 are currently removed as it is not a request.
Use case 6 is a nice to have, we might want to skip this in the first implementation.
Use case 12 is a nice to have, not a priority.

Note
Use cases for the count for folders, table, views etc are not included as they were not specifically mentioned by the users

Considerations

In general the idea is to collect data over the number of file downloads and the number of unique users that downloaded files within a project, providing some minimal filtering options. Additional filtering/data cleaning options that we might need to consider include: annotations, team exclusion (e.g. to reduce noise, I want to exclude from the statistics the users that are part of some administrative sage team ~ tricky as sage team users might actually work with the data). This might be part of a cleaning pipeline when we gather statistics and we might keep the filtering only at the level of time frame.

In the following we list some consideration about the use cases that we received:

Aside from the count of downloads there is also the need to aggregate this data in a way that captures the trend of file downloads over time. For example the user is interested to see the number of downloads within a project or for a specific file grouped daily, weekly, monthly, yearly. There was no specific request to have averages but we could think of including it.
Note that the information about the number of downloads might also appear in the file entities in a view, while no specific request was done for files (handles) linked in a table. In general the use cases focus on statistics about downloads over FileEntity.
We might consider an additional dedicated permission that could be used to allow access to statistics (e.g. limiting its access by default). On the other end the use case for this is not defined, it should not be a problem to use the (computed) READ permission on the entity for statistics that do not contain identifiable data and expose aggregations only. We can start simple and use the canRead permission computed for entity (See Entity Permissions) and if needed in the future add a new type of permission specific for statistics (e.g. VIEW_STATISTICS ACCESS_TYPE).
Note that if a file is not accessible to the user its statistics should not be accessible, if a folder is not accessible to the user the statistics about the files in the folder should not be accessible (unless the files have local permissions that allow access to them).
Also note that the value of the statistics about a parent element (e.g. the project) should maintain consistency and include the count relative to the child objects no matter the permissions of the child objects (e.g. in other words if I change the permissions of a child entity, the aggregate information about the parent should not change).
File entities are versioned, we would need to have a way to gather statistics for a specific version of the entity as well as the total among all the versions.
File entities might have previews, we should avoid including in the count the downloads of the previews.
Note from John: Is the team filtering really needed? It appears so from the doc
Lref gdrive file
url https://docs.google.com/document/d/1t3OL3TnORZwpHOAgLuUt8NNUTLGjDuGj1UUrpggj54k/edit
, but the actual statistics reported for downloads do not seem to filter by team (See screenshot below)
Filtering the information might be contextual, for example filtering by a specific team might mean to filter by the current team members or depending on the team members at the time of the download: it seems that the users are oriented towards the first option, a team filter might be used to limit the statistics to a limited set of users. This poses an interesting question: should we limit the team filtering to teams with at least X users? (e.g. to avoid people misusing the statistics to identify specific user behavior). Also adding the option to filter the statistics by team adds a great deal of complexity in computing and storing the statistics.
Use Case 12 is from Andrew L. for the dream challenges, is not a strong use case but a nice to have.

Current Situation

Currently the statistics about the file downloads are collected in various ways mostly through the DW, using /wiki/spaces/DW/pages/796819457 of the data collected to run ad-hoc queries for aggregations as well as through dedicated clients (See Synapse Usage Report).

The current solution poses some challenges:

The process is relatively lengthy and requires non trivial technical skills
The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Image Removed

Files and Downloads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

In the context of a project and project files downloads we are interested in particular to the entity of type FileEntity and the download request made for the associated file handles (which does not include the preview handle) specifically in the context of a particular project or file entity.

There are also different ways to actually get pre-signed urls for file handles that are exposed through the API. For the context of project and files download we are interested in specific operations, in particular there are several API endpoints that are used to request a pre-signed url to download the content of entities of type FileEntity (For a complete list of endpoints that reference file handles see Batch FileHandle and File URL Services):

...

/entity/{id}/file /entity/{id}/version/{versionNumber}/file

...

Allows to get a batch of pre-signed urls for a set of file handles. The request (BatchFileRequest) specifically requires the id of the object that reference the file handle id and the type of the object that reference the file handle (See FileHandleAssociation and FileHandleAssociateType)

...

Allows to get the pre-signed url of the file handle with the given id. The request must contain within the parameters the FileHandleAssociateType and the id of the object that reference the file. Similarly to the deprecated /entity/{id}/file will return a 307 with the redirect to the pre-signed url

...

The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

In the following we provide some basic download statistics computed for the last 30 days worth of data (as of 06/26/2019) as reported by the data warehouse:

"direct" downloads statistics

...

/entity/{id}/file
/entity/{id}/version/{version}/file
/file/{d}
/filehandle/{id}/url

...

2324

...

/entity/{id}/file
/entity/{id}/version/{version}/file

...

"Batch and bulk" Downloads

...

33175

...

12100

...

Proposed API Design

The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). We would need a way to gather statistics about two particular objects (for now): projects and files.

At the same time statistics for different kind of entities might return a different kind of response (in terms of semantics): for example the number of downloads for a project have a different semantic than the number of downloads for a file entity. Nevertheless the response will most likely be similar and can be grouped under a generic statistics object.

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. According to the use cases and the requirements we can potentially pre-compute statistics aggregates so that we can serve them relatively quickly, nevertheless we cannot guarantee that all the statistics will return within a reasonable response time.

...

To this end we propose an API that integrates into the current /asynchronous/job API (See Asynchronous Job API) so that a client can send a request to compute certain statistics and wait for the final response polling the API for the result. This allows to offload the computation to a background worker keeping the API servers free of heavy computation. The two main objects that the statistics API will need to extend are: AsynchronousRequestBody and AsynchronousResponseBody that represent respectively the request for a background job and the response from the job (returned as part of the AsynchronousJobStatus).

Note: In the actual implementation we might have a generic SynapseStatistics interface used as a marker of any statistics request, for brevity in the following we simply refer to the EntityStatistics objects.

Request Objects

In the following we provide a list of object representations for requests of entity statistics:

<<abstract>> EntityStatisticsRequest <extends> AsynchronousRequestBody

Common parent object for a job request to compute statistics of an entity. This object is an AsynchronousResponseBody that will be sent to the /asynchronous/job endpoint in order to ask for the computation of a certain statistic.

...

DownloadStatisticsRequest <extends> EntityStatisticsRequest

The actual job request that extends the EntityStatisticsRequest to get the download statistics of an entity, this will be used to get download and users counts only.

...

DownloadBucketsStatisticsRequest <extends> DownloadStatisticsRequest

Extends the DownloadStatisticsRequest in order to compute the download counts (along with the unique users count) bucketed into a fixed amount of time, e.g. grouping by a given time frame, which allows to build an histogram and have the download trend over the specified time frame.

...

Response Objects

In the following we provide a list of object representation for the statistics API:

<<abstract>> EntityStatisticsResponse <extends> AsynchronousResponseBody

Common parent object for the response regarding a request for statistics on an entity. This object is an AsynchronousResponseBody that will be included as part of the AsynchronousJobStatus when requesting the status of the job.

...

lastUpdateTimestamp

...

DownloadStatisticsResponse <extends> EntityStatisticsResponse

Represents the response for a job request for computing the download statistics for an entity (In response to a DownloadStatisticsRequest).

...

DownloadBucketsStatisticsResponse <extends> DownloadStatisticsResponse

Represents the response for a job request for computing the bucketed download statistics for an entity (in response to a DownloadBucketsStatisticsRequest). Note that the response extends the DownloadStatisticsResponse so that it includes the total counts.

...

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame:

...

Endpoints

The API reuses the endpoints from the Asynchronous Job API (We could potentially add a dedicated /statistics endpoint just for clarity).

...

DownloadStatisticsRequest
DownloadBucketsStatisticsRequest

...

DownloadStatisticsResponse
or the
DownloadBucketsStatisticsResponse

...

The entity specified in the request must be a project or a file entity (if not 400)
The user should have view permission on the entity specified in the body (if not 403)
The DownloadBucketsStatisticsRequest will be used to compute paginated results. The value of the "limit" parameter is restricted to a maximum value of 100.

...

Additionally we propose to have dedicated statistics endpoints to be consistent with the current API design that accepts the above requests:

...

DownloadStattisticsRequest
DownloadBucketsStatisticsRequest

...

Current Situation

Currently the statistics about a project are collected in various ways mostly through the DW, using /wiki/spaces/DW/pages/796819457 of the data collected to run ad-hoc queries for aggregations as well as through dedicated clients (See Synapse Usage Report).

The current solution poses some challenges:

The process is relatively lengthy and requires non trivial technical skills
The data is limited to a 6 months window, the current solution to this problem is to store incremental updates on an external source (cvs file on S3)
The data cannot be easily integrated in the synapse portal or other components (in some cases the files are manually annotated with the number of downloads)
The system has an all or nothing policy for accessing the data, that is (for good reason) only accessible to a specific subset of synapse employees, this does not allow the users of the synapse platform to access this kind of data without asking a synapse engineer

An example of usage report generated using the Synapse Usage Report written by Kenny Daily using the data warehouse:

Image Added

First Phase Proposal

In a first phase we want to have a minimal set of statistics that are exposed from the synapse platform API that provides a way to fulfill the use case. In particular we identified the following aspects that allows to have an initial implementation that can be extended in a later moment if needed:

Expose statistics about number of download and number of uploads
Expose statistics about the number of unique users that downloaded and/or uploaded
Statistics are collected on a per project basis (no file level statistics)
Statistics limited to the past 12 months (without including the current month)
Statistics aggregated monthly (no total downloads or total users)
Statistics access is restricted through a new ACCESS_TYPE VIEW_STATISTICS, initially granted to project owners and administrators
No run-time filtering

Files, Downloads and Uploads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

In the context of a project we are interested in particular to download (and/or uploads) of entities of type FileEntity and TableEntity (See What is a "download" in Synpase?) and the download request made for the associated file handles.

There are also different ways to actually get pre-signed urls for file handles that are exposed through the API. For the context of project and files download we are interested in specific operations, in particular there are several API endpoints that are used to request a pre-signed url to download the content of entities of type FileEntity (For a complete list of endpoints that reference file handles see Batch FileHandle and File URL Services):

Endpoint	Description
`/entity/{id}/file` `/entity/{id}/version/{versionNumber}/file`	Currently deprecated API, will return a 307 with the redirect to the actual pre-signed url of the file for the given entity
`/fileHandle/batch`	Allows to get a batch of pre-signed urls for a set of file handles. The request (BatchFileRequest) specifically requires the id of the object that reference the file handle id and the type of the object that reference the file handle (See FileHandleAssociation and FileHandleAssociateType)
`/file/{id}`	Allows to get the pre-signed url of the file handle with the given id. The request must contain within the parameters the FileHandleAssociateType and the id of the object that reference the file. Similarly to the deprecated /entity/{id}/file will return a 307 with the redirect to the pre-signed url
`/file/bulk/async/start`	This is part of an asynchronous API to start a bulk download of a list of files. The api allows to monitor the background task and get the result through a dedicated file handle id that points to the zip file that can be used to download the bulk of files

The main aspect that we want to capture is that in general any relevant file download should include the triplet associateObjectType, associateObjectId and fileHandleId (See the FileHandleAssociation) that allows us to understand the source of the file download in the context of a request.

In the following we provide some basic download statistics computed for the last 30 days worth of data (as of 06/27/2019) as reported by the data warehouse:

"direct" downloads statistics

Included APIs	Average Daily Downloads	Max Daily Downloads
/entity/{id}/file /entity/{id}/version/{version}/file /file/{d} /filehandle/{id}/url	2324	4030
/entity/{id}/file /entity/{id}/version/{version}/file	67	245

"Batch and bulk" Downloads

Association Type	Average Daily Downloads	Max Daily Downloads	Average daily users	Max daily users
All	39998	383539	140	227
FileEntity	19963	367345	59	100
TableEntity	19680	145050	12	21

Proposed API Design

The current API design uses a polymorphic approach to access certain kind of objects, in particular related to Projects, Folders and Files (See Entities, Files, and Folders Oh My!), exposing a single prefixed endpoint for CRUD operations (/entity). This has the effect that a GET request to the /entity/{id} endpoint for a specific entity might return a different response type (See the Entity Response Object). For a first implementation we would need an endpoint to expose statistics about a project.

In general computing statistics might be an expensive operation, moreover in the future we might want to extend the API to include several different types of statistics. For the current use case we can potentially pre-compute statistics aggregates monthly for the projects so that we can serve them relatively quickly with a single lookup. While we cannot guarantee that all kind of statistics that will be exposed in the future can be pre-computed and will return within a reasonable response time to ease the clients integration we propose to have a dedicated endpoint that will return the statistics for a project within a single synchronous HTTP call.

If in the future we require more complex and/or expensive computation for certain statistics we can extend the API to integrate with the /asynchronous/job API (See Asynchronous Job API) in order to offload the computation to a background worker.

Endpoints

We propose to introduce a dedicated /statistics endpoint that will serve as main entry point to statistics requests. The project statistics are nested within this endpoint:

Endpoint	Method	Description	Response Type	Restrictions
/statistics/project/{projectSynId}	GET	Allows to get the statistics for the given project	ProjectStatistics	The project with the given id should exists: NotFoundException (404) The user (directly or through its groups) must have the VIEW_STATISTICS ACCESS_TYPE for the project with the given id: UnauthorizedException (403)

The endpoint accepts the following optional URL parameters:

Parameter Name	Type	Default Value	Description
downloads	Boolean	true	If set to false allows to exclude the downloads statistics from the response
uploads	Boolean	true	If set to false allows to exclude the uploads statistics from the response

Code Block

language	js
title	Example

GET /statistics/project/syn123?downloads=true&uploads=false

Response Objects

ProjectStatistics

Represents the response for the project statistics request:

Property	Type	Description
lastUpdatedOn	Date	Contains the last (approximate) update date for the project statistics, this value provide an approximation on the freshness of the statistics. This value might be null, in which case the statistics for the project are not currently available.
downloads	DownloadStatistics	Contains the download statistics for the project specified in the request, this is included only if the downloads parameter in the request is set to true.
uploads	UploadStatistics	Contains the upload statistics for the project specified in the request, this is included only if the uploads parameter in the request is set to true.

DownloadStatistics/UploadStatistics

Property	Type	Description
lastUpdatedOn	Date	Contains the last update date for the download/upload statistics, this value provide an approximation on the freshness of the statistics. This value might be null, in which case the download/upload statistics are not currently available.
monthly	ARRAY<StatisticsCountBucket>	An array containing the monthly download/upload count for the last 12 months, each bucket aggregates a month worth of data. The number of buckets is limited to 12. Each bucket will include the unique users count for the month.

StatisticsCountBucket

The purpose of this object is to include information about the count of a certain metric within a specific time frame (in this case monthly):

Property	Type	Description
startDate	Date	The starting date of the time frame represented by the bucket
count	INTEGER	The download/upload count in the time frame
usersCount	INTEGER	The number of unique users that performed a download/upload in the time frame of the bucket

Code Block

language	js
title	Example

{
	"lastUpdatedOn: "2019-26-06T01:01:00.000Z",
	"downloads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 1230,
			"usersCount": 10
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	},
	"uploads": {
		"lastUpdatedOn": "2019-26-06T01:01:00.000Z",
		"monthly": [{
			"startDate": "2019-01-06T00:00:00.000Z", 
			"count": 51200,
			"usersCount": 200
		},
		{
			"startDate": "2019-01-05T00:00:00.000Z", 
			"count": 10000,
			"usersCount": 100
		}]
	}
}

Proposed Backend Architecture

In order to serve the statistics from the synapse API we need a way to efficiently access the statistics without heavy loading the web instances of the API.

In the following we provide an high level architecture of the components involved:

Image Added

In particular the following key components are integrated into the system:

AWS Kinesis Firehose: Allows to collect events records from the Synapse API, convert the records into an columnar format such as Apache Parquet and store the stream to an S3 destination bucket
AWS Glue: Glue is used to build the catalog of tables used both by Kinesis Firehose for the record conversion and by Athena to efficiently query the data stored in S3
AWS Athena: Uses the Presto SQL engine and can be used to directly query the data produced by kinetics firehose, the data will be stored using the Apache Parquet format thanks to the Kinesis Firehose automatic conversion that allows to reduce both the storage and query runtime

Kinesis Firehose

The idea is to use Kinesis Firehose to send the events we are interested in (e.g. file upload and file download) as json records, the kinesis stream will funnel the records to firehose that will be converted to the columnar format Apache Parquet (the table schema is created and managed in AWS glue) and stored to an S3 bucket.

For the first phase we collect statistics for download and uploads, for each type of event we will have a separate stream, an example of JSON object sent to the kinesis stream for a download event:

Code Block

title	Download Record Example

{
    "timestamp": "1562626674712",
    "stack": "dev",
    "instance": 123,
    "projectId": 456,
    "userId": 5432,
    "associationType": "FileEntity",
    "associationId": 12312
    "fileHandleId": 6789
}

The JSON record is sent to the appropriate kinesis stream (e.g. fileDownloadsStream or fileUploadStream), converted to Apache Parquet and finally stored in S3 by firehose.

Athena

Once the data is in S3 it can be queried with AWS Athena with standard SQL. For the JSON schema example above we can run an SQL query grouping for example by projectId (and filtering by timestamp) and counting the records as well as the distinct users (Note: Athena uses Presto as query engine, that supports approximate aggregations such as approx_distinct).

Athena can be accessed using the Amazon SDK directly in Java. This allows us to implement synapse background workers that periodically queries the data using Athena in order to compute and store manageable aggregates that can be queried directly from the synapse services.

Given that we will query the data stored in S3 month by month, we can partition the data directly so that Athena will scan only the needed records (See https://docs.aws.amazon.com/athena/latest/ug/partitions.html). We can define partitions directly in the S3 schema created by firehose and leverage Athena partitioning (through Hive, see https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena). For example using s3://prod.log.sagebase.org /fileDownloads/year=!{timestamp:yyyy}/month=!{timestamp:MM/day=!{timestamp=dd}/ when defining the firehose stream.

Statistics Tables

In a first phase we only provide monthly aggregates per project to the client, to this end we can store the monthly aggregates into RDS using dedicated statistics tables. If we need to store more fine grained statistics (e.g. daily, or at the file level) we can move in a later moment to a more scalable solution (e.g. DynamoDB seems a good fit). In the following we provide an initial guide on the tables that might be needed (not final, and most likely will change during the implementation).

In order to compute the monthly aggregates we want to make sure that the workers only queries a month worth of data at the time, in particular since we only expose the past months aggregation we do not need to gather the statistics for the current month and once the statistics for a given month are stored we do not have to recompute them (unless we specifically want to). For each type of statistics we store the timestamp of the last successful run for a given year/month into a dedicated table statistics_monthly_status:

Image Added

The table tracks the last time a worker for the given type of statistics and for a specific month finished successfully and its status, the type may be project_statistics_monthly_downloads or project_statistics_monthly_uploads, the status may be completed, in_progress or failed (note that we probably need more columns that stores the failure reason, failure date etc). If the status is in_progress the worker might skip the turn.

A dedicated worker can periodically query this table in order to get the months for which the aggregates still needs to be performed and submit a message to a dedicated queue that will be processed and polled by dedicated message driven workers. (Note that each time a worker runs, the query will need to scan all the data for the considered month as we need to get estimates on the number of unique users that downloaded/uploaded files on a monthly basis).

The monthly aggregates for each project is then stored into a dedicated table statistics_project_monthly:

Image Added

The table contains for each project, year and month the count of downloads, uploads, the number of unique users that downloaded and uploaded files as well as the last time the statistic for that particular month was updated (per download and upload). This table can be queried directly by the service layer in the synapse API in read only.

Statistics Workers

We will initially need a few different workers that will aggregate the statistics periodically using the Athena SDK:

MonthlyProjectStatisticsWorker: Will run periodically to check the statistics_monthly_status table on the status of the past 12 months, if a month is missing or failed will push a message to a dedicated queue that will be picked up by the following workers.
ProjectDownloadStatisticsAggregator: Process messages pushed by the MonthlyStatisticsWorker to (re)compute the statistics for a given month, run the SQL query on the S3 data for the downloads streams using Athena and updates the statistics_project_monthly table above.
ProjectUploadStatisticsAggregator: Similar to the DownloadStatitisticsAggregator will perform the aggregation for uploads, updating the statistics_project_monthly table

The workers will potentially need to batch the query result into a smaller set, potentially running multiple transactions to save the data.

This initial setup will allow us to serve the statistics for a given project, but there might be some issues:

Zero downloads/uploads problem: We can avoid storing unnecessary data for each project if for a given month there were no uploads or downloads, at the application level we can simply return a 0 count if the record is not in the table for a given month. This poses a problem since we cannot discriminate between a zero count or a "not yet computed" case. We can work around this by using the statistics_monthly_status table. If for a given project and month we never have downloads or uploads but we know that the month was processed we know that we had no downloads or uploads.
Last update time: The last update time for a given statistics (download and/or upload) is valid only if the project has had in the last month at least one download or one upload. We can use the global last_update for the given statistics from the statistics_monthly_status table instead, but it's still an approximation: updating the statistics table might take some time, multiple transactions are potentially needed and even though the statistics for a given project can be retrieved the last_update might not reflect the fact that the worker is still in progress.
Delays in the stream: Most likely there is going to be a slight delay between the time a record is sent to kinesis and the time it's stored in S3, we should take this delay into account and make sure we read ahead when we run the query for the month preceding the current one if we are within a given timeframe (e.g. at the beginning of the month we can aggregate data loading a short window of the data past the current month).

Additionally we might want to store the initial date when we started collecting statistics, this would allow us to truncate the months past that date so that we can report an "unknown" status to the client.

Web Client Integration

The web client should have a way to show the statistics for a project or a specific file, some initial ideas:

...

Have a dedicated menu item that leads to a statistics dashboard Might want to have a dedicated wiki widget that reuses the API(should be visible only to the project owner/administrator)

Page Comparison

Versions Compared

Old Version 11

New Version Current

Key

Introduction

Downloads Count

Introduction

Downloads Count

Page Views

Data Breaches/Audit Trail

Project Statistics

Use Case

Use Cases

Considerations

Current Situation

Files and Downloads

"direct" downloads statistics

"Batch and bulk" Downloads

Proposed API Design

Request Objects

<<abstract>> EntityStatisticsRequest <extends> AsynchronousRequestBody

DownloadStatisticsRequest <extends> EntityStatisticsRequest

DownloadBucketsStatisticsRequest <extends> DownloadStatisticsRequest

Response Objects

<<abstract>> EntityStatisticsResponse <extends> AsynchronousResponseBody

DownloadStatisticsResponse <extends> EntityStatisticsResponse

DownloadBucketsStatisticsResponse <extends> DownloadStatisticsResponse

StatisticsCountBucket

Endpoints

Current Situation

First Phase Proposal

Files, Downloads and Uploads

"direct" downloads statistics

"Batch and bulk" Downloads

Proposed API Design

Endpoints

Response Objects

ProjectStatistics

DownloadStatistics

UploadStatistics

DownloadStatistics/UploadStatistics

StatisticsCountBucket

Proposed Backend Architecture

Image Added

Kinesis Firehose

Athena

Statistics Tables

Statistics Workers

Web Client Integration