Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel3
indent15px
styledics

...

  1. Downloads Count
  2. Page Views
  3. Data Breaches/Audit Trail

Downloads Count

This is the main statistic that the users are currently looking for, it provides a way for project owners, funders and data contributor to monitor the interest over time in the datasets published in a particular project, which then reflects on the interest on the project itself and it is a metric of the value provided by the data in the project. This kind of data is related specifically to the usage of the platform by synapse users, since without being authenticated the downloads are not available. This is part of a generic category of statistics that relates to the entities and metadata that is stored in the backend and it's only a subset of aggregate statistic that can be exposed (e.g. number of projects, users, teams etc).

Page Views

This metric is also an indicator to monitor the interest but it plays a different role and focuses on the general user activity over the synapse platform as a whole. While it might be an indicator for a specific project success it captures a different aspect that might span to different type of clients used to interface on the Synapse API and that include information about users that are not authenticated into synapse. For this particular aspect there are tools already integrated (E.g. google analytics) that collect analytics on the user interactions. Note however that this information is not currently available to the synapse users, nor setup in a way to produce information about specific projects pages, files, wikis etc.

Data Breaches/Audit Trail

Another aspect that came out and might seem related is the identification of when/what/why of potential data breaches (e.g. a dataset was released even though it was not supposed to). This relates to the audit trail of users activity in order to identify potential offenders. While this information is crucial it should not be exposed by the API, and a due process is in place in order to access this kind of data.

Project Statistics

With this brief introduction in mind this document focuses on the main driving use case, that is:

  • A funder and/or project creator would like to have a way to understand if the project is successful and if its data is used.

There are several metrics that can be used in order to determine the usage and success of a project, among which:

  • Project Access (e.g. page views)
  • Number of Downloads
  • Number of Uploads
  • User Discussions

...

  • Expose statistics about number of download and number of uploads
  • Expose statistics about the number of unique users that downloaded and/or uploaded
  • Statistics are collected on a per project basis (no file level statistics)
  • Statistics limited to the last past 12 months (without including the current month)
  • Statistics aggregated monthly (no total downloads or total users)
  • Statistics access is restricted through a new ACCESS_TYPE VIEW_STATISTICS, initially granted to project owners and administrators
  • No run-time filtering

Files, Downloads and Uploads

Files in synapse are referenced through an abstraction (FileHandle) that maintain the information about the link to the content of the file itself (e.g. an S3 bucket). A file handle is then referenced in many places (such as FileEntity and WikiPage, see FileHandleAssociateType) as pointers to the actual file content. In order to actually download the content the synapse platform allows to generated a pre-signed url (according to the location where the file is stored) that can be used to directly download the file. Note that the platform has no way to guarantee that the pre-signed url is actually used by the client in order to download a file. Every single pre-signed url request in the codebase comes down to a single method getURLForFileHandle.

In the context of a project we are interested in particular to download (and/or uploads) of entities of type FileEntity and TableEntity (See What is a "download" in Synpase?) and the download request made for the associated file handles.

...


Code Block
titleDownload Record Example
{
    "timestamp": "1562626674712",
    "stack": "dev",
    "instance": 123,
    "projectId": 456,
    "userId": 5432,
    "associationType": "FileEntity",
    "associationId": 12312
    "fileIdfileHandleId": 6789
}


The JSON record is sent to the appropriate kinesis stream (e.g. fileDownloadsStream or fileUploadStream), converted to Apache Parquet and finally stored in S3 by firehose.

...

Given that we will query the data stored in S3 month by month, we can partition the data directly so that Athena will scan only the needed records (See https://docs.aws.amazon.com/athena/latest/ug/partitions.html). We can define partitions directly in the S3 schema created by firehose and leverage Athena partitioning (through Hive, see https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena). For example using s3://prod.log.sagebase.org/fileDownloads/year=!{timestamp:yyyy}/month=!{timestamp:MM/day=!{timestamp=dd}/ when defining the firehose stream.

Statistics Tables

In a first phase we only provide monthly aggregates per project to the client, to this end we can store the monthly aggregates into RDS using dedicated statistics tables. If we need to store more fine grained statistics (e.g. daily, or at the file level) we can move in a later moment to a more scalable solution (e.g. DynamoDB seems a good fit). In order to compute the monthly aggregates we want to make sure that the workers only queries the current month (unless the last successful run was past the current month). For each type of statistics we store the timestamp of the last successful run into a dedicated table statistics_status:Image Removedfollowing we provide an initial guide on the tables that might be needed (not final, and most likely will change during the implementation).

In order to compute the monthly aggregates we want to make sure that the workers only queries a month worth of data at the time, in particular since we only expose the past months aggregation we do not need to gather the statistics for the current month and once the statistics for a given month are stored we do not have to recompute them (unless we specifically want to). For each type of statistics we store the timestamp of the last successful run for a given year/month into a dedicated table statistics_monthly_status:

Image Added

The table tracks the last time a worker for the given type of statistics and for a specific month finished successfully and its status, the type may be project_statistics_monthly_downloads or project_statistics_monthly_uploads, the status may be completed, in_progress or failed . (note that we probably need more columns that stores the failure reason, failure date etc). If the status is in_progress the worker might skip the turn.

The workers will A dedicated worker can periodically query this table in order to get the months for which the aggregates still needs to be performed and submit a message to a dedicated queue that will be processed and polled by dedicated message driven workers. (Note that each time a worker runs, the query will need to scan all the data for each the considered month as we need to get estimates on the number of unique users that downloaded/uploaded files on a monthly basis).

...

Statistics Workers

We will initially need at least two a few different workers that will aggregate the statistics periodically using the Athena SDK:

  • DownloadStatisticsAggregatorMonthlyProjectStatisticsWorker: Will run every X hours (configurable, but probably 6 to 12 hours should be enough)periodically to check the statistics_monthly_status table on the status of the past 12 months, if a month is missing or failed will push a message to a dedicated queue that will be picked up by the following workers.
  • ProjectDownloadStatisticsAggregator: Process messages pushed by the MonthlyStatisticsWorker to (re)compute the statistics for a given month, run the SQL query on the S3 data for the downloads streams for the current month (or previous months according to the statitistics_status last_update value) and update the statistics_projectusing Athena and updates the statistics_project_monthly table above.
  • UploadStatisticsAggregatorProjectUploadStatisticsAggregator: Similar to the DownloadStatitisticsAggregator will perform the aggregation for uploads, updating the statistics_project_monthly table

Both the The workers will potentially need to batch the query result into a smaller set, potentially running multiple transactions to save the data.

This initial setup will allow us to serve the statistics for a given project, but poses there might be some potential issues:

  • Zero downloads/uploads problem: We can avoid storing unnecessary data for each project if for a given month there were no uploads or downloads, at the application level we can simply return a 0 count if the record is not in the table for a given month. This poses a problem since we cannot discriminate between a zero count or a "not yet computed" case. We can work around this by using the global last_update. if statistics_monthly_status table. If for a given project and month we never have downloads or uploads but we know that we last updated recently we can assume the month was processed we know that we had no downloads or uploads in the past.
  • Last update time: The last update time for a given statistics (download and/or upload) is valid only if the project has had in the last month at least one download or one upload. We can use the global last_update for the given statistics from the statistics_monthly_status table instead, but it's still an approximation: updating the statistics table might take some time, multiple transactions are potentially needed and even though the statistics for a given project can be retrieved the last_update might not reflect the fact that the worker is still in progress.

The issues above might not be considered critical, as statistics are anyway used as an indication and not taken strictly. We could potentially solve both problems implementing a stricter synchronization of projects and statistics with a set of additional workers:

...

  • Delays in the stream: Most likely there is going to be a slight delay between the time a record is sent to kinesis and the time it's stored in S3, we should take this delay into account and make sure we read ahead when we run the query for the month preceding the current one if we are within a given timeframe (e.g. at the beginning of the month we can aggregate data loading a short window of the data past the current month). 

Additionally we might want to store in the statistics_status table the initial date when we started collecting statistics, this would allow us to truncate the months past that date so that we can report an "unknown" status to the client.

...