Monthly Project Statistics

Introduction

Monthly project statistics include file uploads and downloads by various users for each project over the course of a month. At the start of each month, statistics from the previous month are calculated and made accessible via statistics APIs to admins and users with read permissions on a project.

Current System

In Synapse we record all the file upload and download events and monthly file upload and download is calculated for each project by distinct users.

In Synapse, we track all the file upload and download events, and calculate monthly file uploads and downloads for each project based on distinct users.

Each file upload and download event is captured by FileEventRecordWorker and sent to kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena. As we changed the data warehouse architecture the same data is also available in warehouse.filedownloadrecords and warehouse.fileuploadrecords.
StatisticsMonthlyStatusWatcherWorker dentifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.
StatisticsMonthlyWorker retrieve the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and store the results in STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.
The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html

Recommended new Approach

The architecture of the Synapse data warehouse has been updated. Now, all raw data is stored in S3 in JSON format, while processed data is stored in queryable PARQUET format. We are leveraging Snowflake for time travel queries, statistical analyses, and Synapse usage analysis.

Snowflake currently performs project statistics for public projects and stores the results in a table within the Synapse database, from which the UI retrieves and displays the data to users.

Monthly Project Statistics in Snowflake

To enhance flexibility in calculating statistics, we propose that monthly project statistics should also be calculated in Snowflake. While we are currently calculating these statistics, using Snowflake will simplify and improve the maintenance and modification of project statistics.

Given the complexity of statistical calculations, we recommend utilizing Snowflake to calculate project statistics based on user activity. Snowflake already handles statistics calculations for public projects effectively, making it an ideal platform for monthly project statistics as well. To ensure accurate and secure display of statistics to users, appropriate access checks are necessary. The Synapse team has prebuilt API endpoints to validate user access roles. Snowflake can leverage these API endpoints to validate user permissions directly.

API Url	Request Object	Response Object	Description
GET /entity/{id}/accessRequirement	None	AccessRequirement	Retrieve paginated list of ALL Access Requirements associated with an entity.

Pros:

Maintaining a single source of ownership is ensured, and since Snowflake already performs these statistics, it becomes more manageable and leverages the existing system, thereby minimizing maintenance efforts.
Pre-existing endpoints for validating user permissions can be directly leveraged by Snowflake, simplifying integration and ensuring seamless access control.