Introduction
Monthly project statistics include file uploads and downloads by various users for each project over the course of a month. At the start of each month, statistics from the previous month are calculated and made accessible via statistics APIs to admins and users with read permissions on a project.
Current System
In Synapse we record all the file upload and download events and monthly file upload and download is calculated for each project by distinct users.
In Synapse, we track all the file upload and download events, and calculate monthly file uploads and downloads for each project based on distinct users.
Each file upload and download event is captured by FileEventRecordWorker and sent to kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena.
StatisticsMonthlyStatusWatcherWorker dentifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.
StatisticsMonthlyWorker retrieve the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and store the results in STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.
The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html
Why changes are required in current system?
The architecture of the Synapse data warehouse has been updated. Audit and snapshot data is now being sent to S3 in JSON format. AWS Glue ETL jobs process this data and store it in a queryable Parquet format. The fileuploadrecords and filedownloadrecords tables are now accessible in the warehouse database within Glue. As a result, file event data is duplicated, with one set residing in the older <Stack>firehoselogs Glue database and another set in the warehouse Glue database. We should use the newer database and eliminate the outdated Kinesis streams and Glue database.
Approaches
Approach | Pros | Cons |
---|---|---|
Maintain the current setup and simply modify the database and table names in the Athena query to access data from the most recent database. | Only minimal adjustments are needed since there are no changes to the architecture. | The Architecture used by this flow is Lengthy and need to maintain more code. |
Monthly statistics will be computed in Snowflake, with the aggregated results being stored in a table within the Synapse Tables database. In Synapse the statistics API will read the statistics from Synapse table database. | The process will be streamlined in Synapse since the statistics will already be calculated, allowing us to eliminate the workers that currently identify unprocessed months and process the statistics. | The Statistics API verifies a user's read permissions for a project if they are not an admin. Tables in the Synapse Tables database are public, so we cannot implement row-level filtering on them. Although we can apply restrictions within the Synapse Statistics API, other connected clients may still access the data since the tables are public. |
Monthly statistics will be calculated in Snowflake with the aggregated results being stored in S3 in csv/Json format. In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table. | The process will be streamlined in Synapse since the statistics will already be calculated. There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table. | This introduces a dependency on Snowflake. Any changes needed in the statistics query or if more data is required in the results, those modifications must first be made in Snowflake. If the statistics are unavailable for any reason, we need to coordinate with the relevant team |
Develop an AWS Step Function to calculate the previous month's statistics on the first day of the current month in Synapse-Stack-Builder and send a message to the queue for the Synapse worker to process. In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table. | We will decouple the Athena query process within the Synapse-Stack-Builder project. There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table. | Modifications to the statistics calculation will necessitate changes in two projects Synapse-Repository and Synapse-Stack-Builder depending upon the type of change. |
Create an ETL job that retrieves data from tables warehouse.fileuploadrecords and warehouse.filedownloadrecords to calculate statistics. Once the query is successfully executed, send a message to the queue for the Synapse worker to process. In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table. | We will decouple the Athena query process within the Synapse-ETL-Jobs project. There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table. | Modifications to the statistics calculation will necessitate changes in two projects Synapse-Repository and Synapse-Stack-Builder depending upon the type of change. |