Monthly Project Statistics
Introduction
Monthly project statistics include file uploads and downloads by various users for each project over the course of a month. At the start of each month, statistics from the previous month are calculated and made accessible via statistics APIs to admins and users with read permissions on a project.
Current System
In Synapse we record all the file upload and download events and monthly file upload and download is calculated for each project by distinct users.
Each file upload and download event is captured by FileEventRecordWorker and sent to a kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena. As we changed the data warehouse architecture the same data is also available in warehouse.filedownloadrecords and warehouse.fileuploadrecords.
StatisticsMonthlyStatusWatcherWorker identifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.
StatisticsMonthlyWorker retrieves the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and stores the results in the STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.
The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html
Synapse data warehouse
The architecture of the Synapse data warehouse has been revised. All raw data is now stored in S3 in JSON format, while processed data is stored in a queryable PARQUET format. The monthly project statistics should be calculated from Processed Parquet files.
Synapse Integration support for Third-Party Services
Synapse offers integration support to third-party services. The basic idea here is that the calculation and management of statistics and making them available to users should be implemented by third party. Third party services can follow the following processes to achieve this.
Pre-calculate the project statistics and store the statistics in S3, database table or any other suitable storage.
The third party App will connect to Synapse via single sign-on(OAuth) to authenticate user. Once authenticated, the Synapse API will provide access token in response.
Using the access token retrieved in Step 2, the third party service will connect to Synapse API https://rest-docs.synapse.org/rest/GET/entity/id/permissions.html to get UserEntityPermissions.
Using the UserEntityPermissions response, the third party app can verify user permission on entity and provide monthly project statistics to the user, if eligible.
Why hybrid approach will not work efficiently?
If the third party is only responsible for calculating the statistics while the Platform team is responsible for building the API that displays these statistics to users, a dependency between the two teams will be established. This arrangement necessitates meticulous implementation and communication for every change. Any release from one team could potentially disrupt the API if not properly coordinated with the other team.