...
Each file upload and download event is captured by FileEventRecordWorker and sent to kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena. As we changed the data warehouse architecture the same data is also available in warehouse.filedownloadrecords and warehouse.fileuploadrecords.
StatisticsMonthlyStatusWatcherWorker dentifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.
StatisticsMonthlyWorker retrieve the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and store the results in STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.
The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html
...
Synapse data warehouse
The architecture of the Synapse data warehouse has been updatedrevised. Now, all All raw data is now stored in S3 in JSON format, while processed data is stored in a queryable PARQUET format. We are leveraging Snowflake for time travel queries, statistical analyses, and Synapse usage analysis.
Snowflake currently performs project statistics for public projects and stores the results in a table within the Synapse database, from which the UI retrieves and displays the data to users.
Monthly Project Statistics in Snowflake
To enhance flexibility in calculating statistics, we propose that monthly project statistics should also be calculated in Snowflake. While we are currently calculating these statistics, using Snowflake will simplify and improve the maintenance and modification of project statistics.
Given the complexity of statistical calculations, we recommend utilizing Snowflake to calculate project statistics based on user activity. Snowflake already handles statistics calculations for public projects effectively, making it an ideal platform for monthly project statistics as well. To ensure accurate and secure display of statistics to users, appropriate access checks are necessary. The Synapse team has prebuilt API endpoints to validate user access roles. Snowflake can leverage these API endpoints to validate user permissions directly.
API Url | Request Object | Response Object | Description |
---|---|---|---|
None | AccessRequirement | Retrieve paginated list of ALL Access Requirements associated with an entity. |
Pros:
...
Maintaining a single source of ownership is ensured, and since Snowflake already performs these statistics, it becomes more manageable and leverages the existing system, thereby minimizing maintenance efforts.
...
The monthly project statistics should be calculated from Processed Parquet files.
Synapse Integration support for Third-Party Services
Synapse offers integration support to third-party services. The basic idea here is that the calculation and management of statistics and making them available to users should be implemented by third party. Third party services needs to follow below processes
Pre-calculate the project statistics and store the statistics in S3, database table or any other suitable storage.
The third party App will connect to Synapse via single sing-on(OAuth) to authenticate user. Once authenticated, the Synapse API will provide access token in response.
Using the access token retrieved in Step 2, the third party service will connect to Synapse API https://rest-docs.synapse.org/rest/GET/entity/id/permissions.html to get UserEntityPermissions.
Using the UserEntityPermissions response, the third party app can verify user permission on entity and provide monthly project statistics to the user, if eligible.
Why hybrid approach will not work efficiently?
If the third party is only responsible for calculating the statistics while the Platform team is responsible for building the API that displays these statistics to users, a dependency between the two teams will be established. This arrangement necessitates meticulous implementation and communication for every change. Any release from one team could potentially disrupt the API if not properly coordinated with the other team.