Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Content Report Table
isMissingRequiredParameterstrue

Introduction

Table of Contents
stylenone

Introduction

Monthly project statistics include file uploads and downloads by various users for each project over the course of a month. At the start of each month, statistics from the previous month are calculated and made accessible via statistics APIs to admins and users with read permissions on a project.

Current System

In Synapse we record all the file upload and download events and monthly file upload and download is calculated for each project by distinct users.

In Synapse, we track all the file upload and download events, and calculate monthly file uploads and downloads for each project based on distinct users.

  1. Each file upload and download event is captured by FileEventRecordWorker and sent to a kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena. As we changed the data warehouse architecture the same data is also available in warehouse.filedownloadrecords and warehouse.fileuploadrecords.

  2. StatisticsMonthlyStatusWatcherWorker dentifies identifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.

  3. StatisticsMonthlyWorker retrieve retrieves the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and store stores the results in the STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.

  4. The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html

...

Synapse data warehouse

The architecture of the Synapse data warehouse has been updated. Audit and snapshot revised. All raw data is now being sent to stored in S3 in JSON format. AWS Glue ETL jobs process this data and store it , while processed data is stored in a queryable Parquet PARQUET format. The fileuploadrecords and filedownloadrecords tables are now accessible in the warehouse database within Glue. As a result, file event data is duplicated, with one set residing in the older <Stack>firehoselogs Glue database and another set in the warehouse Glue database. We should use the newer database and eliminate the outdated Kinesis streams and Glue database.

Approaches

...

Approach

...

Pros

...

Cons

...

Maintain the current setup and simply modify the database and table names in the Athena query to access data from the most recent database.

...

Only minimal adjustments are needed since there are no changes to the architecture.

...

The Architecture used by this flow is Lengthy and need to maintain more code.

...

Monthly statistics will be computed in Snowflake, with the aggregated results being stored in a table within the Synapse Tables database.

In Synapse the statistics API will read the statistics from Synapse table database.

...

The process will be streamlined in Synapse since the statistics will already be calculated, allowing us to eliminate the workers that currently identify unprocessed months and process the statistics.

...

The Statistics API verifies a user's read permissions for a project if they are not an admin. Tables in the Synapse Tables database are public, so we cannot implement row-level filtering on them. Although we can apply restrictions within the Synapse Statistics API, other connected clients may still access the data since the tables are public.

...

Monthly statistics will be calculated in Snowflake with the aggregated results being stored in S3 in csv/Json format.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

The process will be streamlined in Synapse since the statistics will already be calculated.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

This introduces a dependency on Snowflake. Any changes needed in the statistics query or if more data is required in the results, those modifications must first be made in Snowflake. If the statistics are unavailable for any reason, we need to coordinate with the relevant team

...

Develop an AWS Step Function to calculate the previous month's statistics on the first day of the current month in Synapse-Stack-Builder and send a message to the queue for the Synapse worker to process.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

We will decouple the Athena query process within the Synapse-Stack-Builder project.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

Modifications to the statistics calculation will necessitate changes in two projects Synapse-Repository and Synapse-Stack-Builder depending upon the type of change.

...

Create an ETL job that retrieves data from tables warehouse.fileuploadrecords and warehouse.filedownloadrecords to calculate statistics. Once the query is successfully executed, send a message to the queue for the Synapse worker to process.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

We will decouple the Athena query process within the Synapse-ETL-Jobs project.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

monthly project statistics should be calculated from Processed Parquet files.

Synapse Integration support for Third-Party Services

Synapse offers integration support to third-party services. The basic idea here is that the calculation and management of statistics and making them available to users should be implemented by third party. Third party services can follow the following processes to achieve this.

  1. Pre-calculate the project statistics and store the statistics in S3, database table or any other suitable storage.

  2. The third party App will connect to Synapse via single sign-on(OAuth) to authenticate user. Once authenticated, the Synapse API will provide access token in response.

  3. Using the access token retrieved in Step 2, the third party service will connect to Synapse API https://rest-docs.synapse.org/rest/GET/entity/id/permissions.html to get UserEntityPermissions.

  4. Using the UserEntityPermissions response, the third party app can verify user permission on entity and provide monthly project statistics to the user, if eligible.

Why hybrid approach will not work efficiently?

If the third party is only responsible for calculating the statistics while the Platform team is responsible for building the API that displays these statistics to users, a dependency between the two teams will be established. This arrangement necessitates meticulous implementation and communication for every change. Any release from one team could potentially disrupt the API if not properly coordinated with the other team.