Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Each file upload and download event is captured by FileEventRecordWorker and sent to kinesis stream, which stores the data in S3 in queryable Parquet format. The file event can be queried using <Env><Stack>firehoselogs.fileuploadsrecords and <Stack>firehoselogs.filedownloadsrecords Glue tables in Athena. As we changed the data warehouse architecture the same data is also available in warehouse.filedownloadrecords and warehouse.fileuploadrecords.

  2. StatisticsMonthlyStatusWatcherWorker dentifies unprocessed months for project statistics and initiates a processing request by sending a message to a queue, allowing the processing to begin for the specified object type and month.

  3. StatisticsMonthlyWorker retrieve the message from queue and processes it. This worker executes the Athena query for file upload and file download statistics and store the results in STATISTICS_MONTHLY_PROJECT_FILES table of the Synapse main database.

  4. The Synapse users who are admins or have read permissions on a project can access the statistics with https://rest-docs.synapse.org/rest/POST/statistics.html

...

Recommended new Approach

The architecture of the Synapse data warehouse has been updated. Audit and snapshot Now, all raw data is now being sent to stored in S3 in JSON format. AWS Glue ETL jobs process this data and store it in a queryable Parquet format. The fileuploadrecords and filedownloadrecords tables are now accessible in the warehouse database within Glue. As a result, file event data is duplicated, with one set residing in the older <Stack>firehoselogs Glue database and another set in the warehouse Glue database. We should use the newer database and eliminate the outdated Kinesis streams and Glue database.

Approaches

...

Approach

...

Pros

...

Cons

...

Maintain the current setup and simply modify the database and table names in the Athena query to access data from the most recent database.

...

Only minimal adjustments are needed since there are no changes to the architecture.

...

The Architecture used by this flow is Lengthy and need to maintain more code.

...

, while processed data is stored in queryable PARQUET format. We are leveraging Snowflake for time travel queries, statistical analyses, and Synapse usage analysis.

Snowflake currently performs project statistics for public projects and stores the results in a table within the Synapse

...

In Synapse the statistics API will read the statistics from Synapse table database.

...

The process will be streamlined in Synapse since the statistics will already be calculated, allowing us to eliminate the workers that currently identify unprocessed months and process the statistics.

...

The Statistics API verifies a user's read permissions for a project if they are not an admin. Tables in the Synapse Tables database are public, so we cannot implement row-level filtering on them. Although we can apply restrictions within the Synapse Statistics API, other connected clients may still access the data since the tables are public.

...

Monthly statistics will be calculated in Snowflake with the aggregated results being stored in S3 in csv/Json format.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

The process will be streamlined in Synapse since the statistics will already be calculated.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

This introduces a dependency on Snowflake. Any changes needed in the statistics query or if more data is required in the results, those modifications must first be made in Snowflake. If the statistics are unavailable for any reason, we need to coordinate with the relevant team

...

Develop an AWS Step Function to calculate the previous month's statistics on the first day of the current month in Synapse-Stack-Builder and send a message to the queue for the Synapse worker to process.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

We will decouple the Athena query process within the Synapse-Stack-Builder project.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

Modifications to the statistics calculation will necessitate changes in two projects Synapse-Repository and Synapse-Stack-Builder depending upon the type of change.

...

Create an ETL job that retrieves data from tables warehouse.fileuploadrecords and warehouse.filedownloadrecords to calculate statistics. Once the query is successfully executed, send a message to the queue for the Synapse worker to process.

In Synapse the worker will read the aggregated results from s3 and store it in STATISTICS_MONTHLY_PROJECT_FILES table.

...

We will decouple the Athena query process within the Synapse-ETL-Jobs project.

There will be a one worker which will pick up the files from S3 and insert statistics data into STATISTICS_MONTHLY_PROJECT_FILES table.

...

database, from which the UI retrieves and displays the data to users.

Monthly Project Statistics in Snowflake

To enhance flexibility in calculating statistics, we propose that monthly project statistics should also be calculated in Snowflake. While we are currently calculating these statistics, using Snowflake will simplify and improve the maintenance and modification of project statistics.

Given the complexity of statistical calculations, we recommend utilizing Snowflake to calculate project statistics based on user activity. Snowflake already handles statistics calculations for public projects effectively, making it an ideal platform for monthly project statistics as well. To ensure accurate and secure display of statistics to users, appropriate access checks are necessary. The Synapse team has prebuilt API endpoints to validate user access roles. Snowflake can leverage these API endpoints to validate user permissions directly.

API Url

Request Object

Response Object

Description

GET /entity/{id}/accessRequirement

None

AccessRequirement

Retrieve paginated list of ALL Access Requirements associated with an entity.

Pros:

  • Maintaining a single source of ownership is ensured, and since Snowflake already performs these statistics, it becomes more manageable and leverages the existing system, thereby minimizing maintenance efforts.

  • Pre-existing endpoints for validating user permissions can be directly leveraged by Snowflake, simplifying integration and ensuring seamless access control.