Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Approach to Process and make the access Record queryable

Create a glue table of raw data

Raw data means the data which we received directly from source, in our case it Synapse repository project. Synapse is sending access records to s3 with firehose kinesis delivery stream.

...

usage : At any point of time we can query the original raw data and in case our further process fails or some information is required from original data. Data will be available for query from Athena.

Create a ETL job

Once the raw data is available we want to process it further with ETL (Extract, load and transform) job.

...

The source of data data will be our glue table created from raw data, ETL converts every record into dynamic frame and we first apply mapping to columns and then transform the request url value to new value and store back the processed data into S3.

Challenges

  1. As we deploy every week new stack, ETL job should not reprocess the old or already processed data again.

  2. How we will maintain the versioning of script of glue job, which means define infrastructure that should take the script and push it into s3 ( deployment process of script)

  3. How to trigger the job e.g on demand, or on schedule time or it should run in a sequence with crawler.

  4. what should be the partitioning scheme of processed records.

Processed data destination

We can choose destination as S3 or glue table.

Integrate with Redash

we need to integrate the processed access records to redash board which we used in our stack review meeting four audit purposes.