Content Comparison

...

Table of Contents

minLevel	1
maxLevel	7

Introduction

In phase one, we have already sent access records in format of json to s3 with kinesis firehose delivery stream.Kinesis delivery stream creates partition of data on the basis of when the data has been received. we want to process this data and make the processed data queryable. The data is growing with time, so we need to structure this big data in such a way that we can query it and process in adequate time.

Create a glue table of raw data

...

Column Name	Data Type
stack	string
instance	string
timestamp	bigint
payload	struct<sessionId:string,timestamp:bigint,userId:int,method:string,requestURL:string,userAgent:string,host:string,origin:string,via:string,threadId:int,elapseMS:int,success:boolean,stack:string,instance:string,date:string,vmId:string,returnObjectId:string,queryString:string,responseStatus:int,oauthClientId:string,basicAuthUsername:string,authenticationMethod:string,xforwardedFor:string>struct<>
year	string
month	string
day	string

...

Year, month and day are partition column. S3 Kinesis delivery stream creates partition the data data on basis of when the data has been received. InfoWe can use the timestamp as partition, for this we have to enable dynamic partition scheme.

usage : At any point of time we can query the original raw data and in case our further process fails or some information is required from original data. Data will be available for query from Athena.

Create a ETL job

Once the raw data is available we want to process it further with ETL (Extract, load and transform) job.

...

As we deploy every week new stack, ETL job should not reprocess the old or already processed data again.
How we will maintain the versioning of script of glue job, which means define infrastructure that should take the script and push it into s3 ( deployment process of script)
How to trigger the job e.g on demand, or on schedule time or it should run in a sequence with crawler.what should be the partitioning scheme of processed records.
How to handle duplicate data.

Processed data destination

...

Version	Old Version 3	New Version 4
Changes made by	Sandhra Sokhal	Sandhra Sokhal
Saved on	Dec 13, 2022	Dec 22, 2022

Versions Compared

Key

Introduction

Create a glue table of raw data

Create a ETL job

Processed data destination