Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents

Introduction

Introduction : In new datawarehouse architecture, the raw data is procssed by aws-glue jobs (etl) and stored in destination s3 bucket. To create or update the ETL jobs needs jenkins job which create the stack and required resources.

Create/Update the python script

Synapse-ETL-Jobs is Synapse project where all the python scripts are maintained. To create a new job or update an existing job which requires the processing of python script, first we need to do the changes in Synapse-ETL-Jobs and build. Every merge in the project creates a new version of tag (e.g. v1.63.0). Always use the latest tag version.

Create cloud-formation

Synapse-Stack-Builder is the project where all of the cloud-formation scripts are maintained. To create ETL glue job and Glue tables we have added the templates(etldatawarehouse-jobs-template.json.vpt) to create which creates the jobs and required resources. Also update the latest version of Synapse-ETL-Jobs tag in datawarehouse-config.json file.

The cloud-formation stack for ETL jobs should be parameterized by database name. So that on dev environment , engineer can work together on same project without coliding with each other. Basically the each stack will create a stack name with parameter name(database name) and then the job name and all the resources tied to the job will contain the database name name in it e.g job trigger, destination path like bucket/databasename/tablename.

Jenkins job

Jenkins job build-synapse-dev-etl-stack is dev job.There is three parameter of this job.

...

Once the job will be succeeded it should create a stack with database name given in parameter. Also it will create a glue job , a trigger for job with database name, a new database with the name given in database. once the job is run manually or every hour with by scheduler, the processed data will be kept at dev.datawarehouse.sagebase.org/databasename/tablename. Now we can query this data in athenaAthena.