ETL Deployment Workflow

1 Introduction
2 Create/Update the python script
3 Create cloud-formation
4 Jenkins job

Introduction

In the datawarehouse architecture, the raw data is procssed by AWS Glue jobs (ETL) and stored in a destination s3 bucket. Creating or updating the ETL jobs requires Jenkins job which creates the stack and required resources.

Create/Update the python script

Synapse-ETL-Jobs is a Synapse project where all the python scripts are maintained. To create a new job or update an existing job requires the changes to be made to python script, first we need to make the changes in Synapse-ETL-Jobs and build. Every merge in the project creates a new tag (e.g. v1.63.0). Always use the latest tag version.

Create cloud-formation

Synapse-Stack-Builder is the project where all of the CloudFormation scripts are maintained. To create ETL Glue jobs and Glue tables, we have added the templates(datawarehouse-template.json.vpt) which creates the required resources. Also update the latest version of Synapse-ETL-Jobs tag in datawarehouse-config.json file.

The CloudFormation stack for ETL jobs should be parameterized by database name, so that on dev environment , engineers can work on same project without colliding with each other. Basically each stack will create a stack name with parameter name(database name) and then the job name and all the resources tied to the job will contain the database name name in it e.g job trigger, destination path like bucket/databasename/tablename.

Jenkins job

Jenkins job build-synapse-dev-etl-stack is a dev job. There are three parameter for this job. This job will create a stack with required resources like Glue jobs with schedulers, Glue database etc. Once the job is run manually or by scheduler, the processed, queryable data will be kept at dev.datawarehouse.sagebase.org/databasename/tablename. Use Athena to query this data.

DATABASE_NAME : On dev, use a stack instance name that is unique among all dev stacks in the account. For production, the database name should be “synapsedatawarehouse”.
GITHUB_REPOSITORY : The value for this parameter should be the repository owner's name. Default value is Sage-Bionetworks which is our main repository owner. To run the job on your own forked branch, use the owner name of your forked repository.
GITHUB_BRANCH : Default value is develop. To run on your forked repository branch use branch name (e.g PLFM-7967).