ETL Deployment Workflow

Introduction

Introduction : In datawarehouse architecture, the raw data is procssed by aws-glue jobs (etl) and stored in destination s3 bucket. To create or update the ETL jobs needs jenkins job which create the stack and required resources.

Create/Update the python script

Synapse-ETL-Jobs is Synapse project where all the python scripts are maintained. To create a new job or update an existing job which requires the processing of python script, first we need to do the changes in Synapse-ETL-Jobs and build. Every merge in the project creates a new version of tag (e.g. v1.63.0). Always use the latest tag version.

Create cloud-formation

Synapse-Stack-Builder is the project where all of the cloud-formation scripts are maintained. To create ETL glue job and Glue tables we have added the templates(datawarehouse-template.json.vpt) which creates the required resources. Also update the latest version of Synapse-ETL-Jobs tag in datawarehouse-config.json file.

The cloud-formation stack for ETL jobs should be parameterized by database name. So that on dev environment , engineer can work together on same project without coliding with each other. Basically the each stack will create a stack name with parameter name(database name) and then the job name and all the resources tied to the job will contain the database name name in it e.g job trigger, destination path like bucket/databasename/tablename.

Jenkins job

Jenkins job build-synapse-dev-etl-stack is dev job.There is three parameter of this job.

DATABASE_NAME : To build a stack run the job with unique database name as parameter. On dev we can use our instance name or full name which is unique in team hopefully also in big group. For production database name should be “synapsedatawarehouse” and always should use same database name.
GITHUB_REPOSITORY : Default value is Sage-Bionetworks which is our main repository. To run the job on your own forked branch use forked repository name.
GITHUB_BRANCH : Default value is develop which is our master branch of Sage-Bionetworks repository. To run on your forked repository branch use branch name PLFM-123 or develop for forked develop branch.

Once the job will be succeeded it should create a stack with database name given in parameter. Also it will create a glue job , a trigger for job with database name, a new database with the name given in database. once the job is run manually or every hour with scheduler, the processed data will be kept at dev.datawarehouse.sagebase.org/databasename/tablename. Now we can query this data in athena.