Document toolboxDocument toolbox

Master Scheduler

Lambda

The Master Scheduler will run on AWS Lambda, configured using a cron expression to execute every hour, 5 minutes past the hour (to avoid weird log rotation issues). We choose to run the scheduler hourly because data and metrics in Bridge change very slowly, so there's no point in updating a magnitude more frequently than hourly. For tasks that require a quick turnaround (on the order of minutes), we're better off creating ad-hoc requests (such as On-Demand Export or User Data Download) rather than try to schedule something to run every minute.

Note that because of the way the Master Scheduler runs, we can schedule jobs to run every 30 minutes (or similar). They'll just be scheduled when the Master Scheduler runs. As long as the workers key off the of the times listed in the request rather than the time the request is received, this should be fine. (See below for details.)

Using https://github.com/travis-ci/dpl/issues/397, we can now use Travis to deploy to Lambda. This way, we can have a Lambda for each of dev, uat, and prod.

DynamoDB Tables

Previously, each scheduler had a single table, and all configs for all envs would go into that single table. In the Master Scheduler, we'll use the scheduler name as the DDB table prefix. This allows us to separate dev, uat, and prod.

DynamoDB was chosen over RDS (MySQL) because DynamoDB is easier to use and maintain, and because we never need anything more complex than hash lookups and table scans over a relatively small table.

Config

The config table has an entry for each schedule it needs to run. Config table has the following fields:

scheduleId (hash key) - string ID used to uniquely identify a schedule. Examples: "test-schedule", "Bridge-EX-Scheduler-dev", "Bridge-FitBit-Scheduler-uat", "Bridge-Reporter-Scheduler-prod"

cronSchedule (string) - Cron expression indicating when the schedule should be run. Uses Quartz notation. The first clause is seconds, and the remaining clauses are your standard 5 cron clauses. (Note that one of day of month or day of week must be a question mark.) Examples: "0 15,45 * * * ?" (twice per hour, at minute 15 and minute 45), "0 0 * * * ?" (every hour on the hour), "0 0 3 * * ?" (every day at 3am). This cron expression assumes UTC, to avoid weird timezone and DST issues.

requestTemplate (string) - Template for the request to be sent to the configured SQS queue. This may contain template variables, which are substituted by the scheduler. See below for more information on template variables. This is historically JSON, but there's no reason it can't be something else.

sqsQueueUrl (string) - SQS queue URL to send requests to

Status

Status table keeps track of the last time the scheduler ran, so that it can remember what needs to be scheduled. This table contains only a singleton row, which contains status information for the scheduler. Status table has the following fields:

hashKey (hash key) - Arbitrary string because DynamoDB needs a hash key. Always "BridgeMasterScheduler".

lastProcessedTime (long) - Epoch milliseconds representing the last time the Master Scheduler ran.

Scheduler

Prototype in https://github.com/DwayneJengSage/BridgeMasterScheduler

When the scheduler is woken up, it does the following:

  1. It checks the status table for the lastProcessedTime and uses that and the current time (snapshotted as endTime) to compute the window for tasks that it needs to schedule. (If the lastProcessedTime is not present, for example, if this is the first time the scheduler runs, it picks an arbitrary start time an hour ago.)
  2. It scans the config table. For each row in the config table, it uses the cron expression to determine which times and how many times to run. It then creates requests for those schedules and times and sends them to the configured SQS queue.
  3. It updates the status table, using the endTime as the new lastProcessedTime.

Template Variables

The request template can contain template variables that look like "${startOfDay}" or "${yesterdaysDate}". These are automatically substituted by the scheduler, using the times generated by the cron schedule to calculate the template variable values. Dates and times are always generated assuming Seattle local timezone. Template variables include (example values assume a process time of 2018-03-21T17:09-07:00):

endOfPreviousDay - The last millisecond of the previous day. Example: 2018-03-20T23:59:59.999-07:00.

processTime - The process time of the scheduled event, as generated by the cron schedule. Example 2018-03-21T17:09:00.000-07:00.

startOfDay - The start of the current day. Example: 2018-03-21T00:00:00.000-07:00.

startOfDayOneWeekAgo - The start of the day one week ago. Example: 2018-03-14T00:00:00.000-07:00.

startOfHour - The start of the current hour. Example: 2018-03-21T17:00:00.000-07:00.

startOfPreviousDay - The start of the previous day. Example: 2018-03-20T00:00:00.000-07:00.

todaysDate - Today's date (as a calendar date rather than as a timestamp). Example: 2018-03-21.

yesterdaysDate - Yesterday's date (as a calendar date rather than as a timestamp). Example: 2018-03-20.

Examples

Bridge Exporter

{
  "endDateTime":"${startOfHour}",
  "useLastExportTime":true,
  "tag":"[scheduler=Bridge-EX-Scheduler-dwaynejeng,endDateTime=${startOfHour}]"
}

Bridge Reporter

{
  "service":"REPORTER",
  "body":{
    "scheduler":"Bridge-Reporter-Scheduler-dwaynejeng",
    "scheduleType":"DAILY",
    "startDateTime":"${startOfPreviousDay}",
    "endDateTime":"${endOfPreviousDay}"
  }
}

Bridge FitBit Worker

{
  "service":"FitBitWorker",
  "body":{
    "date":"${yesterdaysDate}"
  }
}