Speeding up final migration by removing counts

Every week when a Synapse production stack goes live, we immediately start migrating its content to the staging stack (where validation occurs before release). The staging stack starts empty, the initial migration brings it up to date (about 30hrs), then tables/search indices get built by workers while migration runs every day to keep both stacks relatively in sync. After validation is done, and before switching staging in production, a final migration occurs where both stacks are in READ_ONLY mode. The total downtime for switching the stack is 50-60 mins (including time to switch the NLBs and a couple of checks before going back online), during which our services are off-line.

As part of a data migration, we gather metadata (minId, maxId, count) about every type of object that needs to be migrated. That process takes 8-10 mins at the beginning and and the end of a migration process (i.e 15-20 mins out of the last migration which takes 45-50 mins).

The database query to gather metadata about a type is ‘SELECT min(ID), max(ID), count(ID) from <table>’, the columns involved are indexed but the ‘count(ID)’ requires a scan, which makes the query slow. The purpose of this research spike is to confirm that’s the case and propose faster solution.

Effect of count() on the query performance

To show how ‘count(ID)’ slows down the query, we execute it 3 times on the FILES table (large number of rows) with data not cached (worst case, need to read from disk) and cached.

query	cached	not_cached

query	cached	not_cached
select min(ID),max(ID),count(ID) from FILES	00:05.6 00:05.5 00:05.5	03:09 03:06 03:07
select min(ID), max(ID) from FILES	00:00.4 00:00.37 00:00.36	00:00.35 00:00.37 00:00.35

Note: The same experiment was repeated with other tables, with similar results.

Proposal

The counts are used by the Synapse-Migration-Utility to display the differences between the production and the staging stack at the beginning and at the end of migration. They provide a visual clue that something might have gone wrong.

The proposal is to remove the ‘count()’ from the type metadata query, and remove the display at the beginning and at the end as an experiment (i.e. no type change, just return null for the count). Then change the actual data structures (remove completely or make optional) and/or add another visual clue (diff on maxId?) when we’re satisfied with the experiment.