State of Tables in Beta

The Synapse tables feature has been in beta for a few months and has been growing in popularity. I thought it would be a good time to capture some of design decisions we made and document the current state of the feature.

Table Features

The tables feature allows Synapse users to store their tabular data in a structured "table". Each table has a schema that defines the columns' data types and names.

Column Types

Integer
String - Up to 1000 characters per cell.
Double
Date
Boolean
Entity - A link to any Synapse entity.
Link - Generic link to any URL.
File - A reference to a FileHandle and functions much like a BLOB.

Operation

Most table operation are "Asynchronous" meaning the client starts a job and is issued a JobId that is then used to fetch but the status and the result of the Job. The current supported operation are:

SQL-like Query - A SQL-like query language is supported to query against a single table (joins across tables are not supported). SQL includes pagination, sorting, grouping, basic aggregation functions, and query operations (equals, greater than, less than, not equals, null, not null, in, and like).
CSV/TSV upload - The upload operation can be used to both create and update rows of a table from a single CSV file. Currently, tables with 100M+ rows have been created with CSV uploads.
CSV/TSV download - SQL-like query results can be fetched as a CSV file that a client can download from Synapse.
Partial Row Set - A partial row-set allows a client to change a sub-set of the cells of a row. This feature is used by the Portal to create/update rows using the UI.
Row Set - A change set includes all cells of a row for both insert and update.
Fetch Rows by Version - Fetch specific versions of all rows specified in a RowReferenceSet.

Row Level Versions

Each operation that creates, updates or deletes a row in a table results in a new version of that row being appended to the table. This means the tables feature is an append-only database that maintains the state of all rows. While any version of a row can be fetched from its rowId and rowVersionNumber, only the current versions of all rows are reflected in the SQL-like queries.

Concurrency

The tables feature is intended to be used concurrently by many users at a time. Consistency is maintained by processing updates on a single table serially and blocking queries while the updates are being processed. This is enforced with a very simple two lock system:

Non-Exclusive Lock (Read Lock). - When a worker receives a query request, it first attempts to acquire a non-exclusive lock on the table. A non-exclusive lock can be issue immediately if there are zero to many non-exclusive locks already on the table. However, if an exclusive lock has been issued for that table then the non-exclusive lock acquisition fails immediately (non-blocking) and the worker returns the query request to the queue to be attempted later.
Exclusive Lock (Write Lock) - When a worker receives a create or update request to a table, it first attempts to acquire an exclusive lock on the table. If another exclusive lock is already issued to the table then the lock acquisition fails immediately and the update request is returned to the queue to be processed later. If there are any non-exclusive locks outstanding, then an exclusive precursor lock is placed on the table. This precursor lock will block the acquisition of all new locks on the table but will not interrupt the existing non-exclusive locks. At this point the worker is blocked while waiting for the non-exclusive locks to release (or time out). Once all outstanding non-exclusive locks are released (or time out), an exclusive lock is issued to the worker and update proceeds.

Updates are applied to a table in blocks of 25K rows. Each block will be issued a version number which also becomes the new version number of every row in that block. If all blocks are processed without failure then the metadata of each block is committed to the table's history in a single transaction making them a permanent part of the table's history. Any failure in any block results in a rollback of all blocks from that update. This makes each update to a table atomic regardless of the size of the update. Raw block data is stored in S3, while the metadata about the blocks is store in the main Synapse MySQL database.

Clustering

Worker Cluster

All table operations are executed by a cluster of workers. This includes, query, updates, uploads, and downloads. This worker cluster is maintained by Amazon's Elastic Beanstalk and is currently configured to scale up/down on CPU usage. We currently run with a minimum of four machines and have scaled up to eight machines under load. In addition to the number of machines, we also control the number of concurrent instance of each worker across the cluster using a counting semaphore system. We also control how many threads each worker type can acquire on each machine. Each worker is triggered by a timer. All worker communication is facilitated using Amazon's SQS. When a worker timer fires, it first attempts to grab one the semaphores for its worker types. If successful, it will pull for messages from its queue. SQS ensures the same message is only issued to one worker at a time and automatically re-issues messages if the worker fails to delete the message in a timely manner.

Database Cluster

While the raw "truth" data of a table is stored in S3, and tracked in a single MySQL database, a cluster of MySQL databases is used to support table queries. A secondary worker (see Worker Cluster above) listens for table change events and attempts to build an index for a table in one of the database instances in the query cluster. This worker will update a table's index while holding an exclusive lock on the table. This workers primary job is to determine the current version of each row and ensure it is reflected in the index. In the future we plan to "rebalanced" the query database cluster by redistributing table incidences based on the load of the of each database machine. Currently every table is indexed in the cluster but we plan to only index "active" tables in the future.