Document toolboxDocument toolbox

Table Version

The epic PLFM-4698 captures all of the use cases associated with table versions and/or history.  In this document we will summarize the key use cases, review how the tables feature currently works then present a design for the additional features.

Driving Use Case Summary

  1. Add support to allow a user to "freeze" the current state of a table including both the current row data and the current schema.  The freeze would be immutable such that subsequent changes to either the rows or schema will not change the previous freeze (PLFM-4890, PLFM-5335).
  2. The freeze would be reference-able such that reference to freeze can be added to historical records such as provenance (PLFM-2870, PLFM-4558).
  3. The freeze would be query-able such that a query run against the freeze will return results consistent with the state of the table at the time of the freeze (PLFM-4558).
  4. Add support to create a "freeze" of a view by creating a table containing the data from the view and then adding a freeze to the resulting table (PLFM-4247, PLFM-5335, PLFM-5433).
  5. Add support to revert a table to a previous state (PLFM-4607, PLFM-4639).
  6. Add support to add, update, or delete rows using primary keys defined by the user (PLFM-3815).
  7. Add support to remove all history from a table to speed up table build times (PLFM-4988).

Table Architecture Review

The Synapse Table feature is an append-only database.  For example, a user can start a new transaction that includes row addition, updates, and deleted.  To handle such a request, the Synapse table engine will calculate a "delta" that represents each row or column change to be appended to the table.  The engine will automatically batches the deltas into the blocks that are optimized for storage and retrieval.  Each block will contain as many deltas as will fit into memory.  This means the number of row changes in each block is a function of the "width" of the table.  Wide tables have fewer rows deltas per block than narrow tables.  Each block is assigned a version number.  This is the version number required to update a row.  Finally all blocks are grouped together with a transaction number that captures all of the changes from a single user's request. Each delta block is stored in AWS S3 and tracked in the main Synapse database.  The blocks of deltas are the "truth" of all Synapse tables. 


Table query is then supported by building an actual MySQL table, called the "table's index", in a separate MySQL database cluster.  Specifically, each index is built by applying blocks of deltas to the table in the order they were created.  Currently, we always apply all of the blocks of data when building a table's index.  Creating an index that represents a table version is as simple as building a index that includes all blocks from the beginning of the table's history up to, and including, the blocks marked in the version.


Feature Design

Table Version

TableEntities already implement VersionableEntity, but we are not currently capturing any transaction information in the table's version.  We propose extending the table's version to include a reference to a transaction ID.  The example in Figure 1 shows the current state of the table syn123.  We will use this example to demonstrate the proposal for how table version will function.

Figure 1

Figure 1.


In Figure 1. the blocks of row deltas are on the left (Red), the transactions are in the middle (Green), and the version are on the right (blue).  As depicted, syn123 has eight  blocks of row deltas (0-7), four transactions (82, 98, 101, & 105), and two table versions (syn123.1 & syn123.2).

As mentioned above, each time a users updates a table, all of the changes of each update are captured with a transaction identifier. In this example the first user update created three blocks of row deltas (0-2) which are identified as transaction 82.  The second update included two blocks of row deltas (3-4) identified by transaction 98, and so on.  Notice that first and last transactions (82 & 105 respectively), do not have a version a table version associated with it.  This is because table updates do not automatically create new table version.  Instead, the user must explicitly request that a new table version be created using one of the following two methods:

  1. POST /entity/{id}/table/transaction/async/start - We will add a new Boolean parameter to the TableUpdateTransactionRequest called 'newVersion'.   By default the 'newVersion' will be 'false'.  When 'newVersion' is set to 'true', if the transaction completes successfully, a new table version will be created that references the newly created transaction.
  2. PUT /entity/{id} - This method already has a parameter called 'newVersion' and is currently used as a way to create new version of FileEntites.  We will extend this method to do the same for TableEntites.  When 'newVersion' is set to 'true' (for a TableEntity), a new version will be created that referenced the last transaction of that table.

Table version References

Table versions can be referenced in the same way as file versions.  The following entity version nomenclature is already used throughout Synapse and for Entity DOIs:

syn<id>.<version_number>


It will also be possible to reference a table version in a table query using the same pattern:

select * from syn123.2 where foo = 'bar'

When a table version is referenced in a query, a table "index" will be built (if it does not already exist) using the transactions from the beginning of the table's history up to and including the transaction referenced by the version.  The query above references version two (syn123.2), so Synapse will run such a query against an index that has been built to include blocks 0 - 5.  Any changes applied to the table after version two will not be reflected in the query results.  So the query results for syn123.2 will not reflect the changes from transaction 105 (blocks 6 & 7).


Table Version History

The same API used to list the version history for FileEntities will be used to list a table's version history:  GET /entity/{id}/version.  However, unlike files that always have at least one version, a table's version history will be empty until a user actively creates the first version.  For tables, the List of VersionInfo objects will not include values for 'contentSize' or 'contentMd5', but all other fields will be provided.


View Freeze

Yet to be determined...


User Defined Primary Keys

Yet to be determined...