Data Grid Snapshots

This document describes the addition of a "snapshot" feature to the Synapse Grid services, and assumes the reader is familiar with the Synapse Grid implementation and its technical foundations. For more information, see Grid Implementation Using JSON Joy

Ticket:

PLFM-9307 - Getting issue details... STATUS

Background

When a new data grid replica is created, the replica gets up-to-date by receiving all patches that have been created for the grid, sequentially. A replica requests patch data by sending the hub a message containing its current clock data. The hub will use this clock data to identify the next patch to send to the client (if any exist).

As the grid document is updated and patches are created, the amount of time to load the grid data scales linearly with the number of patches. The time it takes to start a new replica and retrieve the latest data increases quickly, as this sequence occurs for each patch:

The client sends a message (via WebSocket) to the hub
The hub performs a database lookup to get the next patch
The hub downloads the patch data from S3
The hub sends the patch data to the client

This is inefficient, since later patches may render earlier patches ‘obsolete’, and does not scale to meet user performance expectations. We can introduce “snapshots” to provide a way to get a replica approximately up-to-date in one message.

In the future, we may also consider using snapshots as "restore points", which enable users to reset the grid session to a previous state.

Conceptual Design

For each grid session, Synapse will store 'snapshots'. A snapshot can be described with the following properties

The grid session's ID to which it belongs
ID (which must be unique at least within its grid session)
The date-time when the snapshot was created, and
The model data at that point in time.

Snapshots are automatically created by the system. A replica can use a snapshot to load a model more quickly than starting from patches. A user may also use a snapshot to 'rollback' the grid to an earlier state.

The model data (the current state of a JSON CRDT model) can be serialized using the indexed binary encoding format described in the JSON CRDT specification. The json-joy library that the web client uses natively supports this encoding format. The encoded model can be saved to a file in S3, and a presigned URL for the file can be shared with clients to download and load the encoded model.

Replicas (including the internal replica) will be able to load the snapshot file to get close to an up-to-date state in one operation. Once a snapshot has been loaded, the replica can continue to receive patches to apply on top of the data loaded from a snapshot.

The system will always create a snapshot to initialize the grid session, and may periodically create snapshots as changes are made to the grid. The initial snapshot and any user-created snapshots will be preserved for the lifetime of the grid. Snapshots automatically created by the system will be pruned as they become stale.

Technical Design

Using the internal replica, Synapse can serialize the internal replica's data using the Indexed encoding format. We choose the indexed encoding because its format aligns with our current database representation of the grid data, and its flat structure does not require us to recursively build the model tree.

When a new grid is initialized, Synapse will create a snapshot including the initial grid data, which all replicas can load from. Synapse may periodically create new snapshots to ensure new replicas can connect quickly regardless of the number of patches that have been applied over the lifetime of the session (

PLFM-9465 - Getting issue details... STATUS

).

Communication

We must update our grid communication protocol to support usage of snapshots.

Loading a new replica from a snapshot

Today, a replica may request new patches with a synchronize-clock message. Using the clock provided by the replica, the server responds with a patch:

Request:
	[1, 25, "synchronize-clock", <current-replica-clock-arr> ]
Response:
	[1, 25, "patch", <patch-content> ]`

We will update the body of the message to be a JSON object which can indicate the type of the response:

{"type":"patch","body":<patch-contents>}

If the current_replica_clock_arr is an empty array (that is, the replica has not loaded any data), the server will respond with a 'snapshot' message, where the body property is a presigned URL to the snapshot file.

Response:
 [1, 25, "snapshot", {"type":"snapshot","body":<s3-presigned-url>}}]

Replicas that have data

If the synchronize-clock message is sent with clock data (i.e. the replica has already loaded data from patches or a snapshot), the client will receive patches to get up-to-date. Other than the change to the message format, this behavior is unchanged.