Design: Agent Assisted Template Curation

Design: Agent Assisted Template Curation

Note: This document is a work in progress…

Prerequisites

Introduction

This document outlines the user experience for curating tabular data (tables, views, CSV files) that are already governed by a defined data model (e.g., JSON Schema). This process focuses on the core curation workflow, excluding the initial data model setup and post-merge data handling.

The curation process is structured around three key stages:

  1. Cloning for Sandbox Environment:

    • The process begins by creating a "sandbox" clone of the original data, ensuring that modifications do not impact the source data.

  2. Iterative Sandbox Curation and Validation:

    • Users iteratively explore and modify the cloned data within the sandbox.

    • This stage includes rigorous data validation against the predefined data model to maintain data integrity.

    • Iterations will incorporate AI-assisted curation.

    • In the future we will incorporate multi-user collaborations.

  3. Merging Curated Data:

    • Once satisfied with the curated results, users merge the changes from the sandbox back into the original data source.

This design specifically addresses the user interactions during the core curation process, from cloning to merging, and does not encompass any upstream or downstream operations.

Overview.drawio(1).png
Figure 1 -Three Stages of Curation

 

Scale

The scale of the data to be curated significantly impacts our design choices. We'll use the maximum number of cells in a grid to define the scale we need to support. For example, a small-scale grid with ten columns and one hundred rows yields a scale of one thousand cells (1K, 10 x 100). A large-scale grid, with one hundred columns and one hundred thousand rows, reaches a scale of ten million cells (10M, 100 x 100,000).

With a small-scale grid (1K cells), we could likely pass all cells for each operation and achieve results within a reasonable time-frame (e.g., a few seconds). For instance, sending all 1K cells to a validation module should return 1K validation results quickly. However, repeatedly passing 1K cells to an AI agent would rapidly exhaust its context window—the limited memory the AI uses to process information. After a few iterations, and definitely after a hundred iterations, the agent's performance would degrade significantly.

A small-scale grid could potentially be implemented entirely within the UI, provided the total number of iterations remains low and multi-user collaboration is not required.

However, this project mandates support for large-scale grids (over 10M cells). The naive approach of sending all data cells for each operation is untenable. For example, a number datatype, typically requiring 8 bytes, would translate to approximately 80 MB for 10M cells. While download speeds are generally adequate, upload speeds are often constrained. A user with a 5 Mbps upload speed would require over two minutes to upload 80 MB of data. Expecting a user to wait over two minutes after updating a single cell is unacceptable.

To accommodate large-scale data grids, we require a modular approach. Ideally, updating a single cell should necessitate the exchange of minimal data: the cell's address, previous value, and new value. Similarly, data validation should only process the affected cells and potentially a small surrounding context, rather than the entire grid.

Most importantly, the AI agents need to be able to assists without loading the entire grid into their context window. This can be done by providing the agent with functions to explore the grid:

  • Query - The agent has a function to execute SQL queries against the grid. These queries need to be paginated with limit and offset. Examples of some of SQL aggregation functions that should be available to the agent:

    • avg()

    • count()/count(distinct)

    • group_concat()

    • max()

    • min()

    • std()/stddev()/stddev_pop()/stddev_samp()

    • sum()

    • var_pop()/var_sample()/variance()

  • Descriptive Statistics - More sophisticated statistical techniques, recommended by Gemini, that we might want to explore:

    • Data Profiling and Anomaly Detection:

      • Descriptive Statistics:

        • Calculating mean, median, standard deviation, quartiles, and histograms to understand data distribution within columns. (We would get some of this from the basic query functions above).

        • Identifying outliers and anomalies based on statistical thresholds (e.g., values exceeding 3 standard deviations from the mean).

      • Distribution Analysis:

        • Determining the probability distribution of data within columns (e.g., normal distribution, skewed distribution).

        • Detecting deviations from expected distributions.

      • Correlation Analysis:

        • Identifying relationships between columns using correlation coefficients (e.g., Pearson's r, Spearman's rho).

        • Highlighting unexpected or significant correlations.

    • Data Validation and Imputation:

      • Statistical Validation Rules:

        • Defining validation rules based on statistical properties (e.g., values within a specific range, consistent with a particular distribution).

        • Using statistical tests (e.g., chi-squared test) to assess data conformance to predefined rules.

      • Statistical Imputation:

        • Filling in missing values using statistical methods (e.g., mean imputation, median imputation, regression imputation).  

        • Predicting missing values based on relationships with other columns.

    • Data Summarization and Visualization:

      • Statistical Summaries:

        • Generating concise statistical summaries of large datasets (e.g., key metrics, percentiles).

        • Providing insights into data characteristics.

      • Statistical Visualizations:

        • Creating visualizations (e.g., scatter plots, histograms, box plots) to represent data distributions and relationships.

        • Enabling users to quickly identify patterns and anomalies.

      • Dimensionality Reduction:

        • Using techniques like Principal Component Analysis (PCA) to reduce the number of columns, while retaining data information, for visualization and analysis.

    • Predictive Modeling and Pattern Recognition:

      • Regression Analysis:

        • Predicting values based on relationships with other columns.

        • Identifying factors that influence data values.

      • Clustering Analysis:

        • Grouping similar data points together.

        • Identifying patterns and segments within the data.

      • Classification:

        • Categorizing data based on patterns within the data.

    • Adaptive Sampling:

      • Stratified Sampling:

        • If the data is very large, the agent could use stratified sampling to ensure all portions of the data are represented in the samples that are sent to the AI.

      • Adaptive sampling based on data complexity:

        • If the agent detects areas of high variation or complexity, it could sample those areas at a higher rate.

 

The agent will also need the means to manipulate the data grid in a scalable fashion. While the agent might directly change a few individual cells, they will also need a mechanism for making large scale changes by providing instructions. For example, the agent might need to execute a complex operation such as:

update sandbox_template set column_two = 'bar' where column_one is not null and column_three > 12;

This is just an illustration to show how an agent might change hundreds of thousands for cells with a single operation.

The agents will be provided with the same functions that clients use to make small scale changes to the grid. In addition, the agents will be able to use the following API functions to both explore and change the grid at scale:

API

Description

API

Description

POST /grid/session/{sessionId}/sql/query

Agent can run a SQL query against the grid using GROUP BY and aggregation functions. The exact SQL syntax is built into the API object model so a SQL parser will not be needed.

PUT /grid/session/{sessionId}/sql/update

Agents can execute a SQL UPDATE statement to make sweeping changes to the grid. Note: The provided SQL is not directly applied to the grid. Instead, the SQL is used to generate a set of CRDT operations that will be applied to an agent’s grid replica.

Sandbox Grid

In this section, we will define exactly what a sandbox grid is and, equally importantly, what it is not. A sandbox grid is a tool for exploring and modifying a “clone” of tabular data. It is created by providing the following information:

  • A column header that includes the name of each column.

  • One or more rows where each row has one cell for each column from the header.

  • JSON Schema that defines the validation rules for the sandbox grid.

Each cell in a row can be interdependent with other cells in the same row. For example, the JSON Schema may define validation rules that consider the values of other cells in the same row. The number of cells in a grid is exactly the number of rows times the number of columns (there are no cells outside of a row or column).

Note: For the MVP, rows will not be interdependent with each other (i.e., parentId are not enforced).

A sandbox grid is not a spread sheet:

  • No formulas for calculating cell values such as “=(a1+b3)/$e$12”.

  • No functions such as “=SUM(B5:B7)”.

  • No cells outside of a row or column.

 

In Figure 1. we show that a sandbox grid session starts as “clone” from a table/view/csv. This means the grid will be loaded with all of the data from its source. The starting state of the session will be as follow:

  • Row:

    • Each row from the source will be represented in the grid.

    • The row order in the grid will match the order of the rows in the source.

    • Each row has a hidden immutable identifier (row_id).

  • Columns:

    • Each column name from the source will be represented in the grid.

    • The column order in the grid will match the order from the columns in the source.

    • Each column has a hidden immutable identifier (column_id).

  • Cells:

    • All cells from the source, including empty cells will be represented in the grid.

    • Each cell has an immutable identifier that is a combination of its row_id and column_id.

 

Once a user’s client loads the grid session initial state, the user will be able to perform the following operations on the grid:

  • Change cell values.

  • Add rows at a specified position (after a selected row).

  • Remove rows.

  • Add columns at a specified position (after a selected column).

  • Create a new version of the grid.

  • Rollback the version of the grid.

  • Undo/redo operations.

  • Copy/pastes cells & rows.

  • Request their agent assistant to perform any of the above operations.

  • “Merge” the grid back to the original source, to complete and close the grid session.

Row & Column Order

Since we need to be able add/remove both rows and columns the order of both is mutable. We also plan to support user driven row sorting in the future. It is important to note that both row and column ordering are globally shared by all clients/agents. This means if one user changes the sorting of the grid, all other users will see the sorting results. This is similar to how it works in other collaborative grid environments such as Google Sheets.

The following are the row ordering rules:

  1. Each row has a starting position that matches the “cloned” source position.

  2. Sorting is a global operation that changes the ordering of rows for all users/agents. This is the same behavior as other collaborative grid systems like Google Sheets.

  3. Changing a cell’s value does not trigger re-sorting. Users would not expect rows to “move” after entering a new cell value.

  4. Adding/Pasting rows does not trigger re-sorting.

  5. Undo/Redo/Versioning can “revert” the order of the rows to a previous state.

 

Agent Assistance

The following diagram illustrates a typical user/agent chat session:

BasicChatSession.drawio(2).png
Basic Chat Session

The Synapse Agent Chat Services uses a similar chat flow:

  1. User starts by prompting the agent with a request.

  2. In order to respond to a request the agent utilizes one or more “functions” to:

    1. Gather data from the system to contextualize the request.

    2. Make changes to the system.

  3. Once finished, the agent will then provide a response (usually text) to the user.

The flow is then repeated as many times as the user desires.

 

The agent will have the following functions within the grid sandbox session:

  • Load the grid session’s JSON schema that defines the validation rules of the grid.

  • Run SQL queries against the grid’s cell values.

  • Run Data Manipulation Language (DML) to change the grid’s cell values.

  • Get the user’s current cell selection.

  • Get the validation state for any cell in the grid.

  • Get recent cell changes made by the user.

In the near future, we plan to support concurrent multi-user collaboration on the same sandbox grid. Each user will have a private agent session to assist with curation. However, all data modifications made by any user or their agent will be immediately visible to all collaborators. For instance, if user A's agent modifies cell 123, the 'changedBy' attribute will indicate 'user=A, agent_assist=true,' and this attribution will be visible to all other users. User chat exchanges with their agents will remain private.

Since the user and agent can change the same cell at the same time, conflicts are possible even without multiple users. When we add support for multiple users, each with their own agent, the chance for conflicts will only increase. This means we need at least basic conflict resolution in the MVP, with potentially more sophisticated conflict resolution in the future.

Collaborative Grid Session

Before we tackle conflict resolution we need to cover exactly what it means to “collaborate” on a grid. A simple way to think about modern collaborative software is to imagine that each collaborator has their own “copy” of the document. In our case, that would mean each collaborator has a copy of the grid session. The most basic grid session will have at least three collaborative copies of the grid:

  1. User’s copy - The user’s copy of the grid which resides in the user’s web browser.

  2. Cloud Persistence Copy - A server-side copy used to persist the state of the grid to the cloud. This copy is also the first copy as it is created by “cloning” the source data.

  3. User’s Agent copy - The user’s agent copy of the grid that will be used by the agent to assist the user.

When we want to support multiple users, each additional user will also have their own copy, plus a copy for each user’s agent. For now we will limit our discussion to the first three copies of the grid. Consider Figure 2, which shows three copies of a grid (a.k.a. replicas): A, B, C that communicate using peer-to-peer (P2P):

PeerToPeer.drawio.png
Figure 2. Peer-To-Peer (P2P) Replication with three replicas.

 

The bi-direction arrows between each replica represent the bi-directional communication between replicas that is used to ensure changes from one replica propagate to all other replicas. For example, if user makes a change to their copy, say C, that change would then need to be broadcast to both A, and B. The P2P approach works for some use cases, but is not practical when more than one replica is expected to be in a web browser or other client. A better option is to use server-side “hub” where all other replicas becomes spokes:

SpokeWheel.drawio.png
Figure 3. Hub and Spoke Replication between five replicas.

In Figure 3, replica A, is the “hub” that connects to all of the other replicas (spokes). Client-side replicas do not need to directly communicate with each other, instead, all replica communication goes through the hub.

So what data is exchanged between replicas? We know that sending an entire grid of 10M cells for each cell change will not scale. We need a solution that can send a minimal amount of data for each change. We will cover the details of replica data exchange in a later section.

Grid Document

In the previous sections we outlined the types of operations that we need to support within the grid. We also showed that there will be at least three replicas of the grid: user’s client, hub (persistence store) and the user’s agent (see: Figure 4). We have labeled each replica with a letter (A, B, C) that is the replica’s ID. Each new replica added to the collaboration will be issued a unique replica ID.

ThreeMainReplicas.drawio.png
Figure 4. Three replicas identified as A,B, & C.

A grid document is simply a data model that captures the state of a single replica. When a new grid session is started, the first step will be to create the “hub” replica. This hub replica will start empty and then be populated with the “cloned” data from the original table/view/csv. Once populated, the data from the hub replica will be saved to the cloud. This cloud document will then serve as the starting point for each new replica that joins the collaboration.

Simple Grid Document

In this section we will define exactly what is in the grid “document” that defines each replica. Let us start by imagining what a grid document would look like if we only had a single client, and we did not need worry about collaboration (see Figure 5).

SimpleGrid.drawio.png
Figure 5. Simple Grid

Our simple grid is composted of three members:

  • colunNames:

    • Captures the names of the columns of the grid (column names are immutable).

    • Defines the order of the grids columns.

    • Adding/Removing names from this list will add/remove columns from the grid.

  • rowIds:

    • Captures the immutable row_ids of the grid.

    • Defines the global order of the rows in the grid.

    • A global sort operation would reorder this list.

    • Adding/Removing row_ids from this list will add/removing rows from the grid.

  • cells:

    • Each cell is uniquely identify by its row_id and column name (row_id,column_name) so the keys of this map are the concatenation of the two.

    • Each value in this map contains the string value of a single cell.

    • Blank cells are not in the map.

    • Cells are not added to the map when new rows/columns are added, since the new cells are blank.

    • Removing rows/columns should trigger the removal of all associated cells.

This simple document has enough information to capture the full state of a non-collaborative grid. How can we extend this basic document model to support distributed collaboration? Before we can do that, we first need to define the user expectation for modern distributed collaboration.

Distributed Collaboration Expectations

Modern distributed collaboration software such as Figma, Slack, and Google Docs have set the bar for user expectations. The minimum expectations can be summarized as follows:

  • Users can make rapid changes even when “disconnected” from other collaborators.

    • Users should not need to wait for the acquisition of remote locks before making changes.

  • Changes made by remote collaborators (users/agents) are automatically detected and applied locally.

    • If disconnected, changes from remote users/agents are applied locally shortly after connection is re-established.

  • Concurrent conflicts are automatically resolved by the application.

It turns out that using traditional data structures like Lists, Maps, Set, and even simple strings to build collaborative applications is complex. This has lead researchers to develop a new sub-field of data structures called Conflict-Free Replicated Data Types (CRDT). Basically, CRDTs are specialized data structures that we can use as building blocks for creating modern collaborative software. In the next section we will show how we can substitute some of the key elements from our basic grid documents with CRDT counterparts to lay the foundation our collaborative grid.

CRDT Grid Document

In an earlier section we defined a simple non-collaborative grid document using two mutable Lists and one mutable Map. The value of each cell was simply a mutable string value in the cell Map. In order to transform our simple gird document into a collaborative document we will need to replace each mutable data structure with a suitable CRDT equivalent.

CollaborativeGrid.drawio.png
Figure 6. Collaborative Grid Document

The collaborative grid document (Figure 6) has all of the same members as our simple non-collaborative grid (Figure 5) with the following transforms:

  • List → OrderListCRDT

  • Map → SetCRDT

  • Mutable String → RegisterCRDT

The VersionVector is a fairly simple structure that is the key to replica synchronization which we will cover in a dedicated section.

 

Note: OrderListCRDT, SetCRDT, & RegisterCRDT are three categories of CRDTs. For each category, there are multiple implementation options. This is analogues to List as a category, with LinkedList & ArrayList as two possible implementations.

 

Since each column_name and row_id are immutable they can be represented with a basic string (hence, OrderListCRDT<String>). However, since the value within a cell can change, we need a CRDT that can correctly handle the changes. This is why cells are defined as SetCRDT<RegisterCRDT> and not SetCRDT<String>.

 

If you are looking for a place to start learning about CRDTs, I found the following blog posts useful: Bartosz Sypytkowski's CRDT blog posts.

 

Operation-based CRDT

There are two broad categories of CRDTs:

  • State-based - Replicas synchronize by sending a copy of an entire CRDT object to each other.

    • Each replica is expected to merge a passed CRDT with their local CRDT.

    • The network bandwidth required to synchronize replicas depends on the size of the exchanged CRDTs.

    • Replicas will become eventually consistent even if some messages are lost or sent out-of-order.

  • Operation-based - Replicas synchronize by exchanging events that describe changes made locally.

    • Each replica is expected to apply the change to their local copy of the CRDT.

    • The network bandwidth required to synchronize is independent of the size of the original CRDT object.

    • Replica must process all event messages in causal order.

In Figure 6, we defined our collaborative grid document. It is important to note that the grid document itself is not a CRDT, but rather is is composed of fields that are CRDTs. The largest CRDT in the grid document is the cells field which contains all of the cells for the entire grid. If one replica were to add/remove a cell from this set, a state-based approach would require that the entire set be sent to all other replicas. Sending a CRDT containing the data of 10M cells for each update will not scale. This means we need to take an operation-based approach, where adding/removing a cell in one replica only requires an event message describing the change to be sent to all other replicas.

While an operation-based approach only requires simple event exchanges between replicas, the event exchange protocol must be: Reliable Causal Broadcast (RCB). The following post outlines the RCB requirement and includes several suggestion on how to build such a protocol: Operation based CRDTs: protocol. Our suggested protocol is a variation on the author’s: Allowing replication of other peer's events.

Operation-based Synchronization Protocol

The operation-based synchronization protocol is as follows:

  • When a user/agent makes a change to a local replica, a uniquely identified event describing the change is sent to the hub.

  • When a replica receives an event from another replica it will do one of the following:

    • If the event is out-of-sequence, the event must be ignored, and synchronization with the hub should be initiated.

    • If the event is in-sequence, the change will be applied to the local replica.

  • The hub replica acts as a event broker where events from one replica are broadcast to all other replicas.

  • Disconnects will happen. This means both out-going and in-coming events will occasionally be lost.

  • Each replica can synchronize with the hub by sending its current state (Version Vector covered below) to the hub.

    • The hub will respond by sending the replica any missing events from other replicas.

    • The hub will request that the replica re-send any events that the hub is missing.

The above protocol ensures that all events are eventually process, in order, by all replicas in the collaboration.

Event Identifier

The protocol requires that all events_ids to the the following characteristic:

  • Each event_id is unique across all replicas

  • We can determine the replica that originated the event.

  • We can determine the order of the events from a single replica.

We can use the following information to enable each replica to issue its own event_ids meeting our criteria:

  • Each replica is issued a unique identifier. In Figure 4, we have three replicas identified by A, B, & C.

  • Each replica maintains a sequence number that is incremented for each change originating from it.

An event_id is simply the concatenation of these two values:

event_id = unique_replca_id + ':' + sequence_number

For example, if we started with an empty grid, each replica’s sequence number would be zero. If a user added a row to replica C, then replica C would increment its sequence by one and generate an event_id = C:1. A second change from same replica would have event_id = C:2 and so on. We will also use the same identifier scheme to help us track the exact state of any replica in the collaboration in the next section.

 

Version Vector

We need a system to defining the exact state of any replica in the collaboration. A simple solution to this complex problem is to use a Version Vector (VV). Each replica is expected to maintain a VV with an entry for each replica in the collaboration. This is captured in the grid document as version (see: Figure 6).

For example, in our collaboration with three replicas, A, B, C, the starting state of each replica would be:

replica

version (VV)

replica

version (VV)

A

[A:0, B:0, C:0]

B

[A:0, B:0, C:0]

C

[A:0, B:0, C:0]

Now if a user makes a change to replica C and the same event has not yet been applied to replica A or B, then the state of each replica would be:

replica

version (VV)

replica

version (VV)

A

[A:0, B:0, C:0]

B

[A:0, B:0, C:0]

C

[A:0, B:0, C:1]

The fact that the state of both A and B include C:0 tells us that these two replicas are missing a single change from replica C. Next, let us apply the change C:1 to replica A:

replica

version (VV)

replica

version (VV)

A

[A:0, B:0, C:1]

B

[A:0, B:0, C:0]

C

[A:0, B:0, C:1]

After C:1 is applied to replica A, replica A’s new state becomes [A:0, B:0, C:1] while B’s state remains unchanged. Next, we show what would happen a user changed replica B before the the event C:1 could be applied to B.

replica

version (VV)

replica

version (VV)

A

[A:0, B:0, C:1]

B

[A:0, B:1, C:0]

C

[A:0, B:0, C:1]

We now have a system for comparing the state of each replica within the collaboration.

The following table shows four examples of comparing the VV from replica A to the VV from replica B:

example

A.version

B.version

results

Description

example

A.version

B.version

results

Description

one

[A:2, B:9, C:5]

[A:2, B:9, C:5]

synchronized

The two replicas are synchronized

two

[A:2, B:9, C:5]

[A:2, B:11, C:5]

non-conflicting

A is missing 2 non-conflicting changes from B.

three

[A:3, B:11, C:5]

[A:2, B:11, C:5]

non-conflicting

B is missing 1 non-conflicting change from A.

four

[A:5 B:11, C:5]

[A:3, B:14, C:5]

potentially-conflicting

A is missing 3 changes from B and B is missing 2 changes from A. It is possible that the missing changes conflict.

  • In example “one”, the two VV match, so we know that both replica A and B are synchronized with each other.

  • In example “two” we can see that replica A is missing two events from replica B. Since replica A did not change, we know that the two changes from B are non-conflicting with replica A.

  • Example, “three” is similar to two, but with a changes from replica A instead of B.

  • With example “four”, both replica A and B have change that the other has not seen. The VV does not tell us exactly when these changes occurred but we should assume that the changes are potentially-conflicting with each other.

 

AsycApi Operation-based Synchronization Protocol

We will be using Web Sockets to support the bi-directional AsyncAPI. If you paste the following JSON into studio.asyncapi.com you can see a user friendly representation of the API:

{ "asyncapi": "3.0.0", "info": { "title": "Collaborative Grid Operations API", "version": "1.0.0", "description": "API for synchronizing collaborative grid state using operation-based CRDTs over WebSockets." }, "servers": { "ws": { "protocol": "wss", "host": "{baseUrl}:{port}", "variables": { "baseUrl": { "default": "localhost", "description": "The base URL of the WebSocket server." }, "port": { "default": "8080", "description": "The port of the WebSocket server." } } } }, "channels": { "grid": { "address": "/grid", "description": "Channel for exchanging CRDT operations for a specific grid session.", "bindings": { "ws": { "method": "POST", "query": { "type": "object", "properties": { "gridSessionId": { "type": "string", "description": "The unique identifier for the grid session" }, "replicaId": { "type": "string", "description": "The unique identifier of the subscribing replica." } } }, "headers": { "type": "object", "properties": { "Authorization": { "type": "string", "description": "Value must be 'Bearer <token>'. This is the same header required for all authetnicted HTTP requests." } } } } } } }, "operations": { "eventsSend": { "title": "Send Events", "description": "Send live events to all other replicas in the collaboration.", "channel": { "$ref": "#/channels/grid" }, "action": "send" }, "eventsReceive": { "title": "Receive Events", "description": "Receive live events from all other replicas in the collaboration. If an event is received out-of-order (event_id > version.replica_id.sequence + 1), it should be ignored, and sychronization should be started. Duplicate events should also be ignored.", "channel": { "$ref": "#/channels/grid" }, "action": "receive" }, "sendVersion": { "title": "Send Version", "description": "Send a replica's version to the hub to start the synchroniziation process. If the hub detects the sender is missing event, then then the missing events will be sent to be sent to: eventSend", "channel": { "$ref": "#/channels/grid" }, "action": "send" }, "receiveVersion": { "title": "Receive Version", "description": "Receive the current version of the hub. If the hub's vesion does not match the replica's version, all missing events that orginate for recipenant should be sent to: eventSend.", "channel": { "$ref": "#/channels/grid" }, "action": "receive" } }, "components": { "messages": { "GridOperation": { "summary": "Represents a CRDT operation to be applied to the grid state.", "payload": { "type": "object", "properties": { "replicaId": { "type": "string", "description": "The unique identifier of the replica that generated this operation." }, "operationType": { "type": "string", "description": "The type of CRDT operation (e.g., 'insertRow', 'updateCell', 'deleteColumn')." }, "data": { "type": "object", "description": "The data associated with the operation." } }, "required": [ "replicaId", "operationType", "data" ] } } }, "schemas": { "VersionVector": { "type": "object", "properties": { "vector": { "type": "array", "items": { "type": "string" } } } } }, "securitySchemes": { "bearerAuth": { "type": "http", "scheme": "bearer", "bearerFormat": "JWT" } } } }

Model Validation

One of the required parameters when starting a grid session is the data model that defines the data validation rules for the grid. The data model is expected to be a JSON Schema. As mentioned above, cells within the same row can be interdependent with each other. For validation, this means one cell value might restrict the valid values for one or more other cells in the same row. This type of cross row validation is typically defined with a JSON schema condition. Following our sample grid from above, consider the following JSON schema with cross row validation:

{ "type": "array", "items": { "type": "object", "properties": { "type": { "type": "string", "enum": [ "cat", "dog", "rat" ], "description": "The type of animal" }, "age": { "type": "integer", "description": "The age of the animal" } }, "required": [ "type", "age" ], "if": { "properties": { "type": { "const": "cat" } } }, "then": { "properties": { "age": { "maximum": 20 } } }, "if": { "properties": { "type": { "const": "dog" } } }, "then": { "properties": { "age": { "maximum": 15 } } }, "if": { "properties": { "type": { "const": "rat" } } }, "then": { "properties": { "age": { "maximum": 3 } } } } }

This schema shows that valid values for type are: cat, dog, & mouse. It also shows that age is an integer with a maximum value that is dependent on type. Both age and type are also required values.

 

Server-Side Validation

The server-side validation module is tasked with the following:

  • Listen to all cell level grid events

  • Broadcast validation change messes

Consider a row where type=rat, and age=2. If the user wants to correct the value of the row to have a type=cat, and age=12, they should be able to change age first, without being blocked. This means age will temporarily be invalid since type=rat has a max age of 3. Once the user changes the type to be cat, age will once again become valid.

Open API Specification

The definition of the REST APIs used to create and manage a grid session. Copy & paste the following JSON into: editor.swagger.io to see the generated documentation.

{ "openapi": "3.0.1", "info": { "title": "Synapse REST API", "version": "v1" }, "servers": [{"url": "https://repo-prod.prod.sagebase.org"}], "tags": [{ "name": "Grid Services", "description": "Services for create and managing data curation sandbox grid." }], "paths": { "/repo/v1/grid/session/async/start": {"post": { "tags": ["Grid Services"], "operationId": "post-/repo/v1/grid/session/async/start", "parameters": [], "requestBody": { "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateGridRequest"}}}, "required": true }, "security": [{"bearerAuth": []}], "responses": {"201": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.asynch.AsyncJobId"}}} }} }}, "/repo/v1/grid/session/async/get/{asyncToken}": {"get": { "tags": ["Grid Services"], "operationId": "get-/repo/v1/grid/session/async/get/{asyncToken}", "parameters": [{ "name": "asyncToken", "in": "path", "required": true, "schema": {"type": "string"} }], "security": [{"bearerAuth": []}], "responses": {"200": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateGridResponse"}}} }} }}, "/repo/v1/grid/{sessionId}": {"get": { "tags": ["Grid Services"], "operationId": "get-/repo/v1/grid/{sessionId}", "parameters": [{ "name": "sessionId", "in": "path", "required": true, "schema": {"type": "string"} }], "security": [{"bearerAuth": []}], "responses": {"200": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.GridSession"}}} }} }}, "/repo/v1/grid/{sessionId}/sql/query": {"post": { "tags": ["Grid Services"], "operationId": "post-/repo/v1/grid/{sessionId}/sql/query", "parameters": [{ "name": "sessionId", "in": "path", "required": true, "description": "The grid session ID.", "schema": {"type": "string"} }], "requestBody": { "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.GridQueryRequest"}}}, "required": true }, "security": [{"bearerAuth": []}], "responses": {"200": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.GridQueryResponse"}}} }} }}, "/repo/v1/grid/{sessionId}/sql/update": {"put": { "tags": ["Grid Services"], "operationId": "put-/repo/v1/grid/{sessionId}/sql/update", "parameters": [{ "name": "sessionId", "in": "path", "required": true, "description": "The grid session ID.", "schema": {"type": "string"} }], "requestBody": { "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.UpdateGridRequest"}}}, "required": true }, "security": [{"bearerAuth": []}], "responses": {"200": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.UpdateGridResponse"}}} }} }}, "/repo/v1/grid/{sessionId}/schema": {"get": { "tags": ["Grid Services"], "operationId": "get-/repo/v1/grid/{sessionId}/schema", "parameters": [{ "name": "sessionId", "in": "path", "required": true, "schema": {"type": "string"} }], "security": [{"bearerAuth": []}], "responses": {"200": { "description": "The JSON Schema that defines the validation rules of this grid.", "content": {"application/json": {"schema": {"type": "string"}}} }} }}, "/repo/v1/grid/{sessionId}/replica": {"put": { "tags": ["Grid Services"], "operationId": "put-/repo/v1/grid/{sessionId}/replica", "parameters": [{ "name": "sessionId", "in": "path", "required": true, "schema": {"type": "string"} }], "requestBody": { "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateReplicaRequest"}}}, "required": true }, "security": [{"bearerAuth": []}], "responses": {"200": { "description": "Auto-generated description", "content": {"application/json": {"schema": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateReplicaResponse"}}} }} }} }, "components": { "schemas": { "org.sagebionetworks.repo.model.grid.sql.Literal": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.Literal"] }, "value": {"type": "string"}, "dataType": {"type": "string"} }, "description": "Literal is a single value of a given type.", "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.Identifier": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.Identifier"] }, "value": {"type": "string"} }, "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.CellId": { "type": "object", "properties": { "rowId": {"type": "string"}, "columnName": {"type": "string"} }, "description": "The immutable (stable) identifier for a cell." }, "org.sagebionetworks.repo.model.grid.CreateReplicaRequest": { "type": "object", "properties": {"gridSessionId": {"type": "string"}}, "description": "Request to create a new replica. A replica represents an 'in-memory' grid document identifed by a unique replicaId." }, "org.sagebionetworks.repo.model.grid.sql.SelectItem": { "type": "object", "properties": {"expression": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }} }, "org.sagebionetworks.repo.model.grid.UpdateGridRequest": { "type": "object", "properties": { "gridSessionId": {"type": "string"}, "update": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Update", "description": "SQL used to update cells of a grid." } }, "description": "Request to update the grid by executing a SQL update command." }, "org.sagebionetworks.repo.model.grid.sql.OrderByItem": { "type": "object", "properties": { "expression": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "direction": {"type": "string"} } }, "org.sagebionetworks.repo.model.grid.CreateGridResponse": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.CreateGridResponse"] }, "gridSession": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.GridSession", "description": "Basic information about a grid session." } }, "description": "The response to creating a new grid sesion.", "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.Query": { "type": "object", "properties": { "select": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SelectClause", "description": "Object model describing the SELECT statment of a SQL query." }, "where": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "description": "Optional WHERE clause." }, "groupBy": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "description": "Optional GROUP BY clause." }, "orderBy": { "type": "array", "items": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.OrderByItem"}, "description": "Optional ORDER BY clause." }, "limit": { "type": "integer", "format": "int32" }, "offset": { "type": "integer", "format": "int32" } }, "description": "Defines a SQL query to be run against the current grid." }, "org.sagebionetworks.repo.model.grid.sql.Row": { "type": "object", "properties": { "rowId": {"type": "string"}, "values": { "type": "array", "items": {"type": "string"}, "description": "Cell values for a single row of a query result." } }, "description": "" }, "org.sagebionetworks.repo.model.grid.sql.InExpression": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.InExpression"] }, "left": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.ColumnReference"}, "values": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Literal", "description": "Literal is a single value of a given type." }, "description": "An array of literal values for the 'IN' operator." }, "not": {"type": "boolean"} }, "description": "In expression is used for a where condition using the 'in' opperator.", "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.ColumnReference": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.ColumnReference"] }, "column": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Identifier"} }, "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.GridSession": { "type": "object", "properties": { "sessionId": {"type": "string"}, "startedBy": { "type": "integer", "format": "int32" }, "startedOn": {"type": "string"}, "numberOfRows": { "type": "integer", "format": "int32" }, "columnNames": { "type": "array", "items": {"type": "string"}, "description": "The names of the coumns in this grid" }, "currentVesion": { "type": "integer", "format": "int32" }, "schema$id": {"type": "string"} }, "description": "Basic information about a grid session." }, "org.sagebionetworks.repo.model.grid.slq.OrderByDirection": { "type": "string", "enum": [ "ASC", "DESC" ] }, "org.sagebionetworks.repo.model.grid.sql.LiteralType": { "type": "string", "enum": [ "string", "integner", "float", "boolean" ] }, "org.sagebionetworks.repo.model.grid.sql.QueryResult": { "type": "object", "properties": { "columnNames": { "type": "array", "items": {"type": "string"}, "description": "The column names from the select." }, "rows": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Row", "description": "" }, "description": "The column names from the select." } }, "description": "Tabular query results from a SQL query." }, "org.sagebionetworks.repo.model.asynch.AsynchronousResponseBody": { "type": "object", "properties": {"concreteType": {"type": "string"}}, "description": "The body of an Asynchronous job response.", "oneOf": [{"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateGridResponse"}], "required": ["concreteType"], "discriminator": {"propertyName": "concreteType"} }, "org.sagebionetworks.repo.model.grid.sql.Star": { "type": "object", "properties": {"concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.Star"] }}, "description": "Indicates an asterisk (*).", "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.BinaryOperator": { "type": "string", "enum": [ "EQUALS", "NOT_EQUALS", "GREATER_THAN", "LESS_THAN", "GREATER_THAN_OR_EQUALS", "LESS_THAN_OR_EQUALS", "AND", "OR" ] }, "org.sagebionetworks.repo.model.grid.GridQueryRequest": { "type": "object", "properties": { "gridSessionId": {"type": "string"}, "query": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Query", "description": "Defines a SQL query to be run against the current grid." } }, "description": "Request to query the grid." }, "org.sagebionetworks.repo.model.grid.sql.FunctionCall": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.FunctionCall"] }, "name": {"type": "string"}, "arguments": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "description": "An array of arguments passed to the function." }, "distinct": {"type": "boolean"} }, "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.SetValue": { "type": "object", "properties": { "columnName": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.ColumnReference"}, "value": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Literal", "description": "Literal is a single value of a given type." } }, "description": "Set the value of a column." }, "org.sagebionetworks.repo.model.grid.CreateReplicaResponse": { "type": "object", "properties": { "gridSessionId": {"type": "string"}, "replicaId": {"type": "string"}, "createdBy": { "type": "integer", "format": "int32" }, "isAgentReplica": {"type": "boolean"}, "createdOn": {"type": "string"} }, "description": "Response to a create replica request." }, "org.sagebionetworks.repo.model.grid.CreateGridRequest": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.CreateGridRequest"] }, "csvFileHandleId": { "type": "integer", "format": "int32" }, "schema$id": {"type": "string"} }, "description": "Start a job to create a new Grid session.", "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.sql.BinaryExpression": { "type": "object", "properties": { "concreteType": { "type": "string", "enum": ["org.sagebionetworks.repo.model.grid.sql.BinaryExpression"] }, "left": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "operator": {"type": "string"}, "right": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." } }, "required": ["concreteType"] }, "org.sagebionetworks.repo.model.grid.GridQueryResponse": { "type": "object", "properties": {"results": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.QueryResult", "description": "Tabular query results from a SQL query." }, "description": "" }}, "description": "" }, "org.sagebionetworks.repo.model.asynch.AsyncJobId": { "type": "object", "properties": {"token": {"type": "string"}}, "description": "Object used to track the status of an asynchronous job." }, "org.sagebionetworks.repo.model.asynch.AsynchronousRequestBody": { "type": "object", "properties": {"concreteType": {"type": "string"}}, "description": "The body of an Asynchronous job request.", "oneOf": [{"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.CreateGridRequest"}], "required": ["concreteType"], "discriminator": {"propertyName": "concreteType"} }, "org.sagebionetworks.repo.model.grid.slq.SetQuantifier": { "type": "string", "enum": ["DISNTICT"] }, "org.sagebionetworks.repo.model.grid.sql.Update": { "type": "object", "properties": { "set": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SetValue", "description": "Set the value of a column." }, "description": "Values to set." }, "where": { "type": "array", "items": { "$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SqlExpression", "description": "A generic SQL expression." }, "description": "Optional WHERE clause." }, "limit": { "type": "integer", "format": "int32" } }, "description": "SQL used to update cells of a grid." }, "org.sagebionetworks.repo.model.grid.UpdateGridResponse": { "type": "object", "properties": {"updatedRowCount": { "type": "integer", "format": "int32" }}, "description": "Response to an Update grid SQL request." }, "org.sagebionetworks.repo.model.grid.sql.SelectClause": { "type": "object", "properties": { "setQuantifier": {"type": "string"}, "selectList": { "type": "array", "items": {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.SelectItem"}, "description": "One ore more select items." } }, "description": "Object model describing the SELECT statment of a SQL query." }, "org.sagebionetworks.repo.model.grid.sql.SqlExpression": { "type": "object", "properties": {"concreteType": {"type": "string"}}, "description": "A generic SQL expression.", "oneOf": [ {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Star"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Literal"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.Identifier"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.BinaryExpression"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.InExpression"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.FunctionCall"}, {"$ref": "#/components/schemas/org.sagebionetworks.repo.model.grid.sql.ColumnReference"} ], "required": ["concreteType"], "discriminator": {"propertyName": "concreteType"} }, "org.sagebionetworks.repo.model.grid.sql.FunctionName": { "type": "string", "enum": [ "AVG", "COUNT", "COUNT_DISTINCT", "MAX", "MIN", "STD", "STDPOP", "STDDEV_SAMP", "SUM", "VAR_POP", "VAR_SAMP", "VARIANCE" ] } }, "securitySchemes": {"bearerAuth": { "type": "http", "scheme": "bearer" }} } }