Synapse Joins
Introduction
The Synapse API currently supports two types of tabular data: Tables & Views. While both tables and views are relations, the API only supports a limited sub-set of relational algebra, specifically unary operations such as selection. This document explores a possible mechanism for expanding beyond single table operation, to operations involving multiple tables.
Users
This document will cover two types of users:
Content Producers - This is the group of users that produce content in Synapse. It assumed that these users are comfortable with creating/maintaining both views and tables in Synapse. It is assumed that they have a basic understanding of both SQL and relational algebra.
Content Consumers - This is the group of users that consume content in Synapse. These users are comfortable with using UI tools such as faceted navigation of views and/or tables. They would not be expected to create or maintain tables/views. They would not be expected to know any SQL or relational algebra.
Normalization Problem
Database normalization is an important part of managing relational data as it both reduces redundancy and improves data integrity. We will use the following example to demonstrate the concepts. Assume we have the following view of files:
FilesWithPatients
File ID | name | patient id | patient age |
---|---|---|---|
syn123 | scan.csv | x | 23 |
syn456 | other.csv | y | 65 |
syn890 | more.csv | z | 85 |
syn333 | less.csv | y | 65 |
With such a view we can run a simple query to find all files about patients over the age of 60. However, notice that patient y is in assigned to two different files, so patients y’s age is duplicated. This makes it possible to accidentally set conflicting ages for patient y. What if syn456 has age=65, but syn333 has age 66, which is correct? In order to update the age for any patient we would need to find and replace all duplicate values.
A better solution to this problem is to normalize our data into two separate relations such as:
Files:
File ID | name | patient id |
---|---|---|
syn123 | scan.csv | x |
syn456 | other.csv | y |
syn890 | more.csv | z |
syn333 | less.csv | y |
Patients:
patient id | patient age |
---|---|
x | 23 |
y | 65 |
z | 85 |
Part of normalizing our Files and Patients tables is to guarantee that they do not contain duplicate rows. This is typically achieved by defining a primary key. All FileViews automatically define the syn<ID> of each file (or object) as the primary key. However, currently there is no way to set a primary key for a Synapse table. In order to guarantee that we cannot have duplicate patient rows, we would want to define the primary key of our Patients table as the patient id column.
Now that our data is in two separate tables: Files and Patients, we no longer have data duplication. If we need to update the patient y’s age, there is only one place where we need to make the change. However, this creates a new problem. How to find all files with patients over the age of 60? The answer is a relational join.
By joining the Files table with the Patients table on the patient id, we are able to recreate a “materialized” view that includes data from both tables. The following SQL shows a join that could be used to define a materialized view:
select * from Files join Patients on (Files.patientId = Patients.patientId)
This SQL statement would produce the following materialized view that is identical to our starting example.
Materialized:
File ID | name | patient id | patient age |
---|---|---|---|
syn123 | scan.csv | x | 23 |
syn456 | other.csv | y | 65 |
syn890 | more.csv | z | 85 |
syn333 | less.csv | y | 65 |
We can now run our query against the materialized view:
select * from materialized where patientAge > 60
Would produce:
File ID | name | patient id | patient age |
---|---|---|---|
syn456 | other.csv | y | 65 |
syn890 | more.csv | z | 85 |
syn333 | less.csv | y | 65 |
It is important to note that our materialized view cannot be directly modified. If we want to change the age of patient y, we would first need to update the patient table which would then rebuild the materialized view.
Direct View vs Materialized View
Since the release of views in Synapse the primary concern from our users has been query performance. Users expect sub-second query performance since these queries drive many UI interactions. Poor query performance can make a portal sluggish. On the other hand, data producers are tolerant of slow view creation times. They are also willing to accept eventual consistency if it yields faster query performance. This trade-off lends itself to caching, or materialization. All existing views in Synapse today are materialized.
User Requirements
From the data consumer’s perspective a materialized view would behave exactly like any other table or view in Synapse. For example, faceted navigation of materialized view would behave exactly like faceted navigation in a table.
The data consumer should not need to understand that the data in a materialized view was derived from one or more tables/views.
The data producer would be expected to construct materialized view on behalf of their data consumers.
The data producer would be expected to have a basic understanding of database normalization and the SQL join syntax.
When the source data of a materialized view changes (example a change occurs in the patients table), the materialized view should eventually get update automatically. Specifically, no users action should be required to force the materialized view to re-build upon source changes.
Materialized view sources can be either views, tables, or even other materialized views.
Materialized view are defined by SQL only. The SQL can include any supported SQL including joins, where clause, group by clause, and select functions.
All authorization checks on the source data must extends to materialized view.
Tables - The user must have the DOWNLOAD permission on the all tables referenced in the view to see any rows in the materialized view. The user must also meet all AccessRequirements on all tables referenced in the view to see any rows in the materialized view. All or none.
View - The user must have READ on each object from a view. Row level filtering.
Performance Requirements
A query against a materialized view is expected to return in under one second (on average). Queries against materialized views are expected to occur very frequently.
Technical Requirements
When a materialized view is first created, the source tables/views might not be available on the a given shard. This means we would need to build an index for all source table/views on a given shard before, building a materialized view.
When a change occurs in a source table/view, the change will first need to be applied to the source index on the destination shard before the change can propagate to the destination materialized view. The amount of time for this propagation is unbounded.
In order to meet sub-second query times the materialized view index must remain available for query even during unbounded change propagation.
Views queries currently have a equivalent transaction isolation level of Read Committed. While tables currently have a transaction isolation level of Serializable. Since a materialized views can have either a table or a view as a source, the highest possible materialized view isolation level would be Read Committed.
Joining two tables without specifying an on clause will produce a Cartesian product. The Cartesian product can be large for even small tables. In addition it is unlikely that our users will intentionally want to create the Cartesian product. In order to prevent a potentially expensive mistake we reject any SQL JOIN that does not include an on clause.
API Design
MaterializedViews
The API design is fairly simple, and only involves the creation of a new Entity type that extends the Table interface called MaterializedView:
Field Name | type | description |
---|---|---|
definingSQL | String | A Synapse SQL statement that fully defines this materialized view. |
Unlike other Views or Tables, a MaterializedView is fully defined by its single property: definingSQL. The schema of the MaterializedView will be derived directly from the SQL select statement. While the scope of the MaterializedView is derived from both the SQL from statements, and where statements.
Note: The definingSQL can contain one or more JOIN clauses. This is the only context where a JOIN clause will be valid. Adding a JOIN clause in any other context will result in an error.
Using our example above we can define a new MaterializedView using the following JSON:
{
"concreteType":"org.sagebionetworks.repo.model.table.MaterializedView"
"name":"My First MaterializedView",
"parentId":"syn111",
definingSQL:"SELECT * FROM syn222 F
JOIN syn444 P on (F.patientId = P.patientId) WHERE P.age > 60"
}
This example assumes syn111 is our File view, while syn444 is our Patients table.
We create our new MaterializedView by passing the above JSON to POST /entity. Let us assume this post return a new synID=999. At this point we can run the following query:
by starting a query using POST /entity/syn999/table/query/async/start which returns the following results:
File ID | name | patient id | patient age |
---|---|---|---|
syn456 | other.csv | y | 65 |
syn890 | more.csv | z | 85 |
syn333 | less.csv | y | 65 |
Primary Keys/Uniqueness
A uniqueness constraint on a single column would ensure that all values in that column are unique. A uniqueness constraint on multiple columns ensure that the combined values of those columns are unique. A single table can have more than one uniqueness constraint. A primary key is just a specific type of uniqueness constraint with the restriction that there can be only one primary key per table. Therefore we are proposing supporting the more generic uniqueness constraint.
The following object model defines a uniqueness constraint:
UniquenessConstraint
type | field | description |
---|---|---|
String | name | A user friendly name to show when a uniqueness constraint has been violated. |
List<Long> | columnIds | A list of one or more columnIDs that define the uniqueness constraint. |
We will extend the TableEntity object mode to include: List<UniquenessConstraint> uniquenessConstraints;
Does it make sense to add Uniqueness Constraints to views?