...
To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.
...
In a previous version of this document we considered exposing these perspectives to users by requiring UI engineers generate the SQL needed to generate them. If you expanded the query sections for these two perspectives you would notice that the SQL is non-trivial. To sort and/or filter on these perspectives (covered in a later section), adds yet another lay layer of complexity.
If you look at a complex table/view with multiple facets in any portal or the Synapse UI, you will see a set of fairly standard controls on the left-hand-side panel and in the columns headers. Users are able to filter and sort the results by manipulating theses controls. Typically, the UI code does not need to directly parse or generate SQL when these controls change. Instead, the UI can pass model objects that describe the filtering and sorting to the Synapse query service, which then manipulate the SQL on the caller’s behalf. This basic functionally works for all table/view features including:
...
So rather than pushing the complexity to the UI, it would be better of if these perspectives behaves like any other Synapse table/view. Parsing and SQL manipulation should not be required. This means that the perspectives should behave as if they are as the simple tables , instead of three layers of aggregation used to generate themthey appear to be. So, our new design attempts to hide most of the complexity without losing any of the functionality. We would also like the resulting feature to be generic enough to work for other (non-cohort-builder) use cases.
...
When you define a materialized view (MV) in Synapse you do not directly define its schema. Instead, the schema is automatically determined by the select-statement of the MV’s defining SQL. For example, if a MV contains a join of two tables, the select statement will typically refer to the columns of each of the tables. So for MVs, we automatically inherit the ColumnModel of any selected column from its source table. While this simple assumption might work for most MV use cases it does not work well for aggregation or any other type of “derived column”.
...
For example, the PART_COUNT column is does not exist in the source MATERIAL table, instead it is a derived aggregation. Count is a simple case that always returns an integer so it should be safe to assume a column type of INTEGER. However, what about the other three aggregate columns: PART_STAGE, PART_AGE, and PART_ID? Each contains different types of data.
...
The first five columns (11,22,33,44,55) and the last (88) are all existing column types supported by both the clients and server. This means the UI should be able to treat those columns exactly like any other column of the same type. However, PART_STAGE (76) and PART_AGE (77) include new types definitions. We will talk about these new types next.
Column
...
TypeJSON
Technically, the aggregate statics statistics for the PART_STAGE are gathered in two separate columns using the following in the the SQL select:
...
These statistics seem similar to aggregate-enumeration. Should they really be two separate types?Now that we understand exactly what we want in our files
Filtering Before Aggregation
So far, we have only discussed applying filters to aggregated columns. Such filters must be applied after the aggregation results are calculated. However, the main use case requires some runtime filtering to occur before aggregation is applied.
For example, a user might start with the participants-with-aggregated-participants perspective, how do we tell Synapse what we want? We will tackle this problem in the next section.
Aggregate View
We are proposing adding a new entity Type called AggregateView. Like a MaterializedView, an AggregateView would be configured by setting its defining SQL. However, that is where the similarities end. An AggregateView is not a materialization. This means that we do not create an actual table in the database for this view. Instead, an AggregateView provides a simple, table like, layer of of abstraction over a complex aggregation query. To better understand this concept, lets show how we would use an AggregateView to create the files-to-participant-perspective.
Creating an AggregateView
Let’s assume we have already created our desired ColumnModels according to the above schema, with the resulting column IDs: 11,22,33,44,55,66,77,88. We can now create a new AggregateView with the following definingSQL:
...
language | sql |
---|
...
files perspective and narrow down their selection of participant IDs to: 2, 4, 5, & 7.
They then will want to apply their selection of participant IDs as a filter to files-with-aggregated-participants perspective. This filter must change the aggregated participant data. This means the filter must be applied before aggregation occurs.
Here is what the results look like with this filter applied:
Expand | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||
|
Next we will cover each line of this SQL.
The first thing to notice is that the SQL is an aggregation as it contains a GROUP BY (line:11). Specifically, we are grouping by all of the relevant columns of the FILE_VIEW.
Next, at line:2 we have CAST(FILE_ID, 11),
which defines the first column of the perspective. This line is simply casting the FILE_ID as column ID: 11. In other words, line 2 tells Synapse to treat the first column of the resulting table as type INTEGER with the name FILE_ID (see: ID 11). Note: Since FILE_ID is part of the group by, we are not required apply an aggregation function to it. Any column that is not part of the group by must have some type of aggregation functions.
Lines 3-5 are similar to line:2 where the rest of the file’s columns are cast to their respective column model IDs.
At line:6 we have our first aggregation function COUNT(PART_ID)
. Since a single file ID could map to many participants, we need an aggregation function define how to handle the “many”. In this case we simply want the count. We then cast the resulting count as column ID=55, which has a simple INTEGER type.
At line:7 we have a new function: AGG_EXPAND(STAGE, 66)
called AGG_EXPAND. This function call is syntactic sugar that tells Synapse to do the following:
Add a column for each value of the enumeration with a case statement that will count the occurrences of each.
In the next layer of the CTE, recombine all of the expanded columns into a single column of JSON.
In the final layer of the CTE, cast the resulting JSON as column ID=66.
At line:8 we have AGG_EXPAND(AGE, 77)
which will do a similar expansion to the previous case. This expansion will create a column for each aggregate function defined in column ID=77.
Finally, at line:9 we have CAST(GROUP_CONCAT(DISTINCT PART_ID), 88)
. The group concat function will create a comma separated list of all of the PART_IDs that match each file. The results are then cast to column ID=88 which is of type STRING_LIST. This means this column will behave similar to other string list columns in Synapse.
Querying an AggregateView
Once we have defined our AggregateView we can run a query against it like any other table/view in Synapse. Let’s assume that we created the AggregateView using the defining SQL from the previous section. The resulting AggregateView is then assigned syn123. If we get the schema for syn123, we would see its column IDs are: 11,22,33,44,55,66,77,88.
To query this view we could simply send the following to the query service:
Code Block |
---|
select * from syn123 |
The results of this query would be exactly the same as the unfiltered files-with-aggregated-participants table shown above.
So how does that work exactly? Basically, when the user provides select * from syn123
at runtime, we run the following query on their behalf:
Code Block |
---|
WITH
F2P AS (
SELECT
FILE_ID,
FILE_NAME,
FILE_TYPE,
FILE_SIZE,
COUNT(PART_ID) AS PART_COUNT,
SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
MAX(AGE) AS MAX_AGE,
MIN(AGE) AS MIN_AGE,
GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
),
AGG AS (
SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
PART_IDS
FROM F2P
)
SELECT * FROM AGG; |
The above SQL is actaully a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, two inner queries of the common table expression (CTE) (lines:2-22) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:23). In essence, the user is querying what appears to be a simple table.
A real runtime query transformation would be more complex but basic principals would still apply. For example, since our MATERIAL table includes files, the transformation process would include adding a row-level-filter to hide rows where the user lacks the read permission. This type of query manipulation is already common for existing Synapse tables/views.
In the final, section we will show how runtime filtering an sorting would be applied using a few examples.
First, lets assume that the user wants to only see rows where PART_STAGE ‘one’ is greater than two:
Code Block |
---|
select * from syn123 where PART_STAGE.one > 2 |
For this query the first 22 lines of the above query would remain the same, while the last line (line:32) would become:
Code Block |
---|
select * from AGG where JSON_EXTRACT(PART_STAGE, '$.one') > 2; |
A sorting example would be similar. For example to sort on PART_STAGE ‘two’ asc:
Code Block |
---|
select * from syn123 order by PART_STAGE.two asc |
Again we would only need to change the last line the CTE to be:
Code Block |
---|
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS |
The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.
New Features
In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.
...
New column types - …
...
New Facet types - ….
...
New Entity Type - AggregateView …
...
|
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | PART_STATE | PART_AGE | PART_ID |
---|---|---|---|---|---|---|---|
1 | f1 | raw | 100 | 4 | {"one": 2, "two": 2} | {"max": 40, "min": 10} | 2,4,5,7 |
2 | f2 | raw | 200 | 2 | {"one": 2, "two": 0} | {"max": 40, "min": 20} | 2,4 |
3 | f3 | raw | 300 | 2 | {"one": 0, "two": 2} | {"max": 30, "min": 10} | 5,7 |
4 | f3 | raw | 400 | 2 | {"one": 2, "two": 0} | {"max": 40, "min": 20} | 2,4 |
5 | f5 | proc | 100 | 2 | {"one": 0, "two": 2} | {"max": 30, "min": 10} | 5,7 |
6 | f6 | proc | 200 | 2 | {"one": 1, "two": 1} | {"max": 20, "min": 10} | 2,5 |
7 | f7 | proc | 300 | 2 | {"one": 1, "two": 1} | {"max": 40, "min": 30} | 4,7 |
8 | f8 | proc | 400 | 4 | {"one": 2, "two": 2} | {"max": 40, "min": 10} | 2,4,5,7 |
Notice, the previous unfiltered results for file:1 had a PART_COUNT=8, while the filtered results have a PART_COUNT=4.
If the aggregation results did not need to change at runtime, then we could simply use a materialized view as a solution to the entire problem. For example, we could pre-build a MV with millions rows of “static” aggregate data. End users could query this “static” data at runtime without any problems.
On the other hand, if the user’s runtime selections can change the aggregation results, then materialization does not help. In fact, it would require that we rebuild millions of material rows with each click. This means we need a solution that will support filtering both before and after aggregation without materialization.
Aggregate View
We are proposing adding a new entity Type called AggregateView. Like a MaterializedView, an AggregateView would be configured by setting its defining SQL. However, that is where the similarities end. An AggregateView is not a materialization. This means that we do not create an actual table in the database for this view. Instead, an AggregateView provides a simple, table like, layer of of abstraction over a complex aggregation query. To better understand this concept, lets show how we would use an AggregateView to create the files-to-participant-perspective.
Creating an AggregateView
Let’s assume we have already created our desired ColumnModels according to the above schema, with the resulting column IDs: 11,22,33,44,55,66,77,88. We can now create a new AggregateView with the following definingSQL:
Code Block | ||
---|---|---|
| ||
SELECT
CAST(FILE_ID AS 11),
CAST(FILE_NAME AS 22),
CAST(FILE_TYPE AS 33),
CAST(FILE_SIZE AS 44),
CAST(COUNT(PART_ID) AS 55),
AGG_EXPAND(STAGE AS 66),
AGG_EXPAND(AGE AS 77),
CAST(GROUP_CONCAT(DISTINCT PART_ID) AS 88)
FROM MATERIAL
WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE |
Next we will cover each line of this SQL.
The first thing to notice is that the SQL is an aggregation as it contains a GROUP BY (line:11). Specifically, we are grouping by all of the relevant columns of the FILE_VIEW.
Next, at line:2 we have CAST(FILE_ID AS 11),
which defines the first column of the perspective. This line is simply casting the FILE_ID as column ID: 11. In other words, line 2 tells Synapse to treat the first column of the resulting table as type INTEGER with the name FILE_ID (see: ID 11). Note: Since FILE_ID is part of the group by, we are not required apply an aggregation function to it. Any column that is not part of the group by must have some type of aggregation functions.
Lines 3-5 are similar to line:2 where the rest of the file’s columns are cast to their respective column model IDs.
At line:6 we have our first aggregation function COUNT(PART_ID)
. Since a single file ID could map to many participants, we need an aggregation function define how to handle the “many”. In this case we simply want the count. We then cast the resulting count as column ID=55, which has a simple INTEGER type.
At line:7 we have a new function: AGG_EXPAND(STAGE AS 66)
called AGG_EXPAND. This function call is syntactic sugar that tells Synapse to do the following:
Add a column for each value of the enumeration with a case statement that will count the occurrences of each.
In the next layer of the CTE, recombine all of the expanded columns into a single column of JSON.
In the final layer of the CTE, cast the resulting JSON as column ID=66.
At line:8 we have AGG_EXPAND(AGE AS 77)
which will do a similar expansion to the previous case. This expansion will create a column for each aggregate function defined in column ID=77.
Finally, at line:9 we have CAST(GROUP_CONCAT(DISTINCT PART_ID) AS 88)
. The group concat function will create a comma separated list of all of the PART_IDs that match each file. The results are then cast to column ID=88 which is of type STRING_LIST. This means this column will behave similar to other string list columns in Synapse.
Querying an AggregateView
Once we have defined our AggregateView we can run a query against it like any other table/view in Synapse. Let’s assume that we created the AggregateView using the defining SQL from the previous section. The resulting AggregateView is then assigned syn123. If we get the schema for syn123, we would see its column IDs are: 11,22,33,44,55,66,77,88.
To query this view we could simply send the following to the query service:
Code Block |
---|
select * from syn123 |
The results of this query would be exactly the same as the unfiltered files-with-aggregated-participants table shown above.
So how does that work exactly? Basically, when the user provides select * from syn123
at runtime, we run the following query on their behalf:
Code Block |
---|
WITH
AGG AS (
SELECT
FILE_ID,
FILE_NAME,
FILE_TYPE,
FILE_SIZE,
COUNT(PART_ID) AS PART_COUNT,
JSON_OBJECT('one', SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END),
'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) as PART_STAGE,
JSON_OBJECT('min', MIN(AGE), 'max', MAX(AGE)) as PART_AGE,
GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
)
SELECT * FROM AGG; |
The above SQL is actually a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, the inner query of the common table expression (CTE) (lines:3-14) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:15). In essence, the user is querying what appears to be a simple table.
A real runtime query transformation would be more complex but basic principals would still apply. For example, since our MATERIAL table includes files, the transformation process would include adding a row-level-filter to hide rows where the user lacks the read permission. This type of query manipulation is already common for existing Synapse tables/views.
In the next section, we will show how runtime filtering and sorting would be applied using a few examples.
First, lets assume that the user wants to only see rows where PART_STAGE ‘one’ is greater than two:
Code Block |
---|
select * from syn123 where PART_STAGE.one > 2 |
For this query the first 14 lines of the above query would remain the same, while the last line (line:15) would become:
Code Block |
---|
select * from AGG where JSON_EXTRACT(PART_STAGE, '$.one') > 2; |
A sorting example would be similar. For example to sort on PART_STAGE ‘two’ asc:
Code Block |
---|
select * from syn123 order by PART_STAGE.two asc |
Again we would only need to change the last line the CTE to be:
Code Block |
---|
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS |
The above filters/sorting applied to the aggregation results. We still need to cover the case where the user is requesting a filter before aggregation. We will use the same example filter where the user per-selected participant IDs to: 2, 4, 5, & 7:
Code Block | ||
---|---|---|
| ||
select * from syn123 where pre_agg(PART_ID in(2,4,5,7)) |
Here we have defined a new function called pre_agg()
which is syntactic sugar that means apply this filter before aggregation. So rather than apply the filter at the end of the CTE (line:15) it is added to the inner layer of the CTE (line:13).
The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.
New Features
The implementation plan is to divide the work into two phases:
Phase One - Add support for CTE, CASE Statement, JSON columns/functions to the runtime query services. At the end of this phase a user should be able to run this type of query against any Synapse table/view:
Code Block WITH AGG AS ( SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, JSON_OBJECT('one', SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END), 'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) as PART_STAGE, JSON_OBJECT('min', MIN(AGE), 'max', MAX(AGE)) as PART_AGE, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE ) SELECT * FROM AGG;
Phase Two: Add AggregateViews plus the new aggregate facet types as syntactic sugar that will be expanded to the full SQL from phase one at runtime.
To see the full list of features needed to make this design work see the epic:
Jira Legacy | ||||||
---|---|---|---|---|---|---|
|
...
|