Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagesql
 SELECT 
			CAST(FILE_ID, AS 11),
			CAST(FILE_NAME, AS 22),
			CAST(FILE_TYPE, AS 33),
            CAST(FILE_SIZE, AS 44),
			CAST(COUNT(PART_ID), AS 55),
			AGG_EXPAND(STAGE, AS 66),
			AGG_EXPAND(AGE, AS 77),
			CAST(GROUP_CONCAT(DISTINCT PART_ID), AS 88)
				FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE

...

Next, at line:2 we have CAST(FILE_ID , AS 11), which defines the first column of the perspective. This line is simply casting the FILE_ID as column ID: 11. In other words, line 2 tells Synapse to treat the first column of the resulting table as type INTEGER with the name FILE_ID (see: ID 11). Note: Since FILE_ID is part of the group by, we are not required apply an aggregation function to it. Any column that is not part of the group by must have some type of aggregation functions.

...

At line:7 we have a new function: AGG_EXPAND(STAGE , AS 66) called AGG_EXPAND. This function call is syntactic sugar that tells Synapse to do the following:

...

At line:8 we have AGG_EXPAND(AGE , AS 77) which will do a similar expansion to the previous case. This expansion will create a column for each aggregate function defined in column ID=77.

Finally, at line:9 we have CAST(GROUP_CONCAT(DISTINCT PART_ID) , AS 88). The group concat function will create a comma separated list of all of the PART_IDs that match each file. The results are then cast to column ID=88 which is of type STRING_LIST. This means this column will behave similar to other string list columns in Synapse.

...

So how does that work exactly? Basically, when the user provides select * from syn123 at runtime, we run the following query on their behalf:

Code Block
WITH
	F2PAGG AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
            JSON_OBJECT('one', 			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END), AS STAGE_ONE_COUNT,

						'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) ASas PART_STAGE_TWO_COUNT,
            MAXJSON_OBJECT('min', MIN(AGE) AS MAX_AGE,
            MIN(AGE) AS MIN, 'max', MAX(AGE)) as PART_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
    	),
SELECT * FROM  AGG AS (
		SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
			JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
            JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
            PART_IDS
				FROM F2P
    )
SELECT * FROM AGG;

The above SQL is actually a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, two inner queries of the common table expression (CTE) (lines:2-22) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:23). In essence, the user is querying what appears to be a simple table.

A real runtime query transformation would be more complex but basic principals would still apply. For example, since our MATERIAL table includes files, the transformation process would include adding a row-level-filter to hide rows where the user lacks the read permission. This type of query manipulation is already common for existing Synapse tables/views.

In the final, section we will show how runtime filtering and sorting would be applied using a few examples.

First, lets assume that the user wants to only see rows where PART_STAGE ‘one’ is greater than two:

Code Block
select * from syn123 where PART_STAGE.one > 2

For this query the first 22 lines of the above query would remain the same, while the last line (line:32) would become:

Code Block
select * from AGG where JSON_EXTRACT(PART_STAGE, '$.one') > 2;

A sorting example would be similar. For example to sort on PART_STAGE ‘two’ asc:

Code Block
select * from syn123 order by PART_STAGE.two asc

Again we would only need to change the last line the CTE to be:

Code Block
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS

The above filters/sorting applied to the aggregation results. We still need to cover the case where the user is requesting a filter before aggregation. We will use the same example filter where the user per-selected participant IDs to: 2, 4, 5, & 7:

Code Block
languagesql
select * from syn123 where pre_agg(PART_ID in(2,4,5,7))

Here we have defined a new function called pre_agg() which is syntactic sugar that means apply this filter before aggregation. So rather than apply the filter at the end of the CTE (line:23) add it to the first layer of the CTE (line:14).

The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.

...

AGG;

The above SQL is actually a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, the inner query of the common table expression (CTE) (lines:3-14) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:15). In essence, the user is querying what appears to be a simple table.

A real runtime query transformation would be more complex but basic principals would still apply. For example, since our MATERIAL table includes files, the transformation process would include adding a row-level-filter to hide rows where the user lacks the read permission. This type of query manipulation is already common for existing Synapse tables/views.

In the next section, we will show how runtime filtering and sorting would be applied using a few examples.

First, lets assume that the user wants to only see rows where PART_STAGE ‘one’ is greater than two:

Code Block
select * from syn123 where PART_STAGE.one > 2

For this query the first 14 lines of the above query would remain the same, while the last line (line:15) would become:

Code Block
select * from AGG where JSON_EXTRACT(PART_STAGE, '$.one') > 2;

A sorting example would be similar. For example to sort on PART_STAGE ‘two’ asc:

Code Block
select * from syn123 order by PART_STAGE.two asc

Again we would only need to change the last line the CTE to be:

Code Block
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS

The above filters/sorting applied to the aggregation results. We still need to cover the case where the user is requesting a filter before aggregation. We will use the same example filter where the user per-selected participant IDs to: 2, 4, 5, & 7:

Code Block
languagesql
select * from syn123 where pre_agg(PART_ID in(2,4,5,7))

Here we have defined a new function called pre_agg() which is syntactic sugar that means apply this filter before aggregation. So rather than apply the filter at the end of the CTE (line:15) it is added to the inner layer of the CTE (line:13).

The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.

New Features

The implementation plan is to divide the work into two phases:

  • Phase One - Add support for CTE, CASE Statement, JSON columns/functions to the runtime query services. At the end of this phase a user should be able to run this type of query against any Synapse table/view:

    Code Block
    WITH
    	AGG AS (
    		SELECT 
    			FILE_ID,
    			FILE_NAME,
    			FILE_TYPE,
                FILE_SIZE,
    			COUNT(PART_ID) AS PART_COUNT,
                JSON_OBJECT('one', SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END), 'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) as PART_STAGE,
                JSON_OBJECT('min', MIN(AGE), 'max', MAX(AGE)) as PART_AGE,
    			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
    				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
    	)
    SELECT * FROM AGG;
  • Phase Two: Add AggregateViews plus the new aggregate facet types as syntactic sugar that will be expanded to the full SQL from phase one at runtime.

To see the full list of features needed to make this design work see the epic:

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-7789

...