Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In a previous version of this document we considered exposing these perspectives to users by requiring UI engineers generate the SQL needed to generate them. If you expanded the query sections for these two perspectives you would notice that the SQL is non-trivial. To sort and/or filter on these perspectives (covered in a later section), adds yet another lay layer of complexity.

If you look at a complex table/view with multiple facets in any portal or the Synapse UI, you will see a set of fairly standard controls on the left-hand-side panel and in the columns headers. Users are able to filter and sort the results by manipulating theses controls. Typically, the UI code does not need to directly parse or generate SQL when these controls change. Instead, the UI can pass model objects that describe the filtering and sorting to the Synapse query service, which then manipulate the SQL on the caller’s behalf. This basic functionally works for all table/view features including:

...

So rather than pushing the complexity to the UI, it would be better if these perspectives behaves like any other Synapse table/view. Parsing and SQL manipulation should not be required. This means that the perspectives should behave as if they are as the simple tables , instead of three layers of aggregation used to generate themthey appear to be. So, our new design attempts to hide most of the complexity without losing any of the functionality. We would also like the resulting feature to be generic enough to work for other (non-cohort-builder) use cases.

...

When you define a materialized view (MV) in Synapse you do not directly define its schema. Instead, the schema is automatically determined by the select-statement of the MV’s defining SQL. For example, if a MV contains a join of two tables, the select statement will typically refer to the columns of each of the tables. So for MVs, we automatically inherit the ColumnModel of any selected column from its source table. While this simple assumption might work for most MV use cases it does not work well for aggregation or any other type of “derived column”.

...

So far, we have only discussed applying filters to aggregated columns. Such filter filters must be applied after the aggregation results are calculated. However, the main use case requires some runtime filtering to occur before aggregation is applied.

...

Expand
titleExpand to see the sql that generated these results. Line:15 shows the filtering in the first layer of SQL.
Code Block
WITH
	F2P AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
            MAX(AGE) AS MAX_AGE,
            MIN(AGE) AS MIN_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL 
				AND PART_ID IN(2, 4, 5, 7)
                GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
    ),
    AGG AS (
		SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
			JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
            JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
            PART_IDS
				FROM F2P
    )
SELECT * FROM AGG;

...

If the aggregation results did not need to change at runtime, then we could simply use a materialized view as a solution to the entire problem. For example, we could pre-build a MV with millions rows of “static” aggregate data. End users could query this “static” data at runtime without any problems.

On the other hand, if the user’s runtime selections can change the aggregation results, then pre-building static results will now work. We also cannot rebuild a million row MV each time a user makes a runtime selection.materialization does not help. In fact, it would require that we rebuild millions of material rows with each click. This means we need a solution that will perform the aggregation at runtime in response to user’s selectionssupport filtering both before and after aggregation without materialization.

Aggregate View

We are proposing adding a new entity Type called AggregateView. Like a MaterializedView, an AggregateView would be configured by setting its defining SQL. However, that is where the similarities end. An AggregateView is not a materialization. This means that we do not create an actual table in the database for this view. Instead, an AggregateView provides a simple, table like, layer of of abstraction over a complex aggregation query. To better understand this concept, lets show how we would use an AggregateView to create the files-to-participant-perspective.

...

Code Block
WITH
	F2P AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
            MAX(AGE) AS MAX_AGE,
            MIN(AGE) AS MIN_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
    ),
    AGG AS (
		SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
			JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
            JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
            PART_IDS
				FROM F2P
    )
SELECT * FROM AGG;

The above SQL is actaully actually a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, two inner queries of the common table expression (CTE) (lines:2-22) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:23). In essence, the user is querying what appears to be a simple table.

...

In the final, section we will show how runtime filtering an and sorting would be applied using a few examples.

...

Code Block
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS

The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.

New Features

In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.

...

New column types - …

...

New Facet types - ….

...

New Entity Type - AggregateView …

...

above filters/sorting applied to the aggregation results. We still need to cover the case where the user is requesting a filter before aggregation. We will use the same example filter where the user per-selected participant IDs to: 2, 4, 5, & 7:

Code Block
languagesql
select * from syn123 where pre_agg(PART_ID in(2,4,5,7))

Here we have defined a new function called pre_agg() which is syntactic sugar that means apply this filter before aggregation. So rather than apply the filter at the end of the CTE (line:23) add it to the first layer of the CTE (line:14).

The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.

New Features

To see the full list of features needed to make this design work see the epic:

Jira Legacy
serverSystem JIRA
serverIdba6fb084-9827-3160-8067-8ac7470f78b2
keyPLFM-

...

7789