Page Comparison

The purpose of this document is to provide technical details of how the proposed /wiki/spaces/PD/pages/2730721508 might be built.

Setup

To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.

...

FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vise vice versa.
MATERIAL - This table represents a materialized view that is the de-normalized join of the proceeding preceding three tables. It includes files that might not have any participants and vise vice versa. In the real world this table would be automatically index indexed on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.

...

Expand

title	Expand to see the full ddl...

Code Block

DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;

USE `COHORT`;

CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
    `TYPE` ENUM ('raw','proc') NOT NULL,
	`SIZE` BIGINT NOT NULL,
    `BEN_ID` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
	`STAGE` ENUM('one','two'),
    `AGE` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
	`FILE_ID` BIGINT NOT NULL,
	`PART_ID` BIGINT NOT NULL,
    PRIMARY KEY (`FILE_ID`, `PART_ID`)
);

INSERT INTO `FILE_VIEW` VALUES 
	(1,'f1','raw',100,111),
	(2,'f2','raw',200,111),
	(3,'f3','raw',300,111),
	(4,'f3','raw',400,111),
	(5,'f5','proc',100,111),
	(6,'f6','proc',200,111),
	(7,'f7','proc',300,222),
	(8,'f8','proc',400,222),    
	(9,'no participants','proc',100,111)
    ;

INSERT INTO `PARTICIPANTS` VALUES 
	(1,'P1','one',10),
    (2,'P2','one',20),
	(3,'P3','one',30),
    (4,'P4','one',40),
	(5,'P5','two',10),
    (6,'P6','two',20),
	(7,'P7','two',30),
    (8,'P4','two',40),
	(9,'no files','one', 18),
	(10,'few files','two', 30)
    ;
    
INSERT INTO `FILE_TO_PART` VALUES 
    # f1 to all
	(1,1),
	(1,2),
	(1,3),
	(1,4),
	(1,5),
    (1,6),
    (1,7),
    (1,8),
    # f2 to even
    (2,2),
	(2,4),
    (2,6),
    (2,8),
    # f3 to odd
	(3,1),
	(3,3),
	(3,5),
    (3,7),
    # f4 only stage one
    (4,1),
    (4,2),
    (4,3),
    (4,4),
    # f5 only stage two
    (5,5),
    (5,6),
    (5,7),
    (5,8),
    # f6 under20
	(6,1),
    (6,2),
    (6,5),
    (6,6),
	# f7 over 20
	(7,3),
    (7,4),
    (7,7),
    (7,8),
	# f8 to all
	(8,1),
	(8,2),
	(8,3),
	(8,4),
	(8,5),
    (8,6),
    (8,7),
    (8,8),
    # p10 only few files
    (2,10),
    (3,10),
	(5,10)
    ;

/* 
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants.  It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;

Full De-normalized Data

After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:

...

Expand

title	Expand to see the table

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	FILE_BEN_ID	PART_ID	PART_NAME	STAGE	AGE
1	f1	raw	100	111	1	P1	one	10
1	f1	raw	100	111	2	P2	one	20
1	f1	raw	100	111	3	P3	one	30
1	f1	raw	100	111	4	P4	one	40
1	f1	raw	100	111	5	P5	two	10
1	f1	raw	100	111	6	P6	two	20
1	f1	raw	100	111	7	P7	two	30
1	f1	raw	100	111	8	P4	two	40
2	f2	raw	200	111	2	P2	one	20
2	f2	raw	200	111	4	P4	one	40
2	f2	raw	200	111	6	P6	two	20
2	f2	raw	200	111	8	P4	two	40
2	f2	raw	200	111	10	few files	two	30
3	f3	raw	300	111	1	P1	one	10
3	f3	raw	300	111	3	P3	one	30
3	f3	raw	300	111	5	P5	two	10
3	f3	raw	300	111	7	P7	two	30
3	f3	raw	300	111	10	few files	two	30
4	f3	raw	400	111	1	P1	one	10
4	f3	raw	400	111	2	P2	one	20
4	f3	raw	400	111	3	P3	one	30
4	f3	raw	400	111	4	P4	one	40
5	f5	proc	100	111	5	P5	two	10
5	f5	proc	100	111	6	P6	two	20
5	f5	proc	100	111	7	P7	two	30
5	f5	proc	100	111	8	P4	two	40
5	f5	proc	100	111	10	few files	two	30
6	f6	proc	200	111	1	P1	one	10
6	f6	proc	200	111	2	P2	one	20
6	f6	proc	200	111	5	P5	two	10
6	f6	proc	200	111	6	P6	two	20
7	f7	proc	300	222	3	P3	one	30
7	f7	proc	300	222	4	P4	one	40
7	f7	proc	300	222	7	P7	two	30
7	f7	proc	300	222	8	P4	two	40
8	f8	proc	400	222	1	P1	one	10
8	f8	proc	400	222	2	P2	one	20
8	f8	proc	400	222	3	P3	one	30
8	f8	proc	400	222	4	P4	one	40
8	f8	proc	400	222	5	P5	two	10
8	f8	proc	400	222	6	P6	two	20
8	f8	proc	400	222	7	P7	two	30
8	f8	proc	400	222	8	P4	two	40
9	no participants	proc	100	111	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	9	no files	one	18

Two Perspectives

While the MATERIAL table contains all of the data, it is not super user-friendlywould be challenging to use in its raw form. To improve usability, the data is exposed as two separate perspectives of the data:

...

Expand

title	Expand to see the query that generated this result...

Code Block

language	sql

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
            MAX(FILE_SIZE) AS `MAX_SIZE`,
			MIN(FILE_SIZE) AS `MIN_SIZE`,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F;

...

PART_ID

...

PART_NAME

...

STAGE

...

AGE

...

FILE_COUNT

...

RAW_FILES

...

PROC_FILES

...

FILE_IDS

...

1

...

P1

...

one

...

10

...

5

...

3

...

2

...

1,3,4,6,8

...

2

...

P2

...

one

...

20

...

5

...

3

...

2

...

1,2,4,6,8

...

3

...

P3

...

one

...

30

...

5

...

3

...

2

,
    AGG AS (
		SELECT 
			PART_ID, PART_NAME, STAGE, AGE, FILE_COUNT,
			JSON_OBJECT('raw', RAW_FILES, 'proc', PROC_FILES) as FILE_TYPE,
			JSON_OBJECT('max', `MAX_SIZE`, 'min', `MIN_SIZE`) as FILE_SIZE,
			FILE_IDS    
				FROM P2F
	)
SELECT * FROM AGG;

PART_ID	PART_NAME	STAGE	AGE	FILE_COUNT	FILE_TYPE	FILE_SIZE	FILE_ID
1	P1	one	10	5	{"raw": 3, "proc": 2}	{"max": 400, "min": 100}	1,3,4,

7

6,8

4

2

P4

P2

one

40

20

5

{"raw": 3

2

, "proc": 2}

{"max": 400, "min": 100}

1,2,4,

7

6,8

5

3

P5

P3

two

one

10

30

5

{"raw": 3, "proc": 2

3

}

{"max": 400, "min": 100}

1,3,

5

4,

6

7,8

6

4

P6

P4

two

one

20

40

5

{"raw": 3, "proc": 2

3

}

{"max": 400, "min": 100}

1,2,

5

4,

6

7,8

7

5

P7

P5

two

30

10

5

{"raw": 2

3

, "proc": 3}

{"max": 400, "min": 100}

1,3,5,

7

6,8

8

6

P4

P6

two

40

20

5

2

3

1,

{"raw": 2, "proc": 3}

{"max": 400, "min": 100}

1,2,5,

7

10

few files

6,8

9

no files

one

18

0

7

P7

two

30

5

{"raw": 2, "proc": 3

2

}

{"max": 400, "min": 100}

1

2

,3,5

The unfiltered files-with-aggregated-participants would look like:

Expand

title	Exapnd to see the query that generated this result...

Code Block

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID
    )
SELECT * FROM F2P;

...

FILE_ID

...

FILE_NAME

...

FILE_TYPE

...

FILE_SIZE

...

PART_COUNT

...

STAGE_ONE_COUNT

...

STAGE_TWO_COUNT

...

PART_IDS

...

1

...

f1

...

raw

...

100

...

8

...

4

...

4

...

1,2,3,4,5,6,7,8

...

2

...

f2

...

raw

...

200

...

5

...

2

...

3

...

2,4,6,8,10

...

3

...

f3

...

raw

...

300

...

5

...

2

...

3

...

1,3,5,7,10

...

4

...

f3

...

raw

...

400

...

4

...

4

...

0

...

1,2,3,4

...

5

...

f5

...

proc

...

100

...

5

...

0

...

5

...

5,6,7,8,10

...

6

...

f6

...

proc

...

200

...

4

...

2

...

2

...

1,2,5,6

...

7

...

f7

...

proc

...

300

...

4

...

2

...

2

...

3,4,7,8

...

8

...

f8

...

proc

...

400

...

8

...

4

...

4

...

1,2,3,4,5,6,7,8

...

9

...

no participants

...

proc

...

100

...

0

...

0

...

0

...

NULL

Perspective Filtering

Now that we have two perspectives that show all of the information in a user friendly manner, it will be natural for uses to want to filter on each perspective. In fact, filtering is the point of the entire cohort builder exercise.

It is fairly trivial to filter by participant metadata on the participants-with-aggregated-files perspective. The same is true when filtering on file metadata on the files-with-aggregated-participants perspective. However, it is more complicated to filter on the aggregated columns of each perspective. In fact, the query for each perspective was built using a common-table-expression (CTE), to support aggregation filtering.

Let’s look at an example to see how to use the CTE to filter by aggregation data. A user looking at participants-with-aggregated-files perspective decides that they only want to include participants with at least 3 processed files in their cohort. To show these results we must add a filter to the p2f alias as follows (see line 14):

Code Block

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F WHERE PROC_FILES >= 3;

...

PART_ID

...

PART_NAME

...

STAGE

...

AGE

...

FILE_COUNT

...

RAW_FILES

...

PROC_FILES

...

FILE_IDS

...

5

...

P5

...

two

...

10

...

5

...

2

...

3

...

1,3,5,6,8

...

6

...

P6

...

two

...

20

...

5

...

2

...

3

...

1,2,5,6,8

...

7

...

P7

...

two

...

30

...

5

...

2

...

3

...

1,3,5,7,8

...

8

...

P4

...

two

...

40

...

5

...

2

...

3

...

1,2,5,7,8

The result is four participants (5,6,7,8). Once the user has selected their starting participants, the next step is to take their results to the files-with-aggregated-participants perspective to finalize their selected files. This creates a new challenge. Since we filtered by an aggregation in one perspective that does not exist in the other, how to we transfer the user’s filter to the other perspective?

One scale limited solution to this problem is to capture the resulting participant from the first perspective and apply them as a filter to the second. Following the example above we would transfer the four participants (5,6,7,8) to the files-with-aggregated-participants as follows (see in clause at line 12):

Code Block

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

...

FILE_ID

...

FILE_NAME

...

FILE_TYPE

...

FILE_SIZE

...

PART_COUNT

...

STAGE_ONE_COUNT

...

STAGE_TWO_COUNT

...

PART_IDS

...

1

...

f1

...

raw

...

100

...

4

...

0

...

4

...

5,6,7,8

...

2

...

f2

...

raw

...

200

...

2

...

0

...

2

...

6,8

...

3

...

f3

...

raw

...

300

...

2

...

0

...

2

...

5,7

...

5

...

f5

...

proc

...

100

...

4

...

0

...

4

...

5,6,7,8

...

6

...

f6

...

proc

...

200

...

2

...

0

...

2

...

5,6

...

7

...

f7

...

proc

...

300

...

2

...

0

...

2

...

7,8

...

8

...

f8

...

proc

...

400

...

4

...

0

...

4

...

5,6,7,8

Notice, this filter was applied to the inner query of the CTE. Since the outer F2P does not contain raw participant ids we cannot add the filter there.

Extending our existing user story, the user decides they only want to include files that include all four of their selected participants, so they add the filter (see: line 14):

Code Block

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P WHERE PART_COUNT >= 4;

...

FILE_ID

...

FILE_NAME

...

FILE_TYPE

...

FILE_SIZE

...

PART_COUNT

...

STAGE_ONE_COUNT

...

STAGE_TWO_COUNT

...

PART_IDS

...

1

...

f1

...

raw

...

100

...

4

...

0

...

4

...

5,6,7,8

...

5

...

f5

...

proc

...

100

...

4

...

0

...

4

...

5,6,7,8

...

8

...

f8

...

proc

...

400

...

4

...

0

...

4

...

5,6,7,8

Their final result now includes three files (1,5,8) that each include their four selected participants (5,6,7,8).

Row-level Filtering

Anytime a runtime query result can include files, we need to ensure that the system automatically filters out files that the caller does not have permission to read. This is something we must consider if we want to extend runtime queries to support CTE.

In our example FILE_VIEW table we include a column called BEN_ID. It represents the ID of the Project/Folder that controls access to a file via its access control list. It is called a benefactor ID because it defines where a file inherits its permissions from. This column simulate how Synapse treats a real file view. You might also notice that this column was include in the materialized view (MATERIAL). This also simulates how Synapse treats materialized views that include data from file views.

The BEN_ID column is then used at query time to automatically filter out rows that the caller cannot see. This filtering is done in three steps.

In step 1 we extract the FROM and WHERE from the user provided query and use them in combination with a distinct SELECT clause. For this example the results would look like:

Code Block
SELECT DISTINCT FILE_BEN_ID FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8);

This query will return two values: 111, and 222. This tells us the full set of possible benefactors that this query could return. These values are then used in step 2 which is an authorization check to determine the sub-set that they the user is allowed to see. For our example, lets say the caller is allowed to see 111 but not 222.

This information is then used in the final step (3), where we rewrite the user’s query to unconditionally limit to rows that the user is authorized to see:

Code Block

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE (FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8)) AND FILE_BEN_ID IN (111) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

Note: At line 12, the user’s original where clause is in parenthesis and combined with:FILE_BEN_ID IN (111).

In summary, if we want to support CTE in runtime queries we will need to extend the authorization sub-system to extract and rebuild the inner query, while leaving outer query intact.

New Features

In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.

...

CASE statements - We do not currently support CASE statements in the Synapse SQL. We would need to extend the SQL parser to correctly handle such cases.

...

CTE - We do not currently support common table expressions (CTE). Note: CTE can be more complex than the simple CTEs used in above examples. For example, it is possible for a CTE to include many tables, joins, unions, and sub-queries. Supporting complex CTE that includes joins, unions or sub-queries would potentially allow users to execute system debilitating queries. However, all of the examples above are simple an only involve a single table with aggregation. Such a simple CTE poses little to no risk. Therefore, we would like to add CTE support to runtime queries. This means a materialized view would not be needed to use a CTE.

...

,7,8
8	P4	two	40	5	{"raw": 2, "proc": 3}	{"max": 400, "min": 100}	1,2,5,7,8
9	no files	one	18	0	{"raw": 0, "proc": 0}	{"max": null, "min": null}
10	few files	two	30	3	{"raw": 2, "proc": 1}	{"max": 300, "min": 100}	2,3,5

The unfiltered files-with-aggregated-participants would look like:

Expand

title	Exapnd to see the query that generated this result...

Code Block

language	sql

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
            MAX(AGE) AS MAX_AGE,
            MIN(AGE) AS MIN_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID
    ),
    AGG AS (
		SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
			JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
            JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
            PART_IDS
				FROM F2P
    )
SELECT * FROM AGG;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	PART_COUNT	PART_STATE	PART_AGE	PART_ID
1	f1	raw	100	8	{"one": 4, "two": 4}	{"max": 40, "min": 10}	1,2,3,4,5,6,7,8
2	f2	raw	200	5	{"one": 2, "two": 3}	{"max": 40, "min": 20}	2,4,6,8,10
3	f3	raw	300	5	{"one": 2, "two": 3}	{"max": 30, "min": 10}	1,3,5,7,10
4	f3	raw	400	4	{"one": 4, "two": 0}	{"max": 40, "min": 10}	1,2,3,4
5	f5	proc	100	5	{"one": 0, "two": 5}	{"max": 40, "min": 10}	5,6,7,8,10
6	f6	proc	200	4	{"one": 2, "two": 2}	{"max": 20, "min": 10}	1,2,5,6
7	f7	proc	300	4	{"one": 2, "two": 2}	{"max": 40, "min": 30}	3,4,7,8
8	f8	proc	400	8	{"one": 4, "two": 4}	{"max": 40, "min": 10}	1,2,3,4,5,6,7,8
9	no participants	proc	100	0	{"one": 0, "two": 0}	{"max": null, "min": null}

Expected User Experience

In a previous version of this document we considered exposing these perspectives to users by requiring UI engineers generate the SQL needed to generate them. If you expanded the query sections for these two perspectives you would notice that the SQL is non-trivial. To sort and/or filter on these perspectives (covered in a later section), adds yet another layer of complexity.

If you look at a complex table/view with multiple facets in any portal or the Synapse UI, you will see a set of fairly standard controls on the left-hand-side panel and in the columns headers. Users are able to filter and sort the results by manipulating theses controls. Typically, the UI code does not need to directly parse or generate SQL when these controls change. Instead, the UI can pass model objects that describe the filtering and sorting to the Synapse query service, which then manipulate the SQL on the caller’s behalf. This basic functionally works for all table/view features including:

File Views
Tables
Materialized Views
Submission Views (challenges)

So rather than pushing the complexity to the UI, it would be better if these perspectives behaves like any other Synapse table/view. Parsing and SQL manipulation should not be required. This means that the perspectives should behave as if they are the simple tables they appear to be. So, our new design attempts to hide most of the complexity without losing any of the functionality. We would also like the resulting feature to be generic enough to work for other (non-cohort-builder) use cases.

Perspective Details

If we want to treat these perspectives as simple tables, what is the minimum information needed to describe them? In this section we will attempt to answer this questions by focusing on the files-to-participant-perspective. Our assumption is that both perspectives are two variations of the same theme, so anything that works for one, should work for the other.

Schema

The UI utilizes the schema of a table/view to determine how it should be rendered and controlled. Specifically, the column types define how each cell should be rendered/sorted. While each column’s facet types determine which controls should be include in headers and the left-hand-side pane of the UI.

When you define a materialized view (MV) in Synapse you do not directly define its schema. Instead, the schema is automatically determined by the select-statement of the MV’s defining SQL. For example, if a MV contains a join of two tables, the select statement will typically refer to the columns of each of the tables. So for MVs, we automatically inherit the ColumnModel of any selected column from its source table. While this simple assumption might work for most MV use cases it does not work well for aggregation or any other type of “derived column”.

For example, the PART_COUNT column does not exist in the source MATERIAL table, instead it is a derived aggregation. Count is a simple case that always returns an integer so it should be safe to assume a column type of INTEGER. However, what about the other three aggregate columns: PART_STAGE, PART_AGE, and PART_ID? Each contains different types of data.

The following is an example of what the files-to-participant-perspective schema could be:

id	name	columnType	enumValues	facetType	agg_functions
11	FILE_ID	INTEGER
22	FILE_NAME	STRING
33	FILE_TYPE	STRING	[raw,proc]	enumeration
44	FILE_SIZE	INTEGER		range
55	PART_COUNT	INTEGER		range
66	PART_STAGE	JSON	[one,two]	aggregate-enumeration
77	PART_AGE	JSON		aggregate-range	min,max
88	PART_ID	INTEGER_LIST

The first five columns (11,22,33,44,55) and the last (88) are all existing column types supported by both the clients and server. This means the UI should be able to treat those columns exactly like any other column of the same type. However, PART_STAGE (76) and PART_AGE (77) include new types definitions. We will talk about these new types next.

Column TypeJSON

Technically, the aggregate statistics for the PART_STAGE are gathered in two separate columns using the following in the the SQL select:

Code Block

language	sql

			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,

In fact, we need a column for each distinct enumeration value. So an enumeration with ten possible values would require ten columns to generate the appropriate statistics. In the UI design for the cohort builder, this type of aggregate enumeration data is shown as a single complex column. This is a nice, compact, way to look at a lot of data. One way to represent such complexity is to combine all of the data for a single cell into JSON.

MySQL provides JSON specific functions that we can use to combine multiple column into a single JSON column. For example, the following was used to recombine the two columns for ‘one’ and ‘two’ into a single column of JSON:

Code Block
JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,

MySQL also provides JSON specific functions that can be used for both sorting and filtering on contents of JSON columns. For example, let’s assume the user wants to filter on the FILE_TYPE column as follows:

Code Block

language	sql

...where PART_STAGE.one > 2

This filter is syntactic sugar for:

Code Block

language	sql

...where JSON_EXTRACT(PART_STAGE, '$.one') > 2

We can also use a similar syntax for sorting:

Code Block

language	sql

... order by PART_STAGE.two asc

Which is syntactic sugar for:

Code Block

language	sql

... ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS

As we can see, having a new column type of JSON, enables us to solve complex problems for columns with complex data. This column type will likely prove useful for other non-cohort-builder use cases.

Aggregate Functions

The basic idea for this column type metadata was to provide a list of the aggregate functions that should be provided for an aggregate numeric column like PART_AGE. In our example schema column ID 77 has: min,max. This indicates that we should gather both the min and max for this column. Other options might include: sum, avg, count, count(distinct), std.

Is there a better place to capture this information? Should we just hard-code it to min and max?

New Facet Types

Before we describe the new facet types let’s quickly review our two existing facet types:

facetType	Applies To	Statistics	Filtering
enumeration	Columns with one or more enumValue.	Provides the total count of each occurrence of each enumeration value with the current query filter applied.	When an enumeration value is “selected' all rows with non-selected enumeration values are excluded from the result.
range	Numeric columns	Provides the min and max of a numeric column with the current query filter applied.	The user provides an upper and lower bounds. This adds a filter to the numeric column such that the values are between lower and upper (inclusive).

Facet Type: aggregate-enumeration

To define this facet type, let’s try to define what type of statistics and filtering controls we would want to provide for the PART_STAGE column.

Let’s start with the statistics that would be shown for the entire table on the left-hand-side in the UI:

value	min	max
one	0	4
two	0	4

Given these statistics a user might want to do a “range” filter such as:

Code Block
... where PART_STAGE.one between 2 and 3 AND PART_STAGE.two betweeen 1 and 2

You could image the controls for such a facet would have two range sliders; one for each stage enumeration value.

Facet Type: aggregate-range

To define this facet type, let’s try to define what type of statistics and filtering controls we would want to provide for the PART_AGE column.

Since each row contains a min and a max, the logical statistics for the entire table seem to be the min of the mins, max of the mins….

value	min	max
min	10	30
max	20	40

These statistics seem similar to aggregate-enumeration. Should they really be two separate types?

Filtering Before Aggregation

So far, we have only discussed applying filters to aggregated columns. Such filters must be applied after the aggregation results are calculated. However, the main use case requires some runtime filtering to occur before aggregation is applied.

For example, a user might start with the participants-with-aggregated-files perspective and narrow down their selection of participant IDs to: 2, 4, 5, & 7.

They then will want to apply their selection of participant IDs as a filter to files-with-aggregated-participants perspective. This filter must change the aggregated participant data. This means the filter must be applied before aggregation occurs.

Here is what the results look like with this filter applied:

Expand

title	Expand to see the sql that generated these results. Line:15 shows the filtering in the first layer of SQL.

Code Block

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
            MAX(AGE) AS MAX_AGE,
            MIN(AGE) AS MIN_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL 
				AND PART_ID IN(2, 4, 5, 7)
                GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
    ),
    AGG AS (
		SELECT FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE, PART_COUNT,
			JSON_OBJECT('one', STAGE_ONE_COUNT, 'two', STAGE_TWO_COUNT) as PART_STAGE,
            JSON_OBJECT('min', MIN_AGE, 'max', MAX_AGE) as PART_AGE,
            PART_IDS
				FROM F2P
    )
SELECT * FROM AGG;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	PART_COUNT	PART_STATE	PART_AGE	PART_ID
1	f1	raw	100	4	{"one": 2, "two": 2}	{"max": 40, "min": 10}	2,4,5,7
2	f2	raw	200	2	{"one": 2, "two": 0}	{"max": 40, "min": 20}	2,4
3	f3	raw	300	2	{"one": 0, "two": 2}	{"max": 30, "min": 10}	5,7
4	f3	raw	400	2	{"one": 2, "two": 0}	{"max": 40, "min": 20}	2,4
5	f5	proc	100	2	{"one": 0, "two": 2}	{"max": 30, "min": 10}	5,7
6	f6	proc	200	2	{"one": 1, "two": 1}	{"max": 20, "min": 10}	2,5
7	f7	proc	300	2	{"one": 1, "two": 1}	{"max": 40, "min": 30}	4,7
8	f8	proc	400	4	{"one": 2, "two": 2}	{"max": 40, "min": 10}	2,4,5,7

Notice, the previous unfiltered results for file:1 had a PART_COUNT=8, while the filtered results have a PART_COUNT=4.

If the aggregation results did not need to change at runtime, then we could simply use a materialized view as a solution to the entire problem. For example, we could pre-build a MV with millions rows of “static” aggregate data. End users could query this “static” data at runtime without any problems.

On the other hand, if the user’s runtime selections can change the aggregation results, then materialization does not help. In fact, it would require that we rebuild millions of material rows with each click. This means we need a solution that will support filtering both before and after aggregation without materialization.

Aggregate View

We are proposing adding a new entity Type called AggregateView. Like a MaterializedView, an AggregateView would be configured by setting its defining SQL. However, that is where the similarities end. An AggregateView is not a materialization. This means that we do not create an actual table in the database for this view. Instead, an AggregateView provides a simple, table like, layer of of abstraction over a complex aggregation query. To better understand this concept, lets show how we would use an AggregateView to create the files-to-participant-perspective.

Creating an AggregateView

Let’s assume we have already created our desired ColumnModels according to the above schema, with the resulting column IDs: 11,22,33,44,55,66,77,88. We can now create a new AggregateView with the following definingSQL:

Code Block

language	sql

 SELECT 
			CAST(FILE_ID AS 11),
			CAST(FILE_NAME AS 22),
			CAST(FILE_TYPE AS 33),
            CAST(FILE_SIZE AS 44),
			CAST(COUNT(PART_ID) AS 55),
			AGG_EXPAND(STAGE AS 66),
			AGG_EXPAND(AGE AS 77),
			CAST(GROUP_CONCAT(DISTINCT PART_ID) AS 88)
				FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE

Next we will cover each line of this SQL.

The first thing to notice is that the SQL is an aggregation as it contains a GROUP BY (line:11). Specifically, we are grouping by all of the relevant columns of the FILE_VIEW.

Next, at line:2 we have CAST(FILE_ID AS 11), which defines the first column of the perspective. This line is simply casting the FILE_ID as column ID: 11. In other words, line 2 tells Synapse to treat the first column of the resulting table as type INTEGER with the name FILE_ID (see: ID 11). Note: Since FILE_ID is part of the group by, we are not required apply an aggregation function to it. Any column that is not part of the group by must have some type of aggregation functions.

Lines 3-5 are similar to line:2 where the rest of the file’s columns are cast to their respective column model IDs.

At line:6 we have our first aggregation function COUNT(PART_ID). Since a single file ID could map to many participants, we need an aggregation function define how to handle the “many”. In this case we simply want the count. We then cast the resulting count as column ID=55, which has a simple INTEGER type.

At line:7 we have a new function: AGG_EXPAND(STAGE AS 66) called AGG_EXPAND. This function call is syntactic sugar that tells Synapse to do the following:

Add a column for each value of the enumeration with a case statement that will count the occurrences of each.
In the next layer of the CTE, recombine all of the expanded columns into a single column of JSON.
In the final layer of the CTE, cast the resulting JSON as column ID=66.

At line:8 we have AGG_EXPAND(AGE AS 77) which will do a similar expansion to the previous case. This expansion will create a column for each aggregate function defined in column ID=77.

Finally, at line:9 we have CAST(GROUP_CONCAT(DISTINCT PART_ID) AS 88). The group concat function will create a comma separated list of all of the PART_IDs that match each file. The results are then cast to column ID=88 which is of type STRING_LIST. This means this column will behave similar to other string list columns in Synapse.

Querying an AggregateView

Once we have defined our AggregateView we can run a query against it like any other table/view in Synapse. Let’s assume that we created the AggregateView using the defining SQL from the previous section. The resulting AggregateView is then assigned syn123. If we get the schema for syn123, we would see its column IDs are: 11,22,33,44,55,66,77,88.

To query this view we could simply send the following to the query service:

Code Block
select * from syn123

The results of this query would be exactly the same as the unfiltered files-with-aggregated-participants table shown above.

So how does that work exactly? Basically, when the user provides select * from syn123 at runtime, we run the following query on their behalf:

Code Block

WITH
	AGG AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
            JSON_OBJECT('one', SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END), 
						'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) as PART_STAGE,
            JSON_OBJECT('min', MIN(AGE), 'max', MAX(AGE)) as PART_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
	)
SELECT * FROM AGG;

The above SQL is actually a combination of the syn123’s defining SQL and the runtime query (select * from syn123). Specifically, the inner query of the common table expression (CTE) (lines:3-14) are an expansion of the defining SQL. While the runtime query is transformed into the outer query of the CTE (line:15). In essence, the user is querying what appears to be a simple table.

A real runtime query transformation would be more complex but basic principals would still apply. For example, since our MATERIAL table includes files, the transformation process would include adding a row-level-filter to hide rows where the user lacks the read permission. This type of query manipulation is already common for existing Synapse tables/views.

In the next section, we will show how runtime filtering and sorting would be applied using a few examples.

First, lets assume that the user wants to only see rows where PART_STAGE ‘one’ is greater than two:

Code Block
select * from syn123 where PART_STAGE.one > 2

For this query the first 14 lines of the above query would remain the same, while the last line (line:15) would become:

Code Block
select * from AGG where JSON_EXTRACT(PART_STAGE, '$.one') > 2;

A sorting example would be similar. For example to sort on PART_STAGE ‘two’ asc:

Code Block
select * from syn123 order by PART_STAGE.two asc

Again we would only need to change the last line the CTE to be:

Code Block
select * from AGG ORDER BY CAST(JSON_EXTRACT(PART_STAGE, '$.two') AS UNSIGNED) ACS

The above filters/sorting applied to the aggregation results. We still need to cover the case where the user is requesting a filter before aggregation. We will use the same example filter where the user per-selected participant IDs to: 2, 4, 5, & 7:

Code Block

language	sql

select * from syn123 where pre_agg(PART_ID in(2,4,5,7))

Here we have defined a new function called pre_agg() which is syntactic sugar that means apply this filter before aggregation. So rather than apply the filter at the end of the CTE (line:15) it is added to the inner layer of the CTE (line:13).

The key to this entire design is that there is always a one-to-one translation for anything in the both the provide defining SQL and runtime queries.

New Features

The implementation plan is to divide the work into two phases:

Phase One - Add support for CTE, CASE Statement, JSON columns/functions to the runtime query services. At the end of this phase a user should be able to run this type of query against any Synapse table/view:

Code Block

WITH
	AGG AS (
		SELECT 
			FILE_ID,
			FILE_NAME,
			FILE_TYPE,
            FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
            JSON_OBJECT('one', SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END), 'two', SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END)) as PART_STAGE,
            JSON_OBJECT('min', MIN(AGE), 'max', MAX(AGE)) as PART_AGE,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID, FILE_NAME, FILE_TYPE, FILE_SIZE
	)
SELECT * FROM AGG;

Phase Two: Add AggregateViews plus the new aggregate facet types as syntactic sugar that will be expanded to the full SQL from phase one at runtime.

To see the full list of features needed to make this design work see the epic:

Jira Legacy

server	System JIRA
serverId	ba6fb084-9827-3160-8067-8ac7470f78b2
key	PLFM-7789

Versions Compared

Old Version 2

New Version Current

Key

Setup

Full De-normalized Data

Two Perspectives

Perspective Filtering

Row-level Filtering

New Features

Expected User Experience

Perspective Details

Schema

Column TypeJSON

Aggregate Functions

New Facet Types

Facet Type: aggregate-enumeration

Facet Type: aggregate-range

Filtering Before Aggregation

Aggregate View

Creating an AggregateView

Querying an AggregateView

New Features