Cohort Builder - Technical Details

The purpose of this document is to provide technical details of how the proposed /wiki/spaces/PD/pages/2730721508 might be built.

Setup

To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.

The data model is composed of three main building blocks:

FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vise versa.
MATERIAL - This table represents a materialized view that de-normalized join of the proceeding three tables. It includes files that might not have any participants and vise versa. In the real world this table would be automatically index on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.

If you would like to try out the data model here is the ddl:

Expand to see the full ddl...

DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;

USE `COHORT`;

CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
    `TYPE` ENUM ('raw','proc') NOT NULL,
	`SIZE` BIGINT NOT NULL,
    `BEN_ID` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
	`STAGE` ENUM('one','two'),
    `AGE` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
	`FILE_ID` BIGINT NOT NULL,
	`PART_ID` BIGINT NOT NULL,
    PRIMARY KEY (`FILE_ID`, `PART_ID`)
);

INSERT INTO `FILE_VIEW` VALUES 
	(1,'f1','raw',100,111),
	(2,'f2','raw',200,111),
	(3,'f3','raw',300,111),
	(4,'f3','raw',400,111),
	(5,'f5','proc',100,111),
	(6,'f6','proc',200,111),
	(7,'f7','proc',300,222),
	(8,'f8','proc',400,222),    
	(9,'no participants','proc',100,111)
    ;

INSERT INTO `PARTICIPANTS` VALUES 
	(1,'P1','one',10),
    (2,'P2','one',20),
	(3,'P3','one',30),
    (4,'P4','one',40),
	(5,'P5','two',10),
    (6,'P6','two',20),
	(7,'P7','two',30),
    (8,'P4','two',40),
	(9,'no files','one', 18),
	(10,'few files','two', 30)
    ;
    
INSERT INTO `FILE_TO_PART` VALUES 
    # f1 to all
	(1,1),
	(1,2),
	(1,3),
	(1,4),
	(1,5),
    (1,6),
    (1,7),
    (1,8),
    # f2 to even
    (2,2),
	(2,4),
    (2,6),
    (2,8),
    # f3 to odd
	(3,1),
	(3,3),
	(3,5),
    (3,7),
    # f4 only stage one
    (4,1),
    (4,2),
    (4,3),
    (4,4),
    # f5 only stage two
    (5,5),
    (5,6),
    (5,7),
    (5,8),
    # f6 under20
	(6,1),
    (6,2),
    (6,5),
    (6,6),
	# f7 over 20
	(7,3),
    (7,4),
    (7,7),
    (7,8),
	# f8 to all
	(8,1),
	(8,2),
	(8,3),
	(8,4),
	(8,5),
    (8,6),
    (8,7),
    (8,8),
    # p10 only few files
    (2,10),
    (3,10),
	(5,10)
    ;

/* 
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants.  It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;

Full De-normalized Data

After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:

select * from MATERIAL;

Expand to see the table

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	FILE_BEN_ID	PART_ID	PART_NAME	STAGE	AGE
1	f1	raw	100	111	1	P1	one	10
1	f1	raw	100	111	2	P2	one	20
1	f1	raw	100	111	3	P3	one	30
1	f1	raw	100	111	4	P4	one	40
1	f1	raw	100	111	5	P5	two	10
1	f1	raw	100	111	6	P6	two	20
1	f1	raw	100	111	7	P7	two	30
1	f1	raw	100	111	8	P4	two	40
2	f2	raw	200	111	2	P2	one	20
2	f2	raw	200	111	4	P4	one	40
2	f2	raw	200	111	6	P6	two	20
2	f2	raw	200	111	8	P4	two	40
2	f2	raw	200	111	10	few files	two	30
3	f3	raw	300	111	1	P1	one	10
3	f3	raw	300	111	3	P3	one	30
3	f3	raw	300	111	5	P5	two	10
3	f3	raw	300	111	7	P7	two	30
3	f3	raw	300	111	10	few files	two	30
4	f3	raw	400	111	1	P1	one	10
4	f3	raw	400	111	2	P2	one	20
4	f3	raw	400	111	3	P3	one	30
4	f3	raw	400	111	4	P4	one	40
5	f5	proc	100	111	5	P5	two	10
5	f5	proc	100	111	6	P6	two	20
5	f5	proc	100	111	7	P7	two	30
5	f5	proc	100	111	8	P4	two	40
5	f5	proc	100	111	10	few files	two	30
6	f6	proc	200	111	1	P1	one	10
6	f6	proc	200	111	2	P2	one	20
6	f6	proc	200	111	5	P5	two	10
6	f6	proc	200	111	6	P6	two	20
7	f7	proc	300	222	3	P3	one	30
7	f7	proc	300	222	4	P4	one	40
7	f7	proc	300	222	7	P7	two	30
7	f7	proc	300	222	8	P4	two	40
8	f8	proc	400	222	1	P1	one	10
8	f8	proc	400	222	2	P2	one	20
8	f8	proc	400	222	3	P3	one	30
8	f8	proc	400	222	4	P4	one	40
8	f8	proc	400	222	5	P5	two	10
8	f8	proc	400	222	6	P6	two	20
8	f8	proc	400	222	7	P7	two	30
8	f8	proc	400	222	8	P4	two	40
9	no participants	proc	100	111	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	9	no files	one	18

Two Perspectives

While the MATERIAL table contains all of the data, it is not super user friendly. To make it user friendly, we create two separate perspectives of the data:

Participants with aggregated files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. This table also has additional columns that contain file aggregation data for each participant.
Files with aggregated participants- The results of this query is such that each row represents a single file, including columns for the file’s metadata. This table also has additional columns that contain participant aggregation data for each file.

The unfiltered participants-with-aggregated-files would look like:

Expand to see the query that generated this result...

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F;

PART_ID	PART_NAME	STAGE	AGE	FILE_COUNT	RAW_FILES	PROC_FILES	FILE_IDS
1	P1	one	10	5	3	2	1,3,4,6,8
2	P2	one	20	5	3	2	1,2,4,6,8
3	P3	one	30	5	3	2	1,3,4,7,8
4	P4	one	40	5	3	2	1,2,4,7,8
5	P5	two	10	5	2	3	1,3,5,6,8
6	P6	two	20	5	2	3	1,2,5,6,8
7	P7	two	30	5	2	3	1,3,5,7,8
8	P4	two	40	5	2	3	1,2,5,7,8
9	no files	one	18	0	0	0
10	few files	two	30	3	2	1	2,3,5

The unfiltered files-with-aggregated-participants would look like:

Exapnd to see the query that generated this result...

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID
    )
SELECT * FROM F2P;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	PART_COUNT	STAGE_ONE_COUNT	STAGE_TWO_COUNT	PART_IDS
1	f1	raw	100	8	4	4	1,2,3,4,5,6,7,8
2	f2	raw	200	5	2	3	2,4,6,8,10
3	f3	raw	300	5	2	3	1,3,5,7,10
4	f3	raw	400	4	4	0	1,2,3,4
5	f5	proc	100	5	0	5	5,6,7,8,10
6	f6	proc	200	4	2	2	1,2,5,6
7	f7	proc	300	4	2	2	3,4,7,8
8	f8	proc	400	8	4	4	1,2,3,4,5,6,7,8
9	no participants	proc	100	0	0	0	NULL

Perspective Filtering

Now that we have two perspectives that show all of the information in a user friendly manner, it will be natural for uses to want to filter on each perspective. In fact, filtering is the point of the entire cohort builder exercise.

It is fairly trivial to filter by participant metadata on the participants-with-aggregated-files perspective. The same is true when filtering on file metadata on the files-with-aggregated-participants perspective. However, it is more complicated to filter on the aggregated columns of each perspective. In fact, the query for each perspective was built using a common-table-expression (CTE), to support aggregation filtering.

Let’s look at an example to see how to use the CTE to filter by aggregation data. A user looking at participants-with-aggregated-files perspective decides that they only want to include participants with at least 3 processed files in their cohort. To show these results we must add a filter to the p2f alias as follows (see line 14):

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F WHERE PROC_FILES >= 3;

PART_ID	PART_NAME	STAGE	AGE	FILE_COUNT	RAW_FILES	PROC_FILES	FILE_IDS
5	P5	two	10	5	2	3	1,3,5,6,8
6	P6	two	20	5	2	3	1,2,5,6,8
7	P7	two	30	5	2	3	1,3,5,7,8
8	P4	two	40	5	2	3	1,2,5,7,8

The result is four participants (5,6,7,8). Once the user has selected their starting participants, the next step is to take their results to the files-with-aggregated-participants perspective to finalize their selected files. This creates a new challenge. Since we filtered by an aggregation in one perspective that does not exist in the other, how to we transfer the user’s filter to the other perspective?

One scale limited solution to this problem is to capture the resulting participant from the first perspective and apply them as a filter to the second. Following the example above we would transfer the four participants (5,6,7,8) to the files-with-aggregated-participants as follows (see in clause at line 12):

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	PART_COUNT	STAGE_ONE_COUNT	STAGE_TWO_COUNT	PART_IDS
1	f1	raw	100	4	0	4	5,6,7,8
2	f2	raw	200	2	0	2	6,8
3	f3	raw	300	2	0	2	5,7
5	f5	proc	100	4	0	4	5,6,7,8
6	f6	proc	200	2	0	2	5,6
7	f7	proc	300	2	0	2	7,8
8	f8	proc	400	4	0	4	5,6,7,8

Notice, this filter was applied to the inner query of the CTE. Since the outer F2P does not contain raw participant ids we cannot add the filter there.

Extending our existing user story, the user decides they only want to include files that include all four of their selected participants, so they add the filter:

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P WHERE PART_COUNT >= 4;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	PART_COUNT	STAGE_TWO_COUNT	PART_IDS
1	f1	raw	100	4	4	5,6,7,8
5	f5	proc	100	4	4	5,6,7,8
8	f8	proc	400	4	4	5,6,7,8

Their final result now includes three files (1,5,8) that each include their four selected participants (5,6,7,8).

Row-level Filtering

Anytime a solution includes files, we need to ensure that the system automatically filters out files that the caller does not have permission to read.

Note: The current query system does not filter files that the user does not have permission to download, or unmet access requirements, by design.

In our example FILE_VIEW table we include a column called BEN_ID. It represents the ID of the Project/Folder that controls access to a file via its access control list. It is called a benefactor ID because it defines where a file inherits its permissions from. This column simulate how Synapse treats a real file view. You might also notice that this column was include in the materialized view (MATERIAL). This also simulates how Synapse treats materialized views that include data from file views.

The BEN_ID column is then used at query time to automatically filter out rows that the caller cannot see using the following two steps:

A new query is derived from the user provide query to select the distinct benefactor ID from the synID in the from clause. This tells the security filter what all of the possible BEN_IDs are for the given situation.
We run an authorization check to the the sub-set of BEN_IDs that the caller has the read permission on.
Finally, the user’s original query is modified to include an unconditional IN clause that limits the results to only benefactors the caller can see.

If we apply this concept to the our example scenario above, we detect that the MATERIAL has two distinct benefactor ID by automatically converting the user’s input query as follows:

		SELECT 
			DISTINCT FILE_BEN_ID FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8);

In this step we must discard the original CTE, select, and group by provided by the user’s query and replace the select with the DISTINCT FILE_BEN_ID. The result of this query returns two rows: 111 and 222.

We then do an authorization check on the user and determine they can see 111 but not 222. We are then ready to re-write their input SQL as follows:

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE (FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8)) AND FILE_BEN_ID IN (111) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

Note: At line 12, the user’s original where clause is unconditionally combined withAND FILE_BEN_ID IN (111). This ensures that the caller will never see file rows that they are unauthorized to see.

We added this section to show how the adding a CTE query filter requires that we also plan for how to apply the correct runtime row-level filtering.

New Features

In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.

CASE statements - We do not currently support CASE statements in the Synapse SQL. We would need to extend the SQL parser to correctly handle such cases.
CTE - We do not currently support common table expressions (CTE). Note: CTE can be more complex than the simple CTEs used in above examples. For example, it is possible for a CTE to include many tables, joins, unions, and sub-queries. Supporting complex CTE that includes joins, unions or sub-queries would potentially allow users to execute system debilitating queries. However, all of the examples above are simple an only involve a single table with aggregation. Such a simple CTE poses little to no risk. Therefore, we would like to add CTE support to runtime queries. This means a materialized view would not be needed to use a CTE.
Optimize Materialized Views - Currently, when we rebuild a materialized view (MV), we lock users from querying it while we rebuild it. For small MVs that only rebuild occasionally, this is not a problem. However, this use case will require potentially large MVs (millions of rows). If it takes 30 mins to rebuild the MV used for a cohort builder, after a file updates, it is unlikely that we want users to be unable to use the cohort builder during a rebuild. On the other hand users are unlikely to notices if the data in the cohort builder is slightly stale. Therefore, we propose adding an optimization to MVs that would allow the existing MV to remain query-able during the rebuild process. After the rebuild is finished, the stale MV could be atomically, swapped with the newly build MV.