Skip to end of banner
Go to start of banner

Cohort Builder - Technical Details

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

The purpose of this document is to provide technical details of how the proposed /wiki/spaces/PD/pages/2730721508 might be built.

Setup

To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.

The data model is composed of three main building blocks:

  • FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.

  • PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.

  • FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vise versa.

  • MATERIAL - This table represents a materialized view that de-normalized join of the proceeding three tables. It includes files that might not have any participants and vise versa. In the real world this table would be automatically index on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.

If you would like to try out the data model here is the ddl:

 Expand to see the full ddl...
DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;

USE `COHORT`;

CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
    `TYPE` ENUM ('raw','proc') NOT NULL,
	`SIZE` BIGINT NOT NULL,
    `BEN_ID` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
	`STAGE` ENUM('one','two'),
    `AGE` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
	`FILE_ID` BIGINT NOT NULL,
	`PART_ID` BIGINT NOT NULL,
    PRIMARY KEY (`FILE_ID`, `PART_ID`)
);

INSERT INTO `FILE_VIEW` VALUES 
	(1,'f1','raw',100,111),
	(2,'f2','raw',200,111),
	(3,'f3','raw',300,111),
	(4,'f3','raw',400,111),
	(5,'f5','proc',100,111),
	(6,'f6','proc',200,111),
	(7,'f7','proc',300,222),
	(8,'f8','proc',400,222),    
	(9,'no participants','proc',100,111)
    ;

INSERT INTO `PARTICIPANTS` VALUES 
	(1,'P1','one',10),
    (2,'P2','one',20),
	(3,'P3','one',30),
    (4,'P4','one',40),
	(5,'P5','two',10),
    (6,'P6','two',20),
	(7,'P7','two',30),
    (8,'P4','two',40),
	(9,'no files','one', 18),
	(10,'few files','two', 30)
    ;
    
INSERT INTO `FILE_TO_PART` VALUES 
    # f1 to all
	(1,1),
	(1,2),
	(1,3),
	(1,4),
	(1,5),
    (1,6),
    (1,7),
    (1,8),
    # f2 to even
    (2,2),
	(2,4),
    (2,6),
    (2,8),
    # f3 to odd
	(3,1),
	(3,3),
	(3,5),
    (3,7),
    # f4 only stage one
    (4,1),
    (4,2),
    (4,3),
    (4,4),
    # f5 only stage two
    (5,5),
    (5,6),
    (5,7),
    (5,8),
    # f6 under20
	(6,1),
    (6,2),
    (6,5),
    (6,6),
	# f7 over 20
	(7,3),
    (7,4),
    (7,7),
    (7,8),
	# f8 to all
	(8,1),
	(8,2),
	(8,3),
	(8,4),
	(8,5),
    (8,6),
    (8,7),
    (8,8),
    # p10 only few files
    (2,10),
    (3,10),
	(5,10)
    ;

/* 
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants.  It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;

Full De-normalized Data

After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:

select * from MATERIAL;
 Expand to see the table

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

FILE_BEN_ID

PART_ID

PART_NAME

STAGE

AGE

1

f1

raw

100

111

1

P1

one

10

1

f1

raw

100

111

2

P2

one

20

1

f1

raw

100

111

3

P3

one

30

1

f1

raw

100

111

4

P4

one

40

1

f1

raw

100

111

5

P5

two

10

1

f1

raw

100

111

6

P6

two

20

1

f1

raw

100

111

7

P7

two

30

1

f1

raw

100

111

8

P4

two

40

2

f2

raw

200

111

2

P2

one

20

2

f2

raw

200

111

4

P4

one

40

2

f2

raw

200

111

6

P6

two

20

2

f2

raw

200

111

8

P4

two

40

2

f2

raw

200

111

10

few files

two

30

3

f3

raw

300

111

1

P1

one

10

3

f3

raw

300

111

3

P3

one

30

3

f3

raw

300

111

5

P5

two

10

3

f3

raw

300

111

7

P7

two

30

3

f3

raw

300

111

10

few files

two

30

4

f3

raw

400

111

1

P1

one

10

4

f3

raw

400

111

2

P2

one

20

4

f3

raw

400

111

3

P3

one

30

4

f3

raw

400

111

4

P4

one

40

5

f5

proc

100

111

5

P5

two

10

5

f5

proc

100

111

6

P6

two

20

5

f5

proc

100

111

7

P7

two

30

5

f5

proc

100

111

8

P4

two

40

5

f5

proc

100

111

10

few files

two

30

6

f6

proc

200

111

1

P1

one

10

6

f6

proc

200

111

2

P2

one

20

6

f6

proc

200

111

5

P5

two

10

6

f6

proc

200

111

6

P6

two

20

7

f7

proc

300

222

3

P3

one

30

7

f7

proc

300

222

4

P4

one

40

7

f7

proc

300

222

7

P7

two

30

7

f7

proc

300

222

8

P4

two

40

8

f8

proc

400

222

1

P1

one

10

8

f8

proc

400

222

2

P2

one

20

8

f8

proc

400

222

3

P3

one

30

8

f8

proc

400

222

4

P4

one

40

8

f8

proc

400

222

5

P5

two

10

8

f8

proc

400

222

6

P6

two

20

8

f8

proc

400

222

7

P7

two

30

8

f8

proc

400

222

8

P4

two

40

9

no participants

proc

100

111

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

9

no files

one

18

Two Perspectives

While the MATERIAL table contains all of the data, it is not super user friendly. To make it user friendly, we create two separate perspectives of the data:

  • Participants with aggregated files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. This table also has additional columns that contain file aggregation data for each participant.

  • Files with aggregated participants- The results of this query is such that each row represents a single file, including columns for the file’s metadata. This table also has additional columns that contain participant aggregation data for each file.

The unfiltered participants-with-aggregated-files would look like:

 Expand to see the query that generated this result...
WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F;

PART_ID

PART_NAME

STAGE

AGE

FILE_COUNT

RAW_FILES

PROC_FILES

FILE_IDS

1

P1

one

10

5

3

2

1,3,4,6,8

2

P2

one

20

5

3

2

1,2,4,6,8

3

P3

one

30

5

3

2

1,3,4,7,8

4

P4

one

40

5

3

2

1,2,4,7,8

5

P5

two

10

5

2

3

1,3,5,6,8

6

P6

two

20

5

2

3

1,2,5,6,8

7

P7

two

30

5

2

3

1,3,5,7,8

8

P4

two

40

5

2

3

1,2,5,7,8

9

no files

one

18

0

0

0

10

few files

two

30

3

2

1

2,3,5

The unfiltered files-with-aggregated-participants would look like:

 Exapnd to see the query that generated this result...
WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL GROUP BY FILE_ID
    )
SELECT * FROM F2P;

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

PART_COUNT

STAGE_ONE_COUNT

STAGE_TWO_COUNT

PART_IDS

1

f1

raw

100

8

4

4

1,2,3,4,5,6,7,8

2

f2

raw

200

5

2

3

2,4,6,8,10

3

f3

raw

300

5

2

3

1,3,5,7,10

4

f3

raw

400

4

4

0

1,2,3,4

5

f5

proc

100

5

0

5

5,6,7,8,10

6

f6

proc

200

4

2

2

1,2,5,6

7

f7

proc

300

4

2

2

3,4,7,8

8

f8

proc

400

8

4

4

1,2,3,4,5,6,7,8

9

no participants

proc

100

0

0

0

NULL

Perspective Filtering

Now that we have two perspectives that show all of the information in a user friendly manner, it will be natural for uses to want to filter on each perspective. In fact, filtering is the point of the entire cohort builder exercise.

It is fairly trivial to filter by participant metadata on the participants-with-aggregated-files perspective. The same is true when filtering on file metadata on the files-with-aggregated-participants perspective. However, it is more complicated to filter on the aggregated columns of each perspective. In fact, the query for each perspective was built using a common-table-expression (CTE), to support aggregation filtering.

Let’s look at an example to see how to use the CTE to filter by aggregation data. A user looking at participants-with-aggregated-files perspective decides that they only want to include participants with at least 3 processed files in their cohort. To show these results we must add a filter to the p2f alias as follows (see line 14):

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	)
SELECT * FROM P2F WHERE PROC_FILES >= 3;

PART_ID

PART_NAME

STAGE

AGE

FILE_COUNT

RAW_FILES

PROC_FILES

FILE_IDS

5

P5

two

10

5

2

3

1,3,5,6,8

6

P6

two

20

5

2

3

1,2,5,6,8

7

P7

two

30

5

2

3

1,3,5,7,8

8

P4

two

40

5

2

3

1,2,5,7,8

The result is four participants (5,6,7,8). Once the user has selected their starting participants, the next step is to take their results to the files-with-aggregated-participants perspective to finalize their selected files. This creates a new challenge. Since we filtered by an aggregation in one perspective that does not exist in the other, how to we transfer the user’s filter to the other perspective?

One scale limited solution to this problem is to capture the resulting participant from the first perspective and apply them as a filter to the second. Following the example above we would transfer the four participants (5,6,7,8) to the files-with-aggregated-participants as follows (see in clause at line 12):

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

PART_COUNT

STAGE_ONE_COUNT

STAGE_TWO_COUNT

PART_IDS

1

f1

raw

100

4

0

4

5,6,7,8

2

f2

raw

200

2

0

2

6,8

3

f3

raw

300

2

0

2

5,7

5

f5

proc

100

4

0

4

5,6,7,8

6

f6

proc

200

2

0

2

5,6

7

f7

proc

300

2

0

2

7,8

8

f8

proc

400

4

0

4

5,6,7,8

Notice, this filter was applied to the inner query of the CTE. Since the outer F2P does not contain raw participant ids we cannot add the filter there.

Extending our existing user story, the user decides they only want to include files that include all four of their selected participants, so they add the filter:

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID
    )
SELECT * FROM F2P WHERE PART_COUNT >= 4;

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

PART_COUNT

STAGE_ONE_COUNT

STAGE_TWO_COUNT

PART_IDS

1

f1

raw

100

4

0

4

5,6,7,8

5

f5

proc

100

4

0

4

5,6,7,8

8

f8

proc

400

4

0

4

5,6,7,8

Their final result now includes three files (1,5,8) that each include their four selected participants (5,6,7,8).

Row-level Filtering

Anytime a solution includes files, we need to ensure that the system automatically filters out files that the caller does not have permission to read.

Note: The current query system does not filter files that the user does not have permission to download, or unmet access requirements, by design.

In our example FILE_VIEW table we include a column called BEN_ID. It represents the ID of the Project/Folder that controls access to a file via its access control list. It is called a benefactor ID because it defines where a file inherits its permissions from. This column simulate how Synapse treats a real file view. You might also notice that this column was include in the materialized view (MATERIAL). This also simulates how Synapse treats materialized views that include data from file views.

The BEN_ID column is then used at query time to automatically filter out rows that the caller cannot see using the following two steps:

  1. A new query is derived from the user provide query to select the distinct benefactor ID from the synID in the from clause. This tells the security filter what all of the possible BEN_IDs are for the given situation.

  2. We run an authorization check to the the sub-set of BEN_IDs that the caller has the read permission on.

  3. Finally, the user’s original query is modified to include an unconditional IN clause that limits the results to only benefactors the caller can see.

If we apply this concept to the our example scenario above, we detect that the MATERIAL has two distinct benefactor ID by automatically converting the user’s input query as follows:

		SELECT 
			DISTINCT FILE_BEN_ID FROM MATERIAL
				WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8);

In this step we must discard the original CTE, select, and group by provided by the user’s query and replace the select with the DISTINCT FILE_BEN_ID. The result of this query returns two rows: 111 and 222.

We then do an authorization check on the user and determine they can see 111 but not 222. We are then ready to re-write their input SQL as follows:

WITH
	F2P AS (
		SELECT 
			FILE_ID,
			MAX(FILE_NAME) AS FILE_NAME,
			MAX(FILE_TYPE) AS FILE_TYPE,
            MAX(FILE_SIZE) AS FILE_SIZE,
			COUNT(PART_ID) AS PART_COUNT,
			SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT,
			SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT,
			GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL
				WHERE (FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8)) AND FILE_BEN_ID IN (111) GROUP BY FILE_ID
    )
SELECT * FROM F2P;

Note: At line 12, the user’s original where clause is unconditionally combined withAND FILE_BEN_ID IN (111). This ensures that the caller will never see file rows that they are unauthorized to see.

We added this section to show how the adding a CTE query filter requires that we also plan for how to apply the correct runtime row-level filtering.

New Features

In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.

  1. CASE statements - We do not currently support CASE statements in the Synapse SQL. We would need to extend the SQL parser to correctly handle such cases.

  2. CTE - We do not currently support common table expressions (CTE). Note: CTE can be more complex than the simple CTEs used in above examples. For example, it is possible for a CTE to include many tables, joins, unions, and sub-queries. Supporting complex CTE that includes joins, unions or sub-queries would potentially allow users to execute system debilitating queries. However, all of the examples above are simple an only involve a single table with aggregation. Such a simple CTE poses little to no risk. Therefore, we would like to add CTE support to runtime queries. This means a materialized view would not be needed to use a CTE.

  3. Optimize Materialized Views - Currently, when we rebuild a materialized view (MV), we lock users from querying it while we rebuild it. For small MVs that only rebuild occasionally, this is not a problem. However, this use case will require potentially large MVs (millions of rows). If it takes 30 mins to rebuild the MV used for a cohort builder, after a file updates, it is unlikely that we want users to be unable to use the cohort builder during a rebuild. On the other hand users are unlikely to notices if the data in the cohort builder is slightly stale. Therefore, we propose adding an optimization to MVs that would allow the existing MV to remain query-able during the rebuild process. After the rebuild is finished, the stale MV could be atomically, swapped with the newly build MV.

  • No labels