Cohort Builder - Technical Details -V2

Cohort Builder - Technical Details -V2

The purpose of this document is to provide technical details of how the proposed https://sagebionetworks.jira.com/wiki/spaces/PD/pages/2730721508 might be built.

Setup

To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.

 

The data model is composed of three main building blocks:

  • FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.

  • PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.

  • FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vice versa.

  • MATERIAL - This table represents a materialized view that is the de-normalized join of the preceding three tables. It includes files that might not have any participants and vice versa. In the real world this table would be automatically indexed on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.

 

If you would like to try out the data model here is the ddl:

DROP DATABASE `COHORT`; CREATE DATABASE `COHORT`; USE `COHORT`; CREATE TABLE IF NOT EXISTS `FILE_VIEW` ( `ID` BIGINT NOT NULL, `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `TYPE` ENUM ('raw','proc') NOT NULL, `SIZE` BIGINT NOT NULL, `BEN_ID` BIGINT NOT NULL, PRIMARY KEY (`ID`) ); CREATE TABLE IF NOT EXISTS `PARTICIPANTS` ( `ID` BIGINT NOT NULL, `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `STAGE` ENUM('one','two'), `AGE` BIGINT NOT NULL, PRIMARY KEY (`ID`) ); CREATE TABLE IF NOT EXISTS `FILE_TO_PART` ( `FILE_ID` BIGINT NOT NULL, `PART_ID` BIGINT NOT NULL, PRIMARY KEY (`FILE_ID`, `PART_ID`) ); INSERT INTO `FILE_VIEW` VALUES (1,'f1','raw',100,111), (2,'f2','raw',200,111), (3,'f3','raw',300,111), (4,'f3','raw',400,111), (5,'f5','proc',100,111), (6,'f6','proc',200,111), (7,'f7','proc',300,222), (8,'f8','proc',400,222), (9,'no participants','proc',100,111) ; INSERT INTO `PARTICIPANTS` VALUES (1,'P1','one',10), (2,'P2','one',20), (3,'P3','one',30), (4,'P4','one',40), (5,'P5','two',10), (6,'P6','two',20), (7,'P7','two',30), (8,'P4','two',40), (9,'no files','one', 18), (10,'few files','two', 30) ; INSERT INTO `FILE_TO_PART` VALUES # f1 to all (1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7), (1,8), # f2 to even (2,2), (2,4), (2,6), (2,8), # f3 to odd (3,1), (3,3), (3,5), (3,7), # f4 only stage one (4,1), (4,2), (4,3), (4,4), # f5 only stage two (5,5), (5,6), (5,7), (5,8), # f6 under20 (6,1), (6,2), (6,5), (6,6), # f7 over 20 (7,3), (7,4), (7,7), (7,8), # f8 to all (8,1), (8,2), (8,3), (8,4), (8,5), (8,6), (8,7), (8,8), # p10 only few files (2,10), (3,10), (5,10) ; /* This table is the denormalized materialization of all of the data. It includes all files including files with no particpants. It also incldues all participants, even those with no files. */ CREATE TABLE `MATERIAL` # includes all files even those without participants SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) UNION # Add all particpants that do not have any files SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;

Full De-normalized Data

After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:

select * from MATERIAL;

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

FILE_BEN_ID

PART_ID

PART_NAME

STAGE

AGE

FILE_ID

FILE_NAME

FILE_TYPE

FILE_SIZE

FILE_BEN_ID

PART_ID

PART_NAME

STAGE

AGE

1

f1

raw

100

111

1

P1

one

10

1

f1

raw

100

111

2

P2

one

20

1

f1

raw

100

111

3

P3

one

30

1

f1

raw

100

111

4

P4

one

40

1

f1

raw

100

111

5

P5

two

10

1

f1

raw

100

111

6

P6

two

20

1

f1

raw

100

111

7

P7

two

30

1

f1

raw

100

111

8

P4

two

40

2

f2

raw

200

111

2

P2

one

20

2

f2

raw

200

111

4

P4

one

40

2

f2

raw

200

111

6

P6

two

20

2

f2

raw

200

111

8

P4

two

40

2

f2

raw

200

111

10

few files

two

30

3

f3

raw

300

111

1

P1

one

10

3

f3

raw

300

111

3

P3

one

30

3

f3

raw

300

111

5

P5

two

10

3

f3

raw

300

111

7

P7

two

30

3

f3

raw

300

111

10

few files

two

30

4

f3

raw

400

111

1

P1

one

10

4

f3

raw

400

111

2

P2

one

20

4

f3

raw

400

111

3

P3

one

30

4

f3

raw

400

111

4

P4

one

40

5

f5

proc

100

111

5

P5

two

10

5

f5

proc

100

111

6

P6

two

20

5

f5

proc

100

111

7

P7

two

30

5

f5

proc

100

111

8

P4

two

40

5

f5

proc

100

111

10

few files

two

30

6

f6

proc

200

111

1

P1

one

10

6

f6

proc

200

111

2

P2

one

20

6

f6

proc

200

111

5

P5

two

10

6

f6

proc

200

111

6

P6

two

20

7

f7

proc

300

222

3

P3

one

30

7

f7

proc

300

222

4

P4

one

40

7

f7

proc

300

222

7

P7

two

30

7

f7

proc

300

222

8

P4

two

40

8

f8

proc

400

222

1

P1

one

10

8

f8

proc

400

222

2

P2

one

20

8

f8

proc

400

222

3

P3

one

30

8

f8

proc

400

222

4

P4

one

40

8

f8

proc

400

222

5

P5

two

10

8

f8

proc

400

222

6

P6

two

20

8

f8

proc

400

222

7

P7

two

30

8

f8

proc

400

222

8

P4

two

40

9

no participants

proc

100

111

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

NULL

9

no files

one

18

Two Perspectives

While the MATERIAL table contains all of the data, it would be challenging to use in its raw form. To improve usability, the data is exposed as two separate perspectives of the data:

  • Participants-with-aggregated-files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. The results also include additional columns that contain file aggregation data for each participant.

  • Files-with-aggregated-participants - The results of this query is such that each row represents a single file, including columns for the file’s metadata. The results also include additional columns that contain participant aggregation data for each file.

 

The unfiltered participants-with-aggregated-files would look like:

WITH P2F AS ( SELECT PART_ID, MAX(PART_NAME) AS PART_NAME, MAX(STAGE) AS STAGE, MAX(AGE) AS AGE, COUNT(FILE_ID) AS FILE_COUNT, SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES, SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES, MAX(FILE_SIZE) AS `MAX_SIZE`, MIN(FILE_SIZE) AS `MIN_SIZE`, GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC ), AGG AS ( SELECT PART_ID, PART_NAME, STAGE, AGE, FILE_COUNT, JSON_OBJECT('raw', RAW_FILES, 'proc', PROC_FILES) as FILE_TYPE, JSON_OBJECT('max', `MAX_SIZE`, 'min', `MIN_SIZE`) as FILE_SIZE, FILE_IDS FROM P2F ) SELECT * FROM AGG;

PART_ID

PART_NAME

STAGE

AGE

FILE_COUNT

FILE_TYPE

FILE_SIZE

FILE_ID

PART_ID

PART_NAME

STAGE

AGE

FILE_COUNT

FILE_TYPE

FILE_SIZE

FILE_ID

1

P1

one

10

5

{"raw": 3, "proc": 2}

{"max": 400, "min": 100}

1,3,4,6,8

2

P2

one

20

5

{"raw": 3, "proc": 2}

{"max": 400, "min": 100}

1,2,4,6,8

3

P3

one

30

5

{"raw": 3, "proc": 2}

{"max": 400, "min": 100}

1,3,4,7,8

4

P4

one

40

5

{"raw": 3, "proc": 2}

{"max": 400, "min": 100}

1,2,4,7,8

5