Cohort Builder - Technical Details -V2

The purpose of this document is to provide technical details of how the proposed https://sagebionetworks.jira.com/wiki/spaces/PD/pages/2730721508 might be built.

Setup

To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.

The data model is composed of three main building blocks:

FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vice versa.
MATERIAL - This table represents a materialized view that is the de-normalized join of the preceding three tables. It includes files that might not have any participants and vice versa. In the real world this table would be automatically indexed on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.

If you would like to try out the data model here is the ddl:

DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;

USE `COHORT`;

CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
    `TYPE` ENUM ('raw','proc') NOT NULL,
	`SIZE` BIGINT NOT NULL,
    `BEN_ID` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
	`ID` BIGINT NOT NULL,
    `NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
	`STAGE` ENUM('one','two'),
    `AGE` BIGINT NOT NULL,
    PRIMARY KEY (`ID`)
);

CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
	`FILE_ID` BIGINT NOT NULL,
	`PART_ID` BIGINT NOT NULL,
    PRIMARY KEY (`FILE_ID`, `PART_ID`)
);

INSERT INTO `FILE_VIEW` VALUES 
	(1,'f1','raw',100,111),
	(2,'f2','raw',200,111),
	(3,'f3','raw',300,111),
	(4,'f3','raw',400,111),
	(5,'f5','proc',100,111),
	(6,'f6','proc',200,111),
	(7,'f7','proc',300,222),
	(8,'f8','proc',400,222),    
	(9,'no participants','proc',100,111)
    ;

INSERT INTO `PARTICIPANTS` VALUES 
	(1,'P1','one',10),
    (2,'P2','one',20),
	(3,'P3','one',30),
    (4,'P4','one',40),
	(5,'P5','two',10),
    (6,'P6','two',20),
	(7,'P7','two',30),
    (8,'P4','two',40),
	(9,'no files','one', 18),
	(10,'few files','two', 30)
    ;
    
INSERT INTO `FILE_TO_PART` VALUES 
    # f1 to all
	(1,1),
	(1,2),
	(1,3),
	(1,4),
	(1,5),
    (1,6),
    (1,7),
    (1,8),
    # f2 to even
    (2,2),
	(2,4),
    (2,6),
    (2,8),
    # f3 to odd
	(3,1),
	(3,3),
	(3,5),
    (3,7),
    # f4 only stage one
    (4,1),
    (4,2),
    (4,3),
    (4,4),
    # f5 only stage two
    (5,5),
    (5,6),
    (5,7),
    (5,8),
    # f6 under20
	(6,1),
    (6,2),
    (6,5),
    (6,6),
	# f7 over 20
	(7,3),
    (7,4),
    (7,7),
    (7,8),
	# f8 to all
	(8,1),
	(8,2),
	(8,3),
	(8,4),
	(8,5),
    (8,6),
    (8,7),
    (8,8),
    # p10 only few files
    (2,10),
    (3,10),
	(5,10)
    ;

/* 
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants.  It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE 
	FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;

Full De-normalized Data

After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:

select * from MATERIAL;

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	FILE_BEN_ID	PART_ID	PART_NAME	STAGE	AGE

FILE_ID	FILE_NAME	FILE_TYPE	FILE_SIZE	FILE_BEN_ID	PART_ID	PART_NAME	STAGE	AGE
1	f1	raw	100	111	1	P1	one	10
1	f1	raw	100	111	2	P2	one	20
1	f1	raw	100	111	3	P3	one	30
1	f1	raw	100	111	4	P4	one	40
1	f1	raw	100	111	5	P5	two	10
1	f1	raw	100	111	6	P6	two	20
1	f1	raw	100	111	7	P7	two	30
1	f1	raw	100	111	8	P4	two	40
2	f2	raw	200	111	2	P2	one	20
2	f2	raw	200	111	4	P4	one	40
2	f2	raw	200	111	6	P6	two	20
2	f2	raw	200	111	8	P4	two	40
2	f2	raw	200	111	10	few files	two	30
3	f3	raw	300	111	1	P1	one	10
3	f3	raw	300	111	3	P3	one	30
3	f3	raw	300	111	5	P5	two	10
3	f3	raw	300	111	7	P7	two	30
3	f3	raw	300	111	10	few files	two	30
4	f3	raw	400	111	1	P1	one	10
4	f3	raw	400	111	2	P2	one	20
4	f3	raw	400	111	3	P3	one	30
4	f3	raw	400	111	4	P4	one	40
5	f5	proc	100	111	5	P5	two	10
5	f5	proc	100	111	6	P6	two	20
5	f5	proc	100	111	7	P7	two	30
5	f5	proc	100	111	8	P4	two	40
5	f5	proc	100	111	10	few files	two	30
6	f6	proc	200	111	1	P1	one	10
6	f6	proc	200	111	2	P2	one	20
6	f6	proc	200	111	5	P5	two	10
6	f6	proc	200	111	6	P6	two	20
7	f7	proc	300	222	3	P3	one	30
7	f7	proc	300	222	4	P4	one	40
7	f7	proc	300	222	7	P7	two	30
7	f7	proc	300	222	8	P4	two	40
8	f8	proc	400	222	1	P1	one	10
8	f8	proc	400	222	2	P2	one	20
8	f8	proc	400	222	3	P3	one	30
8	f8	proc	400	222	4	P4	one	40
8	f8	proc	400	222	5	P5	two	10
8	f8	proc	400	222	6	P6	two	20
8	f8	proc	400	222	7	P7	two	30
8	f8	proc	400	222	8	P4	two	40
9	no participants	proc	100	111	NULL	NULL	NULL	NULL
NULL	NULL	NULL	NULL	NULL	9	no files	one	18

Two Perspectives

While the MATERIAL table contains all of the data, it would be challenging to use in its raw form. To improve usability, the data is exposed as two separate perspectives of the data:

Participants-with-aggregated-files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. The results also include additional columns that contain file aggregation data for each participant.
Files-with-aggregated-participants - The results of this query is such that each row represents a single file, including columns for the file’s metadata. The results also include additional columns that contain participant aggregation data for each file.

The unfiltered participants-with-aggregated-files would look like:

WITH
	P2F AS (
		SELECT 
			PART_ID,
			MAX(PART_NAME) AS PART_NAME,
			MAX(STAGE) AS STAGE,
			MAX(AGE) AS AGE,
			COUNT(FILE_ID) AS FILE_COUNT,
			SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
			SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
            MAX(FILE_SIZE) AS `MAX_SIZE`,
			MIN(FILE_SIZE) AS `MIN_SIZE`,
			GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
				FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
	),
    AGG AS (
		SELECT 
			PART_ID, PART_NAME, STAGE, AGE, FILE_COUNT,
			JSON_OBJECT('raw', RAW_FILES, 'proc', PROC_FILES) as FILE_TYPE,
			JSON_OBJECT('max', `MAX_SIZE`, 'min', `MIN_SIZE`) as FILE_SIZE,
			FILE_IDS    
				FROM P2F
	)
SELECT * FROM AGG;

PART_ID	PART_NAME	STAGE	AGE	FILE_COUNT	FILE_TYPE	FILE_SIZE	FILE_ID

PART_ID	PART_NAME	STAGE	AGE	FILE_COUNT	FILE_TYPE	FILE_SIZE	FILE_ID
1	P1	one	10	5	{"raw": 3, "proc": 2}	{"max": 400, "min": 100}	1,3,4,6,8
2	P2	one	20	5	{"raw": 3, "proc": 2}	{"max": 400, "min": 100}	1,2,4,6,8
3	P3	one	30	5	{"raw": 3, "proc": 2}	{"max": 400, "min": 100}	1,3,4,7,8
4	P4	one	40	5	{"raw": 3, "proc": 2}	{"max": 400, "min": 100}	1,2,4,7,8
5