Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.
The data model is composed of three main building blocks:
FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vice versa.
MATERIAL - This table represents a materialized view that is the de-normalized join of the preceding three tables. It includes files that might not have any participants and vice versa. In the real world this table would be automatically indexed on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.
If you would like to try out the data model here is the ddl:
DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;
USE `COHORT`;
CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
`ID` BIGINT NOT NULL,
`NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`TYPE` ENUM ('raw','proc') NOT NULL,
`SIZE` BIGINT NOT NULL,
`BEN_ID` BIGINT NOT NULL,
PRIMARY KEY (`ID`)
);
CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
`ID` BIGINT NOT NULL,
`NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`STAGE` ENUM('one','two'),
`AGE` BIGINT NOT NULL,
PRIMARY KEY (`ID`)
);
CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
`FILE_ID` BIGINT NOT NULL,
`PART_ID` BIGINT NOT NULL,
PRIMARY KEY (`FILE_ID`, `PART_ID`)
);
INSERT INTO `FILE_VIEW` VALUES
(1,'f1','raw',100,111),
(2,'f2','raw',200,111),
(3,'f3','raw',300,111),
(4,'f3','raw',400,111),
(5,'f5','proc',100,111),
(6,'f6','proc',200,111),
(7,'f7','proc',300,222),
(8,'f8','proc',400,222),
(9,'no participants','proc',100,111)
;
INSERT INTO `PARTICIPANTS` VALUES
(1,'P1','one',10),
(2,'P2','one',20),
(3,'P3','one',30),
(4,'P4','one',40),
(5,'P5','two',10),
(6,'P6','two',20),
(7,'P7','two',30),
(8,'P4','two',40),
(9,'no files','one', 18),
(10,'few files','two', 30)
;
INSERT INTO `FILE_TO_PART` VALUES
# f1 to all
(1,1),
(1,2),
(1,3),
(1,4),
(1,5),
(1,6),
(1,7),
(1,8),
# f2 to even
(2,2),
(2,4),
(2,6),
(2,8),
# f3 to odd
(3,1),
(3,3),
(3,5),
(3,7),
# f4 only stage one
(4,1),
(4,2),
(4,3),
(4,4),
# f5 only stage two
(5,5),
(5,6),
(5,7),
(5,8),
# f6 under20
(6,1),
(6,2),
(6,5),
(6,6),
# f7 over 20
(7,3),
(7,4),
(7,7),
(7,8),
# f8 to all
(8,1),
(8,2),
(8,3),
(8,4),
(8,5),
(8,6),
(8,7),
(8,8),
# p10 only few files
(2,10),
(3,10),
(5,10)
;
/*
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants. It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE
FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE
FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;
Full De-normalized Data
After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:
select * from MATERIAL;
FILE_ID
FILE_NAME
FILE_TYPE
FILE_SIZE
FILE_BEN_ID
PART_ID
PART_NAME
STAGE
AGE
FILE_ID
FILE_NAME
FILE_TYPE
FILE_SIZE
FILE_BEN_ID
PART_ID
PART_NAME
STAGE
AGE
1
f1
raw
100
111
1
P1
one
10
1
f1
raw
100
111
2
P2
one
20
1
f1
raw
100
111
3
P3
one
30
1
f1
raw
100
111
4
P4
one
40
1
f1
raw
100
111
5
P5
two
10
1
f1
raw
100
111
6
P6
two
20
1
f1
raw
100
111
7
P7
two
30
1
f1
raw
100
111
8
P4
two
40
2
f2
raw
200
111
2
P2
one
20
2
f2
raw
200
111
4
P4
one
40
2
f2
raw
200
111
6
P6
two
20
2
f2
raw
200
111
8
P4
two
40
2
f2
raw
200
111
10
few files
two
30
3
f3
raw
300
111
1
P1
one
10
3
f3
raw
300
111
3
P3
one
30
3
f3
raw
300
111
5
P5
two
10
3
f3
raw
300
111
7
P7
two
30
3
f3
raw
300
111
10
few files
two
30
4
f3
raw
400
111
1
P1
one
10
4
f3
raw
400
111
2
P2
one
20
4
f3
raw
400
111
3
P3
one
30
4
f3
raw
400
111
4
P4
one
40
5
f5
proc
100
111
5
P5
two
10
5
f5
proc
100
111
6
P6
two
20
5
f5
proc
100
111
7
P7
two
30
5
f5
proc
100
111
8
P4
two
40
5
f5
proc
100
111
10
few files
two
30
6
f6
proc
200
111
1
P1
one
10
6
f6
proc
200
111
2
P2
one
20
6
f6
proc
200
111
5
P5
two
10
6
f6
proc
200
111
6
P6
two
20
7
f7
proc
300
222
3
P3
one
30
7
f7
proc
300
222
4
P4
one
40
7
f7
proc
300
222
7
P7
two
30
7
f7
proc
300
222
8
P4
two
40
8
f8
proc
400
222
1
P1
one
10
8
f8
proc
400
222
2
P2
one
20
8
f8
proc
400
222
3
P3
one
30
8
f8
proc
400
222
4
P4
one
40
8
f8
proc
400
222
5
P5
two
10
8
f8
proc
400
222
6
P6
two
20
8
f8
proc
400
222
7
P7
two
30
8
f8
proc
400
222
8
P4
two
40
9
no participants
proc
100
111
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
9
no files
one
18
Two Perspectives
While the MATERIAL table contains all of the data, it would be challenging to use in its raw form. To improve usability, the data is exposed as two separate perspectives of the data:
Participants-with-aggregated-files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. The results also include additional columns that contain file aggregation data for each participant.
Files-with-aggregated-participants - The results of this query is such that each row represents a single file, including columns for the file’s metadata. The results also include additional columns that contain participant aggregation data for each file.
The unfiltered participants-with-aggregated-files would look like:
WITH
P2F AS (
SELECT
PART_ID,
MAX(PART_NAME) AS PART_NAME,
MAX(STAGE) AS STAGE,
MAX(AGE) AS AGE,
COUNT(FILE_ID) AS FILE_COUNT,
SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES,
SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES,
MAX(FILE_SIZE) AS `MAX_SIZE`,
MIN(FILE_SIZE) AS `MIN_SIZE`,
GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS
FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC
),
AGG AS (
SELECT
PART_ID, PART_NAME, STAGE, AGE, FILE_COUNT,
JSON_OBJECT('raw', RAW_FILES, 'proc', PROC_FILES) as FILE_TYPE,
JSON_OBJECT('max', `MAX_SIZE`, 'min', `MIN_SIZE`) as FILE_SIZE,
FILE_IDS
FROM P2F
)
SELECT * FROM AGG;