Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.
The data model is composed of three main building blocks:
FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vice versa.
MATERIAL - This table represents a materialized view that is the de-normalized join of the preceding three tables. It includes files that might not have any participants and vice versa. In the real world this table would be automatically indexed on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.
If you would like to try out the data model here is the ddl:
DROP DATABASE `COHORT`;
CREATE DATABASE `COHORT`;
USE `COHORT`;
CREATE TABLE IF NOT EXISTS `FILE_VIEW` (
`ID` BIGINT NOT NULL,
`NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`TYPE` ENUM ('raw','proc') NOT NULL,
`SIZE` BIGINT NOT NULL,
`BEN_ID` BIGINT NOT NULL,
PRIMARY KEY (`ID`)
);
CREATE TABLE IF NOT EXISTS `PARTICIPANTS` (
`ID` BIGINT NOT NULL,
`NAME` varchar(256) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
`STAGE` ENUM('one','two'),
`AGE` BIGINT NOT NULL,
PRIMARY KEY (`ID`)
);
CREATE TABLE IF NOT EXISTS `FILE_TO_PART` (
`FILE_ID` BIGINT NOT NULL,
`PART_ID` BIGINT NOT NULL,
PRIMARY KEY (`FILE_ID`, `PART_ID`)
);
INSERT INTO `FILE_VIEW` VALUES
(1,'f1','raw',100,111),
(2,'f2','raw',200,111),
(3,'f3','raw',300,111),
(4,'f3','raw',400,111),
(5,'f5','proc',100,111),
(6,'f6','proc',200,111),
(7,'f7','proc',300,222),
(8,'f8','proc',400,222),
(9,'no participants','proc',100,111)
;
INSERT INTO `PARTICIPANTS` VALUES
(1,'P1','one',10),
(2,'P2','one',20),
(3,'P3','one',30),
(4,'P4','one',40),
(5,'P5','two',10),
(6,'P6','two',20),
(7,'P7','two',30),
(8,'P4','two',40),
(9,'no files','one', 18),
(10,'few files','two', 30)
;
INSERT INTO `FILE_TO_PART` VALUES
# f1 to all
(1,1),
(1,2),
(1,3),
(1,4),
(1,5),
(1,6),
(1,7),
(1,8),
# f2 to even
(2,2),
(2,4),
(2,6),
(2,8),
# f3 to odd
(3,1),
(3,3),
(3,5),
(3,7),
# f4 only stage one
(4,1),
(4,2),
(4,3),
(4,4),
# f5 only stage two
(5,5),
(5,6),
(5,7),
(5,8),
# f6 under20
(6,1),
(6,2),
(6,5),
(6,6),
# f7 over 20
(7,3),
(7,4),
(7,7),
(7,8),
# f8 to all
(8,1),
(8,2),
(8,3),
(8,4),
(8,5),
(8,6),
(8,7),
(8,8),
# p10 only few files
(2,10),
(3,10),
(5,10)
;
/*
This table is the denormalized materialization of all of the data.
It includes all files including files with no particpants. It also
incldues all participants, even those with no files.
*/
CREATE TABLE `MATERIAL`
# includes all files even those without participants
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE
FROM `FILE_VIEW` F LEFT JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) LEFT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID )
UNION
# Add all particpants that do not have any files
SELECT F.ID AS FILE_ID, F.NAME AS FILE_NAME, F.TYPE AS FILE_TYPE, F.BEN_ID AS FILE_BEN_ID, P.ID AS PART_ID, P.NAME AS PART_NAME, P.STAGE, P.AGE
FROM `FILE_VIEW` F JOIN `FILE_TO_PART` F2P ON (F.ID = F2P.FILE_ID) RIGHT JOIN `PARTICIPANTS` P ON ( F2P.PART_ID =P.ID ) WHERE F.ID IS NULL;
Full De-normalized Data
After running the above script we can start to query the data. For example here is the contents of the MATERIAL table: