The purpose of this document is to provide technical details of how the proposed /wiki/spaces/PD/pages/2730721508 might be built.
Setup
To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.
The data model is composed of three main building blocks:
FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vise versa.
MATERIAL - This table represents a materialized view that is the de-normalized join of the proceeding three tables. It includes files that might not have any participants and vise versa. In the real world this table would be automatically index on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.
If you would like to try out the data model here is the ddl:
Full De-normalized Data
After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:
select * from MATERIAL;
Two Perspectives
While the MATERIAL table contains all of the data, it is not super user-friendly. To improve usability, the data is exposed as two separate perspectives of the data:
Participants-with-aggregated-files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. The results also include additional columns that contain file aggregation data for each participant.
Files-with-aggregated-participants - The results of this query is such that each row represents a single file, including columns for the file’s metadata. The results also include additional columns that contain participant aggregation data for each file.
The unfiltered participants-with-aggregated-files would look like:
PART_ID | PART_NAME | STAGE | AGE | FILE_COUNT | RAW_FILES | PROC_FILES | FILE_IDS |
---|---|---|---|---|---|---|---|
1 | P1 | one | 10 | 5 | 3 | 2 | 1,3,4,6,8 |
2 | P2 | one | 20 | 5 | 3 | 2 | 1,2,4,6,8 |
3 | P3 | one | 30 | 5 | 3 | 2 | 1,3,4,7,8 |
4 | P4 | one | 40 | 5 | 3 | 2 | 1,2,4,7,8 |
5 | P5 | two | 10 | 5 | 2 | 3 | 1,3,5,6,8 |
6 | P6 | two | 20 | 5 | 2 | 3 | 1,2,5,6,8 |
7 | P7 | two | 30 | 5 | 2 | 3 | 1,3,5,7,8 |
8 | P4 | two | 40 | 5 | 2 | 3 | 1,2,5,7,8 |
9 | no files | one | 18 | 0 | 0 | 0 | |
10 | few files | two | 30 | 3 | 2 | 1 | 2,3,5 |
The unfiltered files-with-aggregated-participants would look like:
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
---|---|---|---|---|---|---|---|
1 | f1 | raw | 100 | 8 | 4 | 4 | 1,2,3,4,5,6,7,8 |
2 | f2 | raw | 200 | 5 | 2 | 3 | 2,4,6,8,10 |
3 | f3 | raw | 300 | 5 | 2 | 3 | 1,3,5,7,10 |
4 | f3 | raw | 400 | 4 | 4 | 0 | 1,2,3,4 |
5 | f5 | proc | 100 | 5 | 0 | 5 | 5,6,7,8,10 |
6 | f6 | proc | 200 | 4 | 2 | 2 | 1,2,5,6 |
7 | f7 | proc | 300 | 4 | 2 | 2 | 3,4,7,8 |
8 | f8 | proc | 400 | 8 | 4 | 4 | 1,2,3,4,5,6,7,8 |
9 | no participants | proc | 100 | 0 | 0 | 0 | NULL |
Perspective Filtering
Now that we have two perspectives that show all of the information in a user friendly manner, it will be natural for uses to want to filter on each perspective. In fact, filtering is the point of the entire cohort builder exercise.
It is fairly trivial to filter by participant metadata on the participants-with-aggregated-files perspective. The same is true when filtering on file metadata on the files-with-aggregated-participants perspective. However, it is more complicated to filter on the aggregated columns of each perspective. In fact, the query for each perspective was built using a common-table-expression (CTE), to support aggregation filtering.
Let’s look at an example to see how to use the CTE to filter by aggregation data. A user looking at participants-with-aggregated-files perspective decides that they only want to include participants with at least 3 processed files in their cohort. To show these results we must add a filter to the p2f alias as follows (see line 14):
WITH P2F AS ( SELECT PART_ID, MAX(PART_NAME) AS PART_NAME, MAX(STAGE) AS STAGE, MAX(AGE) AS AGE, COUNT(FILE_ID) AS FILE_COUNT, SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES, SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES, GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC ) SELECT * FROM P2F WHERE PROC_FILES >= 3;
PART_ID | PART_NAME | STAGE | AGE | FILE_COUNT | RAW_FILES | PROC_FILES | FILE_IDS |
---|---|---|---|---|---|---|---|
5 | P5 | two | 10 | 5 | 2 | 3 | 1,3,5,6,8 |
6 | P6 | two | 20 | 5 | 2 | 3 | 1,2,5,6,8 |
7 | P7 | two | 30 | 5 | 2 | 3 | 1,3,5,7,8 |
8 | P4 | two | 40 | 5 | 2 | 3 | 1,2,5,7,8 |
The result is four participants (5,6,7,8). Once the user has selected their starting participants, the next step is to take their results to the files-with-aggregated-participants perspective to finalize their selected files. This creates a new challenge. Since we filtered by an aggregation in one perspective that does not exist in the other, how to we transfer the user’s filter to the other perspective?
One scale limited solution to this problem is to capture the resulting participant from the first perspective and apply them as a filter to the second. Following the example above we would transfer the four participants (5,6,7,8) to the files-with-aggregated-participants as follows (see in clause at line 12):
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID ) SELECT * FROM F2P;
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
1 | f1 | raw | 100 | 4 | 0 | 4 | 5,6,7,8 |
2 | f2 | raw | 200 | 2 | 0 | 2 | 6,8 |
3 | f3 | raw | 300 | 2 | 0 | 2 | 5,7 |
5 | f5 | proc | 100 | 4 | 0 | 4 | 5,6,7,8 |
6 | f6 | proc | 200 | 2 | 0 | 2 | 5,6 |
7 | f7 | proc | 300 | 2 | 0 | 2 | 7,8 |
8 | f8 | proc | 400 | 4 | 0 | 4 | 5,6,7,8 |
Notice, this filter was applied to the inner query of the CTE. Since the outer F2P does not contain raw participant ids we cannot add the filter there.
Extending our existing user story, the user decides they only want to include files that include all four of their selected participants, so they add the filter (see: line 14):
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID ) SELECT * FROM F2P WHERE PART_COUNT >= 4;
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
---|---|---|---|---|---|---|---|
1 | f1 | raw | 100 | 4 | 0 | 4 | 5,6,7,8 |
5 | f5 | proc | 100 | 4 | 0 | 4 | 5,6,7,8 |
8 | f8 | proc | 400 | 4 | 0 | 4 | 5,6,7,8 |
Their final result now includes three files (1,5,8) that each include their four selected participants (5,6,7,8).
Row-level Filtering
Anytime a runtime query result can include files, we need to ensure that the system automatically filters out files that the caller does not have permission to read. This is something we must consider if we want to extend runtime queries to support CTE.
In our example FILE_VIEW table we include a column called BEN_ID. It represents the ID of the Project/Folder that controls access to a file via its access control list. It is called a benefactor ID because it defines where a file inherits its permissions from. This column simulate how Synapse treats a real file view. You might also notice that this column was include in the materialized view (MATERIAL). This also simulates how Synapse treats materialized views that include data from file views.
The BEN_ID column is then used at query time to automatically filter out rows that the caller cannot see. This filtering is done in three steps.
In step 1 we extract the FROM
and WHERE
from the user provided query and use them in combination with a distinct SELECT
clause. For this example the results would look like:
SELECT DISTINCT FILE_BEN_ID FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8);
This query will return two values: 111, and 222. This tells us the full set of possible benefactors that this query could return. These values are then used in step 2 which is an authorization check to determine the sub-set that they the user is allowed to see. For our example, lets say the caller is allowed to see 111 but not 222.
This information is then used in the final step (3), where we rewrite the user’s query to unconditionally limit to rows that the user is authorized to see:
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE (FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8)) AND FILE_BEN_ID IN (111) GROUP BY FILE_ID ) SELECT * FROM F2P;
Note: At line 12, the user’s original where clause is in parenthesis and combined with:FILE_BEN_ID IN (111)
.
In summary, if we want to support CTE in runtime queries we will need to extend the authorization sub-system to extract and rebuild the inner query, while leaving outer query intact.
New Features
In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.
CASE statements - We do not currently support CASE statements in the Synapse SQL. We would need to extend the SQL parser to correctly handle such cases.
CTE - We do not currently support common table expressions (CTE). Note: CTE can be more complex than the simple CTEs used in above examples. For example, it is possible for a CTE to include many tables, joins, unions, and sub-queries. Supporting complex CTE that includes joins, unions or sub-queries would potentially allow users to execute system debilitating queries. However, all of the examples above are simple an only involve a single table with aggregation. Such a simple CTE poses little to no risk. Therefore, we would like to add CTE support to runtime queries. This means a materialized view would not be needed to use a CTE.
Optimize Materialized Views - Currently, when we rebuild a materialized view (MV), we lock users from querying it while we rebuild it. For small MVs that only rebuild occasionally, this is not a problem. However, this use case will require potentially large MVs (millions of rows). If it takes 30 mins to rebuild the MV used for a cohort builder, after a file updates, it is unlikely that we want users to be unable to use the cohort builder during a rebuild. On the other hand users are unlikely to notices if the data in the cohort builder is slightly stale. Therefore, we propose adding an optimization to MVs that would allow the existing MV to remain query-able during the rebuild process. After the rebuild is finished, the stale MV could be atomically, swapped with the newly build MV.