The purpose of this document is to provide technical details of how the proposed /wiki/spaces/PD/pages/2730721508 might be built.
Setup
To help us understand all of the technical challenges for this project we built a toy example data model. This data model was designed to capture all of the technical challenges while remaining as simple and small as we could make it.
The data model is composed of three main building blocks:
FILE_VIEW- This represents a Synapse file view that contains all of the files within the scope of the project. Each row represents a single file. The metadata about each file is captured in the columns of this table. These files represent the entire pool of files that the end user can draw from.
PARTICIPANTS - This represents a Synapse table that contains one row per participant. The metadata about each participant is captured in the columns of this table. These participants represent the entire pool of candidate participants that the end user can draw from.
FILE_TO_PART - This is a Synapse table that maps files to participant. In the real data a single participant might be mapped to thousands of files, and vise versa.
MATERIAL - This table represents a materialized view that de-normalized join of the proceeding three tables. It includes files that might not have any participants and vise versa. In the real world this table would be automatically index on all relevant columns. Note: A worst case for this table would be where each file is associated with each participant thereby resulting in a Cartesian product.
If you would like to try out the data model here is the ddl:
Full De-normalized Data
After running the above script we can start to query the data. For example here is the contents of the MATERIAL table:
select * from MATERIAL;
Two Perspectives
While the MATERIAL table contains all of the data, it is not super user friendly. To make it user friendly, we create two separate perspectives of the data:
Participants with aggregated files - The results of this query is such that each row represents a single participant, including columns for the participant’s metadata. This table also has additional columns that contain file aggregation data for each participant.
Files with aggregated participants- The results of this query is such that each row represents a single file, including columns for the file’s metadata. This table also has additional columns that contain participant aggregation data for each file.
The unfiltered participants-with-aggregated-files would look like:
PART_ID | PART_NAME | STAGE | AGE | FILE_COUNT | RAW_FILES | PROC_FILES | FILE_IDS |
---|---|---|---|---|---|---|---|
1 | P1 | one | 10 | 5 | 3 | 2 | 1,3,4,6,8 |
2 | P2 | one | 20 | 5 | 3 | 2 | 1,2,4,6,8 |
3 | P3 | one | 30 | 5 | 3 | 2 | 1,3,4,7,8 |
4 | P4 | one | 40 | 5 | 3 | 2 | 1,2,4,7,8 |
5 | P5 | two | 10 | 5 | 2 | 3 | 1,3,5,6,8 |
6 | P6 | two | 20 | 5 | 2 | 3 | 1,2,5,6,8 |
7 | P7 | two | 30 | 5 | 2 | 3 | 1,3,5,7,8 |
8 | P4 | two | 40 | 5 | 2 | 3 | 1,2,5,7,8 |
9 | no files | one | 18 | 0 | 0 | 0 | |
10 | few files | two | 30 | 3 | 2 | 1 | 2,3,5 |
The unfiltered files-with-aggregated-participants would look like:
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
---|---|---|---|---|---|---|---|
1 | f1 | raw | 100 | 8 | 4 | 4 | 1,2,3,4,5,6,7,8 |
2 | f2 | raw | 200 | 5 | 2 | 3 | 2,4,6,8,10 |
3 | f3 | raw | 300 | 5 | 2 | 3 | 1,3,5,7,10 |
4 | f3 | raw | 400 | 4 | 4 | 0 | 1,2,3,4 |
5 | f5 | proc | 100 | 5 | 0 | 5 | 5,6,7,8,10 |
6 | f6 | proc | 200 | 4 | 2 | 2 | 1,2,5,6 |
7 | f7 | proc | 300 | 4 | 2 | 2 | 3,4,7,8 |
8 | f8 | proc | 400 | 8 | 4 | 4 | 1,2,3,4,5,6,7,8 |
9 | no participants | proc | 100 | 0 | 0 | 0 | NULL |
Perspective Filtering
Now that we have two perspectives that show all of the information in a user friendly manner, it will be natural for uses to want to filter on each perspective. In fact, filtering is the point of the entire cohort builder exercise.
It is fairly trivial to filter by participant metadata on the participants-with-aggregated-files perspective. The same is true when filtering on file metadata on the files-with-aggregated-participants perspective. However, it is more complicated to filter on the aggregated columns of each perspective. In fact, the query for each perspective was built using a common-table-expression (CTE), to support aggregation filtering.
Let’s look at an example to see how to use the CTE to filter by aggregation data. A user looking at participants-with-aggregated-files perspective decides that they only want to include participants with at least 3 processed files in their cohort. To show these results we must add a filter to the p2f alias as follows (see line 14):
WITH P2F AS ( SELECT PART_ID, MAX(PART_NAME) AS PART_NAME, MAX(STAGE) AS STAGE, MAX(AGE) AS AGE, COUNT(FILE_ID) AS FILE_COUNT, SUM(CASE FILE_TYPE WHEN 'raw' THEN 1 ELSE 0 END) AS RAW_FILES, SUM(CASE FILE_TYPE WHEN 'proc' THEN 1 ELSE 0 END) AS PROC_FILES, GROUP_CONCAT(DISTINCT FILE_ID) AS FILE_IDS FROM MATERIAL WHERE PART_ID IS NOT NULL GROUP BY PART_ID ORDER BY PART_ID ASC ) SELECT * FROM P2F WHERE PROC_FILES >= 3;
PART_ID | PART_NAME | STAGE | AGE | FILE_COUNT | RAW_FILES | PROC_FILES | FILE_IDS |
---|---|---|---|---|---|---|---|
5 | P5 | two | 10 | 5 | 2 | 3 | 1,3,5,6,8 |
6 | P6 | two | 20 | 5 | 2 | 3 | 1,2,5,6,8 |
7 | P7 | two | 30 | 5 | 2 | 3 | 1,3,5,7,8 |
8 | P4 | two | 40 | 5 | 2 | 3 | 1,2,5,7,8 |
The result is four participants (5,6,7,8). Once the user has selected their starting participants, the next step is to take their results to the files-with-aggregated-participants perspective to finalize their selected files. This creates a new challenge. Since we filtered by an aggregation in one perspective that does not exist in the other, how to we transfer the user’s filter to the other perspective?
One scale limited solution to this problem is to capture the resulting participant from the first perspective and apply them as a filter to the second. Following the example above we would transfer the four participants (5,6,7,8) to the files-with-aggregated-participants as follows (see in clause at line 12):
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID ) SELECT * FROM F2P;
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
1 | f1 | raw | 100 | 4 | 0 | 4 | 5,6,7,8 |
2 | f2 | raw | 200 | 2 | 0 | 2 | 6,8 |
3 | f3 | raw | 300 | 2 | 0 | 2 | 5,7 |
5 | f5 | proc | 100 | 4 | 0 | 4 | 5,6,7,8 |
6 | f6 | proc | 200 | 2 | 0 | 2 | 5,6 |
7 | f7 | proc | 300 | 2 | 0 | 2 | 7,8 |
8 | f8 | proc | 400 | 4 | 0 | 4 | 5,6,7,8 |
Notice, this filter was applied to the inner query of the CTE. Since the outer F2P does not contain raw participant ids we cannot add the filter there.
Extending our existing user story, the user decides they only want to include files that include all four of their selected participants, so they add the filter:
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8) GROUP BY FILE_ID ) SELECT * FROM F2P WHERE PART_COUNT >= 4;
FILE_ID | FILE_NAME | FILE_TYPE | FILE_SIZE | PART_COUNT | STAGE_ONE_COUNT | STAGE_TWO_COUNT | PART_IDS |
---|---|---|---|---|---|---|---|
1 | f1 | raw | 100 | 4 | 0 | 4 | 5,6,7,8 |
5 | f5 | proc | 100 | 4 | 0 | 4 | 5,6,7,8 |
8 | f8 | proc | 400 | 4 | 0 | 4 | 5,6,7,8 |
Their final result now includes three files (1,5,8) that each include their four selected participants (5,6,7,8).
Row-level Filtering
Anytime a solution includes files, we need to ensure that the system automatically filters out files that the caller does not have permission to read.
Note: The current query system does not filter files that the user does not have permission to download, or unmet access requirements, by design.
In our example FILE_VIEW table we include a column called BEN_ID. It represents the ID of the Project/Folder that controls access to a file via its access control list. It is called a benefactor ID because it defines where a file inherits its permissions from. This column simulate how Synapse treats a real file view. You might also notice that this column was include in the materialized view (MATERIAL). This also simulates how Synapse treats materialized views that include data from file views.
The BEN_ID column is then used at query time to automatically filter out rows that the caller cannot see using the following two steps:
A new query is derived from the user provide query to select the distinct benefactor ID from the synID in the from clause. This tells the security filter what all of the possible BEN_IDs are for the given situation.
We run an authorization check to the the sub-set of BEN_IDs that the caller has the read permission on.
Finally, the user’s original query is modified to include an unconditional IN clause that limits the results to only benefactors the caller can see.
If we apply this concept to the our example scenario above, we detect that the MATERIAL has two distinct benefactor ID by automatically converting the user’s input query as follows:
SELECT DISTINCT FILE_BEN_ID FROM MATERIAL WHERE FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8);
In this step we must discard the original CTE, select, and group by provided by the user’s query and replace the select with the DISTINCT FILE_BEN_ID
. The result of this query returns two rows: 111 and 222.
We then do an authorization check on the user and determine they can see 111 but not 222. We are then ready to re-write their input SQL as follows:
WITH F2P AS ( SELECT FILE_ID, MAX(FILE_NAME) AS FILE_NAME, MAX(FILE_TYPE) AS FILE_TYPE, MAX(FILE_SIZE) AS FILE_SIZE, COUNT(PART_ID) AS PART_COUNT, SUM(CASE STAGE WHEN 'one' THEN 1 ELSE 0 END) AS STAGE_ONE_COUNT, SUM(CASE STAGE WHEN 'two' THEN 1 ELSE 0 END) AS STAGE_TWO_COUNT, GROUP_CONCAT(DISTINCT PART_ID) AS PART_IDS FROM MATERIAL WHERE (FILE_ID IS NOT NULL AND PART_ID IN (5,6,7,8)) AND FILE_BEN_ID IN (111) GROUP BY FILE_ID ) SELECT * FROM F2P;
Note: At line 12, the user’s original where clause is unconditionally combined withAND FILE_BEN_ID IN (111)
. This ensures that the caller will never see file rows that they are unauthorized to see.
We added this section to show how the adding a CTE query filter requires that we also plan for how to apply the correct runtime row-level filtering.
New Features
In order to make the above use cases work using the provide example queries we are going to need to add several new features to the Synapse query system.
CASE statements - We do not currently support CASE statements in the Synapse SQL. We would need to extend the SQL parser to correctly handle such cases.
CTE - We do not currently support common table expressions (CTE). Note: CTE can be more complex than the simple CTEs used in above examples. For example, it is possible for a CTE to include many tables, joins, unions, and sub-queries. Supporting complex CTE that includes joins, unions or sub-queries would potentially allow users to execute system debilitating queries. However, all of the examples above are simple an only involve a single table with aggregation. Such a simple CTE poses little to no risk. Therefore, we would like to add CTE support to runtime queries. This means a materialized view would not be needed to use a CTE.
Optimize Materialized Views - Currently, when we rebuild a materialized view (MV), we lock users from querying it while we rebuild it. For small MVs that only rebuild occasionally, this is not a problem. However, this use case will require potentially large MVs (millions of rows). If it takes 30 mins to rebuild the MV used for a cohort builder, after a file updates, it is unlikely that we want users to be unable to use the cohort builder during a rebuild. On the other hand users are unlikely to notices if the data in the cohort builder is slightly stale. Therefore, we propose adding an optimization to MVs that would allow the existing MV to remain query-able during the rebuild process. After the rebuild is finished, the stale MV could be atomically, swapped with the newly build MV.