Facet Support for JSON Column Type
Introduction
One of the main features for the new https://sagebionetworks.jira.com/wiki/spaces/PD/pages/2730721508 is the ability to combine multiple columns into a single column. The following example will be used to illustrate this use case. One example of the Cohort Builder will be for a longevity study that involves participants over the age of 89. Since it would be theoretically possible to use age to re-identify a participant over the age 89, we need to be careful with how we expose participant age data. One way to solve the re-identification problem is provide an age range for each participant, instead of their actual age. This means the age data in the participant table would have two parts: a lower bounds and an upper bounds. Technically, the lower and upper bounds are two separate columns, but it is convenient for the user to think of them as two part of a single age column.
We recently added support for a new JSON column type that allows us to represent multiple columns in a single column as a JSON string. We also added an entire suite of JSON SQL functions that can be used to filter table involving JSON columns. However, most users do not want to read or write SQL. Instead, they would like UI controls that help them navigate the data. Synapse support this type of navigation using faceted column definitions.
Facets
Let us start with the definition of a Synapse Facet: Any ColumnModel that has been assigned a valid FacetType is considered a faceted column. Each faceted columns will receive special treatment in the Synapse UI or custom Portal. Specifically, each faceted columns will have its own UI control in the left-hand-side panel. Each facet’s UI control will show vital statistics about that column, and provide a mechanism to apply a filter on that column without the need to read or write SQL. The Columns assigned FacetType determines what statistics are shown and what type of control is used.
Synapse currently supports two FacetTypes:
enumeration - Identifies a column where each value is one of a possible selection of values. For example, patient table might have a column named ‘eyeColor’ with six possible values: amber, blue, brown, gray, green, or hazel. The UI control will include a check-box for each color and a count of the number of rows matching that color.
range - Identifies a numeric column where each value is a number. For example, a treatment table might have a column named ‘score’ that contains a numeric score of test given to a participant. The UI control will be a slider with two control points. Lower control point shows the lowest score (
min(score)
) , while the upper control point shows the highest score (max(score)
).
The check-boxes and sliders provide a user experience that is similar the product navigation of most on-line shopping sites.
There are two special features for Faceted columns in the Synapse query API POST /entity/{id}/table/query/async/start request object: QueryBundleRequest:
partMask contains 0x20 - When true, the API will include statistics for each Faceted column in the resulting QueryResultBundle (see: QueryResultBundle.facets ARRAY<FacetColumnResult>)
query.selectedFacets - Allows for one or more “facet selection” filter to be applied to the query. The type of the selectedFacets is determined by the column’s facetType. Specifically, selections for enumerations will be of type: FacetColumnValuesRequest while selections for ranges will be of type: FacetColumnRangeRequest
In the next section we we describe the type of data that can be in JSON column which will aid in the discussion of how to provide faceted navigation of such data.
ColumnType: JSON
JSON data can be composed of two basic types types:
JSONObject - Composed of one or more key-value-pairs.
JSONArray - Composed of an array of comma separated values.
The values of either the JSONObjects or JSONArrays can either be a primitives (strings, integers, doubles, Boolean, NULL, …) or hierarchical (values that are either JSONObjects or JSONArrays). We do not currently have any use cases for hierarchical values. Therefore, this design will be limited to primitive values only.
Note: We might want to consider hierarchical values in the future with an integration of JSON Schemas to define the hierarchy.
We will setup a example data that contains both types of JSON data:
id | json_array | json_object |
---|---|---|
1 | [1, 2, 3, 4, 5, 6] | {"a": "one", "b": true, "c": 111} |
2 | [2, 4, 6] | {"a": "two", "b": false, "c": 113} |
3 | [1, 3, 5] | {"a": "three", "c": 114} |
4 | NULL | {"a": "three", "b": true, "c": 115} |
5 | [2, 3, 4, 5] | {"a": "two", "b": true, "c": 116} |
6 | [1, 2, 3] | {"a": "two", "b": false, "c": 117} |
7 | [4, 5, 6] | NULL |
The json_array column is an example of JSONArray where each value is a primitive (integer). The json_object column is an example of a JSONObjects. You can think of the json_object column as a column that contains three sub-columns: “a” of type string, “b” of type Boolean, and “c” of type integer. Note: All of the values of both JSON columns are primitives (integers, strings, & Booleans). JSONArrays are contained within square brackets [ ]
while the JSONObjects are contained within curly brackets { }
.
JSONArray
We already support columns of types list (STRING_LIST, INTEGER_LIST, BOOLEAN_LIST…) in Synapse. In fact, each list columns is actually implemented as a MySQL JSON column (basically a JSONArray). However, our current list implementation also requires a separate “index” table that is used for both filtering and for the unnest
function. Maintaining a separate “index“ table dramatically increases the complexity of the Synapse list feature.
One of the main use cases for the new cohort builder was the need to support faceted navigation of aggregation results. We designed the VirtualTable feature to meet this use case. The core of a VirtualTable is its defining SQL, which allows the results of each aggregation function to be cast as a faceted ColumModel. We have been able to add support for most ColumnTypes in VirtualTable, but we could not find a way to cleanly support lists. The problem is, the list implementation depends on the separate “index“ table which is incompatible with CTEs.
Do we really needed a separate “index” table to make lists work? To answer this question, we did some prototyping and captured the results in the Appendix. The results suggest that if we use a MySQL muti-value-index, we can actually get better query performance without using a separate “index“ table.
To better understand the need for JSONArrays/Tables in the cohort builder use case, let’s use a simple example. This example shows how individual values can be aggregated into JSONArrays:
DROP TABLE IF EXISTS material;
CREATE TABLE material (
id BIGINT NOT NULL,
a_string CHAR(50) NOT NULL
);
INSERT INTO material VALUES
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'b'),
(2, 'c'),
(3, 'c');
select id, JSON_ARRAYAGG(a_string) as alist from material group by id;
id | alist |
---|---|
1 | ["a", "b", "c"] |
2 | ["b", "c"] |
3 | ["c"] |
Note: The aList
column is composed of non-list data that is aggregated into JSONArrays. The cohort builder use case has this type of aggregation, and it also requires that the aggregated columns be treated as an enumeration facet.
So, how would we create a VirtualTable using the the SQL from line:15 such that the the resulting aList
column is faceted?
One way to do this would be to cast the results to the following ColumnModel:
{
"id":"222"
"name":"aList",
"columnType":"JSON",
"facetType":"enumeration"
}
select id, cast(JSON_ARRAYAGG(a_string) as 222) from material group by id;
While this approach might work for VirtualTables, it has a few problems:
Since ColumnModel 222 is not a list, other list features like
has
,has_like
, andunnest
will not work. Instead, users would be required to use the more complex JSON_OVERLAP, JSON_SEARCH, and JSON_TABLE.ColumnModel 222 does not have enough information to create a multi-value index. We would need the type (CHAR) and the size. This would be more of an issue for MaterializedView than a VirtualTable.
A better alternatives might be to cast the results to a list column, such as:
This would enable all existing list operations for this column, plus facet support. The list column (111) also contains enough information to correctly add a multi-value index for the MaterializedView use case. Therefore, our suggestion is to recommend users cast to list columns instead of JSON columns for this type of use case. We might even want to disallow setting facetType on columns of type JSON.
Note: This assumes we are able to remove the “index” table used for lists. This work is tracked in: PLFM-7968.
JSONObect
The main cohort builder use case includes the need to combine columns into a single column. The above json_object column is an example of this type of column combination. It contains three columns: a, b, & c. Therefore, we propose to represent this column by extending ColumnModel to include a new recursive property name “jsonSubColumns”.
The following is an example of how we would use the recursive subColumns to fully define our example json_object column model:
This allows us to declare that the json_object column contains two faceted column (a & c) and one non-faceted column (b). This would require that we extends the ColumnModel definition to capture the new path and sub-columns: PLFM-7972.
Filtering
In order to filter our example json_object column we will need the jsonPath to identify which sub-column to filter on. The path always starts with ‘$' which represents the “root” of the JSONObject. So to filter on sub-column ‘a’ we would use a path of '$.a’. The following example shows how to find all rows where “a“ = “three“:
This query returns two rows with ids: 3 & 4.
Facet Selection
Since the json_object column contains two faceted sub-columns, it is possible to apply a facet selection to both columns at the same time. For example, if the user selects the “a.two” checkbox while at the same time sets the “c” slider to be between 115 and 116, the following shows how we would represent their selection in the query bundle:
This would then translate to:
Returning one row with the id of 4.
Note: We will need to extend the FacetColumnRequest to include 'jsonApath':
We will also need to extends the facet selection translator to generate the above SQL from the facet requests. This work is tracked in PLFM-7973.
Facet Statistics
Since the json_object column contains two faceted sub-columns, we will need to generate statistics for both sub-columns when the user’s query request includes partsMask=0x20. Again we can use the jsonPath to identify the stats for each sub-column:
Note: We will need to extend the FacetColumnResults to include the jsonPath:
We will also need to extend the facet statistics system to generate the stats for faceted JSON sub-column. This work is tracked in: PLFM-7974.
Appendix - Performance Considerations
Filter with JSON_OVERLAPS without Index
Each Synapse list column is implemented in MySQL as a JSON column. With a JSON column that contains JSONArray data, MySQL provides a special function called JSON_OVERLAPS()
to filter rows based on the contents of the JSONArray. The following real life example shows how this function can be used to filter rows that have ‘ROSMAP' as a value anywhere in the row’s JSONArray:
Note: The above example does not have an index on the JSON column (_C123848_
).
Filter with JSON_OVERLAPS with Index
MySQL now supports a multi-value index for JSONArrays. In order to test the use of this new index we ran to the following DDL to create a copy of the data in a new table with a multi-value index:
The following SQL was use to test the multi-value index:
Filter with Indexed Sub-query
Currently, for columns of type list, we do not use the JSON_OVERLAPS
function to filter rows. Instead we create a new "index" table that maps each list value to its row. When a user includes Synapse has()
function in the where clause of a table with a list column, we use the corresponding index to execute the query. The following is an example of how the same ‘ROSMAP’ filter would be applied using the index table:
First the rows_id that match the values are found using a sub-query, that then feeds into an in
clause against the main table. The main idea behind this implementation is that we can apply a database index to the value column of the index table, which should eliminate a full table scan. Does adding an extra index table buy us better performance and therefore justify the added complexity?
Which is Faster?
In order to determine if the Index Sub-query is truly superior to the JSON_OVERLAPS function, we ran a query to find some real table in Synapse that include list columns, using the following query:
Most of the tables we inspected had less than 1k rows so they were ignored. However, we found one table T11346063
with 212K rows. We then ran the following test queries against this fairly large table:
Query Strategy | SQL | query time (MS) | Query Plan | Query Cost |
---|---|---|---|---|
JSON_OVERPLAPS without index | SELECT count(*) FROM prod462.T11346063 WHERE JSON_OVERLAPS(C190401, '["HBTRC"]'); | 547 |
| 23K |
JSON_OVERPLAPS with multi-value index | SELECT COUNT(*) FROM prod463.ARRAY_EXAMPLE WHERE JSON_OVERLAPS(stringArray, '["HBTRC"]'); | 94 |
| 702 |
Index Sub-query | SELECT count(*) FROM prod462.T11346063 WHERE ROW_ID IN ( SELECT ROW_ID_REF_C190401_ FROM prod462.T11346063_INDEX_C190401_ WHERE _C190401__UNNEST IN ('HBTRC') ); | 93 |
| 881 |
The above results show that using the JSON_OVERLAPS without a multi-value index is the slowest strategy. While both the JSON_OVERLAPS with index and Index Sub-query are almost equivalent. This suggest that the added complexity of the indexed sub-query is not warranted.