Multiple Value Annotations
Use cases
Full list of concrete use cases can be found in:
Summary of desired functionality :
- Count of distinct values within multi-value annotations of Entities for faceting
- Filtering/searching for Entities having specific value(s) within a multi-value annotation
- Filtering on no values (already supported:
WHERE col1 IS NULL
).
- Filtering on no values (already supported:
JSON API object
Our current API Object actually already have fields mapping from a single key to multiple values. However, most of our codebase currently ignores anything past the first value. The current Annotations object could be used, but because it is a hand-written object, there is no auto-generated documentation for the annotations object, which makes our REST API confusing for newer users.
Annotations V2
We could take this as an opportunity to redesign Annotations and introduce a V2, which fixes the awkwardness of having to define each type be in its own fields. Instead we have an object which looks like:
{ "id": "syn123" "etag": "4cf6f2db-3d30-4759-903a-d9fd0a76472c" "annotations":{ "my string key":{ "type":"STRING", "value": "my value goes here" }, "my int key":{ "type":"INTEGER", "value": "42" }, "my string list key":{ "type":"STRING_LIST", "value": ["value", "other value"] } } }
Multiple annotations in Entity Views
ColumnTypes
New ColumnTypes
will be introduced to handle multiple values
STRING_LIST INTEGER_LIST BOOLEAN_LIST DATE_LIST
In the first iteration, these will only be supported for Entity Views.
Currently, the feature works on tables as well. However, we are unable to support swapping to/from any of the aforementioned LIST ColumnTypes
because we MySQL can't be relied upon to perform data type transformations when using LISTs.
Existing ColumnType
s such as STRING
will still only evaluate the first value in an annotation list
Multi-Value Format
Multiple values are represented as JSON arrays. For Example:
["value1", "value2", "value3"] [6,7,8] [true,false,true]
Value Limits
Number of keys: 100 (check current numbers on existing annotations)
The maximum number of values for each annotation is 100.
For String annotations, there can be at most 500 characters For example:
["value1","value2","value3"]
would have a size of 28 characters
For a STRING_LIST column, maxSize restricts the size of single string values in that list, not the total size of the list.
For example, if maxSize=5, ["asdf", "qwerty] would not be allowed because "qwerty" has 6 characters.
Querying Multiple Values
To work with multiple value annotations:
Selecting for annotation containing one value:
SELECT * FROM syn123 WHERE annotation1 HAS 'val1';
Selecting for annotations containing any of multiple values (OR relationship):
SELECT * FROM syn123 WHERE annotation1 HAS 'val3' or annotation1 HAS 'val1';
Selecting for annotations containing ALL of the listed values (AND relationship):
SELECT * FROM syn123 WHERE annotation1 HAS 'val3' AND annotation1 HAS 'val1';
Finding annotations with no values set:
SELECT * FROM syn123 WHERE annotation1 IS NULL;
Finding number of values in an annotation:
SELECT LENGTH(annotation1) FROM syn123;
Facets
Facet counts for multi-value annotations will count each value in the array
Table:
synapseId | annotation1 |
---|---|
syn123 | ["val1", "val2", "val3"] |
syn456 | ["val2", "val4"] |
syn789 | NULL |
syn135 | NULL |
syn357 | ["val1", "val3", "val5", "val7"] |
Facet Counts:
annotation1 | facet count |
---|---|
val1 | 2 |
val2 | 2 |
val3 | 2 |
val4 | 1 |
val5 | 1 |
val7 | 1 |
NULL | 2 |
Implementation Details and notes about query performance
Usually, design documents would not include implementation details. However, in this case, choosing one implementation may cause us to lose the ability to implement certain features, so it is important to briefly mention the tradeoffs.
- Currently proposed Approach: The currently proposed implementation, which is to use MySQL8's JSON column type means that it is impossible to index values in the array. Hopefully, the size of the Entity Views are small enough that the lack of an index will not be a large performance hit.
- Another approach would be to use a separate table for each column containing multiple values. However, this causes the results to return one row per annotation value. The result set expands further when there are multiple multi-value columns. Performing a group-by to group the multiple values into a single column would also restrict the possible SQL functions to only those that are usable in aggregates.
- The final proposal is to use a document-store database such as MongoDB, which is optimized for semi-structured data. However, this causes us to lose many SQL features that are currently available. Following this approach would be costly in implementation time and may need to be implemented as a separate feature if not all of the features of a relational database can be provided.