Document toolboxDocument toolbox

Multiple Value Annotations

Use cases

Full list of concrete use cases can be found in:

Summary of desired functionality :

  1. Count of distinct values within multi-value annotations of Entities for faceting
  2. Filtering/searching for Entities having specific value(s) within a multi-value annotation
    1. Filtering on no values (already supported: WHERE col1 IS NULL).

JSON API object

Our current API Object actually already have fields mapping from a single key to multiple values. However, most of our codebase currently ignores anything past the first value. The current Annotations object could be used, but because it is a hand-written object, there is no auto-generated documentation for the annotations object, which makes our REST API confusing for newer users.

Annotations V2

We could take this as an opportunity to redesign Annotations and introduce a V2, which fixes the awkwardness of having to define each type be in its own fields. Instead we have an object which looks like:

{
	"id": "syn123"
	"etag": "4cf6f2db-3d30-4759-903a-d9fd0a76472c"
	"annotations":{
		"my string key":{
			"type":"STRING",
			"value": "my value goes here"
		},
		"my int key":{
			"type":"INTEGER",
			"value": "42"
		},
		"my string list key":{
			"type":"STRING_LIST",
			"value": ["value", "other value"]
		}

	}

}

Multiple annotations in Entity Views

ColumnTypes

New ColumnTypes will be introduced to handle multiple values

STRING_LIST
INTEGER_LIST
BOOLEAN_LIST
DATE_LIST

In the first iteration, these will only be supported for Entity Views.

Currently, the feature works on tables as well. However, we are unable to support swapping to/from any of the aforementioned LIST ColumnTypes because we MySQL can't be relied upon to perform data type transformations when using LISTs.

Existing ColumnTypes such as STRING will still only evaluate the first value in an annotation list 

Multi-Value Format

Multiple values are represented as JSON arrays. For Example:

["value1", "value2", "value3"]
[6,7,8]
[true,false,true]

Value Limits

Number of keys: 100 (check current numbers on existing annotations)


The maximum number of values for each annotation is 100.


For String annotations, there can be at most 500 characters For example:

["value1","value2","value3"] would have a size of 28 characters


For a STRING_LIST column, maxSize restricts the size of single string values in that list, not the total size of the list.

For example, if maxSize=5, ["asdf", "qwerty] would not be allowed because "qwerty" has 6 characters.


Querying Multiple Values

To work with multiple value annotations:


Selecting for annotation containing one value:

SELECT * FROM syn123 WHERE annotation1 HAS 'val1';

Selecting for annotations containing any of multiple values (OR relationship):

SELECT * FROM syn123 WHERE annotation1 HAS 'val3' or annotation1 HAS 'val1';

Selecting for annotations containing ALL of the listed values (AND relationship):

SELECT * FROM syn123 WHERE annotation1 HAS 'val3' AND annotation1 HAS 'val1';


Finding annotations with no values set:

SELECT * FROM syn123 WHERE annotation1 IS NULL;

Finding number of values in an annotation:

SELECT LENGTH(annotation1) FROM syn123;

Facets

Facet counts for multi-value annotations will count each value in the array

Table:

synapseIdannotation1
syn123["val1", "val2", "val3"]
syn456["val2", "val4"]
syn789NULL
syn135NULL
syn357["val1", "val3", "val5", "val7"]

Facet Counts:

annotation1facet count
val12
val22
val32
val41
val51
val71
NULL2

Implementation Details and notes about query performance

Usually, design documents would not include implementation details. However, in this case, choosing one implementation may cause us to lose the ability to implement certain features, so it is important to briefly mention the tradeoffs.

  1. Currently proposed Approach: The currently proposed implementation, which is to use MySQL8's JSON column type means that it is impossible to index values in the array. Hopefully, the size of the Entity Views are small enough that the lack of an index will not be a large performance hit.
  2. Another approach would be to use a separate table for each column containing multiple values. However, this causes the results to return one row per annotation value. The result set expands further when there are multiple multi-value columns. Performing a group-by to group the multiple values into a single column would also restrict the possible SQL functions to only those that are usable in aggregates.
  3. The final proposal is to use a document-store database such as MongoDB, which is optimized for semi-structured data. However, this causes us to lose many SQL features that are currently available. Following this approach would be costly in implementation time and may need to be implemented as a separate feature if not all of the features of a relational database can be provided.