Full Text Search Integration
References:
https://sagebionetworks.jira.com/browse/PLFM-6876
https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
To support the NF research tools use case we decided to implement an integration of full text search for Synapse tables using the native MySQL full text search capabilities.
Tables Background
Each synapse table is stored in a MySQL table
Each column defined for a synapse table (A list of ColumnModel) is represented as a mapped MySQL column according to the datatype
A Synapse table has a maximum width of 64K bytes and each column counts towards this limit according to the data type (See https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/ColumnType.html )
Synapse tables are eventually consistent, operations on synapse table are first submitted and saved in a permanent history log and a worker takes care of “building” a table up to the last change asynchronously
Indexes are created automatically for each column up to a limit of 60 indexes (with some optimizations in which columns take priority). MySQL has a limit of 64 indexes for a table. One is reserved for the primary key, another 2 are reserved in views
We do not create multi-column indexes. Note that MySQL limits the maximum number of column in a single index to 16
Operations supported on a Synapse table include: adding, deleting and updating rows and updating the schema of the table. Additionally it is possible to upload a CSV that translate into adding/updating rows.
Proposed Implementation
Option 1
Add a boolean flag (e.g. searchIndex) to the ColumnModel that specifies if the column should be included in the search index. This flag defaults to false and can be set to true only for the following datatypes: STRING, LINK, LARGETEXT, STRING_LIST
Make sure that a schema where a ColumnModel#searchIndex is true can be assigned only to tables (e.g. do not support views for now)
When a search column is in the schema add a default “virtual column” (a column that the user does not see) to a Synapse table named SEARCH_CONTENT of (MySQL) type LONGTEXT and an index of type FULLTEXT. The side effect is that 12 bytes are taken off the maximum width available (the actual data is stored separately in MySQL so it just requires the space of the pointer) and reduce the number of available indexes to 59.
Add a new function to the SQL language that allows to search on the indexed data, for example:
SELECT * FROM syn123 WHERE TEXT_SEARCH("input")
The TEXT_SEARCH function will translate directly into its MySQL equivalent on the SEARCH_CONTENT column:SELECT * FROM T123 WHERE MATCH(SEARCH_CONTENT) AGAINST('input')
.
Note that we can also allow the function to be used in the select so that the rank value is returned similarly to what MySQL does:SELECT TEXT_SEARCH("input") FROM syn123 WHERE TEXT_SEARCH("input")
We can support easily all the MySQL functionalities (https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html ):MATCH (SEARCH_CONTENT) AGAINST (input [search_modifier])
search_modifier: { IN NATURAL LANGUAGE MODE | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION | IN BOOLEAN MODE | WITH QUERY EXPANSION }
When the worker that builds a table appends/update a row we also populate the SEARCH_CONTENT column if necessary from the selected columns.
Deleting a row should remove it from the index automatically
Since we build the search content from concatenated values when the schema changes we might need to re-index all the current values. In particular this might be needed when a column with searchIndex set to true is dropped or when a column is modified to one where the searchIndex value is altered (adding new columns should not affect the existing data and would be handled by subsequent operations). Since the transaction in this case is anyway asynchronous and happens in a worker we can do the re-indexing during this transaction directly. This might take several minutes depending on the size of the table (we need to update each existing row).
Option 2
An alternative and much simpler solution is to simply add a new datatype to the column model such as FULLTEXT for which we create a FULLTEXT index automatically and users can then put whatever value they want and query it specifying the column similar to what MySQL does. The drawback is that similarly to the LARGTEXT type in the column model we would have to limit the size of data (524,288 characters or 2 MB of UTF-8 4 byte chars) and count it towards the width of the table (2133 bytes).
Option 3
Similar to the previous one but instead of having a dedicated datatype, have a flag in the column model for existing STRING/TEXT types such as isFullText that would imply adding the FULLTEXT index on each of the columns and enabling the MATCH AGAINST function.
Notes
Option 1:
Having a column in the table itself simplifies a lot the queries since they can be translated directly (e.g. now complicated JOIN or IN clauses).
We could treat it as an eventual consistent operation if we want to speed up the transactions but I would not start with such an implementation for now considering that the amount of expected data will be limited. If we have to support very big tables (>100K rows) we should consider using a dedicated separate search index.
Note that since the virtual SEARCH_CONTENT column is not part of the column model of a synapse table, users cannot select or update its value.
Option 2 and 3:
This might the most flexible and simple solution (e.g. we do not have to do special work in the backend), users could decide to apply some pre-processing to the data that is indexed (e.g. stemming).
This options allow the users to “tweak” the search results, e.g. adding synonyms, removing unwanted tokens etc
Since the column is part of the model it is returned in the results for a
select *
.This would potentially work automatically even with views
Design Review 09/13/2021:
Option 2 and 3 excluded, they put too much burden on the end user and we need an automatic solution
Consider a boolean at the table level rather than in the column model, this makes it more automatic for the user, in the backend we select the columns to index ourselves (with the possibility to extend it in the future so the user can customize which columns to index).
Add a specific parameter to the search query model on top of supporting a new SQL function to make it easier for the clients.