Table of Contents | ||||
---|---|---|---|---|
|
Introduction
The purpose of this document is to provide options to potentially enable full text search in Synapse Tables. The use case comes from the NF resources portal currently under development where data about NF resources would be stored in Synapse tables along with natural language description of various kind of resources.
...
Feature | CloudSearch | AWS ElasticSearch | MySQL FTS | Notes | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Schema Type | Fixed, partial support for schema-less with dynamic fields (capture all) | Fixed or Schema-less (Dynamic mapping) | Fixed, index needs to specify all the columns included in the search. The query needs to specify all the columns in the index. | A schema-less approach allows to index data whose structure is unknown, this might not be needed for table as by design we know the structure of the data. | |||||||
Stemming | Yes | Yes | No - Require pre-processing from the application side. Boolean query mode allows wildcard expressions. | When indexing the tokens can be usually reduced to a word stem before indexing, this allows more flexibility when matching against similar terms (e.g. search for database might match documents containing databases) | |||||||
Fuzzy search | Yes | Yes | No - Can potentially be implemented using soundex in a pre-processing step but it’s very fragile | Fuzzy search can be useful in some cases for minor misspellings | |||||||
Field boosting | Yes (at query time, but this might not be relevant due to the limitation on the amount of fields in a domain) | Yes (both query and schema time, this would work in a one index per table scenario) | No - Not sure what a work around would look like. | This is useful when specific columns are more relevant than others (e.g. a match in the title might be more meaningful than a match in a description). | Multiple indexes | No | On the other end when there are too many fields or unknown fields to index this might not be relevant as a we might choose to concatenate all the fields into a single one. | ||||
Multiple indexes | No | Yes | Yes - Each synapse table has its own DB table) | In the synapse tables context it is relevant to have the possibility to create an index per table given that each table might have a different schema and it might be more meaningful to reflect the schema in the search index. | |||||||
Auto-complete | Yes (Suggester API) | Yes (through suggesters, various options) | No | This is a feature that provides suggestions, useful for auto-complete (e.g. while you type) | |||||||
Did-you-mean | Partial? - Maybe the suggester can be used or fuzzy search | Yes (through suggesters) | No | This is a feature that provides potential suggestions after the search (e.g. misspellings) | |||||||
Highlighting | Yes | Yes | No | ||||||||
Facets | Yes | Yes | Partial - This is already supported for Synapse tables as a custom implementation | This might not be relevant as Synapse table already implement faceting. | Arrays | Yes | Yes | No - (But we wouldn’t be able to use it for tables due to the limitation on the number of fields in a domain and given the sheer amount of columns in tables) | Yes (But it might not be a good idea to use it, given that we would have to re-implement the whole faceting on top of elastic search) | Partial - This is already supported for Synapse tables as a custom implementation | This might not be relevant as Synapse table already implement faceting. |
Arrays | Yes | Yes | No - Not natively but could be probably worked around | This might be needed for multi-value columns | |||||||
Custom Synonyms | Yes (Index time) | Yes (Index or query time) | No - This could be implemented at the application level, or simply adding synonyms into the data. | This feature can be useful to complement stemming or fuzzy search. Expanding the index/query with similar term might yield better results. | |||||||
Custom Stop words | Yes (global) | Yes (Index) | Yes (global) | ||||||||
Dedicated Java Client | Yes | No, currently re-use the client provided by Elastic that broke on purpose the compatibility with non-elastic distributions for newer version. There are plans for releasing forks that will maintain compatibility. | Yes, JDBC | ||||||||
Maintenance and scalability | Managed, auto-scale | Managed, tuning suggestions | Managed RDS | ||||||||
Synapse Tables Integration Effort | High | High | Medium | ||||||||
Additional Costs | Yes, per cluster per instance type/hour. Plus amount of data in batches sent to index. | Yes per instance type/hour. Plus size of data. | No | Elasticsearch might Recommended from AWS are at least 3 master nodes plus data nodes. | No | Elasticsearch might turn out to be cheaper than CloudSearch since the instances are priced lowered and we do not pay for sending batches to index . Setting but setting up the cluster with the right sizing can be complex with Elasticsearch and to without knowing the amount of data and type of tables beforehand. To ensure availability it can be more become extremely expensive (e.g. dedicated master nodes, multiple availability zones and replicas). |
...
MySQL FTS is a (very) limited product that has the advantage of being potentially integrated with the current query engine. For example faceting might work out of the box without additional work. Since it would be part of the table indexes there is potentially little development overhead in adding it to tables including when changing the schema. There is some performance overhead added due to text processing during the transactions. The main concern is the limited options and flexibility, for example if we have a multiple columns to search we either create a separate index for each column and run an OR and somehow re-compute the ranking or all the columns need to be added to the index and all of them need to be specified in the search query (e.g. WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE)
, note that both title and body needs to be specified for an FTS index on <title, body>). Of course we can pre-process the data and come up with ways to have meaningful indexing but it’s additional development time (but we might not need to implement everything right away).
Using an external search index such as Elasticsearch adds additional complexity in maintaining the index in sync, especially regarding schema changes (where the whole index might need to be rebuilt). We would need infrastructure to either periodically get the changed rows (this would mean adding a timestamp to each row, and potentially a deleted flag) or to send data and changes in a streaming fashion in order (integrations such as kinesis seem to support additions but no deletions). The biggest integration pain would be faceting and filtering and potentially supporting the existing API. While both AWS Elasticsearch and CloudSearch support faceting, in order to use it we would have to replicate everything that is already done for the table query engine and make it somehow compatible with the current search results (The input search API could be different and not integrated with the SQL queries).
From the design it seems clear that they would want an integration with faceting and filtering, this might prove to be a huge task (basically it would could replace the current query engine). An idea might be to index only relevant columns in elastic search, at query time we first run a search and limit to a small subset of results (e.g. 1000, this makes sense when searching for relevant documents) and apply the normal filters and faceting in the MySQL database on the subset of ids (maintaining the order).
...
Pro | Con | |
---|---|---|
MySQL Full Text Search |
|
|
AWS Elasticsearch |
|
|
CloudSearch |
|
|
Implementation Options
Given the above and the search experiment (See Sample Data) it’s clear that an elastic search cluster would be the most flexible and powerful option especially if setup with a one table per index configuration. However an integration with the Synapse backend would become a sizable project that might not be worth and a simpler implementation using a MySQL full text index might be ok for now. Also considering that in its default configuration elasticsearch is not much different than MySQL, the advantages come in when the index and queries can be tuned according to the use case.
In the following we list some potential implementations for a potential integration:
MySQL Full Text Index:
A table can be configured to support full text search, the user can select the columns to index
Given the limits on indexes with MySQL a special column (potentially in a special table) contains the concatenated text of the selected columns
Provide a special construct in the SQL language
In some cases re-indexing of the whole data will be needed (e.g. if we drop/add a column since we concatenate the text)
Can be extended in the future with some more advanced features (e.g. we can setup stemming etc) as needed, we could explore Apache Lucene (Elasticsearch is based on it) or OpenNLP
Synapse Elasticsearch Cluster with one index-per-table configuration
Setup an elasticsearch cluster with the Synapse stack
A table can be configured to be indexed with a one index per table configuration
Provide the users with a special search API, or integrate with the SQL language
Limit the search results to a maximum value (e.g. 1000) and integrate the results with the faceted results
In some cases a full re-indexing might be necessary but with the one index per table configuration we might be able to run modify-by-query operations instead
We can extend it to allow the users to setup the index configuration as they see fit (e.g. custom analyzers, stop words, suggesters etc)
We can extend it to setup a default query “template” and let the users customize it per table
Synapse Elasticsearch Cluster with single index configuration (can be turned into multiple indexes for scalability)
Similar as 2. but we construct a special document with the concatenated values instead and the table id as a field
Full re-indexing might be necessary for some operations (add/drop column)
No customization for the index or query
We need to be careful not to leak information (e.g. suggestion APIs etc might leak data from different tables)
External Elasticsearch Cluster with one index-per-table configuration
Similar to 2. but the user specifies the cluster endpoint (e.g. similar to what we do for custom storage locations) and we verify ownership. The cluster setup would need to provide access to the Synapse account for a specific index.
Synapse handles the index synchronization, there might be some issues with the fact that tables are actually rebuilt every stack (e.g. the users might not be happy with us rebuilding indexes every week)