Document toolboxDocument toolbox

Table Search Options

 

Introduction

The purpose of this document is to provide options to potentially enable full text search in Synapse Tables. The use case comes from the NF resources portal currently under development where data about NF resources would be stored in Synapse tables along with natural language description of various kind of resources.

Overview of the Project:

User Stories:

Internal Meeting Logs:

Initial Design Wireframes: https://www.figma.com/file/aSr8SUVMX0EETgF3Qxvyrw/NF-Resource-Database?node-id=343%3A18199

Slack Channel: https://app.slack.com/client/T092YT0LV/C01RKMZQBKL

 

TL;DR: Elasticsearch is the preferred option over CloudSearch. MySQL Full text index a (very) limited option but easier to integrate.

 

From the project overview we extract the following (note that it is not part of the deliverable nor acceptance criteria):

Implementation of a search index on the data tables that supports search features like relevancy ranking, root word extraction, multiple keyword search;

It highlights the need for a full text search feature that performs some text processing for indexing and provide result sorted by some form of relevance ranking.

From the user stories:

search on a single term that allows for searching / matching across any field within any resource type (Plasmid, Antibody, etc.) without having to specify which type of resource I'm looking for

 

search on a single term as above, but have accurate results if there are minor misspellings

 

search using autocomplete

The synapse tables feature already support searching, filtering and faceting results using an SQL like language, a similar result as the above can be achieved using the LIKE (and HAS_LIKE for multi-value columns) operator. The main difference is that synapse tables do not perform any kind of text processing before indexing nor include any form of relevance ranking. For example searching for a phrase using a LIKE expression won’t apply any tokenization nor stemming on the input potentially limiting the relevant results.

For this use case we assume that there will be real and meaningful natural language descriptions in the table data, when tabular data is mostly key/value pairs with common metadata the filtering provided by a relation database is more effective than a full text search index given that there might not be enough context to compute a relevant score for documents. Note that while it’s relatively easy to build a search index, tuning the index is at the core of any meaningful search result and users might want to provide additional information to tweak the results of a particular table (e.g. relative relevance of columns, synonyms etc).

Available Options

To enable such a feature we have different options: generally a search index is maintained as a separate and eventually consistent component that complements the source of truth, this is the case even for Synapse tables: the data is stored in S3, but they are eventually built in a dedicated relational DB against which we run user queries.

We consider various technologies that are easily available to us and that live within the AWS ecosystem so that they could be integrated with the Synapse backend (This makes it easier to integrate and maintain).

  1. CloudSearch: This is a managed service by AWS based on Solr that we use in Synapse to index entity metadata and wiki pages content. Note that while still supported the last meaningful update is from 2014 with dynamic fields and the product receives sporadic updates mainly for supporting new instance types. The product is not included in the AWS pricing calculator (redirects to elasticsearch offering).

  2. AWS Elasticsearch: This the managed offering from AWS of the popular Elasticsearch product from Elastic NV based on Lucene. AWS maintains an open source fork: https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch (Elastic changed the license from apache to their own) and will rename the service OpenSearch.

  3. MySQL Full Text Search: Since Synapse tables are built in a MySQL database we also consider the native full text search index capabilities offered by MySQL.

  4. Kendra : New offering from AWS that supports unstructured data indexing and searching using ML. We exclude this offering because of the type of technology that is designed around natural language queries (e.g. what is..?, who is..? where is…? etc) and its cost (min $800/month for a developer edition).

Comparison

We compare some of the relevant features of the various options (excluding Kendra):

Feature

CloudSearch

AWS ElasticSearch

MySQL FTS

Notes

Feature

CloudSearch

AWS ElasticSearch

MySQL FTS

Notes

Schema Type

Fixed, partial support for schema-less with dynamic fields (capture all)

Fixed or Schema-less (Dynamic mapping)

Fixed, index needs to specify all the columns included in the search. The query needs to specify all the columns in the index.

A schema-less approach allows to index data whose structure is unknown, this might not be needed for table as by design we know the structure of the data.

Stemming

Yes

Yes

No - Require pre-processing from the application side. Boolean query mode allows wildcard expressions.

When indexing the tokens can be usually reduced to a word stem before indexing, this allows more flexibility when matching against similar terms (e.g. search for database might match documents containing databases)

Fuzzy search

Yes

Yes

No - Can potentially be implemented using soundex in a pre-processing step but it’s very fragile

Fuzzy search can be useful in some cases for minor misspellings

Field boosting

Yes (at query time, but this might not be relevant due to the limitation on the amount of fields in a domain)

Yes (both query and schema time, this would work in a one index per table scenario)

No - Not sure what a work around would look like.

This is useful when specific columns are more relevant than others (e.g. a match in the title might be more meaningful than a match in a description). On the other end when there are too many fields or unknown fields to index this might not be relevant as a we might choose to concatenate all the fields into a single one.

Multiple indexes

No

Yes

Yes - Each synapse table has its own DB table

In the synapse tables context it is relevant to have the possibility to create an index per table given that each table might have a different schema and it might be more meaningful to reflect the schema in the search index.

Auto-complete

Yes (Suggester API)

Yes (through suggesters, various options)

No

This is a feature that provides suggestions, useful for auto-complete (e.g. while you type)

Did-you-mean

Partial? - Maybe the suggester can be used or fuzzy search

Yes (through suggesters)

No

This is a feature that provides potential suggestions after the search (e.g. misspellings)

Highlighting

Yes

Yes

No

 

Facets

Yes (But we wouldn’t be able to use it for tables due to the limitation on the number of fields in a domain and given the sheer amount of columns in tables)

Yes (But it might not be a good idea to use it, given that we would have to re-implement the whole faceting on top of elastic search)

Partial - This is already supported for Synapse tables as a custom implementation

This might not be relevant as Synapse table already implement faceting.

Arrays

Yes

Yes

No - Not natively but could be probably worked around

This might be needed for multi-value columns

Custom Synonyms

Yes (Index time)

Yes (Index or query time)

No - This could be implemented at the application level, or simply adding synonyms into the data.

This feature can be useful to complement stemming or fuzzy search. Expanding the index/query with similar term might yield better results.

Custom Stop words

Yes (global)

Yes (Index)

Yes (global)

 

Dedicated Java Client

Yes

No, currently re-use the client provided by Elastic that broke on purpose the compatibility with non-elastic distributions for newer version. There are plans for releasing forks that will maintain compatibility.

Yes, JDBC

 

Maintenance and scalability

Managed, auto-scale

Managed, tuning suggestions

Managed RDS

 

Synapse Tables Integration Effort

High

High

Medium

 

Additional Costs

Yes, per cluster per instance type/hour. Plus amount of data in batches sent to index.

Yes per instance type/hour. Plus size of data. Recommended from AWS are at least 3 master nodes plus data nodes.

No

Elasticsearch might turn out to be cheaper than CloudSearch since the instances are priced lowered and we do not pay for sending batches to index but setting up the cluster with the right sizing can be complex without knowing the amount of data and type of tables beforehand. To ensure availability it can be become extremely expensive (e.g. dedicated master nodes, multiple availability zones and replicas).

Integration Notes

Clearly the most flexible product is the AWS Elasticsearch offering, mainly because while CloudSearch is a good search index product it does not support multiple indexes per cluster (e.g. one domain per table). We could somehow make use of dynamic fields to catch all the fields of a given type or try and index a table row as a single document field but we would sacrifice other features (such as field boosting).

Given that AWS Elasticsearch supports multiple indexes per cluster we could maintain one index per table, this would eliminate the use of dynamic mapping since we know the schema before-hand. Additionally while CloudSearch and Elasticsearch are very similar products, with the latter being open source (at least in the AWS offering) and with many more features and flexibility it wouldn’t make much sense to go with the former so I would exclude CloudSearch altogether. This also given that CloudSearch seems to be in maintenance mode.

MySQL FTS is a (very) limited product that has the advantage of being potentially integrated with the current query engine. For example faceting might work out of the box without additional work. Since it would be part of the table indexes there is potentially little development overhead in adding it to tables including when changing the schema. There is some performance overhead added due to text processing during the transactions. The main concern is the limited options and flexibility, for example if we have a multiple columns to search we either create a separate index for each column and run an OR and somehow re-compute the ranking or all the columns need to be added to the index and all of them need to be specified in the search query (e.g. WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), note that both title and body needs to be specified for an FTS index on <title, body>). Of course we can pre-process the data and come up with ways to have meaningful indexing but it’s additional development time (but we might not need to implement everything right away).

Using an external search index such as Elasticsearch adds additional complexity in maintaining the index in sync, especially regarding schema changes (where the whole index might need to be rebuilt). We would need infrastructure to either periodically get the changed rows (this would mean adding a timestamp to each row, and potentially a deleted flag) or to send data and changes in a streaming fashion in order (integrations such as kinesis seem to support additions but no deletions). The biggest integration pain would be faceting and filtering and potentially supporting the existing API. While both AWS Elasticsearch and CloudSearch support faceting, in order to use it we would have to replicate everything that is already done for the table query engine and make it somehow compatible with the current search results (The input search API could be different and not integrated with the SQL queries).

From the design it seems clear that they would want an integration with faceting and filtering, this might prove to be a huge task (basically it could replace the current query engine). An idea might be to index only relevant columns in elastic search, at query time we first run a search and limit to a small subset of results (e.g. 1000, this makes sense when searching for relevant documents) and apply the normal filters and faceting in the MySQL database on the subset of ids (maintaining the order).

Pros and Cons

 

Pro

Con

 

Pro

Con

MySQL Full Text Search

  • Relatively easier to integrate than other options

  • Can be integrated with the current query language

  • Can be integrated with current facet implementation

  • Does not add additional costs

  • Very limited capabilities (e.g. no stemming might be a big deal)

  • Very hard to customize

  • Might add significant overhead for tables that change frequently

  • Not clear how we would handle multi-value columns

  • No field boosting, this can be hard to implement efficiently or in an effective way (e.g we would need to create separate indexes for each column and compute the rank ourselves)

  • All or nothing index (e.g. all columns must be included in index and query), there is an hard limit of 16 columns in an index. If we create a single index per column we have a limit of 64 indexes. The most efficient option would probably be to have a single column with the concatenation of all the values.

  • While easier to integrate, in order to support some of the features (such as stemming) we would need additional pre-processing

AWS Elasticsearch

  • Open source implementation and active community

  • Additional companion tools and integrations such as logstash or kinesis firehose

  • High level of customization and tuning

  • Supports multiple indexes per cluster

  • Native support for nested objects and arrays

  • Can be integrated with the Synapse infrastructure or deployed as a separate service independent of Synapse (e.g. service catalog offering?) and user could customize it to their needs.

  • Can be deployed as a non-managed solution (e.g. docker, EC2 etc) and there are various providers (e.g. Elastic) if we ever have problems with the AWS offering

  • Supports search templates (https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/search-template/ ). This seems like a very interesting feature, in fact we could let the user customize their own search templates per table.

  • Can be complicated to setup properly

  • Compared to MySQL it requires a substantial effort to sync the index and handling schema changes specifically that might lead to expensive re-indexing

  • Using the AWS offering might be risky: there have been past reports of limitations (e.g. ). The offering might be more mature at the moment but with the Elastic license changes and the soon to become “OpenSearch” there are some unknowns. The open source version is a fork that is already behind the Elastic offering. Additionally AWS and all the other providers were relying on the clients offered by ElasticSearch but recently Elastic broke on purpose compatibility including a license check in their clients. While older clients still works the AWS team is currently working on forks to maintain. There are also legal battles going on from Elastic and the future of ElasticSearch offerings might be uncertain.

  • If integrated with the synapse infrastructure having a cluster being built every release might prove very effective (e.g. no issues updating) but might also show limitations in the long run (e.g. depending on the amount of data we ingest rebuilding indexes might take too long).

  • Can be hard to integrate with facets and filters in tables if not impossible. Might lead to a complete replication (and maintenance) of existing features on top of another engine that could turn into a very long project.

  • There might be substantial additional costs (e.g. ~500-2000/month) depending on how the cluster and indexes are configured. If we have a big cluster where we have one index per table and several tables this can add up to several thousands per month.

  • While supporting multiple indexes per cluster, they are related to the number of shards (default 1 shard and 1 replica) in the cluster, the more shards the more nodes and costs so it might be more cost effective to have a single index with reduced search capabilities.

CloudSearch

  • Already used in the Synapse backend

  • Easy to setup and auto-scale

  • The one cluster per index structure makes it a bit inflexible (there is a 200 fields limit per domain and using dynamic fields they suggest to stay below 1000 for performance reasons). While we can work around it (e.g. pre-process the rows to be indexed) we would probably end up with issue in the relevance of results. Having one index per table is a much more effective strategy given that we know the schema beforehand and we have much more room for customization based on the table content.

  • Same integration pain points with existing features as Elasticsearch

  • Since CloudSearch does not support multiple indexes we would be limited to create documents that do not have explicit fields (e.g. tables have a lot of fields and a single table can go over the 200 fields limit). This means that the faceting option of CloudSearch is not an actual option.

  • Additional costs

  • Seems to receive no meaningful updates and its support is a bit of an unknown

  • While easy to setup any request to update the configuration, schema or re-indexing is extremely slow and almost unusable: a simple update of a field on an index with 1 document can easily take more than 30 mins.

Implementation Options

Given the above and the search experiment (See ) it’s clear that an elastic search cluster would be the most flexible and powerful option especially if setup with a one table per index configuration. However an integration with the Synapse backend would become a sizable project that might not be worth and a simpler implementation using a MySQL full text index might be ok for now. Also considering that in its default configuration elasticsearch is not much different than MySQL, the advantages come in when the index and queries can be tuned according to the use case.

In the following we list some potential implementations for a potential integration:

  1. MySQL Full Text Index:

    1. A table can be configured to support full text search, the user can select the columns to index

    2. Given the limits on indexes with MySQL a special column (potentially in a special table) contains the concatenated text of the selected columns

    3. Provide a special construct in the SQL language

    4. In some cases re-indexing of the whole data will be needed (e.g. if we drop/add a column since we concatenate the text)

    5. Can be extended in the future with some more advanced features (e.g. we can setup stemming etc) as needed, we could explore Apache Lucene (Elasticsearch is based on it) or OpenNLP

  2. Synapse Elasticsearch Cluster with one index-per-table configuration

    1. Setup an elasticsearch cluster with the Synapse stack

    2. A table can be configured to be indexed with a one index per table configuration

    3. Provide the users with a special search API, or integrate with the SQL language

    4. Limit the search results to a maximum value (e.g. 1000) and integrate the results with the faceted results

    5. In some cases a full re-indexing might be necessary but with the one index per table configuration we might be able to run modify-by-query operations instead

    6. We can extend it to allow the users to setup the index configuration as they see fit (e.g. custom analyzers, stop words, suggesters etc)

    7. We can extend it to setup a default query “template” and let the users customize it per table

  3. Synapse Elasticsearch Cluster with single index configuration (can be turned into multiple indexes for scalability)

    1. Similar as 2. but we construct a special document with the concatenated values instead and the table id as a field

    2. Full re-indexing might be necessary for some operations (add/drop column)

    3. No customization for the index or query

    4. We need to be careful not to leak information (e.g. suggestion APIs etc might leak data from different tables)

  4. External Elasticsearch Cluster with one index-per-table configuration

    1. Similar to 2. but the user specifies the cluster endpoint (e.g. similar to what we do for custom storage locations) and we verify ownership. The cluster setup would need to provide access to the Synapse account for a specific index.

    2. Synapse handles the index synchronization, there might be some issues with the fact that tables are actually rebuilt every stack (e.g. the users might not be happy with us rebuilding indexes every week)