Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For this use case we assume that there will be real and meaningful natural language descriptions in the table data (when tabular data stores is mostly key/value pair pairs with common metadata the filtering provided by a relation database is more meaningful than a full text search index).

...

To enable such a feature we have different options: generally a search index is maintained as a separate and eventually consistent component that complements the source of truth, this is the case even for Synapse tables: the data is stored in S3, but they are eventually built in a dedicated relational DB against which we run user queries.

We consider three (TODO, check Kendra) main various technologies that are easily available to us and that live within the AWS ecosystem.

  1. CloudSearch: This is a managed service by AWS based on Solr that we use in Synapse to index entity metadata and wiki pages content. Note that while still supported the last meaningful update is from 2014 with dynamic fields and the product receive sporadic updates mainly for supporting new instance types. The product is not included in the AWS pricing calculator (redirects to elasticsearch offering).

  2. Elasticsearch: This the managed offering from AWS of the popular Elasticsearch product from Elastic NV based on Lucene. AWS will maintain an open source fork: https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch (elastic changed the license from apache to their own).

  3. MySQL Full Text Search: Since Synapse tables are built in a MySQL database we also consider the native full text search index capabilities offered by MySQL.

  4. Kendra (question): New offering from AWS that supports unstructured data indexing and searching using ML. We exclude this offering because of the type of technology catered toward natural language queries (e.g. what is..?, who is..? where is…? etc) and its costs (min $800/month for a developer edition).

Comparison

We compare some of the relevant features of the various options (excluding Kendra):

Feature

CloudSearch

ElasticSearch

MySQL FTS

Notes

Schema Type

Fixed, partial support for schema-less with dynamic fields (capture all)

Fixed or Schema-less (Dynamic mapping)

Fixed, index needs to specify all the columns included in the search. The query needs to specify all the columns in the index.

A schema-less approach allows to index data whose structure is unknown, this might not be needed for table as by design we know the structure of the data.

Stemming

Yes

Yes

No - Require pre-processing from the application side

When indexing the tokens can be usually reduced to a word stem before indexing, this allows more flexibility when matching against similar terms (e.g. search for database might match documents containing databases)

Fuzzy search

Yes

Yes

No - Can potentially be implemented using soundex in a pre-processing step but it’s very fragile

Fuzzy search can be useful in some cases for minor misspellings

Field boosting

Yes (at query time)

Yes (both query and schema time)

No - Not sure what a work around would look like.

This is useful when specific columns are more relevant than others (e.g. a match in the title might be more meaningful than a match in a description).

Multiple indexes

No

Yes

Yes - Each synapse table has its own DB table)

In the synapse tables context it is relevant to have the possibility to create an index per table given that each table might have a different schema.

Auto-complete

Yes (Suggester API)

Yes (through suggesters, various options)

No

This is a feature that provides suggestions, useful for auto-complete (e.g. while you type)

Did-you-mean

Partial? - Maybe the suggester can be used or fuzzy search

Yes (through suggesters)

No

This is a feature that provides potential suggestions after the search (e.g. misspellings)

Highlighting

Yes

Yes

No

Facets

Yes

Yes

Partial - This is already supported for Synapse tables as a custom implementation

Arrays

Yes

Yes

No - Not natively but could be probably worked around

This might be needed for multi-value columns

Custom Synonyms

Yes (Index time)

Yes (Index or query time)

No

This feature can be useful to complement stemming or fuzzy search. Expanding the index/query with similar term might yield better results.

Custom Stop words

Yes (global)

Yes (Index)

Yes (global)

Maintenance and scalability

Managed, auto-scale

Managed, tuning suggestions

Managed RDS

Synapse Tables Integration Effort

High

High

Medium

Additional Costs

Yes, per cluster per instance type/hour. Plus amount of data in batches sent to index.

Yes per instance type/hour. Plus size of data. Might

No

Elasticsearch might turn out to be cheaper than CloudSearch since the instances are priced lowered and we do not pay for sending batches to index. NoSetting up the cluster with the right sizing can be complex with Elasticsearch and to ensure availability it can be more expensive (e.g. dedicated master nodes, multiple availability zones and replicas).

Integration Notes

Clearly the most flexible product is Elasticsearch, mainly because while CloudSearch is a good search index product it does not support multiple indexes per cluster (e.g. one domain per table). We could somehow make use of dynamic fields to catch all the fields (without field boosting) of a given type or try and index a table row as a single document field but field boosting wouldn’t work (does not support nested documents).

...

From the design it seems clear that they would want an integration with faceting and filtering, this might prove to be a huge task (basically it would replace the current query engine). An idea might be to index only relevant columns in elastic search, at query time we first run a search and limit to a small subset of results (e.g. 1000, this makes sense when searching for relevant documents) and apply the normal filters and faceting in the MySQL database on the subset of ids (maintaining the order).

Pros and Cons

Pro

Con

MySQL Full Text Search

  • Relatively easier to implement than other options

  • Can be integrated with the current query language

  • Can be integrated with current facet implementation

  • Does not add additional costs

  • Very limited capabilities (e.g. no stemming might be a big deal)

  • Basically no way to customize it

  • Might add significant overhead for adding data

  • Not clear how we would handle multi-value columns

  • No field boosting, this can be hard to implement efficiently

  • All or nothing index (e.g. all columns must be included in index and query)

Elasticsearch

  • Open source implementation and very active community

  • Additional companion tools and integrations such as logstash or kinesis firehose

  • High level of customization and tuning

  • Supports multiple indexes per cluster

  • Native support for nested objects and arrays

  • Can be integrated with the Synapse infrastructure or deployed as a separate service independent of Synapse (e.g. service catalog offering?) and user could customize it to their needs.

  • Can be complicated to setup properly

  • Compared to MySQL it requires a substantial effort to sync the index and handling schema changes

  • Using the AWS offering might be risky, there have been past reports of limitations (e.g. https://spun.io/2019/10/10/aws-elasticsearch-a-fundamentally-flawed-offering/ ). The offering might be more mature at the moment but with the Elastic license changes and the soon to become “open search” there are some unknowns.

  • If integrated with the synapse infrastructure having a cluster being built every release might prove very effective (e.g. no issues updating) but might also show limitations in the long run (e.g. depending on the amount of data we ingest rebuilding indexes might take too long).

  • Can be hard to integrate with facets and filters in tables. Might lead to a complete replication (and maintenance) of existing features.

  • There might be substantial additional costs (e.g. ~200-1000/month)

CloudSearch

  • Already used in the Synapse backend

  • Easy to setup

  • It’s basically a no-go due to the one cluster per index structure (there is a 200 fields limit per domain and using dynamic fields they suggest to stay below 1000 for performance reasons)

  • Same integration pain points with existing features as Elasticsearch

  • Additional costs

  • Seems to receive no meaningful updates and its support is a bit of an unknown