Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Feature

CloudSearch

AWS ElasticSearch

MySQL FTS

Notes

Schema Type

Fixed, partial support for schema-less with dynamic fields (capture all)

Fixed or Schema-less (Dynamic mapping)

Fixed, index needs to specify all the columns included in the search. The query needs to specify all the columns in the index.

A schema-less approach allows to index data whose structure is unknown, this might not be needed for table as by design we know the structure of the data.

Stemming

Yes

Yes

No - Require pre-processing from the application side

When indexing the tokens can be usually reduced to a word stem before indexing, this allows more flexibility when matching against similar terms (e.g. search for database might match documents containing databases)

Fuzzy search

Yes

Yes

No - Can potentially be implemented using soundex in a pre-processing step but it’s very fragile

Fuzzy search can be useful in some cases for minor misspellings

Field boosting

Yes (at query time)

Yes (both query and schema time)

No - Not sure what a work around would look like.

This is useful when specific columns are more relevant than others (e.g. a match in the title might be more meaningful than a match in a description).

Multiple indexes

No

Yes

Yes - Each synapse table has its own DB table)

In the synapse tables context it is relevant to have the possibility to create an index per table given that each table might have a different schema and it might be more meaningful to reflect the schema in the search index.

Auto-complete

Yes (Suggester API)

Yes (through suggesters, various options)

No

This is a feature that provides suggestions, useful for auto-complete (e.g. while you type)

Did-you-mean

Partial? - Maybe the suggester can be used or fuzzy search

Yes (through suggesters)

No

This is a feature that provides potential suggestions after the search (e.g. misspellings)

Highlighting

Yes

Yes

No

Facets

Yes

Yes

Partial - This is already supported for Synapse tables as a custom implementation

This might not be relevant as Synapse table already (But we wouldn’t be able to use it for tables due to the limitation on the number of fields in a domain and given the sheer amount of columns in tables)

Yes (But it might not be a good idea to use it, given that we would have to re-implement the whole faceting on top of elastic search)

Partial - This is already supported for Synapse tables as a custom implementation

This might not be relevant as Synapse table already implement faceting.

Arrays

Yes

Yes

No - Not natively but could be probably worked around

This might be needed for multi-value columns

Custom Synonyms

Yes (Index time)

Yes (Index or query time)

No

This feature can be useful to complement stemming or fuzzy search. Expanding the index/query with similar term might yield better results.

Custom Stop words

Yes (global)

Yes (Index)

Yes (global)

Dedicated Java Client

Yes

No, currently re-use the client provided by Elastic that broke on purpose the compatibility with non-elastic distributions for newer version. There are plans for releasing forks that will maintain compatibility.

Yes, JDBC

Maintenance and scalability

Managed, auto-scale

Managed, tuning suggestions

Managed RDS

Synapse Tables Integration Effort

High

High

Medium

Additional Costs

Yes, per cluster per instance type/hour. Plus amount of data in batches sent to index.

Yes per instance type/hour. Plus size of data.

No

Elasticsearch might turn out to be cheaper than CloudSearch since the instances are priced lowered and we do not pay for sending batches to index. Setting up the cluster with the right sizing can be complex with Elasticsearch and to ensure availability it can be more expensive (e.g. dedicated master nodes, multiple availability zones and replicas).

...

Pro

Con

MySQL Full Text Search

  • Relatively easier to integrate than other options

  • Can be integrated with the current query language

  • Can be integrated with current facet implementation

  • Does not add additional costs

  • Very limited capabilities (e.g. no stemming might be a big deal)

  • Very hard to customize

  • Might add significant overhead for tables that change frequently

  • Not clear how we would handle multi-value columns

  • No field boosting, this can be hard to implement efficiently or in an effective way (e.g we would need to create separate indexes for each column and compute the rank ourselves)

  • All or nothing index (e.g. all columns must be included in index and query)While easier to integrate, in , there is an hard limit of 16 columns in an index. If we create a single index per column we have a limit of 64 indexes.

  • While easier to integrate, in order to support some of the features (such as stemming) we would need additional pre-processing

AWS Elasticsearch

  • Open source implementation and active community

  • Additional companion tools and integrations such as logstash or kinesis firehose

  • High level of customization and tuning

  • Supports multiple indexes per cluster

  • Native support for nested objects and arrays

  • Can be integrated with the Synapse infrastructure or deployed as a separate service independent of Synapse (e.g. service catalog offering?) and user could customize it to their needs.

  • Can be deployed as a non-managed solution (e.g. docker, EC2 etc) and there are various providers (e.g. Elastic) if we ever have problems with the AWS offering

  • Supports search templates (https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/search-template/ ). This seems like a very interesting feature, in fact we could let the user customize their own search templates per table.

  • Can be complicated to setup properly

  • Compared to MySQL it requires a substantial effort to sync the index and handling schema changes specifically that might lead to expensive re-indexing

  • Using the AWS offering might be risky: there have been past reports of limitations (e.g. https://spun.io/2019/10/10/aws-elasticsearch-a-fundamentally-flawed-offering/ ). The offering might be more mature at the moment but with the Elastic license changes and the soon to become “OpenSearch” there are some unknowns. The open source version is a fork that is already behind the Elastic offering. Additionally AWS and all the other providers were relying on the clients offered by ElasticSearch but recently Elastic broke on purpose compatibility including a license check in their clients. While older clients still works the AWS team is currently working on forks to maintain. There are also legal battles going on from Elastic and the future of ElasticSearch offerings might be uncertain.

  • If integrated with the synapse infrastructure having a cluster being built every release might prove very effective (e.g. no issues updating) but might also show limitations in the long run (e.g. depending on the amount of data we ingest rebuilding indexes might take too long).

  • Can be hard to integrate with facets and filters in tables if not impossible. Might lead to a complete replication (and maintenance) of existing features on top of another engine that could turn into a very long project.

  • There might be substantial additional costs (e.g. ~200~500-10002000/month) While supporting multiple indexes per cluster, they are related to the number of shards depending on how the cluster and indexes are configured. If we have a big cluster where we have one index per table and several tables this can add up to several thousands per month.

  • While supporting multiple indexes per cluster, they are related to the number of shards (default 1 shard and 1 replica) in the cluster, the more shards the more nodes and costs so it might be more cost effective to have a single index with reduced search capabilities.

CloudSearch

  • Already used in the Synapse backend

  • Easy to setup and auto-scale

  • The one cluster per index structure makes it a bit unflexible inflexible (there is a 200 fields limit per domain and using dynamic fields they suggest to stay below 1000 for performance reasons). While we can work around it (e.g. pre-process the rows to be indexed) we would probably end up with issue in the relevance of results. Having one index per table is a much more effective strategy given that we know the schema before-hand and we have much more room for customization based on the table content.

  • Same integration pain points with existing features as Elasticsearch

  • Since CloudSearch does not support multiple indexes we would be limited to create documents that do not have explicit fields (e.g. tables have a lot of fields and a single table can go over the 200 fields limit). This means that the faceting option of CloudSearch is not an actual option.

  • Additional costs

  • Seems to receive no meaningful updates and its support is a bit of an unknown

  • While easy to setup any request to update the configuration, schema or re-indexing is extremely slow and almost unusable: a simple update of a field on an index with 1 document can easily take more than 30 mins.

...