Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Pro

Con

MySQL Full Text Search

  • Relatively easier to integrate than other options

  • Can be integrated with the current query language

  • Can be integrated with current facet implementation

  • Does not add additional costs

  • Very limited capabilities (e.g. no stemming might be a big deal)

  • Very hard to customize

  • Might add significant overhead for tables that change frequently

  • Not clear how we would handle multi-value columns

  • No field boosting, this can be hard to implement efficiently or in an effective way (e.g we would need to create separate indexes for each column and compute the rank ourselves)

  • All or nothing index (e.g. all columns must be included in index and query)

  • While easier to integrate, in order to support some of the features (such as stemming) we would need additional pre-processing

AWS Elasticsearch

  • Open source implementation and active community

  • Additional companion tools and integrations such as logstash or kinesis firehose

  • High level of customization and tuning

  • Supports multiple indexes per cluster

  • Native support for nested objects and arrays

  • Can be integrated with the Synapse infrastructure or deployed as a separate service independent of Synapse (e.g. service catalog offering?) and user could customize it to their needs.

  • Can be deployed as a non-managed solution (e.g. docker, EC2 etc) and there are various providers (e.g. Elastic) if we ever have problems with the AWS offering

  • Can be complicated to setup properly

  • Compared to MySQL it requires a substantial effort to sync the index and handling schema changes

  • Using the AWS offering might be risky: there have been past reports of limitations (e.g. https://spun.io/2019/10/10/aws-elasticsearch-a-fundamentally-flawed-offering/ ). The offering might be more mature at the moment but with the Elastic license changes and the soon to become “OpenSearch” there are some unknowns. The open source version is a fork that is already behind the Elastic offering.

  • If integrated with the synapse infrastructure having a cluster being built every release might prove very effective (e.g. no issues updating) but might also show limitations in the long run (e.g. depending on the amount of data we ingest rebuilding indexes might take too long).

  • Can be hard to integrate with facets and filters in tables. Might lead to a complete replication (and maintenance) of existing features.

  • There might be substantial additional costs (e.g. ~200-1000/month)

CloudSearch

  • Already used in the Synapse backend

  • Easy to setup and auto-scale

  • The one cluster per index structure makes it a bit unflexible (there is a 200 fields limit per domain and using dynamic fields they suggest to stay below 1000 for performance reasons). While we can work around it (e.g. pre-process the rows to be indexed) we would probably end up with issue in the relevance of results. Having one index per table is a much more effective strategy given that we know the schema before-hand and we have much more room for customization based on the table content.

  • Same integration pain points with existing features as Elasticsearch

  • Additional costs

  • Seems to receive no meaningful updates and its support is a bit of an unknown

  • While easy to setup any request to update the configuration, schema or re-indexing is extremely slow and almost unusable: a simple update of a field on an index with 1 document can easily take more than 30 mins.