Content Comparison

...

MySQL limits the number of secondary indexes to 64, and the total number of columns in one index to 16. We added a special column that contains the concatenated values of all the columns and created a FULL TEXT index on that particular column. FULL TEXT indexes for each column were also added (so that ORed queries can be run against each column) but for each column a separate score will be computed and the optimizer might choose not to use the indexes (not enough data to see how this would work).

The RDS instance with the data can be reached (through the VPN) at:

Code Block
host: dev-marco-db.cdusmwdhqvso.us-east-1.rds.amazonaws.com db: devmarco user: devmarcouser password: platform table: SEARCH_TEST

Once connected full text queries queries can be executed such as:

Code Block

SELECT MATCH(CONTENT_TEXT) AGAINST('tumor') as SCORE, S.* FROM SEARCH_TEST S WHERE MATCH(CONTENT_TEXT) AGAINST('tumor')
SELECT MATCH(CONTENT_TEXT) AGAINST('peripher* tumor' IN BOOLEAN MODE) as SCORE, S.* FROM SEARCH_TEST S WHERE MATCH(CONTENT_TEXT) AGAINST('peripher* tumor' IN BOOLEAN MODE)

See https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html for documentation on the syntax.

Elasticsearch Setup

We setup an AWS elasticsearch cluster with a single data node (t3.small.elasticsearch instance) and no dedicated master node in a VPC. The setup was initially done using fine grained access with a IAM user to perform the import. Later the authentication was switched to the internal user management with a dedicated user so that queries can be run from the command line for testing.

...

Code Block

Endpoint: https://vpc-tables-search-test-es7bt4peajix4wokysfxfldqoy.us-east-1.es.amazonaws.com
Kibana Console: https://vpc-tables-search-test-es7bt4peajix4wokysfxfldqoy.us-east-1.es.amazonaws.com/_plugin/kibana/
user: devmarco
password: Platform?es2021
indexes: syn26050977_index_default, syn26050977_index_eng

Queries can be run executed using curl, e.g.

Code Block

curl -XGET -u 'devmarco:Platform?es2021' 'https://vpc-tables-search-test-es7bt4peajix4wokysfxfldqoy.us-east-1.es.amazonaws.com/syn26050977_index_default/_search?q=tumor&pretty=true'
curl -XPOST -d '{"size": 3, "query": {"query_string": {"query":"tumor"}}}' -H 'Content-Type: application/json' -u 'devmarco:Platform?es2021' 'https://vpc-tables-search-test-es7bt4peajix4wokysfxfldqoy.us-east-1.es.amazonaws.com/syn26050977_index_default/_search?pretty=true'

See https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/full-text/ for documentation on the syntax.

Only the STRING, STRING_LIST and LARGETEXT columns were imported in the indexes, no static mapping was performed beforehand and we let elasticsearch dynamically map the fields (all the fields that were not null were automatically added as TEXT with a KEYWORD field as well). Multi values column were set in the document as arrays.

The syn26050977_index_default index is “as-is” from just submitting the documents, the syn26050977_index_eng was instead configured to use as default analyzer the pre-configured English analyzer that includes an English stemmer (See https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-lang-analyzer.html#english-analyzer Note that : this is the Elastic.co documentation as I could not find it in AWS or opendistro or opensearch docs).

Version	Old Version 5	New Version 6
Changes made by	Marco Marasca	Marco Marasca
Saved on	Aug 19, 2021	Aug 20, 2021

Versions Compared

Key

Elasticsearch Setup