OpenSearch Feature "Did you mean"
Introduction
The "Did You Mean" feature significantly enhances the user experience by mitigating the frustrating "No Results Found" scenario caused by typos and misspellings. When a user submits a query that fails to return any documents, the system automatically checks for the most likely correction and presents it as a suggestion. This functionality is achieved using OpenSearch's Suggester, which is separate from the core search logic.
Suggesters
The OpenSearch Suggesters are a powerful search feature used to provide relevant suggestions and corrections for terms submitted by users.
Term Suggester : The term suggester to suggest corrected spellings for individual words. The term suggester uses an edit distance to compute suggestions. The edit distance is the number of single-character insertions, deletions, or substitutions that need to be performed for a term to match another term. For example, to change the word “cat” to “hats”, you need to substitute “h” for “c” and insert an “s”, so the edit distance in this case is 2.
To use the term suggester, we don’t need any special field mappings for our index. By default, string field types are mapped as text.
The Term suggester options. I have listed a few options which we can use to change the default behavior.
Options | Description |
|---|---|
size | The maximum number of suggestions to return for each token in the input text. |
sort | Specifies how suggestions should be sorted in the response. Valid values are: |
suggest_mode | The suggest mode specifies the terms for which suggestions should be included in the response. Valid values are: |
max_edits | The maximum edit distance for suggestions. Valid values are in the [1, 2] range. Default is 2. |
Phrase Suggester : The phrase suggester is similar to the term suggester, except it uses n-gram language models to suggest whole phrases instead of individual words.
To set up a phrase suggester, create a custom analyzer called trigram that uses a shingle filter and lowercases tokens. This filter is similar to the edge_ngram filter, but it applies to words instead of letters. Then configure the field from which you’ll be sourcing suggestions with the custom analyzer you created.
We need a custom analyzer because the default analyzer does not store the necessary information to evaluate the context and probability of word sequences. Phrase Suggester's goal is to correct misspellings in a sequence (e.g., correcting "quick box down" to the more probable "quick brown fox"). To do this, it needs a Language Model that assesses the likelihood of a word following another word.
By adding custom analyzer to index will increase the indexing time for documents.
Following Index setting changes are required for phrase suggester.
"settings": {
"index": {
"analysis": {
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
},
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
}
}
}
java code for above is
settings(s -> s
.analysis(a ->a
.filter("shingle", f -> f.definition( d -> d
.shingle(sh -> sh.minShingleSize("2").maxShingleSize("3"))))
.analyzer("trigram", an ->an
.custom(c ->c
.tokenizer("standard")
.filter(List.of("lowercase", "shingle"))))
)
)
Change the txt field mapping to use the custom analyzer
properties(SearchConstants.FIELD_NAME, p -> p.text(text -> text
.fields("trigram", f -> f
.text(txt ->txt.analyzer("trigram")))))
.properties(SearchConstants.FIELD_DESCRIPTION,
p -> p.text(text ->
text.analyzer("english")))
p -> p.text(text -> text
.fields("trigram", f -> f
.text(txt ->txt.analyzer("trigram")))))
Options | Description |
|---|---|
size | The number of candidate suggestions to generate for each query term. Specifying a higher value can result in terms with higher edit distances being returned. Default is 5. |
separator | The separator for the terms in the bigram field. Defaults to the space character. |
Used to prune suggestions for which there are no matching documents in the index. | |
collate.query | Specifies whether to return all suggestions. If prune is set to false, only those suggestions that have matching documents are returned. If prune is set to true, all suggestions are returned; each suggestion has an additional collate_match field that is true if the suggestion has matching documents and is false otherwise. Default is false. |
collate.prune | Specifies whether to return all suggestions. If prune is set to false, only those suggestions that have matching documents are returned. If prune is set to true, all suggestions are returned; each suggestion has an additional collate_match field that is true if the suggestion has matching documents and is false otherwise. Default is false. |
Smoothing model to balance the weight of the shingles that exist in the index frequently with the weight of the shingles that exist in the index infrequently. |
Should we use only Term suggester or both Term and Phrase suggester?
Strategies
Approach 1: Send suggestion if empty response for search in one call
In this approach, we will be using our existing search API. The OpenSearch request will be modified by adding a new Suggestion block. The results of OpenSearch now contain the main query results as well as suggestions. If the SearchResult has no hits or zero results found, we will extract the corrected term from the suggestion block and send it to the frontend.
Pros:
The suggestions will be a part of the query response, thus improving the user experience.
Cons:
Our dataset already contains the misspelled version of terms. Therefore, it could be the case that the query never returns 0 return, hence no suggestions.
Executing suggestions with a search query will increase the latency of the search query.
Misspelled example from Index.
Term | Misspelled versions |
|---|---|
cancer | cance, caner, cacer and cncer |
alzheimer | alzheime, alzhimer and azheimer |
Approach 2: Always Execute Suggestion along with Search in one call
This approach is similar to approach 1, but in this case, the suggestions will always be included in the search query response. The frontend can decide whether to show the suggestions or not.
Existing API : Search API already available to show results to the user. To add suggestion in the response, the current Response Object should be changed.
Pros:
The suggestions will be always available in the query response.
Cons:
Increases the latency of all search queries. The latency will be more increased when we implement feature related to main search for example fuzzy search.
Response Object changes
SearchResult.json
"description": "JSON schema for a the result of a search.",
"properties": {
"found": {
"type": "integer",
"description": "The total number of hits found for this search."
},
"start": {
"type": "integer",
"description": "The zero-based number of the first hit returned in this page of search results."
},
"matchExpression": {
"type": "string",
"description": "DEPRECATED: The search match expression parsed from the search request parameters. This is useful for debugging purposes."
},
"hits": {
"type": "array",
"description": "The hits in this page of search results",
"uniqueItems": false,
"items": {
"$ref": "org.sagebionetworks.repo.model.search.Hit"
}
},
"facets": {
"type": "array",
"description": "The facets found in all results of this search.",
"uniqueItems": false,
"items": {
"$ref": "org.sagebionetworks.repo.model.search.Facet"
}
},
// new added field
"Suggestions": {
"type": "array",
"description": "The suggestions for the searched terms.",
"uniqueItems": true,
"items": {
"$ref": "org.sagebionetworks.repo.model.search.Suggestion"
}
}
}
}
Suggestion.json
{
"description":"JSON schema for a suggestion.",
"properties":{
"key":{
"type":"string",
"description":"The search term"
},
"values":{
"type":"array",
"description":"The suggested values for search term.",
"uniqueItems":true,
"items":{
"type":"string"
}
}
}
}
In average by adding suggestion increase the search query time by ~200ms.
Query | Result | time |
|---|---|---|
GET _search | { …….. | 606ms |
GET _search { "query": { "simple_query_string": { "query": "Uk biobak" } }, "suggest": { "text": "biobak", "spell_check1": { "term": { "field": "name", "suggest_mode": "missing" } }, "spell_check2": { "term": { "field": "description", "suggest_mode": "missing" } }, "phrase-check": { "text": "Uk biobak", "phrase": { "field": "name.trigram" } }, "phrase-check1": { "text": "Uk biobak", "phrase": { "field": "description.trigram" } } } } | { "took": 39, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": {….. } "suggest": { "phrase-check": [ { "text": "Uk biobak", "offset": 0, "length": 9, "options": [ { "text": "uk biobank", "score": 0.41494516 }, { "text": "uk biobatk", "score": 0.41494516 }, { "text": "uk biobark", "score": 0.3803136 } ] } ], "phrase-check1": [ { "text": "Uk biobak", "offset": 0, "length": 9, "options": [ { "text": "uk biobank", "score": 0.2575014 }, { "text": "uk biobatk", "score": 0.2575014 }, { "text": "uk biobark", "score": 0.221984 }, { "text": "uk biobask", "score": 0.221984 } ] } ], "spell_check1": [ { "text": "biobak", "offset": 0, "length": 6, "options": [ { "text": "biobank", "score": 0.8333333, "freq": 1 }, { "text": "biobark", "score": 0.8333333, "freq": 1 }, { "text": "biobatk", "score": 0.8333333, "freq": 1 } ] } ], "spell_check2": [ { "text": "biobak", "offset": 0, "length": 6, "options": [ { "text": "biobank", "score": 0.8333333, "freq": 1 }, { "text": "biobark", "score": 0.8333333, "freq": 1 }, { "text": "biobask", "score": 0.8333333, "freq": 1 }, { "text": "biobatk", "score": 0.8333333, "freq": 1 } ] } ] } } | 734 ms |
|
|
|
Approach 3: Separate API for suggestions (Recommended)
In this approach, we will create a new API endpoint for the suggestion which can be invoked independent of the search query. There will be a new POST API. I am choosing the POST API because there are filter options for Term and Phrase suggesters, and we might want to use them as user input later. The first implementation will be default behavior and filter.
New API end Point
https://repo-prod.prod.sagebase.org/repo/v1/suggestion
Request Body
SuggestionQuery.json
{
"description":"JSON schema for a suggestion query.",
"properties":{
"searchTerm":{
"type":"array",
"description":"The suggestions for search terms",
"uniqueItems":false,
"items":{
"type":"string"
}
}
}
Response Body
SuggestionResult.json
{
"description": "JSON schema for a the suggestions for the search terms.",
"properties": {
"Suggestions": {
"type": "array",
"description": "The suggestions for the searched terms.",
"uniqueItems": true,
"items": {
"$ref": "org.sagebionetworks.repo.model.search.Suggestion"
}
}
}
}Suggestion.json
{
"description":"JSON schema for a suggestion.",
"properties":{
"key":{
"type":"string",
"description":"The search term"
},
"values":{
"type":"array",
"description":"The suggested values for search term.",
"uniqueItems":true,
"items":{
"type":"string"
}
}
}
}
Pros:
The frontend decides weather to invoke the suggestions API endpoint.
No extra latency added to the search query.
Cons:
Additional network bandwidth and operational overhead of maintaining additional API endpoint.