OpenSearch Feature "Did you mean"

1 Introduction
2 Suggesters
3 Strategies

Introduction

The "Did You Mean" feature significantly enhances the user experience by mitigating the frustrating "No Results Found" scenario caused by typos and misspellings. When a user submits a query that fails to return any documents, the system automatically checks for the most likely correction and presents it as a suggestion. This functionality is achieved using OpenSearch's Suggester, which is separate from the core search logic.

Suggesters

The OpenSearch Suggesters are a powerful search feature used to provide relevant suggestions and corrections for terms submitted by users.

Term Suggester : The term suggester to suggest corrected spellings for individual words. The term suggester uses an edit distance to compute suggestions. The edit distance is the number of single-character insertions, deletions, or substitutions that need to be performed for a term to match another term. For example, to change the word “cat” to “hats”, you need to substitute “h” for “c” and insert an “s”, so the edit distance in this case is 2.

To use the term suggester, we don’t need any special field mappings for our index. By default, string field types are mapped as text.

The Term suggester options. I have listed a few options which we can use to change the default behavior.

Options	Description

Options	Description
size	The maximum number of suggestions to return for each token in the input text.
sort	Specifies how suggestions should be sorted in the response. Valid values are: - score: Sort by similarity score, then document frequency, and then the term itself. - frequency: Sort by document frequency, then similarity score, and then the term itself.
suggest_mode	The suggest mode specifies the terms for which suggestions should be included in the response. Valid values are: - missing: Return suggestions only for input terms that have zero occurrences in the specified field of the index. This check is field specific: if a term appears in other fields but not in the targeted field, it is still considered missing. Note that this mode does not consider term frequency across the entire index—only the specified field. - popular: Return suggestions only if they occur in the documents more frequently than in the original input text. - always: Always return suggestions for each term in the input text. Default is missing.
max_edits	The maximum edit distance for suggestions. Valid values are in the [1, 2] range. Default is 2.

Phrase Suggester : The phrase suggester is similar to the term suggester, except it uses n-gram language models to suggest whole phrases instead of individual words.

To set up a phrase suggester, create a custom analyzer called trigram that uses a shingle filter and lowercases tokens. This filter is similar to the edge_ngram filter, but it applies to words instead of letters. Then configure the field from which you’ll be sourcing suggestions with the custom analyzer you created.

We need a custom analyzer because the default analyzer does not store the necessary information to evaluate the context and probability of word sequences. Phrase Suggester's goal is to correct misspellings in a sequence (e.g., correcting "quick box down" to the more probable "quick brown fox"). To do this, it needs a Language Model that assesses the likelihood of a word following another word.

By adding custom analyzer to index will increase the indexing time for documents.

Following Index setting changes are required for phrase suggester.

"settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "trigram": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "shingle"
            ]
          }
        },
        "filter": {
          "shingle": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        }
      }
    }

    java code for above is 
    settings(s -> s
            .analysis(a ->a
                    .filter("shingle", f -> f.definition( d -> d
                            .shingle(sh -> sh.minShingleSize("2").maxShingleSize("3"))))
                    .analyzer("trigram", an ->an
                            .custom(c ->c
                                    .tokenizer("standard")
                                    .filter(List.of("lowercase", "shingle"))))
            )
    )

    Change the txt field mapping to use the custom analyzer

    properties(SearchConstants.FIELD_NAME, p -> p.text(text -> text
                                        .fields("trigram", f -> f
                                                .text(txt ->txt.analyzer("trigram")))))
                                .properties(SearchConstants.FIELD_DESCRIPTION,
                                        p -> p.text(text ->
                                                text.analyzer("english")))
                                        p -> p.text(text -> text
                                                .fields("trigram", f -> f
                                                        .text(txt ->txt.analyzer("trigram")))))

Phrase Suggester Options

Options	Description

Options	Description
size	The number of candidate suggestions to generate for each query term. Specifying a higher value can result in terms with higher edit distances being returned. Default is 5.
separator	The separator for the terms in the bigram field. Defaults to the space character.
collate	Used to prune suggestions for which there are no matching documents in the index.
collate.query	Specifies whether to return all suggestions. If prune is set to false, only those suggestions that have matching documents are returned. If prune is set to true, all suggestions are returned; each suggestion has an additional collate_match field that is true if the suggestion has matching documents and is false otherwise. Default is false.
collate.prune	Specifies whether to return all suggestions. If prune is set to false, only those suggestions that have matching documents are returned. If prune is set to true, all suggestions are returned; each suggestion has an additional collate_match field that is true if the suggestion has matching documents and is false otherwise. Default is false.
smoothing	Smoothing model to balance the weight of the shingles that exist in the index frequently with the weight of the shingles that exist in the index infrequently.

Should we use only Term suggester or both Term and Phrase suggester?

Strategies

Approach 1: Send suggestion if empty response for search in one call

In this approach, we will be using our existing search API. The OpenSearch request will be modified by adding a new Suggestion block. The results of OpenSearch now contain the main query results as well as suggestions. If the SearchResult has no hits or zero results found, we will extract the corrected term from the suggestion block and send it to the frontend.

Pros:

The suggestions will be a part of the query response, thus improving the user experience.

Cons:

Our dataset already contains the misspelled version of terms. Therefore, it could be the case that the query never returns 0 return, hence no suggestions.
Executing suggestions with a search query will increase the latency of the search query.

Misspelled example from Index.

Term	Misspelled versions

Term	Misspelled versions
cancer	cance, caner, cacer and cncer
alzheimer	alzheime, alzhimer and azheimer

Approach 2: Always Execute Suggestion along with Search in one call

This approach is similar to approach 1, but in this case, the suggestions will always be included in the search query response. The frontend can decide whether to show the suggestions or not.

Existing API : Search API already available to show results to the user. To add suggestion in the response, the current Response Object should be changed.

Pros:

The suggestions will be always available in the query response.

Cons:

Increases the latency of all search queries. The latency will be more increased when we implement feature related to main search for example fuzzy search.

Response Object changes

SearchResult.json

"description": "JSON schema for a the result of a search.",
	"properties": {
		"found": {
			"type": "integer",
			"description": "The total number of hits found for this search."
		},
		"start": {
			"type": "integer",
			"description": "The zero-based number of the first hit returned in this page of search results."
		},
		"matchExpression": {
			"type": "string",
			"description": "DEPRECATED: The search match expression parsed from the search request parameters.  This is useful for debugging purposes."
		},
		"hits": {
			"type": "array",
			"description": "The hits in this page of search results",
			"uniqueItems": false,
			"items": {
				"$ref": "org.sagebionetworks.repo.model.search.Hit"
			}
		},
		"facets": {
			"type": "array",
			"description": "The facets found in all results of this search.",
			"uniqueItems": false,
			"items": {
				"$ref": "org.sagebionetworks.repo.model.search.Facet"
			}
		},
        // new added field
        "Suggestions": {
		"type": "array",
		"description": "The suggestions for the searched terms.",
		"uniqueItems": true,
		"items": {
			"$ref": "org.sagebionetworks.repo.model.search.Suggestion"
		}
	}
	}
}

Suggestion.json

{
    "description":"JSON schema for a suggestion.",
    "properties":{
        "key":{
        	"type":"string",
        	"description":"The search term"
        },
        "values":{
        	"type":"array",
        	"description":"The suggested values for search term.",
            "uniqueItems":true,
            "items":{
                "type":"string"
            }                    	
        	
        }
    }
}

In average by adding suggestion increase the search query time by ~200ms.

Query	Result	time

Query

Result

time

GET _search
{
"query": {
"simple_query_string": {
"query": "Biobatk"
}
}
}

{
"took": 18,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.3205979,
"hits": [

……..
]
}

606ms

GET _search

{

"query": {

"simple_query_string": {

"query": "Uk biobak"

}

},

"suggest": {

"text": "biobak",

"spell_check1": {

"term": {

"field": "name",

"suggest_mode": "missing"

}

},

"spell_check2": {

"term": {

"field": "description",

"suggest_mode": "missing"

}

},

"phrase-check": {

"text": "Uk biobak",

"phrase": {

"field": "name.trigram"

}

},

"phrase-check1": {

"text": "Uk biobak",

"phrase": {

"field": "description.trigram"

}

{

"took": 39,

"timed_out": false,

"_shards": {

"total": 0,

"successful": 0,

"skipped": 0,

"failed": 0

},

"hits": {…..

}

"suggest": {

"phrase-check": [

{

"text": "Uk biobak",

"offset": 0,

"length": 9,

"options": [

{

"text": "uk biobank",

"score": 0.41494516

},

{

"text": "uk biobatk",

"score": 0.41494516

},

{

"text": "uk biobark",

"score": 0.3803136

}

]

}

],

"phrase-check1": [

{

"text": "Uk biobak",

"offset": 0,

"length": 9,

"options": [

{

"text": "uk biobank",

"score": 0.2575014

},

{

"text": "uk biobatk",

"score": 0.2575014

},

{

"text": "uk biobark",

"score": 0.221984

},

{

"text": "uk biobask",

"score": 0.221984

}

]

}

],

"spell_check1": [

{

"text": "biobak",

"offset": 0,

"length": 6,

"options": [

{

"text": "biobank",

"score": 0.8333333,

"freq": 1

},

{

"text": "biobark",

"score": 0.8333333,

"freq": 1

},

{

"text": "biobatk",

"score": 0.8333333,

"freq": 1

}

]

}

],

"spell_check2": [

{

"text": "biobak",

"offset": 0,

"length": 6,

"options": [

{

"text": "biobank",

"score": 0.8333333,

"freq": 1

},

{

"text": "biobark",

"score": 0.8333333,

"freq": 1

},

{

"text": "biobask",

"score": 0.8333333,

"freq": 1

},

{

"text": "biobatk",

"score": 0.8333333,

"freq": 1

}

]

}

]

}

734 ms

Approach 3: Separate API for suggestions (Recommended)

In this approach, we will create a new API endpoint for the suggestion which can be invoked independent of the search query. There will be a new POST API. I am choosing the POST API because there are filter options for Term and Phrase suggesters, and we might want to use them as user input later. The first implementation will be default behavior and filter.

New API end Point

https://repo-prod.prod.sagebase.org/repo/v1/suggestion

Request Body

SuggestionQuery.json

{
    "description":"JSON schema for a suggestion query.",
    "properties":{
        "searchTerm":{
            "type":"array",
            "description":"The suggestions for search terms",
            "uniqueItems":false,
            "items":{
                "type":"string"
            }            
        }
}

Response Body

SuggestionResult.json

{
	"description": "JSON schema for a the suggestions for the search terms.",
    "properties": {
        "Suggestions": {
            "type": "array",
      		"description": "The suggestions for the searched terms.",
    		"uniqueItems": true,
    		"items": {
    			"$ref": "org.sagebionetworks.repo.model.search.Suggestion"
    		}
    	}
    }
}

Suggestion.json

{
    "description":"JSON schema for a suggestion.",
    "properties":{
        "key":{
        	"type":"string",
        	"description":"The search term"
        },
        "values":{
        	"type":"array",
        	"description":"The suggested values for search term.",
            "uniqueItems":true,
            "items":{
                "type":"string"
            }                    	
        	
        }
    }
}

Pros:

The frontend decides weather to invoke the suggestions API endpoint.
No extra latency added to the search query.

Cons:

Additional network bandwidth and operational overhead of maintaining additional API endpoint.