CloudSearch to OpenSearch Migration

CloudSearch to OpenSearch Migration

PLFM-8728 - Getting issue details... STATUS

Background

The Synapse offers a search feature for users, accessible through a dedicated Search API SearchQuery | Synapse REST API . Initially, search was implemented using AWS CloudSearch. However, due to its limitations and the fact that it is no longer available for new deployments, it is being phased out. While AWS has not officially announced an end-of-support date for existing users, they recommend migrating to Amazon OpenSearch Service. More information is available here Transition from Amazon CloudSearch to Amazon OpenSearch Service | Amazon Web Services

CloudSearch limitation solved by OpenSearch

 

Category

CloudSearch limitation

OpenSearch Solution

Category

CloudSearch limitation

OpenSearch Solution

Query Language

Limited query flexibility

Full Elasticsearch Query DSL (JSON-based, supports bool, range, fuzziness, slop, intervals etc.)

Custom Ranking

Minimal relevance tuning (Only via expr)

Function score queries, script scoring, boosting fields for advanced tuning

Multi-field search

No native multi-field search

Use multi-match to search across multiple fields simultaneously

Field Types

Limited field types (no boolean, nested, etc.)

Wide support: text, keyword, boolean, geo_point, nested, etc.

Monitoring

No detailed logging or query trace

Built-in slow query logs, profiling, and monitoring via CloudWatch + APIs

Aggregation/Facets

Limited aggregation capabilities (facets only)

Aggregations framework: terms, range, date_range, etc.

Security

Only IAM-based security

Fine-grained access control (roles, field-level, document-level security)

Data ingestion

Limited ingest and update options
(Only batch document uploads)

Supports bulk API, ingestion pipelines, Logstash, Grafana, real-time indexing

Testing

No testing tools or dev utilities

OpenSearch Dashboards with Dev Tools, query profiling, real-time testing

Scaling & Performance Tuning

Scaling is automatic, but not tunable

Control over shards, replicas, index-level tuning, or serverless

Integration

Limited integration ecosystem

Integrates with Kibana (Dashboards), Beats, Logstash, Grafana, etc.

Autocomplete

Simple suggesters

Completion + edge n-gram + full control

OpenSearch Introduction

OpenSearch is a distributed search and analytics engine. After adding data to OpenSearch, we can perform full-text searches on it with all of the features we might expect: search by field, search multiple indexes, boost fields, rank results by score, sort results by field, and aggregate results.For more information, see the OpenSearch documentation.

 

OpenSearch Terminology

Document: A document is a unit that stores information (text or structured data). In OpenSearch, documents are stored in JSON format.

Index: An index is a collection of documents.

Node: Server that store data and handles search and indexing operations.

Cluster: An OpenSearch cluster is a collection of nodes.

Shard: A shard is a horizontal partition of data within an index. Each shard holds a subset of the index’s documents, enabling OpenSearch to scale data and queries across multiple nodes.

Primary and replica shards: OpenSearch automatically creates one replica shard for each primary shard by default. For example, if an index is split into 10 primary shards, OpenSearch will create 10 corresponding replica shards to enhance fault tolerance and availability.

Inverted index: OpenSearch uses an inverted index to perform fast full-text searches. This data structure maps each term (word) to the list of documents that contain it.

For example, consider two documents:

  • Document 1: “Beauty is in the eye of the beholder”

  • Document 2: “Beauty and the beast”

word

Document

word

Document

beauty

1,2

is

1

and

2

Relevance: When a search query is executed, OpenSearch matches the query terms against the indexed documents and assigns a relevance score to each result. This score indicates how closely a document matches the query criteria.

OpenSearch deployment options

 

OpenSearch Service domain: Amazon OpenSearch Service provides a managed environment to deploy and operate OpenSearch clusters. It gives you full control over configuration, including instance types, storage, and network settings. It supports fine-tuned performance optimization, availability zones, VPC access, and security configurations.This option might require in depth knowledge about cluster management and maintenance and more aligned with a long-term always live deployment.

OpenSearch Serverless: Amazon OpenSearch Serverless is an on-demand, serverless option for Amazon OpenSearch Service that eliminates the operational complexity of provisioning, configuring, and tuning OpenSearch clusters. With OpenSearch Serverless, we can search and analyze large volumes of data without managing the underlying infrastructure. An OpenSearch Serverless collection is a group of OpenSearch indexes that work together to support a specific workload or use case. Collections simplify operations compared to self-managed OpenSearch clusters, which require manual provisioning. For more information, see the What is Amazon OpenSearch Serverless? - Amazon OpenSearch Service

Types of collection in OpenSearch Serverless

 

Search: Full-text search based on natural language text, where documents are indexed with analyzers (tokenizers, stemmers, etc.) to support ranking, relevance, and partial matching. The main use cases are to find relevant documents based on user-entered keywords, to care about ranking, matching accuracy, and highlighting and when data is text heavy.

 

Vector Search: Semantic search/Ml-based search documents by comparing vector embeddings (numeric representations of meaning) instead of keywords. It's ideal for semantic similarity, natural language, or ML-driven matching. The main use cases are to retrieve content based on meaning (not just keywords), to using embeddings from models like BERT, OpenAI, etc and building AI-powered features like chatbots or recommendations.

 

Time Series: Time series search focuses on analyzing machine-generated, timestamped data such as logs, metrics, and events. The goal is often operational insight, security monitoring, or business performance tracking.

Feasibility Evaluation

 

To assess whether the features currently supported by CloudSearch can be replicated in OpenSearch, I explored the OpenSearch Serverless offering. As part of this evaluation, I created a collection of type "Search", which is specifically designed for full-text search use cases. OpenSearch Serverless supports three collection types: Search, Time Series, and Vector. Since our requirement focuses on full-text search, the “Search” type was selected. The following steps outline the approach taken during this evaluation:

 

Data Ingestion

 

A couple of document was ingested into an index within the selected collection. The document format remains consistent with how it was previously stored in CloudSearch. For more information, see this.

@Test public void TestDocumentUploadToOpenSearch() throws IOException { //create a document Document doc = createDocument(); SdkHttpClient httpClient = ApacheHttpClient.builder().build(); OpenSearchClient client = new OpenSearchClient( new AwsSdk2Transport( httpClient, "tlenbt......amazonaws.com", // serverless collection endpoint "aoss", // signing service name Region.US_EAST_1, // signing service region AwsSdk2TransportOptions.builder().build() ) ); String index = "sample-index"; IndexRequest<Document> request = new IndexRequest.Builder<Document>() .index(index) .id(doc.getId()) .document(doc) .build(); client.index(request); SearchRequest request1 = SearchRequest.of(r -> r .index(index) .query(q -> q.matchAll(m -> m)) .from(2) .size(10) ); SearchResponse<Map> response = client.search(request1, Map.class); for (Hit<Map> hit : response.hits().hits()) { System.out.println("Doc ID: " + hit.id()); System.out.println("Source: " + hit.source()); } } public Document createDocument() { org.sagebionetworks.repo.model.Node node = new Node(); node.setName("this is second folder"); node.setDescription("second has some data"); node.setId("syn1"); node.setParentId("1234"); node.setETag("0"); node.setNodeType(EntityType.folder); Long nonexistantPrincipalId = 42L; node.setCreatedByPrincipalId(nonexistantPrincipalId); node.setCreatedOn(new Date()); node.setModifiedByPrincipalId(nonexistantPrincipalId); node.setModifiedOn(new Date()); node.setVersionLabel("versionLabel"); Annotations additionalAnnos = new Annotations(); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "organ", "kidney", AnnotationsValueType.STRING); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "longKey", "10", AnnotationsValueType.LONG); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "tissue", "eye lid", AnnotationsValueType.STRING); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "consortium", "C O N S O R T I U M", AnnotationsValueType.STRING); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "diagnosis", "2", AnnotationsValueType.LONG); String dateValue = Long.toString(System.currentTimeMillis()); AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "dateKey", dateValue, AnnotationsValueType.TIMESTAMP_MS); Set<ACCESS_TYPE> rwAccessType = new HashSet<ACCESS_TYPE>(); rwAccessType.add(ACCESS_TYPE.READ); rwAccessType.add(ACCESS_TYPE.UPDATE); ResourceAccess rwResourceAccess = new ResourceAccess(); rwResourceAccess.setPrincipalId(123L); //readWriteTest@sagebase.org rwResourceAccess.setAccessType(rwAccessType); Set<ACCESS_TYPE> roAccessType = new HashSet<ACCESS_TYPE>(); roAccessType.add(ACCESS_TYPE.READ); ResourceAccess roResourceAccess = new ResourceAccess(); roResourceAccess.setPrincipalId(456L); // readOnlyTest@sagebase.org roResourceAccess.setAccessType(roAccessType); Set<ResourceAccess> resourceAccesses = new HashSet<ResourceAccess>(); resourceAccesses.add(rwResourceAccess); resourceAccesses.add(roResourceAccess); AccessControlList acl = new AccessControlList(); acl.setResourceAccess(resourceAccesses); String wikiPageText = "title\nmarkdown\nwiki is useful to find out content"; return searchDocumentDriver.formulateSearchDocument(node, additionalAnnos, acl, wikiPageText); }

 

Search

 

To test data search capabilities, I used the OpenSearch Dashboard to execute queries with various filters. This approach provided a quick and efficient way to evaluate whether the necessary filters could be applied, ensuring compatibility with the current search functionality.

Current Feature

Open Search Query

Result

Current Feature

Open Search Query

Result

queryTerm

GET sample-index/_search { "query": { "simple_query_string": { "fields": [ "fields.name" ,"fields.description"], "query": "second folder replica" } } }

 

The default operator is OR, meaning it searches for documents containing either of the query terms (e.g., "second" or "Folder" or “replica”). We can refine the search by specifying the fields in which the query terms should be matched.

 

We can control the minimum number of terms that a document must match to be returned in the results by specifying the minimum_should_match parameter:

 

If no query term is provided, return all documents.

GET sample-index/_search
{
"query": {
"match_all": {}
}
}

 

OpenSearch Full-text queries

 

{ "took": 25, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1.1130829, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1.1130829, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 0.9186288, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 0.2876821, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }

booleanQuery

GET sample-index/_search { "query": { "bool": { "must": [ { "simple_query_string": { "fields": ["fields.name", "fields.description"], "query": "second folder replica" } }, { "match": { "fields.created_by": "42" } } ], "must_not": [ { "match": { "fields.created_by": "45" } } ] } } }

must : Logical and operator. The results must match all queries in this clause.

 

must_not: Logical not operator. All matches are excluded from the results. If must_not has multiple clauses, only documents that do not match any of those clauses are returned. For example, "must_not":[{clause_A}, {clause_B}] is equivalent to NOT(A OR B).

 

OpenSearch Boolean query

{ "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 1.8062301, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1.8062301, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 0.5753642, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }

rangeQuery

GET sample-index/_search { "query": { "range": { "fields.modified_on": { "gte": 1747084178, "lte": 1747084185 } } } }

 

OpenSearch Range query

{ "took": 42, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } } ] } }

facetOptions

GET sample-index/_search { "query": { "match_all": {} }, "aggs": { "organ_facet": { "terms": { "field": "fields.organ.keyword", "size": 10, "order": { "_count": "desc" } } } } }

 

By default, OpenSearch doesn’t support aggregations on a text field. Because text fields are tokenized, an aggregation on a text field has to reverse the tokenization process back to its original string and then formulate an aggregation based on that. This kind of an operation consumes significant memory and degrades cluster performance.

While we can enable aggregations on text fields by setting the fielddata parameter to true in the mapping, the aggregations are still based on the tokenized words and not on the raw text.

We recommend keeping a raw version of the text field as a keyword field that we can aggregate on.

 

A text field that is analyzed cannot be used to sort documents, because the inverted index only contains the individual tokenized terms and not the entire string.

To bypass this limitation, you can use a raw version of the text field mapped as a keyword type. In the following example, field_name.keyword is not analyzed and you have a copy of the full original version for sorting purposes.

OpenSearch sort option

{ "took": 68, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] }, "aggregations": { "organ_facet": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "ORGAN", "doc_count": 1 }, { "key": "any", "doc_count": 1 }, { "key": "kidney", "doc_count": 1 } ] } } }

returnFields: Specifies the document fields to include in the response

GET sample-index/_search { "_source": ["id", "fields.name", "fields.description"], "query": { "match_all": {} } }
{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""" } } } ] } }

size: The maximum number of search hits to return.

GET sample-index/_search { "query": { "match_all": {} }, "size": 2 }
{ "took": 18, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } } ] } }

start: The zero-based number of the first hit returned in this page of search results.

GET sample-index/_search { "query": { "match_all": {} }, "size": 10, "from" :2 }
{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }

ACL filter

GET sample-index/_search { "query": { "terms": { "fields.acl": ["123"] } } }

 

OpenSearch Filter

{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }

 

OpenSearch Response Object

 

The current API returns a SearchResults. object. We can map the OpenSearch response to this existing structure with some important considerations:

  1. In OpenSearch, each aggregation is identified by a unique name and must be defined using one of the supported aggregation types. The aggregation name in the response effectively serves as the facet name. To maintain compatibility with the existing code, it's recommended to use the field name as the aggregation name. This allows us to derive the facet type from the field directly. Alternatively, we can introduce a new enum that maps aggregation names to their corresponding facet types.

    // existing code to get FacetType IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType())

     

  2. Unlike CloudSearch, OpenSearch responses do not include the from (pagination offset) value. This must be manually included in the searchResults object during response construction.

// OPenSearch request SearchRequest request = SearchRequest.of(r -> r .index(index) .query(q -> q.matchAll(m -> m)) .aggregations("organ", a -> a .terms(t -> t .field("fields.organ.keyword") .size(10) .order((List.of(Map.of("_count", SortOrder.Desc)) )) // false for "desc" )) .from(0) .size(10) ); SearchResponse<Map> response = client.search(request, Map.class); // prepare the hitlist with found documents List<org.sagebionetworks.repo.model.search.Hit> hitList = new ArrayList<>(); for (Hit<Map> hit : response.hits().hits()) { Map<String, Object> fields = (Map<String, Object>) hit.source().get("fields"); org.sagebionetworks.repo.model.search.Hit synapseHit = new org.sagebionetworks.repo.model.search.Hit(); synapseHit.setId(hit.id()); synapseHit.setDescription((String) fields.get("description")); synapseHit.setCreated_by((String) fields.get("created_by")); // add all the fields hitList.add(synapseHit); } // get facet from Opensearch Response Map<String, Aggregate> aggs = response.aggregations(); SearchResults results = new SearchResults(); List<Facet> facetList = new ArrayList<>(); for (Map.Entry<String, Aggregate> entry : aggs.entrySet()) { String facetName = entry.getKey(); Aggregate aggregate = entry.getValue(); if (aggregate.isSterms()) { StringTermsAggregate termsAgg = aggregate.sterms(); Facet facet = new Facet(); facet.setName(facetName); FacetTypeNames facetType = IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType()); facet.setType(facetType); List<FacetConstraint> constraints = new ArrayList<>(); for (StringTermsBucket bucket : termsAgg.buckets().array()) { FacetConstraint constraint = new FacetConstraint(); constraint.setValue(bucket.key()); constraint.setCount(bucket.docCount()); constraints.add(constraint); } facet.setConstraints(constraints); facetList.add(facet); results.setFacets(facetList); results.setStart(response.hits().total().value()); results.setFound((long)request1.from()); results.setHits(hitList); System.out.print(" hit list is : " + hitList); } }

 

The existing search API endpoint can be reused for OpenSearch integration. All current features can be supported, and the response can be returned in the same format as before.

Cost and Data Size Estimation


Size: We currently have approximately 26 million documents stored in CloudSearch. Based on observed data, where every 10 documents consume approximately 3–4 KB, and 692 documents occupy about 2 MB. We estimate the average size of one document to be 4 KB for approximation purposes.
Using this estimate:

26 million × 4 KB = ~99 GB total data volume.

OpenSearch Serverless Size Limitations

  • Up to 1 TiB of data per index for search and vector search collections

  • Up to 100 TiB of hot data per index for time series collections

This means our existing 99 GB of data can comfortably fit within the supported limits of a single index in OpenSearch Serverless.

 

Cost: As the size of data is ~100GB and the cost is already estimated in Sage Portals OpenSearch Integration

 

Currently, user annotations are not included in the search documents. However, to plan for potential future support, we have estimated the storage impact of including annotations. These annotations are stored as JSON in the Node_Revision table within our SQL database.

The following query was used to calculate the average size of an annotation:

SELECT AVG(CHAR_LENGTH(USER_ANNOTATIONS)) AS avg_chars, AVG(OCTET_LENGTH(USER_ANNOTATIONS)) AS avg_bytes FROM NODE_REVISION WHERE USER_ANNOTATIONS IS NOT NULL; //Result avg_chars = 686.5832 avg_bytes = 686.6047

For 26 million documents it will be ~17GB data. The total data size, including annotations, is expected to remain within the OpenSearch Serverless limit of 1 TiB per index.

Proposal

 

Based on the analysis, migrating from Amazon CloudSearch to OpenSearch is technically feasible. The current search API can be retained without modification by introducing a feature flag. This approach allows us to run CloudSearch and OpenSearch in parallel, enabling comparison of search results and user experience with minimal disruption.

Additionally, operating both systems temporarily will provide a more accurate estimate of the cost associated with indexing and managing approximately 26 million documents in OpenSearch.

To minimize operational overhead, it is recommended to adopt OpenSearch Serverless, which eliminates the need to provision and maintain infrastructure, allowing the team to focus on feature delivery and performance tuning.