CloudSearch to OpenSearch Migration

1 Background
2 CloudSearch limitation solved by OpenSearch
3 OpenSearch Introduction
4 OpenSearch deployment options
5 Types of collection in OpenSearch Serverless
6 Feasibility Evaluation
7 Cost and Data Size Estimation
8 Proposal

PLFM-8728 - Getting issue details... STATUS

Background

The Synapse offers a search feature for users, accessible through a dedicated Search API SearchQuery | Synapse REST API . Initially, search was implemented using AWS CloudSearch. However, due to its limitations and the fact that it is no longer available for new deployments, it is being phased out. While AWS has not officially announced an end-of-support date for existing users, they recommend migrating to Amazon OpenSearch Service. More information is available here Transition from Amazon CloudSearch to Amazon OpenSearch Service | Amazon Web Services

CloudSearch limitation solved by OpenSearch

Category	CloudSearch limitation	OpenSearch Solution

Category	CloudSearch limitation	OpenSearch Solution
Query Language	Limited query flexibility	Full Elasticsearch Query DSL (JSON-based, supports bool, range, fuzziness, slop, intervals etc.)
Custom Ranking	Minimal relevance tuning (Only via expr)	Function score queries, script scoring, boosting fields for advanced tuning
Multi-field search	No native multi-field search	Use multi-match to search across multiple fields simultaneously
Field Types	Limited field types (no boolean, nested, etc.)	Wide support: text, keyword, boolean, geo_point, nested, etc.
Monitoring	No detailed logging or query trace	Built-in slow query logs, profiling, and monitoring via CloudWatch + APIs
Aggregation/Facets	Limited aggregation capabilities (facets only)	Aggregations framework: terms, range, date_range, etc.
Security	Only IAM-based security	Fine-grained access control (roles, field-level, document-level security)
Data ingestion	Limited ingest and update options (Only batch document uploads)	Supports bulk API, ingestion pipelines, Logstash, Grafana, real-time indexing
Testing	No testing tools or dev utilities	OpenSearch Dashboards with Dev Tools, query profiling, real-time testing
Scaling & Performance Tuning	Scaling is automatic, but not tunable	Control over shards, replicas, index-level tuning, or serverless
Integration	Limited integration ecosystem	Integrates with Kibana (Dashboards), Beats, Logstash, Grafana, etc.
Autocomplete	Simple suggesters	Completion + edge n-gram + full control

OpenSearch Introduction

OpenSearch is a distributed search and analytics engine. After adding data to OpenSearch, we can perform full-text searches on it with all of the features we might expect: search by field, search multiple indexes, boost fields, rank results by score, sort results by field, and aggregate results.For more information, see the OpenSearch documentation.

OpenSearch Terminology

Document: A document is a unit that stores information (text or structured data). In OpenSearch, documents are stored in JSON format.

Index: An index is a collection of documents.

Node: Server that store data and handles search and indexing operations.

Cluster: An OpenSearch cluster is a collection of nodes.

Shard: A shard is a horizontal partition of data within an index. Each shard holds a subset of the index’s documents, enabling OpenSearch to scale data and queries across multiple nodes.

Primary and replica shards: OpenSearch automatically creates one replica shard for each primary shard by default. For example, if an index is split into 10 primary shards, OpenSearch will create 10 corresponding replica shards to enhance fault tolerance and availability.

Inverted index: OpenSearch uses an inverted index to perform fast full-text searches. This data structure maps each term (word) to the list of documents that contain it.

For example, consider two documents:

Document 1: “Beauty is in the eye of the beholder”
Document 2: “Beauty and the beast”

word	Document

word	Document
beauty	1,2
is	1
and	2

Relevance: When a search query is executed, OpenSearch matches the query terms against the indexed documents and assigns a relevance score to each result. This score indicates how closely a document matches the query criteria.

OpenSearch deployment options

OpenSearch Service domain: Amazon OpenSearch Service provides a managed environment to deploy and operate OpenSearch clusters. It gives you full control over configuration, including instance types, storage, and network settings. It supports fine-tuned performance optimization, availability zones, VPC access, and security configurations.This option might require in depth knowledge about cluster management and maintenance and more aligned with a long-term always live deployment.

OpenSearch Serverless: Amazon OpenSearch Serverless is an on-demand, serverless option for Amazon OpenSearch Service that eliminates the operational complexity of provisioning, configuring, and tuning OpenSearch clusters. With OpenSearch Serverless, we can search and analyze large volumes of data without managing the underlying infrastructure. An OpenSearch Serverless collection is a group of OpenSearch indexes that work together to support a specific workload or use case. Collections simplify operations compared to self-managed OpenSearch clusters, which require manual provisioning. For more information, see the https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-overview.html

Types of collection in OpenSearch Serverless

Search: Full-text search based on natural language text, where documents are indexed with analyzers (tokenizers, stemmers, etc.) to support ranking, relevance, and partial matching. The main use cases are to find relevant documents based on user-entered keywords, to care about ranking, matching accuracy, and highlighting and when data is text heavy.

Vector Search: Semantic search/Ml-based search documents by comparing vector embeddings (numeric representations of meaning) instead of keywords. It's ideal for semantic similarity, natural language, or ML-driven matching. The main use cases are to retrieve content based on meaning (not just keywords), to using embeddings from models like BERT, OpenAI, etc and building AI-powered features like chatbots or recommendations.

Time Series: Time series search focuses on analyzing machine-generated, timestamped data such as logs, metrics, and events. The goal is often operational insight, security monitoring, or business performance tracking.

Feasibility Evaluation

To assess whether the features currently supported by CloudSearch can be replicated in OpenSearch, I explored the OpenSearch Serverless offering. As part of this evaluation, I created a collection of type "Search", which is specifically designed for full-text search use cases. OpenSearch Serverless supports three collection types: Search, Time Series, and Vector. Since our requirement focuses on full-text search, the “Search” type was selected. The following steps outline the approach taken during this evaluation:

Data Ingestion

A couple of document was ingested into an index within the selected collection. The document format remains consistent with how it was previously stored in CloudSearch. For more information, see this.

@Test
    public void TestDocumentUploadToOpenSearch() throws IOException {
        //create a document
        Document doc = createDocument();
        
        SdkHttpClient httpClient = ApacheHttpClient.builder().build();
        OpenSearchClient client = new OpenSearchClient(
                new AwsSdk2Transport(
                        httpClient,
                        "tlenbt......amazonaws.com", // serverless collection endpoint
                        "aoss", // signing service name
                        Region.US_EAST_1, // signing service region
                        AwsSdk2TransportOptions.builder().build()
                )
        );

        String index = "sample-index";

        IndexRequest<Document> request = new IndexRequest.Builder<Document>()
                .index(index)
                .id(doc.getId())
                .document(doc)
                .build();

        client.index(request);
        
        SearchRequest request1 = SearchRequest.of(r -> r
                .index(index)
                .query(q -> q.matchAll(m -> m))
                .from(2)
                .size(10)
        );
        SearchResponse<Map> response = client.search(request1, Map.class);

        for (Hit<Map> hit : response.hits().hits()) {
            System.out.println("Doc ID: " + hit.id());
            System.out.println("Source: " + hit.source());
        }
    }
    
     public Document createDocument() {
        org.sagebionetworks.repo.model.Node node = new Node();
        node.setName("this is second folder");
        node.setDescription("second has some data");
        node.setId("syn1");
        node.setParentId("1234");
        node.setETag("0");
        node.setNodeType(EntityType.folder);
        Long nonexistantPrincipalId = 42L;
        node.setCreatedByPrincipalId(nonexistantPrincipalId);
        node.setCreatedOn(new Date());
        node.setModifiedByPrincipalId(nonexistantPrincipalId);
        node.setModifiedOn(new Date());
        node.setVersionLabel("versionLabel");

        Annotations additionalAnnos = new Annotations();
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "organ",
                "kidney", AnnotationsValueType.STRING);
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "longKey", "10", AnnotationsValueType.LONG);
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "tissue", "eye lid", AnnotationsValueType.STRING);
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "consortium", "C O N S O R T I U M", AnnotationsValueType.STRING);
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "diagnosis", "2", AnnotationsValueType.LONG);
        String dateValue = Long.toString(System.currentTimeMillis());
        AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "dateKey", dateValue, AnnotationsValueType.TIMESTAMP_MS);


        Set<ACCESS_TYPE> rwAccessType = new HashSet<ACCESS_TYPE>();
        rwAccessType.add(ACCESS_TYPE.READ);
        rwAccessType.add(ACCESS_TYPE.UPDATE);
        ResourceAccess rwResourceAccess = new ResourceAccess();
        rwResourceAccess.setPrincipalId(123L); //readWriteTest@sagebase.org
        rwResourceAccess.setAccessType(rwAccessType);
        Set<ACCESS_TYPE> roAccessType = new HashSet<ACCESS_TYPE>();
        roAccessType.add(ACCESS_TYPE.READ);
        ResourceAccess roResourceAccess = new ResourceAccess();
        roResourceAccess.setPrincipalId(456L); // readOnlyTest@sagebase.org
        roResourceAccess.setAccessType(roAccessType);

        Set<ResourceAccess> resourceAccesses = new HashSet<ResourceAccess>();
        resourceAccesses.add(rwResourceAccess);
        resourceAccesses.add(roResourceAccess);

        AccessControlList acl = new AccessControlList();
        acl.setResourceAccess(resourceAccesses);

        String wikiPageText = "title\nmarkdown\nwiki is useful to find out content";

        return searchDocumentDriver.formulateSearchDocument(node, additionalAnnos, acl, wikiPageText);
    }

Search

To test data search capabilities, I used the OpenSearch Dashboard to execute queries with various filters. This approach provided a quick and efficient way to evaluate whether the necessary filters could be applied, ensuring compatibility with the current search functionality.

Current Feature	Open Search Query	Result

Current Feature	Open Search Query	Result
queryTerm	`GET sample-index/_search { "query": { "simple_query_string": { "fields": [ "fields.name" ,"fields.description"], "query": "second folder replica" } } }` The default operator is OR, meaning it searches for documents containing either of the query terms (e.g., "second" or "Folder" or “replica”). We can refine the search by specifying the fields in which the query terms should be matched. We can control the minimum number of terms that a document must match to be returned in the results by specifying the `minimum_should_match` parameter: If no query term is provided, return all documents. GET sample-index/_search { "query": { "match_all": {} } } OpenSearch Full-text queries	{ "took": 25, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1.1130829, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1.1130829, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 0.9186288, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 0.2876821, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }
booleanQuery	`GET sample-index/_search { "query": { "bool": { "must": [ { "simple_query_string": { "fields": ["fields.name", "fields.description"], "query": "second folder replica" } }, { "match": { "fields.created_by": "42" } } ], "must_not": [ { "match": { "fields.created_by": "45" } } ] } } }` must : Logical and operator. The results must match all queries in this clause. must_not: Logical not operator. All matches are excluded from the results. If `must_not` has multiple clauses, only documents that do not match any of those clauses are returned. For example, `"must_not":[{clause_A}, {clause_B}]` is equivalent to `NOT(A OR B)`. OpenSearch Boolean query	{ "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 1.8062301, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1.8062301, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 0.5753642, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }
rangeQuery	`GET sample-index/_search { "query": { "range": { "fields.modified_on": { "gte": 1747084178, "lte": 1747084185 } } } }` OpenSearch Range query	{ "took": 42, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } } ] } }
facetOptions	`GET sample-index/_search { "query": { "match_all": {} }, "aggs": { "organ_facet": { "terms": { "field": "fields.organ.keyword", "size": 10, "order": { "_count": "desc" } } } } }` By default, OpenSearch doesn’t support aggregations on a text field. Because text fields are tokenized, an aggregation on a text field has to reverse the tokenization process back to its original string and then formulate an aggregation based on that. This kind of an operation consumes significant memory and degrades cluster performance. While we can enable aggregations on text fields by setting the `fielddata` parameter to `true` in the mapping, the aggregations are still based on the tokenized words and not on the raw text. We recommend keeping a raw version of the text field as a `keyword` field that we can aggregate on. A text field that is analyzed cannot be used to sort documents, because the inverted index only contains the individual tokenized terms and not the entire string. To bypass this limitation, you can use a raw version of the text field mapped as a keyword type. In the following example, `field_name.keyword` is not analyzed and you have a copy of the full original version for sorting purposes. OpenSearch sort option	{ "took": 68, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] }, "aggregations": { "organ_facet": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "ORGAN", "doc_count": 1 }, { "key": "any", "doc_count": 1 }, { "key": "kidney", "doc_count": 1 } ] } } }
returnFields: Specifies the document fields to include in the response	`GET sample-index/_search { "_source": ["id", "fields.name", "fields.description"], "query": { "match_all": {} } }`	{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""" } } } ] } }
size: The maximum number of search hits to return.	`GET sample-index/_search { "query": { "match_all": {} }, "size": 2 }`	{ "took": 18, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } } ] } }
start: The zero-based number of the first hit returned in this page of search results.	`GET sample-index/_search { "query": { "match_all": {} }, "size": 10, "from" :2 }`	{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }
ACL filter	`GET sample-index/_search { "query": { "terms": { "fields.acl": ["123"] } } }` OpenSearch Filter	{ "took": 27, "timed_out": false, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "sample-index", "_id": "syn2", "_score": 1, "_source": { "type": "add", "id": "syn2", "fields": { "name": "this is second folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084178, "modified_on": 1747084178, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "kidney" } } }, { "_index": "sample-index", "_id": "syn3", "_score": 1, "_source": { "type": "add", "id": "syn3", "fields": { "name": "replica", "description": """title markdown wiki is useful to find out content""", "parent_id": "4567", "node_type": "folder", "etag": "0", "created_on": 1747181404, "modified_on": 1747181404, "created_by": "45", "modified_by": "45", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "2", "tissue": "eye lid", "consortium": "C O N S O R T I U M", "organ": "any" } } }, { "_index": "sample-index", "_id": "syn1", "_score": 1, "_source": { "type": "add", "id": "syn1", "fields": { "name": "this is first folder", "description": """title markdown wiki is useful to find out content""", "parent_id": "1234", "node_type": "folder", "etag": "0", "created_on": 1747084018, "modified_on": 1747084018, "created_by": "42", "modified_by": "42", "acl": [ "456", "123" ], "update_acl": [ "123" ], "diagnosis": "1", "tissue": "ear lobe", "consortium": "C O N S O R T I U M", "organ": "ORGAN" } } } ] } }

OpenSearch Response Object

The current API returns a SearchResults. object. We can map the OpenSearch response to this existing structure with some important considerations:

In OpenSearch, each aggregation is identified by a unique name and must be defined using one of the supported aggregation types. The aggregation name in the response effectively serves as the facet name. To maintain compatibility with the existing code, it's recommended to use the field name as the aggregation name. This allows us to derive the facet type from the field directly. Alternatively, we can introduce a new enum that maps aggregation names to their corresponding facet types.
// existing code to get FacetType IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType())
Unlike CloudSearch, OpenSearch responses do not include the from (pagination offset) value. This must be manually included in the searchResults object during response construction.

       // OPenSearch request
        SearchRequest request = SearchRequest.of(r -> r
                .index(index)
                .query(q -> q.matchAll(m -> m))
                .aggregations("organ", a -> a
                        .terms(t -> t
                                .field("fields.organ.keyword")
                                .size(10)
                                .order((List.of(Map.of("_count", SortOrder.Desc)) )) // false for "desc"
                        ))
                .from(0)
                .size(10)
        );
        SearchResponse<Map> response = client.search(request, Map.class);
        
        // prepare the hitlist with found documents
        List<org.sagebionetworks.repo.model.search.Hit> hitList = new ArrayList<>();
        for (Hit<Map> hit : response.hits().hits()) {
            Map<String, Object> fields = (Map<String, Object>) hit.source().get("fields");

            org.sagebionetworks.repo.model.search.Hit synapseHit = new org.sagebionetworks.repo.model.search.Hit();
            synapseHit.setId(hit.id());
            synapseHit.setDescription((String) fields.get("description"));
            synapseHit.setCreated_by((String) fields.get("created_by"));
            // add all the fields
            hitList.add(synapseHit);
        }

        // get facet from Opensearch Response
        Map<String, Aggregate> aggs = response.aggregations();
        SearchResults results = new SearchResults();
      
        List<Facet> facetList = new ArrayList<>();

        for (Map.Entry<String, Aggregate> entry : aggs.entrySet()) {
            String facetName = entry.getKey();
            Aggregate aggregate = entry.getValue();
            if (aggregate.isSterms()) {
                StringTermsAggregate termsAgg = aggregate.sterms();

                Facet facet = new Facet();
                facet.setName(facetName);

                FacetTypeNames facetType = IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType());
                facet.setType(facetType);

                List<FacetConstraint> constraints = new ArrayList<>();
                for (StringTermsBucket bucket : termsAgg.buckets().array()) {
                    FacetConstraint constraint = new FacetConstraint();
                    constraint.setValue(bucket.key());
                    constraint.setCount(bucket.docCount());
                    constraints.add(constraint);
                }

                facet.setConstraints(constraints);
                facetList.add(facet);

                results.setFacets(facetList);
                results.setStart(response.hits().total().value());
                results.setFound((long)request1.from());
                results.setHits(hitList);
                System.out.print(" hit list is : " + hitList);
            }
        }

The existing search API endpoint can be reused for OpenSearch integration. All current features can be supported, and the response can be returned in the same format as before.

Cost and Data Size Estimation

Size: We currently have approximately 26 million documents stored in CloudSearch. Based on observed data, where every 10 documents consume approximately 3–4 KB, and 692 documents occupy about 2 MB. We estimate the average size of one document to be 4 KB for approximation purposes.
Using this estimate:

26 million × 4 KB = ~99 GB total data volume.

OpenSearch Serverless Size Limitations

Up to 1 TiB of data per index for search and vector search collections
Up to 100 TiB of hot data per index for time series collections

This means our existing 99 GB of data can comfortably fit within the supported limits of a single index in OpenSearch Serverless.

Cost: As the size of data is ~100GB and the cost is already estimated in Sage Portals OpenSearch Integration

Currently, user annotations are not included in the search documents. However, to plan for potential future support, we have estimated the storage impact of including annotations. These annotations are stored as JSON in the Node_Revision table within our SQL database.

The following query was used to calculate the average size of an annotation:

SELECT AVG(CHAR_LENGTH(USER_ANNOTATIONS)) AS avg_chars,
       AVG(OCTET_LENGTH(USER_ANNOTATIONS)) AS avg_bytes
FROM NODE_REVISION
WHERE USER_ANNOTATIONS IS NOT NULL;

//Result

avg_chars = 686.5832
avg_bytes = 686.6047

For 26 million documents it will be ~17GB data. The total data size, including annotations, is expected to remain within the OpenSearch Serverless limit of 1 TiB per index.

Proposal

Based on the analysis, migrating from Amazon CloudSearch to OpenSearch is technically feasible. The current search API can be retained without modification by introducing a feature flag. This approach allows us to run CloudSearch and OpenSearch in parallel, enabling comparison of search results and user experience with minimal disruption.

Additionally, operating both systems temporarily will provide a more accurate estimate of the cost associated with indexing and managing approximately 26 million documents in OpenSearch.

To minimize operational overhead, it is recommended to adopt OpenSearch Serverless, which eliminates the need to provision and maintain infrastructure, allowing the team to focus on feature delivery and performance tuning.