CloudSearch to OpenSearch Migration
Background
The Synapse offers a search feature for users, accessible through a dedicated Search API SearchQuery | Synapse REST API . Initially, search was implemented using AWS CloudSearch. However, due to its limitations and the fact that it is no longer available for new deployments, it is being phased out. While AWS has not officially announced an end-of-support date for existing users, they recommend migrating to Amazon OpenSearch Service. More information is available here Transition from Amazon CloudSearch to Amazon OpenSearch Service | Amazon Web Services
CloudSearch limitation solved by OpenSearch
Category | CloudSearch limitation | OpenSearch Solution |
---|---|---|
Query Language | Limited query flexibility | Full Elasticsearch Query DSL (JSON-based, supports bool, range, fuzziness, slop, intervals etc.) |
Custom Ranking | Minimal relevance tuning (Only via expr) | Function score queries, script scoring, boosting fields for advanced tuning |
Multi-field search | No native multi-field search | Use multi-match to search across multiple fields simultaneously |
Field Types | Limited field types (no boolean, nested, etc.) | Wide support: text, keyword, boolean, geo_point, nested, etc. |
Monitoring | No detailed logging or query trace | Built-in slow query logs, profiling, and monitoring via CloudWatch + APIs |
Aggregation/Facets | Limited aggregation capabilities (facets only) | Aggregations framework: terms, range, date_range, etc. |
Security | Only IAM-based security | Fine-grained access control (roles, field-level, document-level security) |
Data ingestion | Limited ingest and update options | Supports bulk API, ingestion pipelines, Logstash, Grafana, real-time indexing |
Testing | No testing tools or dev utilities | OpenSearch Dashboards with Dev Tools, query profiling, real-time testing |
Scaling & Performance Tuning | Scaling is automatic, but not tunable | Control over shards, replicas, index-level tuning, or serverless |
Integration | Limited integration ecosystem | Integrates with Kibana (Dashboards), Beats, Logstash, Grafana, etc. |
Autocomplete | Simple suggesters | Completion + edge n-gram + full control |
OpenSearch Introduction
OpenSearch is a distributed search and analytics engine. After adding data to OpenSearch, we can perform full-text searches on it with all of the features we might expect: search by field, search multiple indexes, boost fields, rank results by score, sort results by field, and aggregate results.For more information, see the OpenSearch documentation.
OpenSearch Terminology
Document: A document is a unit that stores information (text or structured data). In OpenSearch, documents are stored in JSON format.
Index: An index is a collection of documents.
Node: Server that store data and handles search and indexing operations.
Cluster: An OpenSearch cluster is a collection of nodes.
Shard: A shard is a horizontal partition of data within an index. Each shard holds a subset of the index’s documents, enabling OpenSearch to scale data and queries across multiple nodes.
Primary and replica shards: OpenSearch automatically creates one replica shard for each primary shard by default. For example, if an index is split into 10 primary shards, OpenSearch will create 10 corresponding replica shards to enhance fault tolerance and availability.
Inverted index: OpenSearch uses an inverted index to perform fast full-text searches. This data structure maps each term (word) to the list of documents that contain it.
For example, consider two documents:
Document 1: “Beauty is in the eye of the beholder”
Document 2: “Beauty and the beast”
word | Document |
---|---|
beauty | 1,2 |
is | 1 |
and | 2 |
Relevance: When a search query is executed, OpenSearch matches the query terms against the indexed documents and assigns a relevance score to each result. This score indicates how closely a document matches the query criteria.
OpenSearch deployment options
OpenSearch Service domain: Amazon OpenSearch Service provides a managed environment to deploy and operate OpenSearch clusters. It gives you full control over configuration, including instance types, storage, and network settings. It supports fine-tuned performance optimization, availability zones, VPC access, and security configurations.This option might require in depth knowledge about cluster management and maintenance and more aligned with a long-term always live deployment.
OpenSearch Serverless: Amazon OpenSearch Serverless is an on-demand, serverless option for Amazon OpenSearch Service that eliminates the operational complexity of provisioning, configuring, and tuning OpenSearch clusters. With OpenSearch Serverless, we can search and analyze large volumes of data without managing the underlying infrastructure. An OpenSearch Serverless collection is a group of OpenSearch indexes that work together to support a specific workload or use case. Collections simplify operations compared to self-managed OpenSearch clusters, which require manual provisioning. For more information, see the What is Amazon OpenSearch Serverless? - Amazon OpenSearch Service
Types of collection in OpenSearch Serverless
Search: Full-text search based on natural language text, where documents are indexed with analyzers (tokenizers, stemmers, etc.) to support ranking, relevance, and partial matching. The main use cases are to find relevant documents based on user-entered keywords, to care about ranking, matching accuracy, and highlighting and when data is text heavy.
Vector Search: Semantic search/Ml-based search documents by comparing vector embeddings (numeric representations of meaning) instead of keywords. It's ideal for semantic similarity, natural language, or ML-driven matching. The main use cases are to retrieve content based on meaning (not just keywords), to using embeddings from models like BERT, OpenAI, etc and building AI-powered features like chatbots or recommendations.
Time Series: Time series search focuses on analyzing machine-generated, timestamped data such as logs, metrics, and events. The goal is often operational insight, security monitoring, or business performance tracking.
Feasibility Evaluation
To assess whether the features currently supported by CloudSearch can be replicated in OpenSearch, I explored the OpenSearch Serverless offering. As part of this evaluation, I created a collection of type "Search", which is specifically designed for full-text search use cases. OpenSearch Serverless supports three collection types: Search, Time Series, and Vector. Since our requirement focuses on full-text search, the “Search” type was selected. The following steps outline the approach taken during this evaluation:
Data Ingestion
A couple of document was ingested into an index within the selected collection. The document format remains consistent with how it was previously stored in CloudSearch. For more information, see this.
@Test
public void TestDocumentUploadToOpenSearch() throws IOException {
//create a document
Document doc = createDocument();
SdkHttpClient httpClient = ApacheHttpClient.builder().build();
OpenSearchClient client = new OpenSearchClient(
new AwsSdk2Transport(
httpClient,
"tlenbt......amazonaws.com", // serverless collection endpoint
"aoss", // signing service name
Region.US_EAST_1, // signing service region
AwsSdk2TransportOptions.builder().build()
)
);
String index = "sample-index";
IndexRequest<Document> request = new IndexRequest.Builder<Document>()
.index(index)
.id(doc.getId())
.document(doc)
.build();
client.index(request);
SearchRequest request1 = SearchRequest.of(r -> r
.index(index)
.query(q -> q.matchAll(m -> m))
.from(2)
.size(10)
);
SearchResponse<Map> response = client.search(request1, Map.class);
for (Hit<Map> hit : response.hits().hits()) {
System.out.println("Doc ID: " + hit.id());
System.out.println("Source: " + hit.source());
}
}
public Document createDocument() {
org.sagebionetworks.repo.model.Node node = new Node();
node.setName("this is second folder");
node.setDescription("second has some data");
node.setId("syn1");
node.setParentId("1234");
node.setETag("0");
node.setNodeType(EntityType.folder);
Long nonexistantPrincipalId = 42L;
node.setCreatedByPrincipalId(nonexistantPrincipalId);
node.setCreatedOn(new Date());
node.setModifiedByPrincipalId(nonexistantPrincipalId);
node.setModifiedOn(new Date());
node.setVersionLabel("versionLabel");
Annotations additionalAnnos = new Annotations();
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "organ",
"kidney", AnnotationsValueType.STRING);
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "longKey", "10", AnnotationsValueType.LONG);
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "tissue", "eye lid", AnnotationsValueType.STRING);
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "consortium", "C O N S O R T I U M", AnnotationsValueType.STRING);
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "diagnosis", "2", AnnotationsValueType.LONG);
String dateValue = Long.toString(System.currentTimeMillis());
AnnotationsV2TestUtils.putAnnotations(additionalAnnos, "dateKey", dateValue, AnnotationsValueType.TIMESTAMP_MS);
Set<ACCESS_TYPE> rwAccessType = new HashSet<ACCESS_TYPE>();
rwAccessType.add(ACCESS_TYPE.READ);
rwAccessType.add(ACCESS_TYPE.UPDATE);
ResourceAccess rwResourceAccess = new ResourceAccess();
rwResourceAccess.setPrincipalId(123L); //readWriteTest@sagebase.org
rwResourceAccess.setAccessType(rwAccessType);
Set<ACCESS_TYPE> roAccessType = new HashSet<ACCESS_TYPE>();
roAccessType.add(ACCESS_TYPE.READ);
ResourceAccess roResourceAccess = new ResourceAccess();
roResourceAccess.setPrincipalId(456L); // readOnlyTest@sagebase.org
roResourceAccess.setAccessType(roAccessType);
Set<ResourceAccess> resourceAccesses = new HashSet<ResourceAccess>();
resourceAccesses.add(rwResourceAccess);
resourceAccesses.add(roResourceAccess);
AccessControlList acl = new AccessControlList();
acl.setResourceAccess(resourceAccesses);
String wikiPageText = "title\nmarkdown\nwiki is useful to find out content";
return searchDocumentDriver.formulateSearchDocument(node, additionalAnnos, acl, wikiPageText);
}
Search
To test data search capabilities, I used the OpenSearch Dashboard to execute queries with various filters. This approach provided a quick and efficient way to evaluate whether the necessary filters could be applied, ensuring compatibility with the current search functionality.
Current Feature | Open Search Query | Result |
---|---|---|
GET sample-index/_search
{
"query": {
"simple_query_string": {
"fields": [ "fields.name" ,"fields.description"],
"query": "second folder replica"
}
}
}
The default operator is OR, meaning it searches for documents containing either of the query terms (e.g., "second" or "Folder" or “replica”). We can refine the search by specifying the fields in which the query terms should be matched.
We can control the minimum number of terms that a document must match to be returned in the results by specifying the
If no query term is provided, return all documents. GET sample-index/_search
| {
"took": 25,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.1130829,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1.1130829,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
},
{
"_index": "sample-index",
"_id": "syn3",
"_score": 0.9186288,
"_source": {
"type": "add",
"id": "syn3",
"fields": {
"name": "replica",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "4567",
"node_type": "folder",
"etag": "0",
"created_on": 1747181404,
"modified_on": 1747181404,
"created_by": "45",
"modified_by": "45",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "any"
}
}
},
{
"_index": "sample-index",
"_id": "syn1",
"_score": 0.2876821,
"_source": {
"type": "add",
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084018,
"modified_on": 1747084018,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "1",
"tissue": "ear lobe",
"consortium": "C O N S O R T I U M",
"organ": "ORGAN"
}
}
}
]
}
} | |
GET sample-index/_search
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"fields": ["fields.name", "fields.description"],
"query": "second folder replica"
}
},
{
"match": {
"fields.created_by": "42"
}
}
],
"must_not": [
{
"match": {
"fields.created_by": "45"
}
}
]
}
}
} must : Logical and operator. The results must match all queries in this clause.
must_not: Logical not operator. All matches are excluded from the results. If
|
{
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.8062301,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1.8062301,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
},
{
"_index": "sample-index",
"_id": "syn1",
"_score": 0.5753642,
"_source": {
"type": "add",
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084018,
"modified_on": 1747084018,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "1",
"tissue": "ear lobe",
"consortium": "C O N S O R T I U M",
"organ": "ORGAN"
}
}
}
]
}
} | |
rangeQuery | GET sample-index/_search
{
"query": {
"range": {
"fields.modified_on": {
"gte": 1747084178,
"lte": 1747084185
}
}
}
}
| {
"took": 42,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
}
]
}
} |
GET sample-index/_search
{
"query": {
"match_all": {}
},
"aggs": {
"organ_facet": {
"terms": {
"field": "fields.organ.keyword",
"size": 10,
"order": {
"_count": "desc"
}
}
}
}
}
By default, OpenSearch doesn’t support aggregations on a text field. Because text fields are tokenized, an aggregation on a text field has to reverse the tokenization process back to its original string and then formulate an aggregation based on that. This kind of an operation consumes significant memory and degrades cluster performance. While we can enable aggregations on text fields by setting the We recommend keeping a raw version of the text field as a
A text field that is analyzed cannot be used to sort documents, because the inverted index only contains the individual tokenized terms and not the entire string. To bypass this limitation, you can use a raw version of the text field mapped as a keyword type. In the following example, | {
"took": 68,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
},
{
"_index": "sample-index",
"_id": "syn3",
"_score": 1,
"_source": {
"type": "add",
"id": "syn3",
"fields": {
"name": "replica",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "4567",
"node_type": "folder",
"etag": "0",
"created_on": 1747181404,
"modified_on": 1747181404,
"created_by": "45",
"modified_by": "45",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "any"
}
}
},
{
"_index": "sample-index",
"_id": "syn1",
"_score": 1,
"_source": {
"type": "add",
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084018,
"modified_on": 1747084018,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "1",
"tissue": "ear lobe",
"consortium": "C O N S O R T I U M",
"organ": "ORGAN"
}
}
}
]
},
"aggregations": {
"organ_facet": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ORGAN",
"doc_count": 1
},
{
"key": "any",
"doc_count": 1
},
{
"key": "kidney",
"doc_count": 1
}
]
}
}
} | |
returnFields: Specifies the document fields to include in the response | GET sample-index/_search
{
"_source": ["id", "fields.name", "fields.description"],
"query": {
"match_all": {}
}
} | {
"took": 27,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1,
"_source": {
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content"""
}
}
},
{
"_index": "sample-index",
"_id": "syn3",
"_score": 1,
"_source": {
"id": "syn3",
"fields": {
"name": "replica",
"description": """title
markdown
wiki is useful to find out content"""
}
}
},
{
"_index": "sample-index",
"_id": "syn1",
"_score": 1,
"_source": {
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content"""
}
}
}
]
}
} |
size: The maximum number of search hits to return. | GET sample-index/_search
{
"query": {
"match_all": {}
},
"size": 2
} | {
"took": 18,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
},
{
"_index": "sample-index",
"_id": "syn3",
"_score": 1,
"_source": {
"type": "add",
"id": "syn3",
"fields": {
"name": "replica",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "4567",
"node_type": "folder",
"etag": "0",
"created_on": 1747181404,
"modified_on": 1747181404,
"created_by": "45",
"modified_by": "45",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "any"
}
}
}
]
}
} |
start: The zero-based number of the first hit returned in this page of search results. | GET sample-index/_search
{
"query": {
"match_all": {}
},
"size": 10,
"from" :2
} | {
"took": 27,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn1",
"_score": 1,
"_source": {
"type": "add",
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084018,
"modified_on": 1747084018,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "1",
"tissue": "ear lobe",
"consortium": "C O N S O R T I U M",
"organ": "ORGAN"
}
}
}
]
}
} |
ACL filter | GET sample-index/_search
{
"query": {
"terms": {
"fields.acl": ["123"]
}
}
}
| {
"took": 27,
"timed_out": false,
"_shards": {
"total": 0,
"successful": 0,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "sample-index",
"_id": "syn2",
"_score": 1,
"_source": {
"type": "add",
"id": "syn2",
"fields": {
"name": "this is second folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084178,
"modified_on": 1747084178,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "kidney"
}
}
},
{
"_index": "sample-index",
"_id": "syn3",
"_score": 1,
"_source": {
"type": "add",
"id": "syn3",
"fields": {
"name": "replica",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "4567",
"node_type": "folder",
"etag": "0",
"created_on": 1747181404,
"modified_on": 1747181404,
"created_by": "45",
"modified_by": "45",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "2",
"tissue": "eye lid",
"consortium": "C O N S O R T I U M",
"organ": "any"
}
}
},
{
"_index": "sample-index",
"_id": "syn1",
"_score": 1,
"_source": {
"type": "add",
"id": "syn1",
"fields": {
"name": "this is first folder",
"description": """title
markdown
wiki is useful to find out content""",
"parent_id": "1234",
"node_type": "folder",
"etag": "0",
"created_on": 1747084018,
"modified_on": 1747084018,
"created_by": "42",
"modified_by": "42",
"acl": [
"456",
"123"
],
"update_acl": [
"123"
],
"diagnosis": "1",
"tissue": "ear lobe",
"consortium": "C O N S O R T I U M",
"organ": "ORGAN"
}
}
}
]
}
} |
OpenSearch Response Object
The current API returns a SearchResults. object. We can map the OpenSearch response to this existing structure with some important considerations:
In OpenSearch, each aggregation is identified by a unique name and must be defined using one of the supported aggregation types. The aggregation name in the response effectively serves as the facet name. To maintain compatibility with the existing code, it's recommended to use the field name as the aggregation name. This allows us to derive the facet type from the field directly. Alternatively, we can introduce a new enum that maps aggregation names to their corresponding facet types.
// existing code to get FacetType IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType())
Unlike CloudSearch, OpenSearch responses do not include the from (pagination offset) value. This must be manually included in the searchResults object during response construction.
// OPenSearch request
SearchRequest request = SearchRequest.of(r -> r
.index(index)
.query(q -> q.matchAll(m -> m))
.aggregations("organ", a -> a
.terms(t -> t
.field("fields.organ.keyword")
.size(10)
.order((List.of(Map.of("_count", SortOrder.Desc)) )) // false for "desc"
))
.from(0)
.size(10)
);
SearchResponse<Map> response = client.search(request, Map.class);
// prepare the hitlist with found documents
List<org.sagebionetworks.repo.model.search.Hit> hitList = new ArrayList<>();
for (Hit<Map> hit : response.hits().hits()) {
Map<String, Object> fields = (Map<String, Object>) hit.source().get("fields");
org.sagebionetworks.repo.model.search.Hit synapseHit = new org.sagebionetworks.repo.model.search.Hit();
synapseHit.setId(hit.id());
synapseHit.setDescription((String) fields.get("description"));
synapseHit.setCreated_by((String) fields.get("created_by"));
// add all the fields
hitList.add(synapseHit);
}
// get facet from Opensearch Response
Map<String, Aggregate> aggs = response.aggregations();
SearchResults results = new SearchResults();
List<Facet> facetList = new ArrayList<>();
for (Map.Entry<String, Aggregate> entry : aggs.entrySet()) {
String facetName = entry.getKey();
Aggregate aggregate = entry.getValue();
if (aggregate.isSterms()) {
StringTermsAggregate termsAgg = aggregate.sterms();
Facet facet = new Facet();
facet.setName(facetName);
FacetTypeNames facetType = IndexFieldToSynapseFacetType.getSynapseFacetType(SynapseToCloudSearchField.cloudSearchFieldFor(facetName).getType());
facet.setType(facetType);
List<FacetConstraint> constraints = new ArrayList<>();
for (StringTermsBucket bucket : termsAgg.buckets().array()) {
FacetConstraint constraint = new FacetConstraint();
constraint.setValue(bucket.key());
constraint.setCount(bucket.docCount());
constraints.add(constraint);
}
facet.setConstraints(constraints);
facetList.add(facet);
results.setFacets(facetList);
results.setStart(response.hits().total().value());
results.setFound((long)request1.from());
results.setHits(hitList);
System.out.print(" hit list is : " + hitList);
}
}
The existing search API endpoint can be reused for OpenSearch integration. All current features can be supported, and the response can be returned in the same format as before.
Cost and Data Size Estimation
Size: We currently have approximately 26 million documents stored in CloudSearch. Based on observed data, where every 10 documents consume approximately 3–4 KB, and 692 documents occupy about 2 MB. We estimate the average size of one document to be 4 KB for approximation purposes.
Using this estimate:
26 million × 4 KB = ~99 GB total data volume.
OpenSearch Serverless Size Limitations
Up to 1 TiB of data per index for search and vector search collections
Up to 100 TiB of hot data per index for time series collections
This means our existing 99 GB of data can comfortably fit within the supported limits of a single index in OpenSearch Serverless.
Cost: As the size of data is ~100GB and the cost is already estimated in Sage Portals OpenSearch Integration
Currently, user annotations are not included in the search documents. However, to plan for potential future support, we have estimated the storage impact of including annotations. These annotations are stored as JSON in the Node_Revision table within our SQL database.
The following query was used to calculate the average size of an annotation:
SELECT AVG(CHAR_LENGTH(USER_ANNOTATIONS)) AS avg_chars,
AVG(OCTET_LENGTH(USER_ANNOTATIONS)) AS avg_bytes
FROM NODE_REVISION
WHERE USER_ANNOTATIONS IS NOT NULL;
//Result
avg_chars = 686.5832
avg_bytes = 686.6047
For 26 million documents it will be ~17GB data. The total data size, including annotations, is expected to remain within the OpenSearch Serverless limit of 1 TiB per index.
Proposal
Based on the analysis, migrating from Amazon CloudSearch to OpenSearch is technically feasible. The current search API can be retained without modification by introducing a feature flag. This approach allows us to run CloudSearch and OpenSearch in parallel, enabling comparison of search results and user experience with minimal disruption.
Additionally, operating both systems temporarily will provide a more accurate estimate of the cost associated with indexing and managing approximately 26 million documents in OpenSearch.
To minimize operational overhead, it is recommended to adopt OpenSearch Serverless, which eliminates the need to provision and maintain infrastructure, allowing the team to focus on feature delivery and performance tuning.