Architecture Design for OpenSearch Integration

Architecture Design for OpenSearch Integration

Author: @Bryan Fauble

Overview

Portal Search provides full-text search over Synapse Tables, EntityViews, Datasets, DatasetCollections, MaterializedViews, and SubmissionViews (VirtualTables excluded). Each search index is a Synapse Entity (SearchIndex) with its own OpenSearch Serverless (AOSS) index, state, and build lifecycle.

Design Principles

Simple composition. SearchIndex → SearchConfiguration → {SynonymSets[], ColumnAnalyzerOverrides[]}.

4 standalone + 1 entity. SynonymSet, ColumnAnalyzerOverride, TextAnalyzer, and SearchConfiguration are organization-scoped resources under /search/*. TextAnalyzer is currently read-only (GET and LIST); system analyzers (IDs 1-6) are bootstrapped on startup. SearchIndex is a Synapse Entity (VersionableEntity + HasDefiningSql) managed via /repo/v1/entity/*; only search query operations use /search/*.

Split authorization model. Configuration resources use Organization ACL (publicly readable, mutations require org permissions). SearchIndex uses standard entity ACLs via parentId (Project or Folder). Search queries require READ on both the SearchIndex entity and the source table. A pilot gate restricts SearchIndex creation to Portal Managers.

Build-once semantics. AOSS indexes are point-in-time snapshots built on entity creation. UPDATE messages only trigger a build if no AOSS index exists yet (migration backfill). To force a rebuild, delete and recreate the SearchIndex entity. Entity metadata changes (name, definingSQL, searchConfigurationId) are persisted and take effect on the next build during stack migration.

Schema from SQL. Only the columns selected by definingSQL are mapped in the AOSS index. Column schema is derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback. SELECT studyName, summary FROM syn123 maps only those two columns; SELECT * FROM syn123 maps all.

Anonymous-only indexing. The lifecycle worker streams data as the anonymous user via TableQueryManager.runQueryAsStream(). Row-level filtering via addRowLevelFilter() ensures only publicly visible rows enter the AOSS index.

Two-database architecture. Transactional DB (migrated) stores configuration and entity data; Indexing DB (not migrated) stores runtime state. On migration, status rows start empty and the worker rebuilds indexes organically.

Single polling point. Search queries run as async jobs. If the index is still building (CREATING), the query worker throws RecoverableMessageException for automatic retry. Clients only poll the async result endpoint.

System Context

image-20260303-233203.png
graph TB subgraph "Data Portals" Portals["NF Portal / AD Portal / ELITE"] end subgraph "Synapse REST API" EntityAPI["/repo/v1/entity/*<br/>(SearchIndex CRUD)"] ConfigAPI["/search/synonym/set/*<br/>/search/column/analyzer/override/*<br/>/search/text/analyzer/*<br/>/search/configuration/*"] QueryAPI["/search/query/async/*<br/>/search/autocomplete"] end subgraph "Validation" MV[SearchIndexMetadataProvider<br/>pilot gate + definingSQL validation] end subgraph "Manager Layer" M1[SynonymSetManager] M2[ColumnAnalyzerOverrideManager] M3[TextAnalyzerManager] M4[SearchConfigurationManager] M5[SearchIndexQueryManager] end subgraph "Bootstrap" Boot[TextAnalyzerBootstrapper<br/>bootstraps 6 system analyzers on startup] end subgraph "Storage" TDB[(MySQL<br/>Transactional DB<br/>NODE + NODE_REVISION<br/>+ 4 config tables)] IDB[(MySQL<br/>Indexing DB<br/>SEARCH_INDEX_STATUS)] AOSS[(OpenSearch AOSS<br/>`stack`-`instance`-searchindex)] end subgraph "Event Processing" NodeDAO["NodeDAO"] TxMsg["TransactionalMessenger"] SNS["SNS: ENTITY"] SQS["SQS: SEARCH_INDEX_LIFECYCLE"] W1["SearchIndexLifecycleWorker"] end Portals --> EntityAPI & ConfigAPI & QueryAPI EntityAPI --> MV --> TDB ConfigAPI --> M1 & M2 & M3 & M4 --> TDB QueryAPI --> M5 --> AOSS Boot --> TDB NodeDAO --> TxMsg --> SNS --> SQS --> W1 W1 --> TDB & IDB & AOSS style Portals fill:#4a9eff,color:#fff style EntityAPI fill:#50c878,color:#fff style ConfigAPI fill:#50c878,color:#fff style QueryAPI fill:#50c878,color:#fff style TDB fill:#ff8c00,color:#fff style IDB fill:#ff8c00,color:#fff style AOSS fill:#9b59b6,color:#fff style W1 fill:#e74c3c,color:#fff style Boot fill:#f39c12,color:#fff

SearchIndex CRUD goes through the standard entity controller; SearchIndexMetadataProvider validates on create/update. Configuration resources have dedicated controllers. Search queries go through async job infrastructure. The lifecycle worker subscribes to ENTITY change messages via a dedicated SQS queue and filters for searchindex node types internally. TextAnalyzerBootstrapper runs on startup to ensure 6 system analyzers exist (IDs 1-6).

Security & Authorization

Control Plane: Organization ACL (Configuration Resources)

All configuration resources are publicly readable (GET/list). Mutations require Organization ACL permissions:

AuthorizationUtils.disallowAnonymous(user); aclDao.canAccess(user, organizationId, ObjectType.ORGANIZATION, ACCESS_TYPE.CREATE) .checkAuthorizationOrElseThrow();

Control Plane: Pilot Gate (SearchIndex Entity)

Enforced in SearchIndexMetadataProvider.validateEntity() on both CREATE and UPDATE:

if (!AuthorizationUtils.isPortalManagerOrAdmin(user)) { throw new UnauthorizedException("Only Portal Managers can create or update SearchIndex entities."); } searchIndexValidator.validateDefiningSQL(entity.getDefiningSQL());

Control Plane: Shared Resource Deletion Protection

SynonymSets and ColumnAnalyzerOverrides cannot be deleted while referenced by any SearchConfiguration. Checked at the manager layer:

List<SearchConfiguration> refs = searchConfigurationDao.findBySynonymSetId(synonymSetId); if (!refs.isEmpty()) { throw new IllegalArgumentException("Cannot delete synonym set because it is referenced by search configurations: " + ids); }

findBySynonymSetId() uses MySQL's JSON_CONTAINS() function to query the JSON array column.

Data Plane: Search & Autocomplete

Two READ checks are performed by SearchIndexQueryManagerImpl:

  1. READ on the SearchIndex entity

  2. READ on the source table entity (parsed from definingSQL)

Data is indexed as the anonymous user (principal 273950). addRowLevelFilter() in TableQueryManagerImpl applies benefactor ACL filtering, ensuring only publicly visible rows enter the AOSS index.

Control Plane: Optimistic Concurrency

All three mutable configuration resource DAOs enforce OCC via SELECT ... FOR UPDATE:

String currentEtag = getCurrentEtag(id); // SELECT ETAG ... FOR UPDATE if (!currentEtag.equals(request.getEtag())) { throw new ConflictingUpdateException("Resource was updated since last fetched."); }

Version Prevention

SearchIndex implements VersionableEntity (required for NODE_REVISION.DEFINING_SQL mapping) but versioning is disabled in EntityManagerImpl.updateEntity():

if (updated instanceof SearchIndex) { newVersion = false; }

Configuration & Resolution

Defining SQL

definingSQL scopes which columns and rows are indexed. Validated by SearchIndexValidator:

List<IdAndVersion> sourceTableIds = TableModelUtils.getSourceTableIds(definingSQL); if (sourceTableIds.size() != 1) { throw new IllegalArgumentException("definingSQL must reference exactly one entity."); }

Only the columns in the SELECT clause are mapped in the AOSS index — derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback (after SQL parsing, before row execution).

Configuration Resolution

SearchConfigurationResolver resolves the effective configuration at build time:

image-20260303-233820.png

 

flowchart TD Start([resolve]) --> HasConfig{searchConfigurationId<br/>set on entity?} HasConfig -->|Yes| UseExplicit[Load that SearchConfiguration] HasConfig -->|No| CheckPS["projectSettingsManager.getProjectSettingForNode()<br/>→ SearchConfigurationListSetting"] CheckPS --> HasPS{Found?} HasPS -->|Yes| UsePS[Load that SearchConfiguration] HasPS -->|No| UseDefaults([Platform defaults:<br/>no synonyms, no overrides, column type defaults]) UseExplicit --> Load UsePS --> Load Load["Load SynonymSets via synonymSetDao.get()<br/>Load ColumnAnalyzerOverrides via columnAnalyzerOverrideDao.get()<br/>Load TextAnalyzers via textAnalyzerDao.get()"] style Start fill:#4a9eff,color:#fff style UseDefaults fill:#50c878,color:#fff

Effective analyzer per column (priority order):

  1. Per-column override indexAnalyzerId from ColumnAnalyzerOverrideEntry

  2. defaultAnalyzerId from SearchConfiguration

  3. System default from ColumnTypeToOpenSearchMapping.getDefaultAnalyzerId(columnType)

The lifecycle worker and query manager both collect all required TextAnalyzer IDs (from overrides, config default, and column type defaults) and load them from TextAnalyzerDao before building or querying an index. Paired search analyzers (pairedSearchAnalyzerId) are also loaded transitively.

Project Settings Integration

{ "concreteType": "org.sagebionetworks.repo.model.project.SearchConfigurationListSetting", "settingsType": "search", "searchConfigurationId": "301" }

Registered in ProjectSettingsManagerImpl.TYPE_MAP as SearchConfigurationListSetting.class → ProjectSettingsType.search.

Database Model

Schema Separation

Database

Contents

Migrated?

Database

Contents

Migrated?

Transactional DB

NODE/NODE_REVISION (SearchIndex entity), TEXT_ANALYZER, SYNONYM_SET, COLUMN_ANALYZER_OVERRIDE, SEARCH_CONFIGURATION

Yes

Indexing DB

SEARCH_INDEX_STATUS

No (starts empty)

On migration: config tables are copied (including TEXT_ANALYZER with system analyzers), SearchIndex entities migrate as NODE/NODE_REVISION, status rows start empty, AOSS starts empty. The TextAnalyzerBootstrapper ensures system analyzers exist on startup. The lifecycle worker discovers missing status rows from replayed ENTITY change messages and builds indexes organically.

Entity Relationship Diagram

image-20260303-233525.png

 

erDiagram NODE { BIGINT ID PK ENUM NODE_TYPE "includes searchindex" } NODE_REVISION { BIGINT OWNER_NODE_ID PK BIGINT NUMBER PK VARCHAR DEFINING_SQL VARCHAR SEARCH_CONFIGURATION_ID } TEXT_ANALYZER { BIGINT ID PK VARCHAR NAME JSON SETTINGS BOOLEAN IS_SYSTEM } SYNONYM_SET { BIGINT ID PK } COLUMN_ANALYZER_OVERRIDE { BIGINT ID PK } SEARCH_CONFIGURATION { BIGINT ID PK BIGINT DEFAULT_ANALYZER_ID FK } SEARCH_INDEX_STATUS { BIGINT SEARCH_INDEX_ID PK VARCHAR STATE TIMESTAMP CHANGED_ON VARCHAR ERROR_MESSAGE } NODE ||--o{ NODE_REVISION : "has versions" SEARCH_CONFIGURATION ||--o{ SYNONYM_SET : "references via JSON array" SEARCH_CONFIGURATION ||--o{ COLUMN_ANALYZER_OVERRIDE : "references via JSON array" SEARCH_CONFIGURATION }o--o| TEXT_ANALYZER : "DEFAULT_ANALYZER_ID FK" NODE_REVISION }o--o| SEARCH_CONFIGURATION : "searchConfigurationId" NODE ||--|| SEARCH_INDEX_STATUS : "1:1 across databases"

There is no dedicated SEARCH_INDEX table. SearchIndex is stored in NODE (type searchindex) + NODE_REVISION. definingSQL maps to NODE_REVISION.DEFINING_SQL (shared with MaterializedView/VirtualTable). searchConfigurationId maps to NODE_REVISION.SEARCH_CONFIGURATION_ID (new column).

DDL: Transactional DB (Migrated)

-- Database-stored text analyzer configurations -- System analyzers (IDs 1-999) bootstrapped on startup; user-defined start at 1000+ CREATE TABLE IF NOT EXISTS `TEXT_ANALYZER` ( `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL, `NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `ORGANIZATION_ID` BIGINT DEFAULT NULL, `SETTINGS` JSON NOT NULL, `IS_SYSTEM` BOOLEAN NOT NULL DEFAULT FALSE, `CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`), CONSTRAINT `TA_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT ) -- Synonym rules stored as JSON CREATE TABLE IF NOT EXISTS `SYNONYM_SET` ( `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL, `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `RULES` JSON NOT NULL, `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`), CONSTRAINT `SYNSET_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT ) -- Per-column analyzer overrides stored as JSON CREATE TABLE IF NOT EXISTS `COLUMN_ANALYZER_OVERRIDE` ( `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL, `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `OVERRIDES` JSON NOT NULL, `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`), CONSTRAINT `CAO_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT ) -- References stored as JSON arrays; DEFAULT_ANALYZER_ID references TEXT_ANALYZER CREATE TABLE IF NOT EXISTS `SEARCH_CONFIGURATION` ( `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL, `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `SYNONYM_SET_IDS` JSON DEFAULT NULL, `COLUMN_ANALYZER_OVERRIDE_IDS` JSON DEFAULT NULL, `DEFAULT_ANALYZER_ID` BIGINT DEFAULT NULL, `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL, PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`), CONSTRAINT `SC_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT, CONSTRAINT `SC_DEFAULT_ANALYZER_FK` FOREIGN KEY (`DEFAULT_ANALYZER_ID`) REFERENCES `TEXT_ANALYZER` (`ID`) ON DELETE RESTRICT )

DDL: Indexing DB (Not Migrated)

CREATE TABLE IF NOT EXISTS SEARCH_INDEX_STATUS ( SEARCH_INDEX_ID BIGINT NOT NULL, STATE VARCHAR(50) NOT NULL DEFAULT 'CREATING', LAST_BUILD_ON TIMESTAMP(3) NULL, ERROR_MESSAGE VARCHAR(3000) DEFAULT NULL, CHANGED_ON TIMESTAMP(3) NOT NULL, PRIMARY KEY (SEARCH_INDEX_ID), INDEX (STATE) )

Uses INSERT ... ON DUPLICATE KEY UPDATE for upsert. Absence of a row is the implicit trigger for the lifecycle worker to build the index.

JSON Column Examples

SYNONYM_SET.RULES:

[ {"ruleType": "EQUIVALENT", "terms": ["NF1", "neurofibromatosis 1"]}, {"ruleType": "EXPLICIT", "terms": ["OPG", "optic pathway glioma"]} ]

COLUMN_ANALYZER_OVERRIDE.OVERRIDES:

[ {"columnName": "diseaseFocus", "indexAnalyzerId": "4", "searchAnalyzerId": "4"}, {"columnName": "doi", "indexAnalyzerId": "3", "searchAnalyzerId": "3"} ]

TEXT_ANALYZER.SETTINGS (example: SCIENTIFIC, ID 1):

{ "tokenizer": "standard", "tokenFilters": { "english_stop": "{\"type\":\"stop\",\"stopwords\":\"_english_\"}", "english_stemmer": "{\"type\":\"stemmer\",\"language\":\"english\"}" }, "filterOrder": ["lowercase", "english_stop", "english_stemmer"], "synonymAware": true }

Serialized via JSONObjectAdapterImpl + writeToJSONObject()/initializeFromJSONObject(). Simple string arrays (SYNONYM_SET_IDS, COLUMN_ANALYZER_OVERRIDE_IDS) use JSONArray directly. Cross-resource references are queried via MySQL's JSON_CONTAINS().

Worker Pipeline & Index Lifecycle

Lifecycle State Machine

image-20260303-233508.png
stateDiagram-v2 [*] --> CREATING : POST /repo/v1/entity (SearchIndex) CREATING --> ACTIVE : Build succeeds CREATING --> FAILED : Build error or exceeds 500K rows ACTIVE --> DELETING : DELETE /repo/v1/entity/{id} DELETING --> [*] : AOSS index deleted + status row removed

Change Message Flow

All entity lifecycle events publish ENTITY change messages via TransactionalMessenger. No dedicated SEARCH_INDEX ObjectType or SNS topic. The lifecycle worker subscribes to the SEARCH_INDEX_LIFECYCLE SQS queue (subscribed to the existing ENTITY SNS topic) and filters internally by node type:

image-20260303-233441.png

 

sequenceDiagram participant NodeDAO participant TxMsg as TransactionalMessenger participant SNS as SNS: ENTITY participant SQS as SQS: SEARCH_INDEX_LIFECYCLE participant Worker as SearchIndexLifecycleWorker NodeDAO->>TxMsg: sendMessageAfterCommit(id, ENTITY, CREATE) TxMsg->>SNS: afterCommit() SNS->>SQS: Deliver SQS->>Worker: Poll message Worker->>Worker: nodeDao.getNodeTypeById(id) Note over Worker: Skip if nodeType != searchindex

Worker Configuration

Property

Lifecycle Worker

Query Worker

Property

Lifecycle Worker

Query Worker

Class

SearchIndexLifecycleWorker

SearchQueryWorker

Interface

BatchChangeMessageDrivenRunner

AsyncJobRunner<SearchQuery, SearchResults>

SQS Queue

SEARCH_INDEX_LIFECYCLE

SEARCH_QUERY

Semaphore Lock Key

searchIndexLifecycleWorker

searchQueryWorker

Max Lock Count

2

4

Lock & Visibility Timeout

300s

120s

Max Threads / Machine

2

3

Repeat Interval

2053ms

2039ms

Start Delay

523ms

317ms

The Build Process

image-20260303-233415.png

 

flowchart TD Start([ENTITY ChangeMessage]) --> Filter{nodeType ==<br/>searchindex?} Filter -->|No| Skip([Skip]) Filter -->|Yes| Route{changeType?} Route -->|CREATE| Build Route -->|UPDATE| CheckStatus{Status row<br/>exists?} Route -->|DELETE| Delete CheckStatus -->|Yes| NoOp([No-op: already built]) CheckStatus -->|No| Build Build["Set status CREATING<br/>Load SearchIndex entity as admin"] --> Resolve["SearchConfigurationResolver.resolve()<br/>Load SynonymSets + ColumnAnalyzerOverrides"] Resolve --> LoadAnalyzers["collectAndLoadAnalyzers()<br/>Gather all required TextAnalyzer IDs<br/>(from overrides, config default, column type defaults)<br/>Load from TextAnalyzerDao + paired search analyzers"] LoadAnalyzers --> PreFlight["Pre-flight COUNT query<br/>querySinglePage(anonymousUser, countOnly)"] PreFlight --> RowLimit{"> 500K rows?"} RowLimit -->|Yes| Fail RowLimit -->|No| Stream["runQueryAsStream(anonymousUser, definingSQL)"] Stream --> Callback["RowHandlerProvider callback:<br/>1. getSchemaOfSelect() → selected columns only<br/>2. searchIndexOssClient.createIndex()<br/> with mappings + analyzers + synonyms"] Callback --> Index["SearchIndexRowHandler:<br/>batch 1000 rows → bulkIndex()<br/>guard: 500K row safety net"] Index --> Active[Set status ACTIVE] Delete["Set status DELETING"] --> DeleteAOSS["searchIndexOssClient.deleteIndex()<br/>(idempotent: catches index_not_found)"] DeleteAOSS --> RemoveStatus[Delete status row] Stream -->|Exception| Fail["Set status FAILED<br/>(error message truncated to 3000 chars)"] Callback -->|Exception| Fail Index -->|Exception| Fail style Start fill:#4a9eff,color:#fff style Skip fill:#ccc,color:#333 style NoOp fill:#ccc,color:#333 style Active fill:#50c878,color:#fff style Fail fill:#e74c3c,color:#fff

Data Loading

The lifecycle worker uses TableQueryManager.runQueryAsStream() with the anonymous user (BOOTSTRAP_PRINCIPAL.ANONYMOUS_USER, principal 273950). addRowLevelFilter() automatically applies benefactor ACL checks, ensuring only publicly accessible rows are indexed.

The RowHandlerProvider callback fires after SQL parsing but before row execution. Inside this callback:

  1. getSchemaOfSelect() returns only the columns selected by definingSQL

  2. collectAndLoadAnalyzers() gathers all required TextAnalyzer IDs and loads them from the DAO (including paired search analyzers)

  3. searchIndexOssClient.createIndex() creates the AOSS index with mappings for those columns, using the loaded TextAnalyzer settings to build custom analyzer definitions

  4. Returns SearchIndexRowHandler which streams rows in batches of 1000

SearchIndexRowHandler converts each Row to a Map<String, Object>:

  • _row_id = row.getRowId() (also used as the OS document _id)

  • _row_version = row.getVersionNumber()

  • Column values mapped by name from SelectColumn list; null values skipped

Pre-flight Count Check

Before creating the AOSS index or streaming data, the worker runs a count-only query via querySinglePage() with runQuery=false, runCount=true. If the count exceeds 500,000, the build fails immediately with FAILED status — no orphaned partial AOSS index is created. The row-level guard in SearchIndexRowHandler.nextRow() remains as a safety net for the TOCTOU window.

Edge Cases

Entity not found (NotFoundException): nodeDao.getNodeTypeById() throws → worker checks statusDao.exists() → if true, cleans up AOSS index and status row. If false, no-op.

Duplicate/concurrent CREATE messages: Idempotent. First message builds. Subsequent messages: status row exists → UPDATE handler sees it → no-op.

DELETE during build: DELETE handler deletes AOSS index + status row. CREATE handler may fail → caught → sets FAILED on a now-deleted status row (harmless).

AOSS index not found on delete: SearchIndexOssClientImpl.deleteIndex() catches index_not_found_exception at INFO level. No-op.

Build failure: Status set to FAILED with truncated error message (max 3000 chars). FAILED is terminal. To rebuild: delete entity, create new one.

OpenSearch Implementation

Index Naming

Resource

Pattern

Example

Resource

Pattern

Example

AOSS Collection

{stack}-{instance}-searchindex

prod-123-synsearch

AOSS Index

search-index-{entityId}

search-index-syn42

AOSS Client Bean

@Bean searchIndexOssClient

@Qualifier("searchIndexOssClient")

System Fields

Field

Type

Description

Field

Type

Description

_row_id

long

Row ID from the table index. Also used as OS document _id.

_row_version

long

Row version from the table index.

ColumnType to OpenSearch Mapping

Synapse ColumnType

OS Primary Type

Sub-Fields

ignoreAbove

Synapse ColumnType

OS Primary Type

Sub-Fields

ignoreAbove

STRING, STRING_LIST

text

.keyword

1000

MEDIUMTEXT

text

.keyword

2000

LARGETEXT

text

.keyword

8192

LINK (KEYWORD analyzer, ID 4)

keyword

.searchable (text, synapse_analyzer_1)

1000

LINK (other analyzer)

text

.keyword

1000

INTEGER, DATE, *_LIST

long

FILEHANDLEID, SUBMISSIONID, EVALUATIONID

long

ENTITYID, USERID, *_LIST

keyword

256

DOUBLE

double

BOOLEAN, BOOLEAN_LIST

boolean

JSON

object (dynamic: true)

Dual-field strategy:

  • KEYWORD analyzer columns: Primary is keyword; .searchable sub-field is text with synapse_analyzer_1 (SCIENTIFIC) for full-text search.

  • All other text columns: Primary is text with configured analyzer; .keyword sub-field for exact-match filtering/faceting.

  • AUTOCOMPLETE analyzer (ID 5) uses asymmetric analysis: synapse_analyzer_5 at index time (edge n-grams), synapse_analyzer_6 (AUTOCOMPLETE_SEARCH) at search time via pairedSearchAnalyzerId.

Analyzer Pipeline

Analyzers are database-stored TextAnalyzer rows. All custom analyzers are registered in OpenSearch as synapse_analyzer_{id}. The SearchIndexOssClientImpl.createIndex() method dynamically builds analyzer definitions from TextAnalyzerSettings JSON.

System Analyzers (bootstrapped by TextAnalyzerBootstrapper)

ID

Name

OS Analyzer Name

Tokenizer

Token Filters

Synonym-Aware?

Paired Search ID

ID

Name

OS Analyzer Name

Tokenizer

Token Filters

Synonym-Aware?

Paired Search ID

1

SCIENTIFIC

synapse_analyzer_1

standard

lowercase, english_stop, english_stemmer, [synapse_synonyms]

Yes

2

STANDARD

synapse_analyzer_2

standard

lowercase, [synapse_synonyms]

Yes

3

IDENTIFIER

synapse_analyzer_3

whitespace

lowercase, [synapse_synonyms]

Yes

4

KEYWORD

synapse_analyzer_4

keyword

(none)

No

5

AUTOCOMPLETE

synapse_analyzer_5

standard

lowercase, edge_ngram (min=2, max=20)

No

6

6

AUTOCOMPLETE_SEARCH

synapse_analyzer_6

standard

lowercase, [synapse_synonyms]

Yes

Filters in [brackets] are only included when synonym rules are present. Any analyzer can specify a pairedSearchAnalyzerId to use a different analyzer at search time. User-defined analyzers (IDs 1000+) can also use this mechanism.

Token filter definitions are stored as JSON strings in TextAnalyzerSettings.tokenFilters (e.g., {"type":"stop","stopwords":"_english_"}) and parsed at index creation time by registerTokenFilter().

Synonyms are baked into the AOSS index at build time. They are NOT loaded at query time — only column analyzer overrides and TextAnalyzer objects are loaded at query time for field routing.

Synonym Injection

Resolved synonyms are converted to Solr-format inline strings for the synapse_synonyms token filter: