Author: @Bryan Fauble

Overview

Portal Search provides full-text search over Synapse Tables, EntityViews, Datasets, DatasetCollections, MaterializedViews, and SubmissionViews (VirtualTables excluded). Each search index is a Synapse Entity (SearchIndex) with its own OpenSearch Serverless (AOSS) index, state, and build lifecycle.

Design Principles

Simple composition. SearchIndex → SearchConfiguration → {SynonymSets[], ColumnAnalyzerOverrides[]}.

4 standalone + 1 entity. SynonymSet, ColumnAnalyzerOverride, TextAnalyzer, and SearchConfiguration are organization-scoped resources under /search/*. TextAnalyzer is currently read-only (GET and LIST); system analyzers (IDs 1-6) are bootstrapped on startup. SearchIndex is a Synapse Entity (VersionableEntity + HasDefiningSql) managed via /repo/v1/entity/*; only search query operations use /search/*.

Split authorization model. Configuration resources use Organization ACL (publicly readable, mutations require org permissions). SearchIndex uses standard entity ACLs via parentId (Project or Folder). Search queries require READ on both the SearchIndex entity and the source table. A pilot gate restricts SearchIndex creation to Portal Managers.

Build-once semantics. AOSS indexes are point-in-time snapshots built on entity creation. UPDATE messages only trigger a build if no AOSS index exists yet (migration backfill). To force a rebuild, delete and recreate the SearchIndex entity. Entity metadata changes (name, definingSQL, searchConfigurationId) are persisted and take effect on the next build during stack migration.

Schema from SQL. Only the columns selected by definingSQL are mapped in the AOSS index. Column schema is derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback. SELECT studyName, summary FROM syn123 maps only those two columns; SELECT * FROM syn123 maps all.

Anonymous-only indexing. The lifecycle worker streams data as the anonymous user via TableQueryManager.runQueryAsStream(). Row-level filtering via addRowLevelFilter() ensures only publicly visible rows enter the AOSS index.

Two-database architecture. Transactional DB (migrated) stores configuration and entity data; Indexing DB (not migrated) stores runtime state. On migration, status rows start empty and the worker rebuilds indexes organically.

Single polling point. Search queries run as async jobs. If the index is still building (CREATING), the query worker throws RecoverableMessageException for automatic retry. Clients only poll the async result endpoint.

System Context

graph TB
    subgraph "Data Portals"
        Portals["NF Portal / AD Portal / ELITE"]
    end

    subgraph "Synapse REST API"
        EntityAPI["/repo/v1/entity/*<br/>(SearchIndex CRUD)"]
        ConfigAPI["/search/synonym/set/*<br/>/search/column/analyzer/override/*<br/>/search/text/analyzer/*<br/>/search/configuration/*"]
        QueryAPI["/search/query/async/*<br/>/search/autocomplete"]
    end

    subgraph "Validation"
        MV[SearchIndexMetadataProvider<br/>pilot gate + definingSQL validation]
    end

    subgraph "Manager Layer"
        M1[SynonymSetManager]
        M2[ColumnAnalyzerOverrideManager]
        M3[TextAnalyzerManager]
        M4[SearchConfigurationManager]
        M5[SearchIndexQueryManager]
    end

    subgraph "Bootstrap"
        Boot[TextAnalyzerBootstrapper<br/>bootstraps 6 system analyzers on startup]
    end

    subgraph "Storage"
        TDB[(MySQL<br/>Transactional DB<br/>NODE + NODE_REVISION<br/>+ 4 config tables)]
        IDB[(MySQL<br/>Indexing DB<br/>SEARCH_INDEX_STATUS)]
        AOSS[(OpenSearch AOSS<br/>`stack`-`instance`-searchindex)]
    end

    subgraph "Event Processing"
        NodeDAO["NodeDAO"]
        TxMsg["TransactionalMessenger"]
        SNS["SNS: ENTITY"]
        SQS["SQS: SEARCH_INDEX_LIFECYCLE"]
        W1["SearchIndexLifecycleWorker"]
    end

    Portals --> EntityAPI & ConfigAPI & QueryAPI
    EntityAPI --> MV --> TDB
    ConfigAPI --> M1 & M2 & M3 & M4 --> TDB
    QueryAPI --> M5 --> AOSS
    Boot --> TDB

    NodeDAO --> TxMsg --> SNS --> SQS --> W1
    W1 --> TDB & IDB & AOSS

    style Portals fill:#4a9eff,color:#fff
    style EntityAPI fill:#50c878,color:#fff
    style ConfigAPI fill:#50c878,color:#fff
    style QueryAPI fill:#50c878,color:#fff
    style TDB fill:#ff8c00,color:#fff
    style IDB fill:#ff8c00,color:#fff
    style AOSS fill:#9b59b6,color:#fff
    style W1 fill:#e74c3c,color:#fff
    style Boot fill:#f39c12,color:#fff

SearchIndex CRUD goes through the standard entity controller; SearchIndexMetadataProvider validates on create/update. Configuration resources have dedicated controllers. Search queries go through async job infrastructure. The lifecycle worker subscribes to ENTITY change messages via a dedicated SQS queue and filters for searchindex node types internally. TextAnalyzerBootstrapper runs on startup to ensure 6 system analyzers exist (IDs 1-6).

Security & Authorization

Control Plane: Organization ACL (Configuration Resources)

All configuration resources are publicly readable (GET/list). Mutations require Organization ACL permissions:

AuthorizationUtils.disallowAnonymous(user);
aclDao.canAccess(user, organizationId, ObjectType.ORGANIZATION, ACCESS_TYPE.CREATE)
    .checkAuthorizationOrElseThrow();

Control Plane: Pilot Gate (SearchIndex Entity)

Enforced in SearchIndexMetadataProvider.validateEntity() on both CREATE and UPDATE:

if (!AuthorizationUtils.isPortalManagerOrAdmin(user)) {
    throw new UnauthorizedException("Only Portal Managers can create or update SearchIndex entities.");
}
searchIndexValidator.validateDefiningSQL(entity.getDefiningSQL());

Control Plane: Shared Resource Deletion Protection

SynonymSets and ColumnAnalyzerOverrides cannot be deleted while referenced by any SearchConfiguration. Checked at the manager layer:

List<SearchConfiguration> refs = searchConfigurationDao.findBySynonymSetId(synonymSetId);
if (!refs.isEmpty()) {
    throw new IllegalArgumentException("Cannot delete synonym set because it is referenced by search configurations: " + ids);
}

findBySynonymSetId() uses MySQL's JSON_CONTAINS() function to query the JSON array column.

Data Plane: Search & Autocomplete

Two READ checks are performed by SearchIndexQueryManagerImpl:

READ on the SearchIndex entity
READ on the source table entity (parsed from definingSQL)

Data is indexed as the anonymous user (principal 273950). addRowLevelFilter() in TableQueryManagerImpl applies benefactor ACL filtering, ensuring only publicly visible rows enter the AOSS index.

Control Plane: Optimistic Concurrency

All three mutable configuration resource DAOs enforce OCC via SELECT ... FOR UPDATE:

String currentEtag = getCurrentEtag(id); // SELECT ETAG ... FOR UPDATE
if (!currentEtag.equals(request.getEtag())) {
    throw new ConflictingUpdateException("Resource was updated since last fetched.");
}

Version Prevention

SearchIndex implements VersionableEntity (required for NODE_REVISION.DEFINING_SQL mapping) but versioning is disabled in EntityManagerImpl.updateEntity():

if (updated instanceof SearchIndex) {
    newVersion = false;
}

Configuration & Resolution

Defining SQL

definingSQL scopes which columns and rows are indexed. Validated by SearchIndexValidator:

List<IdAndVersion> sourceTableIds = TableModelUtils.getSourceTableIds(definingSQL);
if (sourceTableIds.size() != 1) {
    throw new IllegalArgumentException("definingSQL must reference exactly one entity.");
}

Only the columns in the SELECT clause are mapped in the AOSS index — derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback (after SQL parsing, before row execution).

Configuration Resolution

SearchConfigurationResolver resolves the effective configuration at build time:

flowchart TD
    Start([resolve]) --> HasConfig{searchConfigurationId<br/>set on entity?}
    HasConfig -->|Yes| UseExplicit[Load that SearchConfiguration]
    HasConfig -->|No| CheckPS["projectSettingsManager.getProjectSettingForNode()<br/>→ SearchConfigurationListSetting"]
    CheckPS --> HasPS{Found?}
    HasPS -->|Yes| UsePS[Load that SearchConfiguration]
    HasPS -->|No| UseDefaults([Platform defaults:<br/>no synonyms, no overrides, column type defaults])
    UseExplicit --> Load
    UsePS --> Load
    Load["Load SynonymSets via synonymSetDao.get()<br/>Load ColumnAnalyzerOverrides via columnAnalyzerOverrideDao.get()<br/>Load TextAnalyzers via textAnalyzerDao.get()"]

    style Start fill:#4a9eff,color:#fff
    style UseDefaults fill:#50c878,color:#fff

Effective analyzer per column (priority order):

Per-column override indexAnalyzerId from ColumnAnalyzerOverrideEntry
defaultAnalyzerId from SearchConfiguration
System default from ColumnTypeToOpenSearchMapping.getDefaultAnalyzerId(columnType)

The lifecycle worker and query manager both collect all required TextAnalyzer IDs (from overrides, config default, and column type defaults) and load them from TextAnalyzerDao before building or querying an index. Paired search analyzers (pairedSearchAnalyzerId) are also loaded transitively.

Project Settings Integration

{
  "concreteType": "org.sagebionetworks.repo.model.project.SearchConfigurationListSetting",
  "settingsType": "search",
  "searchConfigurationId": "301"
}

Registered in ProjectSettingsManagerImpl.TYPE_MAP as SearchConfigurationListSetting.class → ProjectSettingsType.search.

Database Model

Schema Separation

Database	Contents	Migrated?

Database	Contents	Migrated?
Transactional DB	NODE/NODE_REVISION (SearchIndex entity), TEXT_ANALYZER, SYNONYM_SET, COLUMN_ANALYZER_OVERRIDE, SEARCH_CONFIGURATION	Yes
Indexing DB	SEARCH_INDEX_STATUS	No (starts empty)

On migration: config tables are copied (including TEXT_ANALYZER with system analyzers), SearchIndex entities migrate as NODE/NODE_REVISION, status rows start empty, AOSS starts empty. The TextAnalyzerBootstrapper ensures system analyzers exist on startup. The lifecycle worker discovers missing status rows from replayed ENTITY change messages and builds indexes organically.

Entity Relationship Diagram

erDiagram
    NODE {
        BIGINT ID PK
        ENUM NODE_TYPE "includes searchindex"
    }
    NODE_REVISION {
        BIGINT OWNER_NODE_ID PK
        BIGINT NUMBER PK
        VARCHAR DEFINING_SQL
        VARCHAR SEARCH_CONFIGURATION_ID
    }
    TEXT_ANALYZER {
        BIGINT ID PK
        VARCHAR NAME
        JSON SETTINGS
        BOOLEAN IS_SYSTEM
    }
    SYNONYM_SET { BIGINT ID PK }
    COLUMN_ANALYZER_OVERRIDE { BIGINT ID PK }
    SEARCH_CONFIGURATION {
        BIGINT ID PK
        BIGINT DEFAULT_ANALYZER_ID FK
    }
    SEARCH_INDEX_STATUS {
        BIGINT SEARCH_INDEX_ID PK
        VARCHAR STATE
        TIMESTAMP CHANGED_ON
        VARCHAR ERROR_MESSAGE
    }

    NODE ||--o{ NODE_REVISION : "has versions"
    SEARCH_CONFIGURATION ||--o{ SYNONYM_SET : "references via JSON array"
    SEARCH_CONFIGURATION ||--o{ COLUMN_ANALYZER_OVERRIDE : "references via JSON array"
    SEARCH_CONFIGURATION }o--o| TEXT_ANALYZER : "DEFAULT_ANALYZER_ID FK"
    NODE_REVISION }o--o| SEARCH_CONFIGURATION : "searchConfigurationId"
    NODE ||--|| SEARCH_INDEX_STATUS : "1:1 across databases"

There is no dedicated SEARCH_INDEX table. SearchIndex is stored in NODE (type searchindex) + NODE_REVISION. definingSQL maps to NODE_REVISION.DEFINING_SQL (shared with MaterializedView/VirtualTable). searchConfigurationId maps to NODE_REVISION.SEARCH_CONFIGURATION_ID (new column).

DDL: Transactional DB (Migrated)

-- Database-stored text analyzer configurations
-- System analyzers (IDs 1-999) bootstrapped on startup; user-defined start at 1000+
CREATE TABLE IF NOT EXISTS `TEXT_ANALYZER` (
  `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
  `NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL,
  `ORGANIZATION_ID` BIGINT DEFAULT NULL, `SETTINGS` JSON NOT NULL,
  `IS_SYSTEM` BOOLEAN NOT NULL DEFAULT FALSE,
  `CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
  PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
  CONSTRAINT `TA_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)

-- Synonym rules stored as JSON
CREATE TABLE IF NOT EXISTS `SYNONYM_SET` (
  `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
  `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
  `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `RULES` JSON NOT NULL,
  `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
  `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
  PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
  CONSTRAINT `SYNSET_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)

-- Per-column analyzer overrides stored as JSON
CREATE TABLE IF NOT EXISTS `COLUMN_ANALYZER_OVERRIDE` (
  `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
  `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
  `DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `OVERRIDES` JSON NOT NULL,
  `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
  `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
  PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
  CONSTRAINT `CAO_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)

-- References stored as JSON arrays; DEFAULT_ANALYZER_ID references TEXT_ANALYZER
CREATE TABLE IF NOT EXISTS `SEARCH_CONFIGURATION` (
  `ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
  `ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
  `DESCRIPTION` VARCHAR(1000) DEFAULT NULL,
  `SYNONYM_SET_IDS` JSON DEFAULT NULL,
  `COLUMN_ANALYZER_OVERRIDE_IDS` JSON DEFAULT NULL,
  `DEFAULT_ANALYZER_ID` BIGINT DEFAULT NULL,
  `CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
  `MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
  PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
  CONSTRAINT `SC_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT,
  CONSTRAINT `SC_DEFAULT_ANALYZER_FK` FOREIGN KEY (`DEFAULT_ANALYZER_ID`) REFERENCES `TEXT_ANALYZER` (`ID`) ON DELETE RESTRICT
)

DDL: Indexing DB (Not Migrated)

CREATE TABLE IF NOT EXISTS SEARCH_INDEX_STATUS (
  SEARCH_INDEX_ID BIGINT NOT NULL,
  STATE           VARCHAR(50) NOT NULL DEFAULT 'CREATING',
  LAST_BUILD_ON   TIMESTAMP(3) NULL,
  ERROR_MESSAGE   VARCHAR(3000) DEFAULT NULL,
  CHANGED_ON      TIMESTAMP(3) NOT NULL,
  PRIMARY KEY (SEARCH_INDEX_ID), INDEX (STATE)
)

Uses INSERT ... ON DUPLICATE KEY UPDATE for upsert. Absence of a row is the implicit trigger for the lifecycle worker to build the index.

JSON Column Examples

SYNONYM_SET.RULES:

[
  {"ruleType": "EQUIVALENT", "terms": ["NF1", "neurofibromatosis 1"]},
  {"ruleType": "EXPLICIT", "terms": ["OPG", "optic pathway glioma"]}
]

COLUMN_ANALYZER_OVERRIDE.OVERRIDES:

[
  {"columnName": "diseaseFocus", "indexAnalyzerId": "4", "searchAnalyzerId": "4"},
  {"columnName": "doi", "indexAnalyzerId": "3", "searchAnalyzerId": "3"}
]

TEXT_ANALYZER.SETTINGS (example: SCIENTIFIC, ID 1):

{
  "tokenizer": "standard",
  "tokenFilters": {
    "english_stop": "{\"type\":\"stop\",\"stopwords\":\"_english_\"}",
    "english_stemmer": "{\"type\":\"stemmer\",\"language\":\"english\"}"
  },
  "filterOrder": ["lowercase", "english_stop", "english_stemmer"],
  "synonymAware": true
}

Serialized via JSONObjectAdapterImpl + writeToJSONObject()/initializeFromJSONObject(). Simple string arrays (SYNONYM_SET_IDS, COLUMN_ANALYZER_OVERRIDE_IDS) use JSONArray directly. Cross-resource references are queried via MySQL's JSON_CONTAINS().

Worker Pipeline & Index Lifecycle

Lifecycle State Machine

stateDiagram-v2
    [*] --> CREATING : POST /repo/v1/entity (SearchIndex)
    CREATING --> ACTIVE : Build succeeds
    CREATING --> FAILED : Build error or exceeds 500K rows
    ACTIVE --> DELETING : DELETE /repo/v1/entity/{id}
    DELETING --> [*] : AOSS index deleted + status row removed

Change Message Flow

All entity lifecycle events publish ENTITY change messages via TransactionalMessenger. No dedicated SEARCH_INDEX ObjectType or SNS topic. The lifecycle worker subscribes to the SEARCH_INDEX_LIFECYCLE SQS queue (subscribed to the existing ENTITY SNS topic) and filters internally by node type:

sequenceDiagram
    participant NodeDAO
    participant TxMsg as TransactionalMessenger
    participant SNS as SNS: ENTITY
    participant SQS as SQS: SEARCH_INDEX_LIFECYCLE
    participant Worker as SearchIndexLifecycleWorker

    NodeDAO->>TxMsg: sendMessageAfterCommit(id, ENTITY, CREATE)
    TxMsg->>SNS: afterCommit()
    SNS->>SQS: Deliver
    SQS->>Worker: Poll message
    Worker->>Worker: nodeDao.getNodeTypeById(id)
    Note over Worker: Skip if nodeType != searchindex

Worker Configuration

Property	Lifecycle Worker	Query Worker

Property	Lifecycle Worker	Query Worker
Class	SearchIndexLifecycleWorker	SearchQueryWorker
Interface	BatchChangeMessageDrivenRunner	AsyncJobRunner<SearchQuery, SearchResults>
SQS Queue	SEARCH_INDEX_LIFECYCLE	SEARCH_QUERY
Semaphore Lock Key	searchIndexLifecycleWorker	searchQueryWorker
Max Lock Count	2	4
Lock & Visibility Timeout	300s	120s
Max Threads / Machine	2	3
Repeat Interval	2053ms	2039ms
Start Delay	523ms	317ms

The Build Process

flowchart TD
    Start([ENTITY ChangeMessage]) --> Filter{nodeType ==<br/>searchindex?}
    Filter -->|No| Skip([Skip])
    Filter -->|Yes| Route{changeType?}

    Route -->|CREATE| Build
    Route -->|UPDATE| CheckStatus{Status row<br/>exists?}
    Route -->|DELETE| Delete
    CheckStatus -->|Yes| NoOp([No-op: already built])
    CheckStatus -->|No| Build

    Build["Set status CREATING<br/>Load SearchIndex entity as admin"] --> Resolve["SearchConfigurationResolver.resolve()<br/>Load SynonymSets + ColumnAnalyzerOverrides"]
    Resolve --> LoadAnalyzers["collectAndLoadAnalyzers()<br/>Gather all required TextAnalyzer IDs<br/>(from overrides, config default, column type defaults)<br/>Load from TextAnalyzerDao + paired search analyzers"]
    LoadAnalyzers --> PreFlight["Pre-flight COUNT query<br/>querySinglePage(anonymousUser, countOnly)"]
    PreFlight --> RowLimit{"> 500K rows?"}
    RowLimit -->|Yes| Fail
    RowLimit -->|No| Stream["runQueryAsStream(anonymousUser, definingSQL)"]
    Stream --> Callback["RowHandlerProvider callback:<br/>1. getSchemaOfSelect() → selected columns only<br/>2. searchIndexOssClient.createIndex()<br/>   with mappings + analyzers + synonyms"]
    Callback --> Index["SearchIndexRowHandler:<br/>batch 1000 rows → bulkIndex()<br/>guard: 500K row safety net"]
    Index --> Active[Set status ACTIVE]

    Delete["Set status DELETING"] --> DeleteAOSS["searchIndexOssClient.deleteIndex()<br/>(idempotent: catches index_not_found)"]
    DeleteAOSS --> RemoveStatus[Delete status row]

    Stream -->|Exception| Fail["Set status FAILED<br/>(error message truncated to 3000 chars)"]
    Callback -->|Exception| Fail
    Index -->|Exception| Fail

    style Start fill:#4a9eff,color:#fff
    style Skip fill:#ccc,color:#333
    style NoOp fill:#ccc,color:#333
    style Active fill:#50c878,color:#fff
    style Fail fill:#e74c3c,color:#fff

Data Loading

The lifecycle worker uses TableQueryManager.runQueryAsStream() with the anonymous user (BOOTSTRAP_PRINCIPAL.ANONYMOUS_USER, principal 273950). addRowLevelFilter() automatically applies benefactor ACL checks, ensuring only publicly accessible rows are indexed.

The RowHandlerProvider callback fires after SQL parsing but before row execution. Inside this callback:

getSchemaOfSelect() returns only the columns selected by definingSQL
collectAndLoadAnalyzers() gathers all required TextAnalyzer IDs and loads them from the DAO (including paired search analyzers)
searchIndexOssClient.createIndex() creates the AOSS index with mappings for those columns, using the loaded TextAnalyzer settings to build custom analyzer definitions
Returns SearchIndexRowHandler which streams rows in batches of 1000

SearchIndexRowHandler converts each Row to a Map<String, Object>:

_row_id = row.getRowId() (also used as the OS document _id)
_row_version = row.getVersionNumber()
Column values mapped by name from SelectColumn list; null values skipped

Pre-flight Count Check

Before creating the AOSS index or streaming data, the worker runs a count-only query via querySinglePage() with runQuery=false, runCount=true. If the count exceeds 500,000, the build fails immediately with FAILED status — no orphaned partial AOSS index is created. The row-level guard in SearchIndexRowHandler.nextRow() remains as a safety net for the TOCTOU window.

Edge Cases

Entity not found (NotFoundException): nodeDao.getNodeTypeById() throws → worker checks statusDao.exists() → if true, cleans up AOSS index and status row. If false, no-op.

Duplicate/concurrent CREATE messages: Idempotent. First message builds. Subsequent messages: status row exists → UPDATE handler sees it → no-op.

DELETE during build: DELETE handler deletes AOSS index + status row. CREATE handler may fail → caught → sets FAILED on a now-deleted status row (harmless).

AOSS index not found on delete: SearchIndexOssClientImpl.deleteIndex() catches index_not_found_exception at INFO level. No-op.

Build failure: Status set to FAILED with truncated error message (max 3000 chars). FAILED is terminal. To rebuild: delete entity, create new one.

OpenSearch Implementation

Index Naming

Resource	Pattern	Example

Resource	Pattern	Example
AOSS Collection	{stack}-{instance}-searchindex	prod-123-synsearch
AOSS Index	search-index-{entityId}	search-index-syn42
AOSS Client Bean	`@Bean searchIndexOssClient`	`@Qualifier("searchIndexOssClient")`

System Fields

Field	Type	Description

Field	Type	Description
`_row_id`	long	Row ID from the table index. Also used as OS document `_id`.
`_row_version`	long	Row version from the table index.

ColumnType to OpenSearch Mapping

Synapse ColumnType	OS Primary Type	Sub-Fields	ignoreAbove

Synapse ColumnType	OS Primary Type	Sub-Fields	ignoreAbove
STRING, STRING_LIST	text	.keyword	1000
MEDIUMTEXT	text	.keyword	2000
LARGETEXT	text	.keyword	8192
LINK (KEYWORD analyzer, ID 4)	keyword	.searchable (text, synapse_analyzer_1)	1000
LINK (other analyzer)	text	.keyword	1000
INTEGER, DATE, *_LIST	long	—	—
FILEHANDLEID, SUBMISSIONID, EVALUATIONID	long	—	—
ENTITYID, USERID, *_LIST	keyword	—	256
DOUBLE	double	—	—
BOOLEAN, BOOLEAN_LIST	boolean	—	—
JSON	object (dynamic: true)	—	—

Dual-field strategy:

KEYWORD analyzer columns: Primary is keyword; .searchable sub-field is text with synapse_analyzer_1 (SCIENTIFIC) for full-text search.
All other text columns: Primary is text with configured analyzer; .keyword sub-field for exact-match filtering/faceting.
AUTOCOMPLETE analyzer (ID 5) uses asymmetric analysis: synapse_analyzer_5 at index time (edge n-grams), synapse_analyzer_6 (AUTOCOMPLETE_SEARCH) at search time via pairedSearchAnalyzerId.

Analyzer Pipeline

Analyzers are database-stored TextAnalyzer rows. All custom analyzers are registered in OpenSearch as synapse_analyzer_{id}. The SearchIndexOssClientImpl.createIndex() method dynamically builds analyzer definitions from TextAnalyzerSettings JSON.

System Analyzers (bootstrapped by TextAnalyzerBootstrapper)

ID	Name	OS Analyzer Name	Tokenizer	Token Filters	Synonym-Aware?	Paired Search ID

ID	Name	OS Analyzer Name	Tokenizer	Token Filters	Synonym-Aware?	Paired Search ID
1	SCIENTIFIC	synapse_analyzer_1	standard	lowercase, english_stop, english_stemmer, [synapse_synonyms]	Yes	—
2	STANDARD	synapse_analyzer_2	standard	lowercase, [synapse_synonyms]	Yes	—
3	IDENTIFIER	synapse_analyzer_3	whitespace	lowercase, [synapse_synonyms]	Yes	—
4	KEYWORD	synapse_analyzer_4	keyword	(none)	No	—
5	AUTOCOMPLETE	synapse_analyzer_5	standard	lowercase, edge_ngram (min=2, max=20)	No	6
6	AUTOCOMPLETE_SEARCH	synapse_analyzer_6	standard	lowercase, [synapse_synonyms]	Yes	—

Filters in [brackets] are only included when synonym rules are present. Any analyzer can specify a pairedSearchAnalyzerId to use a different analyzer at search time. User-defined analyzers (IDs 1000+) can also use this mechanism.

Token filter definitions are stored as JSON strings in TextAnalyzerSettings.tokenFilters (e.g., {"type":"stop","stopwords":"_english_"}) and parsed at index creation time by registerTokenFilter().

Synonyms are baked into the AOSS index at build time. They are NOT loaded at query time — only column analyzer overrides and TextAnalyzer objects are loaded at query time for field routing.

Synonym Injection

Resolved synonyms are converted to Solr-format inline strings for the synapse_synonyms token filter:

Architecture Design for OpenSearch Integration