Architecture Design for OpenSearch Integration
Author: @Bryan Fauble
Overview
Portal Search provides full-text search over Synapse Tables, EntityViews, Datasets, DatasetCollections, MaterializedViews, and SubmissionViews (VirtualTables excluded). Each search index is a Synapse Entity (SearchIndex) with its own OpenSearch Serverless (AOSS) index, state, and build lifecycle.
Design Principles
Simple composition. SearchIndex → SearchConfiguration → {SynonymSets[], ColumnAnalyzerOverrides[]}.
4 standalone + 1 entity. SynonymSet, ColumnAnalyzerOverride, TextAnalyzer, and SearchConfiguration are organization-scoped resources under /search/*. TextAnalyzer is currently read-only (GET and LIST); system analyzers (IDs 1-6) are bootstrapped on startup. SearchIndex is a Synapse Entity (VersionableEntity + HasDefiningSql) managed via /repo/v1/entity/*; only search query operations use /search/*.
Split authorization model. Configuration resources use Organization ACL (publicly readable, mutations require org permissions). SearchIndex uses standard entity ACLs via parentId (Project or Folder). Search queries require READ on both the SearchIndex entity and the source table. A pilot gate restricts SearchIndex creation to Portal Managers.
Build-once semantics. AOSS indexes are point-in-time snapshots built on entity creation. UPDATE messages only trigger a build if no AOSS index exists yet (migration backfill). To force a rebuild, delete and recreate the SearchIndex entity. Entity metadata changes (name, definingSQL, searchConfigurationId) are persisted and take effect on the next build during stack migration.
Schema from SQL. Only the columns selected by definingSQL are mapped in the AOSS index. Column schema is derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback. SELECT studyName, summary FROM syn123 maps only those two columns; SELECT * FROM syn123 maps all.
Anonymous-only indexing. The lifecycle worker streams data as the anonymous user via TableQueryManager.runQueryAsStream(). Row-level filtering via addRowLevelFilter() ensures only publicly visible rows enter the AOSS index.
Two-database architecture. Transactional DB (migrated) stores configuration and entity data; Indexing DB (not migrated) stores runtime state. On migration, status rows start empty and the worker rebuilds indexes organically.
Single polling point. Search queries run as async jobs. If the index is still building (CREATING), the query worker throws RecoverableMessageException for automatic retry. Clients only poll the async result endpoint.
System Context
SearchIndex CRUD goes through the standard entity controller; SearchIndexMetadataProvider validates on create/update. Configuration resources have dedicated controllers. Search queries go through async job infrastructure. The lifecycle worker subscribes to ENTITY change messages via a dedicated SQS queue and filters for searchindex node types internally. TextAnalyzerBootstrapper runs on startup to ensure 6 system analyzers exist (IDs 1-6).
Security & Authorization
Control Plane: Organization ACL (Configuration Resources)
All configuration resources are publicly readable (GET/list). Mutations require Organization ACL permissions:
AuthorizationUtils.disallowAnonymous(user);
aclDao.canAccess(user, organizationId, ObjectType.ORGANIZATION, ACCESS_TYPE.CREATE)
.checkAuthorizationOrElseThrow();Control Plane: Pilot Gate (SearchIndex Entity)
Enforced in SearchIndexMetadataProvider.validateEntity() on both CREATE and UPDATE:
if (!AuthorizationUtils.isPortalManagerOrAdmin(user)) {
throw new UnauthorizedException("Only Portal Managers can create or update SearchIndex entities.");
}
searchIndexValidator.validateDefiningSQL(entity.getDefiningSQL());Control Plane: Shared Resource Deletion Protection
SynonymSets and ColumnAnalyzerOverrides cannot be deleted while referenced by any SearchConfiguration. Checked at the manager layer:
List<SearchConfiguration> refs = searchConfigurationDao.findBySynonymSetId(synonymSetId);
if (!refs.isEmpty()) {
throw new IllegalArgumentException("Cannot delete synonym set because it is referenced by search configurations: " + ids);
}findBySynonymSetId() uses MySQL's JSON_CONTAINS() function to query the JSON array column.
Data Plane: Search & Autocomplete
Two READ checks are performed by SearchIndexQueryManagerImpl:
READ on the SearchIndex entity
READ on the source table entity (parsed from
definingSQL)
Data is indexed as the anonymous user (principal 273950). addRowLevelFilter() in TableQueryManagerImpl applies benefactor ACL filtering, ensuring only publicly visible rows enter the AOSS index.
Control Plane: Optimistic Concurrency
All three mutable configuration resource DAOs enforce OCC via SELECT ... FOR UPDATE:
String currentEtag = getCurrentEtag(id); // SELECT ETAG ... FOR UPDATE
if (!currentEtag.equals(request.getEtag())) {
throw new ConflictingUpdateException("Resource was updated since last fetched.");
}Version Prevention
SearchIndex implements VersionableEntity (required for NODE_REVISION.DEFINING_SQL mapping) but versioning is disabled in EntityManagerImpl.updateEntity():
if (updated instanceof SearchIndex) {
newVersion = false;
}Configuration & Resolution
Defining SQL
definingSQL scopes which columns and rows are indexed. Validated by SearchIndexValidator:
List<IdAndVersion> sourceTableIds = TableModelUtils.getSourceTableIds(definingSQL);
if (sourceTableIds.size() != 1) {
throw new IllegalArgumentException("definingSQL must reference exactly one entity.");
}Only the columns in the SELECT clause are mapped in the AOSS index — derived at build time via QueryTranslator.getSchemaOfSelect() inside the RowHandlerProvider callback (after SQL parsing, before row execution).
Configuration Resolution
SearchConfigurationResolver resolves the effective configuration at build time:
Effective analyzer per column (priority order):
Per-column override
indexAnalyzerIdfrom ColumnAnalyzerOverrideEntrydefaultAnalyzerIdfrom SearchConfigurationSystem default from
ColumnTypeToOpenSearchMapping.getDefaultAnalyzerId(columnType)
The lifecycle worker and query manager both collect all required TextAnalyzer IDs (from overrides, config default, and column type defaults) and load them from TextAnalyzerDao before building or querying an index. Paired search analyzers (pairedSearchAnalyzerId) are also loaded transitively.
Project Settings Integration
{
"concreteType": "org.sagebionetworks.repo.model.project.SearchConfigurationListSetting",
"settingsType": "search",
"searchConfigurationId": "301"
}Registered in ProjectSettingsManagerImpl.TYPE_MAP as SearchConfigurationListSetting.class → ProjectSettingsType.search.
Database Model
Schema Separation
Database | Contents | Migrated? |
|---|---|---|
Transactional DB | NODE/NODE_REVISION (SearchIndex entity), TEXT_ANALYZER, SYNONYM_SET, COLUMN_ANALYZER_OVERRIDE, SEARCH_CONFIGURATION | Yes |
Indexing DB | SEARCH_INDEX_STATUS | No (starts empty) |
On migration: config tables are copied (including TEXT_ANALYZER with system analyzers), SearchIndex entities migrate as NODE/NODE_REVISION, status rows start empty, AOSS starts empty. The TextAnalyzerBootstrapper ensures system analyzers exist on startup. The lifecycle worker discovers missing status rows from replayed ENTITY change messages and builds indexes organically.
Entity Relationship Diagram
There is no dedicated SEARCH_INDEX table. SearchIndex is stored in NODE (type searchindex) + NODE_REVISION. definingSQL maps to NODE_REVISION.DEFINING_SQL (shared with MaterializedView/VirtualTable). searchConfigurationId maps to NODE_REVISION.SEARCH_CONFIGURATION_ID (new column).
DDL: Transactional DB (Migrated)
-- Database-stored text analyzer configurations
-- System analyzers (IDs 1-999) bootstrapped on startup; user-defined start at 1000+
CREATE TABLE IF NOT EXISTS `TEXT_ANALYZER` (
`ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
`NAME` VARCHAR(256) NOT NULL, `DESCRIPTION` VARCHAR(1000) DEFAULT NULL,
`ORGANIZATION_ID` BIGINT DEFAULT NULL, `SETTINGS` JSON NOT NULL,
`IS_SYSTEM` BOOLEAN NOT NULL DEFAULT FALSE,
`CREATED_ON` TIMESTAMP(3) NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
CONSTRAINT `TA_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)
-- Synonym rules stored as JSON
CREATE TABLE IF NOT EXISTS `SYNONYM_SET` (
`ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
`ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
`DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `RULES` JSON NOT NULL,
`CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
`MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
CONSTRAINT `SYNSET_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)
-- Per-column analyzer overrides stored as JSON
CREATE TABLE IF NOT EXISTS `COLUMN_ANALYZER_OVERRIDE` (
`ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
`ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
`DESCRIPTION` VARCHAR(1000) DEFAULT NULL, `OVERRIDES` JSON NOT NULL,
`CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
`MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
CONSTRAINT `CAO_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT
)
-- References stored as JSON arrays; DEFAULT_ANALYZER_ID references TEXT_ANALYZER
CREATE TABLE IF NOT EXISTS `SEARCH_CONFIGURATION` (
`ID` BIGINT NOT NULL, `ETAG` char(36) NOT NULL,
`ORGANIZATION_ID` BIGINT NOT NULL, `NAME` VARCHAR(256) NOT NULL,
`DESCRIPTION` VARCHAR(1000) DEFAULT NULL,
`SYNONYM_SET_IDS` JSON DEFAULT NULL,
`COLUMN_ANALYZER_OVERRIDE_IDS` JSON DEFAULT NULL,
`DEFAULT_ANALYZER_ID` BIGINT DEFAULT NULL,
`CREATED_BY` BIGINT NOT NULL, `CREATED_ON` TIMESTAMP(3) NOT NULL,
`MODIFIED_BY` BIGINT NOT NULL, `MODIFIED_ON` TIMESTAMP(3) NOT NULL,
PRIMARY KEY (`ID`), UNIQUE KEY (`ORGANIZATION_ID`, `NAME`),
CONSTRAINT `SC_ORG_FK` FOREIGN KEY (`ORGANIZATION_ID`) REFERENCES `ORGANIZATION` (`ID`) ON DELETE RESTRICT,
CONSTRAINT `SC_DEFAULT_ANALYZER_FK` FOREIGN KEY (`DEFAULT_ANALYZER_ID`) REFERENCES `TEXT_ANALYZER` (`ID`) ON DELETE RESTRICT
)DDL: Indexing DB (Not Migrated)
CREATE TABLE IF NOT EXISTS SEARCH_INDEX_STATUS (
SEARCH_INDEX_ID BIGINT NOT NULL,
STATE VARCHAR(50) NOT NULL DEFAULT 'CREATING',
LAST_BUILD_ON TIMESTAMP(3) NULL,
ERROR_MESSAGE VARCHAR(3000) DEFAULT NULL,
CHANGED_ON TIMESTAMP(3) NOT NULL,
PRIMARY KEY (SEARCH_INDEX_ID), INDEX (STATE)
)Uses INSERT ... ON DUPLICATE KEY UPDATE for upsert. Absence of a row is the implicit trigger for the lifecycle worker to build the index.
JSON Column Examples
SYNONYM_SET.RULES:
[
{"ruleType": "EQUIVALENT", "terms": ["NF1", "neurofibromatosis 1"]},
{"ruleType": "EXPLICIT", "terms": ["OPG", "optic pathway glioma"]}
]COLUMN_ANALYZER_OVERRIDE.OVERRIDES:
[
{"columnName": "diseaseFocus", "indexAnalyzerId": "4", "searchAnalyzerId": "4"},
{"columnName": "doi", "indexAnalyzerId": "3", "searchAnalyzerId": "3"}
]TEXT_ANALYZER.SETTINGS (example: SCIENTIFIC, ID 1):
{
"tokenizer": "standard",
"tokenFilters": {
"english_stop": "{\"type\":\"stop\",\"stopwords\":\"_english_\"}",
"english_stemmer": "{\"type\":\"stemmer\",\"language\":\"english\"}"
},
"filterOrder": ["lowercase", "english_stop", "english_stemmer"],
"synonymAware": true
}Serialized via JSONObjectAdapterImpl + writeToJSONObject()/initializeFromJSONObject(). Simple string arrays (SYNONYM_SET_IDS, COLUMN_ANALYZER_OVERRIDE_IDS) use JSONArray directly. Cross-resource references are queried via MySQL's JSON_CONTAINS().
Worker Pipeline & Index Lifecycle
Lifecycle State Machine
Change Message Flow
All entity lifecycle events publish ENTITY change messages via TransactionalMessenger. No dedicated SEARCH_INDEX ObjectType or SNS topic. The lifecycle worker subscribes to the SEARCH_INDEX_LIFECYCLE SQS queue (subscribed to the existing ENTITY SNS topic) and filters internally by node type:
Worker Configuration
Property | Lifecycle Worker | Query Worker |
|---|---|---|
Class | SearchIndexLifecycleWorker | SearchQueryWorker |
Interface | BatchChangeMessageDrivenRunner | AsyncJobRunner<SearchQuery, SearchResults> |
SQS Queue | SEARCH_INDEX_LIFECYCLE | SEARCH_QUERY |
Semaphore Lock Key | searchIndexLifecycleWorker | searchQueryWorker |
Max Lock Count | 2 | 4 |
Lock & Visibility Timeout | 300s | 120s |
Max Threads / Machine | 2 | 3 |
Repeat Interval | 2053ms | 2039ms |
Start Delay | 523ms | 317ms |
The Build Process
Data Loading
The lifecycle worker uses TableQueryManager.runQueryAsStream() with the anonymous user (BOOTSTRAP_PRINCIPAL.ANONYMOUS_USER, principal 273950). addRowLevelFilter() automatically applies benefactor ACL checks, ensuring only publicly accessible rows are indexed.
The RowHandlerProvider callback fires after SQL parsing but before row execution. Inside this callback:
getSchemaOfSelect()returns only the columns selected bydefiningSQLcollectAndLoadAnalyzers()gathers all required TextAnalyzer IDs and loads them from the DAO (including paired search analyzers)searchIndexOssClient.createIndex()creates the AOSS index with mappings for those columns, using the loaded TextAnalyzer settings to build custom analyzer definitionsReturns
SearchIndexRowHandlerwhich streams rows in batches of 1000
SearchIndexRowHandler converts each Row to a Map<String, Object>:
_row_id=row.getRowId()(also used as the OS document_id)_row_version=row.getVersionNumber()Column values mapped by name from
SelectColumnlist; null values skipped
Pre-flight Count Check
Before creating the AOSS index or streaming data, the worker runs a count-only query via querySinglePage() with runQuery=false, runCount=true. If the count exceeds 500,000, the build fails immediately with FAILED status — no orphaned partial AOSS index is created. The row-level guard in SearchIndexRowHandler.nextRow() remains as a safety net for the TOCTOU window.
Edge Cases
Entity not found (NotFoundException): nodeDao.getNodeTypeById() throws → worker checks statusDao.exists() → if true, cleans up AOSS index and status row. If false, no-op.
Duplicate/concurrent CREATE messages: Idempotent. First message builds. Subsequent messages: status row exists → UPDATE handler sees it → no-op.
DELETE during build: DELETE handler deletes AOSS index + status row. CREATE handler may fail → caught → sets FAILED on a now-deleted status row (harmless).
AOSS index not found on delete: SearchIndexOssClientImpl.deleteIndex() catches index_not_found_exception at INFO level. No-op.
Build failure: Status set to FAILED with truncated error message (max 3000 chars). FAILED is terminal. To rebuild: delete entity, create new one.
OpenSearch Implementation
Index Naming
Resource | Pattern | Example |
|---|---|---|
AOSS Collection | {stack}-{instance}-searchindex | prod-123-synsearch |
AOSS Index | search-index-{entityId} | search-index-syn42 |
AOSS Client Bean |
|
|
System Fields
Field | Type | Description |
|---|---|---|
| long | Row ID from the table index. Also used as OS document |
| long | Row version from the table index. |
ColumnType to OpenSearch Mapping
Synapse ColumnType | OS Primary Type | Sub-Fields | ignoreAbove |
|---|---|---|---|
STRING, STRING_LIST | text | .keyword | 1000 |
MEDIUMTEXT | text | .keyword | 2000 |
LARGETEXT | text | .keyword | 8192 |
LINK (KEYWORD analyzer, ID 4) | keyword | .searchable (text, synapse_analyzer_1) | 1000 |
LINK (other analyzer) | text | .keyword | 1000 |
INTEGER, DATE, *_LIST | long | — | — |
FILEHANDLEID, SUBMISSIONID, EVALUATIONID | long | — | — |
ENTITYID, USERID, *_LIST | keyword | — | 256 |
DOUBLE | double | — | — |
BOOLEAN, BOOLEAN_LIST | boolean | — | — |
JSON | object (dynamic: true) | — | — |
Dual-field strategy:
KEYWORD analyzer columns: Primary is
keyword;.searchablesub-field istextwithsynapse_analyzer_1(SCIENTIFIC) for full-text search.All other text columns: Primary is
textwith configured analyzer;.keywordsub-field for exact-match filtering/faceting.AUTOCOMPLETE analyzer (ID 5) uses asymmetric analysis:
synapse_analyzer_5at index time (edge n-grams),synapse_analyzer_6(AUTOCOMPLETE_SEARCH) at search time viapairedSearchAnalyzerId.
Analyzer Pipeline
Analyzers are database-stored TextAnalyzer rows. All custom analyzers are registered in OpenSearch as synapse_analyzer_{id}. The SearchIndexOssClientImpl.createIndex() method dynamically builds analyzer definitions from TextAnalyzerSettings JSON.
System Analyzers (bootstrapped by TextAnalyzerBootstrapper)
ID | Name | OS Analyzer Name | Tokenizer | Token Filters | Synonym-Aware? | Paired Search ID |
|---|---|---|---|---|---|---|
1 | SCIENTIFIC | synapse_analyzer_1 | standard | lowercase, english_stop, english_stemmer, [synapse_synonyms] | Yes | — |
2 | STANDARD | synapse_analyzer_2 | standard | lowercase, [synapse_synonyms] | Yes | — |
3 | IDENTIFIER | synapse_analyzer_3 | whitespace | lowercase, [synapse_synonyms] | Yes | — |
4 | KEYWORD | synapse_analyzer_4 | keyword | (none) | No | — |
5 | AUTOCOMPLETE | synapse_analyzer_5 | standard | lowercase, edge_ngram (min=2, max=20) | No | 6 |
6 | AUTOCOMPLETE_SEARCH | synapse_analyzer_6 | standard | lowercase, [synapse_synonyms] | Yes | — |
Filters in [brackets] are only included when synonym rules are present. Any analyzer can specify a pairedSearchAnalyzerId to use a different analyzer at search time. User-defined analyzers (IDs 1000+) can also use this mechanism.
Token filter definitions are stored as JSON strings in TextAnalyzerSettings.tokenFilters (e.g., {"type":"stop","stopwords":"_english_"}) and parsed at index creation time by registerTokenFilter().
Synonyms are baked into the AOSS index at build time. They are NOT loaded at query time — only column analyzer overrides and TextAnalyzer objects are loaded at query time for field routing.
Synonym Injection
Resolved synonyms are converted to Solr-format inline strings for the synapse_synonyms token filter: