Technical Design Document: Cohort Builder (2.0)
This document describes the backend architecture for Cohort Builder 2.0. It builds on the Cohort Builder - Technical Details - (1.0) design that is already in production. The scope is backend only — UI/UX design is covered separately.
JIRA:
Status: DRAFT
Target: Design complete by Jun 30, 2026 | Implementation target end of Q3 2026
1. Problem Statement and Goals
1.1 What Broke in 1.0
Cohort Builder 1.0 was designed for a one-to-many relationship (~100 files per individual). In practice, certain studies (e.g., LLFM with 5,000+ participants) exhibit a many-to-many relationship where every individual maps to every file. This creates a cartesian explosion in the MATERIAL (MaterializedView) table that breaks Synapse row limits and degrades query performance.
Additionally, 1.0 assumed all users had full access to the PARTICIPANTS table. In 2.0, we must support users who do not have access to individual-level participant data but are permitted to see aggregate counts under governance constraints.
1.2 New Requirements for 2.0
Requirement | Problem | Source |
|---|---|---|
Many:many resolution | File-to-participant mapping tables explode for dense studies | ARQ-6 |
Aggregate-only access | Unapproved users must be able to query for aggregate counts without seeing individual data | ARQ-2, PRIV-4 |
Sub-query cohort handoff | In 1.0, IDs are passed between views (scalability issue + privacy issue for unapproved users) | FILS-4 |
Count suppression | Counts between 1-19 must never be exposed — query must be rejected | PRIV-4, meeting decision |
Complex query logic | Current filter API only supports AND between filters; need nested groups with OR/NOT | FILS-7 |
Column governance | Unrestricted GROUP BY could enable re-identification | PRIV-3, NFR-2 |
Query audit | No existing audit trail for queries — must detect/deter misuse | NFR-2, meeting decision |
Cross-portal extensibility | ELITE and ADKP have different facets and privacy rules | ARQ-3 |
1.3 Scope
In scope: Backend architecture for aggregate-only access, sub-query cohort handoff, privacy enforcement, enhanced filtering, governance, and audit.
Out of scope: Cohort saving/sharing (v3.0 / Q4), cross-portal cohort querying (v3.0+), tiered access UI (deferred/rejected), UI/UX design.
1.4 Goals
Support many:many studies by mapping individuals to Datasets instead of individual files
Enable authenticated users without AR approval to see only aggregate counts from restricted tables, under ACT-defined constraints
Provide a sub-query mechanism so cohort definitions transfer between views without exposing IDs
Reject queries where cohort count falls below the ACT-configured threshold
Support nested filter groups with AND, OR, and NOT
Mandatory query audit logging with user notification
2. Architecture Overview
2.1 Two User Tiers
The 2.0 architecture supports two distinct user experiences based on their access level to the PARTICIPANTS source table:
User Tier | Access to PARTICIPANTS | Experience |
|---|---|---|
Approved (met AR) | Full access | 1.0 experience: see participant rows, facets, full query capabilities |
Unapproved (authenticated, not met AR) | Aggregate count only | 2.0 experience: see only total cohort count (>= threshold); use sub-query to browse matching files |
2.2 User Journeys
Approved User (unchanged from 1.0)
1. User queries PARTICIPANTS VirtualTable → sees rows + facets
2. Applies filters to define cohort
3. Clicks "View Files" → UI captures cohort IDs (or sub-query)
4. Queries FILES VirtualTable with cohort filter applied
5. Browses files → adds to download cart
Unapproved User (new in 2.0)
1. User queries PARTICIPANTS VirtualTable → sees ONLY total count (N = 847)
2. Applies filters using known filter values (from schema metadata)
3. Count updates with each filter: "N = 203"
4. If count drops below threshold: "Cannot proceed — cohort too small"
5. If count >= threshold: user clicks "View Files"
6. UI submits sub-query filter to FILES VirtualTable:
WHERE participantId IN (SELECT individualId FROM syn_participants_vt WHERE [user's filters])
7. Backend validates: sub-query count >= threshold? Source columns not in outer SELECT?
8. Query executes → user sees file rows → adds to download cart
2.3 Query Pipeline with New Steps
User Query: "SELECT * FROM syn_files_vt WHERE participantId IN (SELECT ...)"
│
▼
[1] Parse query (extended grammar supports constrained sub-query) ← ENHANCED
│
▼
[2] VirtualTableIndexDescription.preprocessQuery() ← EXISTING
│ Wraps defining SQL as CTE
▼
[3] TableQueryManagerImpl.queryPreflight() ← ENHANCED
│ a. Check outer table access (user has READ on FILES VT)
│ b. Check sub-query source access:
│ - Does user have full access? → allow (1.0 path)
│ - No? → Is source DataType == AGGREGATE_DATA?
│ → YES: run count check (must be >= threshold)
│ → NO: throw UnauthorizedException
│ c. Validate: outer query does NOT select from restricted source
▼
[4] CombinedQuery: apply additionalFilters + selectedFacets ← ENHANCED (nested groups)
│
▼
[5] QueryTranslator: translate to index SQL ← EXISTING
│
▼
[6] Execute query ← EXISTING
│
▼
[7] Audit logger: record query details to Kinesis ← NEW
│
▼
[8] Return QueryResultBundle ← EXISTING
3. Data Model — Individual-to-Dataset Mapping
3.1 Problem
The 1.0 MATERIAL table maps files to participants via a FILE_TO_PART table. For a study with N participants and M files where every file maps to every participant, the MATERIAL table grows to N × M rows. For LLFM (5,000 participants × thousands of files) this exceeds practical limits.
3.2 Solution: Map to Datasets
For many:many studies, curators map individuals to Datasets (existing Synapse entity type) rather than individual files. A Dataset is a curated collection of file references (max 30K items). This reduces the mapping cardinality dramatically — e.g., 5,000 participants × 10 datasets instead of 5,000 × 50,000 files.
Key Decision: The Dataset entity already exists in Synapse and already integrates with the download cart. No new entity types are needed.
3.3 Source Table Layout (Curator-Managed)
Table | Description | Key Columns |
|---|---|---|
Participants | One row per participant, all metadata | individualId, sex, age, diagnosis, study, ... |
Individual-to-File mapping | For one:many studies (unchanged from 1.0) | individualId, fileEntityId |
Individual-to-Dataset mapping | For many:many studies (NEW) | individualId, datasetId |
3.4 MATERIAL Table (MaterializedView)
The MATERIAL table's defining SQL is updated to support both mapping types:
SELECT
p.individualId, p.sex, p.age, p.diagnosis, p.study,
COALESCE(fm.fileEntityId, dm.datasetId) AS dataReference,
CASE
WHEN fm.fileEntityId IS NOT NULL THEN 'FILE'
ELSE 'DATASET'
END AS referenceType
FROM syn_participants p
LEFT JOIN syn_file_mapping fm ON p.individualId = fm.individualId
LEFT JOIN syn_dataset_mapping dm ON p.individualId = dm.individualId
3.5 Impact on Download Cart
The download list already supports adding Datasets via AddToDownloadListRequest. No backend changes needed — the frontend handles the hybrid file+dataset presentation.
4. Aggregate Access Exception — DataType Model
4.1 Problem
In 1.0, the query preflight calls validateTableReadAccess() which recursively checks access on the VirtualTable's dependencies. If a user hasn't met the AR on the source PARTICIPANTS table, an UnauthorizedException is thrown — no access at all.
In 2.0, ACT needs to say: "This table is restricted, but authenticated users may see aggregate counts from it." We need a mechanism for ACT to create this exception without interfering with the existing AR model.
4.2 Solution: Extend DataType
The existing changeEntityDataType() service is already used by ACT to classify data. Currently supports SENSITIVE_DATA (default) and OPEN_DATA. We extend this with a new value:
DataType | Meaning | Who Can Set |
|---|---|---|
| Default. Full AR enforcement. No exceptions. | Any user with UPDATE |
| Safe for public. READ = automatic DOWNLOAD. | ACT only |
| NEW. Authenticated users may access aggregate counts only, subject to threshold. Full row access still requires AR approval. | ACT only |
4.3 What AGGREGATE_DATA Means at Runtime
When a VirtualTable's source table has DataType = AGGREGATE_DATA:
Approved users (met the AR): Full access. Unchanged from 1.0.
Unapproved but authenticated users:
May query the VirtualTable for COUNT only — no row data, no individual columns
The count is only returned if it meets the threshold (>= 20)
May use a sub-query referencing this table in a filter on another table (the cohort handoff), subject to the count threshold check
May NOT see any columns from the restricted source in query results
Anonymous users: No access. Must be authenticated.
4.4 Threshold Configuration
The count threshold is stored alongside the DataType classification. Extend the DATA_TYPE table (or add a linked table) with:
Column | Type | Description |
|---|---|---|
| BIGINT | Minimum count required. Default: 20. Only applies when DATA_TYPE = AGGREGATE_DATA. |
ACT sets this when classifying the table. The threshold is loaded during preflight and used for count validation.
4.5 Modified Preflight Logic
// Pseudocode for enhanced validateTableReadAccess()
void validateTableReadAccess(UserInfo user, IndexDescription index) {
String entityId = index.getIdAndVersion().getId().toString();
// Check READ permission (ACL-based)
authorizationManager.canAccess(user, entityId, ENTITY, READ).checkAuthorizationOrElseThrow();
// Check DOWNLOAD permission (required for table content access)
AuthorizationStatus downloadStatus = authorizationManager.canAccess(user, entityId, ENTITY, DOWNLOAD);
if (!downloadStatus.isAuthorized()) {
// User does NOT have full access. Check for aggregate exception.
DataType dataType = dataTypeDao.getDataType(entityId);
if (DataType.AGGREGATE_DATA.equals(dataType) && !user.isAnonymous()) {
// Mark this dependency as "aggregate-only" in the query context.
// Actual enforcement (count check, column restriction) happens later.
queryContext.markAggregateOnly(entityId, getThreshold(entityId));
return; // Allow query to proceed with constraints
}
// No exception applies — block access
downloadStatus.checkAuthorizationOrElseThrow();
}
// Recurse into dependencies
for (IndexDescription dependency : index.getDependencies()) {
validateTableReadAccess(user, dependency);
}
}
4.6 Enforcement During Query Execution
When a table is marked as "aggregate-only" in the query context, the following constraints are enforced:
Direct query: Only
SELECT COUNT(*)is permitted. Any other SELECT clause is rejected.Count result: If count < threshold, return an error indicating the cohort is too small (do NOT return the actual count).
Sub-query usage: The table may appear as the source of an IN sub-query in a filter on another table (see Section 5). The sub-query count is validated before the outer query executes.
Column restriction: No columns from the restricted source may appear in the outer query's SELECT clause.
5. Sub-Query Cohort Handoff
5.1 Problem
In 1.0, the UI captures participant IDs from the PARTICIPANTS view and passes them as an IN clause to the FILES view. This has two problems:
Scalability: Thousands of IDs in a URL/request body
Privacy: Unapproved users must never see individual participant IDs
5.2 Solution: Sub-Query Filter
Instead of passing IDs, the UI passes the cohort definition as a sub-query. The backend resolves the IDs internally without exposing them to the client.
New Schema: ColumnSubQueryFilter
{
"description": "A filter that restricts rows to those where a column's value appears in the results of a sub-query against another table. The backend executes the sub-query internally — the caller never sees the resulting IDs.",
"implements": [
{"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}
],
"properties": {
"concreteType": {
"type": "string"
},
"columnName": {
"type": "string",
"description": "Column on the queried table to match against sub-query results (LHS of IN)."
},
"subQuery": {
"$ref": "org.sagebionetworks.repo.model.table.SubQueryFilter",
"description": "Defines the sub-query whose results form the RHS of the IN clause."
}
}
}
New Schema: SubQueryFilter
{
"description": "Defines a constrained sub-query: SELECT single_column FROM table [WHERE conditions].",
"properties": {
"tableId": {
"type": "string",
"description": "The syn ID of the table/VirtualTable to sub-query."
},
"selectColumn": {
"type": "string",
"description": "The single column whose distinct values form the IN list."
},
"additionalFilters": {
"type": "array",
"items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"},
"description": "Filters defining the cohort (same filter tree the user built)."
}
}
}
5.3 Example: Cohort to Files Handoff
User defined cohort "age_bin = 65+ AND sex = female" on PARTICIPANTS VT (syn456). Now querying FILES VT (syn789):
Object model (what the UI submits):
{
"query": {
"sql": "SELECT * FROM syn789",
"additionalFilters": [
{
"concreteType": "...ColumnSubQueryFilter",
"columnName": "participantId",
"isDefiningCondition": true,
"subQuery": {
"tableId": "syn456",
"selectColumn": "individualId",
"additionalFilters": [
{"concreteType": "...ColumnSingleValueQueryFilter",
"columnName": "age_bin", "operator": "EQUAL", "values": ["65+"]},
{"concreteType": "...ColumnSingleValueQueryFilter",
"columnName": "sex", "operator": "EQUAL", "values": ["female"]}
]
}
}
]
}
}
Equivalent SQL (for clients that write SQL directly):
SELECT * FROM syn789
WHERE participantId IN (SELECT individualId FROM syn456 WHERE age_bin = '65+' AND sex = 'female')
5.4 Backend Validation (Pre-flight)
When a ColumnSubQueryFilter (or an IN sub-query in SQL) is encountered during preflight:
Resolve sub-query source — load the VirtualTable referenced by
tableIdCheck user access to the sub-query source:
User has full access (met AR)? → No threshold check needed. Execute normally.
User lacks full access but source is
AGGREGATE_DATA? → Proceed to count check.Neither? →
UnauthorizedException
Count check: Execute
SELECT COUNT(DISTINCT selectColumn) FROM tableId WHERE [filters]Count >= threshold? → Proceed.
Count < threshold? → Reject with error: "Cohort size is below the minimum threshold. Adjust your filters to include more participants."
Column restriction: Verify the outer query's SELECT does not include any columns that originate from the restricted sub-query source.
Execute the full query with the sub-query embedded in the translated SQL.
5.5 Constrained Sub-Query Grammar (JavaCC Extension)
The SQL parser (table-query-parser.jj) will be extended to support sub-queries, but only in the constrained form:
subQuery := SELECT single_column FROM table_reference [WHERE search_condition]
Allowed:
Single column in SELECT (no expressions, no
*)Single FROM table reference (no JOINs, no UNION)
Optional WHERE with full boolean expression grammar
Disallowed (parser rejects at parse time):
Multiple SELECT columns
GROUP BY
ORDER BY
LIMIT / OFFSET
Nested sub-queries (sub-query within a sub-query)
Aggregate functions (COUNT, SUM, etc.) in SELECT
JOIN, UNION
The sub-query may appear only on the right-hand side of an IN predicate:
column_name IN ( subQuery )
5.6 Translation
In SQLTranslatorUtils, the sub-query is translated by:
Resolving
tableIdto its VirtualTable defining SQLWrapping the VirtualTable as a CTE (same mechanism as the outer query)
Applying row-level filters (benefactor) to the sub-query source
Applying the user's WHERE filters
Producing:
column IN (WITH syn456 AS (...) SELECT individualId FROM syn456 WHERE ...)
6. Enhanced Query Filtering — Nested Filter Groups
6.1 Current State
The existing additionalFilters mechanism supports leaf predicates joined with AND only. The SQL parser supports full SQL-92 boolean expressions (nested AND/OR/NOT with parentheses), but the programmatic API cannot express this.
6.2 Proposed: Recursive Filter Tree (Mirrors SQL-92 Grammar)
A FilterGroup represents a parenthesized group of conditions. It maps directly to the parser's AST:
API Model | SQL-92 AST Node | SQL Output |
|---|---|---|
|
|
|
|
|
|
|
|
|
Leaf filter (e.g. |
|
|
Nested |
|
|
Schema: FilterGroup
{
"description": "A group of filter conditions combined with a boolean operator. Groups can be nested to form arbitrary boolean expression trees.",
"implements": [{"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}],
"properties": {
"operator": {
"name": "BooleanOperator",
"type": "string",
"enum": [
{"name": "AND", "description": "All children must match."},
{"name": "OR", "description": "At least one child must match."}
]
},
"not": {
"type": "boolean",
"description": "When true, negates this entire group. Default: false."
},
"children": {
"type": "array",
"items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"},
"description": "Child filters — leaf predicates or nested FilterGroups."
}
}
}
6.3 Example: Complex Query
SQL: WHERE ((diagnosis LIKE '%Alzheimer%' AND age > 65) OR (diagnosis LIKE '%dementia%')) AND NOT (study = 'excluded') AND sex = 'female'
Maps to a FilterGroup tree: root AND group with three children — an OR group (containing an AND group and a leaf), a negated AND group, and a leaf.
6.4 Additional Operators on Leaf Filters
Extend ColumnSingleValueFilterOperator:
Operator | SQL | Status |
|---|---|---|
LIKE, EQUAL, IN | Existing | Existing |
GREATER_THAN |
| NEW |
LESS_THAN |
| NEW |
GREATER_THAN_OR_EQUAL |
| NEW |
LESS_THAN_OR_EQUAL |
| NEW |
NOT_EQUAL |
| NEW |
IS_NULL |
| NEW |
IS_NOT_NULL |
| NEW |
BETWEEN |
| NEW |
6.5 Depth/Size Limits
Maximum nesting depth: 5 levels
Maximum total leaf predicates: 50
Maximum children per group: 25
7. Column Governance / Whitelisting
7.1 Threat Model
Unrestricted GROUP BY or filter combinations could enable re-identification even with count thresholds.
7.2 Defense Layers
VirtualTable defining SQL controls exposure: Only columns curators include are queryable.
Facet types control filtering: Only columns with non-null
facetTypeare filterable.Aggregation is fixed: GROUP BY is set at design time in the defining SQL — users cannot alter it.
restrictToStructuredQueries: Disable raw SQL for cohort builder VirtualTables (force structured filter API only).
Binning: Curators bin quasi-identifiers (age → age ranges, 90+ censoring) in the defining SQL.
7.3 Enforcement for Aggregate-Only Access
For unapproved users querying an AGGREGATE_DATA source:
Direct queries: Only
SELECT COUNT(*)is allowed — no other columnsSub-query: Only the
selectColumn(join key) is retrieved internally — user never sees itThe outer query (FILES) is against a table the user does have access to — no restriction on those columns
8. Query Audit Infrastructure
8.1 Requirement
All queries against cohort builder VirtualTables must be logged. Users warned their activity is monitored.
8.2 Architecture
Follow existing AsyncAccessRecorder → Kinesis Firehose → S3 pattern.
8.3 Audit Record
Field | Type | Description |
|---|---|---|
userId | Long | Synapse user ID |
timestamp | Long | Epoch millis |
virtualTableId | String | syn ID of the queried VirtualTable |
subQueryTableId | String | syn ID of the sub-query source (if applicable) |
additionalFilters | String (JSON) | Serialized filter tree |
resultCount | Long (nullable) | Total count returned, null if below threshold |
accessTier | String | FULL or AGGREGATE_ONLY |
responseTimeMs | Long | Query execution time |
8.4 Gating
Audit all queries where the source table has DataType = AGGREGATE_DATA. For OPEN_DATA or SENSITIVE_DATA with full access, standard access logging applies (existing infrastructure).
9. Security Considerations
9.1 Layered Defense Model
Layer | Mechanism | Protects Against |
|---|---|---|
1. Data model |