Technical Design Document: Cohort Builder (2.0)

This document describes the backend architecture for Cohort Builder 2.0. It builds on the Cohort Builder - Technical Details - (1.0) design that is already in production. The scope is backend only — UI/UX design is covered separately.

JIRA:

PLFM-9721 - Getting issue details... STATUS

Status: DRAFT

Target: Design complete by Jun 30, 2026 | Implementation target end of Q3 2026

1. Problem Statement and Goals

1.1 What Broke in 1.0

Cohort Builder 1.0 was designed for a one-to-many relationship (~100 files per individual). In practice, certain studies (e.g., LLFM with 5,000+ participants) exhibit a many-to-many relationship where every individual maps to every file. This creates a cartesian explosion in the MATERIAL (MaterializedView) table that breaks Synapse row limits and degrades query performance.

Additionally, 1.0 assumed all users had full access to the PARTICIPANTS table. In 2.0, we must support users who do not have access to individual-level participant data but are permitted to see aggregate counts under governance constraints.

1.2 New Requirements for 2.0

Requirement	Problem	Source

Requirement	Problem	Source
Many:many resolution	File-to-participant mapping tables explode for dense studies	ARQ-6
Aggregate-only access	Unapproved users must be able to query for aggregate counts without seeing individual data	ARQ-2, PRIV-4
Sub-query cohort handoff	In 1.0, IDs are passed between views (scalability issue + privacy issue for unapproved users)	FILS-4
Count suppression	Counts between 1-19 must never be exposed — query must be rejected	PRIV-4, meeting decision
Complex query logic	Current filter API only supports AND between filters; need nested groups with OR/NOT	FILS-7
Column governance	Unrestricted GROUP BY could enable re-identification	PRIV-3, NFR-2
Query audit	No existing audit trail for queries — must detect/deter misuse	NFR-2, meeting decision
Cross-portal extensibility	ELITE and ADKP have different facets and privacy rules	ARQ-3

1.3 Scope

In scope: Backend architecture for aggregate-only access, sub-query cohort handoff, privacy enforcement, enhanced filtering, governance, and audit.

Out of scope: Cohort saving/sharing (v3.0 / Q4), cross-portal cohort querying (v3.0+), tiered access UI (deferred/rejected), UI/UX design.

1.4 Goals

Support many:many studies by mapping individuals to Datasets instead of individual files
Enable authenticated users without AR approval to see only aggregate counts from restricted tables, under ACT-defined constraints
Provide a sub-query mechanism so cohort definitions transfer between views without exposing IDs
Reject queries where cohort count falls below the ACT-configured threshold
Support nested filter groups with AND, OR, and NOT
Mandatory query audit logging with user notification

2. Architecture Overview

2.1 Two User Tiers

The 2.0 architecture supports two distinct user experiences based on their access level to the PARTICIPANTS source table:

User Tier	Access to PARTICIPANTS	Experience

User Tier	Access to PARTICIPANTS	Experience
Approved (met AR)	Full access	1.0 experience: see participant rows, facets, full query capabilities
Unapproved (authenticated, not met AR)	Aggregate count only	2.0 experience: see only total cohort count (>= threshold); use sub-query to browse matching files

2.2 User Journeys

Approved User (unchanged from 1.0)

1. User queries PARTICIPANTS VirtualTable → sees rows + facets
2. Applies filters to define cohort
3. Clicks "View Files" → UI captures cohort IDs (or sub-query)
4. Queries FILES VirtualTable with cohort filter applied
5. Browses files → adds to download cart

Unapproved User (new in 2.0)

1. User queries PARTICIPANTS VirtualTable → sees ONLY total count (N = 847)
2. Applies filters using known filter values (from schema metadata)
3. Count updates with each filter: "N = 203"
4. If count drops below threshold: "Cannot proceed — cohort too small"
5. If count >= threshold: user clicks "View Files"
6. UI submits sub-query filter to FILES VirtualTable:
     WHERE participantId IN (SELECT individualId FROM syn_participants_vt WHERE [user's filters])
7. Backend validates: sub-query count >= threshold? Source columns not in outer SELECT? 
8. Query executes → user sees file rows → adds to download cart

2.3 Query Pipeline with New Steps

User Query: "SELECT * FROM syn_files_vt WHERE participantId IN (SELECT ...)"
     │
     ▼
[1] Parse query (extended grammar supports constrained sub-query)     ← ENHANCED
     │
     ▼
[2] VirtualTableIndexDescription.preprocessQuery()                     ← EXISTING
     │  Wraps defining SQL as CTE
     ▼
[3] TableQueryManagerImpl.queryPreflight()                              ← ENHANCED
     │  a. Check outer table access (user has READ on FILES VT)
     │  b. Check sub-query source access:
     │     - Does user have full access? → allow (1.0 path)
     │     - No? → Is source DataType == AGGREGATE_DATA? 
     │       → YES: run count check (must be >= threshold)
     │       → NO: throw UnauthorizedException
     │  c. Validate: outer query does NOT select from restricted source
     ▼
[4] CombinedQuery: apply additionalFilters + selectedFacets            ← ENHANCED (nested groups)
     │
     ▼
[5] QueryTranslator: translate to index SQL                            ← EXISTING
     │
     ▼
[6] Execute query                                                       ← EXISTING
     │
     ▼
[7] Audit logger: record query details to Kinesis                      ← NEW
     │
     ▼
[8] Return QueryResultBundle                                           ← EXISTING

3. Data Model — Individual-to-Dataset Mapping

3.1 Problem

The 1.0 MATERIAL table maps files to participants via a FILE_TO_PART table. For a study with N participants and M files where every file maps to every participant, the MATERIAL table grows to N × M rows. For LLFM (5,000 participants × thousands of files) this exceeds practical limits.

3.2 Solution: Map to Datasets

For many:many studies, curators map individuals to Datasets (existing Synapse entity type) rather than individual files. A Dataset is a curated collection of file references (max 30K items). This reduces the mapping cardinality dramatically — e.g., 5,000 participants × 10 datasets instead of 5,000 × 50,000 files.

Key Decision: The Dataset entity already exists in Synapse and already integrates with the download cart. No new entity types are needed.

3.3 Source Table Layout (Curator-Managed)

Table	Description	Key Columns

Table	Description	Key Columns
Participants	One row per participant, all metadata	individualId, sex, age, diagnosis, study, ...
Individual-to-File mapping	For one:many studies (unchanged from 1.0)	individualId, fileEntityId
Individual-to-Dataset mapping	For many:many studies (NEW)	individualId, datasetId

3.4 MATERIAL Table (MaterializedView)

The MATERIAL table's defining SQL is updated to support both mapping types:

SELECT
    p.individualId, p.sex, p.age, p.diagnosis, p.study,
    COALESCE(fm.fileEntityId, dm.datasetId) AS dataReference,
    CASE
        WHEN fm.fileEntityId IS NOT NULL THEN 'FILE'
        ELSE 'DATASET'
    END AS referenceType
FROM syn_participants p
LEFT JOIN syn_file_mapping fm ON p.individualId = fm.individualId
LEFT JOIN syn_dataset_mapping dm ON p.individualId = dm.individualId

3.5 Impact on Download Cart

The download list already supports adding Datasets via AddToDownloadListRequest. No backend changes needed — the frontend handles the hybrid file+dataset presentation.

4. Aggregate Access Exception — DataType Model

4.1 Problem

In 1.0, the query preflight calls validateTableReadAccess() which recursively checks access on the VirtualTable's dependencies. If a user hasn't met the AR on the source PARTICIPANTS table, an UnauthorizedException is thrown — no access at all.

In 2.0, ACT needs to say: "This table is restricted, but authenticated users may see aggregate counts from it." We need a mechanism for ACT to create this exception without interfering with the existing AR model.

4.2 Solution: Extend DataType

The existing changeEntityDataType() service is already used by ACT to classify data. Currently supports SENSITIVE_DATA (default) and OPEN_DATA. We extend this with a new value:

DataType	Meaning	Who Can Set

DataType	Meaning	Who Can Set
`SENSITIVE_DATA`	Default. Full AR enforcement. No exceptions.	Any user with UPDATE
`OPEN_DATA`	Safe for public. READ = automatic DOWNLOAD.	ACT only
`AGGREGATE_DATA`	NEW. Authenticated users may access aggregate counts only, subject to threshold. Full row access still requires AR approval.	ACT only

4.3 What AGGREGATE_DATA Means at Runtime

When a VirtualTable's source table has DataType = AGGREGATE_DATA:

Approved users (met the AR): Full access. Unchanged from 1.0.
Unapproved but authenticated users:
- May query the VirtualTable for COUNT only — no row data, no individual columns
- The count is only returned if it meets the threshold (>= 20)
- May use a sub-query referencing this table in a filter on another table (the cohort handoff), subject to the count threshold check
- May NOT see any columns from the restricted source in query results
Anonymous users: No access. Must be authenticated.

4.4 Threshold Configuration

The count threshold is stored alongside the DataType classification. Extend the DATA_TYPE table (or add a linked table) with:

Column	Type	Description

Column	Type	Description
`SUPPRESSION_THRESHOLD`	BIGINT	Minimum count required. Default: 20. Only applies when DATA_TYPE = AGGREGATE_DATA.

ACT sets this when classifying the table. The threshold is loaded during preflight and used for count validation.

4.5 Modified Preflight Logic

// Pseudocode for enhanced validateTableReadAccess()
void validateTableReadAccess(UserInfo user, IndexDescription index) {
    String entityId = index.getIdAndVersion().getId().toString();
    
    // Check READ permission (ACL-based)
    authorizationManager.canAccess(user, entityId, ENTITY, READ).checkAuthorizationOrElseThrow();
    
    // Check DOWNLOAD permission (required for table content access)
    AuthorizationStatus downloadStatus = authorizationManager.canAccess(user, entityId, ENTITY, DOWNLOAD);
    
    if (!downloadStatus.isAuthorized()) {
        // User does NOT have full access. Check for aggregate exception.
        DataType dataType = dataTypeDao.getDataType(entityId);
        if (DataType.AGGREGATE_DATA.equals(dataType) && !user.isAnonymous()) {
            // Mark this dependency as "aggregate-only" in the query context.
            // Actual enforcement (count check, column restriction) happens later.
            queryContext.markAggregateOnly(entityId, getThreshold(entityId));
            return; // Allow query to proceed with constraints
        }
        // No exception applies — block access
        downloadStatus.checkAuthorizationOrElseThrow();
    }
    
    // Recurse into dependencies
    for (IndexDescription dependency : index.getDependencies()) {
        validateTableReadAccess(user, dependency);
    }
}

4.6 Enforcement During Query Execution

When a table is marked as "aggregate-only" in the query context, the following constraints are enforced:

Direct query: Only SELECT COUNT(*) is permitted. Any other SELECT clause is rejected.
Count result: If count < threshold, return an error indicating the cohort is too small (do NOT return the actual count).
Sub-query usage: The table may appear as the source of an IN sub-query in a filter on another table (see Section 5). The sub-query count is validated before the outer query executes.
Column restriction: No columns from the restricted source may appear in the outer query's SELECT clause.

5. Sub-Query Cohort Handoff

5.1 Problem

In 1.0, the UI captures participant IDs from the PARTICIPANTS view and passes them as an IN clause to the FILES view. This has two problems:

Scalability: Thousands of IDs in a URL/request body
Privacy: Unapproved users must never see individual participant IDs

5.2 Solution: Sub-Query Filter

Instead of passing IDs, the UI passes the cohort definition as a sub-query. The backend resolves the IDs internally without exposing them to the client.

New Schema: ColumnSubQueryFilter

{
    "description": "A filter that restricts rows to those where a column's value appears in the results of a sub-query against another table. The backend executes the sub-query internally — the caller never sees the resulting IDs.",
    "implements": [
        {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}
    ],
    "properties": {
        "concreteType": {
            "type": "string"
        },
        "columnName": {
            "type": "string",
            "description": "Column on the queried table to match against sub-query results (LHS of IN)."
        },
        "subQuery": {
            "$ref": "org.sagebionetworks.repo.model.table.SubQueryFilter",
            "description": "Defines the sub-query whose results form the RHS of the IN clause."
        }
    }
}

New Schema: SubQueryFilter

{
    "description": "Defines a constrained sub-query: SELECT single_column FROM table [WHERE conditions].",
    "properties": {
        "tableId": {
            "type": "string",
            "description": "The syn ID of the table/VirtualTable to sub-query."
        },
        "selectColumn": {
            "type": "string",
            "description": "The single column whose distinct values form the IN list."
        },
        "additionalFilters": {
            "type": "array",
            "items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"},
            "description": "Filters defining the cohort (same filter tree the user built)."
        }
    }
}

5.3 Example: Cohort to Files Handoff

User defined cohort "age_bin = 65+ AND sex = female" on PARTICIPANTS VT (syn456). Now querying FILES VT (syn789):

Object model (what the UI submits):

{
    "query": {
        "sql": "SELECT * FROM syn789",
        "additionalFilters": [
            {
                "concreteType": "...ColumnSubQueryFilter",
                "columnName": "participantId",
                "isDefiningCondition": true,
                "subQuery": {
                    "tableId": "syn456",
                    "selectColumn": "individualId",
                    "additionalFilters": [
                        {"concreteType": "...ColumnSingleValueQueryFilter",
                         "columnName": "age_bin", "operator": "EQUAL", "values": ["65+"]},
                        {"concreteType": "...ColumnSingleValueQueryFilter",
                         "columnName": "sex", "operator": "EQUAL", "values": ["female"]}
                    ]
                }
            }
        ]
    }
}

Equivalent SQL (for clients that write SQL directly):

SELECT * FROM syn789
WHERE participantId IN (SELECT individualId FROM syn456 WHERE age_bin = '65+' AND sex = 'female')

5.4 Backend Validation (Pre-flight)

When a ColumnSubQueryFilter (or an IN sub-query in SQL) is encountered during preflight:

Resolve sub-query source — load the VirtualTable referenced by tableId
Check user access to the sub-query source:
- User has full access (met AR)? → No threshold check needed. Execute normally.
- User lacks full access but source is AGGREGATE_DATA? → Proceed to count check.
- Neither? → UnauthorizedException
Count check: Execute SELECT COUNT(DISTINCT selectColumn) FROM tableId WHERE [filters]
- Count >= threshold? → Proceed.
- Count < threshold? → Reject with error: "Cohort size is below the minimum threshold. Adjust your filters to include more participants."
Column restriction: Verify the outer query's SELECT does not include any columns that originate from the restricted sub-query source.
Execute the full query with the sub-query embedded in the translated SQL.

5.5 Constrained Sub-Query Grammar (JavaCC Extension)

The SQL parser (table-query-parser.jj) will be extended to support sub-queries, but only in the constrained form:

subQuery := SELECT single_column FROM table_reference [WHERE search_condition]

Allowed:

Single column in SELECT (no expressions, no *)
Single FROM table reference (no JOINs, no UNION)
Optional WHERE with full boolean expression grammar

Disallowed (parser rejects at parse time):

Multiple SELECT columns
GROUP BY
ORDER BY
LIMIT / OFFSET
Nested sub-queries (sub-query within a sub-query)
Aggregate functions (COUNT, SUM, etc.) in SELECT
JOIN, UNION

The sub-query may appear only on the right-hand side of an IN predicate:

column_name IN ( subQuery )

5.6 Translation

In SQLTranslatorUtils, the sub-query is translated by:

Resolving tableId to its VirtualTable defining SQL
Wrapping the VirtualTable as a CTE (same mechanism as the outer query)
Applying row-level filters (benefactor) to the sub-query source
Applying the user's WHERE filters
Producing: column IN (WITH syn456 AS (...) SELECT individualId FROM syn456 WHERE ...)

6. Enhanced Query Filtering — Nested Filter Groups

6.1 Current State

The existing additionalFilters mechanism supports leaf predicates joined with AND only. The SQL parser supports full SQL-92 boolean expressions (nested AND/OR/NOT with parentheses), but the programmatic API cannot express this.

6.2 Proposed: Recursive Filter Tree (Mirrors SQL-92 Grammar)

A FilterGroup represents a parenthesized group of conditions. It maps directly to the parser's AST:

API Model	SQL-92 AST Node	SQL Output

API Model	SQL-92 AST Node	SQL Output
`FilterGroup(operator=OR, children=[...])`	`SearchCondition`	`(child1 OR child2 OR ...)`
`FilterGroup(operator=AND, children=[...])`	`BooleanTerm`	`(child1 AND child2 AND ...)`
`FilterGroup(not=true, ...)`	`BooleanFactor(not=true)`	`NOT (...)`
Leaf filter (e.g. `ColumnSingleValueQueryFilter`)	`Predicate`	`column op value`
Nested `FilterGroup` inside a parent	`BooleanPrimary(SearchCondition)`	`( nested_expression )`

Schema: FilterGroup

{
    "description": "A group of filter conditions combined with a boolean operator. Groups can be nested to form arbitrary boolean expression trees.",
    "implements": [{"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}],
    "properties": {
        "operator": {
            "name": "BooleanOperator",
            "type": "string",
            "enum": [
                {"name": "AND", "description": "All children must match."},
                {"name": "OR", "description": "At least one child must match."}
            ]
        },
        "not": {
            "type": "boolean",
            "description": "When true, negates this entire group. Default: false."
        },
        "children": {
            "type": "array",
            "items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"},
            "description": "Child filters — leaf predicates or nested FilterGroups."
        }
    }
}

6.3 Example: Complex Query

SQL: WHERE ((diagnosis LIKE '%Alzheimer%' AND age > 65) OR (diagnosis LIKE '%dementia%')) AND NOT (study = 'excluded') AND sex = 'female'

Maps to a FilterGroup tree: root AND group with three children — an OR group (containing an AND group and a leaf), a negated AND group, and a leaf.

6.4 Additional Operators on Leaf Filters

Extend ColumnSingleValueFilterOperator:

Operator	SQL	Status

Operator	SQL	Status
LIKE, EQUAL, IN	Existing	Existing
GREATER_THAN	`>`	NEW
LESS_THAN	`<`	NEW
GREATER_THAN_OR_EQUAL	`>=`	NEW
LESS_THAN_OR_EQUAL	`<=`	NEW
NOT_EQUAL	`<>`	NEW
IS_NULL	`IS NULL`	NEW
IS_NOT_NULL	`IS NOT NULL`	NEW
BETWEEN	`BETWEEN v1 AND v2`	NEW

6.5 Depth/Size Limits

Maximum nesting depth: 5 levels
Maximum total leaf predicates: 50
Maximum children per group: 25

7. Column Governance / Whitelisting

7.1 Threat Model

Unrestricted GROUP BY or filter combinations could enable re-identification even with count thresholds.

7.2 Defense Layers

VirtualTable defining SQL controls exposure: Only columns curators include are queryable.
Facet types control filtering: Only columns with non-null facetType are filterable.
Aggregation is fixed: GROUP BY is set at design time in the defining SQL — users cannot alter it.
restrictToStructuredQueries: Disable raw SQL for cohort builder VirtualTables (force structured filter API only).
Binning: Curators bin quasi-identifiers (age → age ranges, 90+ censoring) in the defining SQL.

7.3 Enforcement for Aggregate-Only Access

For unapproved users querying an AGGREGATE_DATA source:

Direct queries: Only SELECT COUNT(*) is allowed — no other columns
Sub-query: Only the selectColumn (join key) is retrieved internally — user never sees it
The outer query (FILES) is against a table the user does have access to — no restriction on those columns

8. Query Audit Infrastructure

8.1 Requirement

All queries against cohort builder VirtualTables must be logged. Users warned their activity is monitored.

8.2 Architecture

Follow existing AsyncAccessRecorder → Kinesis Firehose → S3 pattern.

8.3 Audit Record

Field	Type	Description

Field	Type	Description
userId	Long	Synapse user ID
timestamp	Long	Epoch millis
virtualTableId	String	syn ID of the queried VirtualTable
subQueryTableId	String	syn ID of the sub-query source (if applicable)
additionalFilters	String (JSON)	Serialized filter tree
resultCount	Long (nullable)	Total count returned, null if below threshold
accessTier	String	FULL or AGGREGATE_ONLY
responseTimeMs	Long	Query execution time

8.4 Gating

Audit all queries where the source table has DataType = AGGREGATE_DATA. For OPEN_DATA or SENSITIVE_DATA with full access, standard access logging applies (existing infrastructure).

9. Security Considerations

9.1 Layered Defense Model

Layer	Mechanism	Protects Against

Layer	Mechanism	Protects Against
1. Data model