Technical Design Document: Cohort Builder (2.0)

Technical Design Document: Cohort Builder (2.0)

This document describes the backend architecture for Cohort Builder 2.0. It builds on the Cohort Builder - Technical Details - (1.0) design that is already in production. The scope is backend only — UI/UX design is covered separately.

JIRA:

PLFM-9721 - Getting issue details... STATUS

Status: DRAFT

Target: Design complete by Jun 30, 2026 | Implementation target end of Q3 2026

1. Problem Statement and Goals

1.1 What Broke in 1.0

Cohort Builder 1.0 was designed for a one-to-many relationship (~100 files per individual). In practice, certain studies (e.g., LLFM with 5,000+ participants) exhibit a many-to-many relationship where every individual maps to every file. This creates a cartesian explosion in the MATERIAL (MaterializedView) table that breaks Synapse row limits and degrades query performance.

Additionally, 1.0 assumed all users had full access to the PARTICIPANTS table. In 2.0, we must support users who do not have access to individual-level participant data but are permitted to see aggregate counts under governance constraints.

1.2 New Requirements for 2.0

Requirement

Problem

Source

Requirement

Problem

Source

Many:many resolution

File-to-participant mapping tables explode for dense studies

ARQ-6

Aggregate-only access

Unapproved users must be able to query for aggregate counts without seeing individual data

ARQ-2, PRIV-4

Sub-query cohort handoff

In 1.0, IDs are passed between views (scalability issue + privacy issue for unapproved users)

FILS-4

Count suppression

Counts between 1-19 must never be exposed — query must be rejected

PRIV-4, meeting decision

Complex query logic

Current filter API only supports AND between filters; need nested groups with OR/NOT

FILS-7

Column governance

Unrestricted GROUP BY could enable re-identification

PRIV-3, NFR-2

Query audit

No existing audit trail for queries — must detect/deter misuse

NFR-2, meeting decision

Cross-portal extensibility

ELITE and ADKP have different facets and privacy rules

ARQ-3

1.3 Scope

In scope: Backend architecture for aggregate-only access, sub-query cohort handoff, privacy enforcement, enhanced filtering, governance, and audit.

Out of scope: Cohort saving/sharing (v3.0 / Q4), cross-portal cohort querying (v3.0+), tiered access UI (deferred/rejected), UI/UX design.

1.4 Goals

  1. Support many:many studies by mapping individuals to Datasets instead of individual files

  2. Enable authenticated users without AR approval to see only aggregate counts from restricted tables, under ACT-defined constraints

  3. Provide a sub-query mechanism so cohort definitions transfer between views without exposing IDs

  4. Reject queries where cohort count falls below the ACT-configured threshold

  5. Support nested filter groups with AND, OR, and NOT

  6. Mandatory query audit logging with user notification

2. Architecture Overview

2.1 Two User Tiers

The 2.0 architecture supports two distinct user experiences based on their access level to the PARTICIPANTS source table:

User Tier

Access to PARTICIPANTS

Experience

User Tier

Access to PARTICIPANTS

Experience

Approved (met AR)

Full access

1.0 experience: see participant rows, facets, full query capabilities

Unapproved (authenticated, not met AR)

Aggregate count only

2.0 experience: see only total cohort count (>= threshold); use sub-query to browse matching files

2.2 User Journeys

Approved User (unchanged from 1.0)

1. User queries PARTICIPANTS VirtualTable → sees rows + facets 2. Applies filters to define cohort 3. Clicks "View Files" → UI captures cohort IDs (or sub-query) 4. Queries FILES VirtualTable with cohort filter applied 5. Browses files → adds to download cart

Unapproved User (new in 2.0)

1. User queries PARTICIPANTS VirtualTable → sees ONLY total count (N = 847) 2. Applies filters using known filter values (from schema metadata) 3. Count updates with each filter: "N = 203" 4. If count drops below threshold: "Cannot proceed — cohort too small" 5. If count >= threshold: user clicks "View Files" 6. UI submits sub-query filter to FILES VirtualTable: WHERE participantId IN (SELECT individualId FROM syn_participants_vt WHERE [user's filters]) 7. Backend validates: sub-query count >= threshold? Source columns not in outer SELECT? 8. Query executes → user sees file rows → adds to download cart

2.3 Query Pipeline with New Steps

User Query: "SELECT * FROM syn_files_vt WHERE participantId IN (SELECT ...)" [1] Parse query (extended grammar supports constrained sub-query) ← ENHANCED [2] VirtualTableIndexDescription.preprocessQuery() ← EXISTING │ Wraps defining SQL as CTE [3] TableQueryManagerImpl.queryPreflight() ← ENHANCED │ a. Check outer table access (user has READ on FILES VT) │ b. Check sub-query source access: │ - Does user have full access? → allow (1.0 path) │ - No? → Is source DataType == AGGREGATE_DATA? │ → YES: run count check (must be >= threshold) │ → NO: throw UnauthorizedException │ c. Validate: outer query does NOT select from restricted source [4] CombinedQuery: apply additionalFilters + selectedFacets ← ENHANCED (nested groups) [5] QueryTranslator: translate to index SQL ← EXISTING [6] Execute query ← EXISTING [7] Audit logger: record query details to Kinesis ← NEW [8] Return QueryResultBundle ← EXISTING

3. Data Model — Individual-to-Dataset Mapping

3.1 Problem

The 1.0 MATERIAL table maps files to participants via a FILE_TO_PART table. For a study with N participants and M files where every file maps to every participant, the MATERIAL table grows to N × M rows. For LLFM (5,000 participants × thousands of files) this exceeds practical limits.

3.2 Solution: Map to Datasets

For many:many studies, curators map individuals to Datasets (existing Synapse entity type) rather than individual files. A Dataset is a curated collection of file references (max 30K items). This reduces the mapping cardinality dramatically — e.g., 5,000 participants × 10 datasets instead of 5,000 × 50,000 files.

Key Decision: The Dataset entity already exists in Synapse and already integrates with the download cart. No new entity types are needed.

3.3 Source Table Layout (Curator-Managed)

Table

Description

Key Columns

Table

Description

Key Columns

Participants

One row per participant, all metadata

individualId, sex, age, diagnosis, study, ...

Individual-to-File mapping

For one:many studies (unchanged from 1.0)

individualId, fileEntityId

Individual-to-Dataset mapping

For many:many studies (NEW)

individualId, datasetId

3.4 MATERIAL Table (MaterializedView)

The MATERIAL table's defining SQL is updated to support both mapping types:

SELECT p.individualId, p.sex, p.age, p.diagnosis, p.study, COALESCE(fm.fileEntityId, dm.datasetId) AS dataReference, CASE WHEN fm.fileEntityId IS NOT NULL THEN 'FILE' ELSE 'DATASET' END AS referenceType FROM syn_participants p LEFT JOIN syn_file_mapping fm ON p.individualId = fm.individualId LEFT JOIN syn_dataset_mapping dm ON p.individualId = dm.individualId

3.5 Impact on Download Cart

The download list already supports adding Datasets via AddToDownloadListRequest. No backend changes needed — the frontend handles the hybrid file+dataset presentation.

4. Aggregate Access Exception — DataType Model

4.1 Problem

In 1.0, the query preflight calls validateTableReadAccess() which recursively checks access on the VirtualTable's dependencies. If a user hasn't met the AR on the source PARTICIPANTS table, an UnauthorizedException is thrown — no access at all.

In 2.0, ACT needs to say: "This table is restricted, but authenticated users may see aggregate counts from it." We need a mechanism for ACT to create this exception without interfering with the existing AR model.

4.2 Solution: Extend DataType

The existing changeEntityDataType() service is already used by ACT to classify data. Currently supports SENSITIVE_DATA (default) and OPEN_DATA. We extend this with a new value:

DataType

Meaning

Who Can Set

DataType

Meaning

Who Can Set

SENSITIVE_DATA

Default. Full AR enforcement. No exceptions.

Any user with UPDATE

OPEN_DATA

Safe for public. READ = automatic DOWNLOAD.

ACT only

AGGREGATE_DATA

NEW. Authenticated users may access aggregate counts only, subject to threshold. Full row access still requires AR approval.

ACT only

4.3 What AGGREGATE_DATA Means at Runtime

When a VirtualTable's source table has DataType = AGGREGATE_DATA:

  1. Approved users (met the AR): Full access. Unchanged from 1.0.

  2. Unapproved but authenticated users:

    • May query the VirtualTable for COUNT only — no row data, no individual columns

    • The count is only returned if it meets the threshold (>= 20)

    • May use a sub-query referencing this table in a filter on another table (the cohort handoff), subject to the count threshold check

    • May NOT see any columns from the restricted source in query results

  3. Anonymous users: No access. Must be authenticated.

4.4 Threshold Configuration

The count threshold is stored alongside the DataType classification. Extend the DATA_TYPE table (or add a linked table) with:

Column

Type

Description

Column

Type

Description

SUPPRESSION_THRESHOLD

BIGINT

Minimum count required. Default: 20. Only applies when DATA_TYPE = AGGREGATE_DATA.

ACT sets this when classifying the table. The threshold is loaded during preflight and used for count validation.

4.5 Modified Preflight Logic

// Pseudocode for enhanced validateTableReadAccess() void validateTableReadAccess(UserInfo user, IndexDescription index) { String entityId = index.getIdAndVersion().getId().toString(); // Check READ permission (ACL-based) authorizationManager.canAccess(user, entityId, ENTITY, READ).checkAuthorizationOrElseThrow(); // Check DOWNLOAD permission (required for table content access) AuthorizationStatus downloadStatus = authorizationManager.canAccess(user, entityId, ENTITY, DOWNLOAD); if (!downloadStatus.isAuthorized()) { // User does NOT have full access. Check for aggregate exception. DataType dataType = dataTypeDao.getDataType(entityId); if (DataType.AGGREGATE_DATA.equals(dataType) && !user.isAnonymous()) { // Mark this dependency as "aggregate-only" in the query context. // Actual enforcement (count check, column restriction) happens later. queryContext.markAggregateOnly(entityId, getThreshold(entityId)); return; // Allow query to proceed with constraints } // No exception applies — block access downloadStatus.checkAuthorizationOrElseThrow(); } // Recurse into dependencies for (IndexDescription dependency : index.getDependencies()) { validateTableReadAccess(user, dependency); } }

4.6 Enforcement During Query Execution

When a table is marked as "aggregate-only" in the query context, the following constraints are enforced:

  1. Direct query: Only SELECT COUNT(*) is permitted. Any other SELECT clause is rejected.

  2. Count result: If count < threshold, return an error indicating the cohort is too small (do NOT return the actual count).

  3. Sub-query usage: The table may appear as the source of an IN sub-query in a filter on another table (see Section 5). The sub-query count is validated before the outer query executes.

  4. Column restriction: No columns from the restricted source may appear in the outer query's SELECT clause.

5. Sub-Query Cohort Handoff

5.1 Problem

In 1.0, the UI captures participant IDs from the PARTICIPANTS view and passes them as an IN clause to the FILES view. This has two problems:

  • Scalability: Thousands of IDs in a URL/request body

  • Privacy: Unapproved users must never see individual participant IDs

5.2 Solution: Sub-Query Filter

Instead of passing IDs, the UI passes the cohort definition as a sub-query. The backend resolves the IDs internally without exposing them to the client.

New Schema: ColumnSubQueryFilter

{ "description": "A filter that restricts rows to those where a column's value appears in the results of a sub-query against another table. The backend executes the sub-query internally — the caller never sees the resulting IDs.", "implements": [ {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"} ], "properties": { "concreteType": { "type": "string" }, "columnName": { "type": "string", "description": "Column on the queried table to match against sub-query results (LHS of IN)." }, "subQuery": { "$ref": "org.sagebionetworks.repo.model.table.SubQueryFilter", "description": "Defines the sub-query whose results form the RHS of the IN clause." } } }

New Schema: SubQueryFilter

{ "description": "Defines a constrained sub-query: SELECT single_column FROM table [WHERE conditions].", "properties": { "tableId": { "type": "string", "description": "The syn ID of the table/VirtualTable to sub-query." }, "selectColumn": { "type": "string", "description": "The single column whose distinct values form the IN list." }, "additionalFilters": { "type": "array", "items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}, "description": "Filters defining the cohort (same filter tree the user built)." } } }

5.3 Example: Cohort to Files Handoff

User defined cohort "age_bin = 65+ AND sex = female" on PARTICIPANTS VT (syn456). Now querying FILES VT (syn789):

Object model (what the UI submits):

{ "query": { "sql": "SELECT * FROM syn789", "additionalFilters": [ { "concreteType": "...ColumnSubQueryFilter", "columnName": "participantId", "isDefiningCondition": true, "subQuery": { "tableId": "syn456", "selectColumn": "individualId", "additionalFilters": [ {"concreteType": "...ColumnSingleValueQueryFilter", "columnName": "age_bin", "operator": "EQUAL", "values": ["65+"]}, {"concreteType": "...ColumnSingleValueQueryFilter", "columnName": "sex", "operator": "EQUAL", "values": ["female"]} ] } } ] } }

Equivalent SQL (for clients that write SQL directly):

SELECT * FROM syn789 WHERE participantId IN (SELECT individualId FROM syn456 WHERE age_bin = '65+' AND sex = 'female')

5.4 Backend Validation (Pre-flight)

When a ColumnSubQueryFilter (or an IN sub-query in SQL) is encountered during preflight:

  1. Resolve sub-query source — load the VirtualTable referenced by tableId

  2. Check user access to the sub-query source:

    • User has full access (met AR)? → No threshold check needed. Execute normally.

    • User lacks full access but source is AGGREGATE_DATA? → Proceed to count check.

    • Neither? → UnauthorizedException

  3. Count check: Execute SELECT COUNT(DISTINCT selectColumn) FROM tableId WHERE [filters]

    • Count >= threshold? → Proceed.

    • Count < threshold? → Reject with error: "Cohort size is below the minimum threshold. Adjust your filters to include more participants."

  4. Column restriction: Verify the outer query's SELECT does not include any columns that originate from the restricted sub-query source.

  5. Execute the full query with the sub-query embedded in the translated SQL.

5.5 Constrained Sub-Query Grammar (JavaCC Extension)

The SQL parser (table-query-parser.jj) will be extended to support sub-queries, but only in the constrained form:

subQuery := SELECT single_column FROM table_reference [WHERE search_condition]

Allowed:

  • Single column in SELECT (no expressions, no *)

  • Single FROM table reference (no JOINs, no UNION)

  • Optional WHERE with full boolean expression grammar

Disallowed (parser rejects at parse time):

  • Multiple SELECT columns

  • GROUP BY

  • ORDER BY

  • LIMIT / OFFSET

  • Nested sub-queries (sub-query within a sub-query)

  • Aggregate functions (COUNT, SUM, etc.) in SELECT

  • JOIN, UNION

The sub-query may appear only on the right-hand side of an IN predicate:

column_name IN ( subQuery )

5.6 Translation

In SQLTranslatorUtils, the sub-query is translated by:

  1. Resolving tableId to its VirtualTable defining SQL

  2. Wrapping the VirtualTable as a CTE (same mechanism as the outer query)

  3. Applying row-level filters (benefactor) to the sub-query source

  4. Applying the user's WHERE filters

  5. Producing: column IN (WITH syn456 AS (...) SELECT individualId FROM syn456 WHERE ...)

6. Enhanced Query Filtering — Nested Filter Groups

6.1 Current State

The existing additionalFilters mechanism supports leaf predicates joined with AND only. The SQL parser supports full SQL-92 boolean expressions (nested AND/OR/NOT with parentheses), but the programmatic API cannot express this.

6.2 Proposed: Recursive Filter Tree (Mirrors SQL-92 Grammar)

A FilterGroup represents a parenthesized group of conditions. It maps directly to the parser's AST:

API Model

SQL-92 AST Node

SQL Output

API Model

SQL-92 AST Node

SQL Output

FilterGroup(operator=OR, children=[...])

SearchCondition

(child1 OR child2 OR ...)

FilterGroup(operator=AND, children=[...])

BooleanTerm

(child1 AND child2 AND ...)

FilterGroup(not=true, ...)

BooleanFactor(not=true)

NOT (...)

Leaf filter (e.g. ColumnSingleValueQueryFilter)

Predicate

column op value

Nested FilterGroup inside a parent

BooleanPrimary(SearchCondition)

( nested_expression )

Schema: FilterGroup

{ "description": "A group of filter conditions combined with a boolean operator. Groups can be nested to form arbitrary boolean expression trees.", "implements": [{"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}], "properties": { "operator": { "name": "BooleanOperator", "type": "string", "enum": [ {"name": "AND", "description": "All children must match."}, {"name": "OR", "description": "At least one child must match."} ] }, "not": { "type": "boolean", "description": "When true, negates this entire group. Default: false." }, "children": { "type": "array", "items": {"$ref": "org.sagebionetworks.repo.model.table.QueryFilter"}, "description": "Child filters — leaf predicates or nested FilterGroups." } } }

6.3 Example: Complex Query

SQL: WHERE ((diagnosis LIKE '%Alzheimer%' AND age > 65) OR (diagnosis LIKE '%dementia%')) AND NOT (study = 'excluded') AND sex = 'female'

Maps to a FilterGroup tree: root AND group with three children — an OR group (containing an AND group and a leaf), a negated AND group, and a leaf.

6.4 Additional Operators on Leaf Filters

Extend ColumnSingleValueFilterOperator:

Operator

SQL

Status

Operator

SQL

Status

LIKE, EQUAL, IN

Existing

Existing

GREATER_THAN

>

NEW

LESS_THAN

<

NEW

GREATER_THAN_OR_EQUAL

>=

NEW

LESS_THAN_OR_EQUAL

<=

NEW

NOT_EQUAL

<>

NEW

IS_NULL

IS NULL

NEW

IS_NOT_NULL

IS NOT NULL

NEW

BETWEEN

BETWEEN v1 AND v2

NEW

6.5 Depth/Size Limits

  • Maximum nesting depth: 5 levels

  • Maximum total leaf predicates: 50

  • Maximum children per group: 25

7. Column Governance / Whitelisting

7.1 Threat Model

Unrestricted GROUP BY or filter combinations could enable re-identification even with count thresholds.

7.2 Defense Layers

  1. VirtualTable defining SQL controls exposure: Only columns curators include are queryable.

  2. Facet types control filtering: Only columns with non-null facetType are filterable.

  3. Aggregation is fixed: GROUP BY is set at design time in the defining SQL — users cannot alter it.

  4. restrictToStructuredQueries: Disable raw SQL for cohort builder VirtualTables (force structured filter API only).

  5. Binning: Curators bin quasi-identifiers (age → age ranges, 90+ censoring) in the defining SQL.

7.3 Enforcement for Aggregate-Only Access

For unapproved users querying an AGGREGATE_DATA source:

  • Direct queries: Only SELECT COUNT(*) is allowed — no other columns

  • Sub-query: Only the selectColumn (join key) is retrieved internally — user never sees it

  • The outer query (FILES) is against a table the user does have access to — no restriction on those columns

8. Query Audit Infrastructure

8.1 Requirement

All queries against cohort builder VirtualTables must be logged. Users warned their activity is monitored.

8.2 Architecture

Follow existing AsyncAccessRecorder → Kinesis Firehose → S3 pattern.

8.3 Audit Record

Field

Type

Description

Field

Type

Description

userId

Long

Synapse user ID

timestamp

Long

Epoch millis

virtualTableId

String

syn ID of the queried VirtualTable

subQueryTableId

String

syn ID of the sub-query source (if applicable)

additionalFilters

String (JSON)

Serialized filter tree

resultCount

Long (nullable)

Total count returned, null if below threshold

accessTier

String

FULL or AGGREGATE_ONLY

responseTimeMs

Long

Query execution time

8.4 Gating

Audit all queries where the source table has DataType = AGGREGATE_DATA. For OPEN_DATA or SENSITIVE_DATA with full access, standard access logging applies (existing infrastructure).

9. Security Considerations

9.1 Layered Defense Model

Layer

Mechanism

Protects Against

Layer

Mechanism

Protects Against

1. Data model