Curation Task v2 API Design

Curation Task v2 API Design

The Tasks v1 API design can be found here: Design+for+Records-Based+Metadata+Curation.

Problem Statement:

The current implementation of Curation Tasks in Synapse served as an effective MVP for organizing file-based and record-based metadata. However, as the system has scaled to support larger consortia and collaborative teams, three critical areas of friction have emerged that the V2 API intends to resolve:

1. Fragile Session Resolution and Data Integrity

Currently, the responsibility for finding or creating a metadata grid session is delegated entirely to the UI layer. Because tasks are not formally linked to specific session IDs, the UI must "guess" which session to open based on the task definition. This architectural gap leads to:

  • Duplicate Sessions: Multiple users (or even a single user in multiple tabs) can accidentally spawn parallel sessions for the same task.

  • Version Confusion: Users may be directed to the "latest" session by default, which might not be the session currently under review or the one that was recently completed.

  • Work on Stale Data: Curators can inadvertently land in a session that should be closed or read-only, risking data loss or conflicting updates.

2. Visibility and Scaling Issues (The "Curator's Burden")

As users become responsible for metadata curation across dozens of projects, the current project-centric view has become a bottleneck.

  • Lack of Global Task Access: There is currently no unified way for a user to see all tasks assigned to them or their teams across the entire Synapse platform.

  • Cluttered Workspaces: Without a formal status field (e.g., Not Started, In-Progress, Completed), all tasks—including those that have already been finalized—remain visible in the project tab, creating significant visual noise.

  • Ineffective Filtering: Current filtering is limited, making it difficult for users to isolate high-priority tasks or tasks specifically assigned to them within high-volume projects.

3. Ambiguous Collaboration and Verification Workflows

While the system supports Teams, the lack of a formal task lifecycle creates ambiguity regarding ownership and completion.

  • Concurrent Editing Risks: Without a linked session, team members cannot reliably know if a collaborator is currently working on a shared task.

  • Undefined Completion: There is no distinct state to signal that a task has been completed and verified by a data manager.

Objectives for V2

To address these issues, the V2 API will introduce a task status tracker with client-orchestrated session linking, optimistic concurrency control, and comprehensive cross-project filtering for assignees (Users and Teams).

V2 API

This V2 design addresses the original MVP's technical debt through three primary mechanisms:

A. Client-Orchestrated Session Linking (Solving Problem 1)

In V1, the UI "guessed" the session ID, leading to parallel, duplicate sessions.

  • V2 Solution: The task system now tracks a TaskStatus that includes an activeSessionId via polymorphic TaskExecutionDetails. When an assignee starts a task, the client creates a grid session (using the existing async CreateGridRequest) and then updates the task status to IN_PROGRESS with the session ID linked — all in a single synchronous PUT call with etag-based optimistic concurrency. Race conditions are handled gracefully: if two assignees start simultaneously, the second caller's etag check fails (409), and the client re-fetches the current status via GET /curation/task/{taskId}/status to get the fresh etag and linked session.

B. Global Visibility & Filtering (Solving Problem 2)

V1 required users to manually check every project for tasks, creating a "Curator's Burden."

  • V2 Solution: The ListCurationTaskRequest now supports an optional projectId. If omitted, the API aggregates tasks across all projects where the caller has READ access. Combined with stateFilter (e.g., hide COMPLETED tasks), assigneeIds filtering, and the assignedToMe flag (which automatically includes all teams the caller belongs to), assignees get a unified cross-project task view.

C. Formalized Task Lifecycle (Solving Problem 3)

V1 lacked a way to track task progress and signal completion.

  • V2 Solution: A formal TaskState enum (NOT_STARTED, IN_PROGRESS, COMPLETED, CANCELED) with a separate TaskStatus object that tracks state, execution details, and an etag for optimistic concurrency. Managers can filter for in-progress tasks, and the TaskBundle in list responses provides both task definition and current status at a glance.

All of the API changes are additive to the existing API, so there are no "breaking" API changes.

New Task State

We propose extending the curation tasks to support a basic state machine with the following possible states:

State

Definition

Session Status

State

Definition

Session Status

NOT_STARTED

Task is created but no work has begun.

No active session linked.

IN_PROGRESS

An assignee has started the task.

Session created by client & linked via TaskStatus. All assignees join this specific session.

COMPLETED

Data Manager has verified the results.

The linked session remains unchanged.

CANCELED

Data Manager no longer needs this task. A "soft" delete that removes tasks from most views while keeping them for historical purposes.

The existing grid session link is maintained.

For more details on the possible states see: TaskState (enum).

New Task APIs

response

endpoint

request

description

response

endpoint

request

description

TaskStatus

GET /curation/task/{taskId}/status

Get the current status of a task. Useful for fetching a fresh etag after a 409 conflict. Requires READ access on the task's project.

TaskStatus

PUT /curation/task/{taskId}/status

TaskStatus

Update the state of a task. Requires the current etag for optimistic concurrency control. Returns the updated TaskStatus with a new etag.

Authorization for GET: READ access on the task's project.

Authorization for PUT: A user can update a task's status if:

  • The user has UPDATE access on the task's project (task managers), OR

  • The user is the assignee of the task (either directly or via team membership).

Client Workflow: Starting a Task

When an assignee clicks "Start" on a task that is in the NOT_STARTED state:

  1. Client creates a grid session using the existing async job (POST /grid/session/async/start with CreateGridRequest) populated from the task's definition.

  2. Client updates the task status to IN_PROGRESS and links the newly created session via PUT /curation/task/{taskId}/status with:

    • state: IN_PROGRESS

    • executionDetails: GridExecutionDetails with activeSessionId set to the new session ID

    • etag: the current task etag

Race condition handling: If two assignees click "Start" simultaneously:

  • Both create separate grid sessions (step 1 succeeds for both).

  • The first caller's PUT succeeds and links their session.

  • The second caller's PUT fails with 409 Conflict (stale etag).

  • The client calls GET /curation/task/{taskId}/status to fetch the current status with a fresh etag, sees it is now IN_PROGRESS with a linked session, and redirects the second user to that session.

  • The orphaned grid session created by the losing racer can be cleaned up by the client.

State Transition & Authorization Matrix:

Current State

Target State

Authorized User

Side Effects

Current State

Target State

Authorized User

Side Effects

NOT_STARTED

IN_PROGRESS

Assignee

None (client creates and links session)

IN_PROGRESS

COMPLETED

Manager Only

None

ANY

CANCELED

Manager Only

None

ANY

NOT_STARTED

Manager Only

None

Database Design

The task status columns live on the same CURATION_TASK table and share the single ETAG column with the task definition. Any mutation — whether a task property update or a status transition — bumps the same etag. This ensures that migration (which detects row changes via CRC32(CONCAT(ID, '@', ETAG))) correctly picks up all changes:

`STATE` ENUM('NOT_STARTED','IN_PROGRESS','COMPLETED','CANCELED') NOT NULL DEFAULT 'NOT_STARTED', `EXECUTION_DETAILS` JSON DEFAULT NULL, `STATE_UPDATED_BY` BIGINT DEFAULT NULL, `STATE_UPDATED_ON` TIMESTAMP(3) NULL DEFAULT NULL,

Key design decisions:

  • Single ETAG column for optimistic concurrency on both task definition updates and status transitions. This avoids the problem where a separate STATE_ETAG would not be detected by the migration system's etag-based change detection, potentially causing data loss during blue-green deployments.

  • MySQL ENUM for the STATE column provides type safety at the database level.

  • JSON column for EXECUTION_DETAILS enables polymorphic execution details per task type (e.g., GridExecutionDetails with activeSessionId, UploadExecutionDetails with fileCount).

  • No FK to GRID_SESSION — the session link is stored inside the JSON execution details, keeping the schema flexible for future task types.

  • DDL defaults (STATE = 'NOT_STARTED') handle backfill of existing tasks automatically. Custom MigratableTableTranslation in the DBO handles migration from pre-v2 stacks.

New Model Objects

TaskState (enum)

{ "description": "The state of a curation task in its lifecycle.", "type": "string", "name": "TaskState", "enum": [ { "name": "NOT_STARTED", "description": "The task has been created and assigned but work has not yet started." }, { "name": "IN_PROGRESS", "description": "The assignee has actively started the task." }, { "name": "COMPLETED", "description": "The task has been completed and verified." }, { "name": "CANCELED", "description": "The task has been canceled and is no longer needed." } ] }

TaskStatus (object)

{ "description": "Tracks the dynamic lifecycle and progress of a CurationTask.", "properties": { "taskId": { "type": "integer", "description": "The unique identifier of the associated curation task." }, "state": { "$ref": "org.sagebionetworks.repo.model.curation.TaskState", "description": "The current state of the task in its lifecycle." }, "executionDetails": { "$ref": "org.sagebionetworks.repo.model.curation.TaskExecutionDetails", "description": "Task-type-specific execution details. Null if no execution details are available." }, "lastUpdatedBy": { "type": "string", "description": "The principal ID of the user who last updated the status." }, "lastUpdatedOn": { "type": "string", "format": "date-time", "description": "Timestamp of when the status was last updated." }, "etag": { "type": "string", "description": "Optimistic concurrency control token for the task. Shared with the task definition — any mutation (task update or status transition) bumps this etag.", "transient": true } } }

TaskExecutionDetails (interface)

{ "description": "An interface for task-specific execution details. The concrete type determines which task-type-specific properties are available.", "type": "interface", "properties": { "concreteType": { "type": "string", "description": "Indicates which implementation of TaskExecutionDetails this object represents." } } }

GridExecutionDetails

{ "description": "Execution details for a metadata curation task involving a collaborative grid session.", "implements": [ { "$ref": "org.sagebionetworks.repo.model.curation.TaskExecutionDetails" } ], "properties": { "activeSessionId": { "type": "string", "description": "The unique identifier of the active CRDT grid session linked to this task." } } }

UploadExecutionDetails

{ "description": "Execution details for a file upload task.", "implements": [ { "$ref": "org.sagebionetworks.repo.model.curation.TaskExecutionDetails" } ], "properties": { "fileCount": { "type": "integer", "description": "The current number of files successfully uploaded for this task." }, "totalBytesUploaded": { "type": "integer", "description": "The sum of the size of all files uploaded for this task." } } }

TaskBundle

{ "description": "A bundle containing a CurationTask and its associated TaskStatus.", "properties": { "task": { "$ref": "org.sagebionetworks.repo.model.curation.CurationTask", "description": "The configuration and metadata of the task." }, "status": { "$ref": "org.sagebionetworks.repo.model.curation.TaskStatus", "description": "The dynamic lifecycle state, including execution details and concurrency etag." } } }

Extended Model Objects

These are existing model objects that have been extended.

ListCurationTaskRequest

{ "description": "Request for a single page of CurationTasks with optional filtering.", "properties": { "projectId": { "type": "string", "description": "Optional. The synId of the project. If omitted, results are aggregated across all projects where the caller has READ access." }, "assigneeIds": { "type": "array", "items": { "type": "string" }, "description": "Optional. Filter tasks assigned to specific users or teams. Cannot be combined with assignedToMe." }, "assignedToMe": { "type": "boolean", "description": "Optional. When true, filter to tasks assigned to the caller or any team the caller belongs to. Cannot be combined with assigneeIds." }, "stateFilter": { "type": "array", "items": { "$ref": "org.sagebionetworks.repo.model.curation.TaskState" }, "description": "Optional. Filter tasks by their current state." }, "nextPageToken": { "type": "string", "description": "Forward the returned 'nextPageToken' to get the next page of results." } } }

ListCurationTaskResponse

{ "description": "A single page of CurationTasks.", "properties": { "page": { "type": "array", "items": { "$ref": "org.sagebionetworks.repo.model.curation.CurationTask" }, "description": "A list of task definitions only. Use 'bundlePage' for task status info." }, "bundlePage": { "type": "array", "items": { "$ref": "org.sagebionetworks.repo.model.curation.TaskBundle" }, "description": "A list of task bundles containing both the definition and the current status." }, "nextPageToken": { "type": "string", "description": "Forward this token to get the next page of results." } } }