Problem Statement & Objectives

Context

The Synapse Workers application is a core, high-throughput platform component currently deployed as a monolithic Java .war file on Apache Tomcat instances via AWS Elastic Beanstalk. To align with organizational infrastructure standards, reduce operational overhead, and achieve granular scalability, there is an immediate business push to migrate this workload to AWS ECS Fargate.

Observed Failure Mode

Migration attempts to ECS Fargate resulted in severe runtime instability that was resolved in stages:

Stage 1: Memory Pressure & Task Churn (Resolved)

Initial deployment (prod-587, 2 vCPU × 8 containers, Java 11/Spring 5) configured the JVM to use 85% of task memory. Workers hit the 8GB memory ceiling, causing OOM kills, ECS task replacements, HTTP 499 health check failures, and cascading task churn. Connection spikes correlated with replacement events (new containers opening fresh pools while old containers drain).

Resolution: Reduced JVM heap to 50% of task memory. Task churn resolved completely.

Stage 2: Supervisor Thread Starvation (Unsolved — This Document's Focus)

After resolving memory pressure, change messages were replayed to trigger a full secondary index rebuild (tables, OpenSearch, object snapshots). With stable containers (no churn), the following was observed:

Metric

EC2 (Elastic Beanstalk)

ECS Fargate

"Error on progressMade" events (failed SQS/semaphore renewals)

64

7,835

Infrastructure

8 instances × 2 vCPU

8 containers × 2 vCPU

Total cluster vCPU

16

16

Application version

Identical

Identical

Stack traces confirm: supervisor threads failed to call changeMessageVisibility before receipt expiry:

AmazonSQSException: Value [receipt] for parameter ReceiptHandle is invalid.
Reason: Message does not exist or is not available for visibility timeout change.
(Service: AmazonSQS; Status Code: 400; Error Code: InvalidParameterValue)

A second test with 6 vCPU × 6 containers reduced renewal failures from thousands to hundreds — more CPU helped supervisors run more often — but did not eliminate the problem. It remained orders of magnitude worse than EC2.

The unsolved problem: Same application, same total vCPU (16), same workload. Near-zero supervisor failures under EC2, thousands under ECS Fargate. The only variable is how the runtime schedules threads.

This document analyzes the root cause of this behavior and outlines a Java 21 Virtual Threads solution.

Current Worker Framework Architecture

The framework acts as an application-level distributed scheduler. It orchestrates 120+ unique worker types across a cluster of nodes using two primary tiers of concurrency control: Global Cluster Limits (Database Semaphores) and Node-Level Limits (maxThreadsPerMachine).

Core Architectural Lifecycle

For every worker type, a dedicated Quartz scheduler trigger acts as the Supervisor Thread. The lifecycle proceeds as follows:

sequenceDiagram
    autonumber
    participant QZ as Quartz Trigger (Supervisor)
    participant DB as DB Global Semaphore
    participant SQS as AWS SQS Queue
    participant TP as Dedicated Thread Pool

    rect rgb(230, 245, 255)
    note right of QZ: 1. Global Concurrency Check
    QZ->+DB: attemptToGetLock(type, semaphoreMaxLockCount)
    alt Lock Denied
        DB-->QZ: Return False
        note right of QZ: Exit loop, wait for next Quartz trigger
    else Lock Granted
        DB-->>-QZ: Return True
    end
    end

    rect rgb(240, 255, 240)
    note right of QZ: 2. Main Supervisor Loop (while true)
    loop Active Execution
        alt Worker Type is SQS-Driven
            QZ->+SQS: Standard HTTP GET (Batch Size = maxThreadsPerMachine)
            SQS-->>-QZ: Return Messages
        end
        loop For Each Message / Timer Event
            QZ->>TP: Submit Worker Callable (Bounded by maxThreadsPerMachine)
        end

        rect rgb(255, 240, 240)
        note right of QZ: 3. Health & Lease Monitoring
        loop Every lockTimeout/3 seconds
            note right of QZ: If running time > 2/3 of Timeout:
            QZ->>SQS: Renew Message Visibility
            QZ->>DB: Renew Semaphore Lock Leasetime
            note right of QZ: If Worker finishes:
            QZ->>SQS: Delete Message
        end
        end
    end
    end

I/O Profile

The worker application is intensely I/O-intensive. Workers spend the vast majority of their lifecycles communicating off-box:

Infrastructure Execution Models: EC2 vs. ECS Fargate

The failure on Fargate is an infrastructure mechanics problem. The underlying operating system handles multi-threading fundamentally differently in these two environments — even at identical vCPU allocations.

The EC2 Environment: Native Hardware Slicing

On our Elastic Beanstalk cluster, we utilize dedicated EC2 instances with 2 vCPUs each (8 instances, 16 total vCPU).

The ECS Fargate Environment: The cgroup Bandwidth Tax

On ECS Fargate, applications run inside Firecracker MicroVMs constrained by Linux Control Groups (cgroups). Even at 2 vCPU (our test configuration), the scheduling model is fundamentally different:

[ Fargate Task Container — 2 vCPU ]
├── 120 Supervisor Threads ──┐
└── 120 I/O-Intensive Workers ──┼─> Single cgroup ──> [ cpu.cfs_quota_us = 200ms/100ms ]
                                                       │
                                                       ├── Context-switch overhead for 240+ threads
                                                       ├── I/O wait → threads stay "runnable"
                                                       └── Quota consumed by scheduling overhead
                                                           → Supervisors delayed past lease timeout

Why 6 vCPU helped but didn't fix it: More vCPU increases the quota (600ms per 100ms), reducing the probability of supervisors being delayed past the renewal deadline — but does not eliminate it. The fundamental problem is that platform thread scheduling under cgroup enforcement treats all threads equally, regardless of their criticality or CPU-time needs.

The Solution: Java 21 Virtual Threads

By leveraging our recent upgrade to Java 21 and Spring 6.2, we can fix this scheduling problem entirely inside the Java Virtual Machine using Project Loom Virtual Threads.

How Virtual Threads Bypass Throttling

With Virtual Threads, the 240+ supervisor and worker threads are no longer mapped 1:1 to heavy Linux OS threads. Instead, they become lightweight Virtual Threads (vthreads) managed in memory by the JVM. The JVM mounts these vthreads onto a small pool of Platform Carrier Threads whose size matches the vCPU count (e.g., 2 carrier threads for a 2 vCPU Fargate task).

Exploiting I/O-Intensive Workloads

Virtual threads are designed specifically for I/O-bound architectures:

Because scheduling is handled cooperatively inside the JVM rather than by the Linux kernel, OS-level context switching drops to near zero. The cgroup quota is spent purely on executing code, preventing Fargate from throttling the container. Supervisors always get timely execution for lease renewals.

Why Spring Boot's spring.threads.virtual.enabled=true Does NOT Apply

Spring Boot's virtual thread configuration only affects Spring-managed TaskExecutor beans used by @Async and @Scheduled annotations. Our workers application:

Therefore, we must manually configure Virtual Threads for each thread-creation point. A Spring Boot migration is a worthwhile modernization initiative but is orthogonal to the Fargate VT fix and is not a prerequisite.

Concrete Implementation Changes

Component

File

Current

Change

Worker job executor

ConcurrentManagerImpl.java:57

Executors.newCachedThreadPool()

Executors.newVirtualThreadPerTaskExecutor()

Quartz scheduler (workers)

main-scheduler-spb.xml

Quartz SimpleThreadPool (~121 platform threads)

SchedulerFactoryBean.setTaskExecutor() with VT-enabled SimpleAsyncTaskExecutor

Quartz scheduler (repo)

repository-scheduler.xml

Quartz SimpleThreadPool (10 platform threads)

Same as above

Manager executor bean

ManagerConfiguration.java:346

Executors.newCachedThreadPool()

Executors.newVirtualThreadPerTaskExecutor()

Quartz VT integration (Spring 6.1+ native approach via SchedulerFactoryBean.setTaskExecutor()):

<bean id="virtualThreadTaskExecutor"
      class="org.springframework.core.task.SimpleAsyncTaskExecutor">
    <property name="virtualThreads" value="true"/>
</bean>

<bean id="mainScheduler"
      class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
    <property name="taskExecutor" ref="virtualThreadTaskExecutor"/>
    <property name="triggers" ref="workerTriggersList"/>
</bean>

This overrides Quartz's internal SimpleThreadPool with Spring's Virtual Thread-enabled executor. All 120+ supervisor threads become Virtual Threads mounted on only 2 carrier threads (matching the 2 vCPU allocation).

Architectural Alternative: Spring Cloud AWS SqsListener

An alternative to our bespoke Quartz framework is migrating to Spring Cloud AWS 3.x messaging using the @SqsListener annotation.

Benefits of the Standard

The Risk: Long-Running Batch Operations

Our current supervisor loop is highly resilient because it is decoupled from worker execution; it monitors and renews leases independently of what the worker is doing.

Under @SqsListener, if a worker executes a long-running database batch update, it blocks the thread. While a Visibility object can be injected for manual timeout extension, relying on developers to place renewal triggers inside complex business logic is fragile.

Additionally, our ConcurrentWorkerStack supports FIFO queue ordering (receiving one message at a time per group to preserve sequence), which would be lost with @SqsListener.

Therefore, a complete rewrite to @SqsListener is not recommended as an immediate fix.

Architectural Alternative: Functional Worker Tiering

Instead of running all 120 worker types on a single monolithic Fargate container, we can divide the application using Spring Profiles into functional "Tiers" (e.g., profile-tier-high, profile-tier-bulk, profile-tier-scheduled).

We would deploy 4 to 6 separate Fargate Services, each hosting only a subset (~20-30) of the worker types.

Do We Still Need Global Database Semaphores?

Yes, we absolutely still need the global database semaphores. Dividing workers into tiers optimizes the internal thread density of an individual container, but does nothing to coordinate concurrency across the cluster. For example, if the SearchIndexWorker is restricted to running on only 4 nodes globally (semaphoreMaxLockCount=4), and we scale our "Bulk Tier" Fargate task out to 12 containers, the database semaphore remains our only mechanism to guarantee that only 4 of those 12 containers are actively indexing at any moment.

Strategic Recommendations & Phased Rollout Plan

Phase 1: Virtual Thread Adoption & Connection Pool Migration

Success Criteria

Validation Plan

Deploy VT-enabled workers to a Fargate test stack (2 vCPU × 8 containers). Replay change messages to trigger full secondary index rebuild. Compare renewal failures, connection metrics, and processing throughput to EC2 baseline.

Failure Criteria & Rollback

Phase 1 is considered unsuccessful if load testing shows recurring jdk.VirtualThreadPinned events above threshold, continued SQS visibility-timeout or DB semaphore lease forfeitures, sustained HikariCP connection wait times above 500ms, or renewed database CPU spikes. If any of these occur, the Fargate deployment stays blocked while the specific issue is investigated. The application can continue running on EC2 since VT changes are safe on multi-vCPU EC2 instances.

Phase 2: Functional Tiering & Isolation (Mid-Term)

Risk Mitigation & Virtual Thread Guardrails

While Virtual Threads solve the cgroup scheduling problem, they introduce two runtime risks specific to the JVM layer: Carrier Thread Pinning and Database Connection Pool Starvation.

Carrier Thread Pinning Mitigation

Virtual Threads yield the carrier thread cooperatively at JVM-managed blocking points (e.g., LockSupport.park()). However, if a thread blocks inside a synchronized block or method, the JVM cannot unmount it — the virtual thread becomes pinned to its carrier thread. On a 2 vCPU Fargate task with only 2 carrier threads, pinning events block other virtual threads from making progress.

Detection

# Add to JVM startup flags — logs a full stack trace for every pinning event
-Djdk.tracePinnedThreads=full

Alternatively, monitor the JFR event jdk.VirtualThreadPinned during load testing. Any pinning event lasting longer than a few microseconds should be treated as a blocking bug.

Compliance Checklist — Dependencies

Component

Details

Action

MySQL Connector/J

Current version: 8.4.0. Versions ≥ 8.2.0 have removed synchronized from NativeSession and NativeProtocol.

Already compliant. No action needed.

Connection Pool (DBCP2 2.9.0 — CRITICAL)

Apache Commons DBCP2 delegates to Commons Pool 2, which uses synchronized extensively in GenericObjectPool.borrowObject() and returnObject(). Every connection checkout/checkin will pin the carrier thread.

Migrate to HikariCP ≥ 5.1.0, which uses java.util.concurrent locks. Blocking prerequisite.

AWS SDK v1 (SQS Client)

The worker hot path uses AmazonSQSClient (SDK v1, version 1.12.768). Internally uses Apache HttpClient 4.x with synchronized in PoolingHttpClientConnectionManager.

Validate pinning duration with -Djdk.tracePinnedThreads=full. If sub-millisecond (likely — the sync section is brief pool bookkeeping), acceptable for Phase 1. Plan SDK v2 migration for later.

DBCP2 → HikariCP Migration (All 5 Pools)

The codebase has 5 independent BasicDataSource instances that ALL must be migrated:

Pool

Location

Special Configuration

dataSourcePool

DatabaseInfrastructureConfiguration.java

Primary; currently MaxTotal=-1 (unbounded)

migrationDataSourcePool

DatabaseInfrastructureConfiguration.java

rewriteBatchedStatements=true in JDBC URL

idGeneratorDataSourcePool

IdGeneratorConfig.java

Separate database for ID generation

tableDatabaseConnectionPool

TableClusterConfig.java

Index database operations

gridDatabaseConnectionPool

GridDatabaseConfig.java

CRDT grid; rewriteBatchedStatements=true

<!-- REMOVE (all 5 pools) -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-dbcp2</artifactId>
    <version>2.9.0</version>
</dependency>

<!-- REPLACE WITH -->
<dependency>
    <groupId>com.zaxxer</groupId>
    <artifactId>HikariCP</artifactId>
    <version>5.1.0</version>
</dependency>

Synchronized Block Audit — Codebase Findings

The principle: any synchronized block guarding code that could perform I/O MUST be converted to ReentrantLock, because ReentrantLock.lock() is a Virtual Thread yield point (VT unmounts from carrier), while synchronized pins the carrier.

Critical (I/O under lock):

Location

Issue

Fix

SynapseS3ClientImpl:68

Collections.synchronizedMap for bucket location cache. On cache miss, calls headBucket() (AWS API call) while holding the monitor.

Replace with ConcurrentHashMap.computeIfAbsent() or ReentrantLock per bucket key.

ConcurrentProgressCallback:25,36,50

Synchronized methods. progressMade() iterates listeners that make SQS/semaphore renewal calls — I/O under lock.

Convert to ReentrantLock.

SynchronizedProgressCallback:46,66,82

Same pattern as above.

Convert to ReentrantLock.

Moderate (high contention, no direct I/O):

Location

Issue

Fix

AccessInterceptor:40

Collections.synchronizedMap for per-request access records. High contention on every HTTP request.

Replace with ConcurrentHashMap.

Low risk (in-memory microsecond operations — acceptable pinning):

Already correct (no action needed):

Example refactoring pattern:

// BEFORE: Pins carrier thread during I/O
public synchronized byte[] fetchFromS3(String key) {
    return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key))
                   .asByteArray();
}

// AFTER: Safely yields carrier thread during I/O
private final ReentrantLock lock = new ReentrantLock();
public byte[] fetchFromS3(String key) {
    lock.lock();
    try {
        return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key))
                       .asByteArray();
    } finally {
        lock.unlock();
    }
}

Going-forward rule: New code must use ReentrantLock or java.util.concurrent primitives for any block that might contain I/O.

Database Connection Pool Starvation

Virtual Threads make it trivially easy to have hundreds of concurrent tasks unblock simultaneously, but the database connection pool is still finite. With 240+ virtual threads potentially all requesting a connection at the same moment, the pool must act as a bounded throttle.


Key takeaway: The DBCP2 → HikariCP migration is the most critical prerequisite. Without it, every database call pins the carrier thread, which on a 2 vCPU Fargate task (2 carrier threads) effectively serializes all work and recreates the exact starvation condition we're trying to fix — just at the JVM level instead of the cgroup level.