Status:

Problem Statement & Objectives

Context

The Synapse Workers application is a core, high-throughput platform component currently deployed as a monolithic Java .war file on Apache Tomcat instances via AWS Elastic Beanstalk. To align with organizational infrastructure standards, reduce operational overhead, and achieve granular scalability, there is an immediate business push to migrate this workload to AWS ECS Fargate.

Observed Failure Mode

Initial migration attempts to ECS Fargate MicroVMs resulted in severe runtime instability:

This document analyzes the root causes of this behavior and outlines a modern, Java 21-powered path forward.

Current Worker Framework Architecture

The framework acts as an application-level distributed scheduler. It orchestrates 120+ unique worker types across a cluster of nodes using two primary tiers of concurrency control: Global Cluster Limits (Database Semaphores) and Node-Level Limits (maxThreadsPerMachine).

Core Architectural Lifecycle

For every worker type, a dedicated Quartz scheduler trigger acts as the Supervisor Thread. The lifecycle proceeds as follows:

firefox_YSobpK0pv2.png

sequenceDiagram autonumber participant QZ as Quartz Trigger (Supervisor) participant DB as DB Global Semaphore participant SQS as AWS SQS Queue participant TP as Dedicated Thread Pool rect rgb(230, 245, 255) note right of QZ: 1. Global Concurrency Check QZ->+DB: attemptToGetLock(type, semaphoreMaxLockCount) alt Lock Denied DB-->QZ: Return False note right of QZ: Exit loop, wait for next Quartz trigger else Lock Granted DB-->>-QZ: Return True end end rect rgb(240, 255, 240) note right of QZ: 2. Main Supervisor Loop (while true) loop Active Execution alt Worker Type is SQS-Driven QZ->+SQS: Standard HTTP GET (Batch Size = maxThreadsPerMachine) SQS-->>-QZ: Return Messages end loop For Each Message / Timer Event QZ->>TP: Submit Worker Callable (Bounded by maxThreadsPerMachine) end rect rgb(255, 240, 240) note right of QZ: 3. Health & Lease Monitoring (Sleeps 1s) loop Every 1 Second note right of QZ: If running time > 2/3 of Timeout: QZ->>SQS: Renew Message Visibility QZ->>DB: Renew Semaphore Lock Leasetime note right of QZ: If Worker finishes: QZ->>SQS: Delete Message end end end end

I/O Profile

The worker application is intensely I/O-intensive. Workers spend the vast majority of their lifecycles communicating off-box:

Infrastructure Execution Models: EC2 vs. ECS Fargate

The failure on Fargate is entirely an infrastructure mechanics problem. The underlying operating system handles multi-threading fundamentally differently in these two environments.

The EC2 Environment: Native Hardware Slicing

On our current Elastic Beanstalk cluster, we utilize dedicated EC2 instances running bare Linux operating systems with 2 dedicated vCPUs.

The ECS Fargate Environment: The cgroup Capping Tax

On AWS ECS Fargate, applications do not run on raw virtual hardware; they run inside isolated Firecracker MicroVMs tightly constrained by Linux Control Groups (cgroups).

[ Fargate Task Container ]
├── 120 Supervisor Threads ──┐
└── 120 I/O-Intensive Workers ──┼─> Assigned to single cgroup ──> [ cpu.cfs_quota_us Exceeded ] ──> KVM FREEZES ENTIRE VM
                                                                                                    (Supervisors Cannot Run)

The Solution: Java 21 Virtual Threads

By taking advantage of our recent upgrade to Java 21 and Spring 6, we can fix this scheduling problem entirely inside the Java Virtual Machine using Project Loom Virtual Threads (spring.threads.virtual.enabled=true).

How Virtual Threads Bypass Throttling

When Virtual Threads are enabled, the 240+ supervisor and worker threads are no longer mapped 1:1 to heavy Linux OS threads. Instead, they become lightweight Virtual Threads (vthreads) managed in memory by the JVM. The JVM mounts these vthreads onto a tiny pool of Platform Carrier Threads whose size exactly matches the allocated vCPU count (e.g., exactly 1 carrier thread for a 0.25 or 0.5 vCPU Fargate task).

Exploiting I/O-Intensive Workloads

Virtual threads are designed specifically for I/O-bound architectures.

Because scheduling is handled cooperatively inside the JVM rather than aggressively by the Linux kernel, OS-level context switching drops to near zero. The cgroup quota is spent purely on executing code rather than thread management, preventing Fargate from throttling the container.

Architectural Alternative: Spring Cloud AWS SqsListener

An alternative to our bespoke Quartz framework is migrating to the industry-standard Spring Cloud AWS 3.x messaging system using the @SqsListener annotation.

Benefits of the Standard

The Risk: Long-Running Batch Operations

Our current supervisor loop is highly resilient because it is decoupled from worker execution; it monitors and renews leases on a 1-second interval regardless of what the worker is doing.

Under standard @SqsListener behavior, if a worker executes a large, long-running database batch update, it blocks the thread. While you can inject a Visibility object to manually extend timeouts:

@SqsListener("queue-name")
public void process(String payload, Visibility visibility) {
    // If this batch update takes 2 minutes, we must remember to call visibility.extend()
    // inside our batch loops.
}

As observed in our past iterations, relying on individual developers to manually place lease renewal triggers inside complex business logic is fragile and prone to oversight. Therefore, a complete rewrite to @SqsListener is not recommended as an immediate fix.

Architectural Alternative: Functional Worker Tiering

Instead of running all 120 worker types on a single monolithic Fargate container, we can divide the application using Spring Profiles into functional "Tiers" (e.g., profile-tier-high, profile-tier-bulk, profile-tier-scheduled).

We would deploy 4 to 6 separate Fargate Services, each hosting only a subset (~20-30) of the worker types.

Do We Still Need Global Database Semaphores?

Yes, we absolutely still need the global database semaphores. Dividing the workers into tiers optimizes the internal thread density of an individual container, but it does nothing to coordinate concurrency across the cluster. For example, if the SearchIndexWorker is restricted to running on only 4 nodes globally (semaphoreMaxLockCount=4), and we scale our "Bulk Tier" Fargate task out to 12 containers to handle a massive backlog, the database semaphore remains our only mechanism to guarantee that only 4 of those 12 containers are actively indexing data at any one moment.

Strategic Recommendations & Phased Rollout Plan

To minimize engineering friction and de-risk the infrastructure migration, we recommend a strict, two-phase rollout strategy:

Phase 1: Runtime Optimization (Immediate)

Phase 2: Functional Tiering & Isolation (Mid-Term)

Risk Mitigation & Virtual Thread Guardrails

While Virtual Threads solve the Linux cgroup context-switching tax, they introduce two runtime risks specific to the JVM layer: Carrier Thread Pinning and Database Connection Pool Starvation. This section outlines our mitigation and validation strategies.

Carrier Thread Pinning Mitigation

Virtual Threads yield the carrier thread cooperatively at JVM-managed blocking points (e.g., LockSupport.park()). However, if a thread blocks inside a synchronized block or method, the JVM cannot unmount it — the virtual thread becomes pinned to its carrier thread for the duration of that block, negating the benefit of Virtual Threads for that call. On a 0.5 vCPU Fargate task with only 1 carrier thread, a single pinning event blocks all other virtual threads from making progress.

Detection First

Before refactoring, enable pinning diagnostics to identify actual violations under load:

# Add to JVM startup flags — logs a full stack trace for every pinning event
-Djdk.tracePinnedThreads=full

Alternatively, monitor the JFR event jdk.VirtualThreadPinned during load testing. Any pinning event lasting longer than a few microseconds should be treated as a blocking bug.

Compliance Checklist

Component

Details

Action

1

JDBC Driver Audit (MySQL 8.4)

MySQL Connector/J must be >= 8.2.0 (or >= 9.0.0 on the newer major line). Earlier 8.x versions pin on every query execution via synchronized in NativeSession and NativeProtocol.

Verify the exact Connector/J version in our dependency tree. If below 8.2.0, upgrade is mandatory before Phase 1 deployment.

2

Connection Pool Audit (Commons DBCP2 2.9.0 — CRITICAL)

Apache Commons DBCP2 delegates to Commons Pool 2, which uses synchronized extensively in GenericObjectPool.borrowObject() and returnObject(). This means every connection checkout and checkin will pin the carrier thread.

DBCP2 2.9.0 predates Virtual Thread awareness and has received no VT-compatibility patches.

Migrate to HikariCP >= 5.1.0, which replaced its internal synchronized blocks with java.util.concurrent locks specifically for Virtual Thread compatibility. This is a blocking prerequisite for Phase 1.

     <!-- REMOVE -->
     <dependency>
         <groupId>org.apache.commons</groupId>
         <artifactId>commons-dbcp2</artifactId>
         <version>2.9.0</version>
     </dependency>

     <!-- REPLACE WITH -->
     <dependency>
         <groupId>com.zaxxer</groupId>
         <artifactId>HikariCP</artifactId>
         <version>5.1.0</version>
     </dependency>

Component

Details

Action

1

AWS SDK Audit

AWS SDK v2's async HTTP client (Netty-based) is cooperative with Virtual Threads.

However, any code using the synchronous SDK client (common for S3 getObject, putObject) routes through Apache HttpClient or URL Connection, which can pin.

Audit all synchronous AWS SDK client usages. Where feasible, migrate to the async client. Where not feasible, validate with -Djdk.tracePinnedThreads=full that pinning duration is acceptable (sub-millisecond).

2

Internal Code Refactoring

Any internal frameworks, caching layers, or custom utilities using synchronized around network or disk I/O must be refactored to use ReentrantLock:

Refactor internal frameworks, caching layers, or custom utilities using synchronized around network or disk I/O to use ReentrantLock.

// BEFORE: Pins carrier thread during I/Onpublic synchronized byte[] fetchFromS3(String key) {n    return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key))n                   .asByteArray();n}nn// AFTER: Safely yields carrier thread during I/Onprivate final ReentrantLock lock = new ReentrantLock();npublic byte[] fetchFromS3(String key) {n    lock.lock();n    try {n        return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key))n                       .asByteArray();n    } finally {n        lock.unlock();n    }n}

Database Connection Pool Starvation

Virtual Threads make it trivially easy to have hundreds of concurrent tasks unblock simultaneously, but the database connection pool is still finite. With 240+ virtual threads potentially all requesting a connection at the same moment, we must ensure the pool acts as a bounded throttle.


Key takeaway: The DBCP2 → HikariCP migration is the most critical prerequisite uncovered here. Without it, every single database call pins the carrier thread, which on a single-carrier-thread Fargate task (0.5 vCPU) effectively serializes all work and recreates the exact starvation condition we're trying to fix — just at the JVM level instead of the cgroup level.