...
| Status | ||||
|---|---|---|---|---|
|
Problem Statement & Objectives
Context
The Synapse Workers application is a core, high-throughput platform component currently deployed as a monolithic Java .war file on Apache Tomcat instances via AWS Elastic Beanstalk. To align with organizational infrastructure standards, reduce operational overhead, and achieve granular scalability, there is an immediate business push to migrate this workload to AWS ECS Fargate.
Observed Failure Mode
...
Migration attempts to ECS Fargate
...
resulted in severe runtime instability that was resolved in stages:
Lease Forfeiture: Supervisor threads regularly failed to renew SQS message visibility timeouts and database-backed global semaphore locks.
Cascading Failures: Forfeited locks led to "split-brain" duplicate job execution, causing massive transaction rollbacks and pushing the primary database CPU close to 100% utilization.
Message Backlogs: SQS messages were delayed, redelivered unnecessarily, and stuck in loops.
This document analyzes the root causes of this behavior and outlines a modern, Java 21-powered path forward.
Current Worker Framework Architecture
The framework acts as an application-level distributed scheduler. It orchestrates 120+ unique worker types across a cluster of nodes using two primary tiers of concurrency control: Global Cluster Limits (Database Semaphores) and Node-Level Limits (maxThreadsPerMachine).
Core Architectural Lifecycle
For every worker type, a dedicated Quartz scheduler trigger acts as the Supervisor Thread. The lifecycle proceeds as follows:
...
| Expand | ||||
|---|---|---|---|---|
| ||||
sequenceDiagram autonumber participant QZ as Quartz Trigger (Supervisor) participant DB as DB Global Semaphore participant SQS as AWS SQS Queue participant TP as Dedicated Thread Pool rect rgb(230, 245, 255) note right of QZ: 1. Global Concurrency Check QZ->+DB: attemptToGetLock(type, semaphoreMaxLockCount) alt Lock Denied DB-->QZ: Return False note right of QZ: Exit loop, wait for next Quartz trigger else Lock Granted DB-->>-QZ: Return True end end rect rgb(240, 255, 240) note right of QZ: 2. Main Supervisor Loop (while true) loop Active Execution alt Worker Type is SQS-Driven QZ->+SQS: Standard HTTP GET (Batch Size = maxThreadsPerMachine) SQS-->>-QZ: Return Messages end loop For Each Message / Timer Event QZ->>TP: Submit Worker Callable (Bounded by maxThreadsPerMachine) end rect rgb(255, 240, 240) note right of QZ: 3. Health & Lease Monitoring (Sleeps 1s) loop Every 1 Second note right of QZ: If running time > 2/3 of Timeout: QZ->>SQS: Renew Message Visibility QZ->>DB: Renew Semaphore Lock Leasetime note right of QZ: If Worker finishes: QZ->>SQS: Delete Message end end end end |
I/O Profile
The worker application is intensely I/O-intensive. Workers spend the vast majority of their lifecycles communicating off-box:
Performing heavy relational database queries and massive batch updates via JDBC.
Making blocking AWS API requests (SQS, S3, API Gateway Websockets).
Reading and writing large, transient datasets to temporary disk files.
Infrastructure Execution Models: EC2 vs. ECS Fargate
The failure on Fargate is entirely an infrastructure mechanics problem. The underlying operating system handles multi-threading fundamentally differently in these two environments.
The EC2 Environment: Native Hardware Slicing
On our current Elastic Beanstalk cluster, we utilize dedicated EC2 instances running bare Linux operating systems with 2 dedicated vCPUs.
The Linux CFS Scheduler: The operating system utilizes the Completely Fair Scheduler (CFS). The OS manages our 240+ OS-level threads (120 supervisors + 120 active worker slots) holistically.
Supervisor Protection: When supervisor threads put themselves to sleep via
Thread.sleep(1000)during their monitoring phase, they surrender their CPU time. When they wake up, the Linux kernel identifies that they have not consumed CPU resources recently. The kernel grants them high scheduling priority, momentarily pre-empting the I/O-intensive worker threads to give the supervisor its necessary microsecond of execution time. As a result, leases are always renewed on time.
The ECS Fargate Environment: The cgroup Capping Tax
On AWS ECS Fargate, applications do not run on raw virtual hardware; they run inside isolated Firecracker MicroVMs tightly constrained by Linux Control Groups (cgroups).
...
| breakoutMode | wide |
|---|---|
| breakoutWidth | 760 |
...
Stage 1: Memory Pressure & Task Churn (Resolved)
Initial deployment (prod-587, 2 vCPU × 8 containers, Java 11/Spring 5) configured the JVM to use 85% of task memory. Workers hit the 8GB memory ceiling, causing OOM kills, ECS task replacements, HTTP 499 health check failures, and cascading task churn. Connection spikes correlated with replacement events (new containers opening fresh pools while old containers drain).
Resolution: Reduced JVM heap to 50% of task memory. Task churn resolved completely.
Stage 2: Supervisor Thread Starvation (Unsolved — This Document's Focus)
After resolving memory pressure, change messages were replayed to trigger a full secondary index rebuild (tables, OpenSearch, object snapshots). With stable containers (no churn), the following was observed:
Metric | EC2 (Elastic Beanstalk) | ECS Fargate |
|---|---|---|
"Error on progressMade" events (failed SQS/semaphore renewals) | 64 | 7,835 |
Infrastructure | 8 instances × 2 vCPU | 8 containers × 2 vCPU |
Total cluster vCPU | 16 | 16 |
Application version | Identical | Identical |
Stack traces confirm: supervisor threads failed to call changeMessageVisibility before receipt expiry:
| Code Block |
|---|
AmazonSQSException: Value [receipt] for parameter ReceiptHandle is invalid.
Reason: Message does not exist or is not available for visibility timeout change.
(Service: AmazonSQS; Status Code: 400; Error Code: InvalidParameterValue) |
A second test with 6 vCPU × 6 containers reduced renewal failures from thousands to hundreds — more CPU helped supervisors run more often — but did not eliminate the problem. It remained orders of magnitude worse than EC2.
The unsolved problem: Same application, same total vCPU (16), same workload. Near-zero supervisor failures under EC2, thousands under ECS Fargate. The only variable is how the runtime schedules threads.
This document analyzes the root cause of this behavior and outlines a Java 21 Virtual Threads solution.
Current Worker Framework Architecture
The framework acts as an application-level distributed scheduler. It orchestrates 120+ unique worker types across a cluster of nodes using two primary tiers of concurrency control: Global Cluster Limits (Database Semaphores) and Node-Level Limits (maxThreadsPerMachine).
Core Architectural Lifecycle
For every worker type, a dedicated Quartz scheduler trigger acts as the Supervisor Thread. The lifecycle proceeds as follows:
| Code Block |
|---|
sequenceDiagram autonumber participant QZ as Quartz Trigger (Supervisor) participant DB as DB Global Semaphore participant SQS as AWS SQS Queue participant TP as Dedicated Thread Pool rect rgb(230, 245, 255) note right of QZ: 1. Global Concurrency Check QZ->+DB: attemptToGetLock(type, semaphoreMaxLockCount) alt Lock Denied DB-->QZ: Return False note right of QZ: Exit loop, wait for next Quartz trigger else Lock Granted DB-->>-QZ: Return True end end rect rgb(240, 255, 240) note right of QZ: 2. Main Supervisor Loop (while true) loop Active Execution alt Worker Type is SQS-Driven QZ->+SQS: Standard HTTP GET (Batch Size = maxThreadsPerMachine) (Supervisors Cannot Run) |
The Greedy Worker Trap: Because our workers are highly active with I/O (waiting on network sockets, writing to files), they keep hundreds of OS-level threads registered as "active" or "waiting to execute" in the cgroup.
The Context-Switching Tax: Managing 240+ heavy OS-level threads on a tiny Fargate allocation (e.g., 0.25 or 0.5 vCPU) forces the kernel to waste massive amounts of its assigned quota just performing CPU context-switches.
CFS Bandwidth Throttling: Fargate enforces hard container limits via
cpu.cfs_quota_us. If our task is allocated 0.25 vCPUs, it is allowed exactly 25ms of CPU execution time every 100ms. Because of the sheer thread volume and rapid HTTP short-polling, the application burns through those 25ms in the first few milliseconds of the window.The Freeze: Once the quota is blown, the kernel freezes the entire container for the remaining 75ms of the period. Because the container is completely frozen, our supervisor threads are completely starved of execution time. By the time the container unfreezes, the database semaphore or SQS lease window has lapsed, causing the cluster to drop the lock and trigger duplicate processing.
The Solution: Java 21 Virtual Threads
By taking advantage of our recent upgrade to Java 21 and Spring 6, we can fix this scheduling problem entirely inside the Java Virtual Machine using Project Loom Virtual Threads (spring.threads.virtual.enabled=true).
How Virtual Threads Bypass Throttling
When Virtual Threads are enabled, the 240+ supervisor and worker threads are no longer mapped 1:1 to heavy Linux OS threads. Instead, they become lightweight Virtual Threads (vthreads) managed in memory by the JVM. The JVM mounts these vthreads onto a tiny pool of Platform Carrier Threads whose size exactly matches the allocated vCPU count (e.g., exactly 1 carrier thread for a 0.25 or 0.5 vCPU Fargate task).
Exploiting I/O-Intensive Workloads
Virtual threads are designed specifically for I/O-bound architectures.
Every time a worker thread blocks to perform a JDBC database call, read an SQS message, or write a temporary file, the JVM automatically unmounts that virtual thread from the single OS carrier thread.
The OS carrier thread never blocks. It immediately picks up another virtual thread—such as a sleeping supervisor loop that just woke up.
Because scheduling is handled cooperatively inside the JVM rather than aggressively by the Linux kernel, OS-level context switching drops to near zero. The cgroup quota is spent purely on executing code rather than thread management, preventing Fargate from throttling the container.
Architectural Alternative: Spring Cloud AWS SqsListener
An alternative to our bespoke Quartz framework is migrating to the industry-standard Spring Cloud AWS 3.x messaging system using the @SqsListener annotation.
Benefits of the Standard
Connection Efficiency: It natively utilizes the AWS Java SDK v2 Async Client powered by a non-blocking Netty event loop, eliminating the network connection pool pressures that forced us into short-polling.
Declarative Retries: Built-in support for backoffs, dead-letter queues (DLQs), and framework-managed message deletions.
The Risk: Long-Running Batch Operations
Our current supervisor loop is highly resilient because it is decoupled from worker execution; it monitors and renews leases on a 1-second interval regardless of what the worker is doing.
Under standard @SqsListener behavior, if a worker executes a large, long-running database batch update, it blocks the thread. While you can inject a Visibility object to manually extend timeouts:
| Code Block | ||||||
|---|---|---|---|---|---|---|
| ||||||
@SqsListener("queue-name")
public void process(String payload, Visibility visibility) {
// If this batch update takes 2 minutes, we must remember to call visibility.extend()
// inside our batch loops.
} |
As observed in our past iterations, relying on individual developers to manually place lease renewal triggers inside complex business logic is fragile and prone to oversight. Therefore, a complete rewrite to @SqsListener is not recommended as an immediate fix.
Architectural Alternative: Functional Worker Tiering
Instead of running all 120 worker types on a single monolithic Fargate container, we can divide the application using Spring Profiles into functional "Tiers" (e.g., profile-tier-high, profile-tier-bulk, profile-tier-scheduled).
We would deploy 4 to 6 separate Fargate Services, each hosting only a subset (~20-30) of the worker types.
Do We Still Need Global Database Semaphores?
Yes, we absolutely still need the global database semaphores. Dividing the workers into tiers optimizes the internal thread density of an individual container, but it does nothing to coordinate concurrency across the cluster. For example, if the SearchIndexWorker is restricted to running on only 4 nodes globally (semaphoreMaxLockCount=4), and we scale our "Bulk Tier" Fargate task out to 12 containers to handle a massive backlog, the database semaphore remains our only mechanism to guarantee that only 4 of those 12 containers are actively indexing data at any one moment.
Strategic Recommendations & Phased Rollout Plan
To minimize engineering friction and de-risk the infrastructure migration, we recommend a strict, two-phase rollout strategy:
Phase 1: Runtime Optimization (Immediate)
Action: Retain the current monolithic architecture and bespoke Quartz supervisor loops, but enable Java 21 Virtual Threads via application configuration:
Once on Spring Boot:
Code Block spring.threads.virtual.enabled=trueIf we are not on Spring Boot (our current implementation) we will need to manually configure all threads we created to be virtual and remove all thread pools. Note: Virtual threads do not require thread pools since they are so light.
Justification: This addresses the root cause of the Fargate failure (cgroup throttling via OS thread thrashing) with minimal application logic changes and no architectural redesign. It preserves our highly reliable lease-renewal loop while dropping the platform thread footprint down to 1 or 2 carrier threads. Phase 1 now explicitly includes the required Virtual Thread guardrails, including JDBC driver verification, DBCP2-to-HikariCP migration, synchronous AWS SDK auditing, and remediation of any internal
synchronizedblocks around I/O.Success Criteria: Successful execution of the monolith on a 0.5 vCPU Fargate task under load without triggering lease forfeitures or DB spikes.
Failure Criteria & Rollback: Phase 1 is considered unsuccessful if load testing or production canary deployment shows recurring
jdk.VirtualThreadPinnedevents above the approved threshold, continued SQS visibility-timeout or DB semaphore lease forfeitures, sustained HikariCP connection wait times above 500ms, or renewed database CPU spikes attributable to duplicate processing. If any of these occur, roll back the Fargate deployment to the current EC2-backed Elastic Beanstalk runtime while the blocking issue is remediated, or temporarily increase Fargate vCPU allocation as a short-term containment measure.
Phase 2: Functional Tiering & Isolation (Mid-Term)
Action: Segment the 120 workers into 4 distinct Spring Profiles based on priority and resource profiles (e.g., Fast/UI-driven, Bulk/Heavy, Scheduled Timers). Deploy these as independent ECS Fargate services.
Justification: This provides absolute protection against a single rogue, CPU-bound worker crashing the entire worker ecosystem, and enables independent auto-scaling per tier to optimize AWS hosting costs.
Risk Mitigation & Virtual Thread Guardrails
While Virtual Threads solve the Linux cgroup context-switching tax, they introduce two runtime risks specific to the JVM layer: Carrier Thread Pinning and Database Connection Pool Starvation. This section outlines our mitigation and validation strategies.
Carrier Thread Pinning Mitigation
Virtual Threads yield the carrier thread cooperatively at JVM-managed blocking points (e.g., LockSupport.park()). However, if a thread blocks inside a synchronized block or method, the JVM cannot unmount it — the virtual thread becomes pinned to its carrier thread for the duration of that block, negating the benefit of Virtual Threads for that call. On a 0.5 vCPU Fargate task with only 1 carrier thread, a single pinning event blocks all other virtual threads from making progress.
Detection First
Before refactoring, enable pinning diagnostics to identify actual violations under load:
| Code Block | ||||
|---|---|---|---|---|
| ||||
# Add to JVM startup flags — logs a full stack trace for every pinning event
-Djdk.tracePinnedThreads=full |
Alternatively, monitor the JFR event jdk.VirtualThreadPinned during load testing. Any pinning event lasting longer than a few microseconds should be treated as a blocking bug.
Compliance Checklist
...
Component
...
Details
...
Action
...
JDBC Driver Audit (MySQL 8.4)
...
MySQL Connector/J must be >= 8.2.0 (or >= 9.0.0 on the newer major line). Earlier 8.x versions pin on every query execution via synchronized in NativeSession and NativeProtocol.
...
Verify the exact Connector/J version in our dependency tree. If below 8.2.0, upgrade is mandatory before Phase 1 deployment.
...
Connection Pool Audit (Commons DBCP2 2.9.0 — CRITICAL)
...
Apache Commons DBCP2 delegates to Commons Pool 2, which uses synchronized extensively in GenericObjectPool.borrowObject() and returnObject(). This means every connection checkout and checkin will pin the carrier thread.
DBCP2 2.9.0 predates Virtual Thread awareness and has received no VT-compatibility patches.
...
Migrate to HikariCP >= 5.1.0, which replaced its internal synchronized blocks with java.util.concurrent locks specifically for Virtual Thread compatibility. This is a blocking prerequisite for Phase 1.
| Code Block | ||||
|---|---|---|---|---|
| ||||
<!-- REMOVE -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-dbcp2</artifactId>
<version>2.9.0</version>
</dependency>
<!-- REPLACE WITH -->
<dependency>
<groupId>com.zaxxer</groupId>
<artifactId>HikariCP</artifactId>
<version>5.1.0</version>
</dependency> |
...
Component
...
Details
...
Action
...
AWS SDK Audit
...
AWS SDK v2's async HTTP client (Netty-based) is cooperative with Virtual Threads.
However, any code using the synchronous SDK client (common for S3 getObject, putObject) routes through Apache HttpClient or URL Connection, which can pin.
...
Audit all synchronous AWS SDK client usages. Where feasible, migrate to the async client. Where not feasible, validate with -Djdk.tracePinnedThreads=full that pinning duration is acceptable (sub-millisecond).
...
Internal Code Refactoring
...
Any internal frameworks, caching layers, or custom utilities using synchronized around network or disk I/O must be refactored to use ReentrantLock:
...
Refactor internal frameworks, caching layers, or custom utilities using synchronized around network or disk I/O to use ReentrantLock.
| Code Block | ||||
|---|---|---|---|---|
| ||||
// BEFORE: Pins carrier thread during I/Onpublic synchronized byte[] fetchFromS3(String key) {n SQS-->>-QZ: Return Messages end loop For Each Message / Timer Event QZ->>TP: Submit Worker Callable (Bounded by maxThreadsPerMachine) end rect rgb(255, 240, 240) note right of QZ: 3. Health & Lease Monitoring loop Every lockTimeout/3 seconds note right of QZ: If running time > 2/3 of Timeout: QZ->>SQS: Renew Message Visibility QZ->>DB: Renew Semaphore Lock Leasetime note right of QZ: If Worker finishes: QZ->>SQS: Delete Message end end end end |
I/O Profile
The worker application is intensely I/O-intensive. Workers spend the vast majority of their lifecycles communicating off-box:
Performing heavy relational database queries and massive batch updates via JDBC.
Making blocking AWS API requests (SQS, S3, API Gateway WebSockets).
Reading and writing large, transient datasets to temporary disk files.
Infrastructure Execution Models: EC2 vs. ECS Fargate
The failure on Fargate is an infrastructure mechanics problem. The underlying operating system handles multi-threading fundamentally differently in these two environments — even at identical vCPU allocations.
The EC2 Environment: Native Hardware Slicing
On our Elastic Beanstalk cluster, we utilize dedicated EC2 instances with 2 vCPUs each (8 instances, 16 total vCPU).
The Linux CFS Scheduler: The OS utilizes the Completely Fair Scheduler (CFS) with access to dedicated physical CPU cores. The kernel manages our ~240 OS-level threads (120 supervisors + 120 active worker slots) holistically with full preemption authority.
Supervisor Protection: When supervisor threads sleep during their monitoring phase (via
Thread.sleep()in the exponential backoff loop), they surrender CPU time and accumulate "vruntime credit." When they wake, the kernel grants them high scheduling priority, immediately pre-empting worker threads. The supervisor gets its necessary microsecond of execution to renew leases. As a result, renewals are always timely.
The ECS Fargate Environment: The cgroup Bandwidth Tax
On ECS Fargate, applications run inside Firecracker MicroVMs constrained by Linux Control Groups (cgroups). Even at 2 vCPU (our test configuration), the scheduling model is fundamentally different:
| Code Block |
|---|
[ Fargate Task Container — 2 vCPU ]
├── 120 Supervisor Threads ──┐
└── 120 I/O-Intensive Workers ──┼─> Single cgroup ──> [ cpu.cfs_quota_us = 200ms/100ms ]
│
├── Context-switch overhead for 240+ threads
├── I/O wait → threads stay "runnable"
└── Quota consumed by scheduling overhead
→ Supervisors delayed past lease timeout |
The Greedy Worker Trap: Workers performing I/O (waiting on network sockets, writing files) keep hundreds of OS-level threads registered as "runnable" or "waiting to execute" in the cgroup. Unlike EC2 where the kernel has dedicated cores, Fargate's cgroup must time-slice ALL threads through the quota window.
The Context-Switching Tax: Managing 240+ heavy OS-level threads on a Fargate allocation forces the kernel to waste significant portions of its quota performing CPU context-switches rather than executing useful code.
CFS Bandwidth Throttling: Fargate enforces hard limits via
cpu.cfs_quota_us. At 2 vCPU, the container gets 200ms of CPU time every 100ms period. With 240+ active threads and rapid I/O polling, the quota is consumed quickly.The Starvation: Unlike EC2 where the kernel's CFS "vruntime credit" mechanism guarantees sleeping threads get priority when they wake, Fargate's cgroup enforcement does not distinguish between "supervisor needs 1μs to renew a lease" and "worker thread returning from a JDBC call." Supervisors are delayed until the next quota window, and by then the lease has expired.
Why 6 vCPU helped but didn't fix it: More vCPU increases the quota (600ms per 100ms), reducing the probability of supervisors being delayed past the renewal deadline — but does not eliminate it. The fundamental problem is that platform thread scheduling under cgroup enforcement treats all threads equally, regardless of their criticality or CPU-time needs.
The Solution: Java 21 Virtual Threads
By leveraging our recent upgrade to Java 21 and Spring 6.2, we can fix this scheduling problem entirely inside the Java Virtual Machine using Project Loom Virtual Threads.
How Virtual Threads Bypass Throttling
With Virtual Threads, the 240+ supervisor and worker threads are no longer mapped 1:1 to heavy Linux OS threads. Instead, they become lightweight Virtual Threads (vthreads) managed in memory by the JVM. The JVM mounts these vthreads onto a small pool of Platform Carrier Threads whose size matches the vCPU count (e.g., 2 carrier threads for a 2 vCPU Fargate task).
Exploiting I/O-Intensive Workloads
Virtual threads are designed specifically for I/O-bound architectures:
Every time a worker thread blocks on a JDBC call, SQS request, or file write, the JVM automatically unmounts that virtual thread from the carrier.
The carrier thread never blocks. It immediately picks up another virtual thread — such as a supervisor loop that just woke from sleep.
When a supervisor calls
Thread.sleep(waitTimeMs), the VT unmounts from the carrier — no platform thread consumed. When it wakes, the JVM immediately mounts it on a carrier without needing OS scheduler cooperation.
Because scheduling is handled cooperatively inside the JVM rather than by the Linux kernel, OS-level context switching drops to near zero. The cgroup quota is spent purely on executing code, preventing Fargate from throttling the container. Supervisors always get timely execution for lease renewals.
Why Spring Boot's spring.threads.virtual.enabled=true Does NOT Apply
Spring Boot's virtual thread configuration only affects Spring-managed TaskExecutor beans used by @Async and @Scheduled annotations. Our workers application:
Has zero
@AsyncannotationsUses Quartz's internal thread pool (not Spring's task scheduling)
Uses a custom
Executors.newCachedThreadPool()inConcurrentManagerImpl
Therefore, we must manually configure Virtual Threads for each thread-creation point. A Spring Boot migration is a worthwhile modernization initiative but is orthogonal to the Fargate VT fix and is not a prerequisite.
Concrete Implementation Changes
Component | File | Current | Change |
|---|---|---|---|
Worker job executor |
|
|
|
Quartz scheduler (workers) |
| Quartz |
|
Quartz scheduler (repo) |
| Quartz | Same as above |
Manager executor bean |
|
|
|
Quartz VT integration (Spring 6.1+ native approach via SchedulerFactoryBean.setTaskExecutor()):
| Code Block | ||
|---|---|---|
| ||
<bean id="virtualThreadTaskExecutor"
class="org.springframework.core.task.SimpleAsyncTaskExecutor">
<property name="virtualThreads" value="true"/>
</bean>
<bean id="mainScheduler"
class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
<property name="taskExecutor" ref="virtualThreadTaskExecutor"/>
<property name="triggers" ref="workerTriggersList"/>
</bean> |
This overrides Quartz's internal SimpleThreadPool with Spring's Virtual Thread-enabled executor. All 120+ supervisor threads become Virtual Threads mounted on only 2 carrier threads (matching the 2 vCPU allocation).
Architectural Alternative: Spring Cloud AWS SqsListener
An alternative to our bespoke Quartz framework is migrating to Spring Cloud AWS 3.x messaging using the @SqsListener annotation.
Benefits of the Standard
Connection Efficiency: Natively utilizes the AWS Java SDK v2 Async Client powered by a non-blocking Netty event loop.
Declarative Retries: Built-in support for backoffs, dead-letter queues (DLQs), and framework-managed message deletions.
The Risk: Long-Running Batch Operations
Our current supervisor loop is highly resilient because it is decoupled from worker execution; it monitors and renews leases independently of what the worker is doing.
Under @SqsListener, if a worker executes a long-running database batch update, it blocks the thread. While a Visibility object can be injected for manual timeout extension, relying on developers to place renewal triggers inside complex business logic is fragile.
Additionally, our ConcurrentWorkerStack supports FIFO queue ordering (receiving one message at a time per group to preserve sequence), which would be lost with @SqsListener.
Therefore, a complete rewrite to @SqsListener is not recommended as an immediate fix.
Architectural Alternative: Functional Worker Tiering
Instead of running all 120 worker types on a single monolithic Fargate container, we can divide the application using Spring Profiles into functional "Tiers" (e.g., profile-tier-high, profile-tier-bulk, profile-tier-scheduled).
We would deploy 4 to 6 separate Fargate Services, each hosting only a subset (~20-30) of the worker types.
Do We Still Need Global Database Semaphores?
Yes, we absolutely still need the global database semaphores. Dividing workers into tiers optimizes the internal thread density of an individual container, but does nothing to coordinate concurrency across the cluster. For example, if the SearchIndexWorker is restricted to running on only 4 nodes globally (semaphoreMaxLockCount=4), and we scale our "Bulk Tier" Fargate task out to 12 containers, the database semaphore remains our only mechanism to guarantee that only 4 of those 12 containers are actively indexing at any moment.
Strategic Recommendations & Phased Rollout Plan
Phase 1: Virtual Thread Adoption & Connection Pool Migration
Action: Retain the current monolithic architecture and Quartz supervisor loops. Manually enable Virtual Threads at each thread-creation point (see implementation table above). Migrate all connection pools from DBCP2 to HikariCP. Convert
synchronizedblocks toReentrantLockwhere they guard I/O.Justification: This addresses the root cause of the Fargate failure (supervisor thread starvation due to cgroup scheduling) with minimal architectural redesign. It preserves our reliable lease-renewal loop while dropping the platform thread footprint to 2 carrier threads on a 2 vCPU task.
Scope: Both the workers WAR and repository WAR, since they share the same connection pools and
synchronizedcode paths.
Success Criteria
Renewal failure rate ≤ EC2 baseline (~64 per full index rebuild cycle)
No
jdk.VirtualThreadPinnedevents lasting > 1msHikariCP connection wait times (
HikariPool.Wait) < 500msNo database CPU spikes from duplicate processing
Stable task count (no OOM-driven or health-check-driven replacements)
Validation Plan
Deploy VT-enabled workers to a Fargate test stack (2 vCPU × 8 containers). Replay change messages to trigger full secondary index rebuild. Compare renewal failures, connection metrics, and processing throughput to EC2 baseline.
Failure Criteria & Rollback
Phase 1 is considered unsuccessful if load testing shows recurring jdk.VirtualThreadPinned events above threshold, continued SQS visibility-timeout or DB semaphore lease forfeitures, sustained HikariCP connection wait times above 500ms, or renewed database CPU spikes. If any of these occur, the Fargate deployment stays blocked while the specific issue is investigated. The application can continue running on EC2 since VT changes are safe on multi-vCPU EC2 instances.
Phase 2: Functional Tiering & Isolation (Mid-Term)
Action: Segment the 120 workers into 4 distinct Spring Profiles based on priority and resource profiles (e.g., Fast/UI-driven, Bulk/Heavy, Scheduled Timers). Deploy these as independent ECS Fargate services.
Justification: Provides absolute protection against a single rogue CPU-bound worker crashing the entire ecosystem, and enables independent auto-scaling per tier to optimize hosting costs.
Risk Mitigation & Virtual Thread Guardrails
While Virtual Threads solve the cgroup scheduling problem, they introduce two runtime risks specific to the JVM layer: Carrier Thread Pinning and Database Connection Pool Starvation.
Carrier Thread Pinning Mitigation
Virtual Threads yield the carrier thread cooperatively at JVM-managed blocking points (e.g., LockSupport.park()). However, if a thread blocks inside a synchronized block or method, the JVM cannot unmount it — the virtual thread becomes pinned to its carrier thread. On a 2 vCPU Fargate task with only 2 carrier threads, pinning events block other virtual threads from making progress.
Detection
| Code Block |
|---|
# Add to JVM startup flags — logs a full stack trace for every pinning event
-Djdk.tracePinnedThreads=full |
Alternatively, monitor the JFR event jdk.VirtualThreadPinned during load testing. Any pinning event lasting longer than a few microseconds should be treated as a blocking bug.
Compliance Checklist — Dependencies
Component | Details | Action |
|---|---|---|
MySQL Connector/J | Current version: 8.4.0. Versions ≥ 8.2.0 have removed | ✅ Already compliant. No action needed. |
Connection Pool (DBCP2 2.9.0 — CRITICAL) | Apache Commons DBCP2 delegates to Commons Pool 2, which uses | Migrate to HikariCP ≥ 5.1.0, which uses |
AWS SDK v1 (SQS Client) | The worker hot path uses | Validate pinning duration with |
DBCP2 → HikariCP Migration (All 5 Pools)
The codebase has 5 independent BasicDataSource instances that ALL must be migrated:
Pool | Location | Special Configuration |
|---|---|---|
|
| Primary; currently |
|
|
|
|
| Separate database for ID generation |
|
| Index database operations |
|
| CRDT grid; |
| Code Block | ||
|---|---|---|
| ||
<!-- REMOVE (all 5 pools) -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-dbcp2</artifactId>
<version>2.9.0</version>
</dependency>
<!-- REPLACE WITH -->
<dependency>
<groupId>com.zaxxer</groupId>
<artifactId>HikariCP</artifactId>
<version>5.1.0</version>
</dependency> |
Synchronized Block Audit — Codebase Findings
The principle: any synchronized block guarding code that could perform I/O MUST be converted to ReentrantLock, because ReentrantLock.lock() is a Virtual Thread yield point (VT unmounts from carrier), while synchronized pins the carrier.
Critical (I/O under lock):
Location | Issue | Fix |
|---|---|---|
|
| Replace with |
| Synchronized methods. | Convert to |
| Same pattern as above. | Convert to |
Moderate (high contention, no direct I/O):
Location | Issue | Fix |
|---|---|---|
|
| Replace with |
Low risk (in-memory microsecond operations — acceptable pinning):
JobTrackerImpl(lines 46, 58, 78): HashMap operations onlyMemoryCountingSemaphoreImpl(lines 39, 71, 96, 120): test/dev in-memory semaphore
Already correct (no action needed):
CountingSemaphoreImpl: Uses Spring@Transactional, no synchronized blocksWebhookMetricsCollector,ThrottleRulesCache,ProjectStorageLimitsManager: Already useConcurrentHashMap
Example refactoring pattern:
| Code Block | ||
|---|---|---|
| ||
// BEFORE: Pins carrier thread during I/O public synchronized byte[] fetchFromS3(String key) { return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key)) .asByteArray(); } // AFTER: Safely yields carrier thread during I/O private final ReentrantLock lock = new ReentrantLock(); public byte[] fetchFromS3(String key) { lock.lock(); try { return s3Client.getObjectAsBytes(req -> req.bucket(bucket).key(key))n .asByteArray();n}nn// AFTER: Safely yields carrier thread during I/Onprivate final ReentrantLock lock = new ReentrantLock();npublic byte[] fetchFromS3(String key) {n lock.lock.asByteArray();n try {n} finally { return s3Client.getObjectAsBytes(req -> reqlock.bucketunlock(bucket).key(key))n); .asByteArray();n } finally {n lock.unlock();n }n} |
Database Connection Pool Starvation
...
}
} |
Going-forward rule: New code must use ReentrantLock or java.util.concurrent primitives for any block that might contain I/O.
Database Connection Pool Starvation
Virtual Threads make it trivially easy to have hundreds of concurrent tasks unblock simultaneously, but the database connection pool is still finite. With 240+ virtual threads potentially all requesting a connection at the same moment, the pool must act as a bounded throttle.
Critical change: The current main pool uses
MaxTotal = -1(unbounded, per PLFM-8344). With Virtual Threads, this becomes dangerous — hundreds of VTs can unblock simultaneously and exhaust MySQL's connection limit. HikariCP's boundedmaximumPoolSizebecomes the required throttle.HikariCP
maximumPoolSizeshould be set conservatively (e.g., 20–30 connections) relative to what the MySQL
...
instance can sustain.
The existing DB Global Semaphore
...
bounds active workers per type across the cluster. Combined with a properly sized HikariCP pool, this provides two layers of back-pressure.
Validation: Under load test, confirm that connection wait times (
HikariPool.Waitmetric) remain below 500ms and that noSQLTransientConnectionException(pool timeout) is thrown.
...
...
Key takeaway: The DBCP2 → HikariCP migration is the most critical prerequisite
...
. Without it, every
...
database call pins the carrier thread, which on a
...
2 vCPU Fargate task (
...
2 carrier threads) effectively serializes all work and recreates the exact starvation condition we're trying to fix — just at the JVM level instead of the cgroup level.