A production-grade distributed task queue built with Java 17 / Spring Boot 3 + Redis, featuring ML-assisted dynamic scheduling, at-least-once delivery semantics, idempotent submissions, exponential backoff + DLQ handling, and full observability with Prometheus/Grafana. The platform is designed to stay available under partial failures by falling back from ML-based ordering to FIFO queue consumption.
This project demonstrates a resilient asynchronous processing architecture for backend workloads with mixed job durations. It combines deterministic queue safety guarantees (Lua atomic operations, processing tracking, retries, DLQ) with an optional ML layer that predicts execution duration and improves scheduling order without becoming a single point of failure.
flowchart LR
P1[REST Producer] --> API[Job API - Spring Boot 3]
P2[Cron Producer] --> API
API --> QL[Redis Lists: tq:queue:high/normal/low]
API --> IDEM[Redis String Keys: idempotency SET NX]
QL <--> MLR[PriorityReorderer]
MLR <--> ML["ML Scheduler - FastAPI<br/>GBT duration model"]
MLR --> ZS[Redis Sorted Sets: tq:sorted:*]
ZS --> W[Worker Pool - ThreadPoolExecutor]
QL --> W
W --> PH[Redis Hash: tq:processing]
W --> DLQ[Redis List: tq:dlq]
W --> JS[Redis String: tq:job:*]
API --> MET[Micrometer Metrics]
W --> MET
MET --> PROM[Prometheus]
PROM --> GRAF[Grafana]
subgraph K8s
HPA[Horizontal Pod Autoscaler]
end
PROM --> HPA
- At-least-once delivery: jobs are moved into a processing hash during dequeue and must be acknowledged after completion.
- Idempotent submission: Redis
SET NXkeys reject duplicate job submissions. - Atomic queue operations: enqueue/dequeue transitions run in Lua scripts to avoid race conditions.
- Retry safety: exponential backoff with jitter prevents retry storms and synchronized contention.
- Dead-letter isolation: exhausted jobs are preserved in DLQ for replay/inspection.
- ML graceful degradation: queue processing continues in FIFO mode if ML service is unavailable.
| Key pattern | Type | Purpose |
|---|---|---|
tq:queue:high |
List | Primary FIFO queue lane (high priority) |
tq:queue:normal |
List | Primary FIFO queue lane (normal priority) |
tq:queue:low |
List | Primary FIFO queue lane (low priority) |
tq:sorted:high |
Sorted Set | ML-reordered lane (high priority, lower score dequeued first) |
tq:sorted:normal |
Sorted Set | ML-reordered lane (normal priority, lower score dequeued first) |
tq:sorted:low |
Sorted Set | ML-reordered lane (low priority, lower score dequeued first) |
tq:processing |
Hash | In-flight jobs (jobId -> jobJson) |
tq:dlq |
List | Dead-letter jobs |
tq:idem:{idempotencyKey} |
String | Deduplication lock via SET NX EX |
tq:job:{jobId} |
String | Job status snapshot for API lookups |
- A FastAPI service predicts job duration using a GradientBoostingRegressor.
PriorityReordererperiodically reads list queues and populates sorted sets scored by predicted duration.- Dequeue Lua script pops from sorted set first; if empty, it falls back to list FIFO.
- If ML API calls fail, scheduler methods return safe defaults and reordering is skipped.
The queue-core service emits key metrics via Micrometer:
taskqueue.jobs.submittedtaskqueue.jobs.completedtaskqueue.jobs.failedtaskqueue.jobs.deadtaskqueue.jobs.duplicatetaskqueue.jobs.retriedtaskqueue.queue.depth(gauge per priority)taskqueue.processing.active(gauge)taskqueue.job.execution.duration(timer)taskqueue.job.queue.wait.time(timer)
Prometheus scrapes /actuator/prometheus; Grafana dashboards are provisioned in infra/grafana.
| Layer | Technology | Purpose |
|---|---|---|
| API | Java 17, Spring Boot 3 | REST endpoints, validation, orchestration |
| Queue | Redis 7 (Lists, Sorted Sets, Hashes, Strings) + Lua | Atomic enqueue/dequeue + durability patterns |
| Workers | Java ThreadPoolExecutor | Concurrent execution and lifecycle handling |
| Retry/DLQ | Exponential backoff + jitter | Fault tolerance under transient failures |
| ML Scheduler | Python FastAPI + scikit-learn | Duration prediction and queue reordering |
| Metrics | Micrometer + Prometheus | Runtime counters/timers/gauges |
| Dashboard | Grafana | Operational visualization |
| Infra | Docker Compose, Kubernetes, HPA | Local and scalable deployment |
| Testing | JUnit 5, Mockito, AssertJ | 46 unit tests |
- Java 17
- Maven 3.9+
- Docker Desktop
- Python 3.11+ (optional for direct ML dev loop)
- k6 (
brew install k6) for benchmarking
cd infra
docker-compose up --build -d
docker-compose pscd queue-core
mvn testcurl -X POST http://localhost:8080/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"type": "EMAIL",
"payload": "{\"to\":\"test@example.com\"}",
"priority": "HIGH",
"idempotencyKey": "sample-001"
}'open http://localhost:8080/actuator/health
open http://localhost:8080/actuator/prometheus
open http://localhost:9090
open http://localhost:3000
open http://localhost:8081/healthcd benchmark/k6
./run_all.shOr run individually:
k6 run load_test.js
k6 run chaos_test.js
k6 run ml_benchmark.jsBenchmark result template: benchmark/results/README.md.
| Metric | Value | Source |
|---|---|---|
| Throughput (end-to-end jobs) | 9.28 jobs/s (~557 jobs/min) | load_test_summary.json (jobs_submitted.rate) |
| Submit success rate | 100% | load_test_summary.json (submit_success_rate.rate) |
| Submission latency P95 | 11 ms | load_test_summary.json (submit_duration_ms.p(95)) |
| Submission latency P99 | 21 ms | load_test_summary.json (submit_duration_ms.p(99)) |
| Job completion latency P95 | 2037 ms | load_test_summary.json (job_completion_latency_ms.p(95)) |
| Job completion latency P99 | 2062.94 ms | load_test_summary.json (job_completion_latency_ms.p(99)) |
| Chaos failure rate | 0% | chaos_test_summary.json (http_req_failed.rate) |
| Chaos terminal completion rate | 100% | chaos_test_summary.json (job_terminal_rate.rate) |
| ML vs FIFO avg completion delta | -5.44% (ML faster) | ml_benchmark_summary.json (avg_latency_delta_percent) |
| ML vs FIFO P95 completion delta | -0.58% (ML faster) | ml_benchmark_summary.json (p95_latency_delta_percent) |
Screenshots can be generated after running:
make run- Open Grafana at
http://localhost:3000 - Import/view the Task Queue dashboard
- Save exported images into
docs/images/
Suggested filenames:
docs/images/grafana-dashboard.pngdocs/images/queue-depth.pngdocs/images/latency-graph.png
- Job payloads are stored as JSON strings; large payload compression/chunking is not yet implemented.
- The current queue model targets single-Redis deployment patterns; cluster sharding strategy can be expanded.
- Exactly-once semantics are not provided (system guarantees at-least-once with idempotency at submission).
- ML model is trained on synthetic data; production usage should train on real historical execution traces.
- End-to-end integration tests with Redis/Testcontainers and CI performance smoke tests can be expanded.
Distributed-Task-Queue-Engine/
├── queue-core/
├── ml-scheduler/
├── infra/
├── benchmark/
├── docs/
└── .github/workflows/
MIT