Intelligent Distributed Task Queue Engine

A production-grade distributed task queue built with Java 17 / Spring Boot 3 + Redis, featuring ML-assisted dynamic scheduling, at-least-once delivery semantics, idempotent submissions, exponential backoff + DLQ handling, and full observability with Prometheus/Grafana. The platform is designed to stay available under partial failures by falling back from ML-based ordering to FIFO queue consumption.

Project overview

This project demonstrates a resilient asynchronous processing architecture for backend workloads with mixed job durations. It combines deterministic queue safety guarantees (Lua atomic operations, processing tracking, retries, DLQ) with an optional ML layer that predicts execution duration and improves scheduling order without becoming a single point of failure.

Architecture

flowchart LR
    P1[REST Producer] --> API[Job API - Spring Boot 3]
    P2[Cron Producer] --> API

    API --> QL[Redis Lists: tq:queue:high/normal/low]
    API --> IDEM[Redis String Keys: idempotency SET NX]

    QL <--> MLR[PriorityReorderer]
    MLR <--> ML["ML Scheduler - FastAPI<br/>GBT duration model"]
    MLR --> ZS[Redis Sorted Sets: tq:sorted:*]

    ZS --> W[Worker Pool - ThreadPoolExecutor]
    QL --> W
    W --> PH[Redis Hash: tq:processing]
    W --> DLQ[Redis List: tq:dlq]
    W --> JS[Redis String: tq:job:*]

    API --> MET[Micrometer Metrics]
    W --> MET
    MET --> PROM[Prometheus]
    PROM --> GRAF[Grafana]

    subgraph K8s
      HPA[Horizontal Pod Autoscaler]
    end

    PROM --> HPA

Reliability guarantees

At-least-once delivery: jobs are moved into a processing hash during dequeue and must be acknowledged after completion.
Idempotent submission: Redis SET NX keys reject duplicate job submissions.
Atomic queue operations: enqueue/dequeue transitions run in Lua scripts to avoid race conditions.
Retry safety: exponential backoff with jitter prevents retry storms and synchronized contention.
Dead-letter isolation: exhausted jobs are preserved in DLQ for replay/inspection.
ML graceful degradation: queue processing continues in FIFO mode if ML service is unavailable.

Redis data model

Key pattern	Type	Purpose
`tq:queue:high`	List	Primary FIFO queue lane (high priority)
`tq:queue:normal`	List	Primary FIFO queue lane (normal priority)
`tq:queue:low`	List	Primary FIFO queue lane (low priority)
`tq:sorted:high`	Sorted Set	ML-reordered lane (high priority, lower score dequeued first)
`tq:sorted:normal`	Sorted Set	ML-reordered lane (normal priority, lower score dequeued first)
`tq:sorted:low`	Sorted Set	ML-reordered lane (low priority, lower score dequeued first)
`tq:processing`	Hash	In-flight jobs (`jobId -> jobJson`)
`tq:dlq`	List	Dead-letter jobs
`tq:idem:{idempotencyKey}`	String	Deduplication lock via `SET NX EX`
`tq:job:{jobId}`	String	Job status snapshot for API lookups

ML scheduler behavior and FIFO fallback

A FastAPI service predicts job duration using a GradientBoostingRegressor.
PriorityReorderer periodically reads list queues and populates sorted sets scored by predicted duration.
Dequeue Lua script pops from sorted set first; if empty, it falls back to list FIFO.
If ML API calls fail, scheduler methods return safe defaults and reordering is skipped.

Observability metrics

The queue-core service emits key metrics via Micrometer:

taskqueue.jobs.submitted
taskqueue.jobs.completed
taskqueue.jobs.failed
taskqueue.jobs.dead
taskqueue.jobs.duplicate
taskqueue.jobs.retried
taskqueue.queue.depth (gauge per priority)
taskqueue.processing.active (gauge)
taskqueue.job.execution.duration (timer)
taskqueue.job.queue.wait.time (timer)

Prometheus scrapes /actuator/prometheus; Grafana dashboards are provisioned in infra/grafana.

Tech stack

Layer	Technology	Purpose
API	Java 17, Spring Boot 3	REST endpoints, validation, orchestration
Queue	Redis 7 (Lists, Sorted Sets, Hashes, Strings) + Lua	Atomic enqueue/dequeue + durability patterns
Workers	Java ThreadPoolExecutor	Concurrent execution and lifecycle handling
Retry/DLQ	Exponential backoff + jitter	Fault tolerance under transient failures
ML Scheduler	Python FastAPI + scikit-learn	Duration prediction and queue reordering
Metrics	Micrometer + Prometheus	Runtime counters/timers/gauges
Dashboard	Grafana	Operational visualization
Infra	Docker Compose, Kubernetes, HPA	Local and scalable deployment
Testing	JUnit 5, Mockito, AssertJ	46 unit tests

Local run instructions

Prerequisites

Java 17
Maven 3.9+
Docker Desktop
Python 3.11+ (optional for direct ML dev loop)
k6 (brew install k6) for benchmarking

Start full stack

cd infra
docker-compose up --build -d
docker-compose ps

Run tests

cd queue-core
mvn test

Submit a sample job

curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "type": "EMAIL",
    "payload": "{\"to\":\"test@example.com\"}",
    "priority": "HIGH",
    "idempotencyKey": "sample-001"
  }'

Inspect observability

open http://localhost:8080/actuator/health
open http://localhost:8080/actuator/prometheus
open http://localhost:9090
open http://localhost:3000
open http://localhost:8081/health

Benchmark instructions

cd benchmark/k6
./run_all.sh

Or run individually:

k6 run load_test.js
k6 run chaos_test.js
k6 run ml_benchmark.js

Benchmark result template: benchmark/results/README.md.

Benchmark Results (Captured: 2026-04-23)

Metric	Value	Source
Throughput (end-to-end jobs)	9.28 jobs/s (~557 jobs/min)	`load_test_summary.json` (`jobs_submitted.rate`)
Submit success rate	100%	`load_test_summary.json` (`submit_success_rate.rate`)
Submission latency P95	11 ms	`load_test_summary.json` (`submit_duration_ms.p(95)`)
Submission latency P99	21 ms	`load_test_summary.json` (`submit_duration_ms.p(99)`)
Job completion latency P95	2037 ms	`load_test_summary.json` (`job_completion_latency_ms.p(95)`)
Job completion latency P99	2062.94 ms	`load_test_summary.json` (`job_completion_latency_ms.p(99)`)
Chaos failure rate	0%	`chaos_test_summary.json` (`http_req_failed.rate`)
Chaos terminal completion rate	100%	`chaos_test_summary.json` (`job_terminal_rate.rate`)
ML vs FIFO avg completion delta	-5.44% (ML faster)	`ml_benchmark_summary.json` (`avg_latency_delta_percent`)
ML vs FIFO P95 completion delta	-0.58% (ML faster)	`ml_benchmark_summary.json` (`p95_latency_delta_percent`)

Screenshots

Screenshots can be generated after running:

make run
Open Grafana at http://localhost:3000
Import/view the Task Queue dashboard
Save exported images into docs/images/

Suggested filenames:

docs/images/grafana-dashboard.png
docs/images/queue-depth.png
docs/images/latency-graph.png

Known limitations and future improvements

Job payloads are stored as JSON strings; large payload compression/chunking is not yet implemented.
The current queue model targets single-Redis deployment patterns; cluster sharding strategy can be expanded.
Exactly-once semantics are not provided (system guarantees at-least-once with idempotency at submission).
ML model is trained on synthetic data; production usage should train on real historical execution traces.
End-to-end integration tests with Redis/Testcontainers and CI performance smoke tests can be expanded.

Repository structure

Distributed-Task-Queue-Engine/
├── queue-core/
├── ml-scheduler/
├── infra/
├── benchmark/
├── docs/
└── .github/workflows/

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Distributed Task Queue Engine

Project overview

Architecture

Reliability guarantees

Redis data model

ML scheduler behavior and FIFO fallback

Observability metrics

Tech stack

Local run instructions

Prerequisites

Start full stack

Run tests

Submit a sample job

Inspect observability

Benchmark instructions

Benchmark Results (Captured: 2026-04-23)

Screenshots

Known limitations and future improvements

Repository structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
benchmark		benchmark
docs		docs
infra		infra
ml-scheduler		ml-scheduler
queue-core		queue-core
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

Intelligent Distributed Task Queue Engine

Project overview

Architecture

Reliability guarantees

Redis data model

ML scheduler behavior and FIFO fallback

Observability metrics

Tech stack

Local run instructions

Prerequisites

Start full stack

Run tests

Submit a sample job

Inspect observability

Benchmark instructions

Benchmark Results (Captured: 2026-04-23)

Screenshots

Known limitations and future improvements

Repository structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages