Skip to content

Ishubhammohole/Distributed-Task-Queue-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intelligent Distributed Task Queue Engine

A production-grade distributed task queue built with Java 17 / Spring Boot 3 + Redis, featuring ML-assisted dynamic scheduling, at-least-once delivery semantics, idempotent submissions, exponential backoff + DLQ handling, and full observability with Prometheus/Grafana. The platform is designed to stay available under partial failures by falling back from ML-based ordering to FIFO queue consumption.

Project overview

This project demonstrates a resilient asynchronous processing architecture for backend workloads with mixed job durations. It combines deterministic queue safety guarantees (Lua atomic operations, processing tracking, retries, DLQ) with an optional ML layer that predicts execution duration and improves scheduling order without becoming a single point of failure.

Architecture

flowchart LR
    P1[REST Producer] --> API[Job API - Spring Boot 3]
    P2[Cron Producer] --> API

    API --> QL[Redis Lists: tq:queue:high/normal/low]
    API --> IDEM[Redis String Keys: idempotency SET NX]

    QL <--> MLR[PriorityReorderer]
    MLR <--> ML["ML Scheduler - FastAPI<br/>GBT duration model"]
    MLR --> ZS[Redis Sorted Sets: tq:sorted:*]

    ZS --> W[Worker Pool - ThreadPoolExecutor]
    QL --> W
    W --> PH[Redis Hash: tq:processing]
    W --> DLQ[Redis List: tq:dlq]
    W --> JS[Redis String: tq:job:*]

    API --> MET[Micrometer Metrics]
    W --> MET
    MET --> PROM[Prometheus]
    PROM --> GRAF[Grafana]

    subgraph K8s
      HPA[Horizontal Pod Autoscaler]
    end

    PROM --> HPA
Loading

Reliability guarantees

  • At-least-once delivery: jobs are moved into a processing hash during dequeue and must be acknowledged after completion.
  • Idempotent submission: Redis SET NX keys reject duplicate job submissions.
  • Atomic queue operations: enqueue/dequeue transitions run in Lua scripts to avoid race conditions.
  • Retry safety: exponential backoff with jitter prevents retry storms and synchronized contention.
  • Dead-letter isolation: exhausted jobs are preserved in DLQ for replay/inspection.
  • ML graceful degradation: queue processing continues in FIFO mode if ML service is unavailable.

Redis data model

Key pattern Type Purpose
tq:queue:high List Primary FIFO queue lane (high priority)
tq:queue:normal List Primary FIFO queue lane (normal priority)
tq:queue:low List Primary FIFO queue lane (low priority)
tq:sorted:high Sorted Set ML-reordered lane (high priority, lower score dequeued first)
tq:sorted:normal Sorted Set ML-reordered lane (normal priority, lower score dequeued first)
tq:sorted:low Sorted Set ML-reordered lane (low priority, lower score dequeued first)
tq:processing Hash In-flight jobs (jobId -> jobJson)
tq:dlq List Dead-letter jobs
tq:idem:{idempotencyKey} String Deduplication lock via SET NX EX
tq:job:{jobId} String Job status snapshot for API lookups

ML scheduler behavior and FIFO fallback

  • A FastAPI service predicts job duration using a GradientBoostingRegressor.
  • PriorityReorderer periodically reads list queues and populates sorted sets scored by predicted duration.
  • Dequeue Lua script pops from sorted set first; if empty, it falls back to list FIFO.
  • If ML API calls fail, scheduler methods return safe defaults and reordering is skipped.

Observability metrics

The queue-core service emits key metrics via Micrometer:

  • taskqueue.jobs.submitted
  • taskqueue.jobs.completed
  • taskqueue.jobs.failed
  • taskqueue.jobs.dead
  • taskqueue.jobs.duplicate
  • taskqueue.jobs.retried
  • taskqueue.queue.depth (gauge per priority)
  • taskqueue.processing.active (gauge)
  • taskqueue.job.execution.duration (timer)
  • taskqueue.job.queue.wait.time (timer)

Prometheus scrapes /actuator/prometheus; Grafana dashboards are provisioned in infra/grafana.

Tech stack

Layer Technology Purpose
API Java 17, Spring Boot 3 REST endpoints, validation, orchestration
Queue Redis 7 (Lists, Sorted Sets, Hashes, Strings) + Lua Atomic enqueue/dequeue + durability patterns
Workers Java ThreadPoolExecutor Concurrent execution and lifecycle handling
Retry/DLQ Exponential backoff + jitter Fault tolerance under transient failures
ML Scheduler Python FastAPI + scikit-learn Duration prediction and queue reordering
Metrics Micrometer + Prometheus Runtime counters/timers/gauges
Dashboard Grafana Operational visualization
Infra Docker Compose, Kubernetes, HPA Local and scalable deployment
Testing JUnit 5, Mockito, AssertJ 46 unit tests

Local run instructions

Prerequisites

  • Java 17
  • Maven 3.9+
  • Docker Desktop
  • Python 3.11+ (optional for direct ML dev loop)
  • k6 (brew install k6) for benchmarking

Start full stack

cd infra
docker-compose up --build -d
docker-compose ps

Run tests

cd queue-core
mvn test

Submit a sample job

curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "type": "EMAIL",
    "payload": "{\"to\":\"test@example.com\"}",
    "priority": "HIGH",
    "idempotencyKey": "sample-001"
  }'

Inspect observability

open http://localhost:8080/actuator/health
open http://localhost:8080/actuator/prometheus
open http://localhost:9090
open http://localhost:3000
open http://localhost:8081/health

Benchmark instructions

cd benchmark/k6
./run_all.sh

Or run individually:

k6 run load_test.js
k6 run chaos_test.js
k6 run ml_benchmark.js

Benchmark result template: benchmark/results/README.md.

Benchmark Results (Captured: 2026-04-23)

Metric Value Source
Throughput (end-to-end jobs) 9.28 jobs/s (~557 jobs/min) load_test_summary.json (jobs_submitted.rate)
Submit success rate 100% load_test_summary.json (submit_success_rate.rate)
Submission latency P95 11 ms load_test_summary.json (submit_duration_ms.p(95))
Submission latency P99 21 ms load_test_summary.json (submit_duration_ms.p(99))
Job completion latency P95 2037 ms load_test_summary.json (job_completion_latency_ms.p(95))
Job completion latency P99 2062.94 ms load_test_summary.json (job_completion_latency_ms.p(99))
Chaos failure rate 0% chaos_test_summary.json (http_req_failed.rate)
Chaos terminal completion rate 100% chaos_test_summary.json (job_terminal_rate.rate)
ML vs FIFO avg completion delta -5.44% (ML faster) ml_benchmark_summary.json (avg_latency_delta_percent)
ML vs FIFO P95 completion delta -0.58% (ML faster) ml_benchmark_summary.json (p95_latency_delta_percent)

Screenshots

Screenshots can be generated after running:

  1. make run
  2. Open Grafana at http://localhost:3000
  3. Import/view the Task Queue dashboard
  4. Save exported images into docs/images/

Suggested filenames:

  • docs/images/grafana-dashboard.png
  • docs/images/queue-depth.png
  • docs/images/latency-graph.png

Known limitations and future improvements

  • Job payloads are stored as JSON strings; large payload compression/chunking is not yet implemented.
  • The current queue model targets single-Redis deployment patterns; cluster sharding strategy can be expanded.
  • Exactly-once semantics are not provided (system guarantees at-least-once with idempotency at submission).
  • ML model is trained on synthetic data; production usage should train on real historical execution traces.
  • End-to-end integration tests with Redis/Testcontainers and CI performance smoke tests can be expanded.

Repository structure

Distributed-Task-Queue-Engine/
├── queue-core/
├── ml-scheduler/
├── infra/
├── benchmark/
├── docs/
└── .github/workflows/

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors