====================================================================================
█████╗ ██████╗ ██╗ ██████╗ ██████╗ ██████╗ ████████╗███████╗██╗ ██╗
██╔══██╗██╔══██╗██║ ██╔════╝██╔═══██╗██╔══██╗╚══██╔══╝██╔════╝╚██╗██╔╝
███████║██████╔╝██║ ██║ ██║ ██║██████╔╝ ██║ █████╗ ╚███╔╝
██╔══██║██╔═══╝ ██║ ██║ ██║ ██║██╔══██╗ ██║ ██╔══╝ ██╔██╗
██║ ██║██║ ██║ ╚██████╗╚██████╔╝██║ ██║ ██║ ███████╗██╔╝ ██╗
╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚══════╝╚═╝ ╚═╝
====================================================================================
Predict API Failures Before They Happen
- Overview
- Architecture
- Features
- System Components
- Data Flow
- Installation
- Configuration
- Usage
- Monitoring
- Troubleshooting
- Dependencies
- License
ApiCortex is an enterprise-grade SaaS platform that predicts API failures before they occur using machine learning analytics on real production traffic. The platform ensures API contract compliance and provides proactive failure detection through advanced anomaly detection algorithms.
- Predictive Analytics: ML-powered failure prediction with 95%+ accuracy
- Real-time Monitoring: Sub-second telemetry processing via Kafka streaming
- Contract Validation: OpenAPI specification enforcement and drift detection
- Multi-tenant Architecture: Organization-based isolation with RBAC
- Time-series Analytics: Historical querying with TimescaleDB
- Developer Dashboard: Interactive Next.js UI with live metrics
For the initial MVP launch, we have adopted a hybrid-cloud strategy utilizing high-performance managed services to deliver a full-featured experience.
| Component | Provider | Role |
|---|---|---|
| Frontend | Vercel | Dashboard & Edge Proxy |
| Backend | HuggingFace | Unified Docker Orchestration |
| Metadata | NeonDB | Serverless PostgreSQL |
| Metrics | TigerData | Managed TimescaleDB |
| Streaming | Aiven | Cloud Managed Kafka |
| Caching | Upstash | Serverless Redis |
Note
To maximize efficiency and minimize cross-service latency on free-tier resources, the core backend services (Ingest, Control Plane, and ML Service) are orchestrated within a unified Docker container on HuggingFace Spaces. This architecture leverages a multi-stage build that pulls pre-compiled binaries and virtual environments from internal mirrors to generate a high-density, production-ready image, with a custom entrypoint script handling concurrent process management and environment isolation for the Go and Python runtimes.
┌─────────────────────────────────────────────────────────────────────────┐
│ APICORTEX PLATFORM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ │
│ ┌──────────────┐ │
│ │ API Testing │ │
│ │ Engine (Rust)│ │
│ └──────▲───────┘ │
│ │ │
│ ┌──────────────┐ ┌──────▼───────┐ ┌──────────────┐ │
│ │ Frontend │ │ Control │ │ Ingest │ │
│ │ (Next.js) │◄──►│ Plane │◄──►│ Service │ │
│ │ │ │ (FastAPI) │ │ (Go) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Apache Kafka │ │
│ │ (telemetry.raw, alerts) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ML │ │ PostgreSQL │ │ TimescaleDB │ │
│ │ Service │ │ (NeonDB) │ │ (Metrics) │ │
│ │ (Python) │ │ (Metadata) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
graph TB
subgraph "Presentation Layer"
A[Next.js Dashboard]
B[REST API Clients]
end
subgraph "Control Plane"
C[FastAPI Server]
D[Auth Service]
E[API Management]
F[Contract Validator]
end
subgraph "Data Plane"
G[Go Ingest Service]
H[Kafka Producer]
I[Rate Limiter]
end
subgraph "ML Plane"
J[Python ML Service]
K[Feature Engineering]
L[XGBoost Predictor]
M[Anomaly Detector]
end
subgraph "Execution Plane"
Q[Rust Testing Engine]
R[SSRF Shield]
S[External APIs]
end
subgraph "Storage"
N[(PostgreSQL)]
O[(TimescaleDB)]
P[Kafka Topics]
end
A --> C
B --> C
C --> D
C --> E
C --> F
C <--> Q
Q --> R
R --> S
G --> H
H --> P
J --> P
J --> K
K --> L
L --> M
C --> N
G --> O
J --> O
| Feature | Description | Status |
|---|---|---|
| Real-time Telemetry | Collect API metrics with <10ms latency | ✔ Active |
| ML Failure Prediction | XGBoost-based anomaly detection | ✔ Active |
| Contract Validation | OpenAPI 3.0 specification enforcement | ✔ Active |
| Multi-tenant RBAC | Organization-based access control | ✔ Active |
| Time-series Analytics | Historical data querying | ✔ Active |
| Alerting System | Webhook-based notifications | ✔ Active |
| Developer Dashboard | Interactive UI with live metrics | ✔ Active |
| API Testing | High-performance Rust execution engine | ✔ Active |
- Throughput: 10,000+ events/second
- Latency: <50ms p99 for telemetry ingestion
- Accuracy: 95%+ failure prediction accuracy
- Retention: Configurable (default 30 days)
- Dependencies: Comprehensive list in DEPENDENCY.md
- Scalability: Horizontal scaling with Kafka partitions
Location: ingest-service/
Responsible for high-throughput telemetry collection and streaming.
┌─────────────────────────────────────┐
│ Ingest Service Architecture │
├─────────────────────────────────────┤
│ HTTP API → Validation → Buffer │
│ ↓ │
│ Kafka Producer → Batching │
│ ↓ │
│ TimescaleDB Writer │
└─────────────────────────────────────┘
Key Files:
cmd/server/main.go- Application entry pointinternal/api/handler.go- HTTP request handlersinternal/kafka/producer.go- Kafka producerinternal/buffer/batcher.go- Event batching
Location: control-plane/
Handles authentication, API metadata, and contract management.
┌─────────────────────────────────────┐
│ Control Plane Architecture │
├─────────────────────────────────────┤
│ OAuth2 → JWT → RBAC │
│ ↓ │
│ API Management → OpenAPI Parser │
│ ↓ │
│ PostgreSQL → TimescaleDB │
└─────────────────────────────────────┘
Key Files:
app/main.py- FastAPI applicationapp/routers/auth.py- Authentication endpointsapp/routers/apis.py- API managementapp/services/contract_service.py- Contract validation
Location: ml-service/
Processes telemetry streams and generates failure predictions.
┌─────────────────────────────────────┐
│ ML Service Architecture │
├─────────────────────────────────────┤
│ Kafka Consumer → Feature Extract │
│ ↓ │
│ XGBoost Model → SHAP Analysis │
│ ↓ │
│ Prediction Storage → Alerting │
└─────────────────────────────────────┘
Key Files:
app/main.py- ML worker entryworkers/inference_worker.py- Inference pipelineapp/features/feature_engineering.py- Feature extractionapp/inference/predictor.py- Model prediction
Location: frontend/
Developer dashboard for monitoring and management.
┌─────────────────────────────────────┐
│ Frontend Architecture │
├─────────────────────────────────────┤
│ Dashboard → API Testing │
│ ↓ │
│ Telemetry Charts → Predictions │
│ ↓ │
│ Contract Validation UI │
└─────────────────────────────────────┘
### 5. Execution Engine (Rust)
**Location**: `api-testing/`
High-performance, secure engine optimized for executing REST, GraphQL, and WebSocket tests with microsecond precision.
```text
┌─────────────────────────────────────┐
│ Testing Engine Architecture │
├─────────────────────────────────────┤
│ Request → Resolver → SSRF Shield │
│ ↓ │
│ Network Execution (Tokio + Reqwest)│
│ ↓ │
│ Diagnostics Snapshot (DNS/TLS/TCP) │
└─────────────────────────────────────┘
Key Files:
src/main.rs- Axum server entrysrc/executor.rs- Core execution & security logicsrc/protocols/- WebSocket & HTTP handlerssrc/models.rs- Result & Snapshot schemas
sequenceDiagram
participant Client as API Client
participant Ingest as Ingest Service
participant Kafka as Apache Kafka
participant ML as ML Service
participant DB as TimescaleDB
participant UI as Dashboard
Client->>Ingest: POST /v1/telemetry
Ingest->>Ingest: Validate & Buffer
Ingest->>Kafka: Publish telemetry.raw
Ingest->>DB: Store telemetry
Ingest-->>Client: 200 OK
ML->>Kafka: Consume telemetry.raw
ML->>ML: Feature Engineering
ML->>ML: XGBoost Prediction
ML->>DB: Store prediction
ML->>Kafka: Publish alerts
UI->>DB: Query metrics
UI->>UI: Display charts
flowchart TD
A[Telemetry Event] --> B{Kafka Consumer}
B --> C[Feature Extraction]
C --> D[1m Window Stats]
C --> E[5m Window Stats]
C --> F[15m Window Stats]
D --> G[Feature Vector]
E --> G
F --> G
G --> H{XGBoost Model}
H --> I[Risk Score]
I --> J{Threshold Check}
J -->|Score > 0.8| K[Generate Alert]
J -->|Score < 0.8| L[Store Prediction]
K --> M[Kafka Alerts Topic]
L --> N[TimescaleDB]
- Go: 1.26 or later
- Python: 3.11 or later
- Node.js: 22 or later
- PostgreSQL: 16+ or NeonDB
- TimescaleDB: Latest version
- Apache Kafka: 3.0 or later
# 1. Clone repository
git clone https://github.com/0xarchit/apicortex.git
cd apicortex
# 2. Set up environment variables
cp .env.example .env
# Edit .env with your credentials
# 3. Start infrastructure (Docker)
docker-compose up -d
# 4. Build and run services
# Ingest Service
cd ingest-service && go run cmd/server/main.go
# Control Plane
cd control-plane && uvicorn app.main:app --reload
# ML Service
cd ml-service && python app/main.py
# API Testing Engine (Rust)
cd api-testing && cargo run
# Frontend
cd frontend && npm run dev| Variable | Service | Description | Default |
|---|---|---|---|
DATABASE |
Control Plane | PostgreSQL connection string | - |
TIMESCALE_DATABASE |
All | TimescaleDB connection string | - |
KAFKA_SERVICE_URI |
Ingest, ML | Kafka broker URI | - |
ACTIVE_POLLING_ENABLED |
Ingest | Enable active polling | true |
BATCH_SIZE |
Ingest | Kafka batch size | 500 |
MODEL_PATH |
ML | Path to XGBoost model | model/xgboost.pkl |
ALERT_THRESHOLD |
ML | Alert threshold (0-1) | 0.8 |
API_TESTING_URL |
Control Plane | Internal URL for Rust engine | http://api-testing:9090 (Docker) or http://localhost:9090 (local) |
Ingest Service (ingest-service/.env):
PORT=8080
KAFKA_SERVICE_URI=kafka:9092
BATCH_SIZE=500
FLUSH_INTERVAL_SECONDS=2
ACTIVE_POLLING_ENABLED=trueControl Plane (control-plane/.env):
DATABASE=postgresql://user:pass@host:5432/db
JWT_SECRET_KEY=your-secret-key
OAUTH_GITHUB_CLIENT_ID=your-client-idML Service (ml-service/.env):
KAFKA_TOPIC_RAW=telemetry.raw
MODEL_PATH=model/xgboost_failure_prediction.pkl
ALERT_THRESHOLD=0.8
ENABLE_SHAP=true- Open browser:
http://localhost:3000 - Sign in with OAuth (Google/GitHub)
- Navigate to Dashboard
| Endpoint | Method | Description |
|---|---|---|
/auth/login |
POST | User authentication |
/apis |
GET | List APIs |
/apis/{id}/endpoints |
GET | Get API endpoints |
/telemetry |
POST | Submit telemetry |
/predictions |
GET | Get predictions |
/dashboard/metrics |
GET | Dashboard metrics |
/testing/execute |
POST | Execute API test |
┌─────────────────────────────────────┐
│ Monitoring Stack │
├─────────────────────────────────────┤
│ Prometheus → Grafana │
│ ↓ │
│ Custom Metrics: │
│ - telemetry_events_total │
│ - prediction_latency_seconds │
│ - kafka_consumer_lag │
│ - http_request_duration_seconds │
└─────────────────────────────────────┘
| Service | Endpoint | Port |
|---|---|---|
| Ingest | /health |
8080 |
| API Testing | /health |
9090 |
| Control Plane | /health |
8000 |
| Frontend | / |
3000 |
All services use structured logging:
- Ingest: Zerolog (JSON format)
- Control Plane: Python logging (JSON)
- ML Service: Python logging (JSON)
Log format:
{
"timestamp": "2026-04-07T12:00:00Z",
"level": "INFO",
"service": "ingest-service",
"message": "Telemetry batch published",
"batch_size": 500,
"duration_ms": 45
}Symptom: Service exits immediately on startup
Solution:
# Check environment variables
printenv | grep APICORTEX
# Verify database connectivity
psql $DATABASE -c "SELECT 1"
# Check Kafka connection
kafka-consumer-groups --bootstrap-server $KAFKA_URI --listSymptom: Memory usage > 2GB
Solution:
# Reduce batch size in ingest-service
BATCH_SIZE=100
# Limit buffer capacity
MAX_BUFFER_CAPACITY=10000Symptom: Consumer lag > 10000 messages
Solution:
# Increase consumer parallelism
# Add more ML worker instances
# Check network connectivityEnable debug logging:
DEBUG=true
LOG_LEVEL=debug| Parameter | Recommended | Description |
|---|---|---|
BATCH_SIZE |
500-1000 | Events per batch |
FLUSH_INTERVAL |
2s | Batch flush interval |
PUBLISH_WORKER_COUNT |
4 | Parallel publishers |
| Parameter | Recommended | Description |
|---|---|---|
KAFKA_POLL_TIMEOUT |
1.0s | Poll timeout |
ENABLE_SHAP |
false | Disable for performance |
KAFKA_MAX_POLL_INTERVAL |
300s | Max poll interval |
-- Optimize TimescaleDB
SELECT add_retention_policy('api_telemetry', INTERVAL '30 days');
SELECT add_compression_policy('api_telemetry', INTERVAL '7 days');
-- Create indexes
CREATE INDEX CONCURRENTLY ON api_telemetry (org_id, time DESC);
CREATE INDEX CONCURRENTLY ON api_telemetry (api_id, time DESC);sequenceDiagram
participant User
participant Frontend
participant ControlPlane
participant OAuth
participant DB
User->>Frontend: Click "Login"
Frontend->>ControlPlane: Initiate OAuth
ControlPlane->>OAuth: Redirect
User->>OAuth: Authenticate
OAuth->>ControlPlane: OAuth Callback
ControlPlane->>DB: Create/Update User
ControlPlane->>Frontend: JWT Token
Frontend->>User: Dashboard Access
- Keys are hashed with pepper before storage
- Keys are rotated every 90 days
- Audit logging for all key operations
- Fork the repository
- Create feature branch
- Submit pull request
- Pass CI/CD pipeline
# Install dependencies
go mod download
pip install -r requirements.txt
npm install
# Run tests
go test ./...
pytest
npm testFor a complete breakdown of all libraries, frameworks, and tools used across our Rust, Go, Python, and Next.js services, please refer to the DEPENDENCY.md file.
Check the LICENSE
- Email: mail@0xarchit.is-a.dev
- Discussions: https://github.com/0xarchit/ApiCortex/discussions
- Issues: https://github.com/0xarchit/ApiCortex/issues
Developer team