Sonomos Traffic Classifier

A tiny MLP binary classifier (~11K parameters) for detecting AI provider network traffic from TLS/HTTPS metadata. Designed for sub-100μs inference via tract (pure Rust, no C++ deps).

Architecture

Input (61 features) → Linear(96) → BatchNorm → ReLU → Dropout(0.1)
                     → Linear(48) → BatchNorm → ReLU → Dropout(0.1)
                     ├→ Linear(1)  → logit      → sigmoid → P(AI traffic)
                     └→ Linear(1)  → sigmoid    → confidence score [0, 1]

ONNX output: (1, 2) = [logit, confidence]

The confidence head tells the pipeline how much to trust the prediction. Low confidence (< 0.4) signals "I don't know" — the pipeline falls back to conservative behavior instead of acting on an uncertain classification.

Stage 3 of the Sonomos Desktop three-stage traffic scanning pipeline:

Stage 1 — Deterministic rules (sub-μs): domain allowlist + user overrides + cache
Stage 2 — Heuristic scoring (~μs): JA4 fingerprint, SNI pattern, DNS/IP correlation
Stage 3 — This ML classifier (~10–70ms cold, <100μs warm): ONNX model via tract

Feature Vector (61 dimensions)

Group	Features	Dims
Flow statistics	pkt size mean/std/min/max/p25/p50/p75, IAT mean/std/min/max/p50, duration, pkt count (up/down), bytes/sec	16
Directional stats	upstream pkt size mean/std/p50, downstream pkt size mean/std/p50, byte ratio (up/total), pkt count ratio (up/total)	8
First-N packet sizes	first 8 packet sizes (upstream interleaved with downstream)	8
TLS metadata	version, cipher count, ext count, ALPN, has_grpc, has_h2, cert_chain_len, has_sni, has_sct, has_status_request, tls_13_only, post_handshake_auth	12
JA4 components	version_ord, cipher_count, ext_count, alpn_ord, sorted_cipher_hash(2d)	6
SNI n-gram hash	character 2/3-gram hashing into 11-dim feature vector	11

Quick Start

# Install deps
pip install torch scikit-learn onnx onnxruntime numpy pandas

# Generate synthetic training data (for testing the pipeline)
python scripts/generate_synthetic_data.py --output data/synthetic_train.csv --samples 10000

# Train
python scripts/train.py --data data/synthetic_train.csv --output models/traffic_classifier.onnx

# Validate ONNX export
python scripts/validate_onnx.py --model models/traffic_classifier.onnx

# Run tests
python -m pytest tests/ -v

Real Data Collection

Extract features from real packet captures using cicflowmeter:

pip install cicflowmeter>=0.5.0

# Single pcap with label (1=AI traffic, 0=normal)
python scripts/extract_with_cicflowmeter.py \
    --pcap data/captures/openai_traffic.pcap \
    --label 1 \
    --sni api.openai.com \
    --output data/openai_flows.csv

# Batch: directory of pcaps with per-file labels
python scripts/extract_with_cicflowmeter.py \
    --pcap-dir data/captures/ \
    --label-file data/captures/labels.json \
    --output data/real_train.csv

# Train on real data
python scripts/train.py --data data/real_train.csv --output models/traffic_classifier.onnx

Rust Integration (tract + huginn-net-tls)

# Cargo.toml
[dependencies]
huginn-net-tls = "1.5"
tract-onnx = "0.22"
anyhow = "1"

use crate::classifier::{TrafficClassifier, FlowStats};

// Load once at daemon startup
let classifier = TrafficClassifier::load("traffic_classifier.onnx", Some(0.5))?;

// On each intercepted flow: pass flow stats + raw ClientHello bytes
let (probability, is_ai, sni) = classifier.classify_flow(&flow_stats, &client_hello_bytes)?;

if is_ai {
    // AI traffic detected — apply Cloak interception
}

The classify_flow() method handles the full pipeline internally:

Passes ClientHello bytes through huginn-net-tls for JA4/TLS extraction
Builds the 61-dim feature vector (flow stats + TLS + JA4 + SNI hash)
Runs tract ONNX inference
Returns (probability, is_ai, sni_domain)

XGBoost Distillation (Optional)

For maximum accuracy, train an XGBoost teacher first:

python scripts/train_xgboost_teacher.py --data data/real_train.csv --output models/xgb_teacher.json
python scripts/train.py --data data/real_train.csv --teacher models/xgb_teacher.json --output models/traffic_classifier.onnx

Model Metrics

Target metrics (on real data):

AUC-PR > 0.95
F1 > 0.92
Precision@90%Recall > 0.90
Confidence: mean > 0.7 on correct, mean < 0.5 on incorrect
Inference: <100μs (tract, x86_64)
Model size: ~50KB (FP32 ONNX)
Output: (1, 2) = [logit, confidence]

License

Proprietary — Sonomos, Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
models		models
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sonomos Traffic Classifier

Architecture

Feature Vector (61 dimensions)

Quick Start

Real Data Collection

Rust Integration (tract + huginn-net-tls)

XGBoost Distillation (Optional)

Model Metrics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sonomos Traffic Classifier

Architecture

Feature Vector (61 dimensions)

Quick Start

Real Data Collection

Rust Integration (tract + huginn-net-tls)

XGBoost Distillation (Optional)

Model Metrics

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages