Skip to content

Th3-Watcher/Amber

Repository files navigation

Amber — Self-Healing ML Pipelines

CI License: MIT Rust 1.70+

Immutable ledger infrastructure for ML training pipelines.
Automated anomaly recovery. Score-gated rollback. Integrity verification at the kernel level.

The ProblemHow Amber Solves ItQuick StartSelf-Healing PipelineComparisonArchitectureMCP ServerComplianceRoadmapMIT License


The Problem

ML training pipelines have no integrity guarantees. Model checkpoints — the most valuable artifacts in any training run — are stored as plain files with no protection against corruption, unauthorized modification, or silent degradation.

Existing experiment trackers (W&B, MLflow, DVC) log metadata about checkpoints. They do not enforce immutability. They do not detect anomalous weight changes. They do not automatically roll back when model quality regresses. They do not verify that the bytes on disk still match what was originally stored.

This creates a class of failures that are invisible until they cause damage:

  • Silent corruption — bit rot, GPU memory errors during saves, filesystem-level corruption
  • Unauthorized modification — automated agents, scripts, or tools modifying checkpoints without safeguards
  • Quality regression — a training step produces worse weights but overwrites the previous checkpoint with no rollback mechanism
  • Provenance loss — no cryptographic guarantee that a checkpoint was produced by a specific training run with specific configuration

In one documented case, an AI assistant given write access to a training pipeline deleted critical pathways from a verified 590M-parameter checkpoint, then fabricated the existence of a backup. The working model was permanently lost. No existing tool would have prevented this — the modification was a valid file write on an unprotected file.

Amber was built to make this class of failure impossible.


How Amber Solves It

Amber is an immutable ledger for model checkpoints. Every version of every watched file is content-addressed, cryptographically hashed, compressed, and locked at the Linux kernel level. The system continuously monitors for anomalies, enforces quality gates, and can autonomously recover from failures — creating self-healing ML pipelines that protect their own integrity.

Immutable Ledger

Every checkpoint stored in Amber is:

  • Content-addressed — named by its SHA-256 hash, automatically deduplicated
  • Compressed — zstd (level 3 for objects, level 9 for archives)
  • Kernel-locked — platform-native immutability set immediately after write. No user-space process can modify or delete stored objects without passphrase-authenticated unlock
    • Linux: FS_IMMUTABLE_FL via ioctl (chattr +i)
    • macOS: UF_IMMUTABLE via chflags (chflags uchg)
    • Windows: read-only attribute + NTFS ACL deny write/delete
  • Chain-linked — each version references its parent hash, forming a tamper-evident chain

Anomaly Detection

Checkpoint write detected (inotify)
  -> Compare to previous version size
  -> If (new / previous) < threshold:
       FLAG anomaly
       Execute on_anomaly hooks (alerts, diagnostics)
       Preserve BOTH versions immutably
       Auto-sync anomalous version to safety mirrors

Write-storm detection automatically identifies active training runs and throttles snapshots to prevent disk thrash while still capturing progress at configurable intervals.

Score-Gated Rollback

# Enforce: checkpoints must score >= 3/5 on ALU tests
# Auto-restore the last passing version if a new checkpoint fails
amber gate ./checkpoints --score-key ALU --min-score "3/5" --auto-rollback

Post-snapshot hooks can run your benchmark suite after every checkpoint write, tag the results as structured metadata, and the gate system automatically evaluates whether the new version meets quality thresholds. If it doesn't, Amber restores the last version that passed — no human intervention required.

Integrity Verification

$ amber verify
Integrity check: 847/847 passed
All objects verified OK.

Every stored object is re-read from disk, decompressed, re-hashed, and compared against its content-addressed key. Delta-stored versions are fully reconstructed and verified against the original content hash. This catches silent bit rot, filesystem corruption, and any modification that bypassed the immutability layer.


Self-Healing Pipeline

Amber's features combine into an automated self-healing loop:

Training produces checkpoint
  |
  +-- Amber stores immutably (SHA-256, zstd, chattr +i)
  |
  +-- Post-snapshot hooks run benchmark suite
  |     |
  |     +-- Results tagged as metadata: ALU=5/5, loss=0.031
  |
  +-- Gate evaluates: score >= threshold?
  |     |
  |     +-- YES: checkpoint accepted, training continues
  |     +-- NO:  auto-rollback to last passing version
  |              anomaly flagged, hooks fire alerts
  |
  +-- Anomaly detector watches for file shrink
  |     |
  |     +-- Normal: proceed
  |     +-- Shrink detected: flag, alert, preserve both versions
  |
  +-- Mirror sync: anomalous versions -> USB/remote
  +-- Remote push: rsync to backup server
  +-- Integrity verification: periodic re-hash of all objects

The pipeline heals itself. Bad checkpoints get rolled back. Corrupted files get flagged. Verified working versions are immutable. The training run continues from the last known-good state.


Comparison

W&B MLflow DVC Git LFS Amber
Version tracking Cloud DB Git-based Git-based Local daemon
Content-addressed No No Yes No SHA-256
Kernel immutability No No No No Linux + macOS + Windows
Anomaly detection No No No No Size shrink + write storm
Score-gated rollback No No No No Automated
Pre/post hooks No Limited Limited Git hooks Full pipeline integration
Integrity verification Metadata No Hash check Hash check Full re-hash from disk
Self-healing recovery No No No No Autonomous rollback
Delta compression No No No No bsdiff + zstd
Offline-first No Optional Yes No Yes
External dependencies Cloud account DB + server Git + remote Git + remote None

Quick Start

Build

git clone https://github.com/Th3-Watcher/amber.git
cd amber
cargo build --release
cp target/release/amber target/release/amberd ~/.local/bin/

Initialize

amber init          # Set unlock passphrase
amberd &            # Start daemon (or install the systemd service)

Watch & Protect

# Watch a checkpoint directory
amber watch ~/training/checkpoints

# View status
amber status

# View version history
amber log ~/training/checkpoints/model.pt
amber log ~/training/checkpoints/model.pt --since 2h

# Tag with training scores
amber tag model.pt --version a1b2c3d4 --key ALU --value "5/5"
amber tag model.pt --version a1b2c3d4 --key loss --value "0.031"

# Set up automated quality gate
amber gate ./checkpoints --score-key ALU --min-score "3/5" --auto-rollback

# Diff two versions
amber diff model.pt a1b2c3d4 e5f6a7b8

# Restore a specific version
amber restore model.pt --version a1b2c3d4

# Search across all stored versions
amber search "def forward" --path ~/training/

# Verify integrity of entire store
amber verify

# Terminal UI dashboard
amber tui

Hook Integration

[hooks]
# Run benchmark after every checkpoint, tag result
post_snapshot = [
    "python3 benchmark.py --checkpoint $AMBER_FILE --out /tmp/score.txt && amber tag $AMBER_FILE --version $AMBER_VERSION --key score --value $(cat /tmp/score.txt)"
]

# Alert on anomalous checkpoint changes
on_anomaly = [
    "echo '[AMBER] Anomaly: $AMBER_FILE shrunk to ${AMBER_SHRINK_RATIO}x' >> /var/log/training-alerts.log"
]

Environment variables available to hooks: $AMBER_FILE, $AMBER_VERSION, $AMBER_HASH, $AMBER_SIZE, $AMBER_ANOMALY, $AMBER_SHRINK_RATIO, $AMBER_PREV_SIZE.

Remote Backup

# Configure rsync remote (auto-push after every snapshot)
amber remote set user@server:/backup/amber --method rsync --auto

# USB mirror (anomalous versions auto-sync when drive connected)
amber mirror add /media/usb --mode flagged --auto --bundle

Architecture

amber-core/        Core library — 17 modules, ~3500 lines of Rust
  storage.rs        Content-addressed immutable object store
  lock.rs           FS_IMMUTABLE_FL kernel ioctls + Argon2 passphrase
  hash.rs           SHA-256 hashing + object-key sharding
  engine.rs         Write-storm detection + anomaly flagging
  gate.rs           Score parsing + automated rollback
  hooks.rs          Pre/post-snapshot hook execution
  delta.rs          bsdiff binary delta compression
  manifest.rs       Append-only bincode version manifests
  snapshot.rs       VersionEntry + CheckpointMeta + HookResult
  session.rs        Gap-based session grouping
  archive.rs        Session collapsing into tar.zst bundles
  remote.rs         rsync remote backup
  mirror.rs         USB mirror sync (flagged/watched/all)
  search.rs         Full-text search across version history
  git.rs            Auto-capture git commit labels
  config.rs         TOML configuration
  ipc.rs            Binary serde daemon protocol

amber-daemon/      Async file-watching daemon (tokio + inotify + Unix socket IPC)
amber-cli/         CLI + TUI interface (clap + ratatui)

Immutable Ledger Layout

~/.amber/store/
  objects/<xx>/<hash>       Content blobs — zstd compressed, FS_IMMUTABLE_FL locked
  deltas/<xx>/<hash>        Binary delta patches — zstd compressed, locked
  manifests/<uuid>.bin      Append-only version chains per watched path
  archives/<uuid>.tar.zst   Collapsed historical sessions

Every object is named by its SHA-256 hash. Duplicates are impossible. Modifications are impossible without kernel-level unlock. The manifest forms a hash-linked chain — any tampering breaks the chain and is detectable by amber verify.

Event Flow

File modification (inotify)
  |
  +-- SmartEngine: training mode? -> throttle to interval snapshots
  +-- SmartEngine: anomaly? -> flag + alert hooks
  |
  +-- Pre-snapshot hooks -> abort if any fail
  |
  +-- SHA-256 hash -> dedup check
  +-- Store: full copy or bsdiff delta (by size threshold)
  +-- Kernel lock: FS_IMMUTABLE_FL via ioctl
  |
  +-- Post-snapshot hooks -> benchmark, tag metadata
  +-- Gate evaluation -> rollback if score below threshold
  |
  +-- Git commit capture
  +-- Session tracking
  +-- Mirror sync (connected + auto)
  +-- Remote push (configured + auto)

Configuration

Full reference: amber.toml.example

[smart_engine]
write_storm_threshold = 5          # writes/sec triggers training mode
anomaly_shrink_ratio = 0.5         # flag if checkpoint shrinks below 50%

[hooks]
post_snapshot = ["python3 /path/to/benchmark.py --checkpoint $AMBER_FILE"]
on_anomaly = ["curl -X POST https://hooks.slack.com/... -d '{\"text\":\"Anomaly: $AMBER_FILE\"}'"]

[gate]
enabled = true
score_key = "ALU"
min_score = "3/5"
auto_rollback = true

[remote]
method = "rsync"
destination = "user@server:/backup/amber"
auto_push = true

Regulatory Compliance

The EU AI Act (2026) mandates audit trails and version control for high-risk AI systems. Amber provides:

  • Immutable audit chain — every checkpoint stored with SHA-256 hash, UTC timestamp, and parent link
  • Tamper evidence — kernel-level immutability flags; any bypass is detectable
  • Integrity verificationamber verify re-hashes every stored object on demand
  • Provenance metadata — training script hash, config hash, GPU ID, training duration
  • Retention guarantees — archive system preserves first, last, anomalous, and labelled versions

Requirements

  • LinuxFS_IMMUTABLE_FL via ioctl (ext4, xfs, btrfs)
  • macOSUF_IMMUTABLE via chflags (APFS, HFS+)
  • Windows — read-only attribute + NTFS ACL deny write/delete
  • Falls back gracefully on unsupported filesystems — versioning works, immutability is skipped
  • Rust 1.70+ for building from source
  • rsync (optional) for remote backup

Tests

cargo test    # 42 tests — hashing, deltas, storage, manifests, sessions, engine, gating, e2e

MCP Server — AI Agent Integration

Amber includes an MCP (Model Context Protocol) server that gives AI assistants direct access to checkpoint protection and context management. The tool built because an AI destroyed model weights — now integrated into AI assistants so they protect checkpoints instead of destroying them.

Setup

pip install "mcp[cli]"

Add to your Claude Code settings (.claude/settings.json):

{
  "mcpServers": {
    "amber": {
      "command": "python3",
      "args": ["/path/to/Amber/amber-mcp/amber_mcp.py"]
    }
  }
}

File Protection Tools

Tool Description
amber_status Show watched paths and protection status
amber_log View version history before modifying files
amber_snapshot Force a snapshot before making risky changes
amber_verify Verify integrity of stored checkpoints
amber_restore Restore a file to a previous version
amber_tag Tag versions with scores, phase, metadata
amber_diff Compare two versions
amber_search Search content across version history

Context Ledger Tools

Immutable, content-addressed snapshots of AI working context — every context state is hashed, chain-linked, and verifiable.

Tool Description
context_save Snapshot current working state (files, decisions, findings, constraints)
context_restore Restore a previous context state with integrity verification
context_log View context version history
context_diff See what changed between context snapshots
context_branch Branch context for exploring different approaches
context_verify Verify integrity of all stored context snapshots

Contributing

Issues, bug reports, and pull requests are welcome. If you're using Amber in your training pipeline, I'd like to hear about it — open an issue or start a discussion.

If you find a bug, please include:

  • Your filesystem type (df -T)
  • Rust version (rustc --version)
  • Steps to reproduce

Support

If Amber is useful to your work, the best way to support the project is to star the repo and share it with others working on ML infrastructure.

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors