A field report from 283 experiments in autonomous ML optimization
Anthropic's recent engineering post on effective harnesses for long-running agents identifies the core challenge precisely: each new agent session begins with no memory of what came before. Their solution — a claude-progress.txt file and git history — works well for software engineering tasks where progress is linear and additive.
But autonomous experimentation is a different problem. In ML optimization, the agent isn't building toward a single goal. It's exploring a search space. And in a search space, what you've already tried is just as important as what worked.
A flat progress file can tell the next agent "we tried gradient boosting." It can't answer:
- Which of the 47 feature engineering approaches worked on which data distributions?
- What's the failure mode of target encoding on this specific dataset?
- Which hyperparameter regions have already been exhausted?
When the search space is large, a text file becomes noise. The agent drowns in history rather than learning from it.
Persistent Agent is an autonomous ML experimentation system built on a different premise: persistent structured memory, queryable by the agent itself.
Instead of a progress file, every experiment writes a structured document to MongoDB:
{
"experiment_id": "exp_0283",
"hypothesis": "Log-transform skewed features before gradient boosting",
"cv_score": 0.12891,
"lb_score": 0.12634,
"features_used": ["GrLivArea_log", "LotArea_log"],
"model": "XGBRegressor",
"failed": false,
"failure_reason": null,
"parent_experiment": "exp_0231",
"created_at": "2026-03-15T09:23:11Z"
}When a new agent session starts, it doesn't read a flat file. It queries:
// What has already been tried?
db.experiments.find({ competition: "house-prices" }).sort({ cv_score: 1 }).limit(20)
// What's the best approach so far?
db.experiments.find({ lb_score: { $exists: true } }).sort({ lb_score: 1 }).limit(1)
// What approaches failed and why?
db.experiments.find({ failed: true }).project({ hypothesis: 1, failure_reason: 1 })The agent proposes the next experiment with full awareness of everything that came before. Not as a text summary — as structured, queryable data.
Running Persistent Agent autonomously on the Kaggle House Prices competition produced some observations that flat-file harnesses can't easily surface:
Failure modes are structured. Target encoding without proper cross-validation leaks systematically. Once that failure is stored in MongoDB, no future agent session proposes target encoding without cross-validation again. A progress file would bury this in prose.
The search space has topology. Some experiments are parents of others. MongoDB preserves this lineage. The agent can query "what experiments branched from exp_0089 and what happened to them" — essential for understanding why a promising direction dead-ended.
CV/LB divergence is detectable. By storing both cross-validation scores and leaderboard scores, Persistent Agent can detect when a model overfits to the validation set. This pattern — invisible in a text log — becomes a queryable signal.
Plateau detection requires history. After 283 experiments, the AnalyzePlateau module queries the last N experiments and detects when marginal improvement has stalled. This drives the decision to explore vs. exploit — a decision that requires structured history, not a summary.
┌─────────────────────────────────────────┐
│ Persistent Agent │
│ │
│ Sidekiq Worker │
│ │ │
│ ▼ │
│ Claude Code (Proposer) │
│ │ reads experiment history │
│ │ from MongoDB via tool │
│ ▼ │
│ DSL Experiment Spec │
│ │ │
│ ▼ │
│ Python Runner │
│ │ executes ML pipeline │
│ ▼ │
│ MongoDB (Persistent Memory) ◄─────────┘
│ stores result, scores, │
│ features, failure reasons │
└─────────────────────────────────────────┘
The key design decision: Claude Code has a MongoDB query tool. It's not summarized for the agent — the agent queries it directly. This means the agent's awareness of history is limited only by what it thinks to ask, not by what a previous agent thought to write down.
| Anthropic's Harness | Persistent Agent / MongoDB | |
|---|---|---|
| Memory format | Flat text file | Structured documents |
| Queryable | No | Yes |
| Scales with history | Degrades | Constant performance |
| Failure analysis | Prose | Structured queries |
| Search space topology | Not captured | Parent/child lineage |
| Best for | Linear build tasks | Iterative search tasks |
Neither approach is universally better. For building a web app toward a known goal, a progress file is sufficient and simpler. For open-ended search across a large experiment space, structured memory becomes essential.
Anthropic's post notes that it's unclear whether a single general-purpose agent or specialized agents perform better across contexts. Persistent Agent uses a single proposer (Claude Code) but the MongoDB schema implicitly creates specialization — the proposer behaves differently when querying feature engineering history vs. model selection history.
A natural extension: specialized query agents that pre-process history into focused context before the proposer runs. A "what feature engineering has been tried" agent that summarizes the relevant subset, rather than exposing raw MongoDB queries to the proposer.
git clone https://github.com/georgeu2000/persistent-agent
cd persistent-agent