Skip to content

AutoX-AI-Labs/AutoR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

97 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AutoR: A Human-Centered Research OS

AI handles execution. Humans own the direction.

A terminal-first research harness that turns long, messy research work into reproducible, artifact-backed runs.

Python 3.10+ Intake plus 8 stages Terminal-first Human approval required Agent harness Reproducible research runs GitHub stars

Overview Β· Showcase Β· Quick Start Β· How It Works Β· Run Layout Β· Architecture Β· Roadmap

AutoR example figure


AutoR is not a chat demo, not a generic agent framework, and not a markdown-only research toy.

It is a structured research harness over a coding agent execution layer: AI handles execution, humans own the direction, and every run becomes an inspectable research artifact on disk.

Overview

Most autoresearch systems optimize for autonomy.

AutoR takes a different position: research is too important to hand over as a blind end-to-end loop. The goal is not to remove humans from research. The goal is to give them a stronger execution system.

At a Glance

Dimension AutoR
Execution model A coding agent as the execution layer, AutoR as the research control loop
Control model Human approval required after every stage
Research unit A reproducible run under runs/<run_id>/
Workflow shape Optional intake plus a fixed 8-stage pipeline
Quality bar Artifact-backed outputs, not markdown-only summaries
Recovery Resume, redo-stage, rollback-stage, stage-local continuation

Highlights

Layer Highlight What AutoR actually does
Big idea Human-centered research execution AutoR is not an autonomous scientist. AI handles execution; humans retain approval and direction at every stage boundary.
Big idea Research loop over agent loop The system manages stage progression, validation, repair, recovery, and human checkpoints above the lower-level agent execution loop.
Big idea Every run is a reproducible research artifact Each run leaves behind prompts, logs, approved summaries, code, data, figures, writing sources, and packaged outputs under runs/<run_id>/.
Big idea Verifiable outputs, not paper-shaped theater The workflow is judged by inspectable artifacts and human approval, not by whether a generated document merely looks polished.
Useful feature Structured literature organization Survey notes, bibliographies, related-work tables, and reading artifacts stay under workspace/literature/ instead of disappearing into chat history.
Useful feature Automated experiment manifests Machine-readable experiment and result files make runs inspectable, comparable, and reusable downstream.
Useful feature Citation verification and writing checks Writing expects citation verification, build logs, and self-review artifacts before Stage 07 is considered complete.
Useful feature Artifact indexing across stages artifact_index.json and related manifests help later stages find data, results, and figures without guessing from filenames.
Useful feature Resume, redo, and rollback controls Long research runs can continue in place, retry a stage, or roll downstream state back without starting over.
Useful feature Venue-aware packaging AutoR can package manuscript sources, PDFs, review materials, and release-ready artifacts instead of stopping at markdown summaries.

In practice, that means AutoR is useful not only because of the high-level framing, but also because it handles real research chores: literature organization, experiment manifests, citation verification, artifact indexing, manuscript packaging, and recoverable long-running workflows.

What AutoR Guarantees

  • Human approval is required before the workflow advances.
  • Approved summaries become the only cross-stage memory.
  • Every run is isolated, resumable, and auditable.
  • Later stages must produce real artifacts, not only prose.
  • A coding agent is the execution layer; AutoR is the research control loop above it.

Why AutoR?

Many systems aim to generate research outputs that look ready.

AutoR takes a harder path:

  • it requires real experiments
  • it enforces artifact validation
  • it keeps humans in control

So the question is not:

Does it look ready?

It is:

Can you verify every part of it?

🌟 Showcase

AutoR already has a full example run used throughout the repository: runs/20260330_101222.

Example Run Snapshot

What the run produced What it demonstrates
example_paper.pdf A compiled manuscript artifact within a broader research package
Executable research code The run is not just a writing pipeline
Machine-readable datasets and result files Claims are backed by inspectable experiment outputs
Real figures used in the research package The run produces publication-style visuals, not placeholders
Review and dissemination materials The workflow continues past writing into release readiness

Highlighted outcomes from that run:

  • AGSNv2 reached 36.21 Β± 1.08 on Actor.
  • The system produced a full research package with real figures, writing sources, and auditable artifacts.
  • The final run preserved the full human-in-the-loop approval trail.

Terminal Experience

AutoR is designed for terminal-first execution, but the interaction layer is not limited to raw logs and plain prompts. The current UI supports banner-style startup, colored stage panels, parsed Claude event streams, wrapped markdown summaries, and a menu-driven approval loop suitable for demos and recordings.

AutoR terminal UI

Example Figures

Accuracy Comparison
Example accuracy figure
Ablation + Actor Results
Example ablation figure
Two-Layer Narrative Figure
Two-layer narrative figure

Research Output Gallery

The manuscript pages below are only the visible surface of larger AutoR runs. To keep the showcase compact and comparable, this gallery uses a consistent 4 Γ— 2 layout: four artifact-backed research outputs, two representative pages from each, and a short note on what each run is demonstrating.

Output 1
A complete end-to-end AutoR run. The pair below shows the opening manuscript page and a later evidence-heavy page where algorithm, tables, and quantitative results appear together.
Output 1 page 1
Page 1
Output 1 evidence page
Evidence Page
Output 2
Do More Experts Help? A parameter-matched MoE-LoRA study. The selected pages show the framing page and a chart-heavy evidence page.
Output 2 page 1
Page 1
Output 2 evidence page
Evidence Page
Output 3
Attention Sink Onset in Tiny Transformers A controlled factorial study. The chosen pages show the opening page and a later structured overview page with visual decomposition.
Output 3 page 1
Page 1
Output 3 overview page
Overview Page
Output 4
HSOD: Harmonic Spectral Operator Decomposition A stability-focused time-series study. The pair below shows the framing page and a later page with dense training-dynamics plots.
Output 4 page 1
Page 1
Output 4 analysis page
Analysis Page

Human-in-the-Loop in Practice

The example run is interesting not because the AI was left alone, but because the human intervened at critical moments:

  • Stage 02 narrowed the project to a single core claim.
  • Stage 04 pushed the system to download real datasets and run actual pre-checks.
  • Stage 05 forced experimentation to continue until real benchmark results were obtained.
  • Stage 06 redirected the story away from leaderboard-only framing toward mechanism-driven analysis.

That is the intended shape of AutoR: AI handles execution load; humans steer the research when direction actually matters.

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Claude CLI available on PATH for real runs
  • Local TeX tools are helpful for Stage 07, but not required for smoke tests
  • For --research-diagram (Gemini-generated method illustration inserted into the LaTeX paper):
    • pip install google-genai (the google.genai SDK is not a default dependency; if it is missing the diagram step prints Diagram generation failed: No module named 'google' and the rest of the run continues unaffected)
    • A Gemini API key exposed via GOOGLE_API_KEY or GEMINI_API_KEY, or a local configs/diagram_config.yaml (see configs/diagram_config.template.yaml)

Common Commands

Goal Command
Start a new run python main.py
Start with an explicit goal python main.py --goal "Your research goal here"
Start with preloaded resources python main.py --goal "Your research goal here" --resources paper.pdf refs.bib data.csv
Run a local smoke test without Claude python main.py --fake-operator --goal "Smoke test"
Choose a Claude model python main.py --model sonnet or python main.py --model opus
Choose a writing venue profile python main.py --venue neurips_2025 or python main.py --venue nature or python main.py --venue jmlr
Resume the latest run python main.py --resume-run latest
Redo a stage inside the same run python main.py --resume-run 20260329_210252 --redo-stage 03
Roll back to a stage inside the same run python main.py --resume-run 20260329_210252 --rollback-stage 03

If --venue is omitted, AutoR defaults to neurips_2025.

Valid stage identifiers include 03, 3, and 03_study_design.

βš™οΈ How It Works

AutoR uses an optional intake step followed by a fixed 8-stage pipeline:

  1. 00_intake (optional)

  2. 01_literature_survey

  3. 02_hypothesis_generation

  4. 03_study_design

  5. 04_implementation

  6. 05_experimentation

  7. 06_analysis

  8. 07_writing

  9. 08_dissemination

flowchart TD
    A[Start or resume run] --> G0{Skip intake?}
    G0 -- Yes --> S1[01 Literature Survey]
    G0 -- No --> I0[00 Intake]
    I0 --> H0{Human approval}
    H0 -- Refine --> I0
    H0 -- Approve --> S1[01 Literature Survey]
    H0 -- Abort --> X[Abort]

    S1 --> H1{Human approval}
    H1 -- Refine --> S1
    H1 -- Approve --> S2[02 Hypothesis Generation]
    H1 -- Abort --> X[Abort]

    S2 --> H2{Human approval}
    H2 -- Refine --> S2
    H2 -- Approve --> S3[03 Study Design]
    H2 -- Abort --> X

    S3 --> H3{Human approval}
    H3 -- Refine --> S3
    H3 -- Approve --> S4[04 Implementation]
    H3 -- Abort --> X

    S4 --> H4{Human approval}
    H4 -- Refine --> S4
    H4 -- Approve --> S5[05 Experimentation]
    H4 -- Abort --> X

    S5 --> H5{Human approval}
    H5 -- Refine --> S5
    H5 -- Approve --> S6[06 Analysis]
    H5 -- Abort --> X

    S6 --> H6{Human approval}
    H6 -- Refine --> S6
    H6 -- Approve --> S7[07 Writing]
    H6 -- Abort --> X

    S7 --> H7{Human approval}
    H7 -- Refine --> S7
    H7 -- Approve --> S8[08 Dissemination]
    H7 -- Abort --> X

    S8 --> H8{Human approval}
    H8 -- Refine --> S8
    H8 -- Approve --> Z[Run complete]
    H8 -- Abort --> X
Loading

Stage Attempt Loop

flowchart TD
    A[Build prompt from template + goal + memory + optional feedback] --> B[Start or resume stage session]
    B --> C[Claude writes draft stage summary]
    C --> D[Validate markdown and required artifacts]
    D --> E{Valid?}
    E -- No --> F[Repair, normalize, or rerun current stage]
    F --> A
    E -- Yes --> G[Present validated draft for human review]
    G --> H{Human choice}
    H -- 1 or 2 or 3 --> I[Continue current stage conversation with AI refinement]
    I --> A
    H -- 4 --> J[Continue current stage conversation with custom feedback]
    J --> A
    H -- 5 --> K[Promote approved summary and append to memory.md]
    K --> L[Continue to next stage]
    H -- 6 --> X[Abort]
Loading

Approval semantics

  • 1 / 2 / 3: continue the same stage conversation using one of the AI's refinement suggestions
  • 4: continue the same stage conversation with custom user feedback
  • 5: approve and continue to the next stage
  • 6: abort the run

The stage loop is controlled by AutoR, not by Claude.

βœ… Validation Bar

AutoR does not consider a run successful just because it generated a plausible markdown summary.

Stage Required non-toy output
Stage 03+ Machine-readable data under workspace/data/
Stage 05+ Machine-readable results under workspace/results/
Stage 06+ Real figure files under workspace/figures/
Stage 07+ Manuscript sources plus a compiled PDF
Stage 08+ Review and readiness assets under workspace/reviews/

Required stage summary shape:

# Stage X: <name>

## Objective
## Previously Approved Stage Summaries
## What I Did
## Key Results
## Files Produced
## Suggestions for Refinement
## Your Options

Additional rules:

  • exactly 3 numbered refinement suggestions
  • the fixed 6 user options
  • no [In progress], [Pending], [TODO], [TBD], or similar placeholders
  • concrete file paths in Files Produced

If a run only leaves behind markdown notes, it has not met AutoR's quality bar.

πŸ“‚ Run Layout

Every run lives entirely inside its own directory.

runs/<run_id>/
β”œβ”€β”€ user_input.txt
β”œβ”€β”€ memory.md
β”œβ”€β”€ run_config.json
β”œβ”€β”€ run_manifest.json
β”œβ”€β”€ artifact_index.json
β”œβ”€β”€ intake_context.json
β”œβ”€β”€ logs.txt
β”œβ”€β”€ logs_raw.jsonl
β”œβ”€β”€ prompt_cache/
β”œβ”€β”€ operator_state/
β”œβ”€β”€ handoff/
β”œβ”€β”€ stages/
└── workspace/
    β”œβ”€β”€ literature/
    β”œβ”€β”€ code/
    β”œβ”€β”€ data/
    β”œβ”€β”€ results/
    β”œβ”€β”€ writing/
    β”œβ”€β”€ figures/
    β”œβ”€β”€ artifacts/
    β”œβ”€β”€ notes/
    └── reviews/

Workspace Directory Semantics

  • literature/: reading notes, survey tables, benchmark notes
  • code/: runnable code, scripts, configs, implementations
  • data/: machine-readable data and manifests
  • results/: machine-readable experiment outputs
  • writing/: LaTeX sources, sections, bibliography, tables
  • figures/: real plots and paper figures
  • artifacts/: compiled PDFs and packaged deliverables
  • notes/: temporary or supporting research notes
  • reviews/: readiness, critique, and dissemination materials

🧠 Execution Model

For each stage attempt, AutoR assembles a prompt from:

  1. the stage template from src/prompts/
  2. the required stage summary contract
  3. execution-discipline constraints
  4. user_input.txt
  5. approved memory.md
  6. intake_context.json, artifact_index.json, and, when available, experiment_manifest.json
  7. optional refinement feedback
  8. for continuation attempts, the current draft/final stage files and workspace context

The assembled prompt is written to runs/<run_id>/prompt_cache/, per-stage session IDs are stored in runs/<run_id>/operator_state/, and Claude is invoked in live streaming mode.

Exact Claude CLI pattern

First attempt for a stage:

claude --model <model> \
  --permission-mode bypassPermissions \
  --dangerously-skip-permissions \
  --session-id <stage_session_id> \
  -p @runs/<run_id>/prompt_cache/<stage>_attempt_<nn>.prompt.md \
  --output-format stream-json \
  --verbose

Continuation attempt for the same stage:

claude --model <model> \
  --permission-mode bypassPermissions \
  --dangerously-skip-permissions \
  --resume <stage_session_id> \
  -p @runs/<run_id>/prompt_cache/<stage>_attempt_<nn>.prompt.md \
  --output-format stream-json \
  --verbose

Important behavior:

  • refinement attempts reuse the same stage conversation whenever possible
  • streamed Claude output is shown live in the terminal
  • raw stream-json output is captured in logs_raw.jsonl
  • if resume fails, AutoR can fall back to a fresh session
  • if stage markdown is incomplete, AutoR can repair or normalize it locally

πŸ—οΈ Architecture

The main code lives in:

flowchart LR
    A[main.py] --> B[src/manager.py]
    B --> C[src/operator.py]
    B --> D[src/intake.py]
    B --> E[src/manifest.py]
    B --> F[src/artifact_index.py]
    B --> G[src/experiment_manifest.py]
    B --> H[src/utils.py]
    B --> I[src/writing_manifest.py]
    B --> J[src/platform/foundry.py]
    B --> K[src/prompts/*]
    C --> H
Loading

File boundaries:

  • main.py: CLI entry point. Starts a new run, resumes an existing run, collects resources, and exposes redo/rollback controls.
  • src/manager.py: Owns intake plus the 8-stage loop, approval flow, repair flow, resume/redo/rollback logic, and stage-level continuation policy.
  • src/operator.py: Invokes Claude CLI, streams output live, persists stage session state, resumes the same stage conversation for refinement, and falls back to a fresh session if resume fails.
  • src/intake.py: Resource ingestion, intake context persistence, and prompt formatting for preloaded materials.
  • src/manifest.py: Lightweight run lifecycle state, stage status tracking, and rollback/stale invalidation.
  • src/artifact_index.py: Run-wide artifact indexing over data, results, and figures.
  • src/experiment_manifest.py: Standardized experiment bundle summary used by later stages.
  • src/utils.py: Stage metadata, prompt assembly, run paths, markdown validation, artifact validation, and handoff helpers.
  • src/prompts/: Per-stage prompt templates.

πŸ—‚οΈ Run State

Each run contains user_input.txt, memory.md, run_manifest.json, artifact_index.json, prompt_cache/, operator_state/, stages/, workspace/, logs.txt, and logs_raw.jsonl. The substantive research payload lives in workspace/.

flowchart TD
    A[workspace/] --> B[literature/]
    A --> C[code/]
    A --> D[data/]
    A --> E[results/]
    A --> F[writing/]
    A --> G[figures/]
    A --> H[artifacts/]
    A --> I[notes/]
    A --> J[reviews/]
Loading

Workspace directories:

  • literature/: papers, benchmark notes, survey tables, reading artifacts.
  • code/: runnable pipeline code, scripts, configs, and method implementations.
  • data/: machine-readable datasets, manifests, processed splits, caches, and loaders.
  • results/: machine-readable metrics, predictions, ablations, tables, and evaluation outputs. AutoR also standardizes results/experiment_manifest.json as a machine-readable summary over result, code, and note artifacts for downstream analysis.
  • writing/: manuscript sources, LaTeX, section drafts, tables, and bibliography.
  • figures/: plots, diagrams, charts, and paper figures.
  • artifacts/: compiled PDFs and packaged deliverables.
  • notes/: temporary notes and setup material.
  • reviews/: critique notes, threat-to-validity notes, and readiness reviews.

Other run state:

  • memory.md: approved cross-stage memory only.
  • run_manifest.json: machine-readable run and stage lifecycle state.
  • artifact_index.json: machine-readable index over workspace/data, workspace/results, and workspace/figures.
  • prompt_cache/: exact prompts used for stage attempts and repairs.
  • operator_state/: per-stage Claude session IDs.
  • stages/: draft and promoted stage summaries.
  • logs.txt and logs_raw.jsonl: workflow logs and raw Claude stream output.

βœ… Validation

AutoR validates both the stage markdown and the stage artifacts.

Required stage markdown shape:

# Stage X: <name>

## Objective
## Previously Approved Stage Summaries
## What I Did
## Key Results
## Files Produced
## Suggestions for Refinement
## Your Options

Additional markdown requirements:

  • Exactly 3 numbered refinement suggestions.
  • The fixed 6 user options.
  • No unfinished placeholders such as [In progress], [Pending], [TODO], or [TBD].
  • Concrete file paths in Files Produced.

Artifact requirements by stage:

  • Stage 03+: machine-readable data under workspace/data/
  • Stage 05+: machine-readable results under workspace/results/
  • Stage 05+: workspace/results/experiment_manifest.json must exist and remain structurally valid
  • Stage 06+: figure files under workspace/figures/
  • Stage 07+: venue-aware conference or journal-style LaTeX sources plus a compiled PDF under workspace/writing/ or workspace/artifacts/
  • Stage 08+: review and readiness artifacts under workspace/reviews/

A run with only markdown notes does not pass validation.

πŸ“Œ Scope

Included in the current mainline

  • optional intake stage and resource ingestion
  • fixed 8-stage workflow
  • mandatory human approval after every stage
  • Claude Code as the execution layer
  • stage-local continuation within the same Claude session
  • prompt caching via @file
  • live streaming terminal output
  • repair passes and local fallback normalization
  • run manifest, rollback, and stale tracking
  • artifact index and experiment manifest
  • stage handoff context
  • manuscript/release package generation after approval
  • artifact-aware validation
  • resume, --redo-stage, and --rollback-stage
  • lightweight venue profiles for Stage 07 writing

Intentionally out of scope

  • generic multi-agent orchestration
  • database-backed runtime state
  • concurrent stage execution
  • heavyweight platform abstractions
  • dashboard-first productization

πŸ›£οΈ Roadmap

The most valuable next steps are the ones that make AutoR more like a real research workflow, not more like a demo framework.

Next step Why it matters
Deeper cross-stage rollback and invalidation Make downstream stale-state handling stronger and more explicit after earlier-stage changes.
Stronger machine-readable run state Extend the current run manifest into a better source of truth for stage status, stale dependencies, and artifact pointers.
Continuation handoff compression Make long stage refinement more stable without bloating context.
Stronger automated tests Cover repair flow, resume fallback, artifact validation, and approval-loop correctness more deeply.
Richer artifact indexing Extend metadata around data/, results/, figures/, and writing/ without turning AutoR into a heavy platform.
Frontend run browser Add a lightweight UI for browsing runs, stages, logs, and artifacts directly from the run directory.

Implemented milestone:

  • Stage-local continuation sessions. Keep one Claude conversation per stage, reuse it for 1/2/3/4 refinement, and fall back to a fresh session only when resume fails. This is now implemented in the operator and manager flow.
  • Artifact-level validation for non-toy outputs. Enforce machine-readable data, result files, figures, LaTeX sources, PDF output, and review artifacts at the right stages. This is now part of the workflow validation path.
Expanded roadmap notes
  • Cross-stage rollback and invalidation. When a later stage reveals that an earlier design decision is wrong, the workflow should be able to jump back to an earlier stage and mark downstream stages as stale. This is the biggest current control-flow gap.
  • Machine-readable run manifest. Add a single source of truth such as run_manifest.json to track stage status, approval state, stale dependencies, session IDs, and key artifact pointers. This should make both automation and future UI work much cleaner.
  • Continuation handoff compression. Add a short machine-generated stage handoff file that summarizes what is already correct, what is missing, and which files matter most. This should reduce context growth and make continuation more stable over long runs.
  • Result schema and artifact indexing. Standardize workspace/data/, workspace/results/, and workspace/figures/ around explicit schemas and generate an artifact index automatically. The workflow now writes artifact_index.json, carries basic inferred or declared schema metadata, and feeds the index into later-stage prompt context and the writing manifest.
  • Writing pipeline hardening. Turn Stage 07 into a reliable manuscript production pipeline with stable conference and journal-style writing structures, bibliography handling, table and figure inclusion, and reproducible PDF compilation. The goal is a submission-grade research package, not just writing notes.
  • Review and dissemination package. Expand Stage 08 so it produces readiness checklists, threats-to-validity notes, artifact manifests, release notes, and external-facing research bundles. The final stage should feel like packaging a verifiable research release, not just wrapping up text.
  • Frontend run dashboard. Build a lightweight UI that can browse runs, stage status, summaries, logs, artifacts, and validation failures. It should read from the run directory and manifest rather than introducing a database first.
  • README and open-source assets. Keep refining the README and add assets/ images such as workflow diagrams, UI screenshots, and artifact examples. This is important for open-source clarity, onboarding, and project presentation.

🌍 Community

Join the project community channels:

Discord WeChat WhatsApp
Discord QR WeChat QR WhatsApp QR

⭐ Star History

Star History Chart

About

AI handles execution, humans own the direction, and every run becomes an inspectable research artifact on disk.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages