English | 简体中文
Skill Forge(技能锻造器)is a signed-approval skill pipeline for self-improving LLM agents. Detect capability gaps from logs, scaffold by profile, validate, and require an HMAC-bound approval token before any install or update mutates state.
Current version: v1.4.0 "Evolve First".
Most "self-improving agent" tooling lets the agent silently mutate its own skill set. That works in demos and breaks in production: a single bad skill silently overwrites a good one, no audit trail, no way to roll back, no way to tell whether the new skill is actually better than the one it replaced.
Skill Forge separates detect → generate → validate → evaluate → install → evolve into a signed pipeline:
- Auto-detection from
.learnings, error logs, and feature requests so capability gaps surface before they become recurring failures. - Profile-aware scaffolding for
academic/product/integration/script/workflow, with bundled references and per-profile quality gates. - Quantitative validator scores frontmatter, sections, triggers, references, and profile-specific behavior — milestone grade requires real structure, not vibes.
- Signed approval gate: every install or replace requires an HMAC-SHA256 token bound to the skill name, target path, install method, source path, and a SHA-256 hash of the source directory. Tamper with the candidate after approval → blocked. Direct
--applycalls are blocked by design. - Replay non-regression check: when an installed copy already exists, the candidate is scored against the same redacted feedback cases as the baseline; install blocked when the candidate regresses.
- Evolve-first routing (opt-in via
--prefer-evolve): before forging a brand-new skill, ask whether an installed skill should evolve instead — same profile, sufficient trigger / topic / name overlap. Prevents skill-set sprawl ("12 near-duplicate citation skills") while the signed gate + replay regression check still guard the actual mutation. - Audit trail in a
flock-protected, atomically-written manifest with full version history and rollback. - Privacy-first feedback: redactor strips emails, JWTs, PEM private keys, Telegram bot tokens, common cloud / Git / Slack token prefixes, IPv4/v6, and separator-bearing phone numbers before anything touches disk.
- Deterministic release benchmark with validation, hidden-eval, line-count, and runtime budgets so quality gates don't drift between releases.
After installing, run the safe plan-only smoke test:
python3 skill-forge/scripts/skill_forge.py demo --jsonExpected:
- detects the bundled
citation-helperexample - classifies it as
academic - generates a candidate skill at
./generated-forge-demo/citation-helper - validates it as
grade=milestone - prints an install plan and exits with
install_status=planned - does not mutate any installed skills
To install with real Telegram approval:
export TELEGRAM_BOT_TOKEN="..."
export TELEGRAM_CHAT_ID="..."
python3 skill-forge/scripts/skill_forge.py forge \
--source skill-forge/examples/sample-feature-requests.md \
--source skill-forge/examples/sample-errors.md \
--output ./generated \
--install telegram \
--agent-name StudyAgent \
--jsonThe pipeline sends a Telegram approval message bound to the candidate's content hash, waits for your reply (yes / 同意安装), then verifies the signed token before any file mutation.
From ClawHub:
clawhub install skills-forge
clawhub install somnia # optional nightly-review companionThe ClawHub package slug is skills-forge (the skill-forge slug is taken by another publisher); the skill metadata name remains skill-forge. ClawHub may show a safety warning because this skill includes Telegram approval calls and local install/rollback scripts — review the source first.
For local development, symlink this repository into your OpenClaw workspace:
mkdir -p ~/.openclaw/workspace/skills
ln -sfn "$PWD/skill-forge" ~/.openclaw/workspace/skills/skill-forge
ln -sfn "$PWD/../somnia" ~/.openclaw/workspace/skills/somnia # if you cloned somnia next to thisConfigure the Telegram approval channel:
cp .env.example ~/.openclaw/skill-forge.env
# Edit ~/.openclaw/skill-forge.env and set TELEGRAM_BOT_TOKEN and TELEGRAM_CHAT_ID.Verify readiness:
python3 skill-forge/scripts/skill_forge.py doctor \
--env-file ~/.openclaw/skill-forge.env --json┌─────────────────────┐
│ skill candidate dir │ ─┐
└─────────────────────┘ │ sha256(dir contents) ─┐
│ │
┌─────────────▼──────────────┐ │
│ telegram_approval.py │ │
│ - sends Telegram message │ │
│ - waits for reply │ │
│ - HMAC-signs token over: │ ─────────┤
│ skill, target, │ │
│ method, source_dir, │ │
│ source_hash, mode, │ │
│ issued_at, expires_at │ │
└─────────────┬──────────────┘ │
│ approval_token │
┌─────────────▼──────────────┐ │
│ propose_skill_install.py │ │
│ --apply --approval-token │ │
│ - verify HMAC │ │
│ - verify expiry │ │
│ - verify every claim │ ◄────────┘
│ (recompute source_hash)│
└─────────────┬──────────────┘
│ all checks pass
▼
atomic + flock-protected
manifest write
There is no --approved-by-telegram boolean. The token is the gate. The HMAC secret comes from SKILL_FORGE_APPROVAL_SECRET if set, otherwise it is derived deterministically from the bot token (so any environment that can reach the bot can verify tokens issued there).
A mode=dry-run token (issued via --telegram-dry-run approve) is refused by --apply unless --allow-dry-run-install is passed; the audit field then records approved_by=dry-run, not telegram. Pipelines surface this as install_status=dry-run-blocked.
┌────────────────────────────┐
detect ──→ ──→ │ decide_skill_action.py │
opportunity │ for each installed skill: │
│ - profile match? (gate) │
│ - trigger overlap │
│ - topic overlap │
│ - name overlap │
│ - exact-name short- │
│ circuit → fitness 1.0 │
└─────────────┬──────────────┘
│
┌───────────────────┼────────────────────┐
▼ ▼ ▼
fitness ≥ threshold 0.4–threshold fitness < threshold
AND clear winner OR top-2 within OR no candidates
│ ambiguous_margin │
▼ ▼ ▼
route to evolve ambiguous (forge forge new skill
+ write synthetic today, surfaces in (default scaffold flow)
forge-routing payload for review)
feedback entry
Pass --prefer-evolve to forge (or call decide directly) to enable. The flag defaults off in v1.x for backward compatibility.
The router prevents skill sprawl when a request is really just a new trigger surface for something already installed. If it routes wrong, the replay non-regression gate downstream catches it — the candidate is scored against the same baseline cases and the install is blocked when fitness regresses. Worst case is a wasted scaffold cycle; the manifest never sees a bad install.
When evolve runs against a skill that's already installed, the candidate and the installed baseline are scored against the same redacted replay cases. The install is blocked when:
candidate_score - baseline_score < --min-replay-improvement # default 0
Replay scoring is offline keyword/coverage based — fast (milliseconds), deterministic, no API key. It catches the most common regression: someone deleted the section that addressed the user's original complaint. Pass --min-replay-improvement -100 to disable the regression block while still emitting coverage scores.
| Capability | Skill Forge | LangChain Hub | AutoGen skills | OpenAI Apps SDK | Manual prompt files |
|---|---|---|---|---|---|
| Auto-detect capability gaps from agent logs | ✅ | ❌ | ❌ | ❌ | ❌ |
| Quantitative quality validator (score + grade) | ✅ | ❌ | ❌ | platform review | ❌ |
| Profile-specific scaffolding (5 profiles) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Install requires signed approval token | ✅ HMAC + source-hash | ❌ | ❌ | platform review | ❌ |
| Candidate tamper invalidates approval | ✅ | ❌ | ❌ | n/a | ❌ |
| Non-regression check vs installed baseline | ✅ | ❌ | ❌ | ❌ | ❌ |
| Audit trail with rollback | ✅ atomic + flock |
❌ | ❌ | platform-side | ❌ |
| Redacted feedback storage | ✅ | ❌ | ❌ | platform-side | ❌ |
| Deterministic release benchmark | ✅ | ❌ | ❌ | ❌ | ❌ |
| Runtime / hosting required | none, local Python 3.9+ | hub fetch | local Python | OpenAI-hosted | none |
| License | MIT | MIT | MIT | proprietary | n/a |
Skill Forge is complementary to LangChain Hub / AutoGen / your own internal registry. Those are great for sharing prompts and modules. Skill Forge is what you put in front of any of them when you want reviewed, signed, regression-checked, audit-trailed installs of agent capabilities.
scripts/skill_forge.py is the single product entry point. Every subcommand accepts --json.
| Command | Purpose | Mutation risk |
|---|---|---|
doctor1 |
Check local readiness: package, Python, OpenClaw path, Telegram config, approval secret, ClawHub, CI, git state | none |
demo |
Run the bundled citation-helper example in plan-only mode |
none |
forge |
Detect, scaffold, validate, evaluate, and optionally request install approval. Pass --prefer-evolve to route through decide first |
install only with signed approval |
decide |
Given an opportunity, decide whether to evolve an installed skill or forge a new one. Returns evolve / forge / ambiguous plus per-candidate fitness |
none |
evolve |
Propose a reviewed update from feedback + replay regression check | install only with signed approval |
install |
Print or apply an install proposal for a candidate skill | apply requires signed approval |
uninstall |
Print or apply uninstall / rollback | apply requires signed approval |
feedback |
Record redacted feedback for later evolution | writes feedback JSONL |
replay |
Score a skill directory against replay cases | none |
release-check |
Run release gates and secret scan | none |
version |
Print product version | none |
From the source repository, the most-used gates are also wrapped in make:
make doctor # readiness check
make demo # plan-only smoke test
make benchmark # deterministic profile benchmarks
make package-check # full release-checks + secret scanFor low-level scripts (direct invocation of detect, generate_skill_scaffold, validate, propose_skill_install, telegram_approval, propose_skill_update, record_skill_feedback, evolve_skill_pipeline, replay/*, nightly review, scheduling), see docs/manual-workflow.md.
Install modes (--install for forge / evolve):
| Mode | Behavior |
|---|---|
plan (default) |
Print the install plan only. No mutation. |
telegram |
Send a Telegram approval request, then verify the signed token before any state change. |
ask / auto |
Blocked compatibility aliases. Local approvals cannot mutate skills. |
Evaluation modes (--eval):
| Mode | Behavior |
|---|---|
hidden (default) |
Run smoke checks; hide internal scenarios from user-facing output. |
details |
Developer mode; expose internal check details. |
off |
Skip evaluation. |
Replay modes (--replay):
| Mode | Behavior |
|---|---|
hidden (default for evolve and nightly review) |
Run replay checks; hide case details. |
off |
Skip replay. |
--min-replay-improvement <N> controls the regression gate (default 0 = no regression). Pass -100 to disable.
Telegram approval reads these environment variables (with OpenClaw aliases):
| Purpose | Default env | Aliases |
|---|---|---|
| Bot token | TELEGRAM_BOT_TOKEN |
OPENCLAW_TELEGRAM_BOT_TOKEN |
| Chat ID | TELEGRAM_CHAT_ID |
OPENCLAW_TELEGRAM_CHAT_ID |
| HMAC secret | SKILL_FORGE_APPROVAL_SECRET (optional; falls back to bot-token-derived) |
— |
Discovery order — first present value wins:
- Explicit
--env-file <path> ~/.openclaw/telegram.env~/.openclaw/skill-forge.env~/.openclaw/.env~/.openclaw/workspace/telegram.env~/.openclaw/workspace/.env- Process environment variables
Approval reply words (case-insensitive, single-word, replies preferred over substring matches):
| Decision | Accepted words |
|---|---|
| Approve | y, yes, approve, approved, 同意, 同意安装, 安装, 确认 |
| Reject | n, no, reject, rejected, 拒绝, 拒绝安装, 取消, 不同意 |
For non-reply messages, the request ID printed in the approval message must appear in the body.
See docs/manual-workflow.md#config-discovery-order for nightly-review, env-file, and rollback details.
skill-forge/
├── README.md
├── SECURITY.md
├── RELEASE.md
├── LICENSE
├── Makefile
├── .env.example
├── assets/
│ └── skill-forge-logo.png
├── docs/
│ ├── manual-workflow.md
│ └── launch/ # promotional / launch assets
├── tests/
│ └── run_release_checks.py
└── skill-forge/ # the actual published skill
├── VERSION
├── SKILL.md
├── agents/openai.yaml
├── examples/
├── references/
└── scripts/
├── skill_forge.py # Forge Console
├── forge_pipeline.py
├── detect_skill_opportunities.py
├── generate_skill_scaffold.py
├── validate_skill_candidate.py
├── evaluate_skill_candidate.py
├── lib/ # approval, install_flow, runtime, telegram_config, policy
├── install/ # propose_skill_install + telegram_approval
├── evolve/ # propose_skill_update, record_skill_feedback, evolve_skill_pipeline
├── replay/ # collect / run / compare / redact / report
├── security/scan_secrets.py
├── somnia/ # nightly review runtime
└── templates/profile_packs.py
../somnia/ # optional companion skill, separate folder/repo
├── SKILL.md
├── agents/openai.yaml
├── references/
└── scripts/ # thin shims that delegate to skill-forge/.../somnia
Backward-compatibility wrappers exist at skill-forge/scripts/*.py for every script under install/, evolve/, replay/, and somnia/.
Run the full release suite locally before publishing:
make package-check
# or:
python3 tests/run_release_checks.py
python3 skill-forge/scripts/security/scan_secrets.py --jsonTwo GitHub Actions workflows guard main and pull requests:
- Release Checks — runs the full 26+ assertion suite (compile, secret scan, validation, forge plan, console doctor/demo, install gate, signed-token positive path, tamper rejection, dry-run blocking, missing-source rejection, uninstall token gate, replay, somnia review).
- Benchmarks — deterministic profile benchmarks with score, line-count, and runtime budgets; uploads JSON artifacts.
See SECURITY.md for the full signed-gate protocol, dry-run rejection semantics, redactor coverage, and lock guarantees. See RELEASE.md for the release checklist.
Contributions are welcome via GitHub Issues and Discussions (announce-style "Skill Showcase" category for sharing skills you've forged). Please run make package-check before opening a PR.
somnia is a thin packaging layer that delegates to Skill Forge's nightly-review runtime via runpy.run_path. Use it when an agent should route maintenance work to a dedicated entry point (scheduled health reports, replay regression checks, sleep-hour upgrade proposals). Somnia never installs or replaces skills directly — any mutation goes through Skill Forge's signed-gate path.
Current Somnia version: v0.6.0 "Quality Aligned".
v1.4.0 "Evolve First"— opt-in evolve-or-forge routing via newdecidecommand andforge --prefer-evolveflag. Prevents skill sprawl by routing opportunities to existing installed skills when profile + trigger / topic / name overlap is high enough. Exact-name match short-circuits to fitness 1.0; profile mismatch is a hard gate. Synthetic feedback entries taggedsource=forge-routingkeep the audit timeline honest.v1.3.0 "Benchmark Pipeline"— deterministic benchmark cases for all 5 profiles; validation/eval/runtime budgets;make benchmark; dedicated GitHub Actions workflow with JSON artifacts.v1.2.0 "Quality Refinement"— scaffold descriptions stay anchored to user triggers; workflow section is now milestone-grade required; numbered-step bonus only counts inside the workflow section; case-insensitive extraneous-doc detection;version/license/authorwhitelisted in frontmatter.v1.1.0 "Skill Quality Engine"— leanSKILL.mdbodies, progressive resource loading, execution-mode guidance, trigger-rich frontmatter, and validation that rejects context-bloating project docs.v1.0.x "Forge Console"— unified product CLI covering doctor, demo, forge, evolve, install, uninstall, feedback, replay, release-check, version.v0.5.0 "Signed Gate"— HMAC-signed approval tokens bound to skill identity and source content hash; replay non-regression gate; atomic +flock-locked manifest writes; expanded redactor.
For the full history (v0.1.0 → present), see RELEASE.md and the GitHub Releases page.
MIT. See LICENSE.
Footnotes
-
doctoris the standard CLI verb for "diagnose local readiness" — it follows the convention established bybrew doctor,flutter doctor,npm doctor, andgem doctorso any developer who's used those tools knows exactly what to expect. ↩
