feat(tools): Claude Delegation Tools by VascoSch92 · Pull Request #2100 · OpenHands/software-agent-sdk

VascoSch92 · 2026-02-17T17:02:05Z

Summary

First version of the Claude Delegation Tools

As discussed on Slack, I have created a specific profile for Claude.

It was challenging to reuse a significant portion of the existing Delegation Tool code, I have integrated as much as possible. For instance, I reused the logic from registration.py and visualizer.py. However, the remaining components were difficult to adapt as the philosophies of the standard Delegation Tool and the Claude Delegation Tool are pretty different.

Architecture

The Claude Delegation Tool acts as a messenger between the main agent and the manager (the DelegationManager class). The manager is responsible for the entire sub-agent lifecycle, including spawning and management.

Communication with the manager is handled via three tools: Task, TaskOutput, and TaskStop. Since these three tools only function correctly when used together, I have kept them out of the public API, exposing only ClaudeDelegationTool.

Good to Know

The DelegationManager is instantiated only once per tool registration. The three sub-tools share this single instance to prevent concurrency issues.
The tool signature respects the original Claude Code format.
It is possible to run agents in the background.

TODO

Refine tool descriptions
more test coverage

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2f95bdc-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2f95bdc-python \
  ghcr.io/openhands/agent-server:2f95bdc-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2f95bdc-golang-amd64
ghcr.io/openhands/agent-server:2f95bdc-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2f95bdc-golang-arm64
ghcr.io/openhands/agent-server:2f95bdc-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2f95bdc-java-amd64
ghcr.io/openhands/agent-server:2f95bdc-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2f95bdc-java-arm64
ghcr.io/openhands/agent-server:2f95bdc-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2f95bdc-python-amd64
ghcr.io/openhands/agent-server:2f95bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:2f95bdc-python-arm64
ghcr.io/openhands/agent-server:2f95bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:2f95bdc-golang
ghcr.io/openhands/agent-server:2f95bdc-java
ghcr.io/openhands/agent-server:2f95bdc-python

About Multi-Architecture Support

Each variant tag (e.g., 2f95bdc-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 2f95bdc-python-amd64) are also available if needed

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

github-actions · 2026-02-17T17:11:24Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-tools/openhands/tools
__init__.py	14	2	85%	32–33
openhands-tools/openhands/tools/task
definition.py	103	51	50%	77, 79–82, 84, 89, 91–96, 113, 156, 158–160, 162, 167, 169–174, 191, 229, 231–233, 235–238, 254, 309–310, 321–323, 325, 334–337, 339–340, 346–347, 349
impl.py	286	214	25%	99–103, 105, 109–112, 116–120, 124–126, 131–135, 152–159, 163–167, 171–172, 176–177, 181–182, 186, 190, 197–200, 203–205, 234–235, 237–238, 241–242, 247–248, 250, 255, 257, 261–262, 266–270, 272–274, 279–283, 288–291, 295–296, 302–304, 313–315, 321, 335–336, 338–340, 343, 351–352, 354–355, 360–361, 364, 367–368, 370–371, 375, 377–380, 382, 384–389, 391–396, 398–400, 405, 407–409, 411–418, 420, 438–443, 447, 450–451, 454, 459–462, 476–478, 480, 482, 489, 494–496, 499, 501, 505–506, 508–513, 515–517, 521–522, 524–526, 537–539, 547–548, 553–554, 559–560, 566–567, 573, 576, 587, 594–595, 605, 609–611, 619, 630, 637–638, 644, 646–647, 662, 669, 671–672, 679
TOTAL	18977	9724	48%

openhands-tools/openhands/tools/claude/templates/task_tool_description.j2

openhands-tools/openhands/tools/claude/__init__.py

enyst · 2026-02-18T11:49:03Z

@OpenHands Do a /codereview-roasted on this PR. Understand first the goal and investigate deeply the implementation and consequences, so you can raise important issues if any; otherwise do not exaggerate.

Post your review directly as a comment in the PR. Note that it will be rendered as markdown.

openhands-ai · 2026-02-18T11:49:30Z

I'm on it! enyst can track my progress at all-hands.dev

enyst · 2026-02-18T11:55:58Z

Taste Rating: 🔴 Needs improvement

You’re building a Claude Code–compatible Task/TaskOutput/TaskStop façade over OpenHands sub-agents. That’s a real use case, and the overall split (definitions vs shared manager vs executors) is sane.

But right now two headline features you advertise (“resume” and “stop”) don’t actually work the way the tool contract claims. That’s the kind of thing that looks fine in a PR diff and then burns you in production.

[CRITICAL ISSUES] (must fix)

1) Resume is structurally broken (it can’t work as written)

Where: openhands-tools/openhands/tools/claude/impl.py

You’re trying to resume by loading base_state.json from:

task_dir = self._tmp_dir / str(self._task_id_to_uuid[task_id]) (_get_task_directory, ~L189-192)
then task_dir / "base_state.json" (~L291-296)

But LocalConversation persists under persistence_dir/<conversation_id.hex>/... (see BaseConversation.get_persistence_dir() and LocalConversation.__init__()), i.e. it uses uuid.hex, not str(uuid).

Worse: when creating a new sub-conversation you never pass conversation_id at all:

_generate_conversation() passes conversation_dir=... (~L346-357)
LocalConversation.__init__() ignores unknown kwargs via **_, so conversation_dir is a no-op.

Net result: the conversation state is persisted to a directory you don’t track, while your resume code looks in a different directory name that doesn’t exist. This isn’t a “small bug”, it’s a data-structure mismatch.

Fix direction:

Always pass conversation_id=self._task_id_to_uuid[task_id] when creating the task conversation.
Use uuid.hex consistently for persistence paths (or better: use BaseConversation.get_persistence_dir() instead of hand-rolling paths).
Then you shouldn’t need to manually JSON-parse base_state.json at all; ConversationState.create() already supports create-or-resume.

2) `task_stop` doesn’t stop anything (it just lies about it)

Where: DelegationManager.stop_task() (~L467-489) and _run_task() background thread (~L385-413)

stop_task() only flips TaskState.status to STOPPED and evicts the task. The background thread keeps running conversation.run() until it naturally finishes.

That has two nasty consequences:

Users think they “terminated” a runaway task, but it keeps consuming time, tools, and LLM budget.
Your max-concurrency gate counts only status == RUNNING (~L241-244). After “stop”, status becomes STOPPED, so the still-running thread no longer counts. Congratulations, you’ve built a bypass for max_tasks.

If you can’t actually cancel conversation.run() (fair), then don’t claim you did.

Fix direction:

Either implement cooperative cancellation (whatever the conversation/agent supports), or rename/redefine semantics to “mark as stopped and ignore results” and update descriptions accordingly.
Don’t let STOPPED-but-still-running threads bypass the concurrency limit.

3) Tool descriptions are actively misleading, which matters for LLM tools

Where: openhands-tools/openhands/tools/claude/templates/*.j2

Examples:

task_tool_description.j2 talks about subprocesses, output_file paths, and resuming “with full context preserved”. None of that is what the implementation does.
task_output_tool_description.j2 references a /tasks command and “remote sessions”. Not implemented.
task_stop_tool_description.j2 says “terminate a long-running task”. It doesn’t.

For LLM-driven systems, the description is part of correctness. If the model follows the docs, it will do the wrong thing.

4) You claim the 3 sub-tools are “not public”, but you register them globally anyway

Where: openhands-tools/openhands/tools/claude/definition.py (~L359-363)

You only export ClaudeDelegationToolSet from openhands.tools.claude, but you still register_tool() the individual task, task_output, task_stop. That makes them addressable by name in the tool registry, and task is a dangerously generic global name.

If these are intentionally internal, don’t register them globally.

[IMPROVEMENT OPPORTUNITIES] (should fix)

Docstrings mention a non-existent API: both openhands.tools.claude.__init__ and definition.py advertise CLAUDE_DELEGATION_TOOLS, but it doesn’t exist.
model parameter is accepted but ignored (start_task(... model=..., # noqa: ARG002) and _generate_worker_agent() TODO). Either implement it or remove/clearly document that it’s ignored.
State bookkeeping is needlessly split: _active_tasks + _inactive_tasks + _task_id_to_uuid is three structures to represent one thing. At minimum, store the UUID on TaskState so you don’t need the side map.

[TESTING GAPS]

There’s no test that would catch the resume-path mismatch (because nothing actually tries to resume a real persisted conversation).
TestTaskState.test_thread_safety is logically flaky: it asserts status in ("completed", "error"), but the implementation uses "succeeded" / "empty_success" / "error". It only “passes” when the last write happens to be set_error().

VERDICT

❌ Needs rework before merge. The advertised behavior around resume/stop is not trustworthy yet.

KEY INSIGHT

Your persistence/resume path is inconsistent at the data-structure level (UUID string vs UUID hex + ignored ctor arg), so resume isn’t “buggy” — it’s impossible right now.

openhands-tools/openhands/tools/task/templates/task_tool_description.j2

enyst · 2026-02-18T16:28:16Z

openhands-tools/openhands/tools/claude/templates/task_tool_description.j2

@@ -0,0 +1,33 @@
+Launch a new agent to handle complex, multi-step tasks autonomously.
+
+The Task tool launches specialized agents (subprocesses) that autonomously handle complex tasks. Each agent type has specific capabilities and tools available to it.


Curious, are the prompts from CC or manually written or LLM written? The GPT-5.2 "roasted" review complained about the "subprocesses", for example

I asked CC to provide the exact tool descriptions, as I don't think they can be found elsewhere.

I was also a bit confused because, as you mentioned, the names are a little misleading.

Moreover, since this implementation uses a slightly different logic behind the scenes, I’ve updated them to be more precise and consistent.

openhands-tools/openhands/tools/claude/templates/task_stop_tool_description.j2

openhands-tools/openhands/tools/claude/templates/task_output_tool_description.j2

.pr/claude_stop_proof/_run/evidence_task_3ea0cc5c.json

openhands-tools/openhands/tools/claude/impl.py

simonrosenberg · 2026-02-19T11:44:29Z

can you add an example (perhaps similar to the current agent_delegation.py but showcasing the other features too) to make sure that this delegation toolkit works?

openhands-tools/openhands/tools/task/templates/task_output_tool_description.j2

openhands-tools/openhands/tools/task/templates/task_tool_description.j2

openhands-tools/openhands/tools/preset/claude.py

simonrosenberg · 2026-02-19T13:21:31Z

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug...
Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

openhands-tools/openhands/tools/claude/templates/task_tool_description.j2

enyst · 2026-02-19T14:07:27Z

tests/tools/claude/test_integration.py

+"""Integration tests for DelegationManager with real LocalConversation.
+
+These tests use real LocalConversation instances (with real persistence)
+and only mock the LLM calls via ``litellm_completion``.  They verify that


Totally a nit:

we have cross-package tests for conversation restore here

iirc, we've done exactly what you describe here, "integration-like" just without LLM

they're named "cross" as in cross-package, as in, almost integration

we have other tests named "integration" which are also special workflows (integration-test label), so maybe it would be nice to not convince the LLM that this is "integration" 😅

Maybe we could move this to cross/, to live alongside the others?

tests/tools/task/test_integration.py

…cription.j2 Co-authored-by: Engel Nyst <engel.nyst@gmail.com>

VascoSch92 · 2026-02-19T15:42:07Z

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.

That being said, I have no problem providing a first version without background execution to make the review process easier.

enyst · 2026-02-19T15:44:52Z

openhands-tools/openhands/tools/task/impl.py

+
+        # Create sub-agent LLM (disable streaming)
+        llm_updates: dict = {"stream": False}
+        sub_agent_llm = parent_llm.model_copy(update=llm_updates)


I guess this copies the metrics and continues on its copy... and not sure what happens with usage_id... 🤔

I think... maybe if we can, we could continue the work on profiles / state / registry so that we offer an understandable little framework for all these LLM uses that we need throughout the codebase, WDYT?

We could maybe go with this one temporarily (I'll post a top-level review why), I just really hope we don't end up doing too many little fixes at every LLM spot, that way lies one of my worst nightmares...

enyst · 2026-02-19T15:47:42Z

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:

It's a lovely picture, IMHO it says that this PR is not risky:

code fully localized
uses tool preset; not default

=> it doesn't risk breaking other things.
=> it's like under feature flag, even though we have no feature flag.

So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

simonrosenberg · 2026-02-19T16:15:26Z

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.

That being said, I have no problem providing a first version without background execution to make the review process easier.

It's just that background execution seems to be a complicated feature, e.g. with locks there are many ways that things can go wrong... So... yes I would feel much more confortable with a first solution that only supports blocking parallel delegation...
How does it make it harder to match CC API? Because lots of flags would now be useless?

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:
It's a lovely picture, IMHO it says that this PR is not risky:

code fully localized

uses tool preset; not default

=> it doesn't risk breaking other things. => it's like under feature flag, even though we have no feature flag.

So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

As long as we have strong testing for background execution I'm OK. But evals/swebench doesn't require any background execution so ...

simonrosenberg · 2026-02-19T16:19:45Z

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.
That being said, I have no problem providing a first version without background execution to make the review process easier.

It's just that background execution seems to be a complicated feature, e.g. with locks there are many ways that things can go wrong... So... yes I would feel much more confortable with a first solution that only supports blocking parallel delegation... How does it make it harder to match CC API? Because lots of flags would now be useless?

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:

It's a lovely picture, IMHO it says that this PR is not risky:

code fully localized

uses tool preset; not default

=> it doesn't risk breaking other things. => it's like under feature flag, even though we have no feature flag.
So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

As long as we have strong testing for background execution I'm OK. But evals/swebench doesn't require any background execution so ...

Also, we won't benefit from background subagent execution until it's integrated in the GUI/CLI and that will probably also take lots of time. So there's no real rush for this PR, right?

Co-authored-by: openhands <openhands@all-hands.dev>

VascoSch92 · 2026-02-19T21:54:22Z

No problem. I will open another PR ;)

I believe just two signature changes from the cc api.

first version of claude delegation tool

9461060

VascoSch92 added the enhancement New feature or request label Feb 17, 2026

VascoSch92 requested a review from all-hands-bot February 17, 2026 17:02