Skip to content

feat(tools): Claude Delegation Tools#2100

Open
VascoSch92 wants to merge 20 commits intomainfrom
vasco/rewrite-api-to-match-cc
Open

feat(tools): Claude Delegation Tools#2100
VascoSch92 wants to merge 20 commits intomainfrom
vasco/rewrite-api-to-match-cc

Conversation

@VascoSch92
Copy link
Contributor

@VascoSch92 VascoSch92 commented Feb 17, 2026

Summary

First version of the Claude Delegation Tools

As discussed on Slack, I have created a specific profile for Claude.

It was challenging to reuse a significant portion of the existing Delegation Tool code, I have integrated as much as possible. For instance, I reused the logic from registration.py and visualizer.py. However, the remaining components were difficult to adapt as the philosophies of the standard Delegation Tool and the Claude Delegation Tool are pretty different.

Architecture

The Claude Delegation Tool acts as a messenger between the main agent and the manager (the DelegationManager class). The manager is responsible for the entire sub-agent lifecycle, including spawning and management.

Communication with the manager is handled via three tools: Task, TaskOutput, and TaskStop. Since these three tools only function correctly when used together, I have kept them out of the public API, exposing only ClaudeDelegationTool.

Good to Know

  • The DelegationManager is instantiated only once per tool registration. The three sub-tools share this single instance to prevent concurrency issues.
  • The tool signature respects the original Claude Code format.
  • It is possible to run agents in the background.

TODO

  • Refine tool descriptions
  • more test coverage

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2f95bdc-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2f95bdc-python \
  ghcr.io/openhands/agent-server:2f95bdc-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2f95bdc-golang-amd64
ghcr.io/openhands/agent-server:2f95bdc-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2f95bdc-golang-arm64
ghcr.io/openhands/agent-server:2f95bdc-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2f95bdc-java-amd64
ghcr.io/openhands/agent-server:2f95bdc-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2f95bdc-java-arm64
ghcr.io/openhands/agent-server:2f95bdc-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2f95bdc-python-amd64
ghcr.io/openhands/agent-server:2f95bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:2f95bdc-python-arm64
ghcr.io/openhands/agent-server:2f95bdc-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:2f95bdc-golang
ghcr.io/openhands/agent-server:2f95bdc-java
ghcr.io/openhands/agent-server:2f95bdc-python

About Multi-Architecture Support

  • Each variant tag (e.g., 2f95bdc-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 2f95bdc-python-amd64) are also available if needed

@VascoSch92 VascoSch92 added the enhancement New feature or request label Feb 17, 2026
all-hands-bot

This comment was marked as outdated.

Co-authored-by: OpenHands Bot <contact@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 17, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools
   __init__.py14285%32–33
openhands-tools/openhands/tools/task
   definition.py1035150%77, 79–82, 84, 89, 91–96, 113, 156, 158–160, 162, 167, 169–174, 191, 229, 231–233, 235–238, 254, 309–310, 321–323, 325, 334–337, 339–340, 346–347, 349
   impl.py28621425%99–103, 105, 109–112, 116–120, 124–126, 131–135, 152–159, 163–167, 171–172, 176–177, 181–182, 186, 190, 197–200, 203–205, 234–235, 237–238, 241–242, 247–248, 250, 255, 257, 261–262, 266–270, 272–274, 279–283, 288–291, 295–296, 302–304, 313–315, 321, 335–336, 338–340, 343, 351–352, 354–355, 360–361, 364, 367–368, 370–371, 375, 377–380, 382, 384–389, 391–396, 398–400, 405, 407–409, 411–418, 420, 438–443, 447, 450–451, 454, 459–462, 476–478, 480, 482, 489, 494–496, 499, 501, 505–506, 508–513, 515–517, 521–522, 524–526, 537–539, 547–548, 553–554, 559–560, 566–567, 573, 576, 587, 594–595, 605, 609–611, 619, 630, 637–638, 644, 646–647, 662, 669, 671–672, 679
TOTAL18977972448% 

all-hands-bot

This comment was marked as outdated.

@VascoSch92 VascoSch92 changed the title feat(sdk): Cluade Delegation Tools (CC flavour) feat(tools): Cluade Delegation Tools Feb 17, 2026
all-hands-bot

This comment was marked as outdated.

@VascoSch92 VascoSch92 changed the title feat(tools): Cluade Delegation Tools feat(tools): Claude Delegation Tools Feb 18, 2026
@VascoSch92 VascoSch92 marked this pull request as ready for review February 18, 2026 10:29
all-hands-bot

This comment was marked as outdated.

@VascoSch92

This comment was marked as outdated.

@enyst
Copy link
Collaborator

enyst commented Feb 18, 2026

@OpenHands Do a /codereview-roasted on this PR. Understand first the goal and investigate deeply the implementation and consequences, so you can raise important issues if any; otherwise do not exaggerate.

Post your review directly as a comment in the PR. Note that it will be rendered as markdown.

@openhands-ai
Copy link

openhands-ai bot commented Feb 18, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

enyst commented Feb 18, 2026

Taste Rating: 🔴 Needs improvement

You’re building a Claude Code–compatible Task/TaskOutput/TaskStop façade over OpenHands sub-agents. That’s a real use case, and the overall split (definitions vs shared manager vs executors) is sane.

But right now two headline features you advertise (“resume” and “stop”) don’t actually work the way the tool contract claims. That’s the kind of thing that looks fine in a PR diff and then burns you in production.


[CRITICAL ISSUES] (must fix)

1) Resume is structurally broken (it can’t work as written)

Where: openhands-tools/openhands/tools/claude/impl.py

You’re trying to resume by loading base_state.json from:

  • task_dir = self._tmp_dir / str(self._task_id_to_uuid[task_id]) (_get_task_directory, ~L189-192)
  • then task_dir / "base_state.json" (~L291-296)

But LocalConversation persists under persistence_dir/<conversation_id.hex>/... (see BaseConversation.get_persistence_dir() and LocalConversation.__init__()), i.e. it uses uuid.hex, not str(uuid).

Worse: when creating a new sub-conversation you never pass conversation_id at all:

  • _generate_conversation() passes conversation_dir=... (~L346-357)
  • LocalConversation.__init__() ignores unknown kwargs via **_, so conversation_dir is a no-op.

Net result: the conversation state is persisted to a directory you don’t track, while your resume code looks in a different directory name that doesn’t exist. This isn’t a “small bug”, it’s a data-structure mismatch.

Fix direction:

  • Always pass conversation_id=self._task_id_to_uuid[task_id] when creating the task conversation.
  • Use uuid.hex consistently for persistence paths (or better: use BaseConversation.get_persistence_dir() instead of hand-rolling paths).
  • Then you shouldn’t need to manually JSON-parse base_state.json at all; ConversationState.create() already supports create-or-resume.

2) task_stop doesn’t stop anything (it just lies about it)

Where: DelegationManager.stop_task() (~L467-489) and _run_task() background thread (~L385-413)

stop_task() only flips TaskState.status to STOPPED and evicts the task. The background thread keeps running conversation.run() until it naturally finishes.

That has two nasty consequences:

  • Users think they “terminated” a runaway task, but it keeps consuming time, tools, and LLM budget.
  • Your max-concurrency gate counts only status == RUNNING (~L241-244). After “stop”, status becomes STOPPED, so the still-running thread no longer counts. Congratulations, you’ve built a bypass for max_tasks.

If you can’t actually cancel conversation.run() (fair), then don’t claim you did.

Fix direction:

  • Either implement cooperative cancellation (whatever the conversation/agent supports), or rename/redefine semantics to “mark as stopped and ignore results” and update descriptions accordingly.
  • Don’t let STOPPED-but-still-running threads bypass the concurrency limit.

3) Tool descriptions are actively misleading, which matters for LLM tools

Where: openhands-tools/openhands/tools/claude/templates/*.j2

Examples:

  • task_tool_description.j2 talks about subprocesses, output_file paths, and resuming “with full context preserved”. None of that is what the implementation does.
  • task_output_tool_description.j2 references a /tasks command and “remote sessions”. Not implemented.
  • task_stop_tool_description.j2 says “terminate a long-running task”. It doesn’t.

For LLM-driven systems, the description is part of correctness. If the model follows the docs, it will do the wrong thing.

4) You claim the 3 sub-tools are “not public”, but you register them globally anyway

Where: openhands-tools/openhands/tools/claude/definition.py (~L359-363)

You only export ClaudeDelegationToolSet from openhands.tools.claude, but you still register_tool() the individual task, task_output, task_stop. That makes them addressable by name in the tool registry, and task is a dangerously generic global name.

If these are intentionally internal, don’t register them globally.


[IMPROVEMENT OPPORTUNITIES] (should fix)

  • Docstrings mention a non-existent API: both openhands.tools.claude.__init__ and definition.py advertise CLAUDE_DELEGATION_TOOLS, but it doesn’t exist.
  • model parameter is accepted but ignored (start_task(... model=..., # noqa: ARG002) and _generate_worker_agent() TODO). Either implement it or remove/clearly document that it’s ignored.
  • State bookkeeping is needlessly split: _active_tasks + _inactive_tasks + _task_id_to_uuid is three structures to represent one thing. At minimum, store the UUID on TaskState so you don’t need the side map.

[TESTING GAPS]

  • There’s no test that would catch the resume-path mismatch (because nothing actually tries to resume a real persisted conversation).
  • TestTaskState.test_thread_safety is logically flaky: it asserts status in ("completed", "error"), but the implementation uses "succeeded" / "empty_success" / "error". It only “passes” when the last write happens to be set_error().

VERDICT

Needs rework before merge. The advertised behavior around resume/stop is not trustworthy yet.

KEY INSIGHT

Your persistence/resume path is inconsistent at the data-structure level (UUID string vs UUID hex + ignored ctor arg), so resume isn’t “buggy” — it’s impossible right now.

@openhands-ai

This comment was marked as duplicate.

all-hands-bot

This comment was marked as outdated.

This comment was marked as outdated.

@@ -0,0 +1,33 @@
Launch a new agent to handle complex, multi-step tasks autonomously.

The Task tool launches specialized agents (subprocesses) that autonomously handle complex tasks. Each agent type has specific capabilities and tools available to it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, are the prompts from CC or manually written or LLM written? The GPT-5.2 "roasted" review complained about the "subprocesses", for example

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked CC to provide the exact tool descriptions, as I don't think they can be found elsewhere.

I was also a bit confused because, as you mentioned, the names are a little misleading.

Moreover, since this implementation uses a slightly different logic behind the scenes, I’ve updated them to be more precise and consistent.

@simonrosenberg
Copy link
Collaborator

can you add an example (perhaps similar to the current agent_delegation.py but showcasing the other features too) to make sure that this delegation toolkit works?

@simonrosenberg
Copy link
Collaborator

simonrosenberg commented Feb 19, 2026

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug...
Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

"""Integration tests for DelegationManager with real LocalConversation.

These tests use real LocalConversation instances (with real persistence)
and only mock the LLM calls via ``litellm_completion``. They verify that
Copy link
Collaborator

@enyst enyst Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally a nit:

  • we have cross-package tests for conversation restore here
  • iirc, we've done exactly what you describe here, "integration-like" just without LLM
  • they're named "cross" as in cross-package, as in, almost integration
  • we have other tests named "integration" which are also special workflows (integration-test label), so maybe it would be nice to not convince the LLM that this is "integration" 😅

Maybe we could move this to cross/, to live alongside the others?

…cription.j2

Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
@VascoSch92
Copy link
Contributor Author

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.

That being said, I have no problem providing a first version without background execution to make the review process easier.


# Create sub-agent LLM (disable streaming)
llm_updates: dict = {"stream": False}
sub_agent_llm = parent_llm.model_copy(update=llm_updates)
Copy link
Collaborator

@enyst enyst Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this copies the metrics and continues on its copy... and not sure what happens with usage_id... 🤔

I think... maybe if we can, we could continue the work on profiles / state / registry so that we offer an understandable little framework for all these LLM uses that we need throughout the codebase, WDYT?

We could maybe go with this one temporarily (I'll post a top-level review why), I just really hope we don't end up doing too many little fixes at every LLM spot, that way lies one of my worst nightmares...

@enyst
Copy link
Collaborator

enyst commented Feb 19, 2026

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?

In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works

@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:

image

It's a lovely picture, IMHO it says that this PR is not risky:

  • code fully localized
  • uses tool preset; not default

=> it doesn't risk breaking other things.
=> it's like under feature flag, even though we have no feature flag.

So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

@simonrosenberg
Copy link
Collaborator

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.

That being said, I have no problem providing a first version without background execution to make the review process easier.

It's just that background execution seems to be a complicated feature, e.g. with locks there are many ways that things can go wrong... So... yes I would feel much more confortable with a first solution that only supports blocking parallel delegation...
How does it make it harder to match CC API? Because lots of flags would now be useless?

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:

image It's a lovely picture, IMHO it says that this PR is not risky:
  • code fully localized
  • uses tool preset; not default

=> it doesn't risk breaking other things. => it's like under feature flag, even though we have no feature flag.

So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

As long as we have strong testing for background execution I'm OK. But evals/swebench doesn't require any background execution so ...

@simonrosenberg
Copy link
Collaborator

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it.
That being said, I have no problem providing a first version without background execution to make the review process easier.

It's just that background execution seems to be a complicated feature, e.g. with locks there are many ways that things can go wrong... So... yes I would feel much more confortable with a first solution that only supports blocking parallel delegation... How does it make it harder to match CC API? Because lots of flags would now be useless?

I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... Thoughts of having a first version that does support parallel agent execution but not background execution?
In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works
@VascoSch92 @enyst @xingyaoww wdyt?

You know better the delegation feature and risks; I just want to note a thought: I look at this picture:
image
It's a lovely picture, IMHO it says that this PR is not risky:

  • code fully localized
  • uses tool preset; not default

=> it doesn't risk breaking other things. => it's like under feature flag, even though we have no feature flag.
So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default?

As long as we have strong testing for background execution I'm OK. But evals/swebench doesn't require any background execution so ...

Also, we won't benefit from background subagent execution until it's integrated in the GUI/CLI and that will probably also take lots of time. So there's no real rush for this PR, right?

Co-authored-by: openhands <openhands@all-hands.dev>
@VascoSch92
Copy link
Contributor Author

VascoSch92 commented Feb 19, 2026

No problem. I will open another PR ;)

I believe just two signature changes from the cc api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments