Conversation
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Coverage Report •
|
|||||||||||||||||||||||||||||||||||
openhands-tools/openhands/tools/claude/templates/task_tool_description.j2
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
|
@OpenHands Do a /codereview-roasted on this PR. Understand first the goal and investigate deeply the implementation and consequences, so you can raise important issues if any; otherwise do not exaggerate. Post your review directly as a comment in the PR. Note that it will be rendered as markdown. |
|
I'm on it! enyst can track my progress at all-hands.dev |
Taste Rating: 🔴 Needs improvementYou’re building a Claude Code–compatible Task/TaskOutput/TaskStop façade over OpenHands sub-agents. That’s a real use case, and the overall split (definitions vs shared manager vs executors) is sane. But right now two headline features you advertise (“resume” and “stop”) don’t actually work the way the tool contract claims. That’s the kind of thing that looks fine in a PR diff and then burns you in production. [CRITICAL ISSUES] (must fix)1) Resume is structurally broken (it can’t work as written)Where: You’re trying to resume by loading
But Worse: when creating a new sub-conversation you never pass
Net result: the conversation state is persisted to a directory you don’t track, while your resume code looks in a different directory name that doesn’t exist. This isn’t a “small bug”, it’s a data-structure mismatch. Fix direction:
2)
|
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as outdated.
This comment was marked as outdated.
| @@ -0,0 +1,33 @@ | |||
| Launch a new agent to handle complex, multi-step tasks autonomously. | |||
|
|
|||
| The Task tool launches specialized agents (subprocesses) that autonomously handle complex tasks. Each agent type has specific capabilities and tools available to it. | |||
There was a problem hiding this comment.
I asked CC to provide the exact tool descriptions, as I don't think they can be found elsewhere.
I was also a bit confused because, as you mentioned, the names are a little misleading.
Moreover, since this implementation uses a slightly different logic behind the scenes, I’ve updated them to be more precise and consistent.
openhands-tools/openhands/tools/claude/templates/task_stop_tool_description.j2
Outdated
Show resolved
Hide resolved
openhands-tools/openhands/tools/claude/templates/task_stop_tool_description.j2
Outdated
Show resolved
Hide resolved
openhands-tools/openhands/tools/claude/templates/task_output_tool_description.j2
Outdated
Show resolved
Hide resolved
openhands-tools/openhands/tools/claude/templates/task_output_tool_description.j2
Outdated
Show resolved
Hide resolved
|
can you add an example (perhaps similar to the current agent_delegation.py but showcasing the other features too) to make sure that this delegation toolkit works? |
openhands-tools/openhands/tools/task/templates/task_output_tool_description.j2
Show resolved
Hide resolved
|
I am bit afraid that this PR introduces lots of new features at once (background sub-agent execution, relies on locks, etc...) and might be very tricky to test and debug... In which case we wouldn't be introducing any new delegation features (we already support parallel sub-agent calls) but it would still be a pretty big change to use the claude code apis + system prompts + evaluate on small swebench and compare with claude code to make sure it works @VascoSch92 @enyst @xingyaoww wdyt? |
openhands-tools/openhands/tools/claude/templates/task_tool_description.j2
Outdated
Show resolved
Hide resolved
| """Integration tests for DelegationManager with real LocalConversation. | ||
|
|
||
| These tests use real LocalConversation instances (with real persistence) | ||
| and only mock the LLM calls via ``litellm_completion``. They verify that |
There was a problem hiding this comment.
Totally a nit:
- we have cross-package tests for conversation restore here
- iirc, we've done exactly what you describe here, "integration-like" just without LLM
- they're named "cross" as in cross-package, as in, almost integration
- we have other tests named "integration" which are also special workflows (
integration-testlabel), so maybe it would be nice to not convince the LLM that this is "integration" 😅
Maybe we could move this to cross/, to live alongside the others?
…cription.j2 Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
In hindsight, I think I might have over-engineered this by adding the background feature (I got a little too excited!). The thing is, it feels so natural to use, and it's actually difficult to match the CC API without it. That being said, I have no problem providing a first version without background execution to make the review process easier. |
|
|
||
| # Create sub-agent LLM (disable streaming) | ||
| llm_updates: dict = {"stream": False} | ||
| sub_agent_llm = parent_llm.model_copy(update=llm_updates) |
There was a problem hiding this comment.
I guess this copies the metrics and continues on its copy... and not sure what happens with usage_id... 🤔
I think... maybe if we can, we could continue the work on profiles / state / registry so that we offer an understandable little framework for all these LLM uses that we need throughout the codebase, WDYT?
We could maybe go with this one temporarily (I'll post a top-level review why), I just really hope we don't end up doing too many little fixes at every LLM spot, that way lies one of my worst nightmares...
You know better the delegation feature and risks; I just want to note a thought: I look at this picture:
It's a lovely picture, IMHO it says that this PR is not risky:
=> it doesn't risk breaking other things. So maybe we can try to get it in, and submit it to evals / tests / etc, before we make it default? |
It's just that background execution seems to be a complicated feature, e.g. with locks there are many ways that things can go wrong... So... yes I would feel much more confortable with a first solution that only supports blocking parallel delegation...
As long as we have strong testing for background execution I'm OK. But evals/swebench doesn't require any background execution so ... |
Also, we won't benefit from background subagent execution until it's integrated in the GUI/CLI and that will probably also take lots of time. So there's no real rush for this PR, right? |
Co-authored-by: openhands <openhands@all-hands.dev>
|
No problem. I will open another PR ;) I believe just two signature changes from the cc api. |



Summary
First version of the Claude Delegation Tools
As discussed on Slack, I have created a specific profile for Claude.
It was challenging to reuse a significant portion of the existing Delegation Tool code, I have integrated as much as possible. For instance, I reused the logic from
registration.pyandvisualizer.py. However, the remaining components were difficult to adapt as the philosophies of the standard Delegation Tool and the Claude Delegation Tool are pretty different.Architecture
The Claude Delegation Tool acts as a messenger between the main agent and the manager (the
DelegationManagerclass). The manager is responsible for the entire sub-agent lifecycle, including spawning and management.Communication with the manager is handled via three tools:
Task,TaskOutput, andTaskStop. Since these three tools only function correctly when used together, I have kept them out of the public API, exposing onlyClaudeDelegationTool.Good to Know
DelegationManageris instantiated only once per tool registration. The three sub-tools share this single instance to prevent concurrency issues.TODO
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:2f95bdc-pythonRun
All tags pushed for this build
About Multi-Architecture Support
2f95bdc-python) is a multi-arch manifest supporting both amd64 and arm642f95bdc-python-amd64) are also available if needed