Skip to content

Feature Request: Agent Arena - Multi-Model Competitive Execution #1814

@tanzhenxin

Description

@tanzhenxin

What would you like to be added?

Agent Arena is a competitive execution feature that allows users to dispatch multiple AI models simultaneously to execute the same task. Users can observe how different models perform on identical tasks, compare their solutions, and select the best result to apply to their main workspace.

Why is this needed?

Current Pain Points

  1. Model selection difficulty: Users configure multiple model providers but are unsure which model is best for specific task types
  2. Lack of horizontal comparison: Unable to intuitively compare performance differences between different models on the same task
  3. Single point of failure: Relying on only one model may lead to suboptimal solutions for specific problem types

Expected Value

  1. Model benchmarking: Evaluate different models' capabilities in actual work scenarios
  2. Best solution selection: Pick the optimal implementation from multiple solutions
  3. Learning and insights: Observe different models' reasoning styles and problem-solving approaches
  4. Improved reliability: Reduce error risks from single models through multi-model validation

Core Requirements

1. User Entry Point

Provide a slash command interface for users to launch Agent Arena.

  • /arena --models model1,model2 "task description" to start a new session with specified models
  • Support selecting models from configured providers
  • Allow users to specify the task description for all competing Agents

2. Multi-Agent Parallel Execution

Run multiple independent Agents simultaneously, each using a different model configuration.

  • Support launching N Agents simultaneously (N specified by user or auto-determined by configured models, max 5)
  • Each Agent is a complete Main Agent-level instance (not a restricted Subagent)
  • Agents are completely independent with no shared state
  • Support individual Agent early completion or failure without affecting other Agents

3. Environment Isolation (Git Worktree)

Each Agent must run in a completely isolated environment to prevent interference.

  • Use git worktree to create independent working directories for each competing model
  • Worktrees should be created in a unified management location (e.g., ~/.qwen/arena/<session-id>/<model-name>/)
  • Support automatic cleanup of worktrees after Agent Arena completion, or retention for user inspection
  • All file operations (read, write, edit) by each Agent are restricted to their worktree scope

4. TUI Multi-Agent Display

Provide flexible ways to visualize and interact with multiple running Agents.

  • Display progress indicators for all running Agents (status, elapsed time, etc.)
  • Support two display modes:
    • In-process Mode: Within a single terminal window, allow users to switch between different Agents to view their execution progress and interact with them
    • Split-pane Mode: Display each Agent's execution in separate terminal windows/panes (e.g., tmux panes) for side-by-side comparison
  • Allow users to interact with any Agent (send input, interrupt, etc.) regardless of display mode

5. Result Comparison and Selection

After Agent Arena completes, allow users to compare outcomes and select preferred solutions.

  • Outcome Summary: See a brief summary of each Agent's result (success/failure, key output)
  • Execution Metrics: View execution statistics for each Agent (completion status, elapsed time, etc.)
  • Solution Selection: Choose one Agent's solution to apply/merge into the main workspace
  • Workspace Management: Choose to preserve worktrees for further inspection or clean them up after selection

Additional context

Acceptance Criteria

  • User can launch Agent Arena with multiple models using /arena --models model1,model2 "task description"
  • Each Agent runs in isolated Git worktree with no interference
  • Main UI shows progress indicators for all Agents
  • User can switch between Agents to view their execution progress and interact with them
  • Agent Arena session can be cleaned up or preserved after completion
  • Results from different Agents can be compared and selected

Key Differences vs. Related Features

Agent Arena Agent Team Agent Swarm
Goal Competitive: Find the best solution to the same task Collaborative: Tackle different aspects together Parallel processing: Dynamically spawn workers for batch tasks
Entry Point /arena slash command with explicit --models Natural language request describing task and team Tool call during execution (spawn_swarm_worker)
Agent Creation Pre-configured models compete Teammates dynamically created with roles Workers created on-the-fly, no pre-definition
Relationship Agents compete; user selects winner Agents collaborate; lead synthesizes results Workers execute independently; parent aggregates
Communication No agent-to-agent communication Direct peer-to-peer messaging between teammates One-way: results aggregated by parent
Coordination Parallel execution with final selection Self-coordination via shared task list Parent manages spawning and result collection
Context Completely isolated (separate worktrees) Independent sessions with shared task list Lightweight ephemeral context per worker
Lifecycle Session-based with comparison phase Persistent team with ongoing collaboration Ephemeral: spawn → execute → return → cleanup
Output One selected solution applied to workspace Synthesized results from multiple perspectives Aggregated results from parallel batch processing
Best for Benchmarking, choosing between model approaches Research, complex collaboration, cross-layer work Batch operations, data processing, map-reduce tasks

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions