Skip to content

Enable configurable context condensation in all benchmarks#429

Open
juanmichelini wants to merge 2 commits intomainfrom
openhands/enable-configurable-condenser
Open

Enable configurable context condensation in all benchmarks#429
juanmichelini wants to merge 2 commits intomainfrom
openhands/enable-configurable-condenser

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR enables context condensation in all benchmarks and makes it configurable via config.py files and command-line arguments. The default condenser from software-agent-sdk (LLMSummarizingCondenser) is now used by default with max_size=80 and keep_first=4.

Fixes #407

Changes

Configuration

  • EvalMetadata: Added three new fields to support condenser configuration:

    • enable_condenser (bool, default: True): Enable/disable the context condenser
    • condenser_max_size (int, default: 80): Maximum number of events before condensing
    • condenser_keep_first (int, default: 4): Number of initial events to always keep
  • Benchmark configs: Added CONDENSER_DEFAULTS to:

    • benchmarks/swebench/config.py
    • benchmarks/swtbench/config.py
    • benchmarks/swebenchmultimodal/config.py

Command-Line Arguments

Added new CLI arguments to control condenser behavior:

  • --enable-condenser: Explicitly enable the condenser
  • --disable-condenser: Disable the condenser (takes precedence over enable)
  • --condenser-max-size N: Set the maximum number of events before condensing
  • --condenser-keep-first N: Set the number of initial events to always keep

Agent Creation

Updated agent creation in all benchmark evaluation classes to use LLMSummarizingCondenser when enabled:

  • benchmarks/swebench/run_infer.py
  • benchmarks/swtbench/run_infer.py
  • benchmarks/swebenchmultimodal/run_infer.py
  • benchmarks/multiswebench/run_infer.py

Testing

Added comprehensive test coverage in tests/test_condenser_config.py:

  • Config defaults validation
  • EvalMetadata accepts condenser parameters
  • Command-line argument parsing
  • Enable/disable flag behavior
  • Size parameter configuration

All tests pass and pre-commit checks (ruff, pycodestyle, pyright) pass.

Usage

Default behavior (condenser enabled)

python -m benchmarks.swebench.run_infer llm_config.json

Disable condenser

python -m benchmarks.swebench.run_infer llm_config.json --disable-condenser

Custom condenser settings

python -m benchmarks.swebench.run_infer llm_config.json \
  --condenser-max-size 100 \
  --condenser-keep-first 10

Notes

  • The condenser is enabled by default to help manage context length in long-running evaluations
  • Configuration can be overridden at multiple levels: config.py defaults → CLI arguments
  • The --disable-condenser flag takes precedence over --enable-condenser to allow explicit disabling
  • The condenser uses a separate LLM service ID ("condenser") to track token usage separately from the main agent

@juanmichelini can click here to continue refining the PR

This change enables context condensation in all benchmarks and makes it
configurable via config.py files and command-line arguments. The default
condenser from software-agent-sdk is now used by default with max_size=80
and keep_first=4.

Changes:
- Add condenser configuration fields to EvalMetadata
- Add CONDENSER_DEFAULTS to config.py files in swebench, swtbench, and swebenchmultimodal
- Add command-line arguments for controlling condenser (--enable-condenser, --disable-condenser, --condenser-max-size, --condenser-keep-first)
- Update agent creation in all benchmarks to use LLMSummarizingCondenser when enabled
- Add comprehensive tests for condenser configuration

Fixes #407

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini
Copy link
Collaborator Author

@OpenHands "condenser_max_size" should be 240 and "condenser_keep_first": should be 2. change that for all benchmarks

@openhands-ai
Copy link

openhands-ai bot commented Feb 19, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

…and Commit0 support

- Update condenser_max_size from 80 to 240 and condenser_keep_first from 4 to 2 across all benchmarks
- Update EvalMetadata defaults in models.py to use new values (240/2)
- Add condenser configuration to GAIA and Commit0 benchmarks:
  - Add CONDENSER_DEFAULTS to config.py files
  - Add LLMSummarizingCondenser import and agent creation logic to run_infer.py files
- Update tests to reflect new condenser defaults and add coverage for GAIA and Commit0
- All tests pass and pre-commit checks pass

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Feb 19, 2026

Summary of Changes

I've successfully updated the condenser configuration across all benchmarks and added condenser support to GAIA and Commit0 as requested. Here's what was completed:

✅ Checklist - All Requirements Met

  1. Updated condenser values from 80/4 to 240/2 for all benchmarks:

    • benchmarks/swebench/config.py - CONDENSER_DEFAULTS updated
    • benchmarks/swtbench/config.py - CONDENSER_DEFAULTS updated
    • benchmarks/swebenchmultimodal/config.py - CONDENSER_DEFAULTS updated
    • benchmarks/utils/models.py - EvalMetadata defaults updated
  2. Added condenser support to GAIA and Commit0:

    • benchmarks/gaia/config.py - Added CONDENSER_DEFAULTS (240/2) and integrated into INFER_DEFAULTS
    • benchmarks/gaia/run_infer.py - Added LLMSummarizingCondenser import and agent integration
    • benchmarks/commit0/config.py - Added CONDENSER_DEFAULTS (240/2) and integrated into INFER_DEFAULTS
    • benchmarks/commit0/run_infer.py - Added LLMSummarizingCondenser import and agent integration
  3. Updated tests:

    • tests/test_condenser_config.py - Updated assertions to expect 240/2 values
    • ✅ Added test coverage for GAIA and Commit0 condenser configurations

Verification

  • ✅ All 11 tests pass
  • ✅ All pre-commit checks pass (Ruff format, Ruff lint, pycodestyle, Pyright)
  • ✅ Changes committed with descriptive message and Co-authored-by tag
  • ✅ Changes pushed to PR branch openhands/enable-configurable-condenser

The PR is now updated with all requested changes. All benchmarks (SWE-bench, SWT-bench, SWE-bench Multimodal, GAIA, and Commit0) now use condenser_max_size=240 and condenser_keep_first=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make context condensation in benchmarks configurable

2 participants

Comments