RealXBench

A multi-turn tool-calling inference and evaluation framework for large language models.

Features

Core Capabilities

Multi-turn Tool Calling: Supports iterative tool calling with automatic result integration
Flexible Model Adapters: Unified interface for multiple LLM providers (OpenAI, Anthropic, Google, etc.)
Tool Integration: Extensible tool system with dummy implementations for demonstrating
Parallel Processing: Efficient batch inference with configurable thread pools
Resume Support: Supports resuming interrupted inference tasks
Performance Optimized: Parallel tool execution and efficient image processing

Directory Structure

RealXBench_open/
├── code/                      # Main code directory
│   ├── inference.py          # Multi-turn inference pipeline
│   ├── evaluate.py           # Evaluation and scoring
│   ├── models/                # Model adapters
│   │   ├── __init__.py       # Model exports
│   │   ├── base.py           # Base model interface and utilities
│   │   ├── claude_sonnet_4_20250514.py
│   │   ├── gpt4_1.py
│   │   ├── o3.py
│   │   ├── gemini_2_5_pro_preview_0506.py
│   │   └── qwen2_5_vl_7b_instruct.py
│   └── tools/                 # Tool implementations
│       ├── __init__.py
│       ├── text_search.py    # Text search tool (dummy)
│       └── image_search.py   # Image search tool (dummy)
├── data/                      # Data directory
│   ├── data_en.json          # Input dataset
│   └── images/               # Image assets
├── output/                    # Output directory (created automatically)
│   ├── response/             # Inference results
│   ├── judge/                # Evaluation results
│   └── scores.xlsx           # Summary scores
└── run.sh                     # Main execution script

Tool Implementation

Important: The provided tools (text_search and image_search) are dummy implementations for testing and demonstration purposes. They do not perform real searches but return mock data based on input parameters.

Text Search Tool

Returns dummy search results with configurable size
Accepts query, size, snippet parameters
Returns structured data matching real search API format

Image Search Tool

Returns dummy image search results
Accepts query, size, images, row, image_field parameters
Returns structured data with image metadata

Note: To use real search functionality, replace the dummy implementations in code/tools/ with actual API integrations. The interface remains compatible.

How to Run

Quick Start

Set environment variables (optional, defaults provided):

export INFER_MODEL="o3"              # Model name
export TOOLS="text_search,image_search"  # Comma-separated tool names
export THREADS=8                     # Number of parallel threads
export MAX_ROUNDS=4                  # Maximum tool-calling rounds

Run the complete pipeline:
```
bash run.sh
```

The script will:

Run inference on the dataset with specified model and tools
Automatically evaluate the results
Generate summary scores

Custom Configuration

Edit run.sh or set environment variables:

export INFER_MODEL="gpt4_1"
export IN_PATH="./data/data_en.json"
export OUT_RESP="./output/response"
export OUT_JUDGE="./output/judge"
export SCORES="./output/scores.xlsx"
export TOOLS="text_search"
export TEXT_FIELD="query"
export IMG_FIELD="image"
export MAX_ROUNDS=4
export THREADS=8

bash run.sh

Manual Execution

Inference only:

python code/inference.py \
    --infer_model o3 \
    --in_path data/data_en.json \
    --out_dir output/response \
    --tools "text_search,image_search" \
    --max_threads 8 \
    --max_rounds 4

Evaluation only:

python code/evaluate.py \
    --infer_model o3__text_search+image_search \
    --in_dir output/response \
    --out_dir output/judge \
    --score_path output/scores.xlsx

Output Files

Inference Results

Location: output/response/
Format:
- {model_tag}_infer.json: Complete results in JSON format
- {model_tag}_infer.jsonl: Incremental results in JSONL format (one line per record)
Content: Each record contains:
- Original input fields
- response: Model's final answer
- __messages__: Complete conversation history
- __obs__: Tool observation results (if --save_obs is used)
- __tools_compact__: Tool usage summary
- __tool_stats__: Tool call statistics
- __rounds__: Number of conversation rounds

Evaluation Results

Location: output/judge/
Format: {model_tag}_eval.json
Content: Each record includes:
- Original data and inference response
- correct: Binary correctness score (0 or 1)
- judge_response: Raw judge response
- judge_parsing: Parsing status
- _empty_response: Flag for empty responses
- _has_error: Flag for errors

Summary Scores

Location: output/scores.xlsx
Format: Excel file with sheet named score
Content: Aggregated metrics including:
- avg_score: Average correctness score
- n: Total number of samples
- n_empty_response: Count of empty responses
- n_error: Count of errors
- n_parse_fail: Count of parsing failures
- tool_{name}_count: Tool usage counts
- time: Timestamp

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealXBench

Features

Core Capabilities

Directory Structure

Tool Implementation

Text Search Tool

Image Search Tool

How to Run

Quick Start

Custom Configuration

Manual Execution

Output Files

Inference Results

Evaluation Results

Summary Scores

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Visual-Agent/RealXBench

Folders and files

Latest commit

History

Repository files navigation

RealXBench

Features

Core Capabilities

Directory Structure

Tool Implementation

Text Search Tool

Image Search Tool

How to Run

Quick Start

Custom Configuration

Manual Execution

Output Files

Inference Results

Evaluation Results

Summary Scores

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages