Skip to content

Visual-Agent/RealXBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RealXBench

A multi-turn tool-calling inference and evaluation framework for large language models.

Features

Core Capabilities

  • Multi-turn Tool Calling: Supports iterative tool calling with automatic result integration
  • Flexible Model Adapters: Unified interface for multiple LLM providers (OpenAI, Anthropic, Google, etc.)
  • Tool Integration: Extensible tool system with dummy implementations for demonstrating
  • Parallel Processing: Efficient batch inference with configurable thread pools
  • Resume Support: Supports resuming interrupted inference tasks
  • Performance Optimized: Parallel tool execution and efficient image processing

Directory Structure

RealXBench_open/
├── code/                      # Main code directory
│   ├── inference.py          # Multi-turn inference pipeline
│   ├── evaluate.py           # Evaluation and scoring
│   ├── models/                # Model adapters
│   │   ├── __init__.py       # Model exports
│   │   ├── base.py           # Base model interface and utilities
│   │   ├── claude_sonnet_4_20250514.py
│   │   ├── gpt4_1.py
│   │   ├── o3.py
│   │   ├── gemini_2_5_pro_preview_0506.py
│   │   └── qwen2_5_vl_7b_instruct.py
│   └── tools/                 # Tool implementations
│       ├── __init__.py
│       ├── text_search.py    # Text search tool (dummy)
│       └── image_search.py   # Image search tool (dummy)
├── data/                      # Data directory
│   ├── data_en.json          # Input dataset
│   └── images/               # Image assets
├── output/                    # Output directory (created automatically)
│   ├── response/             # Inference results
│   ├── judge/                # Evaluation results
│   └── scores.xlsx           # Summary scores
└── run.sh                     # Main execution script

Tool Implementation

Important: The provided tools (text_search and image_search) are dummy implementations for testing and demonstration purposes. They do not perform real searches but return mock data based on input parameters.

Text Search Tool

  • Returns dummy search results with configurable size
  • Accepts query, size, snippet parameters
  • Returns structured data matching real search API format

Image Search Tool

  • Returns dummy image search results
  • Accepts query, size, images, row, image_field parameters
  • Returns structured data with image metadata

Note: To use real search functionality, replace the dummy implementations in code/tools/ with actual API integrations. The interface remains compatible.

How to Run

Quick Start

  1. Set environment variables (optional, defaults provided):

    export INFER_MODEL="o3"              # Model name
    export TOOLS="text_search,image_search"  # Comma-separated tool names
    export THREADS=8                     # Number of parallel threads
    export MAX_ROUNDS=4                  # Maximum tool-calling rounds
  2. Run the complete pipeline:

    bash run.sh

The script will:

  1. Run inference on the dataset with specified model and tools
  2. Automatically evaluate the results
  3. Generate summary scores

Custom Configuration

Edit run.sh or set environment variables:

export INFER_MODEL="gpt4_1"
export IN_PATH="./data/data_en.json"
export OUT_RESP="./output/response"
export OUT_JUDGE="./output/judge"
export SCORES="./output/scores.xlsx"
export TOOLS="text_search"
export TEXT_FIELD="query"
export IMG_FIELD="image"
export MAX_ROUNDS=4
export THREADS=8

bash run.sh

Manual Execution

Inference only:

python code/inference.py \
    --infer_model o3 \
    --in_path data/data_en.json \
    --out_dir output/response \
    --tools "text_search,image_search" \
    --max_threads 8 \
    --max_rounds 4

Evaluation only:

python code/evaluate.py \
    --infer_model o3__text_search+image_search \
    --in_dir output/response \
    --out_dir output/judge \
    --score_path output/scores.xlsx

Output Files

Inference Results

  • Location: output/response/

  • Format:

    • {model_tag}_infer.json: Complete results in JSON format
    • {model_tag}_infer.jsonl: Incremental results in JSONL format (one line per record)
  • Content: Each record contains:

    • Original input fields
    • response: Model's final answer
    • __messages__: Complete conversation history
    • __obs__: Tool observation results (if --save_obs is used)
    • __tools_compact__: Tool usage summary
    • __tool_stats__: Tool call statistics
    • __rounds__: Number of conversation rounds

Evaluation Results

  • Location: output/judge/
  • Format: {model_tag}_eval.json
  • Content: Each record includes:
    • Original data and inference response
    • correct: Binary correctness score (0 or 1)
    • judge_response: Raw judge response
    • judge_parsing: Parsing status
    • _empty_response: Flag for empty responses
    • _has_error: Flag for errors

Summary Scores

  • Location: output/scores.xlsx
  • Format: Excel file with sheet named score
  • Content: Aggregated metrics including:
    • avg_score: Average correctness score
    • n: Total number of samples
    • n_empty_response: Count of empty responses
    • n_error: Count of errors
    • n_parse_fail: Count of parsing failures
    • tool_{name}_count: Tool usage counts
    • time: Timestamp

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors