A multi-turn tool-calling inference and evaluation framework for large language models.
- Multi-turn Tool Calling: Supports iterative tool calling with automatic result integration
- Flexible Model Adapters: Unified interface for multiple LLM providers (OpenAI, Anthropic, Google, etc.)
- Tool Integration: Extensible tool system with dummy implementations for demonstrating
- Parallel Processing: Efficient batch inference with configurable thread pools
- Resume Support: Supports resuming interrupted inference tasks
- Performance Optimized: Parallel tool execution and efficient image processing
RealXBench_open/
├── code/ # Main code directory
│ ├── inference.py # Multi-turn inference pipeline
│ ├── evaluate.py # Evaluation and scoring
│ ├── models/ # Model adapters
│ │ ├── __init__.py # Model exports
│ │ ├── base.py # Base model interface and utilities
│ │ ├── claude_sonnet_4_20250514.py
│ │ ├── gpt4_1.py
│ │ ├── o3.py
│ │ ├── gemini_2_5_pro_preview_0506.py
│ │ └── qwen2_5_vl_7b_instruct.py
│ └── tools/ # Tool implementations
│ ├── __init__.py
│ ├── text_search.py # Text search tool (dummy)
│ └── image_search.py # Image search tool (dummy)
├── data/ # Data directory
│ ├── data_en.json # Input dataset
│ └── images/ # Image assets
├── output/ # Output directory (created automatically)
│ ├── response/ # Inference results
│ ├── judge/ # Evaluation results
│ └── scores.xlsx # Summary scores
└── run.sh # Main execution script
Important: The provided tools (text_search and image_search) are dummy implementations for testing and demonstration purposes. They do not perform real searches but return mock data based on input parameters.
- Returns dummy search results with configurable size
- Accepts
query,size,snippetparameters - Returns structured data matching real search API format
- Returns dummy image search results
- Accepts
query,size,images,row,image_fieldparameters - Returns structured data with image metadata
Note: To use real search functionality, replace the dummy implementations in code/tools/ with actual API integrations. The interface remains compatible.
-
Set environment variables (optional, defaults provided):
export INFER_MODEL="o3" # Model name export TOOLS="text_search,image_search" # Comma-separated tool names export THREADS=8 # Number of parallel threads export MAX_ROUNDS=4 # Maximum tool-calling rounds
-
Run the complete pipeline:
bash run.sh
The script will:
- Run inference on the dataset with specified model and tools
- Automatically evaluate the results
- Generate summary scores
Edit run.sh or set environment variables:
export INFER_MODEL="gpt4_1"
export IN_PATH="./data/data_en.json"
export OUT_RESP="./output/response"
export OUT_JUDGE="./output/judge"
export SCORES="./output/scores.xlsx"
export TOOLS="text_search"
export TEXT_FIELD="query"
export IMG_FIELD="image"
export MAX_ROUNDS=4
export THREADS=8
bash run.shInference only:
python code/inference.py \
--infer_model o3 \
--in_path data/data_en.json \
--out_dir output/response \
--tools "text_search,image_search" \
--max_threads 8 \
--max_rounds 4Evaluation only:
python code/evaluate.py \
--infer_model o3__text_search+image_search \
--in_dir output/response \
--out_dir output/judge \
--score_path output/scores.xlsx-
Location:
output/response/ -
Format:
{model_tag}_infer.json: Complete results in JSON format{model_tag}_infer.jsonl: Incremental results in JSONL format (one line per record)
-
Content: Each record contains:
- Original input fields
response: Model's final answer__messages__: Complete conversation history__obs__: Tool observation results (if--save_obsis used)__tools_compact__: Tool usage summary__tool_stats__: Tool call statistics__rounds__: Number of conversation rounds
- Location:
output/judge/ - Format:
{model_tag}_eval.json - Content: Each record includes:
- Original data and inference response
correct: Binary correctness score (0 or 1)judge_response: Raw judge responsejudge_parsing: Parsing status_empty_response: Flag for empty responses_has_error: Flag for errors
- Location:
output/scores.xlsx - Format: Excel file with sheet named
score - Content: Aggregated metrics including:
avg_score: Average correctness scoren: Total number of samplesn_empty_response: Count of empty responsesn_error: Count of errorsn_parse_fail: Count of parsing failurestool_{name}_count: Tool usage countstime: Timestamp