Wikikig Benchmark

Wikikig Benchmark is a sophisticated evaluation framework designed to test the reasoning, navigation, and long-context capabilities of Large Language Models (LLMs). By challenging models to play the "Wikipedia Game," it provides deep insights into how AI agents plan, backtrack, and handle hallucinations in complex information spaces.

The Core Concept: The Wikipedia Game

The goal is simple but the execution is complex: Navigate from a starting Wikipedia page to a target page using only internal links.

Why this matters for LLM Evaluation:

Reasoning & Planning: Models must understand the semantic relationship between the current page and the target.
Hallucination Tracking: We anonymize links (e.g., [CONCEPT_01: Science]) to ensure the model relies on the provided context rather than just pre-trained knowledge, making it easy to detect when a model "invents" a path.
Backtracking & Recovery: Tests the model's ability to recognize a dead-end and return to a previous state.
Structured Output: Evaluates the model's reliability in following strict navigation protocols.

Key Features

Multi-Model Benchmarking: Run the same challenge across multiple models (GPT-4, Claude... via NanoGPT) sequentially to compare performance.
Real-Time Visualization: Watch the model's "thought process" live with an interactive D3.js graph showing the path taken.
Deep Metrics:
- Path Efficiency: Number of clicks vs. optimal path.
- Hallucination Rate: Frequency of invalid link selections.
- Structured Parsing Success: Reliability of the model's JSON/structured responses.
- Intuition Logging: Captures the model's "gut feeling" for every move.
Archive Explorer: Every run is automatically saved. Review past benchmarks, analyze paths, and export data for further research.
Docker-First: Get up and running in minutes with a fully containerized stack.

Technical Stack

Backend: Python 3.12, FastAPI, LangChain (for structured output), Uvicorn.
Frontend: React 18, TypeScript, Vite, Tailwind CSS, D3.js (for graph visualization), Lucide Icons.
Communication: WebSockets for real-time event streaming from the orchestrator to the UI.
Infrastructure: Docker & Docker Compose.

Quick Start

1. Prerequisites

Docker and Docker Compose installed.
A NanoGPT API Key.

2. Launch

# Clone the repository
git clone https://github.com/MaloLM/WikikigBenchmark.git
cd WikikigBenchmark

# Setup environment
cp .env.example .env
# Edit .env and add your API key

# Start the stack
docker-compose up --build

3. Access

Frontend: http://localhost:3000
API Docs: http://localhost:8000/docs

For detailed instructions, see the Quick Start Guide.

📖 How it Works

Anonymization: The system fetches a Wikipedia page and replaces all links with unique identifiers (e.g., CONCEPT_42).
Prompting: The LLM receives the page content and the list of concepts. It must provide its next move along with its "intuition."
Orchestration: The backend validates the move, fetches the next page, and streams the update to the frontend via WebSockets.
Analysis: Upon completion (or failure), the system calculates final metrics and archives the entire session.

Contributing

Contributions are welcome! Whether it's adding new metrics, improving the graph visualization, or supporting more LLM providers.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE for more information.

Built with ❤️ for the Research Community

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
pairs.csv		pairs.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikikig Benchmark

The Core Concept: The Wikipedia Game

Why this matters for LLM Evaluation:

Key Features

Technical Stack

Quick Start

1. Prerequisites

2. Launch

3. Access

📖 How it Works

Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikikig Benchmark

The Core Concept: The Wikipedia Game

Why this matters for LLM Evaluation:

Key Features

Technical Stack

Quick Start

1. Prerequisites

2. Launch

3. Access

📖 How it Works

Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages