Evaluating Large Language Models via Esoteric Programming Languages
π Paper: arxiv.org/abs/2603.09678 π Website: esolang-bench.vercel.app π¦ Dataset: huggingface.co/datasets/Lossfunk/Esolang-Bench
EsoLang-Bench is a contamination-resistant benchmark that evaluates frontier LLMs on code generation in five Turing-complete esoteric programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. These languages have 340Γ to over 60,000Γ fewer public GitHub repositories than Python (verified May 2026 via topic-tag counts) and have negligible commercial deployment value, which together make large-scale contamination highly unlikely at corpus scale.
The same 80 problems expressed in Python or JavaScript reach 100% on top frontier models, while peak esoteric accuracy is only 11.2% (GPT-5.4 xhigh, self-scaffolding, Befunge-98) β an 89-point collapse on identical algorithmic content. Few-shot prompting adds only 0.8 pp over zero-shot (Wilcoxon p=0.505, n.s.).
Basic (interpreters only):
pip install -e .Benchmark (includes OpenRouter API client):
pip install -e ".[benchmark]"Development (includes test dependencies):
pip install -e ".[benchmark,dev]"The benchmark dataset (80 problems Γ 4 difficulty tiers, evaluated independently in 5 esoteric languages = 400 problem-language combinations per prompting strategy) is available on Hugging Face. Each problem ships with 6 input-output test cases that the evaluation harness uses to score model output. The 6 test cases are withheld from the model's prompt and used by the harness only; they are released publicly with the dataset for transparency and reproducibility but never enter the model's prompt context.
from datasets import load_dataset
# All 80 problems (single config, single split)
ds = load_dataset("Lossfunk/Esolang-Bench")["test"]
# Filter by difficulty tier
easy = ds.filter(lambda r: r["difficulty"] == "easy")
medium = ds.filter(lambda r: r["difficulty"] == "medium")
hard = ds.filter(lambda r: r["difficulty"] == "hard")
xhard = ds.filter(lambda r: r["difficulty"] == "extra_hard")
# Each row: id, difficulty, title, description, test_cases (list of 6 {input, output} dicts)
print(ds[0])A complete Croissant 1.1 metadata file with all seven NeurIPS-mandatory Responsible AI fields is shipped alongside the dataset on the HuggingFace Hub.
# Brainfuck: print '$' (ASCII 36)
esolang-interpret -l brainfuck -c '++++++[>++++++<-]>.'
# Befunge-98: Hello World
esolang-interpret -l befunge98 -c '"!dlroW ,olleH">:#,_@'
# From file
esolang-interpret -l whitespace -f program.ws
# With stdin
echo "5" | esolang-interpret -l brainfuck -c ',.'from esolang_bench import get_interpreter
interp = get_interpreter("brainfuck")
result = interp.run("++++++[>++++++<-]>.", stdin="")
print(result.stdout) # "$"
print(result.error_type) # "ok"export OPENROUTER_API_KEY=sk-or-...
# Run a single evaluation
esolang-run --model gpt-5.2 --language brainfuck --regime self_scaffolding
# Filter by difficulty
esolang-run --model gpt-5.2 --language brainfuck --regime zero_shot --difficulty easy
# Limit problems for quick testing
ESOLANG_MAX_PROBLEMS=5 esolang-run -m gpt-5.2 -l brainfuck -r zero_shotEsoLang-Bench evaluates models under 5 prompting regimes plus a baseline:
| Regime | LLM Calls/Iter | Description |
|---|---|---|
zero_shot |
1 (single) | Direct code generation with language docs |
few_shot |
1 (single) | Zero-shot + 3 in-context learning examples |
self_scaffolding |
1 | Direct interpreter feedback, model self-diagnoses (best non-agentic) |
textual_self_scaffolding |
2 | Coder + critic pair; critic provides NL debugging guidance |
react |
3 | Planner + coder + critic pipeline (ReAct-style) |
All iterative regimes (self_scaffolding, textual_self_scaffolding, react) run up to 5 attempts per problem (configurable via environment variables).
Problems are organized into 4 difficulty tiers:
| Level | Flag | Description |
|---|---|---|
| Easy | --difficulty easy |
Basic I/O, simple loops |
| Medium | --difficulty medium |
String manipulation, conditionals |
| Hard | --difficulty hard |
Complex algorithms, nested structures |
| Extra Hard | --difficulty extra_hard |
Advanced data structures, multi-step reasoning |
Use --difficulty all (default) to run all problems.
| Variable | Default | Description |
|---|---|---|
OPENROUTER_API_KEY |
(required) | OpenRouter API key |
ESOLANG_MAX_PROBLEMS |
unlimited | Limit number of problems per run |
ESOLANG_RESULTS_DIR |
./results |
Output directory for result JSONL files |
MAX_ATTEMPTS_SELF_SCAFFOLDING |
5 |
Max iterations for self-scaffolding |
MAX_ATTEMPTS_TEXTUAL_SELF_SCAFFOLDING |
5 |
Max iterations for textual self-scaffolding |
MAX_ATTEMPTS_REACT |
5 |
Max iterations for ReAct pipeline |
MAX_TOKENS_{REGIME} |
8192 |
Max tokens for a regime (e.g., MAX_TOKENS_ZERO_SHOT) |
MAX_TOKENS_{MODEL}_{REGIME} |
-- | Per-model token override |
| Language | Paradigm | GitHub topic repos (May 2026) | Peak esoteric accuracy |
|---|---|---|---|
| Brainfuck | 8-command memory tape | 2,028 | 6.2% |
| Befunge-98 | 2D stack-based grid | 86 | 11.2% |
| Whitespace | Stack-based, invisible | 125 | 0% |
| Unlambda | Combinatory logic (s, k, i) | 25 | 1.2% |
| Shakespeare | Theatrical-play dialogue | 11 | 2.5% |
For reference: Python is tagged on 684,596 repositories and JavaScript on 648,084. The gap relative to Python ranges from ~340Γ for Brainfuck to over 60,000Γ for Shakespeare.
Best per-model overall accuracy across all five prompting strategies (self-scaffolding is the dominant strategy):
| Model | Best Strategy | Best per-language peak | Overall (across 5 langs) |
|---|---|---|---|
| GPT-5.4 xhigh | Self-Scaffolding | 11.2% (Befunge-98) | ~3.8% |
| O4-mini-high | Self-Scaffolding | 10.0% (Befunge-98) | ~3.4% |
| Gemini 3.1 Pro | Self-Scaffolding | 7.5% (Befunge-98) | ~2.6% |
| Qwen3-235B | Self-Scaffolding | 2.5% (Brainfuck) | ~1.0% |
| Kimi K2.5 | Self-Scaffolding | 1.2% (Shakespeare) | ~0.2% |
Mainstream-language baselines (Python, JavaScript) reach 100% across all four difficulty tiers on the same 80 problems.
esolang_bench/
interpreters/ # Pure-Python interpreters for 5 esolangs
benchmarking/ # LLM evaluation harness
config.py # Models, regimes, difficulty levels, token limits
runner_utils.py # All 5 regime runners + CLI entry point
prompt_templates.py # Prompt builders for each regime
dataset_loader.py # Problem loading with difficulty filtering
metrics.py # Accuracy and attempt tracking
data/ # 80 problems x 4 difficulty tiers
docs/ # Language reference documentation
icl_examples/ # Few-shot examples per language
tests/ # Interpreter test suite
pip install -e ".[dev]"
pytest tests/ -v@inproceedings{sharma2026esolangbench,
title = {{EsoLang-Bench}: Evaluating Large Language Models via Esoteric Programming Languages},
author = {Sharma, Aman and Chopra, Paras},
booktitle = {NeurIPS 2026 Track on Evaluations and Datasets},
year = {2026},
organization = {Lossfunk}
}