Add Gapminder life expectancy analysis tasks by ScuttleBot · Pull Request #320 · pinchbench/skill

ScuttleBot · 2026-04-14T13:57:07Z

Adds 3 new data-analysis tasks using the Gapminder life expectancy CSV dataset (142 countries, 1952-2007):

task_csv_life_exp_ranking - Rank countries by life expectancy in 2007, compute continent averages, compare 1952 vs 2007 rankings
task_csv_life_exp_outliers - Statistical outlier detection (z-score/IQR), within-continent outliers, temporal anomalies (life expectancy drops), connect to real-world events (HIV/AIDS, genocide, wars)
task_csv_life_exp_change - Analyze change over time: global/continent trends, biggest improvers (Oman +38 years), decliners (Zimbabwe -5 years), convergence/divergence analysis

All tasks use assets/csvs/gapminder_life_expectancy.csv and include hybrid grading (automated Python checks + LLM judge rubrics).

Closes #216, Closes #217, Closes #218

kilo-code-bot · 2026-04-14T13:57:52Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid set of benchmark tasks. The grading functions are well-implemented with good fallback alternatives for output filenames, and the expected values are internally consistent across Expected Behavior, Grading Criteria, and Additional Notes sections. The automated checks use appropriately lenient thresholds (e.g. requiring 4+ of 6 outlier countries rather than all 6), which should make scoring robust without being too forgiving.

Files Reviewed (4 files)

tasks/manifest.yaml
tasks/task_csv_life_exp_ranking.md
tasks/task_csv_life_exp_outliers.md
tasks/task_csv_life_exp_change.md

_{Reviewed by claude-4.6-sonnet-20260217 · 90,439 tokens}

ScuttleBot · 2026-04-15T13:21:04Z

🧪 PR Test Started

Instance: 66.42.90.87 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-life-expectancy

Tasks Under Test

task_csv_life_exp_ranking — Life Expectancy Country Ranking
task_csv_life_exp_outliers — Life Expectancy Outlier Detection
task_csv_life_exp_change — Life Expectancy Change Over Time

Models

Model	Provider
`openrouter/anthropic/claude-opus-4.6`	Anthropic
`openrouter/openai/gpt-5.4`	OpenAI
`openrouter/google/gemini-3.1-pro-preview`	Google

ETA: ~20-30 minutes (3 models running in parallel)
Started: 2026-04-15 13:21 UTC

Automated test by ScuttleBot 🦀

ScuttleBot · 2026-04-15T14:04:02Z

🧪 PR Test Results — Life Expectancy Tasks

Instance: 66.42.90.87 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-life-expectancy
Completed: 2026-04-15 14:04 UTC

📊 Overall Scores

Model	Ranking	Outliers	Change	Overall	Tokens
`claude-opus-4.6`	✅ 100%	⚠️ 96%	⚠️ 96%	🟢 97.1%	258K
`gpt-5.4`	⚠️ 99%	⚠️ 89%	⚠️ 84%	🟢 90.6%	243K
`gemini-3.1-pro-preview`	⚠️ 30%	⚠️ 86%	⚠️ 82%	🔴 66.3%	189K

📋 Task Breakdown

task_csv_life_exp_ranking

Model	Score	Automated	Judge	Time	Notes
Claude Opus 4.6	100%	All pass	Excellent	71s	Top/bottom rankings exact (Japan 82.603, Swaziland 39.613). Continent averages correct.
GPT-5.4	99%	All pass	Near-perfect	92s	All values match expected. Comprehensive analysis.
Gemini 3.1 Pro	30%	0%	76% (judge)	45s	Automated checks failed (0/9) despite correct methodology. Syntax error on first attempt. Judge gave credit for correct values visible in transcript.

task_csv_life_exp_outliers

Model	Score	Automated	Judge	Time	Notes
Claude Opus 4.6	96%	8.5/9	Strong	140s	Both IQR and z-score methods. All anomaly types covered. Top 5 drops correct.
GPT-5.4	89%	8.5/9	81%	123s	Same strong methodology. Context for HIV/AIDS, genocide correctly linked.
Gemini 3.1 Pro	86%	9/9	65%	130s	Perfect automated score! IQR method correct. Judge flagged missing z-score approach and incomplete contextual analysis.

task_csv_life_exp_change

Model	Score	Automated	Judge	Time	Notes
Claude Opus 4.6	96%	9/9	High	80s	Comprehensive analysis with correct values throughout.
GPT-5.4	84%	9/9	Mixed	82s	Global averages exact (49.06→67.01). Oman +38.062 as top improver. Some convergence analysis gaps.
Gemini 3.1 Pro	82%	9/9	Limited	82s	All numerical calculations accurate, but analysis lacks depth — just lists numbers without explaining causes.

🔍 Observations

Tasks are well-designed — all 3 tasks successfully test meaningful data analysis capabilities with a good mix of automated checks and LLM judge evaluation.
Hybrid grading works well — the automated/judge split catches both correctness (automated) and quality (judge). Gemini's ranking task is an interesting case where automated checks failed but the judge recognized correct work in the transcript.
Gemini ranking anomaly — Gemini scored 0% on automated checks for the ranking task despite the judge giving 76%. Possible issue: the output file may not have been written to the expected location, or there was a formatting mismatch. Worth investigating the automated grading criteria for report_created.
All models handle the CSV well — the dataset and task structure work as intended. Models correctly parse the CSV, compute statistics, and generate analysis.
Token efficiency — Gemini used the fewest tokens (189K) but scored lowest. Claude and GPT were comparable in token usage (~250K) with Claude edging ahead on quality.

⚙️ Infrastructure Note

Initial runs hit a gateway configuration issue (gateway.mode=local not set on snapshot). After fixing, all models ran successfully. The PinchBench snapshot should be updated to include gateway.mode=local in the config to avoid this in future test runs.

Tested by ScuttleBot 🦀 | 3 models × 3 tasks = 9 evaluations

Add life expectancy CSV analysis tasks

85cde3d

Merge branch 'main' into tasks/csv-life-expectancy

417c4bd

olearycrew merged commit ea94eee into main Apr 15, 2026
1 check passed

ScuttleBot mentioned this pull request Apr 16, 2026

feat: implement new_session:true support for multi-turn task session … #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gapminder life expectancy analysis tasks#320

Add Gapminder life expectancy analysis tasks#320
olearycrew merged 2 commits intomainfrom
tasks/csv-life-expectancy

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 PR Test Started

Tasks Under Test

Models

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 PR Test Results — Life Expectancy Tasks

📊 Overall Scores

📋 Task Breakdown

task_csv_life_exp_ranking

task_csv_life_exp_outliers

task_csv_life_exp_change

🔍 Observations

⚙️ Infrastructure Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading