Skip to content

Add Gapminder life expectancy analysis tasks#320

Merged
olearycrew merged 2 commits intomainfrom
tasks/csv-life-expectancy
Apr 15, 2026
Merged

Add Gapminder life expectancy analysis tasks#320
olearycrew merged 2 commits intomainfrom
tasks/csv-life-expectancy

Conversation

@ScuttleBot
Copy link
Copy Markdown

Adds 3 new data-analysis tasks using the Gapminder life expectancy CSV dataset (142 countries, 1952-2007):

  1. task_csv_life_exp_ranking - Rank countries by life expectancy in 2007, compute continent averages, compare 1952 vs 2007 rankings
  2. task_csv_life_exp_outliers - Statistical outlier detection (z-score/IQR), within-continent outliers, temporal anomalies (life expectancy drops), connect to real-world events (HIV/AIDS, genocide, wars)
  3. task_csv_life_exp_change - Analyze change over time: global/continent trends, biggest improvers (Oman +38 years), decliners (Zimbabwe -5 years), convergence/divergence analysis

All tasks use assets/csvs/gapminder_life_expectancy.csv and include hybrid grading (automated Python checks + LLM judge rubrics).

Closes #216, Closes #217, Closes #218

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid set of benchmark tasks. The grading functions are well-implemented with good fallback alternatives for output filenames, and the expected values are internally consistent across Expected Behavior, Grading Criteria, and Additional Notes sections. The automated checks use appropriately lenient thresholds (e.g. requiring 4+ of 6 outlier countries rather than all 6), which should make scoring robust without being too forgiving.

Files Reviewed (4 files)
  • tasks/manifest.yaml
  • tasks/task_csv_life_exp_ranking.md
  • tasks/task_csv_life_exp_outliers.md
  • tasks/task_csv_life_exp_change.md

Reviewed by claude-4.6-sonnet-20260217 · 90,439 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Started

Instance: 66.42.90.87 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-life-expectancy

Tasks Under Test

  • task_csv_life_exp_ranking — Life Expectancy Country Ranking
  • task_csv_life_exp_outliers — Life Expectancy Outlier Detection
  • task_csv_life_exp_change — Life Expectancy Change Over Time

Models

Model Provider
openrouter/anthropic/claude-opus-4.6 Anthropic
openrouter/openai/gpt-5.4 OpenAI
openrouter/google/gemini-3.1-pro-preview Google

ETA: ~20-30 minutes (3 models running in parallel)
Started: 2026-04-15 13:21 UTC


Automated test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Results — Life Expectancy Tasks

Instance: 66.42.90.87 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-life-expectancy
Completed: 2026-04-15 14:04 UTC


📊 Overall Scores

Model Ranking Outliers Change Overall Tokens
claude-opus-4.6 ✅ 100% ⚠️ 96% ⚠️ 96% 🟢 97.1% 258K
gpt-5.4 ⚠️ 99% ⚠️ 89% ⚠️ 84% 🟢 90.6% 243K
gemini-3.1-pro-preview ⚠️ 30% ⚠️ 86% ⚠️ 82% 🔴 66.3% 189K

📋 Task Breakdown

task_csv_life_exp_ranking

Model Score Automated Judge Time Notes
Claude Opus 4.6 100% All pass Excellent 71s Top/bottom rankings exact (Japan 82.603, Swaziland 39.613). Continent averages correct.
GPT-5.4 99% All pass Near-perfect 92s All values match expected. Comprehensive analysis.
Gemini 3.1 Pro 30% 0% 76% (judge) 45s Automated checks failed (0/9) despite correct methodology. Syntax error on first attempt. Judge gave credit for correct values visible in transcript.

task_csv_life_exp_outliers

Model Score Automated Judge Time Notes
Claude Opus 4.6 96% 8.5/9 Strong 140s Both IQR and z-score methods. All anomaly types covered. Top 5 drops correct.
GPT-5.4 89% 8.5/9 81% 123s Same strong methodology. Context for HIV/AIDS, genocide correctly linked.
Gemini 3.1 Pro 86% 9/9 65% 130s Perfect automated score! IQR method correct. Judge flagged missing z-score approach and incomplete contextual analysis.

task_csv_life_exp_change

Model Score Automated Judge Time Notes
Claude Opus 4.6 96% 9/9 High 80s Comprehensive analysis with correct values throughout.
GPT-5.4 84% 9/9 Mixed 82s Global averages exact (49.06→67.01). Oman +38.062 as top improver. Some convergence analysis gaps.
Gemini 3.1 Pro 82% 9/9 Limited 82s All numerical calculations accurate, but analysis lacks depth — just lists numbers without explaining causes.

🔍 Observations

  1. Tasks are well-designed — all 3 tasks successfully test meaningful data analysis capabilities with a good mix of automated checks and LLM judge evaluation.
  2. Hybrid grading works well — the automated/judge split catches both correctness (automated) and quality (judge). Gemini's ranking task is an interesting case where automated checks failed but the judge recognized correct work in the transcript.
  3. Gemini ranking anomaly — Gemini scored 0% on automated checks for the ranking task despite the judge giving 76%. Possible issue: the output file may not have been written to the expected location, or there was a formatting mismatch. Worth investigating the automated grading criteria for report_created.
  4. All models handle the CSV well — the dataset and task structure work as intended. Models correctly parse the CSV, compute statistics, and generate analysis.
  5. Token efficiency — Gemini used the fewest tokens (189K) but scored lowest. Claude and GPT were comparable in token usage (~250K) with Claude edging ahead on quality.

⚙️ Infrastructure Note

Initial runs hit a gateway configuration issue (gateway.mode=local not set on snapshot). After fixing, all models ran successfully. The PinchBench snapshot should be updated to include gateway.mode=local in the config to avoid this in future test runs.


Tested by ScuttleBot 🦀 | 3 models × 3 tasks = 9 evaluations

@olearycrew olearycrew merged commit ea94eee into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: csv_life_exp_change Task: csv_life_exp_outliers Task: csv_life_exp_ranking

2 participants