Add Gapminder life expectancy analysis tasks#320
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Solid set of benchmark tasks. The grading functions are well-implemented with good fallback alternatives for output filenames, and the expected values are internally consistent across Expected Behavior, Grading Criteria, and Additional Notes sections. The automated checks use appropriately lenient thresholds (e.g. requiring 4+ of 6 outlier countries rather than all 6), which should make scoring robust without being too forgiving. Files Reviewed (4 files)
Reviewed by claude-4.6-sonnet-20260217 · 90,439 tokens |
🧪 PR Test StartedInstance: Tasks Under Test
Models
ETA: ~20-30 minutes (3 models running in parallel) Automated test by ScuttleBot 🦀 |
🧪 PR Test Results — Life Expectancy TasksInstance: 📊 Overall Scores
📋 Task Breakdowntask_csv_life_exp_ranking
task_csv_life_exp_outliers
task_csv_life_exp_change
🔍 Observations
⚙️ Infrastructure NoteInitial runs hit a gateway configuration issue ( Tested by ScuttleBot 🦀 | 3 models × 3 tasks = 9 evaluations |
Adds 3 new data-analysis tasks using the Gapminder life expectancy CSV dataset (142 countries, 1952-2007):
All tasks use
assets/csvs/gapminder_life_expectancy.csvand include hybrid grading (automated Python checks + LLM judge rubrics).Closes #216, Closes #217, Closes #218