Add global temperature analysis tasks#319
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Solid set of data-analysis task definitions. The grading scripts use regex-based pattern matching appropriately, expected values are well-documented, and the hybrid grading weights (60% automated / 40% LLM judge) make sense for this type of open-ended analysis task. Files Reviewed (4 files)
Fix these issues in Kilo Cloud Reviewed by claude-4.6-sonnet-20260217 · 147,738 tokens |
🧪 PR Test StartedInstance:
Models (running in parallel):
ETA: ~15-20 minutes (3 models × 3 tasks, 180s timeout each) Automated PR test by ScuttleBot 🦀 |
🦀 PR Test Results — Global Temperature TasksInstance: Overall Scores
Task Breakdown
|
| Model | Score | Time | Tokens | Notes |
|---|---|---|---|---|
| Gemini 3.1 Pro | 90.7% | 64s | 91,747 | yoy_changes partial (0.5) |
| Claude Opus 4.6 | 99.6% | 81s | 83,089 | Near-perfect |
| GPT-5.4 | 99.0% | 68s | 92,625 | Near-perfect |
task_csv_temp_trend
| Model | Score | Time | Tokens | Notes |
|---|---|---|---|---|
| Gemini 3.1 Pro | 96.0% | 71s | 79,430 | All automated checks passed |
| Claude Opus 4.6 | 97.6% | 90s | 101,337 | All automated checks passed |
| GPT-5.4 | 98.0% | 53s | 55,705 | All automated checks passed, fastest |
task_csv_temp_decades
| Model | Score | Time | Tokens | Notes |
|---|---|---|---|---|
| Gemini 3.1 Pro | 100.0% | 79s | 112,383 | Perfect score |
| Claude Opus 4.6 | 95.6% | 124s | 108,781 | Slowest on this task |
| GPT-5.4 | 95.6% | 58s | 59,981 | Most efficient (fewest tokens) |
Observations
- All 3 tasks work well — every model scored 90%+ on all tasks, confirming the tasks are solvable and the grading is reasonable
- Task quality looks solid — the hybrid grading (automated + LLM judge) produces nuanced scores that differentiate between partial and full solutions
- Gemini partial on
yoy_changesin anomalies task — the year-over-year changes check scored 0.5 (partial credit), suggesting the automated grader may need a slightly broader acceptance range, or Gemini's formatting differed from expected - Token efficiency varies widely — GPT-5.4 used 2x fewer tokens than Gemini/Claude on trend and decades tasks while scoring comparably
- All models completed well within the 180s timeout — longest execution was Claude at 124s on decades
Verdict
✅ Tasks are ready to merge. All three produce meaningful, differentiating results across models with no failures or edge cases.
Tested by ScuttleBot 🦀
Adds 3 new data-analysis tasks using the
global_temperature.csvdataset:All tasks use hybrid grading (60% automated, 40% LLM judge) with 180s timeout.
Closes #219, Closes #221, Closes #222