Skip to content

Add global temperature analysis tasks#319

Merged
olearycrew merged 2 commits intomainfrom
tasks/csv-temperature
Apr 15, 2026
Merged

Add global temperature analysis tasks#319
olearycrew merged 2 commits intomainfrom
tasks/csv-temperature

Conversation

@ScuttleBot
Copy link
Copy Markdown

Adds 3 new data-analysis tasks using the global_temperature.csv dataset:

  1. task_csv_temp_anomalies - Detect temperature anomalies (extreme months, z-score outliers, warmest/coldest years, year-over-year changes)
  2. task_csv_temp_trend - Analyze temperature trends (linear regression, pre/post-1950 acceleration, milestone crossings)
  3. task_csv_temp_decades - Compare decades (decade averages, transitions, variability, GISTEMP vs gcag source comparison)

All tasks use hybrid grading (60% automated, 40% LLM judge) with 180s timeout.

Closes #219, Closes #221, Closes #222

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 14, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid set of data-analysis task definitions. The grading scripts use regex-based pattern matching appropriately, expected values are well-documented, and the hybrid grading weights (60% automated / 40% LLM judge) make sense for this type of open-ended analysis task.

Files Reviewed (4 files)
  • tasks/manifest.yaml
  • tasks/task_csv_temp_anomalies.md
  • tasks/task_csv_temp_trend.md
  • tasks/task_csv_temp_decades.md

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 147,738 tokens

@ScuttleBot
Copy link
Copy Markdown
Author

🧪 PR Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-temperature
Tasks:

  • task_csv_temp_anomalies
  • task_csv_temp_trend
  • task_csv_temp_decades

Models (running in parallel):

# Model
1 openrouter/anthropic/claude-opus-4.6
2 openrouter/openai/gpt-5.4
3 openrouter/google/gemini-3.1-pro-preview

ETA: ~15-20 minutes (3 models × 3 tasks, 180s timeout each)

Automated PR test by ScuttleBot 🦀

@ScuttleBot
Copy link
Copy Markdown
Author

🦀 PR Test Results — Global Temperature Tasks

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-temperature
Grading: Hybrid (60% automated + 40% LLM judge)
Duration: ~28 min total (sequential, agent concurrency issue forced non-parallel)

Overall Scores

Model Anomalies Trend Decades Overall
openrouter/google/gemini-3.1-pro-preview 90.7% 96.0% 100.0% 95.6%
openrouter/anthropic/claude-opus-4.6 99.6% 97.6% 95.6% 97.6%
openrouter/openai/gpt-5.4 99.0% 98.0% 95.6% 97.5%

Task Breakdown

task_csv_temp_anomalies

Model Score Time Tokens Notes
Gemini 3.1 Pro 90.7% 64s 91,747 yoy_changes partial (0.5)
Claude Opus 4.6 99.6% 81s 83,089 Near-perfect
GPT-5.4 99.0% 68s 92,625 Near-perfect

task_csv_temp_trend

Model Score Time Tokens Notes
Gemini 3.1 Pro 96.0% 71s 79,430 All automated checks passed
Claude Opus 4.6 97.6% 90s 101,337 All automated checks passed
GPT-5.4 98.0% 53s 55,705 All automated checks passed, fastest

task_csv_temp_decades

Model Score Time Tokens Notes
Gemini 3.1 Pro 100.0% 79s 112,383 Perfect score
Claude Opus 4.6 95.6% 124s 108,781 Slowest on this task
GPT-5.4 95.6% 58s 59,981 Most efficient (fewest tokens)

Observations

  1. All 3 tasks work well — every model scored 90%+ on all tasks, confirming the tasks are solvable and the grading is reasonable
  2. Task quality looks solid — the hybrid grading (automated + LLM judge) produces nuanced scores that differentiate between partial and full solutions
  3. Gemini partial on yoy_changes in anomalies task — the year-over-year changes check scored 0.5 (partial credit), suggesting the automated grader may need a slightly broader acceptance range, or Gemini's formatting differed from expected
  4. Token efficiency varies widely — GPT-5.4 used 2x fewer tokens than Gemini/Claude on trend and decades tasks while scoring comparably
  5. All models completed well within the 180s timeout — longest execution was Claude at 124s on decades

Verdict

Tasks are ready to merge. All three produce meaningful, differentiating results across models with no failures or edge cases.

Tested by ScuttleBot 🦀

@olearycrew olearycrew merged commit 68a91c9 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: csv_temp_decades Task: csv_temp_trend Task: csv_temp_anomalies

2 participants