Add global temperature analysis tasks by ScuttleBot · Pull Request #319 · pinchbench/skill

ScuttleBot · 2026-04-14T13:54:06Z

Adds 3 new data-analysis tasks using the global_temperature.csv dataset:

task_csv_temp_anomalies - Detect temperature anomalies (extreme months, z-score outliers, warmest/coldest years, year-over-year changes)
task_csv_temp_trend - Analyze temperature trends (linear regression, pre/post-1950 acceleration, milestone crossings)
task_csv_temp_decades - Compare decades (decade averages, transitions, variability, GISTEMP vs gcag source comparison)

All tasks use hybrid grading (60% automated, 40% LLM judge) with 180s timeout.

Closes #219, Closes #221, Closes #222

kilo-code-bot · 2026-04-14T13:55:12Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid set of data-analysis task definitions. The grading scripts use regex-based pattern matching appropriately, expected values are well-documented, and the hybrid grading weights (60% automated / 40% LLM judge) make sense for this type of open-ended analysis task.

Files Reviewed (4 files)

tasks/manifest.yaml
tasks/task_csv_temp_anomalies.md
tasks/task_csv_temp_trend.md
tasks/task_csv_temp_decades.md

Fix these issues in Kilo Cloud

_{Reviewed by claude-4.6-sonnet-20260217 · 147,738 tokens}

ScuttleBot · 2026-04-15T13:19:15Z

🧪 PR Test Started

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-temperature
Tasks:

task_csv_temp_anomalies
task_csv_temp_trend
task_csv_temp_decades

Models (running in parallel):

#	Model
1	`openrouter/anthropic/claude-opus-4.6`
2	`openrouter/openai/gpt-5.4`
3	`openrouter/google/gemini-3.1-pro-preview`

ETA: ~15-20 minutes (3 models × 3 tasks, 180s timeout each)

Automated PR test by ScuttleBot 🦀

ScuttleBot · 2026-04-15T14:00:49Z

🦀 PR Test Results — Global Temperature Tasks

Instance: 155.138.235.245 (Vultr vc2-2c-4gb, ATL)
Branch: tasks/csv-temperature
Grading: Hybrid (60% automated + 40% LLM judge)
Duration: ~28 min total (sequential, agent concurrency issue forced non-parallel)

Overall Scores

Model	Anomalies	Trend	Decades	Overall
`openrouter/google/gemini-3.1-pro-preview`	90.7%	96.0%	100.0%	95.6%
`openrouter/anthropic/claude-opus-4.6`	99.6%	97.6%	95.6%	97.6%
`openrouter/openai/gpt-5.4`	99.0%	98.0%	95.6%	97.5%

Task Breakdown

`task_csv_temp_anomalies`

Model	Score	Time	Tokens	Notes
Gemini 3.1 Pro	90.7%	64s	91,747	`yoy_changes` partial (0.5)
Claude Opus 4.6	99.6%	81s	83,089	Near-perfect
GPT-5.4	99.0%	68s	92,625	Near-perfect

`task_csv_temp_trend`

Model	Score	Time	Tokens	Notes
Gemini 3.1 Pro	96.0%	71s	79,430	All automated checks passed
Claude Opus 4.6	97.6%	90s	101,337	All automated checks passed
GPT-5.4	98.0%	53s	55,705	All automated checks passed, fastest

`task_csv_temp_decades`

Model	Score	Time	Tokens	Notes
Gemini 3.1 Pro	100.0%	79s	112,383	Perfect score
Claude Opus 4.6	95.6%	124s	108,781	Slowest on this task
GPT-5.4	95.6%	58s	59,981	Most efficient (fewest tokens)

Observations

All 3 tasks work well — every model scored 90%+ on all tasks, confirming the tasks are solvable and the grading is reasonable
Task quality looks solid — the hybrid grading (automated + LLM judge) produces nuanced scores that differentiate between partial and full solutions
Gemini partial on yoy_changes in anomalies task — the year-over-year changes check scored 0.5 (partial credit), suggesting the automated grader may need a slightly broader acceptance range, or Gemini's formatting differed from expected
Token efficiency varies widely — GPT-5.4 used 2x fewer tokens than Gemini/Claude on trend and decades tasks while scoring comparably
All models completed well within the 180s timeout — longest execution was Claude at 124s on decades

Verdict

✅ Tasks are ready to merge. All three produce meaningful, differentiating results across models with no failures or edge cases.

Tested by ScuttleBot 🦀

Add temperature CSV analysis tasks

0935a9d

Merge branch 'main' into tasks/csv-temperature

639edcb

olearycrew merged commit 68a91c9 into main Apr 15, 2026
1 check passed

ScuttleBot mentioned this pull request Apr 16, 2026

feat: implement new_session:true support for multi-turn task session … #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add global temperature analysis tasks#319

Add global temperature analysis tasks#319
olearycrew merged 2 commits intomainfrom
tasks/csv-temperature

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

ScuttleBot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 14, 2026

Uh oh!

kilo-code-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

ScuttleBot commented Apr 15, 2026

🧪 PR Test Started

Uh oh!

ScuttleBot commented Apr 15, 2026

🦀 PR Test Results — Global Temperature Tasks

Overall Scores

Task Breakdown

task_csv_temp_anomalies

task_csv_temp_trend

task_csv_temp_decades

Observations

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Apr 14, 2026 •

edited

Loading

`task_csv_temp_anomalies`

`task_csv_temp_trend`

`task_csv_temp_decades`