Add Idaho weather stations analysis tasks#322
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Solid additions — the task definitions are well-structured with clear prompts, accurate expected values, and thorough hybrid grading that combines automated regex checks with LLM judge rubrics. The DMS coordinate parsing challenge in the filter task is a nice test of real-world data handling. Files Reviewed (4 files)
Reviewed by claude-4.6-sonnet-20260217 · 91,779 tokens |
🧪 Test StartedInstance:
Models (running in parallel):
ETA: ~30-40 minutes (3 tasks × 3 models, parallel execution) Instance booting now. Will post results when complete. |
🧪 Test Results — PR #322 (Idaho Weather Stations Tasks)Instance: Overall Scores
Per-Task Breakdowntask_csv_stations_by_elevation
task_csv_stations_coverage
task_csv_stations_filter
Key ObservationsTasks are working well:
Task design feedback:
Verdict: Tasks look solid. The main issue is intermittent LLM judge failures that penalize models unfairly. Consider adding retry logic or a fallback judge model. Instance CleanupInstance |
Adds 3 new data-analysis tasks based on the Idaho weather stations CSV dataset:
All tasks use
assets/csvs/idaho_weather_stations.csv(213 stations, 10 columns). Each includes hybrid grading with automated Python checks and LLM judge rubrics.Closes #220, Closes #223, Closes #235