Skip to content

Add Qwen3.6 Plus evaluation results to leaderboard#1

Merged
augchan42 merged 1 commit into
mainfrom
add-qwen3.6-plus-eval
Apr 8, 2026
Merged

Add Qwen3.6 Plus evaluation results to leaderboard#1
augchan42 merged 1 commit into
mainfrom
add-qwen3.6-plus-eval

Conversation

@augchan42
Copy link
Copy Markdown
Contributor

Summary

  • Adds full Qwen3.6 Plus evaluation (39/39 calls) via lab-01 pipeline with structured prompts across all 13 scenarios
  • F2=70%, HG recall=67%, HG precision=91%, bias=sleepy — ranks 10th of 14 models
  • Run required multiple sessions due to free endpoint daily rate limits; last call used paid endpoint after qwen/qwen3.6-plus:free was removed from OpenRouter

Changes

  • results/reference/qwen-qwen3.6-plus/ — raw eval results
  • results/reference/leaderboard.json — rescored all models
  • shared/leaderboard.json + README.md — published to public leaderboard
  • labs/publish-leaderboard.py — added MODEL_MAP entry for qwen-qwen3.6-plus
  • labs/lab-01-risk-fingerprinting.py — adds --retry <path> flag to resume partial runs, skipping prior successes and retrying only failed/missing evals (important for free models with daily rate limits)

Test plan

  • Verify leaderboard renders correctly on ara-eval-site
  • Confirm --retry flag works on a partial results file

🤖 Generated with Claude Code

39/39 calls successful (38 free + 1 paid endpoint after free endpoint was removed).
F2=70%, HG recall=67%, HG precision=91%, bias=sleepy — ranks 10th of 14 models.

Also adds --retry flag to lab-01 for resuming partial runs across daily rate limits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@augchan42 augchan42 merged commit e1d3574 into main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant