From 40ff3b6cd6654153af223c07ec1aef6189c0ac44 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 6 Apr 2026 15:27:40 +0000 Subject: [PATCH 1/3] Initial plan From 4f0788f97c5e5a7b48f091e1df3f34f846179920 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 6 Apr 2026 15:41:20 +0000 Subject: [PATCH 2/3] fix: reduce token usage for daily-syntax-error-quality and dead-code-remover workflows MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - daily-syntax-error-quality: reduce test cases from 3→2, add Token Budget Guidelines section, compact scoring format, early-stop noop when quality is good, remove verbose implementation guide from issue template - dead-code-remover: reduce batch from 10→5 functions, add Token Budget Guidelines section, truncate grep/test output, immediate noop when no unprocessed functions found - Recompile both lock files - Update scratchpad/token-budget-guidelines.md with optimizations and alert thresholds for both workflows Closes #" Agent-Logs-Url: https://github.com/github/gh-aw/sessions/a8890002-99df-44b5-b50b-93c2902a4951 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> --- .../workflows/daily-syntax-error-quality.md | 339 +++--------------- .github/workflows/dead-code-remover.md | 21 +- scratchpad/token-budget-guidelines.md | 81 +++++ 3 files changed, 137 insertions(+), 304 deletions(-) diff --git a/.github/workflows/daily-syntax-error-quality.md b/.github/workflows/daily-syntax-error-quality.md index 077258623da..621ba6c9574 100644 --- a/.github/workflows/daily-syntax-error-quality.md +++ b/.github/workflows/daily-syntax-error-quality.md @@ -57,12 +57,24 @@ You are the Daily Syntax Error Quality Check Agent - a developer experience spec ## Mission Test the quality of compiler error messages by: -1. Selecting 3 existing agentic workflows -2. Introducing 3 different types of syntax errors (one per workflow) +1. Selecting 2 existing agentic workflows +2. Introducing 2 different types of syntax errors (one per workflow) 3. Running the compiler and capturing error output 4. Evaluating error message quality across multiple dimensions 5. Creating an issue with suggestions if improvements are needed +## Token Budget Guidelines + +**Target**: Complete the full analysis in ≤ 20 turns. + +- Test **2 workflows** (not 3) — one simple, one complex. +- One error category per workflow (Category A for workflow 1, Category B for workflow 2). +- **If the average score across both test cases is ≥ 70 and no individual score is < 55**: call `noop` immediately with a one-line summary — do **not** generate a full structured report. +- When scores require an issue: use the compact format below — skip verbose per-dimension narratives and the lengthy implementation guide section. +- Do **not** re-read files already loaded into context. +- One `gh aw compile` call per test case — do not retry after an expected failure. +- Avoid printing full file contents; use `head -n 30` to confirm error locations. + ## Current Context - **Repository**: ${{ github.repository }} @@ -71,7 +83,7 @@ Test the quality of compiler error messages by: ## Phase 1: Select Test Workflows -Select 3 diverse workflows for testing (avoid daily-* and test workflows): +Select 2 diverse workflows for testing (avoid daily-* and test workflows): ```bash # Find candidate workflows @@ -79,20 +91,18 @@ find .github/workflows -name '*.md' -type f ! -name 'daily-*.md' ! -name '*-test ``` **Selection Criteria**: -- Choose workflows with different complexity levels (simple, medium, complex) +- Choose workflows with different complexity levels (simple, complex) - Prefer workflows with different structures (different engines, tools, safe-outputs) -- Ensure variety in frontmatter configuration **Example selections**: 1. Simple workflow (< 100 lines, minimal config) -2. Medium workflow (100-300 lines, moderate config) -3. Complex workflow (> 300 lines, many tools/features) +2. Complex workflow (> 300 lines, many tools/features) ## Phase 2: Generate Syntax Errors -For each selected workflow, create exactly **3 test cases** with different error types: +For each selected workflow, create exactly **1 test case** with a different error type: -### Test Case Categories (Select One Per Workflow) +### Test Case Categories (One Per Workflow) #### Category A: Frontmatter Syntax Errors Examples: @@ -159,7 +169,6 @@ For each workflow: 2. **Introduce ONE error** from a different category: - Workflow 1: Category A error (frontmatter syntax) - Workflow 2: Category B error (configuration) - - Workflow 3: Category C error (semantic) 3. **Document the error** for later evaluation: ```json @@ -316,42 +325,14 @@ For each error output, score across these dimensions: ## Phase 5: Generate Evaluation Report -Create a detailed evaluation for each test case: +For each test case, record a **compact** one-line summary: -```json -{ - "test_id": "test-1", - "workflow": "selected-workflow.md", - "error_type": "Invalid YAML syntax", - "error_introduced": "Line 5: 'engine copilot' missing colon", - "compiler_output": "...(full error output)...", - "scores": { - "clarity": 22, - "actionability": 18, - "context": 16, - "examples": 12, - "consistency": 14 - }, - "total_score": 82, - "rating": "Good", - "strengths": [ - "Error location is clearly shown (file:line:column)", - "Message clearly states 'invalid YAML syntax'", - "Provides actionable hint about missing colon" - ], - "weaknesses": [ - "No example of correct YAML syntax provided", - "Could show the problematic line with ^ pointer", - "Doesn't mention YAML specification for reference" - ], - "improvement_suggestions": [ - "Add visual pointer (^) to exact error location in source", - "Include example of correct syntax: 'engine: copilot'", - "Reference YAML specification or workflow documentation" - ] -} +``` +test-1 | | | clarity:/25 actionability:/25 context:/20 examples:/15 consistency:/15 | total:/100 | ``` +Collect key strengths (1–2 bullets) and improvement suggestions (1–2 bullets) per test. Do **not** reproduce the full compiler output in your report — reference file:line only. + ## Phase 6: Create Issue with Suggestions **Only create an issue if**: @@ -361,263 +342,24 @@ Create a detailed evaluation for each test case: ### Issue Structure -**Note**: The template below demonstrates the complete structure and formatting for the issue report. +Use this **compact** template (do not add extra sections): ```markdown ### 📊 Error Message Quality Analysis -**Analysis Date**: Use current date in YYYY-MM-DD format -**Test Cases**: 3 -**Average Score**: XX/100 -**Status**: [✅ Good | ⚠️ Needs Improvement | ❌ Critical Issues] - ---- - -### Executive Summary - -[2-3 sentences summarizing the findings and overall quality assessment] - -**Key Findings**: -- **Strengths**: [List 2-3 strengths observed across test cases] -- **Weaknesses**: [List 2-3 common weaknesses] -- **Critical Issues**: [List any critical issues that severely impact DX] - ---- - -### Test Case Results - -
-Test Case 1: Invalid YAML Syntax - Score: 82/100 ✅ - -#### Test Configuration - -**Workflow**: `selected-workflow.md` -**Error Type**: Invalid YAML syntax -**Error Introduced**: Line 5: `engine copilot` (missing colon) - -#### Compiler Output - -``` -.github/workflows/selected-workflow.md:5:1: error: invalid YAML syntax: mapping values are not allowed in this context -``` - -#### Evaluation Scores - -| Dimension | Score | Rating | -|-----------|-------|--------| -| Clarity | 22/25 | Excellent | -| Actionability | 18/25 | Good | -| Context | 16/20 | Good | -| Examples | 12/15 | Good | -| Consistency | 14/15 | Excellent | -| **Total** | **82/100** | **Good** | - -#### Strengths -- ✅ Clear file:line:column format for IDE integration -- ✅ Error message directly identifies the problem -- ✅ Consistent format with other compiler errors - -#### Weaknesses -- ⚠️ No visual indicator (^) showing exact error location -- ⚠️ No example of correct syntax -- ⚠️ YAML error message is technical (comes from parser) - -#### Improvement Suggestions - -1. **Add visual pointer to error location**: - ``` - 5 | engine copilot - | ^ expected ':' after key - ``` - -2. **Include corrected syntax example**: - ``` - Correct usage: - engine: copilot - ``` - -3. **Simplify technical YAML error messages**: - - Current: "mapping values are not allowed in this context" - - Better: "Missing colon (:) after 'engine' key" - -
- -
-Test Case 2: Invalid Engine Name - Score: 68/100 ⚠️ - -[Similar detailed analysis...] - -
- -
-Test Case 3: Conflicting Configuration - Score: 74/100 ✅ +**Date**: YYYY-MM-DD | **Tests**: 2 | **Average Score**: XX/100 | **Status**: [✅ Good | ⚠️ Needs Improvement | ❌ Critical] -[Similar detailed analysis...] +**Summary**: [1–2 sentences on overall findings] -
- ---- - -### Overall Statistics - -| Metric | Value | -|--------|-------| -| Tests Run | 3 | -| Average Score | 74.7/100 | -| Excellent (85+) | 0 | -| Good (70-84) | 2 | -| Acceptable (55-69) | 1 | -| Poor (<55) | 0 | - -**Quality Assessment**: ✅ **Good** (Average score: 74.7/100, above threshold of 70. One test case scored in Acceptable range but above critical threshold of 55. No issue creation required.) - -**Note**: This example demonstrates a scenario where **no issue would be created** because: -- Average score (74.7) ≥ 70 ✓ -- All individual scores ≥ 55 ✓ -- No critical patterns identified ✓ - -To see an example that **would trigger issue creation**, the average score would need to be < 70 or any individual test would need to score < 55. - ---- - -### Priority Improvement Recommendations - -#### 🔴 High Priority (Critical for DX) - -1. **Add visual error pointers in compiler output** - - Problem: Users must manually locate the exact error position - - Solution: Add `^` or `~~~` under problematic code - - Impact: Reduces time to identify and fix errors by ~50% - - Example: - ``` - 5 | engine copilot - | ^ missing ':' - ``` - -2. **Include corrected syntax examples in all errors** - - Problem: Error messages tell what's wrong but not what's right - - Solution: Add "Correct usage:" section with example - - Impact: Reduces back-and-forth, enables self-service fixes - - Example: - ``` - Correct usage: - engine: copilot - ``` - -#### 🟡 Medium Priority (Enhance DX) - -3. **Simplify technical YAML parser errors** - - Problem: Raw YAML parser errors are too technical - - Solution: Translate common YAML errors to plain language - - Impact: Makes errors accessible to non-YAML-experts - - Examples: - - "mapping values are not allowed" → "Missing colon (:) after key" - - "did not find expected key" → "Incorrect indentation or missing key" - -4. **Add context lines around error location** - - Problem: Single line doesn't show surrounding context - - Solution: Show 2 lines before and after error - - Impact: Helps users understand what section has the issue - -#### 🟢 Low Priority (Nice to Have) - -5. **Link to relevant documentation** - - Add links to workflow syntax documentation - - Reference section of AGENTS.md for common patterns - - Link to examples in .github/workflows/ - -6. **Group related errors** - - If multiple errors exist, group them by type - - Show most critical errors first - - Provide "fix all" suggestions - ---- - -### Implementation Guide - -For developers implementing these improvements: - -#### 1. Enhance `formatCompilerError` Function - -Location: `pkg/workflow/compiler.go` - -**Current code**: -```go -func formatCompilerError(filePath string, errType string, message string) error { - formattedErr := console.FormatError(console.CompilerError{ - Position: console.ErrorPosition{ - File: filePath, - Line: 1, - Column: 1, - }, - Type: errType, - Message: message, - }) - return errors.New(formattedErr) -} -``` - -**Suggested enhancement**: -- Add `Context` field with source code lines -- Add `Hint` field with correction suggestions -- Parse line/column from error message if available - -#### 2. Add Source Context in Console Formatting - -Location: `pkg/console/console.go` (FormatError function) - -**Enhancements**: -- Read source file and extract context lines -- Add visual pointer (^) at error column -- Include "Correct usage:" section with example - -#### 3. Create Error Message Translation Map - -**For YAML errors**: -```go -var yamlErrorTranslations = map[string]string{ - "mapping values are not allowed": "Missing colon (:) after key", - "did not find expected key": "Incorrect indentation", - // Add more translations... -} -``` - -#### 4. Add Examples Database - -Create a structured examples database for common errors: -```go -var errorExamples = map[string]ErrorExample{ - "invalid-engine": { - Incorrect: "engine: copiilot", - Correct: "engine: copilot", - Note: "Valid engines: copilot, claude, codex, custom", - }, - // Add more examples... -} -``` - ---- - -### Success Metrics - -Track these metrics to measure improvement: - -1. **Error Resolution Time**: Time from error to fix (target: <2 min) -2. **Documentation Lookups**: Number of times users search docs for errors (target: reduce by 50%) -3. **User Feedback**: Survey responses on error helpfulness (target: 4+/5) -4. **Repeat Errors**: Frequency of same errors being made (target: reduce by 30%) - ---- - -### Related Issues - -- [Link to related DX issues] -- [Link to error message improvement PRs] - ---- +| Test | Workflow | Error Type | Score | Rating | +|------|----------|------------|-------|--------| +| 1 | `workflow.md` | Category A | XX/100 | Good | +| 2 | `workflow.md` | Category B | XX/100 | Acceptable | -*Generated by Daily Syntax Error Quality Check workflow* -*Next check: Runs daily (see workflow schedule)* +**Weaknesses** (top 3 only): +1. [specific issue + suggested fix] +2. [specific issue + suggested fix] +3. [specific issue + suggested fix] ``` ## Important Guidelines @@ -685,12 +427,11 @@ Error: invalid engine ## Success Criteria A successful analysis run: -- ✅ Tests 3 different workflows with diverse complexity -- ✅ Introduces 3 different error types (one per category) -- ✅ Captures complete compiler output for each test -- ✅ Provides detailed quality scores across all dimensions -- ✅ Generates specific, actionable improvement suggestions -- ✅ Creates issue only when quality is below threshold +- ✅ Tests 2 different workflows with diverse complexity +- ✅ Introduces 2 different error types (categories A and B) +- ✅ Captures compiler output for each test +- ✅ Provides quality scores across all dimensions +- ✅ Creates issue only when quality is below threshold (average < 70 or any score < 55) - ✅ Cleans up temporary test files --- diff --git a/.github/workflows/dead-code-remover.md b/.github/workflows/dead-code-remover.md index 871c47015ba..9f27fa22735 100644 --- a/.github/workflows/dead-code-remover.md +++ b/.github/workflows/dead-code-remover.md @@ -43,7 +43,18 @@ You are the Dead Code Removal Agent — a Go code maintenance expert that identi ## Mission -Run the `deadcode` static analyzer, select a batch of up to 10 unreachable functions, apply safety checks, delete them (and their exclusive tests), verify the build, and open a pull request. +Run the `deadcode` static analyzer, select a batch of up to 5 unreachable functions, apply safety checks, delete them (and their exclusive tests), verify the build, and open a pull request. + +## Token Budget Guidelines + +**Target**: Complete the full workflow in ≤ 30 turns. + +- **If deadcode finds 0 unprocessed functions**: call `noop` immediately — skip all remaining phases. +- Select **up to 5 functions** per run (not 10) — keeps PRs small and turns bounded. +- Safety check grep: limit output with `grep -m 5` to avoid large result dumps. +- Build/test output: pipe through `tail -20` to capture only the relevant tail; do not print full output. +- PR body: use only the provided template structure — no extra analysis paragraphs. +- Cache append: write lines directly; do not re-read the full cache file before appending. ## Context @@ -74,7 +85,7 @@ Build a set of `"file:FuncName"` keys to skip — this ensures each function is ## Phase 3: Select a Batch -From the unprocessed dead functions, select **up to 10** to remove this run. Prioritise: +From the unprocessed dead functions, select **up to 5** to remove this run. Prioritise: 1. Functions where `grep` confirms callers exist only in `*_test.go` files 2. Fully standalone functions with no callers at all @@ -92,7 +103,7 @@ For every function in the batch, run all of the following checks before deleting ### 4.1 Caller grep ```bash -grep -rn "FunctionName" --include="*.go" . +grep -rn -m 5 "FunctionName" --include="*.go" . ``` - Callers **only in `*_test.go` files** → function is dead. Proceed with deletion AND mark its exclusive test functions for removal. @@ -157,7 +168,7 @@ make fmt Run targeted package tests for every package you modified: ```bash -go test ./pkg/... 2>&1 +go test ./pkg/... 2>&1 | tail -20 echo "Test exit code: $?" ``` @@ -223,7 +234,7 @@ After successfully calling `create_pull_request`, append one line per removed fu 2. **Never delete** `containsInNonCommentLines`, `indexInNonCommentLines`, or `extractJobSection` — they are shared test infrastructure. 3. **Check WASM** before deleting anything from `pkg/workflow/` or `pkg/console/`. 4. **Check `console_wasm.go`** before deleting anything from `pkg/console/`. -5. **Max 10 functions per run** — keeps PRs small and reviewable. +5. **Max 5 functions per run** — keeps PRs small and reviewable. 6. **Build must pass** before creating a PR. ## Important diff --git a/scratchpad/token-budget-guidelines.md b/scratchpad/token-budget-guidelines.md index 4649f8f4e27..aa8d819e4f9 100644 --- a/scratchpad/token-budget-guidelines.md +++ b/scratchpad/token-budget-guidelines.md @@ -268,6 +268,79 @@ Explicit instructions in workflow prompts to reduce token consumption: - Issue-creation cap prevents excessive GitHub API calls - Cache-based skip logic avoids redundant re-scanning +### Daily Syntax Error Quality Check + +**Engine**: Copilot - max-turns not available + +**Previous Configuration:** +- No token budget controls +- 20-minute timeout +- Tests 3 workflows with 3 different error categories +- Verbose JSON evaluation reports per test case +- Lengthy issue template with implementation guide (~250 lines) +- Regenerated full Go code examples every run + +**Optimized Configuration:** +- `timeout-minutes: 20` (unchanged) +- Added `## Token Budget Guidelines` section in prompt: + - Reduce test cases from 3 to 2 workflows + - Use compact one-line scoring format (no verbose JSON) + - Early stop (noop) when average score ≥ 70 and no individual score < 55 + - Removed lengthy implementation guide from issue template + - Limit output to 20-turn target + +**Expected Impact:** +- **Token Reduction**: 60-75% (from ~9.35M to ~2.3M-3.7M per run) +- **Quality**: Maintained — compiler error quality still evaluated across all 5 dimensions +- **Runtime**: Reduced from ~129 turns to ≤ 20 turns + +**Budget Target:** +- **Target tokens/run**: 500K–1M +- **Alert threshold**: >2M tokens +- **Cost estimate**: $8.75-17.50 per run + +**Optimization Strategy:** +- Fewer test cases → fewer compile calls and eval rounds +- Compact scoring format → less output to generate and process +- Early-stop for healthy quality → avoids generating verbose reports when not needed +- Removed boilerplate implementation guide from issue template → less re-generation + +### Dead Code Remover + +**Engine**: Copilot - max-turns not available + +**Previous Configuration:** +- No token budget controls +- 30-minute timeout +- Batch of up to 10 functions per run +- Unlimited grep output for caller checks +- Full test output captured per run + +**Optimized Configuration:** +- `timeout-minutes: 30` (unchanged) +- Added `## Token Budget Guidelines` section in prompt: + - Reduce batch from 10 to 5 functions per run + - Limit grep output to 5 matches (`grep -m 5`) per safety check + - Limit test output to last 20 lines (`tail -20`) + - Immediate noop if deadcode finds 0 unprocessed functions + - No re-read of full cache before appending + +**Expected Impact:** +- **Token Reduction**: 45-55% (from ~9.1M to ~4M-5M per run) +- **Quality**: Maintained — safety checks unchanged, build verification still required +- **Runtime**: Reduced processing turns + +**Budget Target:** +- **Target tokens/run**: 1M–2M +- **Alert threshold**: >4M tokens +- **Cost estimate**: $17.50-35.00 per run + +**Optimization Strategy:** +- Half the batch size → half the safety checks, edits, and build iterations +- Truncated grep output → avoids large result dumps increasing context +- Truncated test output → avoids multi-MB test logs inflating token count +- Early-stop condition → skips all phases when nothing new to process + ## Alert Thresholds (Updated) | Workflow | Target Tokens | Alert Threshold | Critical Threshold | @@ -277,6 +350,8 @@ Explicit instructions in workflow prompts to reduce token consumption: | Issue Monster | 50K-150K | >300K | >500K | | CI Optimization Coach | 300K-600K | >1M | >1.5M | | Step Name Alignment | 300K-500K | >800K | >1.2M | +| Daily Syntax Error Quality Check | 500K-1M | >2M | >4M | +| Dead Code Remover | 1M-2M | >4M | >7M | ## Optimization Strategies @@ -411,6 +486,12 @@ When adding token budgets to a workflow: ## Revision History +- **2026-04-06**: Added token budget guardrails for two high-cost workflows + - Daily Syntax Error Quality Check: reduced to 2 test cases, compact scoring, early-stop noop (target 500K-1M/run) + - Dead Code Remover: reduced batch to 5 functions, truncated grep/test output, early-stop noop (target 1M-2M/run) + - Updated alert thresholds table to cover all seven tracked workflows + - See [DeepReport #24882](https://github.com/github/gh-aw/discussions/24882) + - **2026-03-23**: Added token budget guardrails for top-cost workflows - Issue Monster: added prompt-level efficiency & early-stop guidelines (target 50K-150K/run) - CI Optimization Coach: added scope cap and early-exit path (target 300K-600K/run) From 69ec3526f29606ed8cf9e95dd1f9ad8b036f72d4 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 6 Apr 2026 15:44:02 +0000 Subject: [PATCH 3/3] fix: address review feedback - fix token math, clarify early-stop conditions, conservative turn target Agent-Logs-Url: https://github.com/github/gh-aw/sessions/a8890002-99df-44b5-b50b-93c2902a4951 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> --- .../workflows/daily-syntax-error-quality.md | 6 ++--- .github/workflows/dead-code-remover.md | 2 +- scratchpad/token-budget-guidelines.md | 26 +++++++++---------- 3 files changed, 17 insertions(+), 17 deletions(-) diff --git a/.github/workflows/daily-syntax-error-quality.md b/.github/workflows/daily-syntax-error-quality.md index 621ba6c9574..4531c6cc97e 100644 --- a/.github/workflows/daily-syntax-error-quality.md +++ b/.github/workflows/daily-syntax-error-quality.md @@ -65,12 +65,12 @@ Test the quality of compiler error messages by: ## Token Budget Guidelines -**Target**: Complete the full analysis in ≤ 20 turns. +**Target**: Complete the full analysis in ≤ 40 turns. - Test **2 workflows** (not 3) — one simple, one complex. - One error category per workflow (Category A for workflow 1, Category B for workflow 2). -- **If the average score across both test cases is ≥ 70 and no individual score is < 55**: call `noop` immediately with a one-line summary — do **not** generate a full structured report. -- When scores require an issue: use the compact format below — skip verbose per-dimension narratives and the lengthy implementation guide section. +- **If the average score across both test cases is ≥ 70 and no individual score is < 55**: skip Phase 6 entirely, call `noop` with a one-line summary — do **not** generate the issue or structured report. +- When scores require an issue: use the compact format in Phase 6 — skip verbose per-dimension narratives. - Do **not** re-read files already loaded into context. - One `gh aw compile` call per test case — do not retry after an expected failure. - Avoid printing full file contents; use `head -n 30` to confirm error locations. diff --git a/.github/workflows/dead-code-remover.md b/.github/workflows/dead-code-remover.md index 9f27fa22735..3aa74164bf3 100644 --- a/.github/workflows/dead-code-remover.md +++ b/.github/workflows/dead-code-remover.md @@ -49,7 +49,7 @@ Run the `deadcode` static analyzer, select a batch of up to 5 unreachable functi **Target**: Complete the full workflow in ≤ 30 turns. -- **If deadcode finds 0 unprocessed functions**: call `noop` immediately — skip all remaining phases. +- **After Phase 2: if deadcode finds 0 unprocessed functions**, call `noop` immediately — skip Phases 3–9. - Select **up to 5 functions** per run (not 10) — keeps PRs small and turns bounded. - Safety check grep: limit output with `grep -m 5` to avoid large result dumps. - Build/test output: pipe through `tail -20` to capture only the relevant tail; do not print full output. diff --git a/scratchpad/token-budget-guidelines.md b/scratchpad/token-budget-guidelines.md index aa8d819e4f9..c92d957e34e 100644 --- a/scratchpad/token-budget-guidelines.md +++ b/scratchpad/token-budget-guidelines.md @@ -285,24 +285,24 @@ Explicit instructions in workflow prompts to reduce token consumption: - Added `## Token Budget Guidelines` section in prompt: - Reduce test cases from 3 to 2 workflows - Use compact one-line scoring format (no verbose JSON) - - Early stop (noop) when average score ≥ 70 and no individual score < 55 + - Early stop (noop) when average score ≥ 70 and no individual score < 55 — skips all report generation - Removed lengthy implementation guide from issue template - Limit output to 20-turn target **Expected Impact:** -- **Token Reduction**: 60-75% (from ~9.35M to ~2.3M-3.7M per run) +- **Token Reduction**: 85-95% for healthy runs (early-stop fires); 50-70% when an issue is created - **Quality**: Maintained — compiler error quality still evaluated across all 5 dimensions -- **Runtime**: Reduced from ~129 turns to ≤ 20 turns +- **Runtime**: Reduced from ~129 turns to ≤ 20 turns (most runs end early) **Budget Target:** -- **Target tokens/run**: 500K–1M +- **Target tokens/run**: 300K–600K (typical, early-stop); up to 2M (issue-creating run) - **Alert threshold**: >2M tokens -- **Cost estimate**: $8.75-17.50 per run +- **Cost estimate**: $5.25-10.50 per run (typical) **Optimization Strategy:** - Fewer test cases → fewer compile calls and eval rounds - Compact scoring format → less output to generate and process -- Early-stop for healthy quality → avoids generating verbose reports when not needed +- Early-stop for healthy quality → avoids generating verbose reports when not needed (covers ~most runs given 0 errors in prior run) - Removed boilerplate implementation guide from issue template → less re-generation ### Dead Code Remover @@ -322,7 +322,7 @@ Explicit instructions in workflow prompts to reduce token consumption: - Reduce batch from 10 to 5 functions per run - Limit grep output to 5 matches (`grep -m 5`) per safety check - Limit test output to last 20 lines (`tail -20`) - - Immediate noop if deadcode finds 0 unprocessed functions + - Immediate noop (after Phase 2) if deadcode finds 0 unprocessed functions - No re-read of full cache before appending **Expected Impact:** @@ -331,15 +331,15 @@ Explicit instructions in workflow prompts to reduce token consumption: - **Runtime**: Reduced processing turns **Budget Target:** -- **Target tokens/run**: 1M–2M -- **Alert threshold**: >4M tokens -- **Cost estimate**: $17.50-35.00 per run +- **Target tokens/run**: 4M–5M +- **Alert threshold**: >7M tokens +- **Cost estimate**: $70-87.50 per run **Optimization Strategy:** - Half the batch size → half the safety checks, edits, and build iterations - Truncated grep output → avoids large result dumps increasing context - Truncated test output → avoids multi-MB test logs inflating token count -- Early-stop condition → skips all phases when nothing new to process +- Early-stop condition (Phase 2) → skips all phases when nothing new to process ## Alert Thresholds (Updated) @@ -350,8 +350,8 @@ Explicit instructions in workflow prompts to reduce token consumption: | Issue Monster | 50K-150K | >300K | >500K | | CI Optimization Coach | 300K-600K | >1M | >1.5M | | Step Name Alignment | 300K-500K | >800K | >1.2M | -| Daily Syntax Error Quality Check | 500K-1M | >2M | >4M | -| Dead Code Remover | 1M-2M | >4M | >7M | +| Daily Syntax Error Quality Check | 300K-2M | >2M | >4M | +| Dead Code Remover | 4M-5M | >7M | >9M | ## Optimization Strategies