Skip to content

Add behavioral evals for tracker#20069

Merged
anj-s merged 30 commits intomainfrom
anj/tracker-evals
Mar 10, 2026
Merged

Add behavioral evals for tracker#20069
anj-s merged 30 commits intomainfrom
anj/tracker-evals

Conversation

@anj-s
Copy link
Copy Markdown
Contributor

@anj-s anj-s commented Feb 23, 2026

Summary

This PR introduces behavioral evaluations for the Task Tracker in evals/tracker.eval.ts. These tests ensure the model correctly utilizes tracker tools (tracker_create_task, tracker_update_task) in both explicit and implicit scenarios when ApprovalMode.YOLO is enabled.

Details

  • Explicit Management Eval: Validates that the model can follow instructions to create a task, perform a fix, and then close the task in the tracker.
  • Implicit Organization Eval: Verifies that the model autonomously deduces when to use tracker tools to organize a complex implementation plan, even when not explicitly prompted to use the tracker.
  • Safety Verification: Ensures the model respects "plan-only" prompts by confirming no code modifications are made during the planning phase.
  • Test Setup: Properly configures the evaluation rig by injecting experimental.taskTracker = true into the model settings.

Related Issues

Fixes #19965

How to Validate

Run the evaluation tests locally:

npm run test:all_evals -- evals/tracker.eval.ts

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (if needed)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run

anj-s added 24 commits February 18, 2026 12:03
- Use cryptographically secure ID generation with node:crypto
- Implement runtime validation for JSON parsing using Zod
- Optimize circular dependency validation to avoid N+1 file reads
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @anj-s, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive suite of behavioral evaluations for the tracker tool. These evaluations are designed to verify that the tool correctly responds to both explicit user commands and implicitly understands when to engage its functionalities, such as task initialization, creation, listing, visualization, and status updates, ensuring robust and intelligent interaction with the model.

Highlights

  • New Behavioral Evaluations: Added a new suite of behavioral evaluation tests specifically for the tracker tool, ensuring its functionality is robustly tested.
  • Explicit Tracker Usage Tests: Included tests that verify the tracker tool responds correctly to explicit user commands, such as initializing the tracker, creating tasks, listing/visualizing tasks, and updating task statuses.
  • Implicit Tracker Usage Tests: Implemented evaluations to confirm the tracker tool can implicitly understand user intent and proactively engage its functionalities, like creating tasks for feature plans or initializing for new projects, without direct instructions.
Changelog
  • evals/tracker.eval.ts
    • Added behavioral tests for the tracker tool.
    • Included explicit tests for tracker_init, tracker_create_task, tracker_list_tasks, tracker_visualize, and tracker_update_task.
    • Added implicit tests for tracker_create_task and tracker_init based on user intent.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli bot commented Feb 23, 2026

Hi @anj-s, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli bot commented Feb 23, 2026

Hi there! Thank you for your contribution to Gemini CLI.

To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md.

This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding Fixes #123 or Related to #123), it will be automatically reopened.

How to link an issue:
Add a keyword followed by the issue number (e.g., Fixes #123) in the description of your pull request. For more details on supported keywords and how linking works, please refer to the GitHub Documentation on linking pull requests to issues.

Thank you for your understanding and for being a part of our community!

@gemini-cli gemini-cli bot closed this Feb 23, 2026
@anj-s anj-s reopened this Feb 23, 2026
@anj-s anj-s closed this Feb 23, 2026
Base automatically changed from u/anj/task-tracker-phase-3 to main March 6, 2026 00:29
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

Size Change: -4 B (0%)

Total Size: 26.2 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 25.7 MB -4 B (0%)
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 221 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 227 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 11.5 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B

compressed-size-action

@anj-s anj-s requested a review from gundermanc March 9, 2026 19:18
@anj-s anj-s marked this pull request as ready for review March 9, 2026 19:18
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable behavioral evaluation tests for the task tracker feature, covering both explicit and implicit tool usage scenarios. The tests are well-structured and the prompts are clear. The identified logical issue in an assertion was valid, and a suggestion has been provided to correct it.

@gundermanc
Copy link
Copy Markdown
Member

FYI: Ran these through the nightly run: https://github.com/google-gemini/gemini-cli/actions/runs/22870896466

Looks like they pass 66-100% of the time, depending on model, though it fails at 0% for some models.

Not necessarily blocking.

@anj-s
Copy link
Copy Markdown
Contributor Author

anj-s commented Mar 9, 2026

FYI: Ran these through the nightly run: https://github.com/google-gemini/gemini-cli/actions/runs/22870896466

Looks like they pass 66-100% of the time, depending on model, though it fails at 0% for some models.

Not necessarily blocking.

Got it! What does no numbers in the table imply?
image

@anj-s anj-s enabled auto-merge March 10, 2026 18:28
@gundermanc
Copy link
Copy Markdown
Member

Got it! What does no numbers in the table imply?

The report lists the results from the last ~7 runs. The rightmost column is the current run. The columns left of that are from previous runs. No numbers means the test did not run in that run, in this case, because it didn't exist.

@anj-s anj-s added this pull request to the merge queue Mar 10, 2026
Merged via the queue into main with commit 2dd0376 Mar 10, 2026
27 checks passed
@anj-s anj-s deleted the anj/tracker-evals branch March 10, 2026 19:05
JaisalJain pushed a commit to JaisalJain/gemini-cli that referenced this pull request Mar 11, 2026
kunal-10-cloud pushed a commit to kunal-10-cloud/gemini-cli that referenced this pull request Mar 12, 2026
liamhelmer pushed a commit to badal-io/gemini-cli that referenced this pull request Mar 12, 2026
yashodipmore pushed a commit to yashodipmore/geemi-cli that referenced this pull request Mar 21, 2026
SUNDRAM07 pushed a commit to SUNDRAM07/gemini-cli that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement behavioral evals for tracker (implicit and explicit prompting)

2 participants