Skip to content

Document model env and improve diff handling#3

Merged
arthrod merged 3 commits intomainfrom
codex/limit-token-changes-to-20k-and-add-model-selection-bsavnq
May 19, 2025
Merged

Document model env and improve diff handling#3
arthrod merged 3 commits intomainfrom
codex/limit-token-changes-to-20k-and-add-model-selection-bsavnq

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 19, 2025

Summary

  • document the AUTO_COMMIT_MODEL environment variable in the README
  • add placeholder Cargo.lock and stop ignoring it
  • clarify tokenizer docs and expose DEFAULT_MODEL
  • handle git diff output errors and use staged diff
  • limit diff tokens via a constant

Testing

  • cargo test --offline (fails: no matching package named async-openai found)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2025

Warning

Rate limit exceeded

@arthrod has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 55d0b7c and 701ed4f.

📒 Files selected for processing (2)
  • src/lib.rs (3 hunks)
  • src/main.rs (5 hunks)
📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added support for selecting the AI model via the AUTO_COMMIT_MODEL environment variable, with documentation and a default model provided.
    • Implemented automatic truncation of commit prompts to a maximum number of tokens for improved reliability.
  • Bug Fixes

    • Improved error handling for Git command execution and output decoding, preventing unexpected crashes.
  • Documentation

    • Updated README to include instructions for configuring the AI model with an environment variable.

Walkthrough

The updates introduce a library crate configuration, add a new library file with utility functions for token truncation and model selection, and enhance error handling in the main executable. The .gitignore is changed to track Cargo.lock, and the README documents a new environment variable for selecting the AI model.

Changes

File(s) Change Summary
.gitignore Removed the rule to ignore Cargo.lock, so it will now be tracked by Git.
Cargo.toml Added a [lib] section specifying the library crate name and source path.
README.md Documented the AUTO_COMMIT_MODEL environment variable and its usage, including the default model fallback.
src/lib.rs New file: Implements truncate_to_n_tokens, DEFAULT_MODEL, and get_model_from_env; includes unit tests.
src/main.rs Improved error handling for Git commands, added token truncation, dynamic model selection, and safer output parsing.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MainRS
    participant LibRS
    participant Git
    participant Env

    User->>MainRS: Run auto-commit
    MainRS->>Git: git diff --name-only --staged
    Git-->>MainRS: List of staged files
    MainRS->>Git: git diff --staged
    Git-->>MainRS: Staged diff
    MainRS->>LibRS: truncate_to_n_tokens(combined_diff, MAX_DIFF_TOKENS)
    LibRS-->>MainRS: Truncated diff string
    MainRS->>LibRS: get_model_from_env()
    LibRS->>Env: Read AUTO_COMMIT_MODEL
    Env-->>LibRS: Model name or default
    LibRS-->>MainRS: Model name
    MainRS->>User: Generate commit message using selected model and truncated diff
Loading

Poem

A hop, a skip, a token snip—
Now models switch with just one tip!
No more panic, errors tamed,
Cargo.lock is now proclaimed.
From README lines to code anew,
The rabbit cheers this clever crew!
🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary or crs to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or cr anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @arthrod, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR focuses on improving the auto-commit tool in a few key areas. Firstly, it introduces the ability to configure the OpenAI model used for generating commit messages via the AUTO_COMMIT_MODEL environment variable, defaulting to gpt-4.1-nano if not set. This configuration is also documented in the README.

Secondly, it significantly enhances the handling of the Git diff input provided to the model. The tool now specifically uses the staged diff (git diff --staged), adds more robust error handling for Git command execution and UTF-8 conversion of output, and limits the size of the combined changed files list and diff output to a maximum number of 'tokens' (implemented as whitespace-delimited words) to prevent exceeding model context limits.

Finally, minor project structure changes include removing Cargo.lock from .gitignore and adding a placeholder Cargo.lock, and defining the core logic as a library in Cargo.toml.

Highlights

  • Model Configuration: The OpenAI model used for generating commit messages can now be specified using the AUTO_COMMIT_MODEL environment variable, providing flexibility for users.
  • Improved Git Diff Handling: The tool now uses the staged diff (git diff --staged), includes better error handling for Git command failures and output encoding issues, and fetches the list of changed files separately.
  • Diff Size Limiting: A new function truncate_to_n_tokens is introduced and used to limit the size of the combined changed files list and diff output sent to the model, preventing potential context window issues.
  • Documentation Update: The README has been updated to include instructions on how to use the new AUTO_COMMIT_MODEL environment variable.
  • Project Structure: The project now explicitly defines its core logic as a library in Cargo.toml, and Cargo.lock is no longer ignored, with a placeholder added.

Changelog

Click here to see the changelog
  • .gitignore
    • Removed Cargo.lock from the ignored files list.
  • Cargo.lock
    • Added a placeholder Cargo.lock file with a comment explaining its purpose.
  • Cargo.toml
    • Added a [lib] section to define the crate as a library named auto_commit.
  • README.md
    • Added documentation for the AUTO_COMMIT_MODEL environment variable, explaining how to set it and its default value.
  • src/lib.rs
    • Added truncate_to_n_tokens function to limit string length by whitespace-delimited words.
    • Defined DEFAULT_MODEL constant for the default OpenAI model name.
    • Added get_model_from_env function to fetch the model name from the environment variable or use the default.
    • Included unit tests for the new truncate_to_n_tokens and get_model_from_env functions.
  • src/main.rs
    • Imported get_model_from_env and truncate_to_n_tokens from the auto_commit library.
    • Added MAX_DIFF_TOKENS constant to define the maximum size of the diff input.
    • Improved error handling for git diff --staged and git rev-parse commands using map_err.
    • Added error handling for UTF-8 conversion of Git command output.
    • Changed the primary diff command from git diff HEAD to git diff --staged.
    • Added a new command git diff --name-only --staged to get the list of changed files.
    • Combined the list of changed files and the diff output into a single string.
    • Applied truncate_to_n_tokens to the combined diff input before sending it to the model.
    • Updated the OpenAI API call to use the model name fetched by get_model_from_env() instead of a hardcoded value.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Diffs show the change,
Staged files in a new range,
AI reads the code.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall, this pull request introduces several valuable improvements, including better error handling for git commands, documenting the model environment variable, and setting up the project as a library. The addition of tests for the new library functions is also great. The changes align well with the stated goals of the pull request. I've reviewed the code based on common Rust practices, as no specific style guide was provided.

Summary of Findings

  • Truncation logic vs. Tokenization: The truncate_to_n_tokens function truncates by whitespace-delimited words, which may not align with the model's actual tokenization and could lead to unexpected behavior regarding token limits. Consider renaming or using a proper tokenization library.
  • Error Handling for git rev-parse: The unwrap_or("") in the git rev-parse error handling might mask UTF-8 errors and incorrectly proceed. Using a match statement to log the error and exit would be safer.
  • Redundant Git Commands: The git diff --staged command is executed multiple times. Running it once and reusing the output would improve efficiency.
  • Test Coverage: Tests were added for the new library functions, which is good. (Note: Per review settings, no direct comment was made on the test code itself).
  • Documentation: Documentation for the AUTO_COMMIT_MODEL environment variable was added to the README. (Note: Per review settings, no direct comment was made on the README changes).
  • Code Structure: Moving functions to src/lib.rs and adding the [lib] section in Cargo.toml improves modularity. (Note: Per review settings, no direct comment was made on these changes).
  • Constant Usage: The MAX_DIFF_TOKENS constant was introduced, which is good practice. (Note: Per review settings, no direct comment was made on this change).
  • Placeholder Cargo.lock: A placeholder Cargo.lock was added and the .gitignore updated accordingly. (Note: Per review settings, no direct comment was made on these changes).

Merge Readiness

This pull request introduces good improvements, particularly in error handling and modularity. However, there is a high-severity issue regarding the redundant execution of a git command that should be addressed before merging. Additionally, consider the medium-severity issues related to the truncation logic and the git rev-parse error handling for further robustness and clarity. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.

Comment thread src/main.rs
Comment on lines +120 to +153
let files_output = Command::new("git")
.arg("diff")
.arg("--name-only")
.arg("--staged")
.output()
.map_err(|e| {
error!("Couldn't get changed files: {}", e);
()
})?
.stdout;
let files_changed = match str::from_utf8(&files_output) {
Ok(v) => v,
Err(e) => {
error!("Changed files output was not valid UTF-8: {}", e);
""
}
};

let diff_output = Command::new("git")
.arg("diff")
.arg("HEAD")
.arg("--staged")
.output()
.expect("Couldn't find diff.")
.map_err(|e| {
error!("Couldn't find diff: {}", e);
()
})?
.stdout;
let output = str::from_utf8(&output).unwrap();
let diff_output = match str::from_utf8(&diff_output) {
Ok(v) => v,
Err(e) => {
error!("Diff output was not valid UTF-8: {}", e);
""
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

You are executing git diff --staged twice here (lines 138-153) after already executing it once earlier (lines 81-97). Running the same command multiple times is inefficient and could potentially lead to inconsistencies if the staged files change between the calls (though unlikely in this context).

Could you refactor this to run git diff --staged only once and reuse the result?

    let files_output = Command::new("git")
        .arg("diff")
        .arg("--name-only")
        .arg("--staged")
        .output()
        .map_err(|e| {
            error!("Couldn't get changed files: {}", e);
            ()
        })?
        .stdout;
    let files_changed = match str::from_utf8(&files_output) {
        Ok(v) => v,
        Err(e) => {
            error!("Changed files output was not valid UTF-8: {}", e);
            ""
        }
    };

    // Reuse the git_staged_cmd from earlier
    let diff_output = git_staged_cmd;

    let combined = format!("Changed files:\n{}\n\nDiff:\n{}", files_changed, diff_output);

Comment thread src/lib.rs Outdated
Comment on lines +1 to +3
/// Truncate a string to the first `limit` whitespace-delimited words.
pub fn truncate_to_n_tokens(text: &str, limit: usize) -> String {
text.split_whitespace().take(limit).collect::<Vec<_>>().join(" ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function truncate_to_n_tokens currently truncates based on whitespace-delimited words. While this might be a reasonable heuristic, it doesn't accurately reflect the tokenization used by large language models, which is typically based on subword units (like BPE). This could lead to the actual token count being significantly different from the limit provided, potentially exceeding the model's context window or truncating too aggressively.

Would it be more accurate to rename this function to reflect that it truncates by words (e.g., truncate_to_n_words) or consider using a proper tokenization library (like tiktoken-rs) if the intent is to strictly limit based on model tokens?

Comment thread src/main.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
README.md (1)

5-5: 🛠️ Refactor suggestion

Update model reference in introduction

Line 5 mentions GPT-3.5, but the default model is now gpt-4.1-nano. This inconsistency could confuse users.

-A CLI tool that generates commit messages from your staged changes, built in Rust and using [OpenAI's GPT-3.5](https://platform.openai.com/overview).
+A CLI tool that generates commit messages from your staged changes, built in Rust and using [OpenAI's models](https://platform.openai.com/overview).
🧹 Nitpick comments (3)
src/main.rs (3)

67-67: Consider documenting the MAX_DIFF_TOKENS constant

Adding a comment explaining why 20,000 tokens was chosen as the limit would improve maintainability.

-const MAX_DIFF_TOKENS: usize = 20_000;
+/// Maximum number of tokens to include in the diff sent to the AI model.
+/// This helps prevent exceeding the model's context window limit.
+const MAX_DIFF_TOKENS: usize = 20_000;

107-116: Great improvement to error handling, but consider refinement

The error handling is much better, but line 113 still uses unwrap_or("") which could be improved to match the pattern used elsewhere.

-    if str::from_utf8(&is_repo).unwrap_or("").trim() != "true" {
+    let is_repo_str = match str::from_utf8(&is_repo) {
+        Ok(v) => v,
+        Err(e) => {
+            error!("Repository check output was not valid UTF-8: {}", e);
+            ""
+        }
+    };
+    if is_repo_str.trim() != "true" {

155-156: Consider using a more descriptive variable name.

The variable name combined is a bit generic. A more descriptive name would improve code readability.

-let combined = format!("Changed files:\n{}\n\nDiff:\n{}", files_changed, diff_output);
-let output = truncate_to_n_tokens(&combined, MAX_DIFF_TOKENS);
+let git_content = format!("Changed files:\n{}\n\nDiff:\n{}", files_changed, diff_output);
+let truncated_git_content = truncate_to_n_tokens(&git_content, MAX_DIFF_TOKENS);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66e0c86 and 55d0b7c.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • .gitignore (0 hunks)
  • Cargo.toml (1 hunks)
  • README.md (1 hunks)
  • src/lib.rs (1 hunks)
  • src/main.rs (5 hunks)
💤 Files with no reviewable changes (1)
  • .gitignore
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/main.rs (1)
src/lib.rs (2)
  • get_model_from_env (10-12)
  • truncate_to_n_tokens (2-4)
🔇 Additional comments (8)
Cargo.toml (1)

27-29: LGTM! Good library configuration.

The addition of the library configuration is well-structured and follows Rust conventions. This modularization will help with code organization and reuse.

src/lib.rs (2)

6-12: LGTM! Good environment variable handling.

The function and constant are well-documented and properly implemented. Using unwrap_or_else with a default is a good pattern for handling environment variables.


14-36: LGTM! Well-structured tests.

The tests are comprehensive and properly verify both functions. Good practice using environment variable cleanup in the tests to prevent interference between test cases.

src/main.rs (4)

24-24: LGTM! Proper imports from the new library module.

Correctly imports the new utility functions from the library.


85-97: LGTM! Improved error handling.

The change from expect to proper error handling with logging is a significant improvement. This prevents the application from crashing on unexpected errors.


120-154: LGTM! Good separation of concerns.

Fetching the file list separately from the diff content improves code clarity. The error handling is also well-implemented.


254-254: LGTM! Dynamic model selection.

Using the environment variable for model selection improves configurability.

README.md (1)

31-35: LGTM! Clear documentation of the new environment variable.

The documentation clearly explains the purpose of the AUTO_COMMIT_MODEL environment variable and its default value.

Comment thread src/lib.rs Outdated
Comment on lines +1 to +4
/// Truncate a string to the first `limit` whitespace-delimited words.
pub fn truncate_to_n_tokens(text: &str, limit: usize) -> String {
text.split_whitespace().take(limit).collect::<Vec<_>>().join(" ")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Function name doesn't accurately reflect implementation

The function is named truncate_to_n_tokens but it's actually truncating based on whitespace-delimited words, not tokens in the NLP sense. OpenAI's tokenization is more complex and based on subword units, not just whitespace.

Either rename the function to better reflect its behavior:

-/// Truncate a string to the first `limit` whitespace-delimited words.
-pub fn truncate_to_n_tokens(text: &str, limit: usize) -> String {
+/// Truncate a string to the first `limit` whitespace-delimited words.
+pub fn truncate_to_n_words(text: &str, limit: usize) -> String {
    text.split_whitespace().take(limit).collect::<Vec<_>>().join(" ")
}

Or implement actual token-based truncation using a proper tokenizer if needed.

🤖 Prompt for AI Agents
In src/lib.rs lines 1 to 4, the function name truncate_to_n_tokens is misleading
because it truncates based on whitespace-delimited words, not true NLP tokens.
To fix this, either rename the function to something like truncate_to_n_words to
accurately reflect its behavior, or implement actual token-based truncation
using a proper tokenizer library that handles subword units as OpenAI
tokenization does.

arthrod and others added 2 commits May 19, 2025 12:19
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@arthrod arthrod merged commit 3a24a90 into main May 19, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant