Fix silent async failures of checkpointing (#13331)#13486
Fix silent async failures of checkpointing (#13331)#13486vishvananda wants to merge 2 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @vishvananda, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical issues with asynchronous checkpointing, which previously led to silent failures and potential repository corruption. By making checkpointing synchronous and implementing comprehensive error handling, the system now provides immediate feedback on snapshot creation status. This ensures greater reliability and transparency when using tools that modify files, safeguarding the user's work by creating robust, restorable snapshots with detailed context. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request makes checkpointing synchronous and improves error handling, which is a great enhancement. The core logic changes in useGeminiStream.ts are sound, moving the snapshotting to happen before tool calls are scheduled. However, I've found a critical type-safety issue in the new error handling logic that could lead to runtime crashes, an inconsistency in how different snapshot failures are handled, and a gap in the test coverage for the new error path. My review includes detailed suggestions to address these points to make the implementation more robust and reliable.
| const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map( | ||
| (request) => ({ | ||
| request, | ||
| status: 'error', | ||
| response: { | ||
| callId: request.callId, | ||
| resultDisplay: '', | ||
| error: new Error( | ||
| 'Tool call aborted due to a critical, non-recoverable file snapshot failure.', | ||
| ), | ||
| errorType: ToolErrorType.UNKNOWN, | ||
| responseParts: [ | ||
| { | ||
| functionResponse: { | ||
| name: request.name, | ||
| response: { | ||
| error: | ||
| 'Tool call aborted due to a critical, non-recoverable file snapshot failure.', | ||
| }, | ||
| }, | ||
| }, | ||
| ], | ||
| }, | ||
| responseSubmittedToGemini: false, | ||
| }), | ||
| ); |
There was a problem hiding this comment.
The erroredToolCalls objects created here do not conform to the TrackedToolCall type (specifically TrackedCompletedToolCall for the 'error' status). They are missing required properties like tool, invocation, startTime, and endTime. This violates type safety and is likely to cause runtime errors if handleCompletedTools or any function it calls is modified to access these missing properties in the future. To make the code more robust and type-safe, you should construct a complete TrackedCompletedToolCall object for each failed tool call.
const toolRegistry = config.getToolRegistry();
const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map(
(request): TrackedCompletedToolCall => {
const tool = toolRegistry.getTool(request.name);
if (!tool) {
// This should not happen if validation is correct, but we should be robust.
throw new Error(
`Snapshot failed: Tool "${request.name}" not found in registry.`,
);
}
const invocation = tool.build(request.args);
const error = new Error(
'Tool call aborted due to a critical, non-recoverable file snapshot failure.',
);
return {
request,
status: 'error',
response: {
callId: request.callId,
resultDisplay: error.message,
error,
errorType: ToolErrorType.UNKNOWN,
responseParts: [
{
functionResponse: {
name: request.name,
response: { error: error.message },
},
},
],
},
responseSubmittedToGemini: false,
tool,
invocation,
startTime: Date.now(),
endTime: Date.now(),
};
},
);| if (!commitHash) { | ||
| addItem( | ||
| { | ||
| type: MessageType.ERROR, | ||
| text: 'Failed to get a commit hash for the file snapshot. Aborting operation.', | ||
| }, | ||
| Date.now(), | ||
| ); | ||
| setSnapshottingMessage(null); | ||
| cancelOngoingRequest(); | ||
| return StreamProcessingStatus.Completed; | ||
| } |
There was a problem hiding this comment.
The error handling when createFileSnapshot returns no commitHash is inconsistent with the catch block above. This code calls cancelOngoingRequest(), which aborts the turn without informing the model about the failure. This can lead to a confusing experience where the conversation just stops. To provide a better user experience and allow the model to react to the failure, this case should also create erroredToolCalls and send them back to the model, similar to how the try...catch block handles snapshot errors.
if (!commitHash) {
const errorMessage =
'Failed to get a commit hash for the file snapshot. Aborting operation.';
addItem(
{
type: MessageType.ERROR,
text: errorMessage,
},
Date.now(),
);
// Inform the model that the tool calls failed.
const toolRegistry = config.getToolRegistry();
const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map(
(request): TrackedCompletedToolCall => {
const tool = toolRegistry.getTool(request.name);
if (!tool) {
throw new Error(
`Snapshot failed: Tool "${request.name}" not found in registry.`,
);
}
const invocation = tool.build(request.args);
const error = new Error(errorMessage);
return {
request,
status: 'error',
response: {
callId: request.callId,
resultDisplay: error.message,
error,
errorType: ToolErrorType.UNKNOWN,
responseParts: [
{
functionResponse: {
name: request.name,
response: { error: error.message },
},
},
],
},
responseSubmittedToGemini: false,
tool,
invocation,
startTime: Date.now(),
endTime: Date.now(),
};
},
);
void handleCompletedToolsRef.current!(erroredToolCalls);
setSnapshottingMessage(null);
return StreamProcessingStatus.Completed;
}| it('should show an error if file snapshotting fails', async () => { | ||
| mockCreateFileSnapshot.mockRejectedValue(new Error('Snapshot failed')); | ||
| const geminiClient = new MockedGeminiClientClass(mockConfig); | ||
| const { result } = renderTestHook([], geminiClient); | ||
|
|
||
| await act(async () => { | ||
| const stream = | ||
| (async function* (): AsyncGenerator<ServerGeminiStreamEvent> { | ||
| yield { | ||
| type: ServerGeminiEventType.ToolCallRequest, | ||
| value: { | ||
| callId: 'test-call-id', | ||
| name: 'replace', | ||
| args: { file_path: 'test.txt' }, | ||
| isClientInitiated: false, | ||
| prompt_id: 'test-prompt-id', | ||
| }, | ||
| }; | ||
| yield { | ||
| type: ServerGeminiEventType.Error, | ||
| value: { error: { message: 'Snapshot failed' } }, | ||
| }; | ||
| })(); | ||
| await result.current.processGeminiStreamEvents( | ||
| stream, | ||
| Date.now(), | ||
| new AbortController().signal, | ||
| ); | ||
|
|
||
| expect(mockCreateFileSnapshot).toHaveBeenCalled(); | ||
| expect(mockAddItem).toHaveBeenCalledWith( | ||
| expect.objectContaining({ | ||
| type: MessageType.ERROR, | ||
| text: expect.stringContaining( | ||
| 'Failed to create file snapshot: Snapshot failed. Aborting operation.', | ||
| ), | ||
| }), | ||
| expect.any(Number), | ||
| ); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
This test correctly verifies that an error message is displayed when file snapshotting fails. However, it's missing a crucial assertion to verify that the model is informed about the tool call failures. The implementation in useGeminiStream.ts calls handleCompletedTools to send the error status back to the model. Please add an assertion to ensure handleCompletedTools (or a mock it depends on) is invoked with the appropriate error information. This will make the test more robust and prevent regressions in this critical error recovery flow.
|
I added the fix for sending the failed snapshot to the model and included new tests for it. I'm not sure about the critical one which seems to include some hallucinated code, invocation, startTime and endTime are not valid values in the return. |
|
I went ahead and added the tool into the response to the model as requested. I'm still confused why it is asking for invocation, startTime, and endTime, which don't seem to exist AFAICT. |
6a1ba73 to
9fcd148
Compare
9fcd148 to
5dd8039
Compare
|
squashed and rebased to fix merge conflicts |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the checkpointing logic to be synchronous, which is a great improvement to prevent silent failures and race conditions. The changes look solid, and the addition of tests for the new snapshotting logic is appreciated. I've identified one area with significant code duplication and a potential bug in the error handling logic. My review includes a suggestion to refactor this duplicated code for better maintainability and to fix the inconsistent error handling.
| let commitHash: string | undefined; | ||
| try { | ||
| if (!gitService) { | ||
| throw new Error( | ||
| 'Checkpointing is enabled but Git service is not available. Ensure Git is installed.', | ||
| ); | ||
| } | ||
| commitHash = await gitService.createFileSnapshot( | ||
| `Snapshot for ${restorableToolCalls | ||
| .map((t) => t.name) | ||
| .join(', ')}`, | ||
| ); | ||
| } catch (error) { | ||
| const errorMessage = `Failed to create file snapshot: ${getErrorMessage( | ||
| error, | ||
| )}`; | ||
| addItem( | ||
| { | ||
| type: MessageType.ERROR, | ||
| text: `${errorMessage}. Aborting operation.`, | ||
| }, | ||
| Date.now(), | ||
| ); | ||
|
|
||
| // Inform the model that the tool calls failed due to a non-recoverable error. | ||
| const toolRegistry = config.getToolRegistry(); | ||
| const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map( | ||
| (request): TrackedCompletedToolCall => { | ||
| const tool = toolRegistry.getTool(request.name); | ||
| const error = new Error( | ||
| 'Tool call aborted due to a critical, non-recoverable file snapshot failure.', | ||
| ); | ||
|
|
||
| return { | ||
| request, | ||
| status: 'error', | ||
| response: { | ||
| callId: request.callId, | ||
| resultDisplay: error.message, | ||
| error, | ||
| errorType: ToolErrorType.UNKNOWN, | ||
| responseParts: [ | ||
| { | ||
| functionResponse: { | ||
| name: request.name, | ||
| response: { error: error.message }, | ||
| }, | ||
| }, | ||
| ], | ||
| }, | ||
| responseSubmittedToGemini: false, | ||
| tool, | ||
| }; | ||
| }, | ||
| ); | ||
|
|
||
| void handleCompletedToolsRef.current!(erroredToolCalls); | ||
| setSnapshottingMessage(null); | ||
| return StreamProcessingStatus.Completed; // Stop processing | ||
| } | ||
| if (!commitHash) { | ||
| const errorMessage = | ||
| 'Failed to get a commit hash for the file snapshot. Aborting operation.'; | ||
| addItem( | ||
| { | ||
| type: MessageType.ERROR, | ||
| text: errorMessage, | ||
| }, | ||
| Date.now(), | ||
| ); | ||
|
|
||
| // Inform the model that the tool calls failed. | ||
| const toolRegistry = config.getToolRegistry(); | ||
| const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map( | ||
| (request): TrackedCompletedToolCall => { | ||
| const tool = toolRegistry.getTool(request.name); | ||
| const error = new Error(errorMessage); | ||
| return { | ||
| request, | ||
| status: 'error', | ||
| response: { | ||
| callId: request.callId, | ||
| resultDisplay: error.message, | ||
| error, | ||
| errorType: ToolErrorType.UNKNOWN, | ||
| responseParts: [ | ||
| { | ||
| functionResponse: { | ||
| name: request.name, | ||
| response: { error: error.message }, | ||
| }, | ||
| }, | ||
| ], | ||
| }, | ||
| responseSubmittedToGemini: false, | ||
| tool, | ||
| }; | ||
| }, | ||
| ); | ||
| void handleCompletedToolsRef.current!(erroredToolCalls); | ||
| setSnapshottingMessage(null); | ||
| cancelOngoingRequest(); | ||
| return StreamProcessingStatus.Completed; | ||
| } |
There was a problem hiding this comment.
There's significant code duplication in the error handling for snapshot creation within the catch block (lines 910-957) and the if (!commitHash) block (lines 958-1001). Both blocks build a list of erroredToolCalls and notify the model. This logic can be extracted into a helper function to improve readability and maintainability.
Additionally, there's an inconsistency in how failures are handled. The if (!commitHash) block calls cancelOngoingRequest(), while the catch block does not. Calling cancelOngoingRequest() will set turnCancelledRef.current to true, which will likely prevent handleCompletedTools from submitting the tool failure responses to Gemini. This seems unintentional, as informing Gemini of the failure allows it to react gracefully. Both failure paths should probably handle cancellation consistently.
I suggest refactoring this into a helper function and removing the cancelOngoingRequest() call to ensure consistent error reporting to the model.
const handleSnapshotFailure = (
uiErrorMessage: string,
modelErrorMessage: string,
) => {
addItem(
{
type: MessageType.ERROR,
text: `${uiErrorMessage}. Aborting operation.`,
},
Date.now(),
);
const toolRegistry = config.getToolRegistry();
const error = new Error(modelErrorMessage);
const erroredToolCalls: TrackedToolCall[] = toolCallRequests.map(
(request): TrackedCompletedToolCall => {
const tool = toolRegistry.getTool(request.name);
return {
request,
status: 'error',
response: {
callId: request.callId,
resultDisplay: error.message,
error,
errorType: ToolErrorType.UNKNOWN,
responseParts: [
{
functionResponse: {
name: request.name,
response: { error: error.message },
},
},
],
},
responseSubmittedToGemini: false,
tool,
};
},
);
void handleCompletedToolsRef.current!(erroredToolCalls);
setSnapshottingMessage(null);
};
let commitHash: string | undefined;
try {
if (!gitService) {
throw new Error(
'Checkpointing is enabled but Git service is not available. Ensure Git is installed.',
);
}
commitHash = await gitService.createFileSnapshot(
`Snapshot for ${restorableToolCalls
.map((t) => t.name)
.join(', ')}`,
);
} catch (error) {
const errorMessage = `Failed to create file snapshot: ${getErrorMessage(
error,
)}`;
handleSnapshotFailure(
errorMessage,
'Tool call aborted due to a critical, non-recoverable file snapshot failure.',
);
return StreamProcessingStatus.Completed; // Stop processing
}
if (!commitHash) {
const errorMessage =
'Failed to get a commit hash for the file snapshot';
handleSnapshotFailure(errorMessage, errorMessage);
return StreamProcessingStatus.Completed;
}|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request makes checkpointing synchronous and fail-fast, which is a significant improvement for reliability. The core logic change in useGeminiStream.ts correctly moves snapshotting into the stream processing flow, blocking tool execution on failure. The new tests for these failure scenarios are also a valuable addition. I have identified one critical issue regarding type safety when creating synthetic error responses for tool calls, which could lead to runtime errors and should be addressed.
|
I think gemini is hallucinating in its feedback. Here is an analysis from gemini-cli: Here's my verification of your feedback:
Conclusion: |
|
Hi @vishvananda, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this. We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines. Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed. Thank you for your understanding and for being a part of our community! |
|
@bdmorgan thank you for the comment. This PR is already associated with an existing issue: #13331. The PR is out of date at the moment. I haven't made any effort to update it yet because I haven't heard any feedback from the team on whether this is an acceptable approach, or if the team wants to handle it differently. |
|
Thank you for submission to the Gemini CLI project. At this time, we are closing this pull request in order to allow us to better triage and support more recent pull requests against the latest code changes. If you feel like this pull request is a critical contribution to the Gemini CLI project, please associate the pull request with an existing GitHub issue (instructions here: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue) before reopening. After Monday January 26 2026, any pull requests submitted by contributors without an associated issue will be automatically closed (more information here: #16706). If you do choose to reopen and submit this pull request, please ensure you rebase your changes onto the current main branch before resubmitting. This will help avoid merge conflicts and ensure your contribution is compatible with the latest codebase. |
Summary
If the git commit for checkpointing fails, it leaves the shadow repository in a broken state. Even when it doesn't fail, it can take some time to complete on a large repository, causing future checkpoint commands to silently fail. This commit updates the checkpointing to happen synchronously and fail there is something wrong with the git commit.
Details
I'm not a typescript expert, so I had AI help with the code and tests.
Related Issues
Partial for #13331
How to Validate
You should see a message that it is checkpointing before file altering commands run.
Pre-Merge Checklist