feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499
feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499fayerman-source wants to merge 78 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @fayerman-source, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Gemini CLI by integrating a native voice input capability. This feature aims to streamline user interaction by enabling natural speech-to-text conversion, making the CLI more accessible and efficient. The implementation focuses on a robust, event-driven design to maintain performance and provides clear feedback to the user throughout the voice input process. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a fantastic new feature for native voice input using Whisper. The implementation is well-structured, leveraging an event-based architecture to prevent UI re-render loops, and includes comprehensive documentation and tests. However, it introduces critical Command Injection vulnerabilities by using child_process.exec with user-controlled settings and unvalidated file paths, specifically in the handling of the user-configurable whisperPath setting. This poses a high security risk, potentially leading to Remote Code Execution (RCE), and must be addressed before merging.
| const runWhisper = async (binary: string, args: string) => { | ||
| // If it looks like an absolute path, verify existence directly | ||
| // This avoids 'which' issues with PATH | ||
| if (binary.startsWith('/') || binary.startsWith('.')) { | ||
| // Check for common configuration error: path with incorrect quotes | ||
| if (binary.includes("'") || binary.includes('"')) { | ||
| const sanitized = binary.replace(/['"]/g, ''); | ||
| if (sanitized !== binary) { | ||
| try { | ||
| await access(sanitized); | ||
| if (!sanitizedPathLoggedRef.current) { | ||
| debugLogger.log( | ||
| 'useVoiceInput: found sanitized path, using it instead', | ||
| sanitized, | ||
| ); | ||
| sanitizedPathLoggedRef.current = true; | ||
| } | ||
| binary = sanitized; | ||
| } catch { | ||
| // Sanitized path also doesn't exist, proceed with original to let it fail or be logged | ||
| } | ||
| } | ||
| } | ||
|
|
||
| try { | ||
| await access(binary); | ||
| } catch { | ||
| // If access fails, we'll fall through to try executing it (or it might fail there) | ||
| // but checking 'which' on an absolute path is redundant/incorrect. | ||
| debugLogger.log( | ||
| 'useVoiceInput: explicit path access check failed, but will try execution', | ||
| binary, | ||
| ); | ||
| } | ||
| } else { | ||
| // For command names, try to find them first | ||
| try { | ||
| await execAsync(`which ${binary}`); | ||
| } catch { | ||
| // If which fails, we might still try running it if we suspect it's in the PATH but not found by `sh -c which` | ||
| // (e.g. some obscure shell setup). But usually `which` failure is authoritative. | ||
| // However, for user convenience, let's allow proceeding if it's the configured path. | ||
| if (binary === config?.whisperPath) { | ||
| debugLogger.log( | ||
| 'useVoiceInput: `which` failed for configured path, but proceeding anyway', | ||
| ); | ||
| } else { | ||
| throw new Error(`Command not found: ${binary}`); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| await execAsync(`${binary} ${args}`); | ||
| }; |
There was a problem hiding this comment.
The runWhisper function uses exec (via execAsync) to execute a command constructed from the user-controlled whisperPath setting. This creates a critical Command Injection vulnerability. The binary variable, derived from the user-configurable voice.whisperPath setting, is directly interpolated into a shell command. An attacker could provide a malicious path like "/path/to/whisper; malicious_command" in the settings to achieve Remote Code Execution (RCE) on the user's machine. The which ${binary} check on line 296 is also vulnerable to injection.
Remediation: To mitigate this, replace child_process.exec (or execAsync) with child_process.spawn and pass arguments as an array. This prevents shell interpretation of user-controlled input. If the whisperPath setting needs to support arguments, parse the string into an executable and an array of arguments safely.
There was a problem hiding this comment.
Fixed in ce4b23c and 1df335c. All exec() calls have been replaced with execFile() using argument arrays — no shell is spawned, so no shell interpretation of user input. Additionally, validateBinaryPath() rejects any path containing shell metacharacters. The which check also uses execFile('which', [cmd]) now.
| await execAsync( | ||
| `whisper-faster "${audioFile}" --model tiny --output_format txt --output_dir "${tempDirRef.current}"`, | ||
| ); | ||
| // Read the transcript file |
There was a problem hiding this comment.
The fallback logic for whisper-faster uses exec with concatenated arguments, including file paths. If the temporary directory path or the filename contains shell metacharacters, it could lead to command injection.
Remediation: Use child_process.spawn with an arguments array to safely execute external commands.
There was a problem hiding this comment.
Fixed in the same commits. whisper-faster is now invoked via execFileAsync('whisper-faster', [audioFile, '--model', 'tiny', ...]) — arguments passed as an array, no string concatenation.
| await execAsync('which whisper'); | ||
| await execAsync( | ||
| `whisper "${audioFile}" --model tiny --output_format txt --output_dir "${tempDirRef.current}"`, | ||
| ); |
There was a problem hiding this comment.
Same fix — whisper fallback now uses execFileAsync('whisper', [audioFile, '--model', 'tiny', ...]) with an argument array.
🔒 Security Fix AppliedFixed the command injection vulnerability identified by @gemini-code-assist: Changes:
Before (Vulnerable):After (Secure):The fix follows Node.js security best practices (2024) and prevents RCE even with malicious settings. |
92776cf to
1df335c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-architected voice input feature. The use of an event-based system for transcript delivery is a great choice to prevent performance issues, and the implementation shows good consideration for security and cross-platform support. I've identified two high-severity issues in the core useVoiceInput hook concerning platform portability and a potential race condition that should be addressed to ensure the feature's robustness. Overall, this is a strong contribution.
6bd08ba to
c6c7329
Compare
🚀 Voice Input Enhancements & FixesI've updated this PR with several improvements to stability, configuration, and developer experience: 🛠 Changes
✅ Validation
These changes make the feature more 'production-ready' by aligning with the CLI's existing command patterns. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a native voice input feature using Whisper, which is a great addition to the CLI's accessibility and usability. The implementation is well-structured, using an event-based architecture to prevent performance issues and including comprehensive tests for the new functionality. The code demonstrates good security practices by validating user-provided paths and using execFile to prevent command injection.
I've identified two high-severity issues in the useVoiceInput hook that should be addressed: a resource leak due to temporary directories not being cleaned up, and a portability issue with the command existence check that will likely cause the feature to fail on Windows. My detailed comments provide suggestions for fixing these issues.
| void unlink(audioFileRef.current).catch(() => {}); | ||
| } | ||
| if (tempDirRef.current) { | ||
| // Clean up temp dir |
There was a problem hiding this comment.
The temporary directory created by mkdtemp for storing audio recordings is not being cleaned up. The useEffect cleanup hook has a // Clean up temp dir comment but no implementation. This will lead to an accumulation of gemini-voice-* directories in the system's temporary folder, causing a resource leak.
While this cleanup hook should be fixed for robustness on unmount, the primary cleanup should happen at the end of the stopRecording function, after the audio and transcript files have been processed.
Please add logic to remove the temporary directory in both places. You'll need to import rm from node:fs/promises.
| // Clean up temp dir | |
| void rm(tempDirRef.current, { recursive: true, force: true }); |
References
- Use the
node:prefix when importing built-in Node.js modules for consistency across the codebase. - Use asynchronous file system operations (e.g.,
fs.promises.readFile) instead of synchronous ones (e.g.,fs.readFileSync) to avoid blocking the event loop.
| const commandExists = (cmd: string): Promise<boolean> => | ||
| new Promise((resolve) => { | ||
| execFile('which', [cmd], (error) => { | ||
| resolve(!error); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
The commandExists helper function uses the which command to check for the existence of sox and arecord. The which command is specific to Unix-like systems and is not available on Windows by default. This will cause the voice input feature to fail on Windows, as it won't be able to find the necessary recording tools.
To ensure cross-platform compatibility, please use a method that works on all supported operating systems. A simple solution is to use where on Windows and which on other platforms by checking process.platform.
| const commandExists = (cmd: string): Promise<boolean> => | |
| new Promise((resolve) => { | |
| execFile('which', [cmd], (error) => { | |
| resolve(!error); | |
| }); | |
| }); | |
| const commandExists = (cmd: string): Promise<boolean> => | |
| new Promise((resolve) => { | |
| const checkCmd = process.platform === 'win32' ? 'where' : 'which'; | |
| execFile(checkCmd, [cmd], (error) => { | |
| resolve(!error); | |
| }); | |
| }); |
|
🛠 Feedback Addressed I've implemented the high-severity fixes identified in the latest review:
All |
|
Thank you for the detailed update, @fayerman-source! I appreciate you addressing the feedback from the previous review. The implemented changes, especially the security fix for command injection by using These updates look great and contribute significantly to making the voice input feature production-ready. Excellent work! |
|
@jacob314 @scidomino Checking in on this. Is there anything else we need before merging? |
| this.recordingProcess.stdout?.on('data', (chunk: Buffer) => { | ||
| if (!this.isStopping) { | ||
| this.audioChunks.push(chunk); | ||
| } | ||
| }); |
There was a problem hiding this comment.
I think the final audio chunk may be getting dropped because of this guard. Here's why — in the stop() method, this.isStopping is set to true at line 163 synchronously, before SIGTERM is sent at line 177 and before the event loop is yielded at line 182 (await closePromise). When sox/arecord receives SIGTERM, it flushes its remaining internal buffer to stdout — but by the time that final data event reaches the listener, isStopping is already true, so the chunk is silently discarded.
Suggested fix
Remove the !this.isStopping guard from the data listener at line 96:
this.recordingProcess.stdout?.on('data', (chunk: Buffer) => {
this.audioChunks.push(chunk);
});This is safe because:
- In
stop(),await closePromiseatline 182waits for the'close'event. Node.js guarantees all'data'events fire before'close', so by the time we reachBuffer.concat(this.audioChunks)atline 184the array already has every chunk including the final flush. - In
cancel(),this.audioChunks = []atline 148wipes the array anyway, so any stray chunks afterSIGTERMare harmless.
Let me know what you think.
|
The PR currently has a bunch of files that should not be included in it. Please take a look. I'll review once the diff is back down to the appropriate size rather than 10K+ lines due to files that should not be included. |
|
@jacob314 Good deal. Got the accidental artifacts removed, so the diff should be back down now. The most recent comment from @mrpmohiburrahman about the final audio chunk being dropped was on point too. Love the community spirit. Got that addressed and added a focused regression test. Thanks a bunch, all. |
# Conflicts: # packages/cli/src/ui/components/InputPrompt.tsx
# Conflicts: # packages/cli/src/ui/components/InputPrompt.tsx
79d3c44 to
c8ffca7
Compare
# Conflicts: # packages/cli/src/nonInteractiveCli.ts # packages/cli/src/ui/AppContainer.tsx # packages/cli/src/ui/components/InputPrompt.tsx # packages/cli/src/ui/components/__snapshots__/InputPrompt.test.tsx.snap # packages/core/src/index.ts
Summary
This PR adds native voice input to Gemini CLI with a pluggable backend architecture:
geminias the default zero-install transcription backendwhispersupport for users who want a local binary workflowThis addresses the request for a zero-install option while keeping a local backend available for users who prefer it.
Update: Gemini Backend Added (addresses @jacob314's feedback)
Per @jacob314's comment requesting a zero-install option using the Gemini API, this PR adds a
GeminiRestBackendthat:soxorarecordgenerateContentusing the user's existing Gemini authFeatures
Space Spaceon an empty prompt to start or stop recordingvoice.provider: "gemini"(default) or"whisper"(local)●while recording, spinner while transcribingvoice.silenceThresholddrops silent recordings before transcription/voice,/voice enable,/voice disable, `/voice provider [geminiSettings
{ "voice": { "enabled": true, // default: false "provider": "gemini", // default: "gemini", or "whisper" "silenceThreshold": 80, // default: 80, 0 disables silence filtering "whisperPath": "/usr/bin/whisper" // optional, only used for whisper } }Architecture
Key Design Decisions
coreEventsto avoid React re-render cascadesconfig.getContentGenerator()for OAuth or API key flowsisTogglingRefto prevent overlapping togglesvoice.silenceThresholdbefore transcription work is done●while recording and a spinner while transcribingAudio Pipeline
Gemini Backend
soxorarecordaudio/wavtogenerateContentWhisper Backend
soxorarecordvoice.silenceThresholdCommands
/voice/voice enable/voice disable/voice provider [gemini|whisper]/voice sensitivity <0-1000>/voice set-path <path>/voice helpValidation
Manual Validation
Space Spacestart/stop behaviorEsccancels active recordinggeminibackendwhisperbackendAutomated Validation
npm run build -w @google/gemini-cli-corenpm run typechecknpm test -w @google/gemini-cli -- src/nonInteractiveCli.test.tsnpm test -w @google/gemini-cli-core -- src/services/voice/LocalWhisperBackend.test.tsnpm test -w @google/gemini-cli -- --coverage.enabled=false src/nonInteractiveCli.test.ts src/ui/components/InputPrompt.test.tsxnpm test -w @google/gemini-cli -- src/ui/hooks/useVoiceInput.test.ts src/ui/hooks/useVoiceInput.replication.test.tsx src/ui/hooks/useVoiceInput.stress.test.ts src/ui/hooks/useVoiceInput.log-volume.test.tsnpm run buildFollow-up Work
/settingsTUI integration for voice configurationCloses #18067