-
Notifications
You must be signed in to change notification settings - Fork 286
Handle writes to non-blocking pty instead of wrapping via Node's tty api #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/unixTerminal.ts
Outdated
| // handling in net.Socket/tty.*Stream which can result in data loss with | ||
| // ~1-4kb of data. | ||
| // Context: https://github.com/microsoft/vscode/issues/283056 | ||
| fs.write(this.fd, data, (err, written) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the original approach of one stream because:
fs.writemay not write all the data passed to it because the underlying system call doesn't guarantee all bytes are written. To usefs.writehere, it would need to do something likedata = data.slice(Buffer.byteLength(data) - written)and write again until data.length is emptyfs.writeenqueues it to a threadpool (or io_uring) which means multiple write() calls could happen at the same time and in an interleaving order. To use fs.write here, this function would need to enqueue each write call and drain one after another
I think the question is why did it stop writing the final chunk? If I was debugging this, I would try to reproduce it and then run either strace node <file> or perf trace node <file> and compare the logs in the success scenario versus the failure scenario. Is it missing a write at the end? Did the write happen after the process already exited?
Since this native module is handling process spawning, I would also closely compare the libuv implementation of process spawning with this one.
My suspicion is that this is an edgecase in node:stream where when one end of the socket is shutdown, it caused the other end to stop sending data (sort of like allowHalfOpen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't actually see this until just now after pushing. It's hinted to in the code, but note also there's a hang issue when writing on macOS specifically that can occur that is closely related to all this stuff (microsoft/vscode#246204).
I have a better implementation now that's pretty close to what you were suggesting. It's using fs.write directly and handles partial writes and EAGAIN. Here are some details I found which made it particularly confusing to work through:
fs.WriteStreamreturnsERR_SYSTEM_ERRORwhich obfuscated the actual underlyingEAGAINso I couldn't handle it,fs.writewas needed for this.- The
drainevent never seems to happen fortty.WriteStream, it's implemented inWritablebut nottty.WriteStream. I couldn't find why though, this is what an LLM said:- TTYs don’t apply meaningful backpressure
- Writes are synchronous or minimally buffered
- The stream buffer never fills
- Therefore, drain is never needed
tty.WriteStreamhas a habit of locking up the process completely when callingwritewith too much data. @lhecker investigated the hang earlier today and found it would always get stuck inuv__writeRemove ugly throttling mechanism when pasting to the terminal vscode#283056 (comment). Decided not to investigate further and usefs.writeas it appears simpler and seems to accomplish the goals.
|
Tests are failing as it currently breaks encoding handling |
This reverts commit 6bb38cc. errno.EAGAIN is coming through as 35 (positive) on macOS, not -35.
deepak1556
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Tyriar 👏
Muhtasham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for working on this, my friend! First time I see Microsoft responding to a community request. You are doing the good part here! |
|
Thanks for this, and sorry that you were bullied for it. |
|
@Tyriar is there a simple way to repro this issue just using node-pty on linux and macos? |
|
@billywhizz it's more involved that is ideal as this codebase was always a bit scrappy since originally getting forked. I was testing this via the xterm.js demo with the below steps: Clone and build node-pty: git clone https://github.com/microsoft/node-pty
cd node-pty
npm ci # deps and native build
npm run watch # ts buildClone and build xterm.js: git clone https://github.com/xtermjs/xterm.js
cd xterm.js
npm ci
# ctrl+shift+b in VS Code to launch these, or:
npm run tsc-watch # build ts
npm run esbuild-watch # bundle ts/js
npm run esbuild-demo # build demo/server
npm run start # run serverSymlink node-pty into xterm.js deps: ln -s <path_to>/node-pty <path_to>/xterm.js/node_modules/node-ptyStart server in xterm.js (kill and restart for any changes): npm run start |
Includes key change microsoft/node-pty#831 and follow up microsoft/node-pty#832 Fixes #246204 Fixes #283056
|
|
||
| // Requeue any partial writes | ||
| if (written < data.length) { | ||
| this._writeQueue.unshift(data.slice(written)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this slice call is incorrect. fs.write returns the amount of bytes written, not code points, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it does. this should probably be using TextEncoder if data is a string and then always operating on buffers?
Disclaimer: A good deal of this is brand new to me, forgive me if I misuse terminology 🙏
First some background, there's 2 parts to this. First the throttling mechanism that is mentioned in microsoft/vscode#283056 in order to fix a data corruption bug microsoft/vscode#38137. The bug was that pasting (or sending via the extension API) "large amounts" of data (1-4kb, but it depends on the OS - macOS seems to be 1kb, so 1024 ascii chars in UTF-8) into the terminal would interleave and/or drop data, in other words pasting was simply broken before if you wanted to paste over 1024 characters. The solution, while crude, did fix that issue at the cost of much slower pasting as it was being throttled on the JS side.
The second part of this is that a more recent issue started happening where similar repro steps on macOS only would hang the process and therefore take down the pty host in VS Code (microsoft/vscode#246204). This is commonly seen when using Copilot as it sometimes suggests large Python scripts to run. This was an existing bug that looks like it had the same root cause. It likely started happening as a result of either Node.js upgrading in VS Code or a refactor that moved us over to using
tty.ReadStreaminstead of the original monkey patch hack that was used in pty.js before forking this project to make sure node treated our socket specially as a TTY. This is code I admittedly never understood and was basically too scared to touch.For the hang issue first, in
mainas a result of thatttyrefactor we were usingtty.ReadStreamto write to. Despite setting the FD explicitly to non-blocking here:node-pty/src/unix/pty.cc
Lines 375 to 377 in 826032c
Node.js would attempt to read it as blocking here:
https://github.com/libuv/libuv/blob/ada5131840a88b4ff6df8d047116899e4192e934/src/unix/tty.c#L177-L192
This caused event loop starvation in libuv here:
https://github.com/libuv/libuv/blob/ada5131840a88b4ff6df8d047116899e4192e934/src/unix/stream.c#L879-L881
When experimenting with
tty.WriteStream, Node.js does mark the fd correctly for blocking: https://github.com/nodejs/node/blob/4f24aff94ad9160bceaf9dcc9cf5235a65f01029/lib/tty.js#L106-L111However this ended up causing libuv again to hang in the sync path, here this time:
https://github.com/libuv/libuv/blob/ada5131840a88b4ff6df8d047116899e4192e934/src/unix/stream.c#L1420-L1435
Trying to workaround this by switching to
fs.WriteStreamand splitting the data into chunks to ensure that we never attempted to write more than the kernel buffer size, I found that would end up throwingERR_SYSTEM_ERRORin that case which was masking the underlyingEAGAINand therefore not letting us handle it.Moving to the raw
fs.write(fd, ...)allowed avoiding this problem where we could handleEAGAINdirectly. In addition to this, partial writes now needed to be handled since we're no longer using Node's stream abstractions.As for the interleaving/dropped data issue, this happened as a result of filling the kernel buffer. The natural suggestion, is after an
EAGAINhappens to listen fordrain, but it turns out that Node.js streams do not emitdrainspecifically for tty sockets and it's not clear to me why exactly. That leaves us in the unfortunate position of needing to use an arbitrary timeout still which is whysetTimeoutof 5 is used when handlingEAGAIN.For each non-
EAGAINiteration in the write queue processing method I attempted to separate them viasetImmediate, but found this would result in the overlapped data which I assume is due to the same reason Node.js tries to mark it as blocking as in this comment:Which points at nodejs/node#1771 (comment):
Which I interpret as non-blocking (which we want to keep to avoid slowing the processing for other ptys) is only possible by giving it some time to breathe since the I/O is async and queued to write via the libuv threadpool. In other words, this happens when there are multiple distinct chunks of data being written in quick succession, which isn't actually a problem in practice since writes (eg. via paste) are always separated by some natural delay which is more than enough time to process the write in the threadpool. This would only be a potential performance problem if we were calling write many times at once. So the
setTimeout5 is also used in the non-EAGAINcase when more data needs to be processed.Thanks @lhecker and @deepak1556 for helping out a bunch with the investigation ❤️
Part of microsoft/vscode#283056
Part of microsoft/vscode#246204
Tests below show node-pty running in xterm.js (which does not have the throttling mentioned in microsoft/vscode#283056),
echo "is typed then 400 lines of text is pasted, the quote is closed and run. This test makes it obvious when there's any interleaving or unordered data by scrolling through the result of the echo command, without needing to worry about wrapping problems this version of bash can have in the input section.Pasted string:
Details
Before (xterm.js server proc hangs due to microsoft/vscode#246204):
After (tested on 1674fdf):