Skip to content

Conversation

@jackyzha0
Copy link
Member

@jackyzha0 jackyzha0 commented Oct 24, 2024

we noticed a few cases where if a program exited really quickly but dumped a lot of output (e.g. env in the replit shell) we could sometimes miss output

we used to poll the controller side fd for POLLIN but that actually isn't fully correct.

the tty pipes look something like:

controlling program <> controller fd <> user fd <> user program
|---user space------||--------kernel space------||-user space-|

we can sometimes enter cases where the controller side thinks it has no more data to give to the controlling program (nodejs when using @replit/ruspty) but the kernel has decided to block on passing some user fd data to the controller fd (on linux for example, the pipe in the user fd -> controller fd direction has about 4kb of capacity)

for example, if node isnt processing data events quickly enough, the controller-side queue can fill up and the user fd write will block until it has space again, we could rarely enter a race if nodejs decides to read an entire 4kb block completely emptying the controller fd queue and then the POLLIN logic could return with no more data left (even though the user side still has a pending write!)

this would drop data :(

a few changes:

  • rust-side
    • poll TIOCINQ and TIOCOUTQ on controller and user instead of just POLLIN on controller
  • wrapper level
    • trigger 'end' event on the read stream on EIO instead of just calling the cb (that way, other things with handles to the socket can also handle end appropriately)
    • exit condition is now
      • fd closed && fd fully read && program actually exited
  • test level
    • bump repeats from 50 -> 500
    • rewrite it to a more standard async format so errors during the test actually are associated with that test
    • added a test for fast output (>4kb output and fast exit)

@jackyzha0 jackyzha0 changed the title bump repeats, add debug proper canonical mode flags, bump repeats, add debug Oct 24, 2024
@jackyzha0 jackyzha0 changed the title bump repeats, add debug poll for TIOCINQ and TIOCOUTQ instead of POLLIN, bump repeats Oct 28, 2024
@jackyzha0 jackyzha0 requested review from lhchavez October 28, 2024 17:24
Copy link

@blast-hardcheese blast-hardcheese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v. nice

@jackyzha0 jackyzha0 merged commit b81def4 into main Oct 28, 2024
@jackyzha0 jackyzha0 deleted the jackyzha0/test-robustness branch October 28, 2024 20:51
@lhchavez
Copy link
Contributor

image

jackyzha0 added a commit that referenced this pull request Sep 29, 2025
turns out #51 reduced the race rate
but did not eliminate it completely. we still see the data race bug very
rarely and it is exacerbated on high-numbers of repeats in the tests
and/or high nodejs event loop utilization

my old theory was:
- the user fd is O_NONBLOCK clear so any writes there should block until
its made its way into the 4kb intermediary buffer
- as a result, when the .wait returns, we know that the program output
has at least made it fully into the the user fd input buffer
- because we poll until `controller_inq == 0 && controller_outq == 0 &&
user_inq == 0 && user_outq == 0`, in theory we should never call the js
side exit code until theres truly nothing left in the pipe (i.e. the
readstream on the nodejs side has it in its own buffer)

HOWEVER, with some logging i saw a few cases where in fact
`controller_inq == 0 && controller_outq == 0 && user_inq == 0 &&
user_outq == 0`, the program had exited, yet the nodejs side still
missed some data

what i think is _actually_ happening is that
1. one of libuv's io threads reads the data from the controller side and
queues a data event on the stream
2. polling exits as now theres no more data in the queues we immediately
`drop(user_fd);` which [sets TTY_OTHER_CLOSED
synchronously](https://github.com/torvalds/linux/blob/4ff71af020ae59ae2d83b174646fc2ad9fcd4dc4/drivers/tty/pty.c#L66)
and we get an error event queued on the stream
3. if the nodejs event loop happens to read error before data, we emit
'end' and mark the data as read even though technically nodejs hasnt
processed the data event yet so we drop data :(

how we fix it:
1. axe poll_pty_fds_until_read
2. make Pty struct own user_fd and expose a method for the js side to
drop this fd when its done
3. when child.wait finishes, write a synthetic 'EOF' that is actually a
custom OSC terminal control sequence (`\x1B]7878\x1B\\`, 7878 is RUST on
the phonepad :)) to the user fd (a cursory search shows no results, it
seems _very_ unlikely for this sequence to appear randomly)
4. on the nodejs wrapper side, create a transform stream that parses out
the synthetic EOF and emits it as a custom event when it happens
5. when the nodejs side hits this EOF, we know we are actually at the
end of the data stream and can safely drop user_fd

node-pty had the [same
problem](microsoft/node-pty#72) and did the
:grug: brain thing and [added a wait
250ms](microsoft/vscode@9464b54
) so im calling it slightly more ok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants