Skip to content

Switch to UTF-8 input on Windows #68

@alexrp

Description

@alexrp

Get rid of this whole workaround:

// The Windows console host is eventually going to support UTF-8 input via the ReadFile function. Sadly, this
// does not work today; non-ASCII characters just turn into NULs. This means that we have to use the
// ReadConsoleW function for interactive input and ReadFile for redirected input. This complicates the
// interactive case considerably since ReadConsoleW operates in terms of UTF-16 code units while the API we
// offer operates in terms of raw bytes.
//
// To solve this problem, we read one or two UTF-16 code units to form a complete code point. We then encode
// that into UTF-8 in a separate buffer. Finally, we copy as many bytes as possible/requested from the UTF-8
// buffer to the caller-provided buffer.
using (_semaphore.Enter(cancellationToken))
{
if (_buffered.IsEmpty)
{
var units = (stackalloc char[2]);
var chars = 0;
fixed (char* p = &MemoryMarshal.GetReference(units))
{
bool ret;
var read = 0u;
while ((ret = ReadConsoleW(Handle, p, 1, out read, null)) &&
Marshal.GetLastPInvokeError() == (int)WIN32_ERROR.ERROR_OPERATION_ABORTED &&
read == 0)
{
// Retry in case we get interrupted by a signal.
}
if (!ret)
WindowsTerminalUtility.ThrowIfUnexpected($"Could not read from {Name}");
if (read == 0)
return 0;
// There is a bug where ReadConsoleW will not process Ctrl-Z properly even though ReadFile will. The
// good news is that we can fairly easily emulate what the console host should be doing by just
// pretending that there is no more data to be read.
if (!Terminal.IsRawMode && *p == '\x1a')
return 0;
chars++;
// If we got a high surrogate, we expect to instantly see a low surrogate following it. In really
// bizarre situations (e.g. broken WriteConsoleInput calls), this might not be the case though; in
// such a case, we will just let UTF8Encoding encode the lone high surrogate into a replacement
// character (U+FFFD).
//
// It is not really clear whether this is the right thing to do. A case could easily be made for
// passing the lone surrogate through unmodified or simply discarding it...
if (char.IsHighSurrogate(*p))
{
while ((ret = ReadConsoleW(Handle, p + 1, 1, out read, null)) &&
Marshal.GetLastPInvokeError() == (int)WIN32_ERROR.ERROR_OPERATION_ABORTED &&
read == 0)
{
// Retry in case we get interrupted by a signal.
}
if (read != 0)
chars++;
else if (!ret)
WindowsTerminalUtility.ThrowIfUnexpected($"Could not read from {Name}");
}
// Encode the UTF-16 code unit(s) into UTF-8 and grab a slice of the buffer corresponding to just
// the portion used.
_buffered = _buffer.AsMemory(..Cathode.Terminal.Encoding.GetBytes(units[..chars], _buffer));
}
}
// Now that we have some UTF-8 text buffered up, we can copy it over to the buffer provided by the caller
// and adjust our UTF-8 buffer accordingly. Be careful not to overrun either buffer.
var copied = Math.Min(_buffered.Length, buffer.Length);
_buffered.Span[..copied].CopyTo(buffer[..copied]);
_buffered = _buffered[copied..];
return copied;
}

There should be no functional change; this is just cleanup. Maybe some slight performance gains at best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: driversIssues related to the terminal drivers.os: windowsIssues that are specific to Windows (10, 11, etc).

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions