jsinspector: Support UTF-8 responses to CDP's IO.read #45426

robhogan · 2024-07-14T00:19:12Z

Summary:
The initial implementation of Network.loadNetworkResource and the accompanying IO.read (D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources.

In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits:

Less CPU and RAM footprint on device (UTF-8 truncation is constant-time, fast, and in-place), similarly less decoding for the frontend.
25% less data per chunk (base64 encodes 3 bytes as 4 characters), implies up to 25% fewer network round trips for large resources.

It also has the benefit of being human-readable in the CDP protocol inspector.

Determining whether data is text

We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the Content-Type header, and assuming any text mime type is UTF-8.

UTF-8 truncation

The slight implementation complexity here is that IO.read requests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap:

Naively truncate the buffer, inspect the last byte
If the last byte has topmost bit =0, it's ASCII (single byte) and we're done.
Otherwise, look back at most 3 bytes to find the first byte of the code point (topmost bits 11), counting the number of "continuationBytes" at the end of our buffer. If we don't find one within 3 bytes then the string isn't UTF-8 - throw.
Read the code point length, which is encoded into the first byte.
Resize to remove the last code point fragment, unless it terminates correctly exactly at the end of our buffer.

Edge cases + divergence from Chrome

Chrome's behaviour here in at least one case is questionable and we intentionally differ:

If a response has header "content-type: text/plain" but content eg0x80 (not valid UTF-8), Chrome will respond to an IO.read with { "data": "", "base64Encoded": false, "eof": false }, ie an empty string, but will move its internal pointer such that the next or some subsequent IO.read will have "eof": true. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted.
Instead, we respond with a CDP error to the IO.read. We do not immediately cancel the request or discard data, since not all IO.read errors are necessarily fatal. I've verified that CDT sends IO.close after an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor is IO.close after a resource is successfully consumed).

Changelog: [General] Debugger: Support text responses to CDP IO.read requests

Differential Revision: D58323790

Differential Revision: D59693730

facebook-github-bot · 2024-07-14T00:21:43Z

This pull request was exported from Phabricator. Differential Revision: D58323790

Summary: Pull Request resolved: facebook#45426 The initial implementation of `Network.loadNetworkResource` and the accompanying `IO.read` (D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources. In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits: - Less CPU and RAM footprint on device (UTF-8 truncation is constant-time, fast, and in-place), similarly less decoding for the frontend. - 25% less data per chunk (base64 encodes 3 bytes as 4 characters), implies up to 25% fewer network round trips for large resources. It also has the benefit of being human-readable in the CDP protocol inspector. ## Determining whether data is text We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the `Content-Type` header, and assuming any text mime type is UTF-8. ## UTF-8 truncation The slight implementation complexity here is that `IO.read` requests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap: 1. Naively truncate the buffer, inspect the last byte 2. If the last byte has topmost bit =0, it's ASCII (single byte) and we're done. 3. Otherwise, look back at most 3 bytes to find the first byte of the code point (topmost bits 11), counting the number of "continuationBytes" at the end of our buffer. If we don't find one within 3 bytes then the string isn't UTF-8 - throw. 4. Read the code point length, which is encoded into the first byte. 5. Resize to remove the last code point fragment, unless it terminates correctly exactly at the end of our buffer. ## Edge cases + divergence from Chrome Chrome's behaviour here in at least one case is questionable and we intentionally differ: - If a response has header "content-type: text/plain" but content eg`0x80` (not valid UTF-8), Chrome will respond to an `IO.read` with `{ "data": "", "base64Encoded": false, "eof": false }`, ie an empty string, but will move its internal pointer such that the next or some subsequent `IO.read` will have `"eof": true`. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted. - Instead, we respond with a CDP error to the `IO.read`. We do not immediately cancel the request or discard data, since not all `IO.read` errors are necessarily fatal. I've verified that CDT sends `IO.close` after an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor is `IO.close` after a resource is successfully consumed). Changelog: [General] Debugger: Support text responses to CDP `IO.read` requests Differential Revision: D58323790

facebook-github-bot · 2024-07-15T08:42:47Z

This pull request was exported from Phabricator. Differential Revision: D58323790

Summary: The initial implementation of `Network.loadNetworkResource` and the accompanying `IO.read` (D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources. In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits: - Less CPU and RAM footprint on device (UTF-8 truncation is constant-time, fast, and in-place), similarly less decoding for the frontend. - 25% less data per chunk (base64 encodes 3 bytes as 4 characters), implies up to 25% fewer network round trips for large resources. It also has the benefit of being human-readable in the CDP protocol inspector. ## Determining whether data is text We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the `Content-Type` header, and assuming any text mime type is UTF-8. ## UTF-8 truncation The slight implementation complexity here is that `IO.read` requests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap: 1. Naively truncate the buffer, inspect the last byte 2. If the last byte has topmost bit =0, it's ASCII (single byte) and we're done. 3. Otherwise, look back at most 3 bytes to find the first byte of the code point (topmost bits 11), counting the number of "continuationBytes" at the end of our buffer. If we don't find one within 3 bytes then the string isn't UTF-8 - throw. 4. Read the code point length, which is encoded into the first byte. 5. Resize to remove the last code point fragment, unless it terminates correctly exactly at the end of our buffer. ## Edge cases + divergence from Chrome Chrome's behaviour here in at least one case is questionable and we intentionally differ: - If a response has header "content-type: text/plain" but content eg`0x80` (not valid UTF-8), Chrome will respond to an `IO.read` with `{ "data": "", "base64Encoded": false, "eof": false }`, ie an empty string, but will move its internal pointer such that the next or some subsequent `IO.read` will have `"eof": true`. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted. - Instead, we respond with a CDP error to the `IO.read`. We do not immediately cancel the request or discard data, since not all `IO.read` errors are necessarily fatal. I've verified that CDT sends `IO.close` after an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor is `IO.close` after a resource is successfully consumed). Changelog: [General] Debugger: Support text responses to CDP `IO.read` requests Differential Revision: D58323790

facebook-github-bot · 2024-07-16T19:40:48Z

This pull request has been merged in c085180.

github-actions · 2024-07-16T19:40:52Z

This pull request was successfully merged by @robhogan in c085180

^{When will my fix make it into a release? | How to file a pick request?}

Fix some warnings

54e34d7

Differential Revision: D59693730

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. p: Facebook Partner: Facebook Partner labels Jul 14, 2024

facebook-github-bot added the fb-exported label Jul 14, 2024

robhogan force-pushed the export-D58323790 branch from f1d25ed to 58c7eb8 Compare July 15, 2024 08:42

facebook-github-bot closed this in c085180 Jul 16, 2024

facebook-github-bot added the Merged This PR has been merged. label Jul 16, 2024

This was referenced Aug 15, 2024

Integrate RN Nightly 7/19 microsoft/react-native-windows#13554

Merged

Unfork NetworkIOAgent files and Utf8.h microsoft/react-native-windows#13587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jsinspector: Support UTF-8 responses to CDP's IO.read #45426

jsinspector: Support UTF-8 responses to CDP's IO.read #45426

Uh oh!

robhogan commented Jul 14, 2024

Uh oh!

facebook-github-bot commented Jul 14, 2024

Uh oh!

facebook-github-bot commented Jul 15, 2024

Uh oh!

facebook-github-bot commented Jul 16, 2024

Uh oh!

github-actions bot commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsinspector: Support UTF-8 responses to CDP's IO.read #45426

jsinspector: Support UTF-8 responses to CDP's IO.read #45426

Uh oh!

Conversation

robhogan commented Jul 14, 2024

Determining whether data is text

UTF-8 truncation

Edge cases + divergence from Chrome

Uh oh!

facebook-github-bot commented Jul 14, 2024

Uh oh!

facebook-github-bot commented Jul 15, 2024

Uh oh!

facebook-github-bot commented Jul 16, 2024

Uh oh!

github-actions bot commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants