-
Notifications
You must be signed in to change notification settings - Fork 25.1k
jsinspector: Support UTF-8 responses to CDP's IO.read #45426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Differential Revision: D59693730
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58323790 |
Summary: Pull Request resolved: facebook#45426 The initial implementation of `Network.loadNetworkResource` and the accompanying `IO.read` (D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources. In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits: - Less CPU and RAM footprint on device (UTF-8 truncation is constant-time, fast, and in-place), similarly less decoding for the frontend. - 25% less data per chunk (base64 encodes 3 bytes as 4 characters), implies up to 25% fewer network round trips for large resources. It also has the benefit of being human-readable in the CDP protocol inspector. ## Determining whether data is text We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the `Content-Type` header, and assuming any text mime type is UTF-8. ## UTF-8 truncation The slight implementation complexity here is that `IO.read` requests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap: 1. Naively truncate the buffer, inspect the last byte 2. If the last byte has topmost bit =0, it's ASCII (single byte) and we're done. 3. Otherwise, look back at most 3 bytes to find the first byte of the code point (topmost bits 11), counting the number of "continuationBytes" at the end of our buffer. If we don't find one within 3 bytes then the string isn't UTF-8 - throw. 4. Read the code point length, which is encoded into the first byte. 5. Resize to remove the last code point fragment, unless it terminates correctly exactly at the end of our buffer. ## Edge cases + divergence from Chrome Chrome's behaviour here in at least one case is questionable and we intentionally differ: - If a response has header "content-type: text/plain" but content eg`0x80` (not valid UTF-8), Chrome will respond to an `IO.read` with `{ "data": "", "base64Encoded": false, "eof": false }`, ie an empty string, but will move its internal pointer such that the next or some subsequent `IO.read` will have `"eof": true`. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted. - Instead, we respond with a CDP error to the `IO.read`. We do not immediately cancel the request or discard data, since not all `IO.read` errors are necessarily fatal. I've verified that CDT sends `IO.close` after an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor is `IO.close` after a resource is successfully consumed). Changelog: [General] Debugger: Support text responses to CDP `IO.read` requests Differential Revision: D58323790
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D58323790 |
f1d25ed to
58c7eb8
Compare
facebook-github-bot
pushed a commit
that referenced
this pull request
Jul 15, 2024
Summary:
The initial implementation of `Network.loadNetworkResource` and the accompanying `IO.read` (D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources.
In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits:
- Less CPU and RAM footprint on device (UTF-8 truncation is constant-time, fast, and in-place), similarly less decoding for the frontend.
- 25% less data per chunk (base64 encodes 3 bytes as 4 characters), implies up to 25% fewer network round trips for large resources.
It also has the benefit of being human-readable in the CDP protocol inspector.
## Determining whether data is text
We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the `Content-Type` header, and assuming any text mime type is UTF-8.
## UTF-8 truncation
The slight implementation complexity here is that `IO.read` requests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap:
1. Naively truncate the buffer, inspect the last byte
2. If the last byte has topmost bit =0, it's ASCII (single byte) and we're done.
3. Otherwise, look back at most 3 bytes to find the first byte of the code point (topmost bits 11), counting the number of "continuationBytes" at the end of our buffer. If we don't find one within 3 bytes then the string isn't UTF-8 - throw.
4. Read the code point length, which is encoded into the first byte.
5. Resize to remove the last code point fragment, unless it terminates correctly exactly at the end of our buffer.
## Edge cases + divergence from Chrome
Chrome's behaviour here in at least one case is questionable and we intentionally differ:
- If a response has header "content-type: text/plain" but content eg`0x80` (not valid UTF-8), Chrome will respond to an `IO.read` with `{ "data": "", "base64Encoded": false, "eof": false }`, ie an empty string, but will move its internal pointer such that the next or some subsequent `IO.read` will have `"eof": true`. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted.
- Instead, we respond with a CDP error to the `IO.read`. We do not immediately cancel the request or discard data, since not all `IO.read` errors are necessarily fatal. I've verified that CDT sends `IO.close` after an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor is `IO.close` after a resource is successfully consumed).
Changelog: [General] Debugger: Support text responses to CDP `IO.read` requests
Differential Revision: D58323790
Contributor
|
This pull request has been merged in c085180. |
|
This pull request was successfully merged by @robhogan in c085180 When will my fix make it into a release? | How to file a pick request? |
This was referenced Aug 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Merged
This PR has been merged.
p: Facebook
Partner: Facebook
Partner
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
The initial implementation of
Network.loadNetworkResourceand the accompanyingIO.read(D54202854) base64-encodes all data as if it is binary. This is the more general case, and we'll continue to base64-encode non-text resources.In the common case of text resources (particularly JS and JSON), it'd be preferable to do as Chrome does and send UTF-8 over the wire directly. This has a few performance benefits:
It also has the benefit of being human-readable in the CDP protocol inspector.
Determining whether data is text
We use exactly Chromium's heuristic for this (code pointers in comments), which is based only on the
Content-Typeheader, and assuming any text mime type is UTF-8.UTF-8 truncation
The slight implementation complexity here is that
IO.readrequests may specify a maximum number of bytes, and so we must slice a raw buffer up into valid UTF-8 sequences. This turns out to be fairly simple and cheap:Edge cases + divergence from Chrome
Chrome's behaviour here in at least one case is questionable and we intentionally differ:
0x80(not valid UTF-8), Chrome will respond to anIO.readwith{ "data": "", "base64Encoded": false, "eof": false }, ie an empty string, but will move its internal pointer such that the next or some subsequentIO.readwill have"eof": true. To the client, this is indistinguishable from a successfully received resource, when in fact it is effectively corrupted.IO.read. We do not immediately cancel the request or discard data, since not allIO.readerrors are necessarily fatal. I've verified that CDT sendsIO.closeafter an error, so we'll clean up that way (this isn't strictly guaranteed by any spec, but nor isIO.closeafter a resource is successfully consumed).Changelog: [General] Debugger: Support text responses to CDP
IO.readrequestsDifferential Revision: D58323790