Use section codes instead of section names#740
Conversation
(rebasing onto 0xC instead of master) This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder. It allows for user-defined sections that have string names to be encoded in the same manner as before. The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.
|
As I suggested long ago if short codes were reserved for use by the standard then it would be possible to test their length and content in a single 32 bit integer comparison which seems slightly simply than using the sign of the length to differentiate between codes and strings. This would address the use case of being 'easier to check in a decoder', and if the encoded size is a real show stopper then use single character strings for the high frequency section names. Perhaps one issue with differentiating ignored sections using a binary 'sign' bit is that this offers no clean transition from a de-facto section name to become a standard section name, and if decoders need to still check for strings in future then the design is just more complex. Also decoders in general will be looking at the section name strings. |
|
Friendly ping for any more thoughts on this PR. |
|
Is the suggestion to use short strings acceptable and would it address the performance and complexity issues raised? |
|
lgtm |
|
@rossberg-chromium Wouldn't short strings address this without adding more special cases. |
|
@titzer From the other PR: since Name is already being defined with the special "you can ignore it" / "validation error does not cause the module to fail to validate" semantics that presumably we want for all the other user-defined sections, can we make names be defined as a user-defined section and move the description of the special validation rules to be the case for all user-defined sections? |
|
Seriously... what's wrong with the short names idea? It's simpler for consumers that read-modify-write a module, in that the new proposal will need an extra field to store the negative number. My thinking is that because a feature test must be an entire module (ill-advised, imo), eliminating (even short) section names could significantly shorten the many "test" modules one might need. |
|
I think the term 'user defined section' is an inappropriate distinction too. The web community will still want to experiment with extensions, might want them to be ignored by runtimes that do not recognize them, and to be a shared standard and calling these 'user defined' is not appropriate. Perhaps the example that @lukewagner mentions is helpful. Perhaps the real distinction is a flag noting if the section can be ignored for execution purposes. If codes are added as an option to names then unrecognised codes need to be ignored too and then might as well just remove the names and use codes, which might then need to be managed, but it was deemed simpler to manage a name space so back to names. |
|
One drawback to short strings (e.g. less than 3 bytes so that a single word-sized check suffices) is that they don't really allow a decoder to just do a simple "load word and compare" operation. There's a couple subtleties here. For example, the module could end early, so the decoder has to first decode the (varint) length (also with bounds checks) and then check the length doesn't go past the end of the module. After that, even with short strings, the actual bytes may not be aligned in memory, so the decoder may still have to do a byte-by-byte comparison on some architectures (e.g. ARM and MIPS). Sections are few, so the efficiency argument isn't that pressing. It's just that doing a single byte comparison without all the extra checks is a lot simpler and more robust. |
|
As for for graduating a user-defined section to a spec-blessed section, there would have to be a point where the string or the opcode is allocated by standardization and then becomes supported by all engines. The only issue is for binaries created before this point which still use a user-defined section string as opposed to a short string or an opcode. Short strings don't help here, since presumably standardization only allocates short strings, while experiments should not be using short strings. |
|
@titzer Instead of thinking of user-defined sections as pre-standardization (such that they graduate into a spec-blessed integer code), I'd like us to consider saying instead that these are the just sections that can be completely ignored by an engine without having semantic effect (as is the case with the Names section). I like this because it means that new (semantic) features that involve a new section should go through the normal spec process before shipping (so they'd start with a new section opcode, just behind a flag at first, etc). |
|
The PR still has the ID as a varint, so there is no change in complexity compared with reading the length. So there is one less check that the length is within the module, hardly justifying the change. Loading a few unaligned bytes is a trivial matter, also does not justify a change, and this does not force a byte-by-byte comparison rather if the length of bytes can fit in a 32 bit integer or 64 bit integer then it can just pack them in and then dispatch on this integer code. |
|
If we really want to split hairs on this one, this proposal would basically amount to: int8_t val = read_i8(); So there is only a single byte read and indirect branch in the normal case, and a compliant engine never has to implement a byte-by-byte comparison for string sections, since by definition they are not required to be parsed. |
|
@lukewagner PTAL. I made the names section into a user-string specified section. Not sure if I like that, since V8 will definitely want to (try to) recognize the names section. |
BinaryEncoding.md
Outdated
| The names section does not change execution semantics, and thus is not allocated | ||
| a section opcode. | ||
| A validation error in this section does not cause validation for the whole module to fail and is | ||
| instead treated as if the section was absent. The expectation is that, when a |
There was a problem hiding this comment.
Could you move this sentence up above, below the para that introduces "user define sections", changing "this section" to "a user defined section"? That way it's clear that this is true for all user-defined sections, not just the names section.
|
lgtm with the one nit; much appreciated (we're also decoding and using the names section, fwiw) |
|
From the v8 code in module-decoder.cc the short names could address the trivial performance use case as follows: I see no rationale for this change. There is no use case for 'user-defined' sections, app developers are not expected to be bundling in arbitrary payloads in this manner and there is no way to even access them. The use case is web standards, even if experimental or de-facto. |
|
In the V8 native prototype (before 0xA), sections were section codes, and after discussion with others we all decided on section strings to reduce the probability that user extensions would collide either with each other or with experiments. But now there really is a fixed set of sections that need to be supported, so it is easier to just give them codes. Having "blessed" sections have codes and "experimental" or "user" sections have names makes the distinction explicit, and has the benefit of making engines slightly more efficient, as shown above. Standardization will assign further fixed codes as part of its normal process. |
|
On Wed, Aug 24, 2016 at 2:31 AM, JSStats notifications@github.com wrote:
BTW, there's a bunch of other complexity in the V8 implementation that is
|
|
Would there be consensus to move to short strings as a trial, and then re-evaluate the situation? So use fixed four byte strings for all the core section names. V8 could clean out some of the string complexity, simplify unit tests etc. Would not be much different than having a 32 bit section id code, less room for collisions. Or how about 1 byte strings taking it to the limit? |
|
While implementing this, I thought better of using the negative number trick and instead introduced a sentinel 0 byte for user-defined (string) section names. That avoids a little wart where one reads a byte from the stream, checks its value, and if it's not negative, "backs up" the stream 1 byte to use the normal varint32-parsing routine. Thoughts? |
|
On Wed, Aug 24, 2016 at 5:43 PM, KarlSchimpf notifications@github.com
|
|
Guess it's workable. Naming of the 'unknown' sections is deferred to a de-facto convention. |
|
@titzer By removing the designated "here's where the eventual variety of unknown sections distinguish themselves" field, it seems likely we'll eventually end up with more ad hoc conventions that end up causing conflicts or ambiguities. If it makes it any cleaner to describe, could we perhaps have a separate "unknown section header" that goes inside the payload of unknown sections and that includes the string? |
|
Yes, why not at least specify where an optional name goes in the 'unknown section header'. The length might be zero if there is none. |
|
I guess I should have kept my mouth shut before. I like having section names as an official part of the spec and separate from the payload. |
BinaryEncoding.md
Outdated
| for all sections. The encoding of sections is structured as follows: | ||
| The module preamble is followed by a sequence of sections. | ||
| Each section is identified by a 1-byte *section code* that encodes either a known section or a user-defined section. | ||
| Known sections have non-zero ids, while unkown sections simply have a zero id and are ignored by the WebAssembly implementation. |
There was a problem hiding this comment.
Nit: typo unkown -> unknown
|
Just floating an older idea here, feel free to shoot it down if it doesn't apply :-) Did you consider adapting the proposal by @mbebenita for this purpose? (#561) That is, just using LEB128 encoded ASCII for section codes while using short codes for official sections? The argument against that proposal was uniform treatment of strings, but now with the whole point being that engines don't have to treat section codes as strings anymore, that doesn't seem to be too much of a concern here. This could mean using the 128 "singleton" bytes as one byte codes for "official" sections ("ASCII strings of one character", but without paying any mind to the "value" of that character, i. e. just numbering sections from 0 to 128) and longer LEB values, that is "ASCII strings of more than one character", for non-standard sections. This seems to share the advantage of engines only having to recognize one byte section codes while custom sections are just those that start with “continuation bytes”. This also means that all possible (ASCII-) strings are available as non standard section codes, as long as they are at least two characters long. From an encoder perspective they're all just LEB values, then. There's also having one big uniform name-space with low risk of collision between non-standard sections while maintaining a distinction between official and non-standard by means of the length of the section code. Of course implementations wouldn't need to know anything about ASCII in any case, the string interpretation of custom section codes would be more like mnemonics for longer LEB values. Optionally, the spec may also want to reserve two character section codes (that is two byte LEB values) for a distant future where there might be more than 128 official sections :-) If that day ever comes, engines could still get away with reading and comparing just two bytes. |
|
Is there consensus on this one? |
|
Considering that https://github.com/WebAssembly/webassembly.github.io/pull/14 has already landed I think the train has left the station for 0xC. I think this should be moved to 0xD. |
|
@jfbastien The binary in https://github.com/WebAssembly/webassembly.github.io/pull/14 does include the change here ("Section opcodes"). |
Ah OK, it does seem like that merge was a tad optimistic then :-) We should merge this PR then. |
|
Ok, it looks like the remaining difference between this PR and what we discussed earlier is saying all user-defined sections have the string field. |
|
Back by popular demand: string identifiers for unknown sections. |
|
Thanks, lgtm |
|
I thought we wanted to make the string part of the payload, so that user sections still have regular structure? |
|
Oh right, I forgot. So that effectively just switches the order of the id string and payload length in the current PR. |
This is the same change as https://github.com/WebAssembly/webassembly.github.io/pull/14. Note that it includes the changes described in WebAssembly/design#740 however it doesn't have any user-defined or names sections, so it's unaffected by the details of how those are handled.
|
Also: one consequence of the current wording which I think is desirable and perhaps should be more-explicitly called out is that unknown-to-the-engine non-zero section codes will now cause a validation error; only user-defined sections will be silently ignored if unknown. |
|
On Wed, Sep 14, 2016 at 8:33 PM, rossberg-chromium <notifications@github.com
|
|
Sure, but it was a regular case then, now it's become irregular, and thus a minus for easy processing -- especially for consumers that just want to skip all but a few selected sections (like tools). |
|
Righty-o. PTAL. On Thu, Sep 15, 2016 at 10:30 AM, rossberg-chromium <
|
|
lgtm |
1 similar comment
|
lgtm |
Merge pull request #1460 from Krovatkin:known_sections Per WebAssembly/design#740 this adds a few bits to support opcodes for known sections fyi @commonlisp @arunetm @cellulle @MikeHolman homeless PR looks for a nice review and branch to be adopted into <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/microsoft/chakracore/1460) <!-- Reviewable:end -->
(rebasing onto 0xC instead of master)
This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder.
It allows for user-defined sections that have string names to be encoded in the same manner as before.
The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.