Skip to content

Use section codes instead of section names#740

Merged
titzer merged 9 commits intobinary_0xcfrom
binary_0xc_section_codes3
Sep 15, 2016
Merged

Use section codes instead of section names#740
titzer merged 9 commits intobinary_0xcfrom
binary_0xc_section_codes3

Conversation

@titzer
Copy link

@titzer titzer commented Aug 1, 2016

(rebasing onto 0xC instead of master)

This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder.
It allows for user-defined sections that have string names to be encoded in the same manner as before.
The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.

(rebasing onto 0xC instead of master)

This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder.
It allows for user-defined sections that have string names to be encoded in the same manner as before.
The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.
@ghost
Copy link

ghost commented Aug 2, 2016

As I suggested long ago if short codes were reserved for use by the standard then it would be possible to test their length and content in a single 32 bit integer comparison which seems slightly simply than using the sign of the length to differentiate between codes and strings. This would address the use case of being 'easier to check in a decoder', and if the encoded size is a real show stopper then use single character strings for the high frequency section names.

Perhaps one issue with differentiating ignored sections using a binary 'sign' bit is that this offers no clean transition from a de-facto section name to become a standard section name, and if decoders need to still check for strings in future then the design is just more complex. Also decoders in general will be looking at the section name strings.

@titzer
Copy link
Author

titzer commented Aug 16, 2016

Friendly ping for any more thoughts on this PR.

@ghost
Copy link

ghost commented Aug 16, 2016

Is the suggestion to use short strings acceptable and would it address the performance and complexity issues raised?

@rossberg
Copy link
Member

lgtm

@ghost
Copy link

ghost commented Aug 16, 2016

@rossberg-chromium Wouldn't short strings address this without adding more special cases.

@lukewagner
Copy link
Member

@titzer From the other PR: since Name is already being defined with the special "you can ignore it" / "validation error does not cause the module to fail to validate" semantics that presumably we want for all the other user-defined sections, can we make names be defined as a user-defined section and move the description of the special validation rules to be the case for all user-defined sections?

@qwertie
Copy link

qwertie commented Aug 17, 2016

Seriously... what's wrong with the short names idea? It's simpler for consumers that read-modify-write a module, in that the new proposal will need an extra field to store the negative number. My thinking is that because a feature test must be an entire module (ill-advised, imo), eliminating (even short) section names could significantly shorten the many "test" modules one might need.

@ghost
Copy link

ghost commented Aug 17, 2016

I think the term 'user defined section' is an inappropriate distinction too. The web community will still want to experiment with extensions, might want them to be ignored by runtimes that do not recognize them, and to be a shared standard and calling these 'user defined' is not appropriate. Perhaps the example that @lukewagner mentions is helpful. Perhaps the real distinction is a flag noting if the section can be ignored for execution purposes.

If codes are added as an option to names then unrecognised codes need to be ignored too and then might as well just remove the names and use codes, which might then need to be managed, but it was deemed simpler to manage a name space so back to names.

@titzer
Copy link
Author

titzer commented Aug 17, 2016

One drawback to short strings (e.g. less than 3 bytes so that a single word-sized check suffices) is that they don't really allow a decoder to just do a simple "load word and compare" operation. There's a couple subtleties here. For example, the module could end early, so the decoder has to first decode the (varint) length (also with bounds checks) and then check the length doesn't go past the end of the module. After that, even with short strings, the actual bytes may not be aligned in memory, so the decoder may still have to do a byte-by-byte comparison on some architectures (e.g. ARM and MIPS).

Sections are few, so the efficiency argument isn't that pressing. It's just that doing a single byte comparison without all the extra checks is a lot simpler and more robust.

@titzer
Copy link
Author

titzer commented Aug 17, 2016

As for for graduating a user-defined section to a spec-blessed section, there would have to be a point where the string or the opcode is allocated by standardization and then becomes supported by all engines.

The only issue is for binaries created before this point which still use a user-defined section string as opposed to a short string or an opcode. Short strings don't help here, since presumably standardization only allocates short strings, while experiments should not be using short strings.

@lukewagner
Copy link
Member

@titzer Instead of thinking of user-defined sections as pre-standardization (such that they graduate into a spec-blessed integer code), I'd like us to consider saying instead that these are the just sections that can be completely ignored by an engine without having semantic effect (as is the case with the Names section). I like this because it means that new (semantic) features that involve a new section should go through the normal spec process before shipping (so they'd start with a new section opcode, just behind a flag at first, etc).

@ghost
Copy link

ghost commented Aug 18, 2016

The PR still has the ID as a varint, so there is no change in complexity compared with reading the length. So there is one less check that the length is within the module, hardly justifying the change. Loading a few unaligned bytes is a trivial matter, also does not justify a change, and this does not force a byte-by-byte comparison rather if the length of bytes can fit in a 32 bit integer or 64 bit integer then it can just pack them in and then dispatch on this integer code.

@titzer
Copy link
Author

titzer commented Aug 22, 2016

If we really want to split hairs on this one, this proposal would basically amount to:

int8_t val = read_i8();
switch (val) {
case -1: do_section_x();
case -2: do_section_y();
case -3: do_section_z();
...
default: {
if (val >= 0) {
rewind(1);
unsigned str_length = read_leb128();
skip(str_length);
unsigned length = read_leb128();
skip(length);
}
}
}

So there is only a single byte read and indirect branch in the normal case, and a compliant engine never has to implement a byte-by-byte comparison for string sections, since by definition they are not required to be parsed.

@titzer
Copy link
Author

titzer commented Aug 23, 2016

@lukewagner PTAL. I made the names section into a user-string specified section. Not sure if I like that, since V8 will definitely want to (try to) recognize the names section.

The names section does not change execution semantics, and thus is not allocated
a section opcode.
A validation error in this section does not cause validation for the whole module to fail and is
instead treated as if the section was absent. The expectation is that, when a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this sentence up above, below the para that introduces "user define sections", changing "this section" to "a user defined section"? That way it's clear that this is true for all user-defined sections, not just the names section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lukewagner
Copy link
Member

lukewagner commented Aug 23, 2016

lgtm with the one nit; much appreciated

(we're also decoding and using the names section, fwiw)

@ghost
Copy link

ghost commented Aug 24, 2016

From the v8 code in module-decoder.cc the short names could address the trivial performance use case as follows:

<<< existing code, that will still be needed >>>
      // Read the section name.
      int string_leb_length = 0;
      uint32_t string_length =
          consume_u32v(&string_leb_length, "section name length");
      const byte* section_name_start = pc_;
      consume_bytes(string_length);
      if (failed()) {
        TRACE("Section name of length %u couldn't be read\n", string_length);
        break;
      }

<<< fast path for short strings >>>
      if (string_length == 4) {
    switch read_u32(section_name_start){
      case ...:
      case ...:
      default: // Skip and ignore section.
      }
      } else {
<<< optional slow path if handling larger strings >>>

I see no rationale for this change.

There is no use case for 'user-defined' sections, app developers are not expected to be bundling in arbitrary payloads in this manner and there is no way to even access them. The use case is web standards, even if experimental or de-facto.

@titzer
Copy link
Author

titzer commented Aug 24, 2016

In the V8 native prototype (before 0xA), sections were section codes, and after discussion with others we all decided on section strings to reduce the probability that user extensions would collide either with each other or with experiments. But now there really is a fixed set of sections that need to be supported, so it is easier to just give them codes. Having "blessed" sections have codes and "experimental" or "user" sections have names makes the distinction explicit, and has the benefit of making engines slightly more efficient, as shown above. Standardization will assign further fixed codes as part of its normal process.

@titzer
Copy link
Author

titzer commented Aug 24, 2016

On Wed, Aug 24, 2016 at 2:31 AM, JSStats notifications@github.com wrote:

From the v8 code in module-decoder.cc the short names could address the
trivial performance use case as follows:

<<< existing code, that will still be needed >>>
// Read the section name.
int string_leb_length = 0;
uint32_t string_length =
consume_u32v(&string_leb_length, "section name length");
const byte* section_name_start = pc_;
consume_bytes(string_length);
if (failed()) {
TRACE("Section name of length %u couldn't be read\n", string_length);
break;
}

<<< fast path for short strings >>>
if (string_length == 4) {
switch read_u32(section_name_start){

This just hides the bounds-checking and unaligned-read inefficiencies
inside the read_u32 code.

  case ...:
  case ...:
  default: // Skip and ignore section.
  }

This switch won't be dense, so it won't get turned into a jump table by
the C++ compiler.

BTW, there's a bunch of other complexity in the V8 implementation that is
the result of strings, if you look at the src/wasm/module-header.h definition of
all those strings and all the unittests. It's vastly simpler with
single-byte section codes. I'm not sure making the strings shorter fixes
any of that.

  } else {

<<< optional slow path if handling larger strings >>>

I see no rationale for this change.

There is no use case for 'user-defined' sections, app developers are not
expected to be bundling in arbitrary payloads in this manner and there is
no way to even access them. The use case is web standards, even if
experimental or de-facto.

Actually, it's clear that they will, because debugging/source-mapping
sections are not going to be standardized for the engine, and will vary
across languages.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1MhrJLjvaotRD2f41WsHk0umQS5gks5qi5DUgaJpZM4JZ_El
.

@ghost
Copy link

ghost commented Aug 24, 2016

Would there be consensus to move to short strings as a trial, and then re-evaluate the situation? So use fixed four byte strings for all the core section names. V8 could clean out some of the string complexity, simplify unit tests etc. Would not be much different than having a 32 bit section id code, less room for collisions. Or how about 1 byte strings taking it to the limit?

@titzer
Copy link
Author

titzer commented Aug 24, 2016

While implementing this, I thought better of using the negative number trick and instead introduced a sentinel 0 byte for user-defined (string) section names. That avoids a little wart where one reads a byte from the stream, checks its value, and if it's not negative, "backs up" the stream 1 byte to use the normal varint32-parsing routine.

Thoughts?

@titzer
Copy link
Author

titzer commented Aug 25, 2016

On Wed, Aug 24, 2016 at 5:43 PM, KarlSchimpf notifications@github.com
wrote:

I for one, do not like what is being proposed here. It adds a BIG wart for
decompression. The rules are very complex and make it much harder to
implement a "general" decompression tool.

Since there are so few sections (when considering the size of a WASM file),
I'm surprised how much time is spent saving a few dozen bytes.

I also don't like that some argue that we should presume all "sections" are
already known. This makes integrating "decompression" algorithms more
problematic, in that the decompression algorithms have no logical place to
be (optionally) added.

I'm not sure I understand this comment, now that I've removed references to
strings for identifying unknown sections. The identification of sections is
just a single byte, as opposed to a string before. How does that make
decompression more complicated?

Karl Schimpf

Karl Schimpf

On Wed, Aug 24, 2016 at 8:29 AM, titzer notifications@github.com wrote:

Also, technically a negative number could be extended with more all-1's
upper bits, which would be another case to handle :-(


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
ALRbV9DPXf1PCy7j4jNXQBCAY12QEVRmks5qjGNegaJpZM4JZ_El>

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1ICT_xeYd4NJPLFwfTrkTNRJ_l1Oks5qjGargaJpZM4JZ_El
.

@ghost
Copy link

ghost commented Aug 25, 2016

Guess it's workable. Naming of the 'unknown' sections is deferred to a de-facto convention.

@lukewagner
Copy link
Member

@titzer By removing the designated "here's where the eventual variety of unknown sections distinguish themselves" field, it seems likely we'll eventually end up with more ad hoc conventions that end up causing conflicts or ambiguities. If it makes it any cleaner to describe, could we perhaps have a separate "unknown section header" that goes inside the payload of unknown sections and that includes the string?

@ghost
Copy link

ghost commented Aug 25, 2016

Yes, why not at least specify where an optional name goes in the 'unknown section header'. The length might be zero if there is none.

@qwertie
Copy link

qwertie commented Aug 26, 2016

I guess I should have kept my mouth shut before. I like having section names as an official part of the spec and separate from the payload.

for all sections. The encoding of sections is structured as follows:
The module preamble is followed by a sequence of sections.
Each section is identified by a 1-byte *section code* that encodes either a known section or a user-defined section.
Known sections have non-zero ids, while unkown sections simply have a zero id and are ignored by the WebAssembly implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo unkown -> unknown

@hemobo
Copy link

hemobo commented Aug 28, 2016

Just floating an older idea here, feel free to shoot it down if it doesn't apply :-)

Did you consider adapting the proposal by @mbebenita for this purpose? (#561)

That is, just using LEB128 encoded ASCII for section codes while using short codes for official sections?

The argument against that proposal was uniform treatment of strings, but now with the whole point being that engines don't have to treat section codes as strings anymore, that doesn't seem to be too much of a concern here.

This could mean using the 128 "singleton" bytes as one byte codes for "official" sections ("ASCII strings of one character", but without paying any mind to the "value" of that character, i. e. just numbering sections from 0 to 128) and longer LEB values, that is "ASCII strings of more than one character", for non-standard sections.

This seems to share the advantage of engines only having to recognize one byte section codes while custom sections are just those that start with “continuation bytes”.
There would be no reading of string sizes in any case.

This also means that all possible (ASCII-) strings are available as non standard section codes, as long as they are at least two characters long.

From an encoder perspective they're all just LEB values, then. There's also having one big uniform name-space with low risk of collision between non-standard sections while maintaining a distinction between official and non-standard by means of the length of the section code.

Of course implementations wouldn't need to know anything about ASCII in any case, the string interpretation of custom section codes would be more like mnemonics for longer LEB values.

Optionally, the spec may also want to reserve two character section codes (that is two byte LEB values) for a distant future where there might be more than 128 official sections :-)
Non-standard codes would then simply be defined to start at three bytes.

If that day ever comes, engines could still get away with reading and comparing just two bytes.

@naturaltransformation
Copy link
Contributor

Is there consensus on this one?

@jfbastien
Copy link
Member

Considering that https://github.com/WebAssembly/webassembly.github.io/pull/14 has already landed I think the train has left the station for 0xC. I think this should be moved to 0xD.

@sunfishcode
Copy link
Member

@jfbastien The binary in https://github.com/WebAssembly/webassembly.github.io/pull/14 does include the change here ("Section opcodes").

@jfbastien
Copy link
Member

@jfbastien The binary in WebAssembly/webassembly.github.io#14 does include the change here ("Section opcodes").

Ah OK, it does seem like that merge was a tad optimistic then :-)

We should merge this PR then.

@lukewagner
Copy link
Member

Ok, it looks like the remaining difference between this PR and what we discussed earlier is saying all user-defined sections have the string field.

@titzer
Copy link
Author

titzer commented Sep 14, 2016

Back by popular demand: string identifiers for unknown sections.

@lukewagner
Copy link
Member

Thanks, lgtm

@rossberg
Copy link
Member

rossberg commented Sep 14, 2016

I thought we wanted to make the string part of the payload, so that user sections still have regular structure?

@lukewagner
Copy link
Member

Oh right, I forgot. So that effectively just switches the order of the id string and payload length in the current PR.

sunfishcode added a commit to WebAssembly/website that referenced this pull request Sep 14, 2016
This is the same change as https://github.com/WebAssembly/webassembly.github.io/pull/14.

Note that it includes the changes described in WebAssembly/design#740
however it doesn't have any user-defined or names sections, so it's
unaffected by the details of how those are handled.
@lukewagner
Copy link
Member

Also: one consequence of the current wording which I think is desirable and perhaps should be more-explicitly called out is that unknown-to-the-engine non-zero section codes will now cause a validation error; only user-defined sections will be silently ignored if unknown.

@titzer
Copy link
Author

titzer commented Sep 15, 2016

On Wed, Sep 14, 2016 at 8:33 PM, rossberg-chromium <notifications@github.com

wrote:

I thought we wanted to make the string part of the payload, so that user
sections still have normal structure?

That's the way that string section names were originally implemented, but
that was actually more complex to check for (in the case where you actually
want to parse the name, you have to check if the string length and string
are within the payload size).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1P90JBg8-Qzd4D0rJacuE-2JDKHLks5qqD3ngaJpZM4JZ_El
.

@rossberg
Copy link
Member

Sure, but it was a regular case then, now it's become irregular, and thus a minus for easy processing -- especially for consumers that just want to skip all but a few selected sections (like tools).

@titzer
Copy link
Author

titzer commented Sep 15, 2016

Righty-o. PTAL.

On Thu, Sep 15, 2016 at 10:30 AM, rossberg-chromium <
notifications@github.com> wrote:

Sure, but it was a regular case then, now it's become irregular, and thus
a minus for easy processing -- especially for consumers that just want to
skip all but a few selected sections (like tools).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1K5Zc-hYZRhoAflii1cY0uGM21Akks5qqQIKgaJpZM4JZ_El
.

@rossberg
Copy link
Member

lgtm

1 similar comment
@lukewagner
Copy link
Member

lgtm

@titzer titzer merged commit d8690c5 into binary_0xc Sep 15, 2016
@titzer titzer deleted the binary_0xc_section_codes3 branch September 15, 2016 14:39
chakrabot pushed a commit to chakra-core/ChakraCore that referenced this pull request Sep 28, 2016
Merge pull request #1460 from Krovatkin:known_sections

Per WebAssembly/design#740
this adds a few bits to support opcodes for known sections
fyi @commonlisp  @arunetm

@cellulle @MikeHolman homeless PR looks for a nice review and branch to be adopted into

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/microsoft/chakracore/1460)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants