Use section codes instead of section names by titzer · Pull Request #740 · WebAssembly/design

titzer · 2016-08-01T20:53:24Z

(rebasing onto 0xC instead of master)

This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder.
It allows for user-defined sections that have string names to be encoded in the same manner as before.
The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.

(rebasing onto 0xC instead of master) This PR proposes uses section codes for known sections, which is more compact and easier to check in a decoder. It allows for user-defined sections that have string names to be encoded in the same manner as before. The scheme of using negative numbers proposed here also has the advantage of allowing a single decoder to accept the old (0xB) format and the new (0xC) format for the time being.

ghost · 2016-08-02T00:06:45Z

As I suggested long ago if short codes were reserved for use by the standard then it would be possible to test their length and content in a single 32 bit integer comparison which seems slightly simply than using the sign of the length to differentiate between codes and strings. This would address the use case of being 'easier to check in a decoder', and if the encoded size is a real show stopper then use single character strings for the high frequency section names.

Perhaps one issue with differentiating ignored sections using a binary 'sign' bit is that this offers no clean transition from a de-facto section name to become a standard section name, and if decoders need to still check for strings in future then the design is just more complex. Also decoders in general will be looking at the section name strings.

titzer · 2016-08-16T07:29:34Z

Friendly ping for any more thoughts on this PR.

ghost · 2016-08-16T10:17:31Z

Is the suggestion to use short strings acceptable and would it address the performance and complexity issues raised?

rossberg · 2016-08-16T10:58:27Z

lgtm

ghost · 2016-08-16T12:14:32Z

@rossberg-chromium Wouldn't short strings address this without adding more special cases.

lukewagner · 2016-08-16T14:16:11Z

@titzer From the other PR: since Name is already being defined with the special "you can ignore it" / "validation error does not cause the module to fail to validate" semantics that presumably we want for all the other user-defined sections, can we make names be defined as a user-defined section and move the description of the special validation rules to be the case for all user-defined sections?

qwertie · 2016-08-17T01:05:02Z

Seriously... what's wrong with the short names idea? It's simpler for consumers that read-modify-write a module, in that the new proposal will need an extra field to store the negative number. My thinking is that because a feature test must be an entire module (ill-advised, imo), eliminating (even short) section names could significantly shorten the many "test" modules one might need.

ghost · 2016-08-17T02:52:02Z

I think the term 'user defined section' is an inappropriate distinction too. The web community will still want to experiment with extensions, might want them to be ignored by runtimes that do not recognize them, and to be a shared standard and calling these 'user defined' is not appropriate. Perhaps the example that @lukewagner mentions is helpful. Perhaps the real distinction is a flag noting if the section can be ignored for execution purposes.

If codes are added as an option to names then unrecognised codes need to be ignored too and then might as well just remove the names and use codes, which might then need to be managed, but it was deemed simpler to manage a name space so back to names.

titzer · 2016-08-17T09:59:23Z

One drawback to short strings (e.g. less than 3 bytes so that a single word-sized check suffices) is that they don't really allow a decoder to just do a simple "load word and compare" operation. There's a couple subtleties here. For example, the module could end early, so the decoder has to first decode the (varint) length (also with bounds checks) and then check the length doesn't go past the end of the module. After that, even with short strings, the actual bytes may not be aligned in memory, so the decoder may still have to do a byte-by-byte comparison on some architectures (e.g. ARM and MIPS).

Sections are few, so the efficiency argument isn't that pressing. It's just that doing a single byte comparison without all the extra checks is a lot simpler and more robust.

titzer · 2016-08-17T10:01:41Z

As for for graduating a user-defined section to a spec-blessed section, there would have to be a point where the string or the opcode is allocated by standardization and then becomes supported by all engines.

The only issue is for binaries created before this point which still use a user-defined section string as opposed to a short string or an opcode. Short strings don't help here, since presumably standardization only allocates short strings, while experiments should not be using short strings.

lukewagner · 2016-08-17T15:57:08Z

@titzer Instead of thinking of user-defined sections as pre-standardization (such that they graduate into a spec-blessed integer code), I'd like us to consider saying instead that these are the just sections that can be completely ignored by an engine without having semantic effect (as is the case with the Names section). I like this because it means that new (semantic) features that involve a new section should go through the normal spec process before shipping (so they'd start with a new section opcode, just behind a flag at first, etc).

ghost · 2016-08-18T13:32:28Z

The PR still has the ID as a varint, so there is no change in complexity compared with reading the length. So there is one less check that the length is within the module, hardly justifying the change. Loading a few unaligned bytes is a trivial matter, also does not justify a change, and this does not force a byte-by-byte comparison rather if the length of bytes can fit in a 32 bit integer or 64 bit integer then it can just pack them in and then dispatch on this integer code.

titzer · 2016-08-22T12:42:45Z

If we really want to split hairs on this one, this proposal would basically amount to:

int8_t val = read_i8();
switch (val) {
case -1: do_section_x();
case -2: do_section_y();
case -3: do_section_z();
...
default: {
if (val >= 0) {
rewind(1);
unsigned str_length = read_leb128();
skip(str_length);
unsigned length = read_leb128();
skip(length);
}
}
}

So there is only a single byte read and indirect branch in the normal case, and a compliant engine never has to implement a byte-by-byte comparison for string sections, since by definition they are not required to be parsed.

titzer · 2016-08-23T14:47:39Z

@lukewagner PTAL. I made the names section into a user-string specified section. Not sure if I like that, since V8 will definitely want to (try to) recognize the names section.

lukewagner · 2016-08-23T14:53:23Z

BinaryEncoding.md

+The names section does not change execution semantics, and thus is not allocated
+a section opcode.
+A validation error in this section does not cause validation for the whole module to fail and is
 instead treated as if the section was absent. The expectation is that, when a


Could you move this sentence up above, below the para that introduces "user define sections", changing "this section" to "a user defined section"? That way it's clear that this is true for all user-defined sections, not just the names section.

lukewagner · 2016-08-23T14:53:42Z

lgtm with the one nit; much appreciated

(we're also decoding and using the names section, fwiw)

ghost · 2016-08-24T00:30:52Z

From the v8 code in module-decoder.cc the short names could address the trivial performance use case as follows:

<<< existing code, that will still be needed >>>
      // Read the section name.
      int string_leb_length = 0;
      uint32_t string_length =
          consume_u32v(&string_leb_length, "section name length");
      const byte* section_name_start = pc_;
      consume_bytes(string_length);
      if (failed()) {
        TRACE("Section name of length %u couldn't be read\n", string_length);
        break;
      }

<<< fast path for short strings >>>
      if (string_length == 4) {
    switch read_u32(section_name_start){
      case ...:
      case ...:
      default: // Skip and ignore section.
      }
      } else {
<<< optional slow path if handling larger strings >>>

I see no rationale for this change.

There is no use case for 'user-defined' sections, app developers are not expected to be bundling in arbitrary payloads in this manner and there is no way to even access them. The use case is web standards, even if experimental or de-facto.

titzer · 2016-08-24T09:08:31Z

In the V8 native prototype (before 0xA), sections were section codes, and after discussion with others we all decided on section strings to reduce the probability that user extensions would collide either with each other or with experiments. But now there really is a fixed set of sections that need to be supported, so it is easier to just give them codes. Having "blessed" sections have codes and "experimental" or "user" sections have names makes the distinction explicit, and has the benefit of making engines slightly more efficient, as shown above. Standardization will assign further fixed codes as part of its normal process.

titzer · 2016-08-24T09:12:12Z

On Wed, Aug 24, 2016 at 2:31 AM, JSStats notifications@github.com wrote:

From the v8 code in module-decoder.cc the short names could address the
trivial performance use case as follows:

<<< existing code, that will still be needed >>>
// Read the section name.
int string_leb_length = 0;
uint32_t string_length =
consume_u32v(&string_leb_length, "section name length");
const byte* section_name_start = pc_;
consume_bytes(string_length);
if (failed()) {
TRACE("Section name of length %u couldn't be read\n", string_length);
break;
}

<<< fast path for short strings >>>
if (string_length == 4) {
switch read_u32(section_name_start){

This just hides the bounds-checking and unaligned-read inefficiencies
inside the read_u32 code.
  case ...:
  case ...:
  default: // Skip and ignore section.
  }
This switch won't be dense, so it won't get turned into a jump table by
the C++ compiler.

BTW, there's a bunch of other complexity in the V8 implementation that is
the result of strings, if you look at the src/wasm/module-header.h definition of
all those strings and all the unittests. It's vastly simpler with
single-byte section codes. I'm not sure making the strings shorter fixes
any of that.

  } else {
<<< optional slow path if handling larger strings >>>

I see no rationale for this change.

There is no use case for 'user-defined' sections, app developers are not
expected to be bundling in arbitrary payloads in this manner and there is
no way to even access them. The use case is web standards, even if
experimental or de-facto.

Actually, it's clear that they will, because debugging/source-mapping
sections are not going to be standardized for the engine, and will vary
across languages.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1MhrJLjvaotRD2f41WsHk0umQS5gks5qi5DUgaJpZM4JZ_El
.

ghost · 2016-08-24T10:01:12Z

Would there be consensus to move to short strings as a trial, and then re-evaluate the situation? So use fixed four byte strings for all the core section names. V8 could clean out some of the string complexity, simplify unit tests etc. Would not be much different than having a 32 bit section id code, less room for collisions. Or how about 1 byte strings taking it to the limit?

titzer · 2016-08-24T15:01:02Z

While implementing this, I thought better of using the negative number trick and instead introduced a sentinel 0 byte for user-defined (string) section names. That avoids a little wart where one reads a byte from the stream, checks its value, and if it's not negative, "backs up" the stream 1 byte to use the normal varint32-parsing routine.

Thoughts?

titzer · 2016-08-25T10:49:43Z

On Wed, Aug 24, 2016 at 5:43 PM, KarlSchimpf notifications@github.com
wrote:

I for one, do not like what is being proposed here. It adds a BIG wart for
decompression. The rules are very complex and make it much harder to
implement a "general" decompression tool.

Since there are so few sections (when considering the size of a WASM file),
I'm surprised how much time is spent saving a few dozen bytes.

I also don't like that some argue that we should presume all "sections" are
already known. This makes integrating "decompression" algorithms more
problematic, in that the decompression algorithms have no logical place to
be (optionally) added.

I'm not sure I understand this comment, now that I've removed references to
strings for identifying unknown sections. The identification of sections is
just a single byte, as opposed to a string before. How does that make
decompression more complicated?

Karl Schimpf

Karl Schimpf

On Wed, Aug 24, 2016 at 8:29 AM, titzer notifications@github.com wrote:

Also, technically a negative number could be extended with more all-1's
upper bits, which would be another case to handle :-(

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
ALRbV9DPXf1PCy7j4jNXQBCAY12QEVRmks5qjGNegaJpZM4JZ_El>

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1ICT_xeYd4NJPLFwfTrkTNRJ_l1Oks5qjGargaJpZM4JZ_El
.

ghost · 2016-08-25T12:15:07Z

Guess it's workable. Naming of the 'unknown' sections is deferred to a de-facto convention.

lukewagner · 2016-08-25T15:09:38Z

@titzer By removing the designated "here's where the eventual variety of unknown sections distinguish themselves" field, it seems likely we'll eventually end up with more ad hoc conventions that end up causing conflicts or ambiguities. If it makes it any cleaner to describe, could we perhaps have a separate "unknown section header" that goes inside the payload of unknown sections and that includes the string?

ghost · 2016-08-25T23:07:10Z

Yes, why not at least specify where an optional name goes in the 'unknown section header'. The length might be zero if there is none.

qwertie · 2016-08-26T08:44:10Z

I guess I should have kept my mouth shut before. I like having section names as an official part of the spec and separate from the payload.

naturaltransformation · 2016-08-26T23:38:29Z

BinaryEncoding.md

-for all sections. The encoding of sections is structured as follows:
+The module preamble is followed by a sequence of sections.
+Each section is identified by a 1-byte *section code* that encodes either a known section or a user-defined section.
+Known sections have non-zero ids, while unkown sections simply have a zero id and are ignored by the WebAssembly implementation.


Nit: typo unkown -> unknown

hemobo · 2016-08-28T18:01:29Z

Just floating an older idea here, feel free to shoot it down if it doesn't apply :-)

Did you consider adapting the proposal by @mbebenita for this purpose? (#561)

That is, just using LEB128 encoded ASCII for section codes while using short codes for official sections?

The argument against that proposal was uniform treatment of strings, but now with the whole point being that engines don't have to treat section codes as strings anymore, that doesn't seem to be too much of a concern here.

This could mean using the 128 "singleton" bytes as one byte codes for "official" sections ("ASCII strings of one character", but without paying any mind to the "value" of that character, i. e. just numbering sections from 0 to 128) and longer LEB values, that is "ASCII strings of more than one character", for non-standard sections.

This seems to share the advantage of engines only having to recognize one byte section codes while custom sections are just those that start with “continuation bytes”.
There would be no reading of string sizes in any case.

This also means that all possible (ASCII-) strings are available as non standard section codes, as long as they are at least two characters long.

From an encoder perspective they're all just LEB values, then. There's also having one big uniform name-space with low risk of collision between non-standard sections while maintaining a distinction between official and non-standard by means of the length of the section code.

Of course implementations wouldn't need to know anything about ASCII in any case, the string interpretation of custom section codes would be more like mnemonics for longer LEB values.

Optionally, the spec may also want to reserve two character section codes (that is two byte LEB values) for a distant future where there might be more than 128 official sections :-)
Non-standard codes would then simply be defined to start at three bytes.

If that day ever comes, engines could still get away with reading and comparing just two bytes.

naturaltransformation · 2016-09-14T16:55:22Z

Is there consensus on this one?

jfbastien · 2016-09-14T17:12:45Z

Considering that https://github.com/WebAssembly/webassembly.github.io/pull/14 has already landed I think the train has left the station for 0xC. I think this should be moved to 0xD.

sunfishcode · 2016-09-14T17:26:46Z

@jfbastien The binary in https://github.com/WebAssembly/webassembly.github.io/pull/14 does include the change here ("Section opcodes").

jfbastien · 2016-09-14T17:28:26Z

@jfbastien The binary in WebAssembly/webassembly.github.io#14 does include the change here ("Section opcodes").

Ah OK, it does seem like that merge was a tad optimistic then :-)

We should merge this PR then.

lukewagner · 2016-09-14T17:53:36Z

Ok, it looks like the remaining difference between this PR and what we discussed earlier is saying all user-defined sections have the string field.

titzer · 2016-09-14T18:14:40Z

Back by popular demand: string identifiers for unknown sections.

lukewagner · 2016-09-14T18:27:49Z

Thanks, lgtm

rossberg · 2016-09-14T18:32:50Z

I thought we wanted to make the string part of the payload, so that user sections still have regular structure?

lukewagner · 2016-09-14T18:44:37Z

Oh right, I forgot. So that effectively just switches the order of the id string and payload length in the current PR.

This is the same change as https://github.com/WebAssembly/webassembly.github.io/pull/14. Note that it includes the changes described in WebAssembly/design#740 however it doesn't have any user-defined or names sections, so it's unaffected by the details of how those are handled.

lukewagner · 2016-09-14T19:03:30Z

Also: one consequence of the current wording which I think is desirable and perhaps should be more-explicitly called out is that unknown-to-the-engine non-zero section codes will now cause a validation error; only user-defined sections will be silently ignored if unknown.

titzer · 2016-09-15T07:51:26Z

On Wed, Sep 14, 2016 at 8:33 PM, rossberg-chromium <notifications@github.com

wrote:

I thought we wanted to make the string part of the payload, so that user
sections still have normal structure?

That's the way that string section names were originally implemented, but
that was actually more complex to check for (in the case where you actually
want to parse the name, you have to check if the string length and string
are within the payload size).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1P90JBg8-Qzd4D0rJacuE-2JDKHLks5qqD3ngaJpZM4JZ_El
.

rossberg · 2016-09-15T08:29:42Z

Sure, but it was a regular case then, now it's become irregular, and thus a minus for easy processing -- especially for consumers that just want to skip all but a few selected sections (like tools).

titzer · 2016-09-15T12:58:44Z

Righty-o. PTAL.

On Thu, Sep 15, 2016 at 10:30 AM, rossberg-chromium <
notifications@github.com> wrote:

Sure, but it was a regular case then, now it's become irregular, and thus
a minus for easy processing -- especially for consumers that just want to
skip all but a few selected sections (like tools).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#740 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1K5Zc-hYZRhoAflii1cY0uGM21Akks5qqQIKgaJpZM4JZ_El
.

rossberg · 2016-09-15T13:03:39Z

lgtm

lukewagner · 2016-09-15T14:24:00Z

lgtm

@commonlisp

Merge pull request #1460 from Krovatkin:known_sections Per WebAssembly/design#740 this adds a few bits to support opcodes for known sections fyi @commonlisp @arunetm @cellulle @MikeHolman homeless PR looks for a nice review and branch to be adopted into  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/microsoft/chakracore/1460)

titzer mentioned this pull request Aug 1, 2016

Use section codes instead of section names #734

Closed

sunfishcode added the binary format label Aug 1, 2016

sunfishcode added this to the MVP milestone Aug 1, 2016

Krovatkin mentioned this pull request Aug 17, 2016

opcodes for known sections chakra-core/ChakraCore#1460

Merged

Make name section a user-string section.

bfe20ee

lukewagner reviewed Aug 23, 2016
View reviewed changes

titzer added 2 commits August 23, 2016 17:09

Update BinaryEncoding.md

69dbea2

Update BinaryEncoding.md

ddaaf09

Use positive section code byte

9d54e9c

naturaltransformation reviewed Aug 26, 2016
View reviewed changes

Add string back

bc6a51b

sunfishcode mentioned this pull request Sep 14, 2016

Candidate for 0xc AngryBots WebAssembly/website#16

Merged

Make string part of the payload

d8690c5

titzer merged commit d8690c5 into binary_0xc Sep 15, 2016

titzer deleted the binary_0xc_section_codes3 branch September 15, 2016 14:39

Conversation

titzer commented Aug 1, 2016

Uh oh!

ghost commented Aug 2, 2016

Uh oh!

titzer commented Aug 16, 2016

Uh oh!

ghost commented Aug 16, 2016

Uh oh!

rossberg commented Aug 16, 2016

Uh oh!

ghost commented Aug 16, 2016

Uh oh!

lukewagner commented Aug 16, 2016

Uh oh!

qwertie commented Aug 17, 2016

Uh oh!

ghost commented Aug 17, 2016

Uh oh!

titzer commented Aug 17, 2016

Uh oh!

titzer commented Aug 17, 2016

Uh oh!

lukewagner commented Aug 17, 2016

Uh oh!

ghost commented Aug 18, 2016

Uh oh!

titzer commented Aug 22, 2016

Uh oh!

titzer commented Aug 23, 2016

Uh oh!

lukewagner Aug 23, 2016

Choose a reason for hiding this comment

Uh oh!

titzer Aug 23, 2016

Choose a reason for hiding this comment

Uh oh!

lukewagner commented Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 24, 2016

Uh oh!

titzer commented Aug 24, 2016

Uh oh!

titzer commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 24, 2016

Uh oh!

titzer commented Aug 24, 2016

Uh oh!

titzer commented Aug 25, 2016

Uh oh!

ghost commented Aug 25, 2016

Uh oh!

lukewagner commented Aug 25, 2016

Uh oh!

ghost commented Aug 25, 2016

Uh oh!

qwertie commented Aug 26, 2016

Uh oh!

naturaltransformation Aug 26, 2016

Choose a reason for hiding this comment

Uh oh!

hemobo commented Aug 28, 2016

Uh oh!

naturaltransformation commented Sep 14, 2016

Uh oh!

jfbastien commented Sep 14, 2016

Uh oh!

sunfishcode commented Sep 14, 2016

Uh oh!

jfbastien commented Sep 14, 2016

Uh oh!

lukewagner commented Sep 14, 2016

Uh oh!

titzer commented Sep 14, 2016

Uh oh!

lukewagner commented Sep 14, 2016

Uh oh!

lukewagner commented Aug 23, 2016 •

edited

Loading

titzer commented Aug 24, 2016 •

edited

Loading

rossberg commented Sep 14, 2016 •

edited

Loading