-
Notifications
You must be signed in to change notification settings - Fork 701
Rewrite BinaryEncoding.md to accurately represent the current v8-native decoder #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,120 +3,136 @@ | |
| This document describes the [portable](Portability.md) binary encoding of the | ||
| [Abstract Syntax Tree](AstSemantics.md) nodes. | ||
|
|
||
| The binary encoding is designed to allow fast startup, which includes reducing | ||
| download size and allow for quick decoding. For more information, see the | ||
| [rationale document](Rationale.md#why-a-binary-encoding) | ||
|
|
||
| Reducing download size, is achieved through three layers: | ||
|
|
||
| * The **raw** binary encoding itself, natively decoded by the browser, and to | ||
| be standardized in the [MVP](MVP.md). | ||
| * **Specific** compression to the binary encoding, that is unreasonable to | ||
| expect a generic compression algorithm like gzip to achieve. | ||
| * This is not meant to be standardized, at least not initially, as it can be | ||
| done with a downloaded decompressor that runs as web content on the client, | ||
| and in particular can be implemented in a [polyfill](Polyfill.md). | ||
| * **Generic** compression, such as gzip, already supported in browsers. Other | ||
| compression algorithms being considered and which might be standardized | ||
| include: LZMA, [LZHAM](https://github.com/richgel999/lzham_codec), | ||
| [Brotli](https://datatracker.ietf.org/doc/draft-alakuijala-brotli/). | ||
|
|
||
| ## Variable-length integers | ||
| * [Polyfill prototype](https://github.com/WebAssembly/polyfill-prototype-1) shows significant size savings before (31%) and after (7%) compression. | ||
| * [LEB128](https://en.wikipedia.org/wiki/LEB128) except limited to uint32_t payloads. | ||
|
|
||
| ## Global structure | ||
|
|
||
| * A module contains (in this order): | ||
| - A header, containing: | ||
| + The [magic number](https://en.wikipedia.org/wiki/Magic_number_%28programming%29) | ||
| + Other data TBD | ||
| - A table (sorted by offset) containing, for each section: | ||
| + A string literal section type name | ||
| + 64-bit offset within the module | ||
| - A sequence of sections | ||
| * A section contains: | ||
| - A header followed by | ||
| - The section contents (specific to the section type) | ||
| * A `definitions` section contains (in this order): | ||
| - The generic section header | ||
| - A table (sorted by offset) containing, for each type which has operators: | ||
| + A standardized string literal [type name](AstSemantics.md#expression-types). | ||
| The index of a type name in this table is referred to as a type ID | ||
| + 64-bit offset of its operator table within the section | ||
| - A sequence of operator tables | ||
| - An operator table contains: | ||
| + A sequence of standardized string literal [operator names](AstSemantics.md), | ||
| where order determines operator index | ||
| * A `code` section contains (in this order): | ||
| - The generic section header | ||
| - A table (sorted by offset) containing, for each function: | ||
| + Signature | ||
| + Function attributes, valid attributes TBD (could include hot/cold, optimization level, noreturn, read/write/pure, ...) | ||
| + 64-bit offset within the section | ||
| - A sequence of functions | ||
| - A function contains: | ||
| + A table containing, for each type ID that has [locals](AstSemantics.md#local-variables): | ||
| * Type ID | ||
| * Count of locals | ||
| + The serialized AST | ||
| * A `data` section contains (in this order): | ||
| - The generic section header | ||
| - A sequence of byte ranges within the binary and corresponding addresses in the linear memory | ||
|
|
||
|
|
||
| All strings are encoded as null-terminated UTF8. Data segments represent | ||
| initialized data that is loaded directly from the binary into the linear memory | ||
| when the program starts (see [modules](Modules.md#linear-memory-section)). | ||
|
|
||
| ## Serialized AST | ||
|
|
||
| * Use a preorder encoding of the AST | ||
| * Efficient single-pass validation+compilation and polyfill | ||
| * The data of a node (if there is any), is written immediately after the operator and before child nodes | ||
| * The operator statically determines what follows, so no generic metadata is necessary. | ||
| The binary encoding is a general representation of syntax trees and module | ||
| information that enables small files, fast decoding, and reduced memory usage. | ||
| See the [rationale document](Rationale.md#why-a-binary-encoding) for more detail. | ||
|
|
||
| The encoding is split into three layers: | ||
|
|
||
| * **Layer 0** is a simple pre-order encoding of the AST and related data structures. | ||
| The encoding is dense and trivial to interact with, making it suitable for | ||
| scenarios like JIT, instrumentation tools, and debugging. | ||
| * **Layer 1** provides structural compression on top of layer 0, exploiting | ||
| specific knowledge about the nature of the syntax tree and its nodes. | ||
| The structural compression introduces more efficient encoding of values, | ||
| rearranges values within the module, and prunes structurally identical | ||
| tree nodes. | ||
| * **Layer 2** applies generic compression techniques, already available | ||
| in browsers and other tooling. Algorithms as simple as gzip can deliver | ||
| good results, but more sophisticated algorithms like | ||
| [LZHAM](https://github.com/richgel999/lzham_codec) and | ||
| [Brotli](https://datatracker.ietf.org/doc/draft-alakuijala-brotli/) are able | ||
| to deliver dramatically smaller files. | ||
|
|
||
| Most importantly, the layering approach allows development and standardization to | ||
| occur incrementally, even though production-quality implementations will need to | ||
| implement all of the layers. | ||
|
|
||
| # Primitives and key terminology | ||
|
|
||
| ### uint8 | ||
| A single-byte unsigned integer. | ||
|
|
||
| ### uint16 | ||
| A two-byte little endian unsigned integer. | ||
|
|
||
| ### uint32 | ||
| A four-byte little endian unsigned integer. | ||
|
|
||
| ### varuint32 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably worthwhile to go ahead and define the other integer types and to be explicit that they are encoded in little-endian. |
||
| A [LEB128](https://en.wikipedia.org/wiki/LEB128) variable-length integer, limited to uint32 payloads. Provides considerable size reduction. | ||
|
|
||
| ### Pre-order encoding | ||
| Refers to an approach for encoding syntax trees, where each node begins with an identifier, followed by any arguments or child nodes. | ||
| Pre-order trees can be decoded iteratively or recursively. Alternative approaches include post-order trees and table representations. | ||
|
|
||
| * Examples | ||
| * Given a simple AST node: `struct I32Add { AstNode *left, *right; }` | ||
| * First write the operator of `I32Add` (1 byte) | ||
| * Then recursively write the left and right nodes. | ||
|
|
||
| * Given a call AST node: `struct Call { uint32_t callee; vector<AstNode*> args; }` | ||
| * First write the operator of `Call` (1 byte) | ||
| * Then write the (variable-length) integer `Call::callee` (1-5 bytes) | ||
| * Then recursively write each arg node (arity is determined by looking up `callee` in table of signatures) | ||
|
|
||
| ## Backwards Compatibility | ||
|
|
||
| As explained above, for size- and decode-efficiency, the binary format will serialize AST nodes, | ||
| their contents and children using dense integer indices and without any kind of embedded metadata | ||
| or tagging. This raises the question of how to reconcile the efficient encoding with the | ||
| backwards-compatibility goals. | ||
|
|
||
| Specifically, we'd like to avoid the situation where a future version of WebAssembly has features | ||
| F1 and F2 and vendor V1 implements F1, assigning the next logical operator indices to F1's new | ||
| operators, and V2 implements F2, assigning the same next logical operator indices to F2's new operators | ||
| and now a single binary has ambiguous semantics if it tries to use either F1 or F2. This type of | ||
| non-linear feature addition is commonplace in JavaScript and Web APIs and is guarded against by | ||
| having unique names for unique features (and associated [conventions](https://hsivonen.fi/vendor-prefixes/)). | ||
|
|
||
| The current proposal is to maintain both the efficiency of indices in the [serialized AST](BinaryEncoding.md#serialized-ast) and the established | ||
| conflict-avoidance practices surrounding string names: | ||
| * The WebAssembly spec doesn't define any global index spaces | ||
| * So, as a general rule, no magic numbers in the spec (other than the literal [magic number](https://en.wikipedia.org/wiki/Magic_number_%28programming%29)). | ||
| * Instead, a module defines its *own* local index spaces of operators by providing tables *of names*. | ||
| * So what the spec *would* define is a set of names and their associated semantics. | ||
| * To avoid (over time) large index-space declaration sections that are largely the same | ||
| between modules, finalized versions of standards would define named baseline index spaces | ||
| that modules could optionally use as a starting point to further refine. | ||
| * For example, to use all of [the MVP](MVP.md) plus | ||
| [SIMD](PostMVP.md#fixed-width-simd) the declaration could be "base" | ||
| followed by the list of SIMD operators used. | ||
| * This feature would also be most useful for people handwriting the [text format](TextFormat.md). | ||
| * However, such a version declaration does not establish a global "version" for the module | ||
| or affect anything outside of the initialization of the index spaces; decoders would | ||
| remain versionless and simply add cases for new *names* (as with current JavaScript parsers). | ||
|
|
||
| ## Proposals | ||
|
|
||
| The native prototype built for [V8](https://github.com/WebAssembly/v8-native-prototype) | ||
| implements a binary format that embodies most, but not all of the ideas in this document. | ||
| It is described in detail in a [public design doc](https://docs.google.com/a/google.com/document/d/1761v1AfhFM5kE8NArF_PyXcl-iVh0Dx3InOrmcyIoiI/pub) and a [copy of the original](https://docs.google.com/document/d/1-G11CnMA0My20KI9D7dBR6ZCPOBCRD0oCH6SHCPFGx0/edit?usp=sharing). | ||
| ### Strings | ||
| Strings referenced by the module (i.e. function names) are encoded as null-terminated [UTF8](http://unicode.org/faq/utf_bom.html#UTF8). | ||
|
|
||
| # Module structure | ||
|
|
||
| The following documents the current prototype format. This format is based on and supersedes the v8-native prototype format, originally in a [public design doc](https://docs.google.com/document/d/1-G11CnMA0My20KI9D7dBR6ZCPOBCRD0oCH6SHCPFGx0/edit?usp=sharing). | ||
|
|
||
| ## High-level structure | ||
| A module contains a stream of sections, containing for each section: | ||
| - ```uint8```: A [section type identifier](https://github.com/v8/v8/blob/master/src/wasm/wasm-module.h#L26) for the section | ||
| - The section body (defined below by section type) | ||
|
|
||
| ### Memory section | ||
| A module may only contain one memory section. | ||
|
|
||
| * ```uint8```: The minimum size of the module heap in bytes, as a power of two | ||
| * ```uint8```: The maximum size of the module heap in bytes, as a power of two | ||
| * ```uint8```: ```1``` if the module's memory is externally visible | ||
|
|
||
| ### Signatures section | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please mention the encoding of the signatures section type identifier. |
||
| A module may only contain one signatures section. | ||
|
|
||
| * [```varuint32```](#varuint32): The number of function signatures in the section | ||
| * For each function signature: | ||
| - ```uint8```: The number of parameters | ||
| - ```uint8```: The function return type, as a [LocalType](https://github.com/v8/v8/blob/master/src/wasm/wasm-opcodes.h#L16) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please also mention the encoding of local types. |
||
| - For each parameter: | ||
| + ```uint8```: The parameter type, as a LocalType | ||
|
|
||
| ### Functions section | ||
| This section must be preceded by a [Signatures](#signatures-section) section. A module may only contain one functions section. | ||
|
|
||
| * ```varuint32```: The number of functions in the section | ||
| * For each function: | ||
| - ```uint8```: The [function declaration bits](https://github.com/v8/v8/blob/master/src/wasm/wasm-module.h#L39) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please mention the encoding of the declaration bits. |
||
| - ```uint16```: The function signature (as an index into the Signatures section) | ||
| - If the ```kDeclFunctionName``` bit is set: | ||
| + ```uint32```: The offset of the function name in the file. | ||
| - If the ```kDeclFunctionImport``` bit is set, **the function entry ends here** | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reverse the condition here, if the kDeclFunctionImport bit is not set, then a function body follows.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But that's not true, because the locals block is also conditional. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we need to decide if the local flag is either ignored when kDeclFunctionImport is set or validated to be zero? @titzer |
||
| - If the ```kDeclFunctionLocals``` bit is set: | ||
| + ```uint16```: The number of i32 locals | ||
| + ```uint16```: The number of i64 locals | ||
| + ```uint16```: The number of f32 locals | ||
| + ```uint16```: The number of f64 locals | ||
| - ```uint16```: The size of the function body, in bytes | ||
| - The function body | ||
|
|
||
| ### Globals section | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is only used for asm.js translation internally in V8, and as such was described in the V8 document. Maybe we should just mention this byte is reserved for internal use.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not possible, because the length header is # of globals instead of # of bytes. If we properly laid out sections with length-in-bytes headers we could reserve section types for internal use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is that this section will never occur in wasm, so it will never be necessary to even skip it - it just will not exist in the wasm binary encoding. Until perhaps thread local variable support is added. How is the consensus to add a length header to all sections?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we later reintroduce the concept of globals (for TLS, for example), this section could end up being adopted. So 'never' is a little strong here. Unless it's actually disabled in every implementation, it's always possible a wasm module will ship out in production on the web that contains this section. So we can't ignore it as long as it's in the code. |
||
| A module may only contain one globals section. This section is currently for V8 internal use. | ||
|
|
||
| * ```varuint32```: The number of global variable declarations in the section. | ||
| * For each global variable: | ||
| - ```uint32```: The offset of the global variable name in the file. | ||
| - ```uint8```: The type of the global, as a [MemType](https://github.com/v8/v8/blob/master/src/wasm/wasm-opcodes.h#L25) | ||
| - ```uint8```: ```1``` if the global is exported | ||
|
|
||
| ### Data Segments section | ||
| A module may only contain one data segments section. | ||
|
|
||
| * ```varuint32```: The number of data segments in the section. | ||
| * For each data segment: | ||
| - ```uint32```: The base address of the data segment in memory. | ||
| - ```uint32```: The offset of the data segment's data in the file. | ||
| - ```uint32```: The size of the data segment (in bytes) | ||
| - ```uint8```: ```1``` if the segment's data should be automatically loaded into memory at module load time. | ||
|
|
||
| ### Function Table section | ||
| This section must be preceded by a [Functions](#functions-section) section. | ||
|
|
||
| * ```varuint32```: The number of function table entries in the section | ||
| * For each function table entry: | ||
| - ```uint16```: The index of the function (in the [Functions](#functions-section) section's index space) | ||
|
|
||
| ### WLL section | ||
|
|
||
| * ```varuint32```: The size of the section body, in bytes | ||
| * The section body (contents currently undefined) | ||
|
|
||
| ### End section | ||
| This indicates the end of the module's sections. Additional data can follow this section marker (for example, to store function names or data segment bodies) but it is not parsed by the decoder. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this text seems to have been removed, was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I mean this line and the 2 below it, specifically, not the whole removed block in the diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't believe there was ever a consensus to not standardize Layer 1/Layer 2. The opposite is true: IIRC all the discussions have involved eventually standardizing them. User-space is a stopgap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the previous text says "this is not meant to be standardized, at least not initially" - which I thought was the consensus, and hence was written here? How about just adding a note that the Specific compression is not initially going to be standardized, but the option is open to do so later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the actual decision I can revise it to say that. But we never arrived at that decision and I would object to it, so maybe we should start a thread to discuss it (issue in the design repo, maybe?)
Do you want me to kick off that thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luke's prototype (our only real-world size test other than mine) basically had a bunch of layer 1 elements baked into it in order to get a good size reduction when combined with gzip. Layer 0 + gzip won't be a compelling improvement unless we pull a bunch of layer 1 stuff into layer 0 (which, to be fair, people seem to want to do anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get more real-world data now. I tested on zlib built with emscripten, and ran
asm2wasm, thenwasm-aswhich emits the current binary format. I see a 33% size reduction before gzip, and 16% size reduction after gzip.And in addition to the 16% smaller download, we will have massively faster parsing. Overall that seems like a very compelling MVP to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 16% post-gzip reduction instead of the ~45% from luke's prototype compelling enough to justify the amount of work necessary to implement this across the web ecosystem? Consumers of wasm will need to ship the polyfill and deal with that for a period of time as well.
In comparison, compressing bananabread's raw asm.js with lzham delivers a ~25% size reduction on top of gzip's. Deploying a new compression codec is much easier compared to shipping wasm everywhere. People are going to ask questions like: "Brotli is going into every web browser already, so why not just compress my asm.js with that?"
For reference, layer 0 as defined previously is larger than v8-native. We can just redefine things so that v8-native is the only thing we spec, and ship that. That has obvious downsides and I'm honestly tired of having to explain them over and over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Thu, Jan 21, 2016 at 8:25 PM, Alon Zakai notifications@github.com
wrote:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titzer: I ran @lukewagner's prototype on the same code (zlib) now. Before gzip it is 64% smaller, and after it is 25% smaller. So it is significantly better before gzip, but less so after gzip.
Possibly the main factor here is that @lukewagner's prototype had more tricks in it than the v8 binary format. Makes sense that would matter more before gzip.
Yes, binaryen's binary support uses I8Const.
@kg: sorry if I'm asking something you've already explained in detail elsewhere. Is there a link to the previous discussions or summaries of them?
Overall, I don't think we have a problem of viability here. With the apples-to-apples comparison in the first part of this comment, the current v8 binary format is 15% smaller vs @lukewagner's which is 25% smaller. That's a significant difference, but I think both are excellent numbers for an MVP. We certainly intend to do better later, but there's no need to rush, I think.
Furthermore, we may do better at the MVP stage, with layer 1 stuff in userspace. We might also modify layer 0 in the choice of opcodes or ordering or such that improves things, even without layer 1.
More importantly, smaller downloads are a crucial piece here, but not the only one. At the MVP stage, we will also have
Even with an initially more moderate download size improvement, I think the wasm MVP definitely justifies itself.
I think we all agree that layer 1 is important to work on, and will matter a lot. The only question I'm raising here is, previously we wrote that we would leave layer 1 standardization to post-MVP, and I don't see a reason to change that?