diff --git a/PushPop.md b/PushPop.md new file mode 100644 index 00000000..aa2e3489 --- /dev/null +++ b/PushPop.md @@ -0,0 +1,112 @@ +# Text Format Idea: Explicit Push and Pop + +Push and pop are an idea for visually splitting up expression trees. Push +and pop connect subtrees to their parents, allowing them to be written +separately in the text syntax, but still be part of the same conceptual tree +in the wasm semantics, and in the wasm binary format. + +Here's the proposed text syntax for the `Q_rsqrt` example from TextFormat.md, +but with `push` and `pop`: + +``` + function $Q_rsqrt ($0:f32) : (f32) { + var $1:f32 + $1 = f32.reinterpret/i32 (1597463007 - ((i32.reinterpret/f32 $0) >> 1)) + push:0 $0 = $0 * 0x1p-1 + $1 = $1 * (0x1.8p0 - $1 * pop:0 * $1) + $1 * (0x1.8p0 - $1 * $0 * $1) + } +``` + +Note that the original version has a `set_local` buried in the middle of a +tree, making it easy for a human to miss. Humans wouldn't write code that +way, but in wasm, compilers are *incentivised* to write it that way, because +it reduces code size. It's going to happen a lot, and the push/pop mechanism +gives us a way to make this more readable in many cases. + + +## Discussion + +In a normal programming language, the preferred way to split up a large +expression tree would be to simply assign some subtrees to their own local +variables. Of course compilers can optimize them away as needed, so there's +no reason not to do this. + +However in wasm, introducing locals increases code size, so +compilers producing wasm aren't going to do that. There will be a lot of code +in the wild with very large monolithic trees, because compilers will be writing +code that way to minimize code size. And, binary->text translation can't +introduce local variables, because that would make binary->text->binary lossy. + +The solution proposed here: `push` and `pop`. `push` pushes subtrees onto a +conceptual stack, and `pop` pops them and conceptually connects them to the +tree that that point. It's important to realize that this is purely a +text-format device. These constructs just exist to build trees. In the abstract +wasm semantics and in the binary format, the trees just exist in monolithic +form. + +Now there's a question: how should a binary->text translator decide where to +split up trees? It turns out, we can let binary->text translators choose what +they think is best in their situation: + + - Split trees at `set_local` operators. This is what the examples here do, + and it's balance delivering readability while still keeping the code + fairly concise. + - Split trees at nodes with "side effects" (call, `store`, etc.). This can + additionally aid in debugging, as one can clearly see where the side effects + occur and step through them. + - Split trees at *all* points. This essentially puts every instruction on its + own line, which may sometimes be useful for single-step debugging scenarios, + or for compiler writers. + - Don't split trees at all. Maximum bushiness. + +Each of these strategies map back to the same binary format. A single text +format can support a wide variety of use cases, because binary->text +translators can split up trees to fit the need at hand. + + +## Details + +Expressions containing multiple pops perform their pops right-to-left. This is +surprising at first, but it makes sense when you look at wasm's evaluation order. +For example: + +``` + push:0 call $foo() + push:1 call $bar() + call $qux(pop:0, pop:1) +``` + +Clearly, this syntax should evaluate the call to `$foo` before the call to +`$bar`. And in the wasm semantics, the call to `$qux` evaluates its operands in +the order they appear. Both of these principles are completely intuitive. Put +together as they are here, they imply that the first pop corresponds to the +first push, which effectively means that the pops happen right-to-left. + +The `:0` and `:1` are stack-depth indicators, which can be useful in pairing +up pushes with their corresponding pops. + +Some additional rules governing push and pop are: + + - Pushed expressions must be popped within the same block as the push. + - Stack-depth indicators start at 0 at the beginning of each block. + - Sequences of trees tied together with push and pop must be contiguous. + Arbitrary blocks can be placed in the middle of trees, but their return value + has to be consumed by some node in the tree. + +These rules reflect how the current wasm binary format works. If there are +changes to wasm, these rules would change accordingly. + + +## Answers to anticipated questions + +Q: How about replacing push/pop with something more flexible? + +A: Push/pop as described here are meant to be a direct reflection of WebAssembly + itself. For example, it would be convenient to replace `push` with + something that would allow a value to be used multiple times. However, + push/pop are representing expression tree edges in WebAssembly, which + can only have a single definition and a single use. The way to use a value + multiple times in WebAssembly is to use `set_local` and `get_local`. + + diff --git a/TextFormat.md b/TextFormat.md index 0aacd163..4a926fe6 100644 --- a/TextFormat.md +++ b/TextFormat.md @@ -1,30 +1,25 @@ # Text Format The purpose of this text format is to support: + * View Source on a WebAssembly module, thus fitting into the Web (where every source can be viewed) in a natural way. * Presentation in browser development tools when source maps aren't present (which is necessarily the case with [the Minimum Viable Product (MVP)](MVP.md)). -* Writing WebAssembly code directly for reasons including pedagogical, - experimental, debugging, optimization, and testing of the spec itself. +* Working with WebAssembly code directly for reasons including pedagogical, + experimental, debugging, profiling, optimization, and testing of the spec + itself. The text format is equivalent and isomorphic to the [binary format](BinaryEncoding.md). -The text format will be standardized, but only for tooling purposes: -* Compilers will support this format for `.S` and inline assembly. -* Debuggers and profilers will present binary code using this textual format. -* Browsers will not parse the textual format on regular web content in order to - implement WebAssembly semantics. - -Given that the code representation is actually an -[Abstract Syntax Tree](AstSemantics.md), the syntax would contain nested -statements and expressions (instead of the linear list of instructions most -assembly languages have). +The text format will be standardized, but only for tooling purposes; browsers +will not parse the textual format on regular web content in order to implement +WebAssembly semantics. -There is no requirement to use JavaScript syntax; this format is not intended to -be evaluated or translated directly into JavaScript. There may also be +The text format does not use JavaScript syntax; it is not intended to +be evaluated or translated directly into JavaScript. There are also substantive reasons to use notation that is different than JavaScript (for -example, WebAssembly has a 32-bit integer type, and it should be represented +example, WebAssembly has a 32-bit integer type, and it is represented in the text format, since that is the natural thing to do for WebAssembly, regardless of JavaScript not having such a type). On the other hand, when there are no substantive reasons and the options are basically @@ -41,39 +36,462 @@ represented as hexadecimal floating-point as specified by the C99 standard, whic IEEE-754-2008 section 5.12.3 also specifies. The textual format may be improved to also support more human-readable representations, but never at the cost of accurate representation. -# Official Text Format +# ~~Official~~*Experimental* Text Format + +## This is an experiment! + +This document is a sketch of a possible Text Format proposal for WebAssembly to +use for the "View Source" functionality in browsers. WebAssembly looks enough +like a programming language that it tends to activate our programmer intuitions +about syntax, but it differs from normal programming languages in numerous +respects, so we don't fully trust our intuitions. + +So, we're sketching something up, and building a trial implementation of it in +Firefox. This way, we can try it out on real code in a real browser setting, and +see if it actually "works" in practice. Maybe we'll like it and propose it to +the official WebAssembly project. Maybe it'll need changes. Or maybe it'll +totally flop and we'll drop it and pursue something completely different! + +Comments, questions, suggestions, and reactions are welcome on +[this repo's issue tracker](https://github.com/sunfishcode/design/issues) for +the moment. As the experiment progresses, we may shift to other discussion +forums, but for now we're keeping it simple. + + +## Philosophy: + + - Use JS-style sensibilities when there aren't reasons otherwise. + - It's a compiler target, not a programming language, but readability still counts. + +## High-level summary: + + - Curly braces for function bodies, blocks, etc., `/* */`-style and `//`-style + comments, and whitespace is not significant. + (TODO: Should `/* */`-style comments nest properly?) + + - `get_local` looks like a simple reference; `set_local` looks like an + assignment. Constants use a simple literal syntax. This makes wasm's most + frequent opcodes very concise. + + - Infix syntax for arithmetic, with simple overloading. Explicit grouping via + parentheses. Concise and familiar with JS and others. (TODO: Use C/JS-style + operator precedence, or fix + [an old mistake](http://www.lysator.liu.se/c/dmr-on-or.html)?) + + - Prefix syntax with operands in parentheses for most other operators (e.g. + `i32.rotl($0, 8)`). For less frequent opcodes, prefer just presenting operator + names, so that they're easy to identify. + + - Typescript-style `name : type` declarations. + + - Parentheses around call arguments, eg. `$functionname(arg, arg, arg)`, + and `if` conditions, eg. `if ($condition) { $then() } else { $else() }`, + because they're familiar to many people and not too intrusive. + + - Allow highly complex trees to be syntactically split up into readable parts. + + - Put labels "where they go". + + - The text format will be compatible with the [LES](http://loyc.net/les) text + format. It _is not_ compatible with the current LES specification, but LES + is in beta and can still be tweaked to wasm's needs. Based on the wasm text + format, a third version of LES (LESv3) will be drafted before the end of 2016. + Meanwhile, the wasm text format will be syntactically constrained in such a + way that it will be an appropriate basis for LESv3. For the MVP, parsers of + the wasm text format will be able to choose whether to use a custom parser + dedicated to wasm or a generic LES parser. + + - TODO: should semicolons should be required at the end of each expression + in a block? If newlines are the primary separator, then LES will cease to + be a superset of JSON (since JSON ignores newlines), but there are benefits + on the flip side (such as eliminating the need for semicolons!). In this + document it is assumed that a newline **does** mark the end of an + expression if the newline does not appear directly inside parentheses (as + inside parentheses, expressions are always terminated by commas or by a + closing parenthesis). In any case it would be useful to _allow_ semicolons, + so that one can write multiple expressions on a single line. + +## Examples: + +### Basics + +``` + function $@fac-opt($a:i64) : i64 { + $x:i64 + $x = 1 + br_if end ? $a < 2 + loop $loop { + $x = $x * $a + $a = $a + -1 + br_if loop ? $a > 1 + } + :end + $x + } +``` + +(hand-translated from [fac.wast](https://github.com/WebAssembly/spec/blob/master/ml-proto/test/fac.wast)) + +The `$` sigil on function and variable names cleanly ensures that they never +collide with wasm keywords, present or future. The `@` sign on `fac-opt` allows +certain special characters to appear in identifiers, such as `-` which would +otherwise be treated as a subtraction operator. + +The function return type can have parentheses (`: (i64)`) for symmetry with the +parameter types, since we anticipate adding multiple return values to wasm in the +future, but they are not required. + +The curly braces around the function body are not a `block` node; they are part +of the function syntax, reflecting how function bodies in wasm are block-like. + +The last expression of the function body here acts as its return value. This +works in all block-like constructs (`block`, function body, `if`, etc.) + +`>` means *signed* greater-than. Unsigned operators will have a `|` before the last character of the operator, so `|>` is *unsigned* greater-than. + +`br_if` uses a question mark to announce the condition operand. `select` does +also. (TODO: Is this too cute? Also, should the order be reversed as in +`br_if $a < 2 ? end`?) + +### Linear memory addresses + +``` + function $test_redundant_load() : (i32) { + i32.load [8,+0] + f32.store [5,+0] = -0x0p0 + i32.load [8,+0] + } +``` + +(hand-translated from [memory_redundancy.wast](https://github.com/WebAssembly/spec/blob/master/ml-proto/test/memory_redundancy.wast)) + +Addresses are printed as `[base,+offset]`. It could be shortened to `[base]` when +there is no offset; I made the offset explicit above just to illustrate the syntax. +There can also be an optional `align …` for non-natural alignments, e.g. +`i32.load [8,+0, align 2]`. + +### A slightly larger example: + +Here's some C code: + +``` + float Q_rsqrt(float number) + { + long i; + float x2, y; + const float threehalfs = 1.5F; + + x2 = number * 0.5F; + y = number; + i = *(long *) &y; + i = 0x5f3759df - (i >> 1); + y = *(float *) &i; + y = y * (threehalfs - (x2 * y * y)); + y = y * (threehalfs - (x2 * y * y)); -WebAssembly currently doesn't have a final, official, text format. As detailed above the -main purpose of the text format will be for human consumption, feedback from humans on -readability will therefore factor into standardizing a text format. + return y; + } +``` -There are, however, prototype syntaxes which are used to bring up WebAssembly: it's easier -to develop using a text format than it is with a binary format, even if the ultimate -WebAssembly format will be binary. Most of these prototypes use [s-expressions][] because they -can easily represent expression trees and [ASTs](AstSemantics.md) (as opposed to CFGs) -and don't have much of a syntax to speak of (avoiding syntax bikeshed discussions). +Here's the corresponding LLVM wasm backend output + binaryen + slight tweaks: - [s-expressions]: https://en.wikipedia.org/wiki/S-expression +``` + (func $Q_rsqrt (param $0 f32) (result f32) + (local $1 f32) + (set_local $1 + (f32.reinterpret/i32 + (i32.sub + (i32.const 1597463007) + (i32.shr_s + (i32.reinterpret/f32 + (get_local $0)) + (i32.const 1))))) + (set_local $1 + (f32.mul + (get_local $1) + (f32.sub + (f32.const 1.5) + (f32.mul + (get_local $1) + (f32.mul + (get_local $1) + (set_local $0 + (f32.mul + (get_local $0) + (f32.const 0.5)))))))) + (f32.mul + (get_local $1) + (f32.sub + (f32.const 1.5) + (f32.mul + (get_local $1) + (f32.mul + (get_local $0) + (get_local $1))))) + ) +``` -Here are some of these prototypes. Keep in mind that these *aren't* official, and the final -official format may look entirely different: +And here's the proposed text syntax: -* [Prototype specification][] consumes an s-expression syntax. -* [WAVM backend][] consumes compatible s-expressions. -* [sexpr-wasm prototype][] consumes compatible s-expressions, and works closely with the [V8 prototype][]. -* [LLVM backend][] (the `CHECK:` parts of these tests) emits compatible s-expressions. -* [ilwasm][] emits compatible s-expressions. -* [wassembler][] consumes a different syntax, and works closely with the [V8 prototype][]. -* [binaryen][] can consume compatible s-expressions. +``` + function $Q_rsqrt($0:f32) : (f32) { + $1:f32 + $1 = f32.reinterpret'i32(1597463007 - (i32.reinterpret'f32($0) >> 1)) + $1 = $1 * (0x1.8p0 - $1 * ($0 = $0 * 0x1p-1) * $1) + $1 * (0x1.8p0 - $1 * $0 * $1) + } +``` - [prototype specification]: https://github.com/WebAssembly/spec/tree/master/ml-proto/test - [LLVM backend]: https://github.com/llvm-mirror/llvm/tree/master/test/CodeGen/WebAssembly - [WAVM backend]: https://github.com/AndrewScheidecker/WAVM/tree/master/Test - [wassembler]: https://github.com/ncbray/wassembler/tree/master/demos - [V8 prototype]: https://github.com/WebAssembly/v8-native-prototype - [ilwasm]: https://github.com/WebAssembly/ilwasm - [sexpr-wasm prototype]: https://github.com/WebAssembly/sexpr-wasm-prototype - [binaryen]: https://github.com/WebAssembly/binaryen +This shows off the compactness of infix operators with overloading. In the +s-expression syntax, these expressions are quite awkward to read, and this +isn't even a very big example. But the text syntax here is very short. + +### Labels + +Excerpt from labels.wast: + +``` + (func $loop3 (result i32) + (local $i i32) + (set_local $i (i32.const 0)) + (loop $exit $cont + (set_local $i (i32.add (get_local $i) (i32.const 1))) + (if (i32.eq (get_local $i) (i32.const 5)) + (br $exit (get_local $i)) + ) + (get_local $i) + ) + ) +``` + +Corresponding proposed text syntax: + +``` + function $loop3 () : (i32) { + $i:i32 + $i = 0 + loop $cont { + $i = $i + 1 + if ($i == 5) { + br exit => $i + } + :exit + } + } +``` + +Note that the curly braces are part of the `if`, rather than introducing a +block. This reflects how `if` essentially provides `block`-like capabilities +in the wasm binary format. + +Due to syntactic requirements of LES, the colon `:` appears before the label +name (`:exit`) rather than afterward. + +### Nested blocks + +Label definitions that do not appear at the end of the enclosing block, such as +the `:exit` above, introduce additional blocks nested within the nearest `{`, +without requiring their own `{`. This allows the deep nesting of `br_table` to +be printed in a relatively flat manner: + +``` + { + br_table [red, orange, yellow, green, default] : $index + :red + // ... + :orange + // ... + :yellow + // ... + :green + // ... + :default + } +``` + +representing the following in nested form: + +``` + (block $default + (block $green + (block $yellow + (block $orange + (block $red + (br_table [$red, $orange, $yellow, $green] $default (get_local $index)) + ) + // ... + ) + // ... + ) + // ... + ) + // ... + ) +``` + +`br_table`s can have large numbers of labels, so this feature allows us to +avoid very deep nesting in many cases. + +Note that when a label appears just before the closing `}`, it doesn't introduce +a new block; it just provides a name for the enclosing block's label. + +## Operators with special syntax + +As mentioned earlier, basic arithmetic operators use an infix notation, some +operators require explicit parentheses, and some operators with boolean +conditions use `?`. The following is a table of special syntax: + +## Control flow operators ([described here](https://github.com/WebAssembly/design/blob/master/AstSemantics.md)) + +| Name | Syntax | Examples +| ---------- | -------------------------- | -------- +| `block` | :*label* | `{ br a; :a }` +| `loop` | `loop` *label* `{` … `}` | `loop a { br a }` +| `if` | `if (`*expr*`)` `{` *expr** `}` | `if ($x) { $f($x) }` +| `if_else` | `if (`*expr*`)` `{` *expr** `} else {` *expr** `}` | `if (0) { 1 } else { 2 }` +| `select` | `select` *expr* `:` *expr* `?` *expr*`)` | `select 1 : 2 ? $x < $y` +| `br` | `br` *label* [=> $result] | `br a`, `br a => $x` +| `br_if` | `br_if` *label* `(if` *expr*`)` [`=>` *expr*] | `br a (if $x < $y) => 0` +| `br_table` | `br_table {` *case-label* `,` … `,` *default-label*] `} from` *expr* | `br_table [a, b, c] : $x` + +(TODO: as above, are the `?`s too cute?) + +## Basic operators ([described here](https://github.com/WebAssembly/design/blob/master/AstSemantics.md#constants)) + +| Name | Syntax | Example +| ----------- | ----------- | ---- | +| `i32.const` | see example | `234`, `0xfff7` +| `i64.const` | see example | `234L`, `0xfff7L` +| `f64.const` | see example | `0.1p2`, `@inf`, `@nan'0x789` +| `f32.const` | see example | `0.1p2f`, `@inf_f`, `@nan'0x789` +| `get_local` | *name* (including the `$`) | `$x` +| `set_local` | *name* `=` *expr* | `$x = 1` +| `call` | *name* `(`*expr* `,` … `)` | `$min(0, 2)` +| `call_import` | `$` *name* `(`*expr* `,` … `)` | `$$max(0, 2)` +| `call_indirect` | *expr* `::` *signature-name* [`[` *expr* `]`] `(`*expr* `,` … `)` | `$func::$signature(0, 2)` + +## Memory-related operators ([described here](https://github.com/WebAssembly/design/blob/master/AstSemantics.md#linear-memory-accesses)) + +| Name | Syntax | Example +| ---- | ---- | ---- | +| *memory-immediate* | `[` *base-expression* `,` *offset* `]` | `[$base, 4]` +| `i32.load8_s` | `i32.load8_s [` *base-expression* `, +` *offset-immediate* `]` | `i32.load8_s [$base, +4]` +| `i32.load8_s` | `i32.load8_s [` *base-expression* `, +` *offset-immediate* `, align ` *align* `]` | `i32.load8_s [$base, +4, align 2]` +| `i32.store8` | `i32.store8 [` *base-expression* `, +` *offset-immediate* `]`, *expr* | `i32.store8 [$base, +4], $value` +| `i32.store8` | `i32.store8 [` *base-expression* `, +` *offset-immediate* `, align ` *align* `]` `=` *expr* | `i32.store8 [$base, +4, align 2] = $value` + +The other forms of `load` and `store` are similar. + +## Simple operators ([described here](AstSemantics#32-bit-integer-operators)) + +| Name | Syntax | +| ---- | ---- | +| `i32.add` | … `+` … +| `i32.sub` | … `-` … +| `i32.mul` | … `*` … +| `i32.div_s` | … `/` … +| `i32.div_u` | … `|/` … +| `i32.rem_s` | … `%` … +| `i32.rem_u` | … `|%` … +| `i32.and` | … `&` … +| `i32.or` | … `|` … +| `i32.xor` | … `^` … +| `i32.shl` | … `<<` … +| `i32.shr_s` | … `>>` … +| `i32.shr_u` | … `>|>` … +| `i32.eq` | … `==` … +| `i32.ne` | … `!=` … +| `i32.lt_s` | … `<` … +| `i32.le_s` | … `<=` … +| `i32.lt_u` | … `|<` … +| `i32.le_u` | … `<|=` … +| `i32.gt_s` | … `>` … +| `i32.ge_s` | … `>=` … +| `i32.gt_u` | … `|>` … +| `i32.ge_u` | … `>|=` … +| `i32.eqz` | `!` … +| `i64.add` | … `+` … +| `i64.sub` | … `-` … +| `i64.mul` | … `*` … +| `i64.div_s` | … `/` … +| `i64.div_u` | … `|/` … +| `i64.rem_s` | … `%` … +| `i64.rem_u` | … `|%` … +| `i64.and` | … `&` … +| `i64.or` | … `\|` … +| `i64.xor` | … `^` … +| `i64.shl` | … `<<` … +| `i64.shr_s` | … `>>` … +| `i64.shr_u` | … `>|>` … +| `i64.eq` | … `==` … +| `i64.ne` | … `!=` … +| `i64.lt_s` | … `<` … +| `i64.le_s` | … `<=` … +| `i64.lt_u` | … `|<` … +| `i64.le_u` | … `<|=` … +| `i64.gt_s` | … `>` … +| `i64.ge_s` | … `>=` … +| `i64.gt_u` | … `|>` … +| `i64.ge_u` | … `>|=` … +| `i64.eqz` | `!` … +| `f32.add` | … `+` … +| `f32.sub` | … `-` … +| `f32.mul` | … `*` … +| `f32.div` | … `/` … +| `f32.neg` | `-` … +| `f32.eq` | … `==` … +| `f32.ne` | … `!=` … +| `f32.lt` | … `<` … +| `f32.le` | … `<=` … +| `f32.gt` | … `>` … +| `f32.ge` | … `>=` … +| `f64.add` | … `+` … +| `f64.sub` | … `-` … +| `f64.mul` | … `*` … +| `f64.div` | … `/` … +| `f64.neg` | `-` … +| `f64.eq` | … `==` … +| `f64.ne` | … `!=` … +| `f64.lt` | … `<` … +| `f64.le` | … `<=` … +| `f64.gt` | … `>` … +| `f64.ge` | … `>=` … + +All other operators use their actual name in a prefix notation, such as +`f32.sqrt …`. + +## Answers to anticipated questions + + +Q: JS avoids sigils, and uses context-sensitive keywords to avoid trouble. + Can wasm do this? + +A: Sigils are more of a burden when writing code than reading code, and wasm + will mostly be written by compilers. And it's my subjective opinion that + it's better to give ourselves maximum flexibility to add new keywords in + the future without having to be tricky. + + +Q: Why not let `br` be spelled `break` or `continue` when targeting block and + loop, respectively? + +A: The `br_table` construct has multiple labels, and there may be a mix of + forward and backward branches, so it isn't always possible to categorize + branches as forward or backward. Also, `br`, `br_if`, and `br_table` are + what we have in the spec, so using their actual names avoids needing + to special-case them. + + +Q: Why is, for example, the unsigned shift operator called `>|>` rather than + the more logical `|>>`, or even `|>>|`? + +A: None of the "unsigned" operators are built into LES. The precedence of + non-built-in operators is derived in a predictable way from the built-in + operators, so that for example `>|>` has the same precedence as `>>`, + whereas `|>>` has the same precedence as `>`, and `|>>|` has the same + precedence as `||`. Placing the vertical bar in the middle allows the + operator to keep the same precedence as the built-in operator. + # Debug symbol integration @@ -83,3 +501,8 @@ therefore synthesize new names. However, as part of the [tooling](Tooling.md) story, a lightweight, optional "debug symbol" global section may be defined which associates names with each indexed entity and, when present, these names will be used in the text format projected from a binary WebAssembly module. + +Since LES allows "attribute" expressions to be attached to any expression, +these could be used someday to represent additional debug information, +comments, or other "side-channel" information that may be stored in the +binary format in the future.