From 23684ecdf367e8739ac215fac6358e14273c8c7b Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Thu, 12 Sep 2019 19:33:08 +0100
Subject: [PATCH 01/51] start writing intro

---
 docs/src/index.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/docs/src/index.md b/docs/src/index.md
index ddfd8d49e..981766bf8 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -7,3 +7,100 @@ DocTestSetup = :(using ChainRulesCore, ChainRules)
 [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl) provides a variety of common utilities that can be used by downstream automatic differentiation (AD) tools to define and execute forward-, reverse-, and mixed-mode primitives.
 
 This package is a work-in-progress, as is the documentation. Contributions welcome!
+
+## TODO Include the following:
+* rrule:
+* frule:
+* Pullback:  takes a Wobble in the output space, and tells you how much Wiggle you need to make in the input space to get that.
+* Pushforward:  takes a Wibble in the input space,
+* and tells you how much Wobble you get in the output space.
+* Total derivative
+* Gradient
+* Seed
+* Partial
+* Permutation
+* Sensitivity
+* Thunk
+* Differential
+* Self-derivative, Internal derivative:
+
+
+
+Note: The following terminology is for ChainRules purposes.
+It should align with uses in general.
+Be warned that differential geometers might make sad-faces when they realize ChainRule’s pullback / pushforwards are only for the very boring euclidean spaces.
+
+
+
+
+### `rrule` and `frule`
+ChainRules is all about providing a rich set of rules for doing differentiation.
+When a person does introductory calculus, they learn that the derivative (with respect to `x`)
+of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
+And they learn how to combine simple rules, via the chain rule, to differentiate complicated functions.
+ChainRules.jl basically a progamatic repository of that knowledge, with the generalizations to higher dimensions.
+
+Autodiff (AD) tools roughly work by reducting a program down to simple parts that they know the rules for,
+and then combining those rules.
+Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
+
+
+
+
+
+________________
+
+
+
+
+On writing good rrule / frules
+
+
+* Use thunks appropriately:
+   * If work is only required for 1 of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
+   * If there are multiple return values, almost always their should be computation wrapped in a `@thunk`s
+
+
+   * Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks: Eg:
+Write:
+```
+∂A = @thunk(foo(x))
+return ∂A
+```
+                Not:
+```
+∂A = foo(x)
+return @thunk(∂A)
+```
+In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doing is wrapping the already calculated result in a function that returns it.
+
+
+
+
+*  Style: used named local functions for the pushforward/pullback:
+Rather than:
+```
+function frule(::typeof(foo), x)
+        return (foo(x), (_, ẋ)->bar(ẋ))
+end
+```
+
+
+write:
+
+
+```
+function frule(::typeof(foo), x)
+        Y = foo(x)
+        function foo_pushforward(_, ẋ)
+            return bar(ẋ)
+        end
+        return Y, foo_pushforward
+end
+```
+
+
+While this is more verbose,
+it ensures that if an error is thrown during the pullback/pushforward
+the gensymed name of the local function will include the name you gave it.
+Which makes it a lot simpler to debug from the stacktrace.

From dc81c198bb87301f3530529484983601b5054456 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 20:04:51 +0100
Subject: [PATCH 02/51] First full draft

---
 docs/src/index.md | 345 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 300 insertions(+), 45 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 981766bf8..b2fff064a 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -6,34 +6,7 @@ DocTestSetup = :(using ChainRulesCore, ChainRules)
 
 [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl) provides a variety of common utilities that can be used by downstream automatic differentiation (AD) tools to define and execute forward-, reverse-, and mixed-mode primitives.
 
-This package is a work-in-progress, as is the documentation. Contributions welcome!
-
-## TODO Include the following:
-* rrule:
-* frule:
-* Pullback:  takes a Wobble in the output space, and tells you how much Wiggle you need to make in the input space to get that.
-* Pushforward:  takes a Wibble in the input space,
-* and tells you how much Wobble you get in the output space.
-* Total derivative
-* Gradient
-* Seed
-* Partial
-* Permutation
-* Sensitivity
-* Thunk
-* Differential
-* Self-derivative, Internal derivative:
-
-
-
-Note: The following terminology is for ChainRules purposes.
-It should align with uses in general.
-Be warned that differential geometers might make sad-faces when they realize ChainRule’s pullback / pushforwards are only for the very boring euclidean spaces.
-
-
-
-
-### `rrule` and `frule`
+### Introduction:
 ChainRules is all about providing a rich set of rules for doing differentiation.
 When a person does introductory calculus, they learn that the derivative (with respect to `x`)
 of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
@@ -44,52 +17,321 @@ Autodiff (AD) tools roughly work by reducting a program down to simple parts tha
 and then combining those rules.
 Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
 
+** ChainRules is an AD independent collection of rules to use in an differentiation system **
+
+### `rrule` and `frule`
 
+!!! Terminology "`rrule` and `frule`"
 
+    `rrule` and `frule` are ChainRules.jl specific terms.
+    And there exact functioning is kind of ChainRule specific,
+    though other tools may do similar.
+    The core notion is sometimes called
+    _Custom AD primitives_, _custom adjoints_, _custom sensitivities_.
 
+The rules are encoded as `rrules` and `frules`,
+for use in forward-mode and reverse-mode differentiation respectively.
 
-________________
+the `rrule` for some function `foo`, taking positional arguments `args` and keyword arguments kwargs is written:
+```julia
+function rrule(::typeof(foo), args; kwargs...)
+    ...
+    return y, pullback
+end
+```
+where `y` must be equal to `foo(args; kwargs...)`,
+and _pullback_ is a function to propagate the derivative information backwards at that point (more later).
+Often but not always it is calculated directly.
+the exeception is we can calculate it indirect to make
+the `pullback` faster. (more on _pullback_ later)
+
+Similarly, the `frule` is written:
+```julia
+function frule(::typeof(foo), args; kwargs...)
+    ...
+    return y, pushforward
+end
+```
+again `y=foo(args, kwargs...)`,
+and _pushforward_ is a function to propagate the derivative information forwards at that point (more later).
 
+Almost always the _pushforward_/_pullback_ will be declared locally with-in the `ffrule`/`rrule`, and will be a _closure_ over some of the other arguments.
 
+### The propagators: pushforward and pullback
 
+!!! Terminology "Pushforward and Pullback"
 
-On writing good rrule / frules
+    _Pushforward_ and _Pullback_ are fancy words that the autodiff community recently stole from Differential Geometry.
+    The are broadly in agreement with the use of these terms in differential geometry. But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
+    Other terms that may be used include for _pullback_ the **backpropagator**, and by analogy for _pushforward_ the **forwardpropagator**, thus these are the _propagators_.
+    These are also good names because effectively they propagate wibbles and wobbles through them, via the chainrule.
+    (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Bearlmutter and Siskind, 2008)
 
 
-* Use thunks appropriately:
-   * If work is only required for 1 of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
-   * If there are multiple return values, almost always their should be computation wrapped in a `@thunk`s
+#### Core Important Idea:
+ - The **Pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
+ - The **Pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input_ space to achieve it.
 
+#### The anatomy of pushforward and pullback
 
-   * Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks: Eg:
-Write:
+For our function `foo(args...; kwargs) = Y`:
+
+The pushforward is a function:
+```julia
+function pushforward(Δself, Δargs...)
+    ...
+    return ∂Y
+end
+```
+Note that there is one `Δargs...` per `arg` to the orginal function, and they are similar in type/structure to the ccorresponding inputs.
+Plus the `Δself` (don't worry we will be back to explain this soon).
+The `∂Y` will be similar in type/structure to the original function's output `Y`.
+In particular if that function returned a tuple then `∂Y` will be a tuple of same size.
+
+The input to the pushforward is often called the _pertubation_.
+If the function is `y=f(x)` often the pushforward will be written `ẏ=pushforward(ẋ)`.
+
+
+The pullback is a function
+```julia
+function pullback(ΔY)
+    ...
+    return ∂self, ∂args...
+end
+```
+
+Note that the pullback returns one `∂arg` per original `arg` to the function, plus one for the s
+
+The input to the pullback is often called the _seed_.
+If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄)`.
+
+
+!!! Terminology:
+    Sometimes _pertubation_, _seed_, _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation analysis are apparently very big on just calling everying _sensitivity_ or _pertubation_ respectively.)
+    At the end of the day they are all _wibbles_ or _wobbles_.
+
+### self derivative `Δself`, `∂self` etc.
+
+!!! Terminology
+    To my knowledge there is no standard termanology for this.
+    Other good names might be `Δinternal`/`∂internal`
+
+From the mathematical perspective,
+one may have been wondering what all this `Δself`, `∂self` is.
+After all a function with two inputs:
+say `f(a, b)` only has two partial derivatives,
+``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``,
+why then does the _pushforward_ take in this extra `Δself`,
+and why does the _pullback_ return this extra `∂self` ?
+
+The thing is in julia
+the function `f` may itself have internal values.
+For example a closure has the fields it closes over; and a callable object (i.e. a functor) like a `Flux.Dense` has the fields of that object.
+
+**Thus every function is treated as having the extra implicit argument `self`,
+which captures those fields.**
+So all _pushforward_ take in a extra argument,
+which unless they are for things with fields, they ignore. (thus common to write `function pushforward(_, Δargs...)` in those cases),
+and every _pullback_ return an extra `∂self`,,
+which is, for things without fields, the constant `NO_FIELDS` which indicates there is no fields within the function itself.
+
+
+#### Pushforward / Pullback summary
+- **Pushforward:**
+    - returned by `ffrule`
+    - takes input space wibbles, gives output space wobbles
+    - 1 argument per orignal function argument + 1 for the function itself
+    - 1 return per orignal function return
+- **Pullback**
+   -  return by `rrule`
+   - takes output space wobbles, gives input space wibbles
+   - 1 argument per original function return
+   - 1 return per orignal function argument + 1 for the function itself
+
+#### Pushforward/Pullback and Total Derivative/Gradient
+
+The most trivial use of the frule+pushforward is to calculate the [Total Derivative](https://en.wikipedia.org/wiki/Total_derivative):
+```julia
+y, pushforward = frule(f, a, b, c)
+ẏ = pushforward(1, 1, 1, 1)  # for appropriate `1`-like perturbation.
+```
+Then we have that
+`ẏ` is the _total derivative_ of
+`f` at `(a, b, c)`:
+written mathematically as ``df_{(a,b,c)}``
+
+
+Similarly:
+The most trivial use of the rrule+pullback is to calculate the [Gradient](https://en.wikipedia.org/wiki/Gradient):
+```julia
+y, pullback = frule(f, a, b, c)
+∇f  = pushforward(1) # for appropriate `1`-like seed.
+s̄, ā, b̄, c̄ = ∇f
+```
+Then we have that
+`∇f` is the _gradient_ of
+`f` at `(a, b, c)`.
+And we thus have the partial derivative:
+s̄, ā, b̄, c̄.
+(Including the and the self-partial derivative,
+s̄).
+Written mathematically as ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``, ``\dfrac{∂f}{∂c}``.
+
+
+### Differentials
+
+The values that come back from pullbacks,
+or pushforwards
+are not always the same type as the input/outputs of the original function.
+They are differentials,
+differency-equivalents.
+A differential might be such a regular type,
+like a Number, or a Matrix,
+or it might be one of the `AbstractDifferencial` subtypes.
+
+Differentials support a number of operations.
+Most importantly:
+`+` and `*` which lets them act as mathematically objects.
+And `extern` which converts them into a conventional type.
+
+The most important AbstractDifferentials when getting started are the ones about avoiding work:
+
+ - `Thunk`: this is a deferred computation. A thunk is a [word for a zero argument closure](https://en.wikipedia.org/wiki/Thunk). An computation wrapped in a `@thunk` doesn't get evaluated until `extern` is called on the `Thunk`. More on thunks later.
+ - `One`, `Zero`: There are special representions of `1` and `0`. They do great things around avoiding expanding `Thunks` in multiplication and (for `Zero`) addition.
+
+
+
+#### Others: don't worry about them right now
+ - Wirtinger: it is complex. The docs need to be better. [Read the links in this issue](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/40).
+ - Casted: it implements broadcasting mechanics. See [#10](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/10)
+ - InplacableThunk: it is like a Thunk but it can do `store!` and `accumulate!` inplace.
+
+
+ -------------------------------
+## Example of using ChainRules directly.
+
+While ChainRules is largely intended as a backend for Autodiff systems it can be used directly.
+(Infact this can be very useful if you can constraint the code you need to differnetiate to only use thing that have rules defined for.
+This was once how all neural network code worked.)
+
+Using ChainRules directly also helped get a feel for it.
+
+
+```julia
+using ChainRules
+
+function foo(x)
+    a = sin(x)
+    b = 2a
+    c = asin(b)
+    return c
+end;
+
+###
+# Find dfoo/dx via rrules
+
+# First the forward pass, accumulating rules
+x=3;
+a, a_pb = rrule(sin, x);
+b, b_pb = rrule(*, 2, a);
+c, c_pb = rrule(asin, b)
+
+# Then the backward pass calculating gradients
+c̄ = 1;
+_, b̄ = c_pb(extern(c̄));
+_, _, ā = b_pb(extern(b̄));
+_, x̄ = a_pb(extern(ā));
+extern(x̄)
+# -2.0638950738662625
+
+###
+# Find dfoo/dx via frules
+
+# Unlike rrule can interleave evaluation and derivative evaluation
+x=3;
+ẋ=1;
+nofields = NamedTuple();
+
+a, a_pf = frule(sin, x);
+ȧ = a_pf(nofields, extern(ẋ));
+
+b, b_pf = frule(*, 2, a);
+ḃ = b_pf(nofields, 0, extern(ȧ));
+
+c, c_pf = frule(asin, b);
+ċ = c_pf(nofields, extern(ḃ));
+extern(ċ)
+# -2.0638950738662625
+
+###
+# Find dfoo/dx via finite-difference
+using FiniteDifferences
+central_fdm(5,1)(foo, x)
+# -2.0638950738670734
+
+###
+# Via ForwardDiff.jl
+using ForwardDiff
+ForwardDiff.derivative(foo, x)
+# -2.0638950738662625
+
+###
+# Via Zygote
+using Zygote
+Zygote.gradient(foo, x)
+# (-2.0638950738662625,)
 ```
+
+
+ -------------------------------
+
+
+
+## On writing good rrule / frules
+
+### Return Zero or One
+rather tan `0` or `1`
+or even rather than `zeros(n)`, `ones(m,n)`
+
+### Use thunks appropriately:
+
+If work is only required for 1 of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
+
+If there are multiple return values, almost always their should be computation wrapped in a `@thunk`s
+
+Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks: Eg:
+Write:
+```julia
 ∂A = @thunk(foo(x))
 return ∂A
 ```
-                Not:
-```
+Not:
+```julia
 ∂A = foo(x)
 return @thunk(∂A)
 ```
 In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doing is wrapping the already calculated result in a function that returns it.
 
+### Becareful of using Adjoing when you mean Transpose
 
+Rember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate. Instead you probably want `transpose(a)`.
 
+While there are arguments that for reverse-mode
+taking the adjoint is correct, it is not currently the behavour of ChainRules to do so.
+Feel free to open an issue and fight about it.
+All differentials support `conj` efficiently, which makes it easy to change in post.
 
-*  Style: used named local functions for the pushforward/pullback:
+### Style
+
+Used named local functions for the pushforward/pullback:
 Rather than:
-```
+```julia
 function frule(::typeof(foo), x)
         return (foo(x), (_, ẋ)->bar(ẋ))
 end
 ```
-
-
-write:
-
-
-```
+Whichrite:
+```julia
 function frule(::typeof(foo), x)
         Y = foo(x)
         function foo_pushforward(_, ẋ)
@@ -104,3 +346,16 @@ While this is more verbose,
 it ensures that if an error is thrown during the pullback/pushforward
 the gensymed name of the local function will include the name you gave it.
 Which makes it a lot simpler to debug from the stacktrace.
+
+### Write tests
+There are faily decent tools for writing tests based on [FiniteDifferences.jl](https://github.com/JuliaDiff/FiniteDifferences.jl)
+They are in [`tests/test_utils.jl`](https://github.com/JuliaDiff/ChainRules.jl/blob/master/test/test_util.jl)
+Take a look at existing test and you should see how to do stuff.
+
+!!! important
+    Don't write equations in tests.
+    Use finite differencing.
+    If you are writing equations in the tests, then you use those same equations as use are using to write your code. Then that is not Ok. We've had several bugs from people misreading/misunderstanding equations, and then using them for both tests and code. And then we have good coverage that is worthless.
+
+### CAS systems are your friends.
+E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](https://www.wolframalpha.com/input/?i=gradient+atan2%28x%2Cy%29).

From 8ad7de85a62a8b08443e36ff7ac4d72c8938f577 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 21:39:14 +0100
Subject: [PATCH 03/51] Update docs/src/index.md

Co-Authored-By: Matt Brzezinski <matt.brzezinski@invenia.ca>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index b2fff064a..834ed5eed 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -32,7 +32,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 The rules are encoded as `rrules` and `frules`,
 for use in forward-mode and reverse-mode differentiation respectively.
 
-the `rrule` for some function `foo`, taking positional arguments `args` and keyword arguments kwargs is written:
+The `rrule` for some function `foo`, takes the positional argument `args` and keyword argument `kwargs` is written:
 ```julia
 function rrule(::typeof(foo), args; kwargs...)
     ...

From e620f2e0d0277f4ffcf7b9fcdc0ce874dc347c4b Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 21:39:36 +0100
Subject: [PATCH 04/51] Update docs/src/index.md

Co-Authored-By: Matt Brzezinski <matt.brzezinski@invenia.ca>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 834ed5eed..b618e1f38 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -30,7 +30,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
     _Custom AD primitives_, _custom adjoints_, _custom sensitivities_.
 
 The rules are encoded as `rrules` and `frules`,
-for use in forward-mode and reverse-mode differentiation respectively.
+for use in reverse-mode and forward-mode differentiation respectively.
 
 The `rrule` for some function `foo`, takes the positional argument `args` and keyword argument `kwargs` is written:
 ```julia

From 50291d61c2465cb0041934fca0e77323a60a9550 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 21:39:46 +0100
Subject: [PATCH 05/51] Update docs/src/index.md

Co-Authored-By: Matt Brzezinski <matt.brzezinski@invenia.ca>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index b618e1f38..62a332b58 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -13,7 +13,7 @@ of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
 And they learn how to combine simple rules, via the chain rule, to differentiate complicated functions.
 ChainRules.jl basically a progamatic repository of that knowledge, with the generalizations to higher dimensions.
 
-Autodiff (AD) tools roughly work by reducting a program down to simple parts that they know the rules for,
+Autodiff (AD) tools roughly work by reducing a problem down to simple parts that they know the rules for,
 and then combining those rules.
 Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
 

From bf72e1422e60bdb0d2bacba63a131715a75afeb7 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 21:49:21 +0100
Subject: [PATCH 06/51] Apply suggestions from code review

Co-Authored-By: Matt Brzezinski <matt.brzezinski@invenia.ca>
---
 docs/src/index.md | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 62a332b58..7ca9fc277 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -42,7 +42,7 @@ end
 where `y` must be equal to `foo(args; kwargs...)`,
 and _pullback_ is a function to propagate the derivative information backwards at that point (more later).
 Often but not always it is calculated directly.
-the exeception is we can calculate it indirect to make
+The exception is that we can calculate it indirectly to make
 the `pullback` faster. (more on _pullback_ later)
 
 Similarly, the `frule` is written:
@@ -83,7 +83,7 @@ function pushforward(Δself, Δargs...)
     return ∂Y
 end
 ```
-Note that there is one `Δargs...` per `arg` to the orginal function, and they are similar in type/structure to the ccorresponding inputs.
+**Note:** that there is one `Δargs...` per `arg` to the original function, and they are similar in type/structure to the corresponding inputs.
 Plus the `Δself` (don't worry we will be back to explain this soon).
 The `∂Y` will be similar in type/structure to the original function's output `Y`.
 In particular if that function returned a tuple then `∂Y` will be a tuple of same size.
@@ -100,7 +100,7 @@ function pullback(ΔY)
 end
 ```
 
-Note that the pullback returns one `∂arg` per original `arg` to the function, plus one for the s
+**Note:** that the pullback returns one `∂arg` per original `arg` to the function, plus one for the fields of the function itself (again will get to that below).
 
 The input to the pullback is often called the _seed_.
 If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄)`.
@@ -110,10 +110,10 @@ If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄
     Sometimes _pertubation_, _seed_, _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation analysis are apparently very big on just calling everying _sensitivity_ or _pertubation_ respectively.)
     At the end of the day they are all _wibbles_ or _wobbles_.
 
-### self derivative `Δself`, `∂self` etc.
+### Self derivative `Δself`, `∂self` etc.
 
 !!! Terminology
-    To my knowledge there is no standard termanology for this.
+    To my knowledge there is no standard terminology for this.
     Other good names might be `Δinternal`/`∂internal`
 
 From the mathematical perspective,
@@ -132,7 +132,7 @@ For example a closure has the fields it closes over; and a callable object (i.e.
 which captures those fields.**
 So all _pushforward_ take in a extra argument,
 which unless they are for things with fields, they ignore. (thus common to write `function pushforward(_, Δargs...)` in those cases),
-and every _pullback_ return an extra `∂self`,,
+and every _pullback_ return an extra `∂self`,
 which is, for things without fields, the constant `NO_FIELDS` which indicates there is no fields within the function itself.
 
 
@@ -177,7 +177,6 @@ s̄, ā, b̄, c̄.
 s̄).
 Written mathematically as ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``, ``\dfrac{∂f}{∂c}``.
 
-
 ### Differentials
 
 The values that come back from pullbacks,
@@ -187,7 +186,7 @@ They are differentials,
 differency-equivalents.
 A differential might be such a regular type,
 like a Number, or a Matrix,
-or it might be one of the `AbstractDifferencial` subtypes.
+or it might be one of the `AbstractDifferential` subtypes.
 
 Differentials support a number of operations.
 Most importantly:
@@ -199,8 +198,6 @@ The most important AbstractDifferentials when getting started are the ones about
  - `Thunk`: this is a deferred computation. A thunk is a [word for a zero argument closure](https://en.wikipedia.org/wiki/Thunk). An computation wrapped in a `@thunk` doesn't get evaluated until `extern` is called on the `Thunk`. More on thunks later.
  - `One`, `Zero`: There are special representions of `1` and `0`. They do great things around avoiding expanding `Thunks` in multiplication and (for `Zero`) addition.
 
-
-
 #### Others: don't worry about them right now
  - Wirtinger: it is complex. The docs need to be better. [Read the links in this issue](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/40).
  - Casted: it implements broadcasting mechanics. See [#10](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/10)

From 9dfdf5ae90ef70fac9dc30525aaaaf1abffc36a2 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 22:26:19 +0100
Subject: [PATCH 07/51] add FAQ

---
 docs/src/index.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/docs/src/index.md b/docs/src/index.md
index 7ca9fc277..d472994a1 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -356,3 +356,19 @@ Take a look at existing test and you should see how to do stuff.
 
 ### CAS systems are your friends.
 E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](https://www.wolframalpha.com/input/?i=gradient+atan2%28x%2Cy%29).
+
+------------------------------------------
+
+### FAQ:
+
+### What is up with the different symbols?
+
+ - `Δx` is the input to a propagator, (i.e a _seed_ for a _pullback_; or a _perturbation_ for a _pushforward_)
+ - `∂x` is the output of a propagator
+ - `dx` could be anything, including a pullback. It really should show up outside of tests.
+ - `ẋ` is a derivative moving forward.
+ - `x̄` is a dderivative moving backward.
+
+ - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not eexlusively.) for scalar functions.
+     - `ΔΩ` is thus a seed for the pullback.
+     - `∂Ω` is thus the output of a pushforward

From 6bc00a3ddc96575f37e1127f1c81cfa5924c111b Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 23:41:10 +0100
Subject: [PATCH 08/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index d472994a1..2c4c3106b 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -365,7 +365,7 @@ E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](http
 
  - `Δx` is the input to a propagator, (i.e a _seed_ for a _pullback_; or a _perturbation_ for a _pushforward_)
  - `∂x` is the output of a propagator
- - `dx` could be anything, including a pullback. It really should show up outside of tests.
+ - `dx` could be anything, including a pullback. It really should not show up outside of tests.
  - `ẋ` is a derivative moving forward.
  - `x̄` is a dderivative moving backward.
 

From 28c7eb005009d7ed222a5792f342b344657d163d Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 23:42:45 +0100
Subject: [PATCH 09/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 2c4c3106b..2f09cc285 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -309,7 +309,7 @@ return @thunk(∂A)
 ```
 In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doing is wrapping the already calculated result in a function that returns it.
 
-### Becareful of using Adjoing when you mean Transpose
+### Be careful with using Adjoint when you mean Transpose
 
 Rember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate. Instead you probably want `transpose(a)`.
 

From ba889eae0a7cfe357d563d9e62b99ab7ab20df01 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 23:43:37 +0100
Subject: [PATCH 10/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 2f09cc285..4b9f6a1db 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -292,7 +292,7 @@ or even rather than `zeros(n)`, `ones(m,n)`
 
 ### Use thunks appropriately:
 
-If work is only required for 1 of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
+If work is only required for one of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
 
 If there are multiple return values, almost always their should be computation wrapped in a `@thunk`s
 

From 56841458a138f531c0a7622bf9163696d51655c6 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 13 Sep 2019 23:44:23 +0100
Subject: [PATCH 11/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 4b9f6a1db..6b61f507e 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -287,7 +287,7 @@ Zygote.gradient(foo, x)
 ## On writing good rrule / frules
 
 ### Return Zero or One
-rather tan `0` or `1`
+rather than `0` or `1`
 or even rather than `zeros(n)`, `ones(m,n)`
 
 ### Use thunks appropriately:

From 4d88a5f40faafea4d783a71662c38b5867caf80f Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Sat, 14 Sep 2019 11:12:44 +0100
Subject: [PATCH 12/51] Apply suggestions from code review

Co-Authored-By: Kristoffer Carlsson <kristoffer.carlsson@chalmers.se>
---
 docs/src/index.md | 54 +++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 6b61f507e..fd384cc81 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -11,13 +11,13 @@ ChainRules is all about providing a rich set of rules for doing differentiation.
 When a person does introductory calculus, they learn that the derivative (with respect to `x`)
 of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
 And they learn how to combine simple rules, via the chain rule, to differentiate complicated functions.
-ChainRules.jl basically a progamatic repository of that knowledge, with the generalizations to higher dimensions.
+ChainRules.jl is a programmatic repository of that knowledge, with the generalizations to higher dimensions.
 
 Autodiff (AD) tools roughly work by reducing a problem down to simple parts that they know the rules for,
 and then combining those rules.
 Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
 
-** ChainRules is an AD independent collection of rules to use in an differentiation system **
+** ChainRules is an AD independent collection of rules to use in a differentiation system **
 
 ### `rrule` and `frule`
 
@@ -32,7 +32,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 The rules are encoded as `rrules` and `frules`,
 for use in reverse-mode and forward-mode differentiation respectively.
 
-The `rrule` for some function `foo`, takes the positional argument `args` and keyword argument `kwargs` is written:
+The `rrule` for some function `foo`, which takes the positional argument `args` and keyword argument `kwargs`, is written:
 ```julia
 function rrule(::typeof(foo), args; kwargs...)
     ...
@@ -83,8 +83,8 @@ function pushforward(Δself, Δargs...)
     return ∂Y
 end
 ```
-**Note:** that there is one `Δargs...` per `arg` to the original function, and they are similar in type/structure to the corresponding inputs.
-Plus the `Δself` (don't worry we will be back to explain this soon).
+**Note:** that there is one `Δarg` per `arg` to the original function, and they (the `Δargs`)  are similar in type/structure to the corresponding inputs (`args`).
+as well as the `Δself` (the presence of `Δself` will be explained soon).
 The `∂Y` will be similar in type/structure to the original function's output `Y`.
 In particular if that function returned a tuple then `∂Y` will be a tuple of same size.
 
@@ -107,8 +107,8 @@ If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄
 
 
 !!! Terminology:
-    Sometimes _pertubation_, _seed_, _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation analysis are apparently very big on just calling everying _sensitivity_ or _pertubation_ respectively.)
-    At the end of the day they are all _wibbles_ or _wobbles_.
+    Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
+    At the end of the day, they are all _wibbles_ or _wobbles_.
 
 ### Self derivative `Δself`, `∂self` etc.
 
@@ -140,13 +140,13 @@ which is, for things without fields, the constant `NO_FIELDS` which indicates th
 - **Pushforward:**
     - returned by `ffrule`
     - takes input space wibbles, gives output space wobbles
-    - 1 argument per orignal function argument + 1 for the function itself
-    - 1 return per orignal function return
+    - 1 argument per original function argument + 1 for the function itself
+    - 1 return per original function return
 - **Pullback**
    -  return by `rrule`
    - takes output space wobbles, gives input space wibbles
    - 1 argument per original function return
-   - 1 return per orignal function argument + 1 for the function itself
+   - 1 return per original function argument + 1 for the function itself
 
 #### Pushforward/Pullback and Total Derivative/Gradient
 
@@ -164,8 +164,8 @@ written mathematically as ``df_{(a,b,c)}``
 Similarly:
 The most trivial use of the rrule+pullback is to calculate the [Gradient](https://en.wikipedia.org/wiki/Gradient):
 ```julia
-y, pullback = frule(f, a, b, c)
-∇f  = pushforward(1) # for appropriate `1`-like seed.
+y, pullback = rrule(f, a, b, c)
+∇f  = pullback(1)  # for appropriate `1`-like seed.
 s̄, ā, b̄, c̄ = ∇f
 ```
 Then we have that
@@ -185,7 +185,7 @@ are not always the same type as the input/outputs of the original function.
 They are differentials,
 differency-equivalents.
 A differential might be such a regular type,
-like a Number, or a Matrix,
+like a `Number`, or a `Matrix`,
 or it might be one of the `AbstractDifferential` subtypes.
 
 Differentials support a number of operations.
@@ -195,8 +195,8 @@ And `extern` which converts them into a conventional type.
 
 The most important AbstractDifferentials when getting started are the ones about avoiding work:
 
- - `Thunk`: this is a deferred computation. A thunk is a [word for a zero argument closure](https://en.wikipedia.org/wiki/Thunk). An computation wrapped in a `@thunk` doesn't get evaluated until `extern` is called on the `Thunk`. More on thunks later.
- - `One`, `Zero`: There are special representions of `1` and `0`. They do great things around avoiding expanding `Thunks` in multiplication and (for `Zero`) addition.
+ - `Thunk`: this is a deferred computation. A thunk is a [word for a zero argument closure](https://en.wikipedia.org/wiki/Thunk). A computation wrapped in a `@thunk` doesn't get evaluated until `extern` is called on the `Thunk`. More on thunks later.
+ - `One`, `Zero`: There are special representations of `1` and `0`. They do great things around avoiding expanding `Thunks` in multiplication and (for `Zero`) addition.
 
 #### Others: don't worry about them right now
  - Wirtinger: it is complex. The docs need to be better. [Read the links in this issue](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/40).
@@ -208,10 +208,10 @@ The most important AbstractDifferentials when getting started are the ones about
 ## Example of using ChainRules directly.
 
 While ChainRules is largely intended as a backend for Autodiff systems it can be used directly.
-(Infact this can be very useful if you can constraint the code you need to differnetiate to only use thing that have rules defined for.
+(In fact, this can be very useful if you can constraint the code you need to differentiate to only use things that have rules defined for.
 This was once how all neural network code worked.)
 
-Using ChainRules directly also helped get a feel for it.
+Using ChainRules directly also helps get a feel for it.
 
 
 ```julia
@@ -288,15 +288,15 @@ Zygote.gradient(foo, x)
 
 ### Return Zero or One
 rather than `0` or `1`
-or even rather than `zeros(n)`, `ones(m,n)`
+or even rather than `zeros(n)`, or the identity matrix.
 
 ### Use thunks appropriately:
 
-If work is only required for one of the returned differentials it should be wrapped in a `@thunk` (potentially using a begin-end block)
+If work is only required for one of the returned differentials it should be wrapped in a `@thunk` (potentially using a `begin`-`end` block).
 
-If there are multiple return values, almost always their should be computation wrapped in a `@thunk`s
+If there are multiple return values, their computation should almost always be wrapped in a `@thunk`.
 
-Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks: Eg:
+Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks, e.g. write:
 Write:
 ```julia
 ∂A = @thunk(foo(x))
@@ -311,10 +311,10 @@ In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doi
 
 ### Be careful with using Adjoint when you mean Transpose
 
-Rember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate. Instead you probably want `transpose(a)`.
+Remember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate. Instead, you probably want `transpose(a)`.
 
 While there are arguments that for reverse-mode
-taking the adjoint is correct, it is not currently the behavour of ChainRules to do so.
+taking the adjoint is correct, it is not currently the behavior of ChainRules to do so.
 Feel free to open an issue and fight about it.
 All differentials support `conj` efficiently, which makes it easy to change in post.
 
@@ -327,7 +327,7 @@ function frule(::typeof(foo), x)
         return (foo(x), (_, ẋ)->bar(ẋ))
 end
 ```
-Whichrite:
+write:
 ```julia
 function frule(::typeof(foo), x)
         Y = foo(x)
@@ -345,12 +345,12 @@ the gensymed name of the local function will include the name you gave it.
 Which makes it a lot simpler to debug from the stacktrace.
 
 ### Write tests
-There are faily decent tools for writing tests based on [FiniteDifferences.jl](https://github.com/JuliaDiff/FiniteDifferences.jl)
+There are fairly decent tools for writing tests based on [FiniteDifferences.jl](https://github.com/JuliaDiff/FiniteDifferences.jl).
 They are in [`tests/test_utils.jl`](https://github.com/JuliaDiff/ChainRules.jl/blob/master/test/test_util.jl)
 Take a look at existing test and you should see how to do stuff.
 
 !!! important
-    Don't write equations in tests.
+    Don't use analytical derivations for derivatives in the tests?
     Use finite differencing.
     If you are writing equations in the tests, then you use those same equations as use are using to write your code. Then that is not Ok. We've had several bugs from people misreading/misunderstanding equations, and then using them for both tests and code. And then we have good coverage that is worthless.
 
@@ -369,6 +369,6 @@ E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](http
  - `ẋ` is a derivative moving forward.
  - `x̄` is a dderivative moving backward.
 
- - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not eexlusively.) for scalar functions.
+ - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not exclusively.) for scalar functions.
      - `ΔΩ` is thus a seed for the pullback.
      - `∂Ω` is thus the output of a pushforward

From ab10b11f20692ed2634d1e6c4f5f198e34d7578b Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Sat, 14 Sep 2019 16:29:41 +0100
Subject: [PATCH 13/51] Update docs/src/index.md

Co-Authored-By: Kristoffer Carlsson <kristoffer.carlsson@chalmers.se>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index fd384cc81..93251d7d9 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -130,7 +130,7 @@ For example a closure has the fields it closes over; and a callable object (i.e.
 
 **Thus every function is treated as having the extra implicit argument `self`,
 which captures those fields.**
-So all _pushforward_ take in a extra argument,
+So all _pushforward_ take in an extra argument,
 which unless they are for things with fields, they ignore. (thus common to write `function pushforward(_, Δargs...)` in those cases),
 and every _pullback_ return an extra `∂self`,
 which is, for things without fields, the constant `NO_FIELDS` which indicates there is no fields within the function itself.

From 563a054074c4facf2f880752e2b8177be43f46c5 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 16 Sep 2019 14:24:27 +0100
Subject: [PATCH 14/51] Make terminology blocks render. Also style the TOC to
 match

---
 docs/make.jl                   |  2 +-
 docs/src/assets/chainrules.css | 40 ++++++++++++++++++++++++++++++++++
 docs/src/index.md              |  9 ++++----
 3 files changed, 45 insertions(+), 6 deletions(-)
 create mode 100644 docs/src/assets/chainrules.css

diff --git a/docs/make.jl b/docs/make.jl
index 82ce86151..34b664ed1 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -4,7 +4,7 @@ using Documenter
 
 makedocs(
     modules=[ChainRules, ChainRulesCore],
-    format=Documenter.HTML(prettyurls=false),
+    format=Documenter.HTML(prettyurls=false, assets = ["assets/chainrules.css"]),
     sitename="ChainRules",
     authors="Jarrett Revels and other contributors",
     pages=[
diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
new file mode 100644
index 000000000..22769cf64
--- /dev/null
+++ b/docs/src/assets/chainrules.css
@@ -0,0 +1,40 @@
+/* TOC */
+nav.toc {
+	background-color: #FFEC8B;
+	box-shadow: inset -14px 0px 5px -12px rgb(210,210,210);
+}
+
+nav.toc ul.internal {
+	background-color: #FFFEDD;
+	box-shadow: inset -14px 0px 5px -12px rgb(210,210,210);
+	list-style: none;
+}
+
+nav.toc ul.internal a:hover {
+	background-color: #FFEC8B;
+	color: black;
+}
+
+nav.toc ul a:hover {
+	color: #fcfcfc;
+	background-color: #B8860B;
+}
+
+nav.toc li.current > .toctext {
+	background-color: #B8860B;
+}
+
+/* Terminology Block */
+
+div.admonition.terminology div.admonition-title:before {
+	content: "Terminology: ";
+	font-family: 'Lato', 'Helvetica Neue', Arial, sans-serif;
+	font-weight: bold;
+}
+div.admonition.terminology div.admonition-title {
+	background-color: #FFEC8B;
+}
+
+div.admonition.terminology div.admonition-text {
+	background-color: #FFFEDD;
+}
diff --git a/docs/src/index.md b/docs/src/index.md
index 93251d7d9..fcbe4ba5c 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -21,8 +21,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 
 ### `rrule` and `frule`
 
-!!! Terminology "`rrule` and `frule`"
-
+!!! terminology "`rrule` and `frule`"
     `rrule` and `frule` are ChainRules.jl specific terms.
     And there exact functioning is kind of ChainRule specific,
     though other tools may do similar.
@@ -59,7 +58,7 @@ Almost always the _pushforward_/_pullback_ will be declared locally with-in the
 
 ### The propagators: pushforward and pullback
 
-!!! Terminology "Pushforward and Pullback"
+!!! terminology "Pushforward and Pullback"
 
     _Pushforward_ and _Pullback_ are fancy words that the autodiff community recently stole from Differential Geometry.
     The are broadly in agreement with the use of these terms in differential geometry. But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
@@ -106,13 +105,13 @@ The input to the pullback is often called the _seed_.
 If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄)`.
 
 
-!!! Terminology:
+!!! terminology
     Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
     At the end of the day, they are all _wibbles_ or _wobbles_.
 
 ### Self derivative `Δself`, `∂self` etc.
 
-!!! Terminology
+!!! terminology
     To my knowledge there is no standard terminology for this.
     Other good names might be `Δinternal`/`∂internal`
 

From 33d9878d1946cd3018bf526e58a51deda48ccef6 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 16 Sep 2019 16:50:40 +0100
Subject: [PATCH 15/51] move explaintikn of returning value to FAQ

---
 docs/src/index.md | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index fcbe4ba5c..8d78ad4db 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -40,9 +40,6 @@ end
 ```
 where `y` must be equal to `foo(args; kwargs...)`,
 and _pullback_ is a function to propagate the derivative information backwards at that point (more later).
-Often but not always it is calculated directly.
-The exception is that we can calculate it indirectly to make
-the `pullback` faster. (more on _pullback_ later)
 
 Similarly, the `frule` is written:
 ```julia
@@ -358,7 +355,7 @@ E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](http
 
 ------------------------------------------
 
-### FAQ:
+## FAQ:
 
 ### What is up with the different symbols?
 
@@ -370,4 +367,15 @@ E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](http
 
  - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not exclusively.) for scalar functions.
      - `ΔΩ` is thus a seed for the pullback.
-     - `∂Ω` is thus the output of a pushforward
+     - `∂Ω` is thus the output of a pushforward.
+
+### Why does `frule` and `rrule` return the function evaluation?
+You might wonder why `frule(f, x)` returns `f(x)` and the pushforward for `f` at `x`,
+and similarly for `rrule` returing `f(x)` and the pullback for `f` at `x`.
+Why not just return the pushforward/pullback,
+and let the user call `f(x)` to get the answer seperately?
+
+Their are two reasons the rules also create the `f(x)`.
+1. For some rules the output value is used in the definition of its propagator. For example `tan`.
+2. For some rules an alternative way of calculating `f(x)` can give the same answer,
+but also define intermediate values that can be used in the calculations within the propagator.

From 7a99cc6916e579cd2e54c64d33d3009d4575556d Mon Sep 17 00:00:00 2001
From: Nick Robinson <npr251@gmail.com>
Date: Tue, 17 Sep 2019 14:02:25 +0100
Subject: [PATCH 16/51] Reformat docs homepage

---
 docs/src/index.md | 337 +++++++++++++++++++++-------------------------
 1 file changed, 154 insertions(+), 183 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 8d78ad4db..d950e419b 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -4,42 +4,39 @@ DocTestSetup = :(using ChainRulesCore, ChainRules)
 
 # ChainRules
 
-[ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl) provides a variety of common utilities that can be used by downstream automatic differentiation (AD) tools to define and execute forward-, reverse-, and mixed-mode primitives.
+[ChainRules](https://github.com/JuliaDiff/ChainRules.jl) provides a variety of common utilities that can be used by downstream [automatic differentiation (AD)](https://en.wikipedia.org/wiki/Automatic_differentiation) tools to define and execute forward-, reverse-, and mixed-mode primitives.
 
-### Introduction:
-ChainRules is all about providing a rich set of rules for doing differentiation.
-When a person does introductory calculus, they learn that the derivative (with respect to `x`)
-of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
-And they learn how to combine simple rules, via the chain rule, to differentiate complicated functions.
-ChainRules.jl is a programmatic repository of that knowledge, with the generalizations to higher dimensions.
+### Introduction
 
-Autodiff (AD) tools roughly work by reducing a problem down to simple parts that they know the rules for,
-and then combining those rules.
+ChainRules is all about providing a rich set of rules for differentiation.
+When a person learns introductory calculus, they learn that the derivative (with respect to `x`) of `a*x` is `a`, and the derivative of `sin(x)` is `cos(x)`, etc.
+And they learn how to combine simple rules, via [the chain rule](https://en.wikipedia.org/wiki/Chain_rule), to differentiate complicated functions.
+ChainRules is a programmatic repository of that knowledge, with the generalizations to higher dimensions.
+
+[Autodiff (AD)](https://en.wikipedia.org/wiki/Automatic_differentiation) tools roughly work by reducing a problem down to simple parts that they know the rules for, and then combining those rules.
 Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
+For an
 
-** ChainRules is an AD independent collection of rules to use in a differentiation system **
+**ChainRules is an AD-independent collection of rules to use in a differentiation system.**
 
 ### `rrule` and `frule`
 
 !!! terminology "`rrule` and `frule`"
-    `rrule` and `frule` are ChainRules.jl specific terms.
-    And there exact functioning is kind of ChainRule specific,
-    though other tools may do similar.
-    The core notion is sometimes called
-    _Custom AD primitives_, _custom adjoints_, _custom sensitivities_.
+    `rrule` and `frule` are ChainRules specific terms.
+    There exact functioning is fairly ChainRules specific, though other tools may do similar.
+    The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom sensitivities_.
 
-The rules are encoded as `rrules` and `frules`,
-for use in reverse-mode and forward-mode differentiation respectively.
+The rules are encoded as `rrules` and `frules`, for use in reverse-mode and forward-mode differentiation respectively.
 
 The `rrule` for some function `foo`, which takes the positional argument `args` and keyword argument `kwargs`, is written:
+
 ```julia
 function rrule(::typeof(foo), args; kwargs...)
     ...
     return y, pullback
 end
 ```
-where `y` must be equal to `foo(args; kwargs...)`,
-and _pullback_ is a function to propagate the derivative information backwards at that point (more later).
+where `y` must be equal to `foo(args; kwargs...)`, and `pullback` is a function to propagate the derivative information backwards at that point (more later).
 
 Similarly, the `frule` is written:
 ```julia
@@ -48,47 +45,51 @@ function frule(::typeof(foo), args; kwargs...)
     return y, pushforward
 end
 ```
-again `y=foo(args, kwargs...)`,
-and _pushforward_ is a function to propagate the derivative information forwards at that point (more later).
+again `y = foo(args; kwargs...)`, and `pushforward` is a function to propagate the derivative information forwards at that point (more later).
 
-Almost always the _pushforward_/_pullback_ will be declared locally with-in the `ffrule`/`rrule`, and will be a _closure_ over some of the other arguments.
+Almost always the _pushforward_/_pullback_ will be declared locally within the `ffrule`/`rrule`, and will be a _closure_ over some of the other arguments.
 
 ### The propagators: pushforward and pullback
 
-!!! terminology "Pushforward and Pullback"
+!!! terminology "pushforward and pullback"
 
-    _Pushforward_ and _Pullback_ are fancy words that the autodiff community recently stole from Differential Geometry.
-    The are broadly in agreement with the use of these terms in differential geometry. But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
+    _Pushforward_ and _pullback_ are fancy words that the autodiff community adopted from Differential Geometry.
+    The are broadly in agreement with the use of [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)) and [pushforward](https://en.wikipedia.org/wiki/Pushforward_(differential)) in differential geometry.
+    But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
     Other terms that may be used include for _pullback_ the **backpropagator**, and by analogy for _pushforward_ the **forwardpropagator**, thus these are the _propagators_.
-    These are also good names because effectively they propagate wibbles and wobbles through them, via the chainrule.
+    These are also good names because effectively they propagate wiggles and wobbles through them, via the chainrule.
     (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Bearlmutter and Siskind, 2008)
 
+#### Core Idea
 
-#### Core Important Idea:
- - The **Pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
- - The **Pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input_ space to achieve it.
+ - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
+ - The **pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input_ space to achieve it.
 
 #### The anatomy of pushforward and pullback
 
 For our function `foo(args...; kwargs) = Y`:
 
 The pushforward is a function:
+
 ```julia
 function pushforward(Δself, Δargs...)
     ...
     return ∂Y
 end
 ```
-**Note:** that there is one `Δarg` per `arg` to the original function, and they (the `Δargs`)  are similar in type/structure to the corresponding inputs (`args`).
-as well as the `Δself` (the presence of `Δself` will be explained soon).
-The `∂Y` will be similar in type/structure to the original function's output `Y`.
-In particular if that function returned a tuple then `∂Y` will be a tuple of same size.
 
 The input to the pushforward is often called the _pertubation_.
-If the function is `y=f(x)` often the pushforward will be written `ẏ=pushforward(ẋ)`.
+If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ḟ, ẋ)`.
+
+!!! note
+
+    There is one `Δarg` per `arg` to the original function.
+    The `Δargs` are similar in type/structure to the corresponding inputs `args` (`Δself` is explained below).
+    The `∂Y` are similar in type/structure to the original function's output `Y`.
+    In particular if that function returned a tuple then `∂Y` will be a tuple of same size.
 
+The pullback is a function:
 
-The pullback is a function
 ```julia
 function pullback(ΔY)
     ...
@@ -96,120 +97,101 @@ function pullback(ΔY)
 end
 ```
 
-**Note:** that the pullback returns one `∂arg` per original `arg` to the function, plus one for the fields of the function itself (again will get to that below).
-
 The input to the pullback is often called the _seed_.
-If the function is `y=f(x)` often the pullback will be written `ȳ=pullback(x̄)`.
+If the function is `y = f(x)` often the pullback will be written `x̄ = pullback(ȳ)`.
+
+!!! note
 
+    The pullback returns one `∂arg` per `arg` to the original function, plus one for the fields of the function itself (explained below).
 
 !!! terminology
-    Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (_sensitivity_ analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
-    At the end of the day, they are all _wibbles_ or _wobbles_.
+    Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (sensitivity analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
+    At the end of the day, they are all _wiggles_ or _wobbles_.
 
 ### Self derivative `Δself`, `∂self` etc.
 
 !!! terminology
     To my knowledge there is no standard terminology for this.
-    Other good names might be `Δinternal`/`∂internal`
+    Other good names might be `Δinternal`/`∂internal`.
 
-From the mathematical perspective,
-one may have been wondering what all this `Δself`, `∂self` is.
-After all a function with two inputs:
-say `f(a, b)` only has two partial derivatives,
-``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``,
-why then does the _pushforward_ take in this extra `Δself`,
-and why does the _pullback_ return this extra `∂self` ?
+From the mathematical perspective, one may have been wondering what all this `Δself`, `∂self` is.
+After all a function with two inputs, say `f(a, b)`, only has two partial derivatives:
+``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``.
+Why then does a `pushforward` take in this extra `Δself`, and why does a `pullback` return this extra `∂self`?
 
-The thing is in julia
-the function `f` may itself have internal values.
-For example a closure has the fields it closes over; and a callable object (i.e. a functor) like a `Flux.Dense` has the fields of that object.
-
-**Thus every function is treated as having the extra implicit argument `self`,
-which captures those fields.**
-So all _pushforward_ take in an extra argument,
-which unless they are for things with fields, they ignore. (thus common to write `function pushforward(_, Δargs...)` in those cases),
-and every _pullback_ return an extra `∂self`,
-which is, for things without fields, the constant `NO_FIELDS` which indicates there is no fields within the function itself.
+The reason is that in Julia the function `f` may itself have internal fields.
+For example a closure has the fields it closes over; a callable object (i.e. a functor) like a `Flux.Dense` has the fields of that object.
 
+**Thus every function is treated as having the extra implicit argument `self`, which captures those fields.**
+So every `pushforward` takes in an extra argument, which is ignored unless the original function had fields.
+In is common to write `function foo_pushforward(_, Δargs...)` in the case when `foo` does not have fields.
+Similarly every `pullback` return an extra `∂self`, which for things without fields is the constant `NO_FIELDS`, indicating there are no fields within the function itself.
 
 #### Pushforward / Pullback summary
+
 - **Pushforward:**
     - returned by `ffrule`
-    - takes input space wibbles, gives output space wobbles
+    - takes input space wiggles, gives output space wobbles
     - 1 argument per original function argument + 1 for the function itself
     - 1 return per original function return
+
 - **Pullback**
-   -  return by `rrule`
-   - takes output space wobbles, gives input space wibbles
+   - returned by `rrule`
+   - takes output space wobbles, gives input space wiggles
    - 1 argument per original function return
    - 1 return per original function argument + 1 for the function itself
 
 #### Pushforward/Pullback and Total Derivative/Gradient
 
-The most trivial use of the frule+pushforward is to calculate the [Total Derivative](https://en.wikipedia.org/wiki/Total_derivative):
+The most trivial use of `frule` and returned `pushforward` is to calculate the [Total Derivative](https://en.wikipedia.org/wiki/Total_derivative):
+
 ```julia
-y, pushforward = frule(f, a, b, c)
-ẏ = pushforward(1, 1, 1, 1)  # for appropriate `1`-like perturbation.
+y, f_pushforward = frule(f, a, b, c)
+ẏ = f_pushforward(1, 1, 1, 1)  # for appropriate `1`-like perturbation.
 ```
-Then we have that
-`ẏ` is the _total derivative_ of
-`f` at `(a, b, c)`:
-written mathematically as ``df_{(a,b,c)}``
 
+Then we have that `ẏ` is the _total derivative_ of `f` at `(a, b, c)`, written mathematically as ``df_{(a,b,c)}``
+
+Similarly, the most trivial use of `rrule` and returned `pullback` is to calculate the [Gradient](https://en.wikipedia.org/wiki/Gradient):
 
-Similarly:
-The most trivial use of the rrule+pullback is to calculate the [Gradient](https://en.wikipedia.org/wiki/Gradient):
 ```julia
-y, pullback = rrule(f, a, b, c)
-∇f  = pullback(1)  # for appropriate `1`-like seed.
+y, f_pullback = rrule(f, a, b, c)
+∇f = f_pullback(1)  # for appropriate `1`-like seed.
 s̄, ā, b̄, c̄ = ∇f
 ```
-Then we have that
-`∇f` is the _gradient_ of
-`f` at `(a, b, c)`.
-And we thus have the partial derivative:
-s̄, ā, b̄, c̄.
-(Including the and the self-partial derivative,
-s̄).
-Written mathematically as ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``, ``\dfrac{∂f}{∂c}``.
+Then we have that `∇f` is the _gradient_ of `f` at `(a, b, c)`.
+And we thus have the partial derivatives ``f̄ = \dfrac{∂f}{∂f}``, ``ā` = \dfrac{∂f}{∂a}``, ``b̄ = \dfrac{∂f}{∂b}``, ``c̄ = \dfrac{∂f}{∂c}``, including the and the self-partial derivative, ``f̄``.
 
 ### Differentials
 
-The values that come back from pullbacks,
-or pushforwards
-are not always the same type as the input/outputs of the original function.
-They are differentials,
-differency-equivalents.
-A differential might be such a regular type,
-like a `Number`, or a `Matrix`,
-or it might be one of the `AbstractDifferential` subtypes.
+The values that come back from pullbacks or pushforwards are not always the same type as the input/outputs of the original function.
+They are differentials; differency-equivalents.
+A differential might be such a regular type, like a `Number`, or a `Matrix`, or it might be one of the `AbstractDifferential` subtypes.
 
 Differentials support a number of operations.
-Most importantly:
-`+` and `*` which lets them act as mathematically objects.
-And `extern` which converts them into a conventional type.
+Most importantly: `+` and `*` which lets them act as mathematically objects.
+And `extern` which converts `AbstractDifferential` types into a conventional non-ChainRules type.
 
-The most important AbstractDifferentials when getting started are the ones about avoiding work:
+The most important `AbstractDifferential`s when getting started are the ones about avoiding work:
 
  - `Thunk`: this is a deferred computation. A thunk is a [word for a zero argument closure](https://en.wikipedia.org/wiki/Thunk). A computation wrapped in a `@thunk` doesn't get evaluated until `extern` is called on the `Thunk`. More on thunks later.
  - `One`, `Zero`: There are special representations of `1` and `0`. They do great things around avoiding expanding `Thunks` in multiplication and (for `Zero`) addition.
 
-#### Others: don't worry about them right now
- - Wirtinger: it is complex. The docs need to be better. [Read the links in this issue](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/40).
- - Casted: it implements broadcasting mechanics. See [#10](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/10)
- - InplacableThunk: it is like a Thunk but it can do `store!` and `accumulate!` inplace.
-
+#### Other `AbstractDifferential`s: don't worry about them right now
+ - `Wirtinger`: it is complex. The docs need to be better. [Read the links in this issue](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/40).
+ - `Casted`: it implements broadcasting mechanics. See [#10](https://github.com/JuliaDiff/ChainRulesCore.jl/issues/10)
+ - `InplaceableThunk`: it is like a Thunk but it can do `store!` and `accumulate!` in-place.
 
  -------------------------------
+
 ## Example of using ChainRules directly.
 
-While ChainRules is largely intended as a backend for Autodiff systems it can be used directly.
-(In fact, this can be very useful if you can constraint the code you need to differentiate to only use things that have rules defined for.
-This was once how all neural network code worked.)
+While ChainRules is largely intended as a backend for autodiff systems, it can be used directly.
+In fact, this can be very useful if you can constrain the code you need to differentiate to only use things that have rules defined for.
+This was once how all neural network code worked.
 
 Using ChainRules directly also helps get a feel for it.
 
-
 ```julia
 using ChainRules
 
@@ -218,164 +200,153 @@ function foo(x)
     b = 2a
     c = asin(b)
     return c
-end;
+end
 
-###
-# Find dfoo/dx via rrules
+#### Find dfoo/dx via rrules
 
 # First the forward pass, accumulating rules
-x=3;
-a, a_pb = rrule(sin, x);
-b, b_pb = rrule(*, 2, a);
-c, c_pb = rrule(asin, b)
+x = 3;
+a, a_pullback = rrule(sin, x);
+b, b_pullback = rrule(*, 2, a);
+c, c_pullback = rrule(asin, b)
 
 # Then the backward pass calculating gradients
-c̄ = 1;
-_, b̄ = c_pb(extern(c̄));
-_, _, ā = b_pb(extern(b̄));
-_, x̄ = a_pb(extern(ā));
+c̄ = 1;  # ∂c/∂c
+_, b̄ = c_pullback(extern(c̄));     # ∂c/∂b
+_, _, ā = b_pullback(extern(b̄));  # ∂c/∂a
+_, x̄ = a_pullback(extern(ā));     # ∂c/∂x = ∂f/∂x
 extern(x̄)
 # -2.0638950738662625
 
-###
-# Find dfoo/dx via frules
+#### Find dfoo/dx via frules
 
-# Unlike rrule can interleave evaluation and derivative evaluation
-x=3;
-ẋ=1;
-nofields = NamedTuple();
+# Unlike with rrule, we can interleave evaluation and derivative evaluation
+x = 3;
+ẋ = 1;  # ∂x/∂x
+nofields = NamedTuple();  # ∂self/∂self
 
-a, a_pf = frule(sin, x);
-ȧ = a_pf(nofields, extern(ẋ));
+a, a_pushforward = frule(sin, x);
+ȧ = a_pushforward(nofields, extern(ẋ));     # ∂a/∂x
 
-b, b_pf = frule(*, 2, a);
-ḃ = b_pf(nofields, 0, extern(ȧ));
+b, b_pushforward = frule(*, 2, a);
+ḃ = b_pushforward(nofields, 0, extern(ȧ));  # ∂b/∂x = ∂b/∂a⋅∂a/∂x
 
-c, c_pf = frule(asin, b);
-ċ = c_pf(nofields, extern(ḃ));
+c, c_pushforward = frule(asin, b);
+ċ = c_pushforward(nofields, extern(ḃ));     # ∂c/∂x = ∂c/∂b⋅∂b/∂x = ∂f/∂x
 extern(ċ)
 # -2.0638950738662625
 
-###
-# Find dfoo/dx via finite-difference
+#### Find dfoo/dx via finite-differences
+
 using FiniteDifferences
-central_fdm(5,1)(foo, x)
+central_fdm(5, 1)(foo, x)
 # -2.0638950738670734
 
-###
-# Via ForwardDiff.jl
+#### Find dfoo/dx via finite-differences ForwardDiff.jl
 using ForwardDiff
 ForwardDiff.derivative(foo, x)
 # -2.0638950738662625
 
-###
-# Via Zygote
+#### Find dfoo/dx via finite-differences Zygote.jl
 using Zygote
 Zygote.gradient(foo, x)
 # (-2.0638950738662625,)
 ```
 
-
  -------------------------------
 
-
-
-## On writing good rrule / frules
+## On writing good `rrule` / `frule` methods
 
 ### Return Zero or One
-rather than `0` or `1`
-or even rather than `zeros(n)`, or the identity matrix.
 
-### Use thunks appropriately:
+Rather than `0` or `1` or even rather than `zeros(n)`, `ones(n)`, or the identity matrix `I`.
 
-If work is only required for one of the returned differentials it should be wrapped in a `@thunk` (potentially using a `begin`-`end` block).
+### Use `Thunk`s appropriately:
+
+If work is only required for one of the returned differentials, then it should be wrapped in a `@thunk` (potentially using a `begin`-`end` block).
 
 If there are multiple return values, their computation should almost always be wrapped in a `@thunk`.
 
-Don’t wrap variables in thunks, wrap the computations that fill those variables in thunks, e.g. write:
-Write:
+Do _not_ wrap _variables_ in a `@thunk`, wrap the _computations_ that fill those variables in `@thunk`:
+
 ```julia
+# good:
 ∂A = @thunk(foo(x))
 return ∂A
-```
-Not:
-```julia
+
+# bad:
 ∂A = foo(x)
 return @thunk(∂A)
 ```
 In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doing is wrapping the already calculated result in a function that returns it.
 
-### Be careful with using Adjoint when you mean Transpose
+### Be careful with using `adjoint` when you mean `transpose`
+
+Remember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate.
+Instead you probably want `transpose(a)`.
 
-Remember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate. Instead, you probably want `transpose(a)`.
+While there are arguments that for reverse-mode taking the adjoint is correct, it is not currently the behavior of ChainRules to do so.
+Feel free to open an issue to discuss it.
 
-While there are arguments that for reverse-mode
-taking the adjoint is correct, it is not currently the behavior of ChainRules to do so.
-Feel free to open an issue and fight about it.
-All differentials support `conj` efficiently, which makes it easy to change in post.
+### Code Style
 
-### Style
+Use named local functions for the `pushforward`/`pullback`:
 
-Used named local functions for the pushforward/pullback:
-Rather than:
 ```julia
+# good:
 function frule(::typeof(foo), x)
-        return (foo(x), (_, ẋ)->bar(ẋ))
+    Y = foo(x)
+    function foo_pushforward(_, ẋ)
+        return bar(ẋ)
+    end
+    return Y, foo_pushforward
 end
-```
-write:
-```julia
+
+# bad:
 function frule(::typeof(foo), x)
-        Y = foo(x)
-        function foo_pushforward(_, ẋ)
-            return bar(ẋ)
-        end
-        return Y, foo_pushforward
+    return foo(x), (_, ẋ) -> bar(ẋ)
 end
 ```
 
-
-While this is more verbose,
-it ensures that if an error is thrown during the pullback/pushforward
-the gensymed name of the local function will include the name you gave it.
-Which makes it a lot simpler to debug from the stacktrace.
+While this is more verbose, it ensures that if an error is thrown during the `pullback`/`pushforward` the [`gensym`](https://docs.julialang.org/en/v1/base/base/#Base.gensym) name of the local function will include the name you gave it.
+This makes it a lot simpler to debug from the stacktrace.
 
 ### Write tests
+
 There are fairly decent tools for writing tests based on [FiniteDifferences.jl](https://github.com/JuliaDiff/FiniteDifferences.jl).
-They are in [`tests/test_utils.jl`](https://github.com/JuliaDiff/ChainRules.jl/blob/master/test/test_util.jl)
+They are in [`tests/test_utils.jl`](https://github.com/JuliaDiff/ChainRules.jl/blob/master/test/test_util.jl).
 Take a look at existing test and you should see how to do stuff.
 
-!!! important
-    Don't use analytical derivations for derivatives in the tests?
-    Use finite differencing.
-    If you are writing equations in the tests, then you use those same equations as use are using to write your code. Then that is not Ok. We've had several bugs from people misreading/misunderstanding equations, and then using them for both tests and code. And then we have good coverage that is worthless.
+!!! warning
+    Use finite differencing to test derivatives.
+    Don't use analytical derivations for derivatives in the tests!
+    Since the rules are analytic expressions, re-writing those same expressions in the tests
+    can not be an effective way to test, and will give misleading test coverage.
 
 ### CAS systems are your friends.
-E.g. it is very easy to check gradients or deriviatives with [WolframAlpha](https://www.wolframalpha.com/input/?i=gradient+atan2%28x%2Cy%29).
+
+It is very easy to check gradients or derivatives with a computer algebra system (CAS) like [WolframAlpha](https://www.wolframalpha.com/input/?i=gradient+atan2%28x%2Cy%29).
 
 ------------------------------------------
 
-## FAQ:
+## FAQ
 
 ### What is up with the different symbols?
 
  - `Δx` is the input to a propagator, (i.e a _seed_ for a _pullback_; or a _perturbation_ for a _pushforward_)
  - `∂x` is the output of a propagator
  - `dx` could be anything, including a pullback. It really should not show up outside of tests.
- - `ẋ` is a derivative moving forward.
- - `x̄` is a dderivative moving backward.
-
+ - `v̇` is a derivative of the input moving forward: ``v̇ = \frac{∂v}{∂x}`` for input ``x``, intermediate value ``v``.
+ - `v̄` is a derivative of the output moving backward: ``v̄ = \frac{∂y}{∂v}`` for output ``y``, intermediate value ``v``.
  - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not exclusively.) for scalar functions.
      - `ΔΩ` is thus a seed for the pullback.
      - `∂Ω` is thus the output of a pushforward.
 
 ### Why does `frule` and `rrule` return the function evaluation?
-You might wonder why `frule(f, x)` returns `f(x)` and the pushforward for `f` at `x`,
-and similarly for `rrule` returing `f(x)` and the pullback for `f` at `x`.
-Why not just return the pushforward/pullback,
-and let the user call `f(x)` to get the answer seperately?
+
+You might wonder why `frule(f, x)` returns `f(x)` and the pushforward for `f` at `x`, and similarly for `rrule` returing `f(x)` and the pullback for `f` at `x`.
+Why not just return the pushforward/pullback, and let the user call `f(x)` to get the answer seperately?
 
 Their are two reasons the rules also create the `f(x)`.
 1. For some rules the output value is used in the definition of its propagator. For example `tan`.
-2. For some rules an alternative way of calculating `f(x)` can give the same answer,
-but also define intermediate values that can be used in the calculations within the propagator.
+2. For some rules an alternative way of calculating `f(x)` can give the same answer while also generating intermediate values that can be used in the calculations within the propagator.

From f3f9e50abd7e3b87d480a67949cac71667caf2de Mon Sep 17 00:00:00 2001
From: Nick Robinson <npr251@gmail.com>
Date: Tue, 17 Sep 2019 14:04:06 +0100
Subject: [PATCH 17/51] Make sidebar blue

---
 docs/src/assets/chainrules.css | 68 ++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 15 deletions(-)

diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
index 22769cf64..80c44ddad 100644
--- a/docs/src/assets/chainrules.css
+++ b/docs/src/assets/chainrules.css
@@ -1,34 +1,58 @@
-/* TOC */
-nav.toc {
-	background-color: #FFEC8B;
-	box-shadow: inset -14px 0px 5px -12px rgb(210,210,210);
+/* Links */
+
+a {
+    color: #4595D1;
 }
 
-nav.toc ul.internal {
-	background-color: #FFFEDD;
-	box-shadow: inset -14px 0px 5px -12px rgb(210,210,210);
-	list-style: none;
+a:hover, a:focus {
+    color: #194E82;
 }
 
+/* Navigation */
+
+nav.toc ul a:hover,
 nav.toc ul.internal a:hover {
-	background-color: #FFEC8B;
-	color: black;
+    color: #FFFFFF;
+    background-color: #4595D1;
 }
 
-nav.toc ul a:hover {
-	color: #fcfcfc;
-	background-color: #B8860B;
+nav.toc ul .toctext {
+    color: #FFFFFF;
+}
+
+nav.toc {
+    box-shadow: none;
+    color: #FFFFFF;
+    background-color: #194E82;
 }
 
 nav.toc li.current > .toctext {
-	background-color: #B8860B;
+    color: #FFFFFF;
+    background-color: #4595D1;
+    border-top-width: 0px;
+    border-bottom-width: 0px;
+}
+
+nav.toc ul.internal a {
+    color: #194E82;
+    background-color: #FFFFFF;
+}
+
+/* Text */
+
+article#docs a.nav-anchor {
+    color: #194E82;
+}
+
+article#docs blockquote {
+    font-style: italic;
 }
 
 /* Terminology Block */
 
 div.admonition.terminology div.admonition-title:before {
 	content: "Terminology: ";
-	font-family: 'Lato', 'Helvetica Neue', Arial, sans-serif;
+    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "Courier New", "andale mono", "lucida console", monospace;
 	font-weight: bold;
 }
 div.admonition.terminology div.admonition-title {
@@ -38,3 +62,17 @@ div.admonition.terminology div.admonition-title {
 div.admonition.terminology div.admonition-text {
 	background-color: #FFFEDD;
 }
+
+/* Code */
+
+code .hljs-meta {
+    color: #4595D1;
+}
+
+code .hljs-keyword {
+    color: #194E82;
+}
+
+pre, code {
+    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "Courier New", "andale mono", "lucida console", monospace;
+}

From 9040c284b349bb44a52d67e8da8c08856b2475b4 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Tue, 17 Sep 2019 17:23:15 +0100
Subject: [PATCH 18/51] Update docs/src/index.md

Co-Authored-By: Nick Robinson <npr251@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index d950e419b..56cdff475 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -23,7 +23,7 @@ For an
 
 !!! terminology "`rrule` and `frule`"
     `rrule` and `frule` are ChainRules specific terms.
-    There exact functioning is fairly ChainRules specific, though other tools may do similar.
+    Their exact functioning is fairly ChainRules specific, though other tools have similar functions.
     The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom sensitivities_.
 
 The rules are encoded as `rrules` and `frules`, for use in reverse-mode and forward-mode differentiation respectively.

From afc2f26b633db5bda7196f212ef4752ca385dad2 Mon Sep 17 00:00:00 2001
From: Nick Robinson <npr251@gmail.com>
Date: Fri, 20 Sep 2019 10:30:53 +0100
Subject: [PATCH 19/51] Update docs/src/index.md

Co-Authored-By: Fredrik Bagge Carlson <baggepinnen@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 56cdff475..b22917fe8 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -78,7 +78,7 @@ function pushforward(Δself, Δargs...)
 end
 ```
 
-The input to the pushforward is often called the _pertubation_.
+The input to the pushforward is often called the _perturbation_.
 If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ḟ, ẋ)`.
 
 !!! note

From 612ef7637d1e5d4dd70781a411f6b419b08033fe Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Fri, 20 Sep 2019 13:06:15 +0100
Subject: [PATCH 20/51] show all ENV during docs building

---
 docs/make.jl | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/make.jl b/docs/make.jl
index 34b664ed1..e1ab0a454 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -2,6 +2,8 @@ using ChainRules
 using ChainRulesCore
 using Documenter
 
+@show ENV
+
 makedocs(
     modules=[ChainRules, ChainRulesCore],
     format=Documenter.HTML(prettyurls=false, assets = ["assets/chainrules.css"]),

From 41269d4e1870f5be53452d813ded62dbae548a9c Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Fri, 20 Sep 2019 20:51:48 +0100
Subject: [PATCH 21/51] Update docs/src/index.md

---
 docs/src/index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/src/index.md b/docs/src/index.md
index b22917fe8..db630b99d 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -80,6 +80,7 @@ end
 
 The input to the pushforward is often called the _perturbation_.
 If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ḟ, ẋ)`.
+(`ẏ` is commonly used to represent the permutation for `y`)
 
 !!! note
 

From 2fc9f90bc6cac100f0a1422ab736375479f1f043 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Sun, 22 Sep 2019 13:25:46 +0100
Subject: [PATCH 22/51] Update docs/src/assets/chainrules.css

---
 docs/src/assets/chainrules.css | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
index 80c44ddad..a4c2e2590 100644
--- a/docs/src/assets/chainrules.css
+++ b/docs/src/assets/chainrules.css
@@ -74,5 +74,5 @@ code .hljs-keyword {
 }
 
 pre, code {
-    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "Courier New", "andale mono", "lucida console", monospace;
+    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "andale mono", "lucida console", monospace;
 }

From 7083fb3df967f669eb89e6ecfc9ac7efe889821e Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Sun, 22 Sep 2019 14:08:32 +0100
Subject: [PATCH 23/51] Update docs/src/index.md

Co-Authored-By: Seth Axen <seth.axen@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index db630b99d..a51ceeae5 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -268,7 +268,7 @@ If work is only required for one of the returned differentials, then it should b
 
 If there are multiple return values, their computation should almost always be wrapped in a `@thunk`.
 
-Do _not_ wrap _variables_ in a `@thunk`, wrap the _computations_ that fill those variables in `@thunk`:
+Do _not_ wrap _variables_ in a `@thunk`; wrap the _computations_ that fill those variables in `@thunk`:
 
 ```julia
 # good:

From 33828b75f76c89d3b715fe4757cf46cdae785e33 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 10:14:11 +0100
Subject: [PATCH 24/51] Update docs/src/index.md

Co-Authored-By: Mateusz Baran <mateuszbaran89@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index a51ceeae5..9a2efe251 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -124,7 +124,7 @@ The reason is that in Julia the function `f` may itself have internal fields.
 For example a closure has the fields it closes over; a callable object (i.e. a functor) like a `Flux.Dense` has the fields of that object.
 
 **Thus every function is treated as having the extra implicit argument `self`, which captures those fields.**
-So every `pushforward` takes in an extra argument, which is ignored unless the original function had fields.
+So every `pushforward` takes in an extra argument, which is ignored unless the original function has fields.
 In is common to write `function foo_pushforward(_, Δargs...)` in the case when `foo` does not have fields.
 Similarly every `pullback` return an extra `∂self`, which for things without fields is the constant `NO_FIELDS`, indicating there are no fields within the function itself.
 

From 9f85a259b0ac8183b73f84e380f946b025af2442 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 10:14:43 +0100
Subject: [PATCH 25/51] Update docs/src/index.md

Co-Authored-By: Glenn Moynihan <glennmoy@gmail.com>
---
 docs/src/index.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 9a2efe251..b841b2872 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -15,7 +15,6 @@ ChainRules is a programmatic repository of that knowledge, with the generalizati
 
 [Autodiff (AD)](https://en.wikipedia.org/wiki/Automatic_differentiation) tools roughly work by reducing a problem down to simple parts that they know the rules for, and then combining those rules.
 Knowing rules for more complicated functions speeds up the autodiff process as it doesn't have to break things down as much.
-For an
 
 **ChainRules is an AD-independent collection of rules to use in a differentiation system.**
 

From 70b8b0c32a9baedd28715495b75e5d8d2908abf9 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 10:15:08 +0100
Subject: [PATCH 26/51] Update docs/src/index.md

Co-Authored-By: Glenn Moynihan <glennmoy@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index b841b2872..35854c111 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -124,7 +124,7 @@ For example a closure has the fields it closes over; a callable object (i.e. a f
 
 **Thus every function is treated as having the extra implicit argument `self`, which captures those fields.**
 So every `pushforward` takes in an extra argument, which is ignored unless the original function has fields.
-In is common to write `function foo_pushforward(_, Δargs...)` in the case when `foo` does not have fields.
+It is common to write `function foo_pushforward(_, Δargs...)` in the case when `foo` does not have fields.
 Similarly every `pullback` return an extra `∂self`, which for things without fields is the constant `NO_FIELDS`, indicating there are no fields within the function itself.
 
 #### Pushforward / Pullback summary

From 025ae1abd364ffa4bfad2f8b408fb663a99c11c1 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 12:38:29 +0100
Subject: [PATCH 27/51] inherit font for admonition title:before

---
 docs/src/assets/chainrules.css | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
index a4c2e2590..bb24cbe72 100644
--- a/docs/src/assets/chainrules.css
+++ b/docs/src/assets/chainrules.css
@@ -52,7 +52,7 @@ article#docs blockquote {
 
 div.admonition.terminology div.admonition-title:before {
 	content: "Terminology: ";
-    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "Courier New", "andale mono", "lucida console", monospace;
+    font-family: inherit
 	font-weight: bold;
 }
 div.admonition.terminology div.admonition-title {

From 45898cb8fd82abcb99fdf3553f7e2eeb7e5957a7 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 12:51:12 +0100
Subject: [PATCH 28/51] fix CSS tabs

---
 docs/src/assets/chainrules.css | 48 +++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
index bb24cbe72..8dd04b0f9 100644
--- a/docs/src/assets/chainrules.css
+++ b/docs/src/assets/chainrules.css
@@ -1,78 +1,78 @@
 /* Links */
 
 a {
-    color: #4595D1;
+  color: #4595D1;
 }
 
 a:hover, a:focus {
-    color: #194E82;
+  color: #194E82;
 }
 
 /* Navigation */
 
 nav.toc ul a:hover,
 nav.toc ul.internal a:hover {
-    color: #FFFFFF;
-    background-color: #4595D1;
+  color: #FFFFFF;
+  background-color: #4595D1;
 }
 
 nav.toc ul .toctext {
-    color: #FFFFFF;
+  color: #FFFFFF;
 }
 
 nav.toc {
-    box-shadow: none;
-    color: #FFFFFF;
-    background-color: #194E82;
+  box-shadow: none;
+  color: #FFFFFF;
+  background-color: #194E82;
 }
 
 nav.toc li.current > .toctext {
-    color: #FFFFFF;
-    background-color: #4595D1;
-    border-top-width: 0px;
-    border-bottom-width: 0px;
+  color: #FFFFFF;
+  background-color: #4595D1;
+  border-top-width: 0px;
+  border-bottom-width: 0px;
 }
 
 nav.toc ul.internal a {
-    color: #194E82;
-    background-color: #FFFFFF;
+  color: #194E82;
+  background-color: #FFFFFF;
 }
 
 /* Text */
 
 article#docs a.nav-anchor {
-    color: #194E82;
+  color: #194E82;
 }
 
 article#docs blockquote {
-    font-style: italic;
+  font-style: italic;
 }
 
 /* Terminology Block */
 
 div.admonition.terminology div.admonition-title:before {
-	content: "Terminology: ";
-    font-family: inherit
-	font-weight: bold;
+  content: "Terminology: ";
+  font-family: inherit
+  font-weight: bold;
 }
 div.admonition.terminology div.admonition-title {
-	background-color: #FFEC8B;
+  background-color: #FFEC8B;
 }
 
 div.admonition.terminology div.admonition-text {
-	background-color: #FFFEDD;
+  background-color: #FFFEDD;
 }
 
 /* Code */
 
 code .hljs-meta {
-    color: #4595D1;
+  color: #4595D1;
 }
 
 code .hljs-keyword {
-    color: #194E82;
+  color: #194E82;
 }
 
 pre, code {
-    font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "andale mono", "lucida console", monospace;
+  font-family: "Liberation Mono", "Consolas", "DejaVu Sans Mono", "Ubuntu Mono", "andale mono", "lucida console", monospace;
 }

From 418e20ab450faca165c676570b79cddcdcb75706 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 14:18:51 +0100
Subject: [PATCH 29/51] Update docs/src/index.md

Co-Authored-By: Seth Axen <seth.axen@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 35854c111..3642dbf40 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -25,7 +25,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
     Their exact functioning is fairly ChainRules specific, though other tools have similar functions.
     The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom sensitivities_.
 
-The rules are encoded as `rrules` and `frules`, for use in reverse-mode and forward-mode differentiation respectively.
+The rules are encoded as `rrule`s and `frule`s, for use in reverse-mode and forward-mode differentiation respectively.
 
 The `rrule` for some function `foo`, which takes the positional argument `args` and keyword argument `kwargs`, is written:
 

From ad6ddc838e3c335fa1c08ab030170b4432877d65 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 14:22:23 +0100
Subject: [PATCH 30/51] reorder

---
 docs/src/index.md | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 3642dbf40..8d0b999ef 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -18,33 +18,37 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 
 **ChainRules is an AD-independent collection of rules to use in a differentiation system.**
 
-### `rrule` and `frule`
+### `frule` and `rrule`
 
-!!! terminology "`rrule` and `frule`"
-    `rrule` and `frule` are ChainRules specific terms.
+!!! terminology "`frule` and `rrule`"
+    `frule` and `rrule` are ChainRules specific terms.
     Their exact functioning is fairly ChainRules specific, though other tools have similar functions.
-    The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom sensitivities_.
+    The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom_gradients_, _custom sensitivities_.
+    (Potentially incorrectly, terminology is often abused.)
 
-The rules are encoded as `rrule`s and `frule`s, for use in reverse-mode and forward-mode differentiation respectively.
-
-The `rrule` for some function `foo`, which takes the positional argument `args` and keyword argument `kwargs`, is written:
+The rules are encoded as `frule`s and `rrule`s, for use in forward-mode and reverse-mode differentiation respectively.
 
+Similarly, the `frule` is written:
 ```julia
-function rrule(::typeof(foo), args; kwargs...)
+function frule(::typeof(foo), args; kwargs...)
     ...
-    return y, pullback
+    return y, pushforward
 end
 ```
-where `y` must be equal to `foo(args; kwargs...)`, and `pullback` is a function to propagate the derivative information backwards at that point (more later).
+where `y = foo(args; kwargs...)`, and `pushforward` is a function to propagate the derivative information forwards at that point (more later).
+
+
+
+The `rrule` for some function `foo`, which takes the positional argument `args` and keyword argument `kwargs`, is written:
 
-Similarly, the `frule` is written:
 ```julia
-function frule(::typeof(foo), args; kwargs...)
+function rrule(::typeof(foo), args; kwargs...)
     ...
-    return y, pushforward
+    return y, pullback
 end
 ```
-again `y = foo(args; kwargs...)`, and `pushforward` is a function to propagate the derivative information forwards at that point (more later).
+again `y` must be equal to `foo(args; kwargs...)`, and `pullback` is a function to propagate the derivative information backwards at that point (more later).
+
 
 Almost always the _pushforward_/_pullback_ will be declared locally within the `ffrule`/`rrule`, and will be a _closure_ over some of the other arguments.
 

From b1f2d1db4a0a063dd0d1a471eb623108c0a2d911 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 14:22:58 +0100
Subject: [PATCH 31/51] Update docs/src/index.md

Co-Authored-By: Seth Axen <seth.axen@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 8d0b999ef..c0fff8352 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -60,7 +60,7 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
     The are broadly in agreement with the use of [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)) and [pushforward](https://en.wikipedia.org/wiki/Pushforward_(differential)) in differential geometry.
     But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
     Other terms that may be used include for _pullback_ the **backpropagator**, and by analogy for _pushforward_ the **forwardpropagator**, thus these are the _propagators_.
-    These are also good names because effectively they propagate wiggles and wobbles through them, via the chainrule.
+    These are also good names because effectively they propagate wiggles and wobbles through them, via the chain rule.
     (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Bearlmutter and Siskind, 2008)
 
 #### Core Idea

From fb2889202e8dcd6e8630323d9159bb3895856dea Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 16:18:33 +0100
Subject: [PATCH 32/51] Apply suggestions from code review

Co-Authored-By: Seth Axen <seth.axen@gmail.com>
Co-Authored-By: Mateusz Baran <mateuszbaran89@gmail.com>
---
 docs/src/index.md | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index c0fff8352..d6b66ff51 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -66,7 +66,7 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
 #### Core Idea
 
  - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
- - The **pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input_ space to achieve it.
+ - The **pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input space_ to achieve it.
 
 #### The anatomy of pushforward and pullback
 
@@ -83,7 +83,7 @@ end
 
 The input to the pushforward is often called the _perturbation_.
 If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ḟ, ẋ)`.
-(`ẏ` is commonly used to represent the permutation for `y`)
+(`ẏ` is commonly used to represent the pertubation for `y`)
 
 !!! note
 
@@ -119,7 +119,7 @@ If the function is `y = f(x)` often the pullback will be written `x̄ = pullback
     Other good names might be `Δinternal`/`∂internal`.
 
 From the mathematical perspective, one may have been wondering what all this `Δself`, `∂self` is.
-After all a function with two inputs, say `f(a, b)`, only has two partial derivatives:
+After all, a function with two inputs, say `f(a, b)`, only has two partial derivatives:
 ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``.
 Why then does a `pushforward` take in this extra `Δself`, and why does a `pullback` return this extra `∂self`?
 
@@ -129,7 +129,7 @@ For example a closure has the fields it closes over; a callable object (i.e. a f
 **Thus every function is treated as having the extra implicit argument `self`, which captures those fields.**
 So every `pushforward` takes in an extra argument, which is ignored unless the original function has fields.
 It is common to write `function foo_pushforward(_, Δargs...)` in the case when `foo` does not have fields.
-Similarly every `pullback` return an extra `∂self`, which for things without fields is the constant `NO_FIELDS`, indicating there are no fields within the function itself.
+Similarly every `pullback` returns an extra `∂self`, which for things without fields is the constant `NO_FIELDS`, indicating there are no fields within the function itself.
 
 #### Pushforward / Pullback summary
 
@@ -161,10 +161,10 @@ Similarly, the most trivial use of `rrule` and returned `pullback` is to calcula
 ```julia
 y, f_pullback = rrule(f, a, b, c)
 ∇f = f_pullback(1)  # for appropriate `1`-like seed.
-s̄, ā, b̄, c̄ = ∇f
+s̄elf, ā, b̄, c̄ = ∇f
 ```
 Then we have that `∇f` is the _gradient_ of `f` at `(a, b, c)`.
-And we thus have the partial derivatives ``f̄ = \dfrac{∂f}{∂f}``, ``ā` = \dfrac{∂f}{∂a}``, ``b̄ = \dfrac{∂f}{∂b}``, ``c̄ = \dfrac{∂f}{∂c}``, including the and the self-partial derivative, ``f̄``.
+And we thus have the partial derivatives ``s̄elf, = \dfrac{∂f}{∂s̄elf}``, ``ā` = \dfrac{∂f}{∂a}``, ``b̄ = \dfrac{∂f}{∂b}``, ``c̄ = \dfrac{∂f}{∂c}``, including the and the self-partial derivative, ``s̄elf,``.
 
 ### Differentials
 
@@ -173,7 +173,7 @@ They are differentials; differency-equivalents.
 A differential might be such a regular type, like a `Number`, or a `Matrix`, or it might be one of the `AbstractDifferential` subtypes.
 
 Differentials support a number of operations.
-Most importantly: `+` and `*` which lets them act as mathematically objects.
+Most importantly: `+` and `*`, which let them act as mathematical objects.
 And `extern` which converts `AbstractDifferential` types into a conventional non-ChainRules type.
 
 The most important `AbstractDifferential`s when getting started are the ones about avoiding work:
@@ -261,9 +261,12 @@ Zygote.gradient(foo, x)
 
 ## On writing good `rrule` / `frule` methods
 
-### Return Zero or One
+### Use `Zero()` or `One()` as return value
 
-Rather than `0` or `1` or even rather than `zeros(n)`, `ones(n)`, or the identity matrix `I`.
+The `Zero()` and `One()` differential objects exist as an alternative to directly returning
+`0` or `zeros(n)`, and `1` or `I`.
+They allow more optimal computation when chaining pullbacks/pushforwards, to avoid work.
+They should be used where possible.
 
 ### Use `Thunk`s appropriately:
 
@@ -342,7 +345,7 @@ It is very easy to check gradients or derivatives with a computer algebra system
  - `dx` could be anything, including a pullback. It really should not show up outside of tests.
  - `v̇` is a derivative of the input moving forward: ``v̇ = \frac{∂v}{∂x}`` for input ``x``, intermediate value ``v``.
  - `v̄` is a derivative of the output moving backward: ``v̄ = \frac{∂y}{∂v}`` for output ``y``, intermediate value ``v``.
- - `Ω` is often used as the return value of the function having the rule found for. Especially, (but not exclusively.) for scalar functions.
+ - `Ω` is often used as the return value of the function. Especially, but not exclusively, for scalar functions.
      - `ΔΩ` is thus a seed for the pullback.
      - `∂Ω` is thus the output of a pushforward.
 
@@ -351,6 +354,6 @@ It is very easy to check gradients or derivatives with a computer algebra system
 You might wonder why `frule(f, x)` returns `f(x)` and the pushforward for `f` at `x`, and similarly for `rrule` returing `f(x)` and the pullback for `f` at `x`.
 Why not just return the pushforward/pullback, and let the user call `f(x)` to get the answer seperately?
 
-Their are two reasons the rules also create the `f(x)`.
+There are two reasons the rules also calculate the `f(x)`.
 1. For some rules the output value is used in the definition of its propagator. For example `tan`.
 2. For some rules an alternative way of calculating `f(x)` can give the same answer while also generating intermediate values that can be used in the calculations within the propagator.

From 3849bb43a8ecb0dc32b66c982c5c114846b6e89d Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 16:22:53 +0100
Subject: [PATCH 33/51] fix ffrule repetition

---
 docs/src/index.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index d6b66ff51..75f8ae017 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -50,7 +50,7 @@ end
 again `y` must be equal to `foo(args; kwargs...)`, and `pullback` is a function to propagate the derivative information backwards at that point (more later).
 
 
-Almost always the _pushforward_/_pullback_ will be declared locally within the `ffrule`/`rrule`, and will be a _closure_ over some of the other arguments.
+Almost always the _pushforward_/_pullback_ will be declared locally within the `frule`/`rrule`, and will be a _closure_ over some of the other arguments.
 
 ### The propagators: pushforward and pullback
 
@@ -134,7 +134,7 @@ Similarly every `pullback` returns an extra `∂self`, which for things without
 #### Pushforward / Pullback summary
 
 - **Pushforward:**
-    - returned by `ffrule`
+    - returned by `frule`
     - takes input space wiggles, gives output space wobbles
     - 1 argument per original function argument + 1 for the function itself
     - 1 return per original function return

From a954d99d2a1a0ac9e325e1d1eb8bd2a75bb671b4 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 16:32:28 +0100
Subject: [PATCH 34/51] improve

---
 docs/src/index.md | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 75f8ae017..a98ba7ada 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -290,10 +290,7 @@ In the bad example `foo(x)` gets computed eagerly, and all that the thunk is doi
 ### Be careful with using `adjoint` when you mean `transpose`
 
 Remember for complex numbers `a'` (i.e. `adjoint(a)`) takes the complex conjugate.
-Instead you probably want `transpose(a)`.
-
-While there are arguments that for reverse-mode taking the adjoint is correct, it is not currently the behavior of ChainRules to do so.
-Feel free to open an issue to discuss it.
+Instead you probably want `transpose(a)`, unless you've already restricted `a` to be a `AbstractMatrix{<:Real}`.
 
 ### Code Style
 
@@ -308,11 +305,19 @@ function frule(::typeof(foo), x)
     end
     return Y, foo_pushforward
 end
+#== output
+julia> frule(foo, 2)
+(4, var"#foo_pushforward#11"())
+==#
 
 # bad:
 function frule(::typeof(foo), x)
     return foo(x), (_, ẋ) -> bar(ẋ)
 end
+#== output:
+julia> frule(foo, 2)
+(4, var"##9#10"())
+==#
 ```
 
 While this is more verbose, it ensures that if an error is thrown during the `pullback`/`pushforward` the [`gensym`](https://docs.julialang.org/en/v1/base/base/#Base.gensym) name of the local function will include the name you gave it.
@@ -326,9 +331,9 @@ Take a look at existing test and you should see how to do stuff.
 
 !!! warning
     Use finite differencing to test derivatives.
-    Don't use analytical derivations for derivatives in the tests!
-    Since the rules are analytic expressions, re-writing those same expressions in the tests
-    can not be an effective way to test, and will give misleading test coverage.
+    Don't use analytical derivations for derivatives in the tests.
+    Those are what you use to define the rules, and so can not be confidently used in the test.
+    If you misread/misunderstood them, then your tests/implementation will have the same mistake.
 
 ### CAS systems are your friends.
 

From d6337795ac102709be8a2800f4335baa3c1f9d8d Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 18:09:57 +0100
Subject: [PATCH 35/51] add about gradient wrt kwargs

---
 docs/src/index.md | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/docs/src/index.md b/docs/src/index.md
index a98ba7ada..987309502 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -362,3 +362,12 @@ Why not just return the pushforward/pullback, and let the user call `f(x)` to ge
 There are two reasons the rules also calculate the `f(x)`.
 1. For some rules the output value is used in the definition of its propagator. For example `tan`.
 2. For some rules an alternative way of calculating `f(x)` can give the same answer while also generating intermediate values that can be used in the calculations within the propagator.
+
+### Where are the gradients for keyword arguments?
+_pullbacks_ do not return a gradient for keyword arguments;
+similarly _pushfowards_ do not accept a pertubation for keyword arguments.
+This is because in practice functions are very rarely differentiable with respect to keyword arguments.
+As a rule keyword arguments tend to control side-effects, like logging verbsoity,
+or to be functionality changing to perform a different operation, e.g. `dims=3`, and thus not differentiable.
+To the best of our knowledge no julia AD system, with support for the definition of custom primatives, supports differentating with respect to keyword arguments.
+At some point in the future ChainRules may support these. Maybe.

From 9046e907ed4f3cc56d2e70a266b7064ae5e23826 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 18:17:30 +0100
Subject: [PATCH 36/51] explain what differentials are

---
 docs/src/index.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 987309502..9874d3c54 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -169,8 +169,9 @@ And we thus have the partial derivatives ``s̄elf, = \dfrac{∂f}{∂s̄elf}``,
 ### Differentials
 
 The values that come back from pullbacks or pushforwards are not always the same type as the input/outputs of the original function.
-They are differentials; differency-equivalents.
-A differential might be such a regular type, like a `Number`, or a `Matrix`, or it might be one of the `AbstractDifferential` subtypes.
+They are differentials, which correspond roughly to something able to represent the difference between two values of the original types.
+A differential might be such a regular type, like a `Number`, or a `Matrix`, matching to the original type;
+or it might be one of the `AbstractDifferential` subtypes.
 
 Differentials support a number of operations.
 Most importantly: `+` and `*`, which let them act as mathematical objects.

From d869621d86e2f464c734d08c344d5adfbce6d585 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 19:12:36 +0100
Subject: [PATCH 37/51] improve

---
 docs/src/index.md | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 9874d3c54..df59dcde4 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -65,6 +65,20 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
 
 #### Core Idea
 
+
+
+----
+##### TODO: Incorperate this:
+
+###### wesselb 9 days ago Member
+Are these ideas consistent with what pushforward and pullback do? I'm not familiar with ChainRules and its internals, but I anticipated pushforward and pullback to do the following: Consider a computation x -> u -> f(u) = v -> y. Then pushforward for f turns du/dx into dv/dx, whereas pullback turns dy/dv into dy/du. So pushforward pushes a "sensitivity with respect to the input through the function", whereas pullback pulls a "sensitivity with respect to the output back through the function". Perhaps that's what the below convey, not sure... maybe I'm just rambling.
+
+###### @jekbradbury
+Yeah, I think the below is accurate for the pushforward but misleading for the pullback. The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
+
+If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
+----------
+
  - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
  - The **pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input space_ to achieve it.
 
@@ -82,7 +96,7 @@ end
 ```
 
 The input to the pushforward is often called the _perturbation_.
-If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ḟ, ẋ)`.
+If the function is `y = f(x)` often the pushforward will be written `ẏ = pushforward(ṡelf, ẋ)`.
 (`ẏ` is commonly used to represent the pertubation for `y`)
 
 !!! note
@@ -102,7 +116,7 @@ end
 ```
 
 The input to the pullback is often called the _seed_.
-If the function is `y = f(x)` often the pullback will be written `x̄ = pullback(ȳ)`.
+If the function is `y = f(x)` often the pullback will be written `s̄elf, x̄ = pullback(ȳ)`.
 
 !!! note
 
@@ -112,12 +126,14 @@ If the function is `y = f(x)` often the pullback will be written `x̄ = pullback
     Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (sensitivity analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
     At the end of the day, they are all _wiggles_ or _wobbles_.
 
-### Self derivative `Δself`, `∂self` etc.
+### Self derivative `Δself`, `∂self`, `s̄elf`, `ṡelf` etc.
 
-!!! terminology
-    To my knowledge there is no standard terminology for this.
+!!! terminology  `Δself`, `∂self`, `s̄elf`, `ṡelf`
+    It is the derivatives with respect to the internal fields of the function.
+    To the best of our knowledge there is no standard terminology for this.
     Other good names might be `Δinternal`/`∂internal`.
 
+
 From the mathematical perspective, one may have been wondering what all this `Δself`, `∂self` is.
 After all, a function with two inputs, say `f(a, b)`, only has two partial derivatives:
 ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``.

From f836dc6da5444facd342489b29020988a52c3415 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Mon, 23 Sep 2019 19:14:26 +0100
Subject: [PATCH 38/51] wip

---
 docs/src/index.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index df59dcde4..70b3dacf0 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -77,6 +77,14 @@ Are these ideas consistent with what pushforward and pullback do? I'm not famili
 Yeah, I think the below is accurate for the pushforward but misleading for the pullback. The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
 
 If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
+
+
+
+This is still misleading for the pullback. Reposting a comment that got lost:
+The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
+
+If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
+
 ----------
 
  - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
@@ -135,7 +143,7 @@ If the function is `y = f(x)` often the pullback will be written `s̄elf, x̄ =
 
 
 From the mathematical perspective, one may have been wondering what all this `Δself`, `∂self` is.
-After all, a function with two inputs, say `f(a, b)`, only has two partial derivatives:
+Given that a function with two inputs, say `f(a, b)`, only has two partial derivatives:
 ``\dfrac{∂f}{∂a}``, ``\dfrac{∂f}{∂b}``.
 Why then does a `pushforward` take in this extra `Δself`, and why does a `pullback` return this extra `∂self`?
 

From 344a88895d62c7b942150cb31b9fcc1c4484efa7 Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 19:15:47 +0100
Subject: [PATCH 39/51] fix citation

Co-Authored-By: Seth Axen <seth.axen@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 70b3dacf0..0b76a1638 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -61,7 +61,7 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
     But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
     Other terms that may be used include for _pullback_ the **backpropagator**, and by analogy for _pushforward_ the **forwardpropagator**, thus these are the _propagators_.
     These are also good names because effectively they propagate wiggles and wobbles through them, via the chain rule.
-    (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Bearlmutter and Siskind, 2008)
+    (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Pearlmutter and Siskind, 2008)
 
 #### Core Idea
 

From a579f3a86e70129679b4fe02822aca192dc03cff Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Mon, 23 Sep 2019 20:19:06 +0100
Subject: [PATCH 40/51] Update docs/src/assets/chainrules.css

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/assets/chainrules.css | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/assets/chainrules.css b/docs/src/assets/chainrules.css
index 8dd04b0f9..f4b0a164f 100644
--- a/docs/src/assets/chainrules.css
+++ b/docs/src/assets/chainrules.css
@@ -52,7 +52,7 @@ article#docs blockquote {
 
 div.admonition.terminology div.admonition-title:before {
   content: "Terminology: ";
-  font-family: inherit
+  font-family: inherit;
   font-weight: bold;
 }
 div.admonition.terminology div.admonition-title {

From f7eed2841701c16d99b4bd825933ad14fc1f8b4d Mon Sep 17 00:00:00 2001
From: Lyndon White <oxinabox@ucc.asn.au>
Date: Tue, 24 Sep 2019 09:33:55 +0100
Subject: [PATCH 41/51] use latex for overbar

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 0b76a1638..daff411a1 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -188,7 +188,7 @@ y, f_pullback = rrule(f, a, b, c)
 s̄elf, ā, b̄, c̄ = ∇f
 ```
 Then we have that `∇f` is the _gradient_ of `f` at `(a, b, c)`.
-And we thus have the partial derivatives ``s̄elf, = \dfrac{∂f}{∂s̄elf}``, ``ā` = \dfrac{∂f}{∂a}``, ``b̄ = \dfrac{∂f}{∂b}``, ``c̄ = \dfrac{∂f}{∂c}``, including the and the self-partial derivative, ``s̄elf,``.
+And we thus have the partial derivatives ``\overline{\mathrm{self}}, = \dfrac{∂f}{∂\mathrm{self}}``, ``\overline{a} = \dfrac{∂f}{∂a}``, ``\overline{b} = \dfrac{∂f}{∂b}``, ``\overline{c} = \dfrac{∂f}{∂c}``, including the and the self-partial derivative, ``\overline{\mathrm{self}}``.
 
 ### Differentials
 

From 01bafdd08e188771a41dffe163f5d5aed25a76de Mon Sep 17 00:00:00 2001
From: Nick Robinson <npr251@gmail.com>
Date: Tue, 24 Sep 2019 10:22:57 +0100
Subject: [PATCH 42/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index daff411a1..29e4fa055 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -271,7 +271,7 @@ using FiniteDifferences
 central_fdm(5, 1)(foo, x)
 # -2.0638950738670734
 
-#### Find dfoo/dx via finite-differences ForwardDiff.jl
+#### Find dfoo/dx via ForwardDiff.jl
 using ForwardDiff
 ForwardDiff.derivative(foo, x)
 # -2.0638950738662625

From 21b1998449cece69a2d976623b789811ebd37b89 Mon Sep 17 00:00:00 2001
From: Nick Robinson <npr251@gmail.com>
Date: Tue, 24 Sep 2019 10:23:11 +0100
Subject: [PATCH 43/51] Update docs/src/index.md

Co-Authored-By: simeonschaub <simeondavidschaub99@gmail.com>
---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 29e4fa055..a868edc31 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -276,7 +276,7 @@ using ForwardDiff
 ForwardDiff.derivative(foo, x)
 # -2.0638950738662625
 
-#### Find dfoo/dx via finite-differences Zygote.jl
+#### Find dfoo/dx via Zygote.jl
 using Zygote
 Zygote.gradient(foo, x)
 # (-2.0638950738662625,)

From f5ca25e202e151101718ffc5cc473afd362d53c0 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 18:09:18 +0100
Subject: [PATCH 44/51] Work on defination of pushforward and pullback

---
 docs/src/index.md | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index a868edc31..7fff5c3ba 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -67,7 +67,8 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
 
 
 
-----
+
+<!-----
 ##### TODO: Incorperate this:
 
 ###### wesselb 9 days ago Member
@@ -79,16 +80,44 @@ Yeah, I think the below is accurate for the pushforward but misleading for the p
 If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
 
 
-
 This is still misleading for the pullback. Reposting a comment that got lost:
 The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
 
 If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
 
-----------
+---------->
 
  - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
- - The **pullback** takes a wobble in the _output space_, and tells you what wiggle you would need to make in the _input space_ to achieve it.
+ - The **pullback** takes wobblyness information with respect to the function's output, and tells the equivalent wobblyness with repect to the functions input.
+
+Definitions:
+ - wobblyness: a sensitivity
+ - wobble: a differential in the output space
+ - wiggle: a differential in the input space
+
+#### Math
+
+If I have some functions: ``g(a)``, `h(b)` and ``f(x)=g(h(x))``,
+∂
+and I know the pullback of ``g``, ``at h(x)`` written: ``\mathrm{pullback}_{g(a)|a=h(x)}``,
+
+and I know the deriviative of h with respect to its input ``b`` at ``g(x)``, written:
+``\left.\dfrac{\text{∂h}{\text{∂b}\right|_{b=g(x)}``
+
+Then I can use the pullback to find: ``\dfrac{\text{∂f}{\text{∂x}``
+
+``\dfrac{\text{∂f}{\text{∂x}=\mathrm{\mathrm{pullback}_{g(a)|a=h(x)}}\left(\left.\dfrac{\text{∂h}{\text{∂b}\right|_{b=g(x)}\right)``
+
+—
+
+If I know the deriviative of g with respect to its input a at x, written: ``\left.\dfrac{\text{∂g}{\text{∂a}\right|_{a=x}``
+
+and I know the pushforward of ``h`` at ``g(x)`` written: ``\mathrm{pushforward}_{h(b)|b=g(x)}``
+
+then I can use the pushforward to find ``\dfrac{\text{∂f}{\text{∂x}``
+
+``\dfrac{\text{∂f}{\text{∂x}=\mathrm{pushforward}_{h(b)|b=g(x)}\left(\left.\dfrac{\text{∂g}{\text{∂a}\right|_{a=x}\right)``
+
 
 #### The anatomy of pushforward and pullback
 

From 66196a5820cdf86bdc952d0ae3be131a98f19262 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 18:37:25 +0100
Subject: [PATCH 45/51] fixmath

---
 docs/src/index.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 7fff5c3ba..db840bc74 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -102,21 +102,21 @@ If I have some functions: ``g(a)``, `h(b)` and ``f(x)=g(h(x))``,
 and I know the pullback of ``g``, ``at h(x)`` written: ``\mathrm{pullback}_{g(a)|a=h(x)}``,
 
 and I know the deriviative of h with respect to its input ``b`` at ``g(x)``, written:
-``\left.\dfrac{\text{∂h}{\text{∂b}\right|_{b=g(x)}``
+``\left.\dfrac{∂h}{∂b}\right|_{b=g(x)}``
 
-Then I can use the pullback to find: ``\dfrac{\text{∂f}{\text{∂x}``
+Then I can use the pullback to find: ``\dfrac{∂f}{∂x}``
 
-``\dfrac{\text{∂f}{\text{∂x}=\mathrm{\mathrm{pullback}_{g(a)|a=h(x)}}\left(\left.\dfrac{\text{∂h}{\text{∂b}\right|_{b=g(x)}\right)``
+``\dfrac{∂f}{∂x}=\mathrm{\mathrm{pullback}_{g(a)|a=h(x)}}\left(\left.\dfrac{∂h}{∂b}\right|_{b=g(x)}\right)``
 
 —
 
-If I know the deriviative of g with respect to its input a at x, written: ``\left.\dfrac{\text{∂g}{\text{∂a}\right|_{a=x}``
+If I know the deriviative of g with respect to its input a at x, written: ``\left.\dfrac{∂g}{∂a}\right|_{a=x}``
 
 and I know the pushforward of ``h`` at ``g(x)`` written: ``\mathrm{pushforward}_{h(b)|b=g(x)}``
 
-then I can use the pushforward to find ``\dfrac{\text{∂f}{\text{∂x}``
+then I can use the pushforward to find ``\dfrac{∂f}{∂x}``
 
-``\dfrac{\text{∂f}{\text{∂x}=\mathrm{pushforward}_{h(b)|b=g(x)}\left(\left.\dfrac{\text{∂g}{\text{∂a}\right|_{a=x}\right)``
+``\dfrac{∂f}{∂x}=\mathrm{pushforward}_{h(b)|b=g(x)}\left(\left.\dfrac{∂g}{∂a}\right|_{a=x}\right)``
 
 
 #### The anatomy of pushforward and pullback

From 6db68c66a19adca6ec76501a5356abca76298037 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 18:42:18 +0100
Subject: [PATCH 46/51] brackets

---
 docs/src/index.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index db840bc74..11d63f5f8 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -97,9 +97,9 @@ Definitions:
 
 #### Math
 
-If I have some functions: ``g(a)``, `h(b)` and ``f(x)=g(h(x))``,
+If I have some functions: ``g(a)``, ``h(b)`` and ``f(x)=g(h(x))``,
 ∂
-and I know the pullback of ``g``, ``at h(x)`` written: ``\mathrm{pullback}_{g(a)|a=h(x)}``,
+and I know the pullback of ``g``, at ``h(x)`` written: ``\mathrm{pullback}_{g(a)|a=h(x)}``,
 
 and I know the deriviative of h with respect to its input ``b`` at ``g(x)``, written:
 ``\left.\dfrac{∂h}{∂b}\right|_{b=g(x)}``

From 784e70431d5fdb655c519d53d6714c451cca1721 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 18:48:59 +0100
Subject: [PATCH 47/51] Hide comment

---
 docs/src/index.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 11d63f5f8..f054e1b31 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -66,10 +66,9 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
 #### Core Idea
 
 
-
-
-<!-----
-##### TODO: Incorperate this:
+```@raw html
+<!-- ---
+This is attempt to condense this feedback
 
 ###### wesselb 9 days ago Member
 Are these ideas consistent with what pushforward and pullback do? I'm not familiar with ChainRules and its internals, but I anticipated pushforward and pullback to do the following: Consider a computation x -> u -> f(u) = v -> y. Then pushforward for f turns du/dx into dv/dx, whereas pullback turns dy/dv into dy/du. So pushforward pushes a "sensitivity with respect to the input through the function", whereas pullback pulls a "sensitivity with respect to the output back through the function". Perhaps that's what the below convey, not sure... maybe I'm just rambling.
@@ -85,7 +84,9 @@ The pullback doesn’t take an output wobble and produce an input wiggle (that w
 
 If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
 
----------->
+-------- -->
+```
+
 
  - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
  - The **pullback** takes wobblyness information with respect to the function's output, and tells the equivalent wobblyness with repect to the functions input.

From 1ceb3e879fd2ec92f0766e345ed1de59602b10f3 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 23:02:09 +0100
Subject: [PATCH 48/51] Describe things a few more times

---
 docs/src/index.md | 55 +++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index f054e1b31..50c50c3de 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -18,17 +18,21 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 
 **ChainRules is an AD-independent collection of rules to use in a differentiation system.**
 
+!!! terminology The whole field is a mess.
+    It isn't just ChainRules, it is everyone.
+    Internally ChainRules tries to be consistent.
+    Help with that is always welcomed.
+
 ### `frule` and `rrule`
 
 !!! terminology "`frule` and `rrule`"
     `frule` and `rrule` are ChainRules specific terms.
     Their exact functioning is fairly ChainRules specific, though other tools have similar functions.
     The core notion is sometimes called _custom AD primitives_, _custom adjoints_, _custom_gradients_, _custom sensitivities_.
-    (Potentially incorrectly, terminology is often abused.)
 
 The rules are encoded as `frule`s and `rrule`s, for use in forward-mode and reverse-mode differentiation respectively.
 
-Similarly, the `frule` is written:
+The `frule` is written:
 ```julia
 function frule(::typeof(foo), args; kwargs...)
     ...
@@ -59,47 +63,42 @@ Almost always the _pushforward_/_pullback_ will be declared locally within the `
     _Pushforward_ and _pullback_ are fancy words that the autodiff community adopted from Differential Geometry.
     The are broadly in agreement with the use of [pullback](https://en.wikipedia.org/wiki/Pullback_(differential_geometry)) and [pushforward](https://en.wikipedia.org/wiki/Pushforward_(differential)) in differential geometry.
     But any geometer will tell you these are the super-boring flat cases. Some will also frown at you.
+    They are also sometimes described in terms of the jacobian:
+    The _pushforward_ is _jacobian vector product_ (`jvp`), and _pullback_ is _jacobian transpose vector product_ (`j'vp`).
     Other terms that may be used include for _pullback_ the **backpropagator**, and by analogy for _pushforward_ the **forwardpropagator**, thus these are the _propagators_.
     These are also good names because effectively they propagate wiggles and wobbles through them, via the chain rule.
     (the term **backpropagator** may originate with ["Lambda The Ultimate Backpropagator"](http://www-bcl.cs.may.ie/~barak/papers/toplas-reverse.pdf) by Pearlmutter and Siskind, 2008)
 
 #### Core Idea
 
+##### Less formally
 
-```@raw html
-<!-- ---
-This is attempt to condense this feedback
-
-###### wesselb 9 days ago Member
-Are these ideas consistent with what pushforward and pullback do? I'm not familiar with ChainRules and its internals, but I anticipated pushforward and pullback to do the following: Consider a computation x -> u -> f(u) = v -> y. Then pushforward for f turns du/dx into dv/dx, whereas pullback turns dy/dv into dy/du. So pushforward pushes a "sensitivity with respect to the input through the function", whereas pullback pulls a "sensitivity with respect to the output back through the function". Perhaps that's what the below convey, not sure... maybe I'm just rambling.
-
-###### @jekbradbury
-Yeah, I think the below is accurate for the pushforward but misleading for the pullback. The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
-
-If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
-
+ - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
+ - The **pullback** takes wobblyness information with respect to the function's output, and tells the equivalent wobblyness with repect to the functions input.
 
-This is still misleading for the pullback. Reposting a comment that got lost:
-The pullback doesn’t take an output wobble and produce an input wiggle (that would be left-multiplying by the inverse of the Jacobian); it takes an output sensitivity (“how much does the loss function wobble when you wiggle the output”) and produces an input sensitivity (“how much does the loss function wobble when you wiggle the input”). This corresponds to left-multiplying by the adjoint of the Jacobian—an important distinction!
+##### More formally
+The **pushforward** of ``f`` takes the _sensitivity_ of the input of ``f`` to a quantity, and gives the _sensitivity_ of the output of ``f`` to that quantity
+The **pullback** of ``f`` takes the _sensitivity_ of a quantity to the output of ``f``, and gives the _sensitivity_ of that quantity to the input of ``f``.
 
-If the output is the scalar loss and you call the pullback on the scalar 1, then it will produce the gradient of the input (also a vector in the cotangent space, aka a wobble-wiggle ratio).
+#### Math
+This is all a bit simplied by talking in 1D.
 
--------- -->
+##### Lighter Math
+For a chain of expressions:
+```
+a = f(x)
+b = g(a)
+c = h(b)
 ```
 
+The pullback of `g`, which incorperates the knowledge of `∂b/∂a`,
+applies the chainrule to go from `∂c/∂b` to `∂c/∂a`.
 
- - The **pushforward** takes a wiggle in the _input space_, and tells what wobble you would create in the output space, by passing it through the function.
- - The **pullback** takes wobblyness information with respect to the function's output, and tells the equivalent wobblyness with repect to the functions input.
-
-Definitions:
- - wobblyness: a sensitivity
- - wobble: a differential in the output space
- - wiggle: a differential in the input space
-
-#### Math
+the pushforward of `g`,  which also incorperates the knowledge of `∂b/∂a`,
+applies the chainrule to go from `∂a/∂x` to `∂b/∂x`.
 
+#### Heavier Math
 If I have some functions: ``g(a)``, ``h(b)`` and ``f(x)=g(h(x))``,
-∂
 and I know the pullback of ``g``, at ``h(x)`` written: ``\mathrm{pullback}_{g(a)|a=h(x)}``,
 
 and I know the deriviative of h with respect to its input ``b`` at ``g(x)``, written:

From 09456040b0ba0a61f071a549204d0646b6eac4a8 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Tue, 24 Sep 2019 23:19:48 +0100
Subject: [PATCH 49/51] add more on dotted and barred forms

---
 docs/src/index.md | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 50c50c3de..6b6cdf626 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -399,17 +399,26 @@ It is very easy to check gradients or derivatives with a computer algebra system
 
 ### What is up with the different symbols?
 
+#### `Δx`, `∂x`, `dx`
+ChainRules uses these perhaps atyptically.
+As a notation that is the same across propagators, regardless of direction. (Incontrast see `ẋ` and `x̄` below)
+
  - `Δx` is the input to a propagator, (i.e a _seed_ for a _pullback_; or a _perturbation_ for a _pushforward_)
  - `∂x` is the output of a propagator
- - `dx` could be anything, including a pullback. It really should not show up outside of tests.
+ - `dx` could be anything, including a pullback/pushforward. It really should not show up outside of tests.
+
+
+#### ``\dot{y} = \dfrac{∂y}{∂x} = \overbar{x}``
  - `v̇` is a derivative of the input moving forward: ``v̇ = \frac{∂v}{∂x}`` for input ``x``, intermediate value ``v``.
  - `v̄` is a derivative of the output moving backward: ``v̄ = \frac{∂y}{∂v}`` for output ``y``, intermediate value ``v``.
+
+#### others
  - `Ω` is often used as the return value of the function. Especially, but not exclusively, for scalar functions.
      - `ΔΩ` is thus a seed for the pullback.
      - `∂Ω` is thus the output of a pushforward.
 
-### Why does `frule` and `rrule` return the function evaluation?
 
+### Why does `frule` and `rrule` return the function evaluation?
 You might wonder why `frule(f, x)` returns `f(x)` and the pushforward for `f` at `x`, and similarly for `rrule` returing `f(x)` and the pullback for `f` at `x`.
 Why not just return the pushforward/pullback, and let the user call `f(x)` to get the answer seperately?
 

From d01a7dc7d6df5c5e29df070981ba34899471bcb9 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Wed, 25 Sep 2019 00:02:07 +0100
Subject: [PATCH 50/51] fix note

---
 docs/src/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 6b6cdf626..139e9879a 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -18,7 +18,7 @@ Knowing rules for more complicated functions speeds up the autodiff process as i
 
 **ChainRules is an AD-independent collection of rules to use in a differentiation system.**
 
-!!! terminology The whole field is a mess.
+!!! note "The whole field is a mess for terminology"
     It isn't just ChainRules, it is everyone.
     Internally ChainRules tries to be consistent.
     Help with that is always welcomed.

From bce469d03e4f49459e07e629d724ec487d85fb20 Mon Sep 17 00:00:00 2001
From: Lyndon White <lyndon.white@invenialabs.co.uk>
Date: Wed, 25 Sep 2019 00:05:04 +0100
Subject: [PATCH 51/51] not more

---
 docs/src/index.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/src/index.md b/docs/src/index.md
index 139e9879a..31136f905 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -160,7 +160,9 @@ If the function is `y = f(x)` often the pullback will be written `s̄elf, x̄ =
     The pullback returns one `∂arg` per `arg` to the original function, plus one for the fields of the function itself (explained below).
 
 !!! terminology
-    Sometimes _perturbation_, _seed_, and _sensitivity_ will be used interchangeably, depending on task/subfield (sensitivity analysis and perturbation theory are apparently very big on just calling everything _sensitivity_ or _perturbation_ respectively.)
+    Sometimes _perturbation_, _seed_, and even _sensitivity_ will be used interchangeably.
+    They are not generally synonymous, and ChainRules shouldn't mix them up.
+    One must be careful when reading literature.
     At the end of the day, they are all _wiggles_ or _wobbles_.
 
 ### Self derivative `Δself`, `∂self`, `s̄elf`, `ṡelf` etc.