Add mean to std.numeric by JackStouffer · Pull Request #3892 · dlang/phobos

JackStouffer · 2015-12-31T14:38:26Z

Now that #2991 has been closed, I am reintroducing this function. I believe std.numeric to be a better fit than std.algorithm.iteration, so I moved it.

See #3679 for previous discussion

Also fixes issue 14034

dlang-bot · 2015-12-31T14:38:34Z

Thanks for your pull request, @JackStouffer!

Bugzilla references

Auto-close	Bugzilla	Description
✓	14034	std.algorithm.mean

9il · 2015-12-31T22:46:48Z

#2991 has been closed because it will be a part of the future std.las (linear algebra subroutines). std.las will have mean (Package Change) with different summation algorithms (API Change) and optimizations (Implementation Change). I will write std.las schedule, plan and first docs, so any one will be able to participate in this project.

CyberShadow · 2017-07-07T17:29:08Z

@JackStouffer Since std.las never materialized, what do you think of reopening this?

Average is ~~the only~~ one of the very few LINQ features on https://github.com/wilzbach/linq that have no D equivalent.

JackStouffer · 2017-07-07T20:34:03Z

@CyberShadow heh, blast from the past.

Sure, if people are interested, I'll resurrect this.

CyberShadow · 2017-07-07T20:42:01Z

There is a private: in std.numeric on line 3332, you will need to override it or add the code above it.

JackStouffer · 2017-07-07T20:45:51Z

@CyberShadow Fixed

CyberShadow · 2017-07-07T20:48:21Z

	int[] arr = null;
	writeln(mean(arr, 5));

This prints 0. Is that correct?

What does the seed do, anyway? I think it could use a bit more elaboration.

JackStouffer · 2017-07-07T20:50:16Z

Yeah, the seed is mainly there to support user defined types. It doesn't really make sense when you're just using plain number types.

I'll add a note in the docs.

CyberShadow · 2017-07-07T20:51:06Z

OK.

Shouldn't mean of an empty list return nan instead of 0 though?

JackStouffer · 2017-07-07T20:55:38Z

@CyberShadow It will if you don't explicitly set the seed. Or, are you saying that there should be some static if special casing in the second overload?

CyberShadow · 2017-07-07T20:59:52Z

Maybe I just don't understand the docs.

Finds the mean (colloquially known as the average) of a range. If r does not provide the length member, then this function will do element by element summation, rather than the more accurate methods provided by $(REF sum, std,algorithm,iteration).

So far so good.

An optional parameter seed may be passed to initially populate the summation.

Populate in what way? What effect does it have, exactly?

This function will return real.nan if the range is empty and if the element type of r is a built-in numerical type.

The above conditions are both satisfied for my example above. So it should return real.nan instead of 0, no?

Or, if the seed value is defined and not a built-in numerical type, then ElementType!R.init will be returned.

A seed is given but it is a numerical type, so this shouldn't apply to my example above.

Also, is the seed value used anywhere, or just its type?

CyberShadow · 2017-07-07T21:02:29Z

Or, are you saying that there should be some static if special casing in the second overload?

Yep, I think nonsensical instantiations should be rejected outright to avoid confusion.

JackStouffer · 2017-07-07T21:10:20Z

Also, is the seed value used anywhere, or just its type?

The value is used in both the call to sum and in the reduce call.

Yep, I think nonsensical instantiations should be rejected outright to avoid confusion.

Ok, changed the template constraints on the second overload and changed the docs.

CyberShadow · 2017-07-07T21:26:48Z

std/numeric.d

+/// ditto
+auto mean(R, T)(R r, T seed)
+    if (isInputRange!R &&
+        (is(T == real) || !isNumeric!(T)) &&


Why allow real seeds?

to make the first overload compile

plus it also correctly returns nan if you do

int[] a = []; mean(a, real(0));

Why you restrict the seed type at all?
Isn't assert((cast(int[]) null).mean(0) == 0) a valid use case?

Isn't assert((cast(int[]) null).mean(0) == 0) a valid use case?

No. Taking the mean of an array and not giving back a FP value makes no sense from a statistical perspective.

This is why this function is in std.numeric and not std.algorithm, because it's opinionated by design.

wilzbach

Please mind my feedback about FP precision.

Moreover,

if it's just mean I would search in std.algorithm.iteration for this (after all that's where sum is). What's your reasoning for std.numerical?
there's quite a bunch of statistical properties that can be computed online (i.e. as an OutputRange):
at least mean, stdev, variance, skewness, kurtosis, min, and max. I wonder whether it would make sense to add a StatReport instead. It would still be very easy to use arr.stat.mean (of course the convenience overload can still exist) and we could even make more expensive computations opt-in (e.g. arr.stat!(Stat.skewness, Stat.kurtosis))

wilzbach · 2017-07-09T21:13:05Z

std/numeric.d

+
+    if (r.empty)
+    {
+        return Unqual!(ElementType!R).init;


Even if documented, this is quite counter-intuitive. After all - the purpose of a seed is to be used for these edge cases ;-)
What's your motivation for this?
Here's how I would expect mean to work:

(cast(int[]) null).mean; // nan (cast(int[]) null).mean(0); // 0

The problem as mentioned above is that mean should return a FP result when ever possible.

wilzbach · 2017-07-09T22:09:27Z

std/numeric.d

+        auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b))
+            (tuple(size_t(0), seed), r);
+
+        return pair[1] / pair[0];


Please let's use a more efficient algorithm here. Or to quote Jonis alonen

Do not use this method to compute variance ever. This is one of those cases where a mathematically simple approach turns out to give wrong results for being numerically unstable. In simple cases the algorithm will seem to work fine, but eventually you will find a dataset that exposes the problem with the algorithm.

To put this in code. The naive approach vs. an online.combination after Welford. I used StatReport here, it's a OutputRange for mean + variance, butSummary of dstats is nice as well.

auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16]; writefln("Naive: %f", arr.mean); writefln("Online: %f", arr.stat.mean);

yields:

Naive: 25002500002500001792.000000 Online: 25002500002500000008.000000

For reference, it should be 25002500002500000010.
Or if you prefer to read a paper (the simple trick is equation 4 in chapter 1):

This method is more precise, but not as efficient due to division in the loop. However, I argue that most users won't care and correctness is more important than performance for a standard library.

wilzbach · 2017-07-09T22:11:32Z

std/numeric.d

+    assert(bigint_arr.mean(BigInt(0)) == BigInt("3"));
+    assert(bigint_arr2.mean(BigInt(0)) == BigInt("3"));
+}
+


A test for floating point accuracy would be nice.

wilzbach · 2017-07-09T22:29:27Z

std/numeric.d

+        return real.nan;
+    }
+
+    return mean(r, real(0.0));


Are you aware that real FP (in this summation) is quite slower than double?

/** Compile with: > ldc -release -O3 -mcpu=native -boundscheck=off */ import std.algorithm, std.range; private void doNotOptimizeAway(T)(auto ref T t) { import core.thread : getpid; import std.stdio : writeln; if(getpid() == 1) { writeln(*cast(char*)&t); } } /// ditto auto mean(R, T)(R r, T seed) { import std.algorithm.iteration : sum, reduce; static if (hasLength!R) { if (r.length == 1) return r.front; return sum(r, seed) / r.length; } else { import std.typecons : tuple; auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b)) (tuple(size_t(0), seed), r); return pair[1] / pair[0]; } } void main() { import std.datetime; import std.stdio; import std.array; import std.conv; import std.random; auto arr = iota(100_000).array; auto bench = benchmark!( { doNotOptimizeAway({ return arr.mean(real(0)); }());}, { doNotOptimizeAway({ return arr.mean(double(0)); }());}, { doNotOptimizeAway({ return arr.filter!(a => a >= 0).mean(real(0)); }());}, { doNotOptimizeAway({ return arr.filter!(a => a >= 0).mean(double(0)); }());}, )(50_000); string[] names = ["mean (real)", "mean (double)", "mean (real, InputRange)", "mean (double, InputRange)"]; foreach(j,r;bench) writefln("%-15s = %s", names[j], r.to!Duration); }

> ldc -release -O3 -mcpu=native -boundscheck=off test.d && ./test mean (real) = 5 secs, 185 ms, 845 μs, and 3 hnsecs mean (double) = 3 secs, 869 ms, 445 μs, and 9 hnsecs mean (real, InputRange) = 6 secs, 109 ms, 83 μs, and 8 hnsecs mean (double, InputRange) = 5 secs, 940 ms, 972 μs, and 1 hnsec

I assume that if you want statistical values, accuracy will be more valuable than speed.

I assume that if you want statistical values, accuracy will be more valuable than speed.

FWIW with the test provided below double yields exactly the same inaccuracy (25002500002500001792.000000).
The point was that always picking real is a severe performance penalty with almost no neglectable accuracy improvements, so the user should at least have the choice.
Also there's another, very valid point about allowing double: it's consistent on all platforms. I purposefully ignored real on my Tinflex random project last year because the algorithm was very sensitive to small change (more details about FP behavior with dmd or as a handy list of FP issues to keep in mind).

wilzbach · 2017-07-09T22:32:40Z

std/numeric.d

+        BigInt("1"), BigInt("2"), BigInt("3"), BigInt("6")
+    ]);
+    assert(bigint_arr.mean(BigInt(0)) == BigInt("3"));
+    assert(bigint_arr2.mean(BigInt(0)) == BigInt("3"));


Could we test mean on an empty array of a user-defined typed (e.g. BigInt). Thanks!

edit: has been fixed.

wilzbach · 2017-07-09T22:38:20Z

std/numeric.d

+    }
+
+    static if (hasLength!R)
+    {


As mentioned below, it's more precise to do division on every step, but of course a bit slower.

wilzbach · 2017-07-09T22:40:46Z

std/numeric.d

+            return r.front;
+        }
+
+        return sum(r, seed) / r.length;


@JackStouffer Since std.las never materialized

FYI there's mir.math.sum: https://github.com/libmir/mir-algorithm/blob/master/source/mir/math/sum.d
It's well tested and highly optimized.

If anyone is willing to port that to Phobos, I'd be willing to use it :P

JackStouffer · 2017-07-09T23:16:42Z

@wilzbach I tried your algorithm and I'm still getting the same result for the test case. What do I need to change?

auto mean(R, T)(R r, T seed)
{
    T meanRes = seed;
    size_t i = 0;

    foreach (e; r) {
        immutable delta = (e - meanRes);
        meanRes += delta / ++i;
    }

    return meanRes;
}

auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16];
writefln("Online: %f", arr.mean(real(0.0))); // 25002500002500001792.000000

wilzbach · 2017-07-09T23:23:59Z

@wilzbach I tried your algorithm and I'm still getting the same result for the test case. What do I need to change?

Looks like your seed is double. Try this:

void main()
{
    import std.stdio;
    auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16];
    writefln("Online: %f", arr.mean(double(0)));
    writefln("Online: %f", arr.mean(real(0)));
}

Run it online

JackStouffer · 2017-07-10T15:28:18Z

Gah! Seems like this needs to be really re-worked.

JackStouffer · 2017-07-10T18:44:54Z

@wilzbach I think addressed most of the issues.

wilzbach · 2017-07-10T20:49:54Z

Behavior for empty input

I thought a bit and returning NaN is definitely reasonable and what literally all other statistical languages do:

Language	Command	Results
Python	`np.mean(list())`	`nan`
Julia	`mean(Array{Float64, 1}())` or `mean(Array{Int64, 1}())`	`NaN`
R	`mean(double())` or `mean(int())`	`NaN`
Matlab / Octave	`mean([])`	`NaN`

wilzbach

Looks a lot better now. Thanks for addressing my concerns!
I found some typos + had an idea for a good tests, but I think the only blocker left is the approval of the name addition.

Lastly I am still not really happy about the different behavior between an empty array of type long and a user-subtype of long. I am aware that the user will be warned by a compile error as the seed parameter is missing, but it's still not a nice behavior.
AFAICT the second overload solely exists for arbitrary-precision types like BigInt and there aren't many precedent of math* or numeric caring about BigInt. gcd is the only one I know of.

wilzbach · 2017-07-10T20:04:50Z

std/numeric.d

+is used. The default result type is `real`, which is more accurate,
+but far slower than `double` on many architectures.
+
+For user defined types, element by element summation is used. Additionally


Nit: user-defined is the common spelling here

wilzbach · 2017-07-10T20:05:34Z

std/numeric.d

+The first overload of this function will return `T.init` if the range
+is empty. However, the second overload will return `seed` on empty ranges.
+
+This function is $(BIGOH r.length).


big O notations are adjectives, not nouns.

wilzbach · 2017-07-10T21:04:33Z

std/numeric.d

+    The mean of `r` when `r` is non-empty.
+*/
+T mean(T = real, R)(R r)
+    if (isInputRange!R &&


Nit: The if constraint should be on the same indent level as the declaration.
See also: it's part of the DStyle and will be automated soon.

wilzbach · 2017-07-10T21:09:56Z

std/numeric.d

+    BigInt[] bigint_arr3 = [];
+    assert(bigint_arr3.mean(BigInt(0)) == BigInt(0));
+}
+


One last idea: testing the behavior with subtypes:

struct MyFancyDouble { double v; alias v this; }

(they currently use the second overload, however, intuitively as a user I would expect the first one).

(they currently use the second overload, however, intuitively as a user I would expect the first one).

Yeah, well that's the problem with alias this and templates. alias this works great with function overloads, but with templates you have to either explicitly cast or explicitly define the template types a la the hacks in std.file

wilzbach · 2017-07-10T21:20:21Z

std/numeric.d

+    // inaccurate for integer division, which the user defined
+    // types might be representing
+    auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b))
+        (tuple(size_t(0), seed), r);


Hmm AFAICT this only makes sense for BigInt (or similar types with arbitrary precision), for all other user-defined or subtyped types using real and the first overload makes more sense...

Also there's no need to save and compute the length is hasLength is available. Another lambda comes with a cost.

Hmm AFAICT this only makes sense for BigInt (or similar types with arbitrary precision), for all other user-defined or subtyped types using real and the first overload makes more sense...

The problem is there's no way to know if the type doesn't use alias this. And if it does use alias this, then they can cast to real in order to get the other overload.

Also there's no need to save and compute the length is hasLength is available. Another lambda comes with a cost.

ok

The problem is there's no way to know if the type doesn't use alias this. And if it does use alias this, then they can cast to real in order to get the other overload.

There's and that was my entire point:

import std.bigint; import std.stdio; void main(string[] args) { struct MyFancyDouble { double v; alias v this; } pragma(msg, typeof(MyFancyDouble.init / size_t(0))); pragma(msg, typeof(BigInt.init / size_t(0))); }

-> https://is.gd/rRbSaP

Or in other words: if the division by size_t yields a FloatingPoint type -> algorithm from overload 1 should be used.

JackStouffer · 2017-07-10T21:59:15Z

AFAICT the second overload solely exists for arbitrary-precision types like BigInt and there aren't many precedent of math* or numeric caring about BigInt. gcd is the only one I know of.

The seed parameter on sum exists for the same purpose.

wilzbach · 2017-07-10T22:13:49Z

The seed parameter on sum exists for the same purpose.

Yet another reason for putting it into std.algorithm.iteration ;-)
Also the std.algorithm.iteration.sum of an empty array is always zero, no matter whether the element type is int, double or BigInt.

JackStouffer · 2017-11-20T16:33:37Z

Ping @wilzbach @CyberShadow

Moved it back to std.algorithm as that makes the most sense. Let's get a final decision on this and get it out of the queue.

CyberShadow · 2017-11-20T18:01:16Z

Ping @wilzbach @CyberShadow

Thanks!

I think the API is looking much better than the initial iteration. Still, math stuff is really not my area of expertise (I was off on a limb reviewing this in the first place), so deferring this to std.algorithm code owners.

andralex · 2017-11-21T22:34:40Z

std/algorithm/iteration.d

+Returns:
+    The mean of `r` when `r` is non-empty.
+*/
+T mean(T = real, R)(R r)


We should move away from the use of real as a default and treat it as a special interest type. Please use double here.

andralex · 2017-11-21T22:34:58Z

std/algorithm/iteration.d

+if (isInputRange!R &&
+    isNumeric!(ElementType!R) &&
+    !isInfinite!R &&
+    is(Unqual!(T) == T))


Why is the unqual needed?

superfluous parens, use Unqual!T

andralex · 2017-11-21T22:37:36Z

std/algorithm/iteration.d

+
+    // Knuth & Welford mean calculation
+    // division per element is slower, but more accurate
+    foreach (e; r)


Avoid foreach, use explicit loops

Huh? We're moving away from foreach?

it's a convenience engine and may copy stuff etc. Just use a loop and use r.front only once inside the function.

andralex · 2017-11-21T22:37:53Z

std/algorithm/iteration.d

+    // division per element is slower, but more accurate
+    foreach (e; r)
+    {
+        immutable T delta = (e - meanRes);


spurious parens

andralex · 2017-11-21T22:38:37Z

std/algorithm/iteration.d

+    foreach (e; r)
+    {
+        immutable T delta = (e - meanRes);
+        meanRes += delta / ++i;


better use meanRes += delta / i++; initializing i to 1

andralex · 2017-11-21T22:39:59Z

std/algorithm/iteration.d

+auto mean(R, T)(R r, T seed)
+if (isInputRange!R &&
+    !isNumeric!(ElementType!R) &&
+    is(typeof(r.front + seed)) &&


the type of + should be numeric

andralex · 2017-11-21T22:40:10Z

std/algorithm/iteration.d

+if (isInputRange!R &&
+    !isNumeric!(ElementType!R) &&
+    is(typeof(r.front + seed)) &&
+    is(typeof(r.front / size_t(1))) &&


should be numeric

andralex · 2017-11-21T22:40:34Z

std/algorithm/iteration.d

+
+        if (len > 0)
+            return sum(r, seed) / len;
+        else


superfluous

andralex · 2017-11-21T22:41:05Z

std/algorithm/iteration.d

+    // types might be representing
+    static if (hasLength!R)
+    {
+        immutable len = r.length;


that's overdoing it - just use r.length

JackStouffer · 2017-11-21T23:20:53Z

the type of + should be numeric

In the BigInt case this isn't true, because the type of the addition is BigInt and isNumeric!(BigInt) == false

Addressed points

JackStouffer · 2017-11-22T04:19:39Z

Thanks everyone for getting this through!

JackStouffer force-pushed the mean branch 2 times, most recently from 441931e to 0d1765c Compare December 31, 2015 15:12

JackStouffer closed this Jan 4, 2016

JackStouffer deleted the mean branch May 22, 2017 17:23

JackStouffer restored the mean branch July 7, 2017 20:34

JackStouffer reopened this Jul 7, 2017

dlang-bot added the Severity:Enhancement label Jul 7, 2017

JackStouffer force-pushed the mean branch from 0d1765c to ccb0cf2 Compare July 7, 2017 20:36

JackStouffer force-pushed the mean branch from ccb0cf2 to ff8491a Compare July 7, 2017 20:45

JackStouffer force-pushed the mean branch from ff8491a to fbbe351 Compare July 7, 2017 20:54

JackStouffer force-pushed the mean branch from fbbe351 to 5078282 Compare July 7, 2017 20:56

JackStouffer force-pushed the mean branch from 5078282 to a207158 Compare July 7, 2017 21:09

CyberShadow reviewed Jul 7, 2017

View reviewed changes

wilzbach previously requested changes Jul 9, 2017

View reviewed changes

JackStouffer force-pushed the mean branch 2 times, most recently from 95f96bd to 55b44d3 Compare July 10, 2017 19:10

wilzbach reviewed Jul 10, 2017

View reviewed changes

wilzbach added the @andralex label Jul 10, 2017

JackStouffer force-pushed the mean branch 2 times, most recently from c7e18f9 to 718017e Compare July 10, 2017 22:12

JackStouffer force-pushed the mean branch from 4e885fe to 5c2acaf Compare November 20, 2017 16:32

JackStouffer requested review from PetarKirov and andralex as code owners November 20, 2017 16:32

JackStouffer force-pushed the mean branch from 56a355f to dbeec01 Compare November 20, 2017 18:29

andralex approved these changes Nov 21, 2017

View reviewed changes

Fix issue 14034: Add mean to Phobos

b5572e8

JackStouffer force-pushed the mean branch from d804189 to b5572e8 Compare November 21, 2017 23:23

JackStouffer added Merge:auto-merge and removed @andralex labels Nov 22, 2017

dlang-bot merged commit eb3c390 into dlang:master Nov 22, 2017

JackStouffer deleted the mean branch November 22, 2017 04:19

Uh oh!

Comments

Conversation

JackStouffer commented Dec 31, 2015

Uh oh!

dlang-bot commented Dec 31, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bugzilla references

Uh oh!

9il commented Dec 31, 2015

Uh oh!

CyberShadow commented Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackStouffer commented Jul 7, 2017

Uh oh!

CyberShadow commented Jul 7, 2017

Uh oh!

JackStouffer commented Jul 7, 2017

Uh oh!

CyberShadow commented Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackStouffer commented Jul 7, 2017

Uh oh!

CyberShadow commented Jul 7, 2017

Uh oh!

JackStouffer commented Jul 7, 2017

Uh oh!

CyberShadow commented Jul 7, 2017

Uh oh!

CyberShadow commented Jul 7, 2017

Uh oh!

JackStouffer commented Jul 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackStouffer Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilzbach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilzbach Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackStouffer Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackStouffer commented Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wilzbach commented Jul 9, 2017

dlang-bot commented Dec 31, 2015 •

edited

Loading

CyberShadow commented Jul 7, 2017 •

edited

Loading

CyberShadow commented Jul 7, 2017 •

edited

Loading

JackStouffer Jul 9, 2017 •

edited

Loading

wilzbach Jul 9, 2017 •

edited

Loading

JackStouffer Jul 9, 2017 •

edited

Loading

JackStouffer commented Jul 9, 2017 •

edited

Loading

JackStouffer Jul 10, 2017 •

edited

Loading