Skip to content

Comments

Add mean to std.numeric#3892

Merged
dlang-bot merged 1 commit intodlang:masterfrom
JackStouffer:mean
Nov 22, 2017
Merged

Add mean to std.numeric#3892
dlang-bot merged 1 commit intodlang:masterfrom
JackStouffer:mean

Conversation

@JackStouffer
Copy link
Contributor

Now that #2991 has been closed, I am reintroducing this function. I believe std.numeric to be a better fit than std.algorithm.iteration, so I moved it.

See #3679 for previous discussion

Also fixes issue 14034

@dlang-bot
Copy link
Contributor

dlang-bot commented Dec 31, 2015

Thanks for your pull request, @JackStouffer!

Bugzilla references

Auto-close Bugzilla Description
14034 std.algorithm.mean

@JackStouffer JackStouffer force-pushed the mean branch 2 times, most recently from 441931e to 0d1765c Compare December 31, 2015 15:12
@9il
Copy link
Member

9il commented Dec 31, 2015

#2991 has been closed because it will be a part of the future std.las (linear algebra subroutines). std.las will have mean (Package Change) with different summation algorithms (API Change) and optimizations (Implementation Change). I will write std.las schedule, plan and first docs, so any one will be able to participate in this project.

@JackStouffer JackStouffer deleted the mean branch May 22, 2017 17:23
@CyberShadow
Copy link
Member

CyberShadow commented Jul 7, 2017

@JackStouffer Since std.las never materialized, what do you think of reopening this?

Average is the only one of the very few LINQ features on https://github.com/wilzbach/linq that have no D equivalent.

@JackStouffer
Copy link
Contributor Author

@CyberShadow heh, blast from the past.

Sure, if people are interested, I'll resurrect this.

@CyberShadow
Copy link
Member

There is a private: in std.numeric on line 3332, you will need to override it or add the code above it.

@JackStouffer
Copy link
Contributor Author

@CyberShadow Fixed

@CyberShadow
Copy link
Member

CyberShadow commented Jul 7, 2017

	int[] arr = null;
	writeln(mean(arr, 5));

This prints 0. Is that correct?

What does the seed do, anyway? I think it could use a bit more elaboration.

@JackStouffer
Copy link
Contributor Author

Yeah, the seed is mainly there to support user defined types. It doesn't really make sense when you're just using plain number types.

I'll add a note in the docs.

@CyberShadow
Copy link
Member

OK.

Shouldn't mean of an empty list return nan instead of 0 though?

@JackStouffer
Copy link
Contributor Author

@CyberShadow It will if you don't explicitly set the seed. Or, are you saying that there should be some static if special casing in the second overload?

@CyberShadow
Copy link
Member

Maybe I just don't understand the docs.

Finds the mean (colloquially known as the average) of a range. If r does not provide the length member, then this function will do element by element summation, rather than the more accurate methods provided by $(REF sum, std,algorithm,iteration).

So far so good.

An optional parameter seed may be passed to initially populate the summation.

Populate in what way? What effect does it have, exactly?

This function will return real.nan if the range is empty and if the element type of r is a built-in numerical type.

The above conditions are both satisfied for my example above. So it should return real.nan instead of 0, no?

Or, if the seed value is defined and not a built-in numerical type, then ElementType!R.init will be returned.

A seed is given but it is a numerical type, so this shouldn't apply to my example above.

Also, is the seed value used anywhere, or just its type?

@CyberShadow
Copy link
Member

Or, are you saying that there should be some static if special casing in the second overload?

Yep, I think nonsensical instantiations should be rejected outright to avoid confusion.

@JackStouffer
Copy link
Contributor Author

Also, is the seed value used anywhere, or just its type?

The value is used in both the call to sum and in the reduce call.

Yep, I think nonsensical instantiations should be rejected outright to avoid confusion.

Ok, changed the template constraints on the second overload and changed the docs.

std/numeric.d Outdated
/// ditto
auto mean(R, T)(R r, T seed)
if (isInputRange!R &&
(is(T == real) || !isNumeric!(T)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow real seeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make the first overload compile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus it also correctly returns nan if you do

int[] a = [];
mean(a, real(0));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you restrict the seed type at all?
Isn't assert((cast(int[]) null).mean(0) == 0) a valid use case?

Copy link
Contributor Author

@JackStouffer JackStouffer Jul 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't assert((cast(int[]) null).mean(0) == 0) a valid use case?

No. Taking the mean of an array and not giving back a FP value makes no sense from a statistical perspective.

This is why this function is in std.numeric and not std.algorithm, because it's opinionated by design.

Copy link
Contributor

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mind my feedback about FP precision.

Moreover,

  • if it's just mean I would search in std.algorithm.iteration for this (after all that's where sum is). What's your reasoning for std.numerical?
  • there's quite a bunch of statistical properties that can be computed online (i.e. as an OutputRange):
    at least mean, stdev, variance, skewness, kurtosis, min, and max. I wonder whether it would make sense to add a StatReport instead. It would still be very easy to use arr.stat.mean (of course the convenience overload can still exist) and we could even make more expensive computations opt-in (e.g. arr.stat!(Stat.skewness, Stat.kurtosis))

std/numeric.d Outdated

if (r.empty)
{
return Unqual!(ElementType!R).init;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if documented, this is quite counter-intuitive. After all - the purpose of a seed is to be used for these edge cases ;-)
What's your motivation for this?
Here's how I would expect mean to work:

(cast(int[]) null).mean; // nan
(cast(int[]) null).mean(0); // 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem as mentioned above is that mean should return a FP result when ever possible.

std/numeric.d Outdated
auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b))
(tuple(size_t(0), seed), r);

return pair[1] / pair[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's use a more efficient algorithm here. Or to quote Jonis alonen

Do not use this method to compute variance ever. This is one of those cases where a mathematically simple approach turns out to give wrong results for being numerically unstable. In simple cases the algorithm will seem to work fine, but eventually you will find a dataset that exposes the problem with the algorithm.

To put this in code. The naive approach vs. an online.combination after Welford. I used StatReport here, it's a OutputRange for mean + variance, butSummary of dstats is nice as well.

auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16];
writefln("Naive: %f", arr.mean);
writefln("Online: %f", arr.stat.mean);

yields:

Naive: 25002500002500001792.000000
Online: 25002500002500000008.000000

For reference, it should be 25002500002500000010.
Or if you prefer to read a paper (the simple trick is equation 4 in chapter 1):

image

This method is more precise, but not as efficient due to division in the loop. However, I argue that most users won't care and correctness is more important than performance for a standard library.

std/numeric.d Outdated
assert(bigint_arr.mean(BigInt(0)) == BigInt("3"));
assert(bigint_arr2.mean(BigInt(0)) == BigInt("3"));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test for floating point accuracy would be nice.

std/numeric.d Outdated
return real.nan;
}

return mean(r, real(0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you aware that real FP (in this summation) is quite slower than double?

/**
Compile with:
> ldc -release -O3 -mcpu=native -boundscheck=off
*/
import std.algorithm, std.range;

private void doNotOptimizeAway(T)(auto ref T t)
{
    import core.thread : getpid;
    import std.stdio : writeln;
    if(getpid() == 1) {
        writeln(*cast(char*)&t);
    }
}

/// ditto
auto mean(R, T)(R r, T seed)
{
    import std.algorithm.iteration : sum, reduce;
    static if (hasLength!R)
    {
        if (r.length == 1)
            return r.front;

        return sum(r, seed) / r.length;
    }
    else
    {
        import std.typecons : tuple;

        auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b))
            (tuple(size_t(0), seed), r);

        return pair[1] / pair[0];
    }
}

void main() {
    import std.datetime;
    import std.stdio;
    import std.array;
    import std.conv;
    import std.random;
    auto arr = iota(100_000).array;
    auto bench = benchmark!(
        { doNotOptimizeAway({
            return arr.mean(real(0));
        }());},
        { doNotOptimizeAway({
            return arr.mean(double(0));
        }());},
        { doNotOptimizeAway({
            return arr.filter!(a => a >= 0).mean(real(0));
        }());},
        { doNotOptimizeAway({
            return arr.filter!(a => a >= 0).mean(double(0));
        }());},
    )(50_000);
    string[] names = ["mean (real)", "mean (double)", "mean (real, InputRange)", "mean (double, InputRange)"];

    foreach(j,r;bench)
        writefln("%-15s = %s", names[j], r.to!Duration);
}
> ldc -release -O3 -mcpu=native -boundscheck=off test.d && ./test
mean (real)     = 5 secs, 185 ms, 845 μs, and 3 hnsecs
mean (double)   = 3 secs, 869 ms, 445 μs, and 9 hnsecs
mean (real, InputRange) = 6 secs, 109 ms, 83 μs, and 8 hnsecs
mean (double, InputRange) = 5 secs, 940 ms, 972 μs, and 1 hnsec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that if you want statistical values, accuracy will be more valuable than speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that if you want statistical values, accuracy will be more valuable than speed.

FWIW with the test provided below double yields exactly the same inaccuracy (25002500002500001792.000000).
The point was that always picking real is a severe performance penalty with almost no neglectable accuracy improvements, so the user should at least have the choice.
Also there's another, very valid point about allowing double: it's consistent on all platforms. I purposefully ignored real on my Tinflex random project last year because the algorithm was very sensitive to small change (more details about FP behavior with dmd or as a handy list of FP issues to keep in mind).

std/numeric.d Outdated
BigInt("1"), BigInt("2"), BigInt("3"), BigInt("6")
]);
assert(bigint_arr.mean(BigInt(0)) == BigInt("3"));
assert(bigint_arr2.mean(BigInt(0)) == BigInt("3"));
Copy link
Contributor

@wilzbach wilzbach Jul 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test mean on an empty array of a user-defined typed (e.g. BigInt). Thanks!

edit: has been fixed.

std/numeric.d Outdated
}

static if (hasLength!R)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned below, it's more precise to do division on every step, but of course a bit slower.

std/numeric.d Outdated
return r.front;
}

return sum(r, seed) / r.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackStouffer Since std.las never materialized

FYI there's mir.math.sum: https://github.com/libmir/mir-algorithm/blob/master/source/mir/math/sum.d
It's well tested and highly optimized.

Copy link
Contributor Author

@JackStouffer JackStouffer Jul 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone is willing to port that to Phobos, I'd be willing to use it :P

@JackStouffer
Copy link
Contributor Author

JackStouffer commented Jul 9, 2017

@wilzbach I tried your algorithm and I'm still getting the same result for the test case. What do I need to change?

auto mean(R, T)(R r, T seed)
{
    T meanRes = seed;
    size_t i = 0;

    foreach (e; r) {
        immutable delta = (e - meanRes);
        meanRes += delta / ++i;
    }

    return meanRes;
}

auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16];
writefln("Online: %f", arr.mean(real(0.0))); // 25002500002500001792.000000

@wilzbach
Copy link
Contributor

wilzbach commented Jul 9, 2017

@wilzbach I tried your algorithm and I'm still getting the same result for the test case. What do I need to change?

Looks like your seed is double. Try this:

void main()
{
    import std.stdio;
    auto arr = [10e19 + 4, 7, 10e9 + 13, 10e15 + 16];
    writefln("Online: %f", arr.mean(double(0)));
    writefln("Online: %f", arr.mean(real(0)));
}

Run it online

@JackStouffer
Copy link
Contributor Author

Gah! Seems like this needs to be really re-worked.

@JackStouffer
Copy link
Contributor Author

@wilzbach I think addressed most of the issues.

@JackStouffer JackStouffer force-pushed the mean branch 2 times, most recently from 95f96bd to 55b44d3 Compare July 10, 2017 19:10
@wilzbach
Copy link
Contributor

Behavior for empty input

I thought a bit and returning NaN is definitely reasonable and what literally all other statistical languages do:

Language Command Results
Python np.mean(list()) nan
Julia mean(Array{Float64, 1}()) or mean(Array{Int64, 1}()) NaN
R mean(double()) or mean(int()) NaN
Matlab / Octave mean([]) NaN

Copy link
Contributor

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a lot better now. Thanks for addressing my concerns!
I found some typos + had an idea for a good tests, but I think the only blocker left is the approval of the name addition.

Lastly I am still not really happy about the different behavior between an empty array of type long and a user-subtype of long. I am aware that the user will be warned by a compile error as the seed parameter is missing, but it's still not a nice behavior.
AFAICT the second overload solely exists for arbitrary-precision types like BigInt and there aren't many precedent of math* or numeric caring about BigInt. gcd is the only one I know of.

std/numeric.d Outdated
is used. The default result type is `real`, which is more accurate,
but far slower than `double` on many architectures.

For user defined types, element by element summation is used. Additionally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: user-defined is the common spelling here

std/numeric.d Outdated
The first overload of this function will return `T.init` if the range
is empty. However, the second overload will return `seed` on empty ranges.

This function is $(BIGOH r.length).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big O notations are adjectives, not nouns.

std/numeric.d Outdated
The mean of `r` when `r` is non-empty.
*/
T mean(T = real, R)(R r)
if (isInputRange!R &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The if constraint should be on the same indent level as the declaration.
See also: it's part of the DStyle and will be automated soon.

std/numeric.d Outdated
BigInt[] bigint_arr3 = [];
assert(bigint_arr3.mean(BigInt(0)) == BigInt(0));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last idea: testing the behavior with subtypes:

struct MyFancyDouble
{
   double v;
   alias v this;
}

(they currently use the second overload, however, intuitively as a user I would expect the first one).

Copy link
Contributor Author

@JackStouffer JackStouffer Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(they currently use the second overload, however, intuitively as a user I would expect the first one).

Yeah, well that's the problem with alias this and templates. alias this works great with function overloads, but with templates you have to either explicitly cast or explicitly define the template types a la the hacks in std.file

std/numeric.d Outdated
// inaccurate for integer division, which the user defined
// types might be representing
auto pair = reduce!((a, b) => tuple(a[0] + 1, a[1] + b))
(tuple(size_t(0), seed), r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm AFAICT this only makes sense for BigInt (or similar types with arbitrary precision), for all other user-defined or subtyped types using real and the first overload makes more sense...

Also there's no need to save and compute the length is hasLength is available. Another lambda comes with a cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm AFAICT this only makes sense for BigInt (or similar types with arbitrary precision), for all other user-defined or subtyped types using real and the first overload makes more sense...

The problem is there's no way to know if the type doesn't use alias this. And if it does use alias this, then they can cast to real in order to get the other overload.

Also there's no need to save and compute the length is hasLength is available. Another lambda comes with a cost.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is there's no way to know if the type doesn't use alias this. And if it does use alias this, then they can cast to real in order to get the other overload.

There's and that was my entire point:

import std.bigint;
import std.stdio;

void main(string[] args)
{
    struct MyFancyDouble
    {
        double v;
        alias v this;
    }
    pragma(msg, typeof(MyFancyDouble.init / size_t(0)));
    pragma(msg, typeof(BigInt.init / size_t(0)));
}

-> https://is.gd/rRbSaP

Or in other words: if the division by size_t yields a FloatingPoint type -> algorithm from overload 1 should be used.

@JackStouffer
Copy link
Contributor Author

AFAICT the second overload solely exists for arbitrary-precision types like BigInt and there aren't many precedent of math* or numeric caring about BigInt. gcd is the only one I know of.

The seed parameter on sum exists for the same purpose.

@JackStouffer JackStouffer force-pushed the mean branch 2 times, most recently from c7e18f9 to 718017e Compare July 10, 2017 22:12
@wilzbach
Copy link
Contributor

The seed parameter on sum exists for the same purpose.

Yet another reason for putting it into std.algorithm.iteration ;-)
Also the std.algorithm.iteration.sum of an empty array is always zero, no matter whether the element type is int, double or BigInt.

@JackStouffer
Copy link
Contributor Author

Ping @wilzbach @CyberShadow

Moved it back to std.algorithm as that makes the most sense. Let's get a final decision on this and get it out of the queue.

@CyberShadow
Copy link
Member

Ping @wilzbach @CyberShadow

Thanks!

I think the API is looking much better than the initial iteration. Still, math stuff is really not my area of expertise (I was off on a limb reviewing this in the first place), so deferring this to std.algorithm code owners.

Returns:
The mean of `r` when `r` is non-empty.
*/
T mean(T = real, R)(R r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move away from the use of real as a default and treat it as a special interest type. Please use double here.

if (isInputRange!R &&
isNumeric!(ElementType!R) &&
!isInfinite!R &&
is(Unqual!(T) == T))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the unqual needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous parens, use Unqual!T


// Knuth & Welford mean calculation
// division per element is slower, but more accurate
foreach (e; r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid foreach, use explicit loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? We're moving away from foreach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a convenience engine and may copy stuff etc. Just use a loop and use r.front only once inside the function.

// division per element is slower, but more accurate
foreach (e; r)
{
immutable T delta = (e - meanRes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious parens

foreach (e; r)
{
immutable T delta = (e - meanRes);
meanRes += delta / ++i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use meanRes += delta / i++; initializing i to 1

auto mean(R, T)(R r, T seed)
if (isInputRange!R &&
!isNumeric!(ElementType!R) &&
is(typeof(r.front + seed)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type of + should be numeric

if (isInputRange!R &&
!isNumeric!(ElementType!R) &&
is(typeof(r.front + seed)) &&
is(typeof(r.front / size_t(1))) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be numeric


if (len > 0)
return sum(r, seed) / len;
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous

// types might be representing
static if (hasLength!R)
{
immutable len = r.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's overdoing it - just use r.length

@JackStouffer
Copy link
Contributor Author

the type of + should be numeric

In the BigInt case this isn't true, because the type of the addition is BigInt and isNumeric!(BigInt) == false

@JackStouffer
Copy link
Contributor Author

Thanks everyone for getting this through!

@JackStouffer JackStouffer deleted the mean branch November 22, 2017 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants