Skip to content

Comments

Add Mean and Median to std.algorithm#3592

Closed
JackStouffer wants to merge 2 commits intodlang:masterfrom
JackStouffer:stats
Closed

Add Mean and Median to std.algorithm#3592
JackStouffer wants to merge 2 commits intodlang:masterfrom
JackStouffer:stats

Conversation

@JackStouffer
Copy link
Contributor

This covers issue 14034.

I couldn't find a way to make a median function for input ranges that didn't use the GC, so I just left it to random access ranges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not find the mean lazily.

@JackStouffer
Copy link
Contributor Author

Fixed mentioned issues and rebased.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no check in the sig constraints for the range elements to be numeric types, yet the function freely applies + and 0.0 to the elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., the sig constraints should at least check that range elements are convertible to floating point (which precision, though?), otherwise + and 0.0 will cause a compile error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would

isImplicitlyConvertible!(ElementType!R, float) || isImplicitlyConvertible!(ElementType!R, double)

work for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of typing (and reading). What about just straight-up:

is(ElementType!R : double)

?

For simplicity, I'd just stick with double to minimize precision errors. But if you really wanted to, you could parametrize the floating type and use isFloatingPoint!F in the sig constraints to enforce it, then the user can specify the type to use for computing the average.

@quickfur
Copy link
Member

BTW, it is possible to implement median for (sorted) forward ranges without allocating. If length is defined, you just drop the first r.length/2 elements of the range and compute the median.

If length is not defined, save the forward range at the beginning, then iterate over the original range and the saved range at half the speed, i.e., for every 2 elements popped from the original range, pop 1 element from the saved range. When the original range is exhausted, the saved range will be either on the median element, or on the element right before the median, and can therefore be computed easily.

For bidirectional ranges, you can just call popFront and popBack until the range is down to just 1 or 2 elements. (If you save the values of the last front and back elements, you can just pop both ends until empty then return their average.)

Note, though, that for anything other than a random access range with length, the complexity of median will be O(n). This should be, of course, documented clearly. Also, in all cases median should take a SortedRange, so that the onus is on the caller to sort his data (or ensure it arrives in sorted order) before feeding it to median.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to be nitpicking on this again, but Phobos style guide requires opening braces to be on a new line.

@JackStouffer
Copy link
Contributor Author

@quickfur Addressed all concerns and rebased. Please see my line note though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it states in the comment, I don't get this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because popFront does not return anything. Don't you mean r.front instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your right.

@burner
Copy link
Member

burner commented Aug 31, 2015

as the median is only a fancy word for the 0.5 quantil why not add a generic quantil functions like:

auto quantil(R)(R r, double q) { // q <= 0.0 && q <= 1.0

?

@JackStouffer
Copy link
Contributor Author

And then add a convenience function for the median? Yeah that sounds reasonable.

@burner
Copy link
Member

burner commented Aug 31, 2015

yes

auto median(R)(R r) { return quantil(r, 0.5); }

@JackStouffer
Copy link
Contributor Author

Changed code to match @burner's requests.

I'm not happy with the cast's everywhere, but I don't think there is another way to do it

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you should be using casts here? What about using std.math.round or std.math.trunc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@JackStouffer
Copy link
Contributor Author

@John-Colvin can I get your opinion on this, seeing as you're the unofficial D statistics guy.

@JackStouffer
Copy link
Contributor Author

Ok, now that's three people who have given this PR an ok. Ready to merge.

@JackStouffer
Copy link
Contributor Author

Ping @DmitryOlshansky @JakobOvrum @quickfur @schveiguy

It has been 11 days since this has been ready to merge (6 if you count very minor doc updates). As I said before I have another PR waiting on this, it would be great to get this merged soon.

@JakobOvrum
Copy link
Contributor

Overall it LGTM, it's just that I'm not really qualified to spot issues in the quantile algorithm.

@JackStouffer
Copy link
Contributor Author

@JakobOvrum Yes, but burner and John-Colvin, who seem confident in their knowledge of these algorithms, gave it an ok.

@John-Colvin
Copy link
Contributor

@JackStouffer That's not really a fair representation of what I've said. To be clear: I think mean is ok. I haven't reviewed quantile but will do so now.

@JackStouffer
Copy link
Contributor Author

@John-Colvin Sorry, I took your silence as acceptance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like that assert, it is the same as assert(!e.empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same as assert(!e.empty)

Does that mean I should move it to the body? Or that you don't like its existence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return NaN

Are you sure this function should do that? Because having a q that's less than zero or greater than one makes no sense what so ever, and I feel that accepting it anyway without error is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and create a function

bool isValidQuantil(real q);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? Why? This is the type of thing that contracts are made for. Why not use them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the compiler can remove them, without you knowing.
and this way it is more composable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the compiler can remove them, without you knowing.

And that's fine, because they'll be in release mode which must be explicitly activated and they should have run it with contracts on at least once while developing. This seems to be perfectly acceptable for a lot of the functions in other modules, especially std.range which makes heavy use of contracts.

@burner
Copy link
Member

burner commented Sep 20, 2015

@JackStouffer on a meta note.

I think you pushing to hard. D is a "open source/no one is paid to work" project. Asking people to do work for you drives people away, and they do less. This comes from personal experience.

What makes me really itchy though is:

Ok, now that's three people who have given this PR an ok. Ready to merge.
when ctrl-f "LGTM" only today gives me two people actually saving LGTM. when D has as signed-off-by then it is LGTM.

This functionally will get in eventually, but not as fast as you like

@JackStouffer
Copy link
Contributor Author

Sorry, you're right.

I was frustrated that people had given an ok but it was just sitting here. I'm happy to fix things if people tell me they're broken, I just didn't want to see this die on the vine.

@burner
Copy link
Member

burner commented Sep 20, 2015

it will not die. just use more of the number one computer science skill:
frustration tolerance

@John-Colvin
Copy link
Contributor

Quantiles from data aren't really a trivial, done-and-dusted thing. See https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html for a serious quantile function and the wikipedia article for an example of a simple one.

A simple quantile, like is shown in the wikipedia example can be done something like this: auto idx = size_t(round(r.length * q)); r.topN(idx); return i[idx]; (although obviously you wouldn't want to do that sort for every call to quantile, I just wanted to illustrate how easy it can be).

Overall, I'm not sure something like quantile belongs in std.algorithm, especially if it is going to do any interpolation (i.e. anything other than "choose the nearest index"). It's too much messy statistics decisions, not enough generic manipulation of arbitrary data. If your not doing any interpolation though, it's no longer really much of an algorithm, it's just "read the value at the index closest to q*length, assuming sortedness".

Could I suggest that if you're interested in doing a more comprehensive treatment of quantiles, you could make a pull request to https://github.com/DlangScience/dstats with it?

mean is a safer proposition, but it'd be great to hear from @andralex about whether he sees this as within the remit of std.algorithm or whether it should go elsewhere (e.g. std.math?). The thing is, it's not very generic, seeing as it only works (by definition), with types that emulate real numbers.

@JackStouffer
Copy link
Contributor Author

Quantiles from data aren't really a trivial, done-and-dusted thing. See https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html for a serious quantile function and the wikipedia article for an example of a simple one.

Thanks for the link.

The thing is, it's not very generic, seeing as it only works (by definition), with types that emulate real numbers.

I'll make the functions more generic by allowing any user defined type that defines + and / and test with BigInt. But that begs the question of what should it return. As stated before, this makes no sense

assert([BigInt("1"), BigInt("2"), BigInt("3"), BigInt("4")].quantile(.5) == BigInt("2"));

but it'd be great to hear from @andralex about whether he sees this as within the remit of std.algorithm or whether it should go elsewhere (e.g. std.math?).

I don't think std.math is a good fit, as there are no other range based functions in there. I think the only other module that fits is std.numeric, but that seems to be an attempt at a strait one to one implementation of the STL.

Could I suggest that if you're interested in doing a more comprehensive treatment of quantiles, you could make a pull request to https://github.com/DlangScience/dstats with it?

I keep quantile in there for now, until more people in this PR tell me it's no good for Phobos. If that's true, I'll open a PR on that repo.

@JackStouffer
Copy link
Contributor Author

After doing more research, I have concluded that the quantile is not nearly complex enough to be useful and requires a much more through look. So, I removed it from this PR, I added the mean overloads back in, and I added support for user defined types to both functions that don't implicitly convert to real.

I'll open a WIP PR on dstats and if it's good it can always be added back into phobos later.

@quickfur
Copy link
Member

I think mean is ready to go (has been ready for a while now). Perhaps the thing to do here is to split this PR into one for mean and one for median/quantile/etc.. The first one can probably be merged already, it's the second one that's been holding this up.

Generally, it's a good idea to split things up into multiple PRs when there are various pieces that don't directly depend on each other, otherwise perfectly good changes will get held up unnecessarily just because of another, mostly-unrelated piece.

@JackStouffer
Copy link
Contributor Author

@quickfur ok, but before I do that there was one thing that I have been thinking about with this PR.

I also want to add various other statistical methods to Phobos, such as standard deviation, which fit in std.algorithm, and things such as rolling mean, which don't fit into std.algorithm. I was thinking that something like a rolling mean would best fit in std.numeric, but, would it be best if all of the statistical methods were in the same module, or is it ok to have them separate?

@John-Colvin
Copy link
Contributor

@JackStouffer this is relevant to discussions about rolling means etc. #2991

@JackStouffer
Copy link
Contributor Author

@John-Colvin looks interesting. I'll wait on the rolling mean/median stuff until a decision is reached on that PR. In the mean (ha) time, mean and median need to have a definite home.

@JackStouffer
Copy link
Contributor Author

#3679
#3680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants