Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,16 @@ ARTWORK_DIR=$(DOC_SOURCE_DIR)/artwork
# xy/zz is in variable PACKAGE_xy_zz. This allows automation in iterating
# packages and their modules.
MIR_PACKAGES = mir $(addprefix mir/,\
blas combinatorics ndslice \
model/lda \
blas combinatorics model/lda ndslice \
random \
sparse sparse/blas)

PACKAGE_mir = math sum
PACKAGE_mir_blas = package dot gemm gemv
PACKAGE_mir_combinatorics = package
PACKAGE_mir_model_lda = hoffman
PACKAGE_mir_ndslice = package iteration selection slice
PACKAGE_mir_random = discrete
PACKAGE_mir_sparse = package
PACKAGE_mir_sparse_blas = package axpy dot gemm gemv

Expand Down
123 changes: 123 additions & 0 deletions source/mir/random/discrete.d
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
/**
Discrete distributions.

License: $(LINK2 http://boost.org/LICENSE_1_0.txt, Boost License 1.0).

Authors: Ilya Yaroshenko, Sebastian Wilzbach
*/

module mir.random.discrete;

import std.traits : isNumeric;

/**
Setup a _discrete distribution sampler.
Given an array of cumulative density points `cdPoints`,
a _discrete distribution is defined.
The cumulative density points can be calculated from probabilities or counts
using `std.algorithm.iteration.cumulativeFold`. Hence `cdPoints` is assumed to be sorted.

Params:
cdPoints = cumulative density points

Returns:
A $(LREF Discrete) sampler that can be called to sample from the distribution.
*/
Discrete!T discrete(T)(const(T)[] cdPoints)
{
return Discrete!T(cdPoints);
}

///
unittest
{
// 10%, 20%, 20%, 40%, 10%
auto cdPoints = [0.1, 0.3, 0.5, 0.9, 1];
auto ds = discrete(cdPoints);

// sample from the discrete distribution
auto obs = new uint[cdPoints.length];
foreach (i; 0..1000)
obs[ds()]++;
}

/**
_Discrete distribution sampler that draws random values from a _discrete
distribution given an array of the respective cumulative density points.
`cdPoints` is an array of the cummulative sum over the probabilities
or counts of a _discrete distribution without a starting zero.

Complexity: O(log n) where n is the number of `cdPoints`.
*/
struct Discrete(T)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you take at all a look at hap.random's DiscreteDistribution when creating this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not? There is no licensing issue, and I'm happy for that code to be re-used here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did not know about it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you want to take a look: https://github.com/WebDrake/hap/blob/e25fc15a17e4e439333740d2a9894fc7d3c94012/source/hap/random/distribution.d#L232

Note there are things that won't be useful here (it's a class, it's a forward range...) but it should be straightforward to take most of the good bits and use them with the functor-based approach here.

if (isNumeric!T)
{
import std.range : SortedRange;

private const(T)[] _cdPoints;
private SortedRange!(const(T)[], "a <= b") r;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for the 2 entries here. _cdPoints is barely used and you have all the data you need in r.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilzbach fix this please. SortedRange should have something like release

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, two style notes:

  • identifiers with a leading underscore _likeThis are IIRC reserved for the either the language or the runtime;
  • try to always give a meaningful name to all internal variables, so the person reading the code can understand their purpose. r is a meaningless name and the parameter isn't even commented -- but if you give it a meaningful name, the lack of comment documentation may not matter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else, except keywords and version identifiers is reserved by the langauge (furthermore we have a module system, to prevent name conflicts with symbols from druntime). I'm not a fan of _likeThis names either, but they're used in many places in phobos for private members. Probably part of the reason was that there were bugs in the module system, though they should all be fixed by now.

Otherwise +1 for meaningful names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilzbach fix this please. SortedRange should have something like release

release uses move, so that won't work? We can just remove cdPoints from the public interface?

https://github.com/dlang/phobos/blob/master/std/range/package.d#L8023

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifiers with a leading underscore _likeThis are IIRC reserved for the either the language or the runtime;

Isn't that a common style to denote that's a private variables. In this case the leading underscore is used to distinguish between the private variable and public getter.

try to always give a meaningful name to all internal variables, so the person reading the code can understand their purpose. r is a meaningless name and the parameter isn't even commented -- but if you give it a meaningful name, the lack of comment documentation may not matter.

Fair point!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else, except keywords and version identifiers is reserved by the langauge (furthermore we have a module system, to prevent name conflicts with symbols from druntime)

I believe it's still considered preferable to use likeThis_ for library code ... ?

I recognize that leading-underscore names are widely used in phobos, but it's still my understanding that this is technically a style violation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the current style guide has: Also, names do not begin with an underscore ‘_’ unless they are private. Ignore that remark, then.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release uses move, so that won't work? We can just remove cdPoints from the public interface?

https://github.com/dlang/phobos/blob/master/std/range/package.d#L8023

OK, then use assumeSorted instead. And make Phobos PR to replace debug with debug(std_range). Remember also about to methods that should be static/alias

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then use assumeSorted instead.

This feels like using an inferior implementation for the sake of a public method that isn't strictly necessary for what Discrete does.

If the cdPoints method is really, really wanted, is it problematic to just return const(SortedRange) ... ? That should serve all the debug use-cases.


/**
The cumulative density points `cdPoints` are assumed to be sorted and given
without a starting zero. They can be calculated with
`std.algorithm.iteration.cumulativeFold` from probabilities or counts.

Params:
cdPoints = cumulative density points
*/
this(const(T)[] cdPoints)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to give the constructor the relative proportions for the different discrete element selection probabilities (as is done with dice) and let the constructor do the work of putting together the sorted range of cumulative values?

I assume one motivation might be to not have the constructor do any memory allocation, but you can get round that by having a second constructor that accepts a SortedRangeas input (not being too specific about its type; it probably suffices that it's an input range, but is almost certainly safer that it be a forward range).

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaty code here is one line. This looks like bad style to have big API for simple things.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaty code here is one line. This looks like bad style to have big API for simple things.

I'm not sure I follow what you mean, can you clarify?

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that having huge API for such simple thing like this distribution is too much efforts. API review needs time from a user / reviewer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but absent any other implementation I would suggest that the Discrete API match that for dice (its function counterpart), which expects to be given individual proportions, not cumulative ones.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to implement dramatically better FP generators. Description is too large and will be published in article. But all this will be after Tinflex.

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the point -- designing the API this way it will complicating the use of Discrete as a drop-in replacement for existing phobos dice (and also throw on the end user the responsibility of summing proportions correctly).

  1. dice support is not the goal.
  2. This is one/two UFCS calls

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to implement dramatically better FP generators. Description is too large and will be published in article.

Cool, looking forward to seeing that work.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dice support is not the goal.

But it's nice to give people something they can use as an easy replacement.

This is one/two UFCS calls

If it's one/two UFCS calls, why not provide it here via an alternative constructor?

(Actually I doubt it's necessarily quite so straightforward when it comes to details of doing sums of integral or floating-point numbers correctly, but that's a longer discussion...)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's one/two UFCS calls, why not provide it here via an alternative constructor?
(Actually I doubt it's necessarily quite so straightforward when it comes to details of doing sums of integral or floating-point numbers correctly, but that's a longer discussion...)

I agree that such an interface would be nice - even in Tinflex we have relative counts and need to sum over them.

{
_cdPoints = cdPoints;
r = typeof(r)(_cdPoints);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just wraps _cdPoints, right -- it doesn't copy it?

Note that the constness of the constructor input provides poor guarantees. You can't be certain that someone will not subsequently modify the points and mess up the assumptions of the sorted property (it's only the SortedRange constructor that checks for sortedness).

}

/// cumulative density points
const(T)[] cdPoints() const @property
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to define this method at all?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just more control for user

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, if the user needs to care, presumably they know what they provided as input ... ? I would suggest better to not provide this method unless someone has an explicit use-case.

I also don't think it's a good idea to constrain the choice of cumulative density points to const(T)[].

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user may accept this struct as function argument. Then he want to print intenal stuff for debug without changing the API.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would accept it wrapped in a debug block, it's having it as part of the regular public API that I'm not happy with.

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok
@wilzbach, lets remove this. anyway we are using assumeSorted now

{
return _cdPoints;
}

/// samples a value from the discrete distribution
size_t opCall() const
{
import std.random : rndGen;
return opCall(rndGen);
}

/// samples a value from the discrete distribution using a custom random generator
size_t opCall(RNG)(ref RNG gen) const
{
import std.random : uniform;
T v = uniform!("[)", T, T)(0, _cdPoints[$-1], gen);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the .back property of the SortedRange instance would work just as well as _cdPoints[$-1].

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the .back property of the SortedRange instance would work

... if it would be const ?

dlang/phobos#4478

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, to me this feels like working around the lack of guarantees from upstream. Wouldn't it be safer to hold off declaring opCall const until upstream is fixed?

return (cast(SortedRange!(const(T)[], "a <= b")) r).lowerBound(v).length;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just use assumeSorted here ... ?

Copy link
Copy Markdown
Member

@9il 9il Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortedRange constructor always check that range is sorted in debug mode. This will affect complexity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... actually, isn't r already a sorted range? Why the cast at all?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the cast at all?

because the internal variable r has type const SortedRange!(const(T)[], "a <= b") - I didn't know another way to get rid of the const.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would repeat my suggestion that you look at hap.random's DiscreteDistribution -- I did not need any casts under the hood:
https://github.com/WebDrake/hap/blob/e25fc15a17e4e439333740d2a9894fc7d3c94012/source/hap/random/distribution.d#L323-L329

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opCall is const

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that would mean you can't use lowerBound with a const SortedRange here, right? Surely that should be fixed in SortedRange, rather than covering up the non-const-ness with a cast?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely that should be fixed in SortedRange, rather than covering up the non-const-ness with a cast?

Yes I will try to fix Phobos, but that won't be released before 2.073 :/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will try to fix Phobos, but that won't be released before 2.073 :/

You have not time to fix Phobos templates, please consider on related issues

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it's appropriate to prioritize putting const on opCall above having clean and easy-to-follow code. Things like the cast and the two different internal variables for the cumulative distribution points create unnecessary complication in the code, and the cast in any case risks breaking the const guarantee (certainly if e.g. at some point you generalize the type which can be provided as input to the constructor).

This is a case where the fix needs to be upstream, and where the guarantee wanted here can easily be added after that fix is complete.

}
}

unittest
{
import std.random : Random;
auto rndGen = Random(42);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise not shadowing the name of the standard random generator instance defined in std.random, even though you don't import it. Just call this gen.


// 10%, 20%, 20%, 40%, 10%
auto cdPoints = [0.1, 0.3, 0.5, 0.9, 1];
auto ds = discrete(cdPoints);

auto obs = new uint[cdPoints.length];
foreach (i; 0..1000)
obs[ds(rndGen)]++;
}

unittest
{
import std.random : Random;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random alias may change

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this dose not matter for this tests, nevermind

auto rndGen = Random(42);

// 1, 2, 1
auto cdPoints = [1, 3, 4];
auto ds = discrete(cdPoints);
assert(ds.cdPoints == cdPoints);

auto obs = new uint[cdPoints.length];
foreach (i; 0..1000)
obs[ds(rndGen)]++;
}