Make Uniform<f32/f64> and gen_range<f32/f64> honor limits strictly. Resolves #476#477
Conversation
|
I guess travis answered the last paragraph... |
|
Cool. Seems like that takes care of the no_std problems. |
|
I understand your motivation, but I am not sure I agree with the method. It is not impossible to make a half-open range exact, i.e. to guarantee That could be faster for many samples, but slower for few or only one. Can you measure one thing: what do the range benchmarks do with the extra loop? |
|
This feels premature to me since (a) it is difficult in general to ensure strict limits with FP, thus often better to design algorithms to accommodate (small) deviations, and (b) not all users care about strict compliance anyway. That said if there is very little (performance/maintenance) overhead then this may be reasonable, but given all the extra code I doubt that is the case. |
|
Given the low performance cost here I think it's worth providing the guarantee that the generated values are within the requested range. I've implemented @pitdicker suggestion to adjust I'll pull numbers to see if this is visible in our benchmarks. |
|
So I ran some performance numbers. At first Then I added Also confused why some builds seem unable to find the |
|
Sorry, I've been a bit slow this weekend. Will have a better look tomorrow. We had trouble with confusing benchmarks in combination with The best thing to do than is to try again with After: So the difference in both benchmarks is 0.412 ns.
I find it strange also, but would not worry about it. If you rebase on master you'll see all the |
|
Yeah, using Before: After: So a 32% slowdown for But for the more performance-optimized API of creating a Definitely more slowdown than desired for |
|
As a test, I tried adding branch-predictor hints to the Which is a 23% slowdown for |
| offset: offset, | ||
| assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries"); | ||
|
|
||
| let mut scale = high - low; |
There was a problem hiding this comment.
Can overflow to infinite even if both inputs are finite. Skip the above check (you already know neither value is an NaN) and just assert scale < INFINITY.
Also means you can simplify your Float trait.
There was a problem hiding this comment.
I intentionally didn't want to panic when high and low are both finite numbers. Even if high-low results in a non-finite number. Mainly because it felt like a lot to ask for users to know when that difference might overflow into infinity. But a little because depending on that the difference can be expressed by an $ty is technically an implementation detail. Another implementation of this function might not.
Either way, the code below deals fine with if scale becomes infinite here.
src/distributions/uniform.rs
Outdated
| let mut scale = high - low; | ||
|
|
||
| let max_rand = (::core::$uty::MAX >> $bits_to_discard) | ||
| .into_float_with_exponent(0) - 1.0; |
There was a problem hiding this comment.
I believe this is 1-ε/2 (i.e. the largest representable number < 1), but the name makes sense.
src/distributions/uniform.rs
Outdated
| scale = <$ty>::from_bits(scale.to_bits() - 1); | ||
| } | ||
|
|
||
| debug_assert!(scale >= 0.0); |
There was a problem hiding this comment.
If high is the next representable number above low then scale is effectively zero — although larger values than zero may still give the same rounding. Probably though we should accept zero.
Note that your while loop would take a long time to get to zero since high - low may be much larger than the minimal representable value, though it likely doesn't matter since it should encounter a value of scale small enough that low + scale rounds to low much sooner.
There was a problem hiding this comment.
Yeah, I'm banking on that scale * max_rand + low > high only happens in situations with unfortunate rounding, so we should only need to make small adjustments to scale to get it within the acceptable range. Especially since max_rand is already < 1.
src/distributions/uniform.rs
Outdated
| .into_float_with_exponent(0) - 1.0; | ||
|
|
||
| while scale * max_rand + low >= high { | ||
| scale = <$ty>::from_bits(scale.to_bits() - 1); |
There was a problem hiding this comment.
Is this always correct when the fraction part is zero, and does it handle sub-normals correctly? I think so. In any case this feels unsafe to me — better to have a single next_smallest_value() method in your Float trait.
There was a problem hiding this comment.
Yeah, this will work as long as long as we don't cross the line from positive to negative numbers. Including when it results in reducing the exponent, and in the sub-normal range.
IEEE numbers were designed specifically such that for non-NaN numbers you can check which is bigger of two numbers by comparing their binary representation as if it was integers. Though you have to filter out the sign-bit first, but since we're dealing with positive numbers that won't be a problem.
It even works for infinity, which I am relying on here.
src/distributions/uniform.rs
Outdated
| // as it could be, we'll still generate good values. | ||
| // But give it one try to increase our chances that | ||
| // Uniform::new_inclusive is actually inclusive of the high value. | ||
| let new_scale = <$ty>::from_bits(scale.to_bits() + 1); |
There was a problem hiding this comment.
I don't think there's sufficient justification for this code — it's extra complexity to slightly improve the range in a case where it already meets all required constraints.
There was a problem hiding this comment.
I don't have a strong opinion. But without this code it seems like there is very little practical difference between new and new_inclusive. I.e. it's very unlikely that the high value can actually be generated. It's arguably still very unlikely, but at least this code gives it a try.
But again, I'm not super opinionated on this. I'm not sure what the use case is for inclusive ranges for floats, but obviously the API requires it, so we need to stick something in this function.
There was a problem hiding this comment.
Are there many cases where this code will do something useful?
There was a problem hiding this comment.
I'm not sure how to measure usefulness given that I don't know of use cases for inclusive ranges for floats.
Given that there was a comment in the code which pointed out that new and new_inclusive generated the same values, do we have any use cases in mind where that is a problem?
There was a problem hiding this comment.
Another approach here could be to implement new_inclusive as
let max_rand = (::core::$uty::MAX >> $bits_to_discard)
.into_float_with_exponent(0) - 1.0;
let mut scale = (high - low) / max_rand;
// Adjust scale lower until the max value we can generate is `high`
while scale * max_rand + low > high {
scale = <$ty>::from_bits(scale.to_bits() - 1);
}
debug_assert!(scale >= 0.0);
UniformFloat { low, scale }I don't know if the / max_rand actually makes a difference. But from a mathematical point of view it at least means that we increase the chance of being able to generate high.
There was a problem hiding this comment.
I think 1.0 / max_rand > 1.0 so it probably does do something. Can the compiler optimise to multiplication or would it be better to write (high - low) * (1.0 / max_rand)?
The implementation is simpler at least — unless the current version has a clear advantage this looks preferable.
There was a problem hiding this comment.
Let's use this approach rather than the one I'm using right now.
Is the loss of precision writing (high - low) * (1.0 / max_rand) really worth making the ctor faster? I.e. is a ctor performance really a priority?
There was a problem hiding this comment.
I don't know. Possibly not, but I think new_inclusive may get used for single-usage objects quite a bit. But anyway, we're not really losing precision since the while loop then reduces scale.
|
|
||
| // Get a value in the range [0, 1) in order to avoid | ||
| // overflowing into infinity when multiplying with scale | ||
| let value0_1 = value1_2 - 1.0; |
There was a problem hiding this comment.
This loses a bit of precision everywhere just to handle scales close to what would round to infinity. I'm not sure it's worth it; we could instead assert that scale * 2 is finite in the constructor.
There was a problem hiding this comment.
It shouldn't lose any precision. Subtracting 1.0 should not result in any rounding or loss of data. In fact, it looks to me like it should gain a tiny amount of precision since we're directly adding low rather than the calculated (and thus possibly slightly rounded) offset.
But it does lose performance.
Either way, not having this here is certainly possible, but I think it'll require a decent amount more code in the constructor. We'll have to worry about not just adjusting the scale, but also adjusting offset such that rounding can't result in values lower than low.
So it's a tradeoff of slower and more code complexity in the constructor, vs. saving around 8% in the sampling code.
There was a problem hiding this comment.
Sorry, you're right about the precision.
I don't know what's the right answer regarding performance.
|
|
||
| // Get a value in the range [0, 1) in order to avoid | ||
| // overflowing into infinity when multiplying with scale | ||
| let value0_1 = value1_2 - 1.0; |
There was a problem hiding this comment.
Again, you're dropping a bit of precision everywhere. So why not do value1_2 * scale + offset as before, then at the end of the loop check whether res is finite and panic if not?
There was a problem hiding this comment.
I might be missing something, but I see this only costing performance, not any precision. But do let me know if I'm wrong?
In this case it should really cost very little performance though. While we need an additional subtraction here, we are also saving a subtraction since we're not calculating offset. Though for constant inputs, offset and scale can be computed at compile time. But arguably someone caring about performance and using constant ranges should create a Uniform.
Additionally, using the offset approach means that we have to check the resulting value both against high and against low, so the resulting code is slower. Here's the perf numbers I get:
Using value1_2 * scale + offset
test gen_range_f32 ... bench: 4,835 ns/iter (+/- 857) = 827 MB/s
test gen_range_f64 ... bench: 6,896 ns/iter (+/- 641) = 1160 MB/s
Using value0_1 * scale + low
test gen_range_f32 ... bench: 4,335 ns/iter (+/- 251) = 922 MB/s
test gen_range_f64 ... bench: 5,712 ns/iter (+/- 916) = 1400 MB/s
a4c6b65 to
8828cad
Compare
|
Here's an alternative approach for implementing However it seems hangs for really large values of I'm sure it can be fixed, but ultimately I don't think the complexity and precision loss is worth it. And the algorithm here has effectively unbounded performance cost, which is also a concern. So I think we should stick with the approach in the PR. fn new(low: Self::X, high: Self::X) -> Self {
assert!(low < high, "Uniform::new called with `low >= high`");
assert!(low.is_finite() && high.is_finite(), "Uniform::new called with non-finite boundaries");
fn adjust_down(val: $ty) -> $ty {
if val > 0.0 {
<$ty>::from_bits(val.to_bits() - 1)
} else if val < 0.0 {
<$ty>::from_bits(val.to_bits() + 1)
} else {
-<$ty>::from_bits(1)
}
}
let max_rand = (::core::$uty::MAX >> $bits_to_discard)
.into_float_with_exponent(0);
let min_rand = (0 as $uty >> $bits_to_discard)
.into_float_with_exponent(0);
let mut low_adjusted = low;
let mut high_adjusted = high;
loop {
let scale = high_adjusted - low_adjusted;
let offset = low_adjusted - scale;
if !(scale * max_rand + offset < high) {
high_adjusted = adjust_down(high_adjusted);
} else if !(scale * min_rand + offset >= low) {
low_adjusted = -adjust_down(-low_adjusted);
} else {
return UniformFloat { scale, offset };
}
}
} |
src/distributions/uniform.rs
Outdated
| scale: scale, | ||
| offset: offset, | ||
| assert!(low <= high, "Uniform::new_inclusive called with `low > high`"); | ||
| assert!(low.is_finite() && high.is_finite(), "Uniform::new_inclusive called with non-finite boundaries"); |
There was a problem hiding this comment.
You can reduce the bounds checks a bit: (NEG_INFINITY < low) & (low <= high) & (high < INFINITY). Or maybe && but in theory that can short-circuit which isn't useful here.
There was a problem hiding this comment.
Oh, that's a great idea! But I don't understand why we wouldn't want to short-circuit?
There was a problem hiding this comment.
Because it might prevent the processor testing multiple things in parallel, and because the common case is that all these tests pass. I don't know; it might be optimisable to the same thing anyway.
There was a problem hiding this comment.
So it turns out that the old approach is actually slightly faster:
&
test gen_range_f32 ... bench: 4,459 ns/iter (+/- 169) = 897 MB/s
test gen_range_f64 ... bench: 6,027 ns/iter (+/- 334) = 1327 MB/s
&&
test gen_range_f32 ... bench: 4,464 ns/iter (+/- 163) = 896 MB/s
test gen_range_f64 ... bench: 6,038 ns/iter (+/- 592) = 1324 MB/s
old
test gen_range_f32 ... bench: 4,382 ns/iter (+/- 135) = 912 MB/s
test gen_range_f64 ... bench: 5,665 ns/iter (+/- 682) = 1412 MB/s
Technically none of these differences are stasticially significant, however it definitely looks like the last is faster for f64. And it does provide better error messages.
My best guess is that the old approach ends up being implementable by using bit logic, whereas the NEG_INF < low checks has to be done using slower floating point instructions.
3b1ab6e to
c1cbf32
Compare
c1cbf32 to
d2d4fcd
Compare
28b3fbf to
42f79af
Compare
|
I've rebased this on top of the newly landed SIMD code, which was definitely non-trivial. The good news is that it's enforcing the limits for both primitives and SIMD types. I also found a way to make I'd really like to get this landed though. Especially since the weighted sampling has landed and currently risks panicing randomly if you use f32/f64 weights. |
47f2998 to
e487112
Compare
|
So I ran some performance numbers, and got quite strange results. Basically if I remove the assert!(low.all_finite() && high.all_finite(),
"Uniform::sample_single called with non-finite boundaries");then I see no performance degradation for However, if I add back the above code, then I see a 50% slowdown. Even though the above code never runs during the benchmark. In the new implementation the code above is only run if I suspect that the problem is related to when the optimizer decides to inline some critical function, but sprinkling #[inline(always)] everywhere doesn't seem to make a difference as long as the above code is there. But I suspect that this problem is one of the realities of using micro-benchmarks. I.e. I don't think a real-world usage of In theory we could break out some of the code into a separate non-inlined function. But since it'd lead to significant code duplication, I don't think it's worth it. Especially since we don't know how a future version of rustc will optimize these things. |
|
Oh, and I should say that I got these numbers using |
e487112 to
9527481
Compare
|
Perhaps you could simplify your test? assert!(scale.all_finite(), "Uniform::sample_single called with non-finite boundaries"); |
|
Unfortunately |
|
I should also mention that the |
|
But do we really care about "nearly infinite" ranges? If someone really is treading so close to the limits of |
|
I tried just checking I'm also a bit hesitant to change the API to work around optimizer quirks which could change with so many parameters, like future versions of rustc, what the calling code does, and which rng is used. FWIW, I just tried using |
9527481 to
be10dba
Compare
|
Okay, I think it's time I stopped opposing this based on performance overhead; it's barely above the noise threshold (notice how unrelated numbers change here): Is this ready? Please rebase, and please separate the new benchmarks into a separate commit (so it's easier to do before/after comparisons). |
|
Only question I had remaining is if we want to merge the |
be10dba to
4f8601b
Compare
4f8601b to
ccb3c3f
Compare
dhardy
left a comment
There was a problem hiding this comment.
I don't care much about renaming (these are only internal, right?) but it's probably better to keep the two traits separate if there's no reason to merge.
src/distributions/utils.rs
Outdated
| #[inline(always)] | ||
| fn extract(self, index: usize) -> Self { assert_eq!(index, 0); self } | ||
| #[inline(always)] | ||
| fn replace(self, index: usize, new_value: Self) -> Self { assert_eq!(index, 0); new_value } |
There was a problem hiding this comment.
I think it would be more appropriate to use debug_assert here?
There was a problem hiding this comment.
I was trying to replicate the behavior of these function on real SIMD types, and hoping that the compiler would optimize the checks away.
Ultimately doesn't really matter right now. These functions are only used in tests.
Happy to go either way.
I did change decrease_masked though to not cast the bool flag to an int. Instead I made it part of the contract that at least one "lane" must be set and added debug_asserts to ensure that. That way we can always subtract 1. This might have gotten optimized away, but we might as well be safe.
ccb3c3f to
6d2b7e5
Compare
Yeah, they're only used internally. It's slightly less code to merge them, so I added a commit for that. But I kept the commit separate so feel free to take it or not. |
Sneaky! But it saves more code than I realised due to all the import rules. |
Sorry, intended to provide the option, not to sneak it in. |
This adds a small amount of overhead to Uniform<f32/f64> and gen_range<f32/f64>. For Uniform<f32/f64> it adds one subtraction and one branch in the common case. For gen_range<f32/f64> it adds just one branch in the common case. I see no way around this while also being sure to stay in the requested range.
This does use f64::from_bits/to_bits/is_infinite/is_finite which might only be available through std? Though it looks like core::num also make them available here? I can't tell given that the functions are marked as stable, but the trait marked unstable.
I'm not sure what are ok build requirements for no_std builds? And is there any way to run
cargo test, or even justcargo buildin no_std mode?