/work/obj-fuzz/dist/include/mozilla/FastBernoulliTrial.h
Line | Count | Source (jump to first uncovered line) |
1 | | /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ |
2 | | /* vim: set ts=8 sts=2 et sw=2 tw=80: */ |
3 | | /* This Source Code Form is subject to the terms of the Mozilla Public |
4 | | * License, v. 2.0. If a copy of the MPL was not distributed with this |
5 | | * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ |
6 | | |
7 | | #ifndef mozilla_FastBernoulliTrial_h |
8 | | #define mozilla_FastBernoulliTrial_h |
9 | | |
10 | | #include "mozilla/Assertions.h" |
11 | | #include "mozilla/XorShift128PlusRNG.h" |
12 | | |
13 | | #include <cmath> |
14 | | #include <stdint.h> |
15 | | |
16 | | namespace mozilla { |
17 | | |
18 | | /** |
19 | | * class FastBernoulliTrial: Efficient sampling with uniform probability |
20 | | * |
21 | | * When gathering statistics about a program's behavior, we may be observing |
22 | | * events that occur very frequently (e.g., function calls or memory |
23 | | * allocations) and we may be gathering information that is somewhat expensive |
24 | | * to produce (e.g., call stacks). Sampling all the events could have a |
25 | | * significant impact on the program's performance. |
26 | | * |
27 | | * Why not just sample every N'th event? This technique is called "systematic |
28 | | * sampling"; it's simple and efficient, and it's fine if we imagine a |
29 | | * patternless stream of events. But what if we're sampling allocations, and the |
30 | | * program happens to have a loop where each iteration does exactly N |
31 | | * allocations? You would end up sampling the same allocation every time through |
32 | | * the loop; the entire rest of the loop becomes invisible to your measurements! |
33 | | * More generally, if each iteration does M allocations, and M and N have any |
34 | | * common divisor at all, most allocation sites will never be sampled. If |
35 | | * they're both even, say, the odd-numbered allocations disappear from your |
36 | | * results. |
37 | | * |
38 | | * Ideally, we'd like each event to have some probability P of being sampled, |
39 | | * independent of its neighbors and of its position in the sequence. This is |
40 | | * called "Bernoulli sampling", and it doesn't suffer from any of the problems |
41 | | * mentioned above. |
42 | | * |
43 | | * One disadvantage of Bernoulli sampling is that you can't be sure exactly how |
44 | | * many samples you'll get: technically, it's possible that you might sample |
45 | | * none of them, or all of them. But if the number of events N is large, these |
46 | | * aren't likely outcomes; you can generally expect somewhere around P * N |
47 | | * events to be sampled. |
48 | | * |
49 | | * The other disadvantage of Bernoulli sampling is that you have to generate a |
50 | | * random number for every event, which can be slow. |
51 | | * |
52 | | * [significant pause] |
53 | | * |
54 | | * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli |
55 | | * sampling, while generating a fresh random number only when we do decide to |
56 | | * sample an event, not on every trial. When it decides not to sample, a call to |
57 | | * |FastBernoulliTrial::trial| is nothing but decrementing a counter and |
58 | | * comparing it to zero. So the lower your sampling probability is, the less |
59 | | * overhead FastBernoulliTrial imposes. |
60 | | * |
61 | | * Probabilities of 0 and 1 are handled efficiently. (In neither case need we |
62 | | * ever generate a random number at all.) |
63 | | * |
64 | | * The essential API: |
65 | | * |
66 | | * - FastBernoulliTrial(double P) |
67 | | * Construct an instance that selects events with probability P. |
68 | | * |
69 | | * - FastBernoulliTrial::trial() |
70 | | * Return true with probability P. Call this each time an event occurs, to |
71 | | * decide whether to sample it or not. |
72 | | * |
73 | | * - FastBernoulliTrial::trial(size_t n) |
74 | | * Equivalent to calling trial() |n| times, and returning true if any of those |
75 | | * calls do. However, like trial, this runs in fast constant time. |
76 | | * |
77 | | * What is this good for? In some applications, some events are "bigger" than |
78 | | * others. For example, large allocations are more significant than small |
79 | | * allocations. Perhaps we'd like to imagine that we're drawing allocations |
80 | | * from a stream of bytes, and performing a separate Bernoulli trial on every |
81 | | * byte from the stream. We can accomplish this by calling |t.trial(S)| for |
82 | | * the number of bytes S, and sampling the event if that returns true. |
83 | | * |
84 | | * Of course, this style of sampling needs to be paired with analysis and |
85 | | * presentation that makes the size of the event apparent, lest trials with |
86 | | * large values for |n| appear to be indistinguishable from those with small |
87 | | * values for |n|. |
88 | | */ |
89 | | class FastBernoulliTrial { |
90 | | /* |
91 | | * This comment should just read, "Generate skip counts with a geometric |
92 | | * distribution", and leave everyone to go look that up and see why it's the |
93 | | * right thing to do, if they don't know already. |
94 | | * |
95 | | * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE... |
96 | | * |
97 | | * Instead of generating a fresh random number for every trial, we can |
98 | | * randomly generate a count of how many times we should return false before |
99 | | * the next time we return true. We call this a "skip count". Once we've |
100 | | * returned true, we generate a fresh skip count, and begin counting down |
101 | | * again. |
102 | | * |
103 | | * Here's an awesome fact: by exercising a little care in the way we generate |
104 | | * skip counts, we can produce results indistinguishable from those we would |
105 | | * get "rolling the dice" afresh for every trial. |
106 | | * |
107 | | * In short, skip counts in Bernoulli trials of probability P obey a geometric |
108 | | * distribution. If a random variable X is uniformly distributed from [0..1), |
109 | | * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric |
110 | | * distribution for the skip counts. |
111 | | * |
112 | | * Why that formula? |
113 | | * |
114 | | * Suppose we're to return |true| with some probability P, say, 0.3. Spread |
115 | | * all possible futures along a line segment of length 1. In portion P of |
116 | | * those cases, we'll return true on the next call to |trial|; the skip count |
117 | | * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more. |
118 | | * |
119 | | * skip: 0 1 or more |
120 | | * |------------------^-----------------------------------------| |
121 | | * portion: 0.3 0.7 |
122 | | * P 1-P |
123 | | * |
124 | | * But the "1 or more" section of the line is subdivided the same way: *within |
125 | | * that section*, in portion P the second call to |trial()| returns true, and in |
126 | | * portion 1-P it returns false a second time; the skip count is two or more. |
127 | | * So we return true on the second call in proportion 0.7 * 0.3, and skip at |
128 | | * least the first two in proportion 0.7 * 0.7. |
129 | | * |
130 | | * skip: 0 1 2 or more |
131 | | * |------------------^------------^----------------------------| |
132 | | * portion: 0.3 0.7 * 0.3 0.7 * 0.7 |
133 | | * P (1-P)*P (1-P)^2 |
134 | | * |
135 | | * We can continue to subdivide: |
136 | | * |
137 | | * skip >= 0: |------------------------------------------------- (1-P)^0 --| |
138 | | * skip >= 1: | ------------------------------- (1-P)^1 --| |
139 | | * skip >= 2: | ------------------ (1-P)^2 --| |
140 | | * skip >= 3: | ^ ---------- (1-P)^3 --| |
141 | | * skip >= 4: | . --- (1-P)^4 --| |
142 | | * . |
143 | | * ^X, see below |
144 | | * |
145 | | * In other words, the likelihood of the next n calls to |trial| returning |
146 | | * false is (1-P)^n. The longer a run we require, the more the likelihood |
147 | | * drops. Further calls may return false too, but this is the probability |
148 | | * we'll skip at least n. |
149 | | * |
150 | | * This is interesting, because we can pick a point along this line segment |
151 | | * and see which skip count's range it falls within; the point X above, for |
152 | | * example, is within the ">= 2" range, but not within the ">= 3" range, so it |
153 | | * designates a skip count of 2. So if we pick points on the line at random |
154 | | * and use the skip counts they fall under, that will be indistinguishable |
155 | | * from generating a fresh random number between 0 and 1 for each trial and |
156 | | * comparing it to P. |
157 | | * |
158 | | * So to find the skip count for a point X, we must ask: To what whole power |
159 | | * must we raise 1-P such that we include X, but the next power would exclude |
160 | | * it? This is exactly std::floor(std::log(X) / std::log(1-P)). |
161 | | * |
162 | | * Our algorithm is then, simply: When constructed, compute an initial skip |
163 | | * count. Return false from |trial| that many times, and then compute a new skip |
164 | | * count. |
165 | | * |
166 | | * For a call to |trial(n)|, if the skip count is greater than n, return false |
167 | | * and subtract n from the skip count. If the skip count is less than n, |
168 | | * return true and compute a new skip count. Since each trial is independent, |
169 | | * it doesn't matter by how much n overshoots the skip count; we can actually |
170 | | * compute a new skip count at *any* time without affecting the distribution. |
171 | | * This is really beautiful. |
172 | | */ |
173 | | public: |
174 | | /** |
175 | | * Construct a fast Bernoulli trial generator. Calls to |trial()| return true |
176 | | * with probability |aProbability|. Use |aState0| and |aState1| to seed the |
177 | | * random number generator; both may not be zero. |
178 | | */ |
179 | | FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1) |
180 | | : mProbability(0) |
181 | | , mInvLogNotProbability(0) |
182 | | , mGenerator(aState0, aState1) |
183 | | , mSkipCount(0) |
184 | 9 | { |
185 | 9 | setProbability(aProbability); |
186 | 9 | } |
187 | | |
188 | | /** |
189 | | * Return true with probability |mProbability|. Call this each time an event |
190 | | * occurs, to decide whether to sample it or not. The lower |mProbability| is, |
191 | | * the faster this function runs. |
192 | | */ |
193 | 0 | bool trial() { |
194 | 0 | if (mSkipCount) { |
195 | 0 | mSkipCount--; |
196 | 0 | return false; |
197 | 0 | } |
198 | 0 | |
199 | 0 | return chooseSkipCount(); |
200 | 0 | } |
201 | | |
202 | | /** |
203 | | * Equivalent to calling trial() |n| times, and returning true if any of those |
204 | | * calls do. However, like trial, this runs in fast constant time. |
205 | | * |
206 | | * What is this good for? In some applications, some events are "bigger" than |
207 | | * others. For example, large allocations are more significant than small |
208 | | * allocations. Perhaps we'd like to imagine that we're drawing allocations |
209 | | * from a stream of bytes, and performing a separate Bernoulli trial on every |
210 | | * byte from the stream. We can accomplish this by calling |t.trial(S)| for |
211 | | * the number of bytes S, and sampling the event if that returns true. |
212 | | * |
213 | | * Of course, this style of sampling needs to be paired with analysis and |
214 | | * presentation that makes the "size" of the event apparent, lest trials with |
215 | | * large values for |n| appear to be indistinguishable from those with small |
216 | | * values for |n|, despite being potentially much more likely to be sampled. |
217 | | */ |
218 | 0 | bool trial(size_t aCount) { |
219 | 0 | if (mSkipCount > aCount) { |
220 | 0 | mSkipCount -= aCount; |
221 | 0 | return false; |
222 | 0 | } |
223 | 0 |
|
224 | 0 | return chooseSkipCount(); |
225 | 0 | } |
226 | | |
227 | 0 | void setRandomState(uint64_t aState0, uint64_t aState1) { |
228 | 0 | mGenerator.setState(aState0, aState1); |
229 | 0 | } |
230 | | |
231 | 9 | void setProbability(double aProbability) { |
232 | 9 | MOZ_ASSERT(0 <= aProbability && aProbability <= 1); |
233 | 9 | mProbability = aProbability; |
234 | 9 | if (0 < mProbability && mProbability < 1) { |
235 | 0 | /* |
236 | 0 | * Let's look carefully at how this calculation plays out in floating- |
237 | 0 | * point arithmetic. We'll assume IEEE, but the final C++ code we arrive |
238 | 0 | * at would still be fine if our numbers were mathematically perfect. So, |
239 | 0 | * while we've considered IEEE's edge cases, we haven't done anything that |
240 | 0 | * should be actively bad when using other representations. |
241 | 0 | * |
242 | 0 | * (In the below, read comparisons as exact mathematical comparisons: when |
243 | 0 | * we say something "equals 1", that means it's exactly equal to 1. We |
244 | 0 | * treat approximation using intervals with open boundaries: saying a |
245 | 0 | * value is in (0,1) doesn't specify how close to 0 or 1 the value gets. |
246 | 0 | * When we use closed boundaries like [2**-53, 1], we're careful to ensure |
247 | 0 | * the boundary values are actually representable.) |
248 | 0 | * |
249 | 0 | * - After the comparison above, we know mProbability is in (0,1). |
250 | 0 | * |
251 | 0 | * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53]. |
252 | 0 | * |
253 | 0 | * - Because the floating-point gaps near 1 are wider than those near |
254 | 0 | * zero, there are many small positive doubles ε such that 1-ε rounds to |
255 | 0 | * exactly 1. However, 2**-53 can be represented exactly. So |
256 | 0 | * 1-mProbability is in [2**-53, 1]. |
257 | 0 | * |
258 | 0 | * - log(1 - mProbability) is thus in (-37, 0]. |
259 | 0 | * |
260 | 0 | * That range includes zero, but when we use mInvLogNotProbability, it |
261 | 0 | * would be helpful if we could trust that it's negative. So when log(1 |
262 | 0 | * - mProbability) is 0, we'll just set mProbability to 0, so that |
263 | 0 | * mInvLogNotProbability is not used in chooseSkipCount. |
264 | 0 | * |
265 | 0 | * - How much of the range of mProbability does this cause us to ignore? |
266 | 0 | * The only value for which log returns 0 is exactly 1; the slope of log |
267 | 0 | * at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε, |
268 | 0 | * never 0. The gaps near one are larger than the gaps near zero, so if |
269 | 0 | * 1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability) |
270 | 0 | * isn't 0, then 1 - mProbability isn't 1, which means that mProbability |
271 | 0 | * is at least 2**-53, as discussed earlier. This is a sampling |
272 | 0 | * likelihood of roughly one in ten trillion, which is unlikely to be |
273 | 0 | * distinguishable from zero in practice. |
274 | 0 | * |
275 | 0 | * So by forbidding zero, we've tightened our range to (-37, -2**-53]. |
276 | 0 | * |
277 | 0 | * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all |
278 | 0 | * falls readily within the range of an IEEE double. |
279 | 0 | * |
280 | 0 | * ALL THAT HAVING BEEN SAID: here are the five lines of actual code: |
281 | 0 | */ |
282 | 0 | double logNotProbability = std::log(1 - mProbability); |
283 | 0 | if (logNotProbability == 0.0) |
284 | 0 | mProbability = 0.0; |
285 | 0 | else |
286 | 0 | mInvLogNotProbability = 1 / logNotProbability; |
287 | 0 | } |
288 | 9 | |
289 | 9 | chooseSkipCount(); |
290 | 9 | } |
291 | | |
292 | | private: |
293 | | /* The likelihood that any given call to |trial| should return true. */ |
294 | | double mProbability; |
295 | | |
296 | | /* |
297 | | * The value of 1/std::log(1 - mProbability), cached for repeated use. |
298 | | * |
299 | | * If mProbability is exactly 0 or exactly 1, we don't use this value. |
300 | | * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e. |
301 | | * definitely negative, as required by chooseSkipCount. See setProbability for |
302 | | * the details. |
303 | | */ |
304 | | double mInvLogNotProbability; |
305 | | |
306 | | /* Our random number generator. */ |
307 | | non_crypto::XorShift128PlusRNG mGenerator; |
308 | | |
309 | | /* The number of times |trial| should return false before next returning true. */ |
310 | | size_t mSkipCount; |
311 | | |
312 | | /* |
313 | | * Choose the next skip count. This also returns the value that |trial| should |
314 | | * return, since we have to check for the extreme values for mProbability |
315 | | * anyway, and |trial| should never return true at all when mProbability is 0. |
316 | | */ |
317 | 9 | bool chooseSkipCount() { |
318 | 9 | /* |
319 | 9 | * If the probability is 1.0, every call to |trial| returns true. Make sure |
320 | 9 | * mSkipCount is 0. |
321 | 9 | */ |
322 | 9 | if (mProbability == 1.0) { |
323 | 9 | mSkipCount = 0; |
324 | 9 | return true; |
325 | 9 | } |
326 | 0 | |
327 | 0 | /* |
328 | 0 | * If the probabilility is zero, |trial| never returns true. Don't bother us |
329 | 0 | * for a while. |
330 | 0 | */ |
331 | 0 | if (mProbability == 0.0) { |
332 | 0 | mSkipCount = SIZE_MAX; |
333 | 0 | return false; |
334 | 0 | } |
335 | 0 | |
336 | 0 | /* |
337 | 0 | * What sorts of values can this call to std::floor produce? |
338 | 0 | * |
339 | 0 | * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log |
340 | 0 | * returns a value in the range [-infinity, -2**-53], all negative. Since |
341 | 0 | * mInvLogNotProbability is negative (see its comments), the product is |
342 | 0 | * positive and possibly infinite. std::floor returns +infinity unchanged. |
343 | 0 | * So the result will always be positive. |
344 | 0 | * |
345 | 0 | * Converting a double to an integer that is out of range for that integer |
346 | 0 | * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure |
347 | 0 | * we get an acceptable value for mSkipCount. |
348 | 0 | * |
349 | 0 | * The clamp is written carefully. Note that if we had said: |
350 | 0 | * |
351 | 0 | * if (skipCount > SIZE_MAX) |
352 | 0 | * skipCount = SIZE_MAX; |
353 | 0 | * |
354 | 0 | * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to |
355 | 0 | * double is 2^64, not 2^64-1, so this doesn't actually set skipCount to a |
356 | 0 | * value that can be safely assigned to mSkipCount. |
357 | 0 | * |
358 | 0 | * Jakob Olesen cleverly suggested flipping the sense of the comparison: if |
359 | 0 | * we require that skipCount < SIZE_MAX, then because of the gaps (2048) |
360 | 0 | * between doubles at that magnitude, the highest double less than 2^64 is |
361 | 0 | * 2^64 - 2048, which is fine to store in a size_t. |
362 | 0 | * |
363 | 0 | * (On 32-bit machines, all size_t values can be represented exactly in |
364 | 0 | * double, so all is well.) |
365 | 0 | */ |
366 | 0 | double skipCount = std::floor(std::log(mGenerator.nextDouble()) |
367 | 0 | * mInvLogNotProbability); |
368 | 0 | if (skipCount < SIZE_MAX) |
369 | 0 | mSkipCount = skipCount; |
370 | 0 | else |
371 | 0 | mSkipCount = SIZE_MAX; |
372 | 0 |
|
373 | 0 | return true; |
374 | 0 | } |
375 | | }; |
376 | | |
377 | | } /* namespace mozilla */ |
378 | | |
379 | | #endif /* mozilla_FastBernoulliTrial_h */ |