Coverage Report

Created: 2018-09-25 14:53

/work/obj-fuzz/dist/include/mozilla/FastBernoulliTrial.h
Line
Count
Source (jump to first uncovered line)
1
/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
2
/* vim: set ts=8 sts=2 et sw=2 tw=80: */
3
/* This Source Code Form is subject to the terms of the Mozilla Public
4
 * License, v. 2.0. If a copy of the MPL was not distributed with this
5
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
6
7
#ifndef mozilla_FastBernoulliTrial_h
8
#define mozilla_FastBernoulliTrial_h
9
10
#include "mozilla/Assertions.h"
11
#include "mozilla/XorShift128PlusRNG.h"
12
13
#include <cmath>
14
#include <stdint.h>
15
16
namespace mozilla {
17
18
/**
19
 * class FastBernoulliTrial: Efficient sampling with uniform probability
20
 *
21
 * When gathering statistics about a program's behavior, we may be observing
22
 * events that occur very frequently (e.g., function calls or memory
23
 * allocations) and we may be gathering information that is somewhat expensive
24
 * to produce (e.g., call stacks). Sampling all the events could have a
25
 * significant impact on the program's performance.
26
 *
27
 * Why not just sample every N'th event? This technique is called "systematic
28
 * sampling"; it's simple and efficient, and it's fine if we imagine a
29
 * patternless stream of events. But what if we're sampling allocations, and the
30
 * program happens to have a loop where each iteration does exactly N
31
 * allocations? You would end up sampling the same allocation every time through
32
 * the loop; the entire rest of the loop becomes invisible to your measurements!
33
 * More generally, if each iteration does M allocations, and M and N have any
34
 * common divisor at all, most allocation sites will never be sampled. If
35
 * they're both even, say, the odd-numbered allocations disappear from your
36
 * results.
37
 *
38
 * Ideally, we'd like each event to have some probability P of being sampled,
39
 * independent of its neighbors and of its position in the sequence. This is
40
 * called "Bernoulli sampling", and it doesn't suffer from any of the problems
41
 * mentioned above.
42
 *
43
 * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
44
 * many samples you'll get: technically, it's possible that you might sample
45
 * none of them, or all of them. But if the number of events N is large, these
46
 * aren't likely outcomes; you can generally expect somewhere around P * N
47
 * events to be sampled.
48
 *
49
 * The other disadvantage of Bernoulli sampling is that you have to generate a
50
 * random number for every event, which can be slow.
51
 *
52
 * [significant pause]
53
 *
54
 * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
55
 * sampling, while generating a fresh random number only when we do decide to
56
 * sample an event, not on every trial. When it decides not to sample, a call to
57
 * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
58
 * comparing it to zero. So the lower your sampling probability is, the less
59
 * overhead FastBernoulliTrial imposes.
60
 *
61
 * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
62
 * ever generate a random number at all.)
63
 *
64
 * The essential API:
65
 *
66
 * - FastBernoulliTrial(double P)
67
 *   Construct an instance that selects events with probability P.
68
 *
69
 * - FastBernoulliTrial::trial()
70
 *   Return true with probability P. Call this each time an event occurs, to
71
 *   decide whether to sample it or not.
72
 *
73
 * - FastBernoulliTrial::trial(size_t n)
74
 *   Equivalent to calling trial() |n| times, and returning true if any of those
75
 *   calls do. However, like trial, this runs in fast constant time.
76
 *
77
 *   What is this good for? In some applications, some events are "bigger" than
78
 *   others. For example, large allocations are more significant than small
79
 *   allocations. Perhaps we'd like to imagine that we're drawing allocations
80
 *   from a stream of bytes, and performing a separate Bernoulli trial on every
81
 *   byte from the stream. We can accomplish this by calling |t.trial(S)| for
82
 *   the number of bytes S, and sampling the event if that returns true.
83
 *
84
 *   Of course, this style of sampling needs to be paired with analysis and
85
 *   presentation that makes the size of the event apparent, lest trials with
86
 *   large values for |n| appear to be indistinguishable from those with small
87
 *   values for |n|.
88
 */
89
class FastBernoulliTrial {
90
  /*
91
   * This comment should just read, "Generate skip counts with a geometric
92
   * distribution", and leave everyone to go look that up and see why it's the
93
   * right thing to do, if they don't know already.
94
   *
95
   * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
96
   *
97
   * Instead of generating a fresh random number for every trial, we can
98
   * randomly generate a count of how many times we should return false before
99
   * the next time we return true. We call this a "skip count". Once we've
100
   * returned true, we generate a fresh skip count, and begin counting down
101
   * again.
102
   *
103
   * Here's an awesome fact: by exercising a little care in the way we generate
104
   * skip counts, we can produce results indistinguishable from those we would
105
   * get "rolling the dice" afresh for every trial.
106
   *
107
   * In short, skip counts in Bernoulli trials of probability P obey a geometric
108
   * distribution. If a random variable X is uniformly distributed from [0..1),
109
   * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
110
   * distribution for the skip counts.
111
   *
112
   * Why that formula?
113
   *
114
   * Suppose we're to return |true| with some probability P, say, 0.3. Spread
115
   * all possible futures along a line segment of length 1. In portion P of
116
   * those cases, we'll return true on the next call to |trial|; the skip count
117
   * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
118
   *
119
   * skip:                0                         1 or more
120
   *             |------------------^-----------------------------------------|
121
   * portion:            0.3                            0.7
122
   *                      P                             1-P
123
   *
124
   * But the "1 or more" section of the line is subdivided the same way: *within
125
   * that section*, in portion P the second call to |trial()| returns true, and in
126
   * portion 1-P it returns false a second time; the skip count is two or more.
127
   * So we return true on the second call in proportion 0.7 * 0.3, and skip at
128
   * least the first two in proportion 0.7 * 0.7.
129
   *
130
   * skip:                0                1              2 or more
131
   *             |------------------^------------^----------------------------|
132
   * portion:            0.3           0.7 * 0.3          0.7 * 0.7
133
   *                      P             (1-P)*P            (1-P)^2
134
   *
135
   * We can continue to subdivide:
136
   *
137
   * skip >= 0:  |------------------------------------------------- (1-P)^0 --|
138
   * skip >= 1:  |                  ------------------------------- (1-P)^1 --|
139
   * skip >= 2:  |                               ------------------ (1-P)^2 --|
140
   * skip >= 3:  |                                 ^     ---------- (1-P)^3 --|
141
   * skip >= 4:  |                                 .            --- (1-P)^4 --|
142
   *                                               .
143
   *                                               ^X, see below
144
   *
145
   * In other words, the likelihood of the next n calls to |trial| returning
146
   * false is (1-P)^n. The longer a run we require, the more the likelihood
147
   * drops. Further calls may return false too, but this is the probability
148
   * we'll skip at least n.
149
   *
150
   * This is interesting, because we can pick a point along this line segment
151
   * and see which skip count's range it falls within; the point X above, for
152
   * example, is within the ">= 2" range, but not within the ">= 3" range, so it
153
   * designates a skip count of 2. So if we pick points on the line at random
154
   * and use the skip counts they fall under, that will be indistinguishable
155
   * from generating a fresh random number between 0 and 1 for each trial and
156
   * comparing it to P.
157
   *
158
   * So to find the skip count for a point X, we must ask: To what whole power
159
   * must we raise 1-P such that we include X, but the next power would exclude
160
   * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
161
   *
162
   * Our algorithm is then, simply: When constructed, compute an initial skip
163
   * count. Return false from |trial| that many times, and then compute a new skip
164
   * count.
165
   *
166
   * For a call to |trial(n)|, if the skip count is greater than n, return false
167
   * and subtract n from the skip count. If the skip count is less than n,
168
   * return true and compute a new skip count. Since each trial is independent,
169
   * it doesn't matter by how much n overshoots the skip count; we can actually
170
   * compute a new skip count at *any* time without affecting the distribution.
171
   * This is really beautiful.
172
   */
173
 public:
174
  /**
175
   * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
176
   * with probability |aProbability|. Use |aState0| and |aState1| to seed the
177
   * random number generator; both may not be zero.
178
   */
179
  FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
180
   : mProbability(0)
181
   , mInvLogNotProbability(0)
182
   , mGenerator(aState0, aState1)
183
   , mSkipCount(0)
184
9
  {
185
9
    setProbability(aProbability);
186
9
  }
187
188
  /**
189
   * Return true with probability |mProbability|. Call this each time an event
190
   * occurs, to decide whether to sample it or not. The lower |mProbability| is,
191
   * the faster this function runs.
192
   */
193
0
  bool trial() {
194
0
    if (mSkipCount) {
195
0
      mSkipCount--;
196
0
      return false;
197
0
    }
198
0
199
0
    return chooseSkipCount();
200
0
  }
201
202
  /**
203
   * Equivalent to calling trial() |n| times, and returning true if any of those
204
   * calls do. However, like trial, this runs in fast constant time.
205
   *
206
   * What is this good for? In some applications, some events are "bigger" than
207
   * others. For example, large allocations are more significant than small
208
   * allocations. Perhaps we'd like to imagine that we're drawing allocations
209
   * from a stream of bytes, and performing a separate Bernoulli trial on every
210
   * byte from the stream. We can accomplish this by calling |t.trial(S)| for
211
   * the number of bytes S, and sampling the event if that returns true.
212
   *
213
   * Of course, this style of sampling needs to be paired with analysis and
214
   * presentation that makes the "size" of the event apparent, lest trials with
215
   * large values for |n| appear to be indistinguishable from those with small
216
   * values for |n|, despite being potentially much more likely to be sampled.
217
   */
218
0
  bool trial(size_t aCount) {
219
0
    if (mSkipCount > aCount) {
220
0
      mSkipCount -= aCount;
221
0
      return false;
222
0
    }
223
0
224
0
    return chooseSkipCount();
225
0
  }
226
227
0
  void setRandomState(uint64_t aState0, uint64_t aState1) {
228
0
    mGenerator.setState(aState0, aState1);
229
0
  }
230
231
9
  void setProbability(double aProbability) {
232
9
    MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
233
9
    mProbability = aProbability;
234
9
    if (0 < mProbability && mProbability < 1) {
235
0
      /*
236
0
       * Let's look carefully at how this calculation plays out in floating-
237
0
       * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
238
0
       * at would still be fine if our numbers were mathematically perfect. So,
239
0
       * while we've considered IEEE's edge cases, we haven't done anything that
240
0
       * should be actively bad when using other representations.
241
0
       *
242
0
       * (In the below, read comparisons as exact mathematical comparisons: when
243
0
       * we say something "equals 1", that means it's exactly equal to 1. We
244
0
       * treat approximation using intervals with open boundaries: saying a
245
0
       * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
246
0
       * When we use closed boundaries like [2**-53, 1], we're careful to ensure
247
0
       * the boundary values are actually representable.)
248
0
       *
249
0
       * - After the comparison above, we know mProbability is in (0,1).
250
0
       *
251
0
       * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
252
0
       *
253
0
       * - Because the floating-point gaps near 1 are wider than those near
254
0
       *   zero, there are many small positive doubles ε such that 1-ε rounds to
255
0
       *   exactly 1. However, 2**-53 can be represented exactly. So
256
0
       *   1-mProbability is in [2**-53, 1].
257
0
       *
258
0
       * - log(1 - mProbability) is thus in (-37, 0].
259
0
       *
260
0
       *   That range includes zero, but when we use mInvLogNotProbability, it
261
0
       *   would be helpful if we could trust that it's negative. So when log(1
262
0
       *   - mProbability) is 0, we'll just set mProbability to 0, so that
263
0
       *   mInvLogNotProbability is not used in chooseSkipCount.
264
0
       *
265
0
       * - How much of the range of mProbability does this cause us to ignore?
266
0
       *   The only value for which log returns 0 is exactly 1; the slope of log
267
0
       *   at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
268
0
       *   never 0. The gaps near one are larger than the gaps near zero, so if
269
0
       *   1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
270
0
       *   isn't 0, then 1 - mProbability isn't 1, which means that mProbability
271
0
       *   is at least 2**-53, as discussed earlier. This is a sampling
272
0
       *   likelihood of roughly one in ten trillion, which is unlikely to be
273
0
       *   distinguishable from zero in practice.
274
0
       *
275
0
       *   So by forbidding zero, we've tightened our range to (-37, -2**-53].
276
0
       *
277
0
       * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
278
0
       *   falls readily within the range of an IEEE double.
279
0
       *
280
0
       * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
281
0
       */
282
0
      double logNotProbability = std::log(1 - mProbability);
283
0
      if (logNotProbability == 0.0)
284
0
        mProbability = 0.0;
285
0
      else
286
0
        mInvLogNotProbability = 1 / logNotProbability;
287
0
    }
288
9
289
9
    chooseSkipCount();
290
9
  }
291
292
 private:
293
  /* The likelihood that any given call to |trial| should return true. */
294
  double mProbability;
295
296
  /*
297
   * The value of 1/std::log(1 - mProbability), cached for repeated use.
298
   *
299
   * If mProbability is exactly 0 or exactly 1, we don't use this value.
300
   * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
301
   * definitely negative, as required by chooseSkipCount. See setProbability for
302
   * the details.
303
   */
304
  double mInvLogNotProbability;
305
306
  /* Our random number generator. */
307
  non_crypto::XorShift128PlusRNG mGenerator;
308
309
  /* The number of times |trial| should return false before next returning true. */
310
  size_t mSkipCount;
311
312
  /*
313
   * Choose the next skip count. This also returns the value that |trial| should
314
   * return, since we have to check for the extreme values for mProbability
315
   * anyway, and |trial| should never return true at all when mProbability is 0.
316
   */
317
9
  bool chooseSkipCount() {
318
9
    /*
319
9
     * If the probability is 1.0, every call to |trial| returns true. Make sure
320
9
     * mSkipCount is 0.
321
9
     */
322
9
    if (mProbability == 1.0) {
323
9
      mSkipCount = 0;
324
9
      return true;
325
9
    }
326
0
327
0
    /*
328
0
     * If the probabilility is zero, |trial| never returns true. Don't bother us
329
0
     * for a while.
330
0
     */
331
0
    if (mProbability == 0.0) {
332
0
      mSkipCount = SIZE_MAX;
333
0
      return false;
334
0
    }
335
0
336
0
    /*
337
0
     * What sorts of values can this call to std::floor produce?
338
0
     *
339
0
     * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
340
0
     * returns a value in the range [-infinity, -2**-53], all negative. Since
341
0
     * mInvLogNotProbability is negative (see its comments), the product is
342
0
     * positive and possibly infinite. std::floor returns +infinity unchanged.
343
0
     * So the result will always be positive.
344
0
     *
345
0
     * Converting a double to an integer that is out of range for that integer
346
0
     * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
347
0
     * we get an acceptable value for mSkipCount.
348
0
     *
349
0
     * The clamp is written carefully. Note that if we had said:
350
0
     *
351
0
     *    if (skipCount > SIZE_MAX)
352
0
     *       skipCount = SIZE_MAX;
353
0
     *
354
0
     * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
355
0
     * double is 2^64, not 2^64-1, so this doesn't actually set skipCount to a
356
0
     * value that can be safely assigned to mSkipCount.
357
0
     *
358
0
     * Jakob Olesen cleverly suggested flipping the sense of the comparison: if
359
0
     * we require that skipCount < SIZE_MAX, then because of the gaps (2048)
360
0
     * between doubles at that magnitude, the highest double less than 2^64 is
361
0
     * 2^64 - 2048, which is fine to store in a size_t.
362
0
     *
363
0
     * (On 32-bit machines, all size_t values can be represented exactly in
364
0
     * double, so all is well.)
365
0
     */
366
0
    double skipCount = std::floor(std::log(mGenerator.nextDouble())
367
0
                                  * mInvLogNotProbability);
368
0
    if (skipCount < SIZE_MAX)
369
0
      mSkipCount = skipCount;
370
0
    else
371
0
      mSkipCount = SIZE_MAX;
372
0
373
0
    return true;
374
0
  }
375
};
376
377
}  /* namespace mozilla */
378
379
#endif /* mozilla_FastBernoulliTrial_h */