Coverage Report

Created: 2025-07-23 07:16

/src/regex/regex-automata/src/nfa/thompson/range_trie.rs
Line
Count
Source (jump to first uncovered line)
1
/*
2
I've called the primary data structure in this module a "range trie." As far
3
as I can tell, there is no prior art on a data structure like this, however,
4
it's likely someone somewhere has built something like it. Searching for
5
"range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
6
but it does not appear relevant.
7
8
The range trie is just like a trie in that it is a special case of a
9
deterministic finite state machine. It has states and each state has a set
10
of transitions to other states. It is acyclic, and, like a normal trie,
11
it makes no attempt to reuse common suffixes among its elements. The key
12
difference between a normal trie and a range trie below is that a range trie
13
operates on *contiguous sequences* of bytes instead of singleton bytes.
14
One could say say that our alphabet is ranges of bytes instead of bytes
15
themselves, except a key part of range trie construction is splitting ranges
16
apart to ensure there is at most one transition that can be taken for any
17
byte in a given state.
18
19
I've tried to explain the details of how the range trie works below, so
20
for now, we are left with trying to understand what problem we're trying to
21
solve. Which is itself fairly involved!
22
23
At the highest level, here's what we want to do. We want to convert a
24
sequence of Unicode codepoints into a finite state machine whose transitions
25
are over *bytes* and *not* Unicode codepoints. We want this because it makes
26
said finite state machines much smaller and much faster to execute. As a
27
simple example, consider a byte oriented automaton for all Unicode scalar
28
values (0x00 through 0x10FFFF, not including surrogate codepoints):
29
30
    [00-7F]
31
    [C2-DF][80-BF]
32
    [E0-E0][A0-BF][80-BF]
33
    [E1-EC][80-BF][80-BF]
34
    [ED-ED][80-9F][80-BF]
35
    [EE-EF][80-BF][80-BF]
36
    [F0-F0][90-BF][80-BF][80-BF]
37
    [F1-F3][80-BF][80-BF][80-BF]
38
    [F4-F4][80-8F][80-BF][80-BF]
39
40
(These byte ranges are generated via the regex-syntax::utf8 module, which
41
was based on Russ Cox's code in RE2, which was in turn based on Ken
42
Thompson's implementation of the same idea in his Plan9 implementation of
43
grep.)
44
45
It should be fairly straight-forward to see how one could compile this into
46
a DFA. The sequences are sorted and non-overlapping. Essentially, you could
47
build a trie from this fairly easy. The problem comes when your initial
48
range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
49
represented by '\w' contains only a tenth of the codepoints that
50
0x00-0x10FFFF contains, but if we were to write out the byte based ranges
51
as we did above, the list would stretch to 892 entries! This turns into
52
quite a large NFA with a few thousand states. Turning this beast into a DFA
53
takes quite a bit of time. We are thus left with trying to trim down the
54
number of states we produce as early as possible.
55
56
One approach (used by RE2 and still by the regex crate, at time of writing)
57
is to try to find common suffixes while building NFA states for the above
58
and reuse them. This is very cheap to do and one can control precisely how
59
much extra memory you want to use for the cache.
60
61
Another approach, however, is to reuse an algorithm for constructing a
62
*minimal* DFA from a sorted sequence of inputs. I don't want to go into
63
the full details here, but I explain it in more depth in my blog post on
64
FSTs[1]. Note that the algorithm was not invented by me, but was published
65
in paper by Daciuk et al. in 2000 called "Incremental Construction of
66
MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
67
it is also possible to control the amount of extra memory one uses, although
68
this usually comes with the cost of sacrificing true minimality. (But it's
69
typically close enough with a reasonably sized cache of states.)
70
71
The catch is that Daciuk's algorithm only works if you add your keys in
72
lexicographic ascending order. In our case, since we're dealing with ranges,
73
we also need the additional requirement that ranges are either equivalent
74
or do not overlap at all. For example, if one were given the following byte
75
ranges:
76
77
    [BC-BF][80-BF]
78
    [BC-BF][90-BF]
79
80
Then Daciuk's algorithm would not work, since there is nothing to handle the
81
fact that the ranges overlap. They would need to be split apart. Thankfully,
82
Thompson's algorithm for producing byte ranges for Unicode codepoint ranges
83
meets both of our requirements. (A proof for this eludes me, but it appears
84
true.)
85
86
... however, we would also like to be able to compile UTF-8 automata in
87
reverse. We want this because in order to find the starting location of a
88
match using a DFA, we need to run a second DFA---a reversed version of the
89
forward DFA---backwards to discover the match location. Unfortunately, if
90
we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
91
can overlap, even if they are sorted:
92
93
    [00-7F]
94
    [80-BF][80-9F][ED-ED]
95
    [80-BF][80-BF][80-8F][F4-F4]
96
    [80-BF][80-BF][80-BF][F1-F3]
97
    [80-BF][80-BF][90-BF][F0-F0]
98
    [80-BF][80-BF][E1-EC]
99
    [80-BF][80-BF][EE-EF]
100
    [80-BF][A0-BF][E0-E0]
101
    [80-BF][C2-DF]
102
103
For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
104
overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
105
simple way to apply Daciuk's algorithm.
106
107
And thus, the range trie was born. The range trie's only purpose is to take
108
sequences of byte ranges like the ones above, collect them into a trie and then
109
spit them out in a sorted fashion with no overlapping ranges. For example,
110
0x00-0x10FFFF gets translated to:
111
112
    [0-7F]
113
    [80-BF][80-9F][80-8F][F1-F3]
114
    [80-BF][80-9F][80-8F][F4]
115
    [80-BF][80-9F][90-BF][F0]
116
    [80-BF][80-9F][90-BF][F1-F3]
117
    [80-BF][80-9F][E1-EC]
118
    [80-BF][80-9F][ED]
119
    [80-BF][80-9F][EE-EF]
120
    [80-BF][A0-BF][80-8F][F1-F3]
121
    [80-BF][A0-BF][80-8F][F4]
122
    [80-BF][A0-BF][90-BF][F0]
123
    [80-BF][A0-BF][90-BF][F1-F3]
124
    [80-BF][A0-BF][E0]
125
    [80-BF][A0-BF][E1-EC]
126
    [80-BF][A0-BF][EE-EF]
127
    [80-BF][C2-DF]
128
129
We've thus satisfied our requirements for running Daciuk's algorithm. All
130
sequences of ranges are sorted, and any corresponding ranges are either
131
exactly equivalent or non-overlapping.
132
133
In effect, a range trie is building a DFA from a sequence of arbitrary byte
134
ranges. But it uses an algorithm custom tailored to its input, so it is not as
135
costly as traditional DFA construction. While it is still quite a bit more
136
costly than the forward case (which only needs Daciuk's algorithm), it winds
137
up saving a substantial amount of time if one is doing a full DFA powerset
138
construction later by virtue of producing a much much smaller NFA.
139
140
[1] - https://blog.burntsushi.net/transducers/
141
[2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
142
*/
143
144
use core::{cell::RefCell, fmt, mem, ops::RangeInclusive};
145
146
use alloc::{format, string::String, vec, vec::Vec};
147
148
use regex_syntax::utf8::Utf8Range;
149
150
use crate::util::primitives::StateID;
151
152
/// There is only one final state in this trie. Every sequence of byte ranges
153
/// added shares the same final state.
154
const FINAL: StateID = StateID::ZERO;
155
156
/// The root state of the trie.
157
const ROOT: StateID = StateID::new_unchecked(1);
158
159
/// A range trie represents an ordered set of sequences of bytes.
160
///
161
/// A range trie accepts as input a sequence of byte ranges and merges
162
/// them into the existing set such that the trie can produce a sorted
163
/// non-overlapping sequence of byte ranges. The sequence emitted corresponds
164
/// precisely to the sequence of bytes matched by the given keys, although the
165
/// byte ranges themselves may be split at different boundaries.
166
///
167
/// The order complexity of this data structure seems difficult to analyze.
168
/// If the size of a byte is held as a constant, then insertion is clearly
169
/// O(n) where n is the number of byte ranges in the input key. However, if
170
/// k=256 is our alphabet size, then insertion could be O(k^2 * n). In
171
/// particular it seems possible for pathological inputs to cause insertion
172
/// to do a lot of work. However, for what we use this data structure for,
173
/// there should be no pathological inputs since the ultimate source is always
174
/// a sorted set of Unicode scalar value ranges.
175
///
176
/// Internally, this trie is setup like a finite state machine. Note though
177
/// that it is acyclic.
178
#[derive(Clone)]
179
pub struct RangeTrie {
180
    /// The states in this trie. The first is always the shared final state.
181
    /// The second is always the root state. Otherwise, there is no
182
    /// particular order.
183
    states: Vec<State>,
184
    /// A free-list of states. When a range trie is cleared, all of its states
185
    /// are added to this list. Creating a new state reuses states from this
186
    /// list before allocating a new one.
187
    free: Vec<State>,
188
    /// A stack for traversing this trie to yield sequences of byte ranges in
189
    /// lexicographic order.
190
    iter_stack: RefCell<Vec<NextIter>>,
191
    /// A buffer that stores the current sequence during iteration.
192
    iter_ranges: RefCell<Vec<Utf8Range>>,
193
    /// A stack used for traversing the trie in order to (deeply) duplicate
194
    /// a state. States are recursively duplicated when ranges are split.
195
    dupe_stack: Vec<NextDupe>,
196
    /// A stack used for traversing the trie during insertion of a new
197
    /// sequence of byte ranges.
198
    insert_stack: Vec<NextInsert>,
199
}
200
201
/// A single state in this trie.
202
#[derive(Clone)]
203
struct State {
204
    /// A sorted sequence of non-overlapping transitions to other states. Each
205
    /// transition corresponds to a single range of bytes.
206
    transitions: Vec<Transition>,
207
}
208
209
/// A transition is a single range of bytes. If a particular byte is in this
210
/// range, then the corresponding machine may transition to the state pointed
211
/// to by `next_id`.
212
#[derive(Clone)]
213
struct Transition {
214
    /// The byte range.
215
    range: Utf8Range,
216
    /// The next state to transition to.
217
    next_id: StateID,
218
}
219
220
impl RangeTrie {
221
    /// Create a new empty range trie.
222
504k
    pub fn new() -> RangeTrie {
223
504k
        let mut trie = RangeTrie {
224
504k
            states: vec![],
225
504k
            free: vec![],
226
504k
            iter_stack: RefCell::new(vec![]),
227
504k
            iter_ranges: RefCell::new(vec![]),
228
504k
            dupe_stack: vec![],
229
504k
            insert_stack: vec![],
230
504k
        };
231
504k
        trie.clear();
232
504k
        trie
233
504k
    }
234
235
    /// Clear this range trie such that it is empty. Clearing a range trie
236
    /// and reusing it can beneficial because this may reuse allocations.
237
504k
    pub fn clear(&mut self) {
238
504k
        self.free.extend(self.states.drain(..));
239
504k
        self.add_empty(); // final
240
504k
        self.add_empty(); // root
241
504k
    }
242
243
    /// Iterate over all of the sequences of byte ranges in this trie, and
244
    /// call the provided function for each sequence. Iteration occurs in
245
    /// lexicographic order.
246
0
    pub fn iter<E, F: FnMut(&[Utf8Range]) -> Result<(), E>>(
247
0
        &self,
248
0
        mut f: F,
249
0
    ) -> Result<(), E> {
250
0
        let mut stack = self.iter_stack.borrow_mut();
251
0
        stack.clear();
252
0
        let mut ranges = self.iter_ranges.borrow_mut();
253
0
        ranges.clear();
254
0
255
0
        // We do iteration in a way that permits us to use a single buffer
256
0
        // for our keys. We iterate in a depth first fashion, while being
257
0
        // careful to expand our frontier as we move deeper in the trie.
258
0
        stack.push(NextIter { state_id: ROOT, tidx: 0 });
259
0
        while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
260
            // This could be implemented more simply without an inner loop
261
            // here, but at the cost of more stack pushes.
262
            loop {
263
0
                let state = self.state(state_id);
264
0
                // If we've visited all transitions in this state, then pop
265
0
                // back to the parent state.
266
0
                if tidx >= state.transitions.len() {
267
0
                    ranges.pop();
268
0
                    break;
269
0
                }
270
0
271
0
                let t = &state.transitions[tidx];
272
0
                ranges.push(t.range);
273
0
                if t.next_id == FINAL {
274
0
                    f(&ranges)?;
275
0
                    ranges.pop();
276
0
                    tidx += 1;
277
0
                } else {
278
0
                    // Expand our frontier. Once we come back to this state
279
0
                    // via the stack, start in on the next transition.
280
0
                    stack.push(NextIter { state_id, tidx: tidx + 1 });
281
0
                    // Otherwise, move to the first transition of the next
282
0
                    // state.
283
0
                    state_id = t.next_id;
284
0
                    tidx = 0;
285
0
                }
286
            }
287
        }
288
0
        Ok(())
289
0
    }
290
291
    /// Inserts a new sequence of ranges into this trie.
292
    ///
293
    /// The sequence given must be non-empty and must not have a length
294
    /// exceeding 4.
295
0
    pub fn insert(&mut self, ranges: &[Utf8Range]) {
296
0
        assert!(!ranges.is_empty());
297
0
        assert!(ranges.len() <= 4);
298
299
0
        let mut stack = mem::replace(&mut self.insert_stack, vec![]);
300
0
        stack.clear();
301
0
302
0
        stack.push(NextInsert::new(ROOT, ranges));
303
0
        while let Some(next) = stack.pop() {
304
0
            let (state_id, ranges) = (next.state_id(), next.ranges());
305
0
            assert!(!ranges.is_empty());
306
307
0
            let (mut new, rest) = (ranges[0], &ranges[1..]);
308
0
309
0
            // i corresponds to the position of the existing transition on
310
0
            // which we are operating. Typically, the result is to remove the
311
0
            // transition and replace it with two or more new transitions
312
0
            // corresponding to the partitions generated by splitting the
313
0
            // 'new' with the ith transition's range.
314
0
            let mut i = self.state(state_id).find(new);
315
0
316
0
            // In this case, there is no overlap *and* the new range is greater
317
0
            // than all existing ranges. So we can just add it to the end.
318
0
            if i == self.state(state_id).transitions.len() {
319
0
                let next_id = NextInsert::push(self, &mut stack, rest);
320
0
                self.add_transition(state_id, new, next_id);
321
0
                continue;
322
0
            }
323
324
            // The need for this loop is a bit subtle, buf basically, after
325
            // we've handled the partitions from our initial split, it's
326
            // possible that there will be a partition leftover that overlaps
327
            // with a subsequent transition. If so, then we have to repeat
328
            // the split process again with the leftovers and that subsequent
329
            // transition.
330
            'OUTER: loop {
331
0
                let old = self.state(state_id).transitions[i].clone();
332
0
                let split = match Split::new(old.range, new) {
333
0
                    Some(split) => split,
334
                    None => {
335
0
                        let next_id = NextInsert::push(self, &mut stack, rest);
336
0
                        self.add_transition_at(i, state_id, new, next_id);
337
0
                        continue;
338
                    }
339
                };
340
0
                let splits = split.as_slice();
341
0
                // If we only have one partition, then the ranges must be
342
0
                // equivalent. There's nothing to do here for this state, so
343
0
                // just move on to the next one.
344
0
                if splits.len() == 1 {
345
                    // ... but only if we have anything left to do.
346
0
                    if !rest.is_empty() {
347
0
                        stack.push(NextInsert::new(old.next_id, rest));
348
0
                    }
349
0
                    break;
350
0
                }
351
0
                // At this point, we know that 'split' is non-empty and there
352
0
                // must be some overlap AND that the two ranges are not
353
0
                // equivalent. Therefore, the existing range MUST be removed
354
0
                // and split up somehow. Instead of actually doing the removal
355
0
                // and then a subsequent insertion---with all the memory
356
0
                // shuffling that entails---we simply overwrite the transition
357
0
                // at position `i` for the first new transition we want to
358
0
                // insert. After that, we're forced to do expensive inserts.
359
0
                let mut first = true;
360
0
                let mut add_trans =
361
0
                    |trie: &mut RangeTrie, pos, from, range, to| {
362
0
                        if first {
363
0
                            trie.set_transition_at(pos, from, range, to);
364
0
                            first = false;
365
0
                        } else {
366
0
                            trie.add_transition_at(pos, from, range, to);
367
0
                        }
368
0
                    };
369
0
                for (j, &srange) in splits.iter().enumerate() {
370
0
                    match srange {
371
0
                        SplitRange::Old(r) => {
372
0
                            // Deep clone the state pointed to by the ith
373
0
                            // transition. This is always necessary since 'old'
374
0
                            // is always coupled with at least a 'both'
375
0
                            // partition. We don't want any new changes made
376
0
                            // via the 'both' partition to impact the part of
377
0
                            // the transition that doesn't overlap with the
378
0
                            // new range.
379
0
                            let dup_id = self.duplicate(old.next_id);
380
0
                            add_trans(self, i, state_id, r, dup_id);
381
0
                        }
382
0
                        SplitRange::New(r) => {
383
0
                            // This is a bit subtle, but if this happens to be
384
0
                            // the last partition in our split, it is possible
385
0
                            // that this overlaps with a subsequent transition.
386
0
                            // If it does, then we must repeat the whole
387
0
                            // splitting process over again with `r` and the
388
0
                            // subsequent transition.
389
0
                            {
390
0
                                let trans = &self.state(state_id).transitions;
391
0
                                if j + 1 == splits.len()
392
0
                                    && i < trans.len()
393
0
                                    && intersects(r, trans[i].range)
394
                                {
395
0
                                    new = r;
396
0
                                    continue 'OUTER;
397
0
                                }
398
0
                            }
399
0
400
0
                            // ... otherwise, setup exploration for a new
401
0
                            // empty state and add a brand new transition for
402
0
                            // this new range.
403
0
                            let next_id =
404
0
                                NextInsert::push(self, &mut stack, rest);
405
0
                            add_trans(self, i, state_id, r, next_id);
406
                        }
407
0
                        SplitRange::Both(r) => {
408
0
                            // Continue adding the remaining ranges on this
409
0
                            // path and update the transition with the new
410
0
                            // range.
411
0
                            if !rest.is_empty() {
412
0
                                stack.push(NextInsert::new(old.next_id, rest));
413
0
                            }
414
0
                            add_trans(self, i, state_id, r, old.next_id);
415
                        }
416
                    }
417
0
                    i += 1;
418
                }
419
                // If we've reached this point, then we know that there are
420
                // no subsequent transitions with any overlap. Therefore, we
421
                // can stop processing this range and move on to the next one.
422
0
                break;
423
            }
424
        }
425
0
        self.insert_stack = stack;
426
0
    }
427
428
1.00M
    pub fn add_empty(&mut self) -> StateID {
429
1.00M
        let id = match StateID::try_from(self.states.len()) {
430
1.00M
            Ok(id) => id,
431
            Err(_) => {
432
                // This generally should not happen since a range trie is
433
                // only ever used to compile a single sequence of Unicode
434
                // scalar values. If we ever got to this point, we would, at
435
                // *minimum*, be using 96GB in just the range trie alone.
436
0
                panic!("too many sequences added to range trie");
437
            }
438
        };
439
        // If we have some free states available, then use them to avoid
440
        // more allocations.
441
1.00M
        if let Some(mut state) = self.free.pop() {
442
0
            state.clear();
443
0
            self.states.push(state);
444
1.00M
        } else {
445
1.00M
            self.states.push(State { transitions: vec![] });
446
1.00M
        }
447
1.00M
        id
448
1.00M
    }
449
450
    /// Performs a deep clone of the given state and returns the duplicate's
451
    /// state ID.
452
    ///
453
    /// A "deep clone" in this context means that the state given along with
454
    /// recursively all states that it points to are copied. Once complete,
455
    /// the given state ID and the returned state ID share nothing.
456
    ///
457
    /// This is useful during range trie insertion when a new range overlaps
458
    /// with an existing range that is bigger than the new one. The part
459
    /// of the existing range that does *not* overlap with the new one is
460
    /// duplicated so that adding the new range to the overlap doesn't disturb
461
    /// the non-overlapping portion.
462
    ///
463
    /// There's one exception: if old_id is the final state, then it is not
464
    /// duplicated and the same final state is returned. This is because all
465
    /// final states in this trie are equivalent.
466
0
    fn duplicate(&mut self, old_id: StateID) -> StateID {
467
0
        if old_id == FINAL {
468
0
            return FINAL;
469
0
        }
470
0
471
0
        let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
472
0
        stack.clear();
473
0
474
0
        let new_id = self.add_empty();
475
0
        // old_id is the state we're cloning and new_id is the ID of the
476
0
        // duplicated state for old_id.
477
0
        stack.push(NextDupe { old_id, new_id });
478
0
        while let Some(NextDupe { old_id, new_id }) = stack.pop() {
479
0
            for i in 0..self.state(old_id).transitions.len() {
480
0
                let t = self.state(old_id).transitions[i].clone();
481
0
                if t.next_id == FINAL {
482
                    // All final states are the same, so there's no need to
483
                    // duplicate it.
484
0
                    self.add_transition(new_id, t.range, FINAL);
485
0
                    continue;
486
0
                }
487
0
488
0
                let new_child_id = self.add_empty();
489
0
                self.add_transition(new_id, t.range, new_child_id);
490
0
                stack.push(NextDupe {
491
0
                    old_id: t.next_id,
492
0
                    new_id: new_child_id,
493
0
                });
494
            }
495
        }
496
0
        self.dupe_stack = stack;
497
0
        new_id
498
0
    }
499
500
    /// Adds the given transition to the given state.
501
    ///
502
    /// Callers must ensure that all previous transitions in this state
503
    /// are lexicographically smaller than the given range.
504
0
    fn add_transition(
505
0
        &mut self,
506
0
        from_id: StateID,
507
0
        range: Utf8Range,
508
0
        next_id: StateID,
509
0
    ) {
510
0
        self.state_mut(from_id)
511
0
            .transitions
512
0
            .push(Transition { range, next_id });
513
0
    }
514
515
    /// Like `add_transition`, except this inserts the transition just before
516
    /// the ith transition.
517
0
    fn add_transition_at(
518
0
        &mut self,
519
0
        i: usize,
520
0
        from_id: StateID,
521
0
        range: Utf8Range,
522
0
        next_id: StateID,
523
0
    ) {
524
0
        self.state_mut(from_id)
525
0
            .transitions
526
0
            .insert(i, Transition { range, next_id });
527
0
    }
528
529
    /// Overwrites the transition at position i with the given transition.
530
0
    fn set_transition_at(
531
0
        &mut self,
532
0
        i: usize,
533
0
        from_id: StateID,
534
0
        range: Utf8Range,
535
0
        next_id: StateID,
536
0
    ) {
537
0
        self.state_mut(from_id).transitions[i] = Transition { range, next_id };
538
0
    }
539
540
    /// Return an immutable borrow for the state with the given ID.
541
0
    fn state(&self, id: StateID) -> &State {
542
0
        &self.states[id]
543
0
    }
544
545
    /// Return a mutable borrow for the state with the given ID.
546
0
    fn state_mut(&mut self, id: StateID) -> &mut State {
547
0
        &mut self.states[id]
548
0
    }
549
}
550
551
impl State {
552
    /// Find the position at which the given range should be inserted in this
553
    /// state.
554
    ///
555
    /// The position returned is always in the inclusive range
556
    /// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557
    /// given range overlaps with no other range in this state *and* is greater
558
    /// than all of them.
559
    ///
560
    /// For all other possible positions, the given range either overlaps
561
    /// with the transition at that position or is otherwise less than it
562
    /// with no overlap (and is greater than the previous transition). In the
563
    /// former case, careful attention must be paid to inserting this range
564
    /// as a new transition. In the latter case, the range can be inserted as
565
    /// a new transition at the given position without disrupting any other
566
    /// transitions.
567
0
    fn find(&self, range: Utf8Range) -> usize {
568
        /// Returns the position `i` at which `pred(xs[i])` first returns true
569
        /// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570
        /// returns true, then `xs.len()` is returned.
571
        ///
572
        /// We roll our own binary search because it doesn't seem like the
573
        /// standard library's binary search can be used here. Namely, if
574
        /// there is an overlapping range, then we want to find the first such
575
        /// occurrence, but there may be many. Or at least, it's not quite
576
        /// clear to me how to do it.
577
0
        fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
578
0
        where
579
0
            F: FnMut(&T) -> bool,
580
0
        {
581
0
            let (mut left, mut right) = (0, xs.len());
582
0
            while left < right {
583
                // Overflow is impossible because xs.len() <= 256.
584
0
                let mid = (left + right) / 2;
585
0
                if pred(&xs[mid]) {
586
0
                    right = mid;
587
0
                } else {
588
0
                    left = mid + 1;
589
0
                }
590
            }
591
0
            left
592
0
        }
593
594
        // Benchmarks suggest that binary search is just a bit faster than
595
        // straight linear search. Specifically when using the debug tool:
596
        //
597
        //   hyperfine "regex-cli debug thompson -qr --captures none '\w{90} ecurB'"
598
0
        binary_search(&self.transitions, |t| range.start <= t.range.end)
599
0
    }
600
601
    /// Clear this state such that it has zero transitions.
602
0
    fn clear(&mut self) {
603
0
        self.transitions.clear();
604
0
    }
605
}
606
607
/// The next state to process during duplication.
608
#[derive(Clone, Debug)]
609
struct NextDupe {
610
    /// The state we want to duplicate.
611
    old_id: StateID,
612
    /// The ID of the new state that is a duplicate of old_id.
613
    new_id: StateID,
614
}
615
616
/// The next state (and its corresponding transition) that we want to visit
617
/// during iteration in lexicographic order.
618
#[derive(Clone, Debug)]
619
struct NextIter {
620
    state_id: StateID,
621
    tidx: usize,
622
}
623
624
/// The next state to process during insertion and any remaining ranges that we
625
/// want to add for a particular sequence of ranges. The first such instance
626
/// is always the root state along with all ranges given.
627
#[derive(Clone, Debug)]
628
struct NextInsert {
629
    /// The next state to begin inserting ranges. This state should be the
630
    /// state at which `ranges[0]` should be inserted.
631
    state_id: StateID,
632
    /// The ranges to insert. We used a fixed-size array here to avoid an
633
    /// allocation.
634
    ranges: [Utf8Range; 4],
635
    /// The number of valid ranges in the above array.
636
    len: u8,
637
}
638
639
impl NextInsert {
640
    /// Create the next item to visit. The given state ID should correspond
641
    /// to the state at which the first range in the given slice should be
642
    /// inserted. The slice given must not be empty and it must be no longer
643
    /// than 4.
644
0
    fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
645
0
        let len = ranges.len();
646
0
        assert!(len > 0);
647
0
        assert!(len <= 4);
648
649
0
        let mut tmp = [Utf8Range { start: 0, end: 0 }; 4];
650
0
        tmp[..len].copy_from_slice(ranges);
651
0
        NextInsert { state_id, ranges: tmp, len: u8::try_from(len).unwrap() }
652
0
    }
653
654
    /// Push a new empty state to visit along with any remaining ranges that
655
    /// still need to be inserted. The ID of the new empty state is returned.
656
    ///
657
    /// If ranges is empty, then no new state is created and FINAL is returned.
658
0
    fn push(
659
0
        trie: &mut RangeTrie,
660
0
        stack: &mut Vec<NextInsert>,
661
0
        ranges: &[Utf8Range],
662
0
    ) -> StateID {
663
0
        if ranges.is_empty() {
664
0
            FINAL
665
        } else {
666
0
            let next_id = trie.add_empty();
667
0
            stack.push(NextInsert::new(next_id, ranges));
668
0
            next_id
669
        }
670
0
    }
671
672
    /// Return the ID of the state to visit.
673
0
    fn state_id(&self) -> StateID {
674
0
        self.state_id
675
0
    }
676
677
    /// Return the remaining ranges to insert.
678
0
    fn ranges(&self) -> &[Utf8Range] {
679
0
        &self.ranges[..usize::try_from(self.len).unwrap()]
680
0
    }
681
}
682
683
/// Split represents a partitioning of two ranges into one or more ranges. This
684
/// is the secret sauce that makes a range trie work, as it's what tells us
685
/// how to deal with two overlapping but unequal ranges during insertion.
686
///
687
/// Essentially, either two ranges overlap or they don't. If they don't, then
688
/// handling insertion is easy: just insert the new range into its
689
/// lexicographically correct position. Since it does not overlap with anything
690
/// else, no other transitions are impacted by the new range.
691
///
692
/// If they do overlap though, there are generally three possible cases to
693
/// handle:
694
///
695
/// 1. The part where the two ranges actually overlap. i.e., The intersection.
696
/// 2. The part of the existing range that is not in the new range.
697
/// 3. The part of the new range that is not in the old range.
698
///
699
/// (1) is guaranteed to always occur since all overlapping ranges have a
700
/// non-empty intersection. If the two ranges are not equivalent, then at
701
/// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702
/// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
703
///
704
/// This `Split` type is responsible for providing (1), (2) and (3) for any
705
/// possible pair of byte ranges.
706
///
707
/// As for insertion, for the overlap in (1), the remaining ranges to insert
708
/// should be added by following the corresponding transition. However, this
709
/// should only be done for the overlapping parts of the range. If there was
710
/// a part of the existing range that was not in the new range, then that
711
/// existing part must be split off from the transition and duplicated. The
712
/// remaining parts of the overlap can then be added to using the new ranges
713
/// without disturbing the existing range.
714
///
715
/// Handling the case for the part of a new range that is not in an existing
716
/// range is seemingly easy. Just treat it as if it were a non-overlapping
717
/// range. The problem here is that if this new non-overlapping range occurs
718
/// after both (1) and (2), then it's possible that it can overlap with the
719
/// next transition in the current state. If it does, then the whole process
720
/// must be repeated!
721
///
722
/// # Details of the 3 cases
723
///
724
/// The following details the various cases that are implemented in code
725
/// below. It's plausible that the number of cases is not actually minimal,
726
/// but it's important for this code to remain at least somewhat readable.
727
///
728
/// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729
/// the follow distinct relationships where at least one must apply. The order
730
/// of these matters, since multiple can match. The first to match applies.
731
///
732
///   1. b < x <=> [a,b] < [x,y]
733
///   2. y < a <=> [x,y] < [a,b]
734
///
735
/// In the case of (1) and (2), these are the only cases where there is no
736
/// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737
/// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738
/// intersection in all of the following cases is non-empty.
739
///
740
///    3. a = x && b = y <=> [a,b] == [x,y]
741
///    4. a = x && b < y <=> [x,y] right-extends [a,b]
742
///    5. b = y && a > x <=> [x,y] left-extends [a,b]
743
///    6. x = a && y < b <=> [a,b] right-extends [x,y]
744
///    7. y = b && x > a <=> [a,b] left-extends [x,y]
745
///    8. a > x && b < y <=> [x,y] covers [a,b]
746
///    9. x > a && y < b <=> [a,b] covers [x,y]
747
///   10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748
///   11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749
///   12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750
///   13. y > a && y < b <=> [x,y] left-overlaps [a,b]
751
///
752
/// In cases 3-13, we can form rules that partition the ranges into a
753
/// non-overlapping ordered sequence of ranges:
754
///
755
///    3. [a,b]
756
///    4. [a,b], [b+1,y]
757
///    5. [x,a-1], [a,b]
758
///    6. [x,y], [y+1,b]
759
///    7. [a,x-1], [x,y]
760
///    8. [x,a-1], [a,b], [b+1,y]
761
///    9. [a,x-1], [x,y], [y+1,b]
762
///   10. [a,b-1], [b,b], [b+1,y]
763
///   11. [x,y-1], [y,y], [y+1,b]
764
///   12. [a,x-1], [x,b], [b+1,y]
765
///   13. [x,a-1], [a,y], [y+1,b]
766
///
767
/// In the code below, we go a step further and identify each of the above
768
/// outputs as belonging either to the overlap of the two ranges or to one
769
/// of [a,b] or [x,y] exclusively.
770
#[derive(Clone, Debug, Eq, PartialEq)]
771
struct Split {
772
    partitions: [SplitRange; 3],
773
    len: usize,
774
}
775
776
/// A tagged range indicating how it was derived from a pair of ranges.
777
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
778
enum SplitRange {
779
    Old(Utf8Range),
780
    New(Utf8Range),
781
    Both(Utf8Range),
782
}
783
784
impl Split {
785
    /// Create a partitioning of the given ranges.
786
    ///
787
    /// If the given ranges have an empty intersection, then None is returned.
788
0
    fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
789
0
        let range = |r: RangeInclusive<u8>| Utf8Range {
790
0
            start: *r.start(),
791
0
            end: *r.end(),
792
0
        };
793
0
        let old = |r| SplitRange::Old(range(r));
794
0
        let new = |r| SplitRange::New(range(r));
795
0
        let both = |r| SplitRange::Both(range(r));
796
797
        // Use same names as the comment above to make it easier to compare.
798
0
        let (a, b, x, y) = (o.start, o.end, n.start, n.end);
799
0
800
0
        if b < x || y < a {
801
            // case 1, case 2
802
0
            None
803
0
        } else if a == x && b == y {
804
            // case 3
805
0
            Some(Split::parts1(both(a..=b)))
806
0
        } else if a == x && b < y {
807
            // case 4
808
0
            Some(Split::parts2(both(a..=b), new(b + 1..=y)))
809
0
        } else if b == y && a > x {
810
            // case 5
811
0
            Some(Split::parts2(new(x..=a - 1), both(a..=b)))
812
0
        } else if x == a && y < b {
813
            // case 6
814
0
            Some(Split::parts2(both(x..=y), old(y + 1..=b)))
815
0
        } else if y == b && x > a {
816
            // case 7
817
0
            Some(Split::parts2(old(a..=x - 1), both(x..=y)))
818
0
        } else if a > x && b < y {
819
            // case 8
820
0
            Some(Split::parts3(new(x..=a - 1), both(a..=b), new(b + 1..=y)))
821
0
        } else if x > a && y < b {
822
            // case 9
823
0
            Some(Split::parts3(old(a..=x - 1), both(x..=y), old(y + 1..=b)))
824
0
        } else if b == x && a < y {
825
            // case 10
826
0
            Some(Split::parts3(old(a..=b - 1), both(b..=b), new(b + 1..=y)))
827
0
        } else if y == a && x < b {
828
            // case 11
829
0
            Some(Split::parts3(new(x..=y - 1), both(y..=y), old(y + 1..=b)))
830
0
        } else if b > x && b < y {
831
            // case 12
832
0
            Some(Split::parts3(old(a..=x - 1), both(x..=b), new(b + 1..=y)))
833
0
        } else if y > a && y < b {
834
            // case 13
835
0
            Some(Split::parts3(new(x..=a - 1), both(a..=y), old(y + 1..=b)))
836
        } else {
837
0
            unreachable!()
838
        }
839
0
    }
840
841
    /// Create a new split with a single partition. This only occurs when two
842
    /// ranges are equivalent.
843
0
    fn parts1(r1: SplitRange) -> Split {
844
0
        // This value doesn't matter since it is never accessed.
845
0
        let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
846
0
        Split { partitions: [r1, nada, nada], len: 1 }
847
0
    }
848
849
    /// Create a new split with two partitions.
850
0
    fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
851
0
        // This value doesn't matter since it is never accessed.
852
0
        let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
853
0
        Split { partitions: [r1, r2, nada], len: 2 }
854
0
    }
855
856
    /// Create a new split with three partitions.
857
0
    fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
858
0
        Split { partitions: [r1, r2, r3], len: 3 }
859
0
    }
860
861
    /// Return the partitions in this split as a slice.
862
0
    fn as_slice(&self) -> &[SplitRange] {
863
0
        &self.partitions[..self.len]
864
0
    }
865
}
866
867
impl fmt::Debug for RangeTrie {
868
0
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
869
0
        writeln!(f, "")?;
870
0
        for (i, state) in self.states.iter().enumerate() {
871
0
            let status = if i == FINAL.as_usize() { '*' } else { ' ' };
872
0
            writeln!(f, "{}{:06}: {:?}", status, i, state)?;
873
        }
874
0
        Ok(())
875
0
    }
876
}
877
878
impl fmt::Debug for State {
879
0
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
880
0
        let rs = self
881
0
            .transitions
882
0
            .iter()
883
0
            .map(|t| format!("{:?}", t))
884
0
            .collect::<Vec<String>>()
885
0
            .join(", ");
886
0
        write!(f, "{}", rs)
887
0
    }
888
}
889
890
impl fmt::Debug for Transition {
891
0
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
892
0
        if self.range.start == self.range.end {
893
0
            write!(
894
0
                f,
895
0
                "{:02X} => {:02X}",
896
0
                self.range.start,
897
0
                self.next_id.as_usize(),
898
0
            )
899
        } else {
900
0
            write!(
901
0
                f,
902
0
                "{:02X}-{:02X} => {:02X}",
903
0
                self.range.start,
904
0
                self.range.end,
905
0
                self.next_id.as_usize(),
906
0
            )
907
        }
908
0
    }
909
}
910
911
/// Returns true if and only if the given ranges intersect.
912
0
fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
913
0
    !(r1.end < r2.start || r2.end < r1.start)
914
0
}
915
916
#[cfg(test)]
917
mod tests {
918
    use super::*;
919
920
    fn r(range: RangeInclusive<u8>) -> Utf8Range {
921
        Utf8Range { start: *range.start(), end: *range.end() }
922
    }
923
924
    fn split_maybe(
925
        old: RangeInclusive<u8>,
926
        new: RangeInclusive<u8>,
927
    ) -> Option<Split> {
928
        Split::new(r(old), r(new))
929
    }
930
931
    fn split(
932
        old: RangeInclusive<u8>,
933
        new: RangeInclusive<u8>,
934
    ) -> Vec<SplitRange> {
935
        split_maybe(old, new).unwrap().as_slice().to_vec()
936
    }
937
938
    #[test]
939
    fn no_splits() {
940
        // case 1
941
        assert_eq!(None, split_maybe(0..=1, 2..=3));
942
        // case 2
943
        assert_eq!(None, split_maybe(2..=3, 0..=1));
944
    }
945
946
    #[test]
947
    fn splits() {
948
        let range = |r: RangeInclusive<u8>| Utf8Range {
949
            start: *r.start(),
950
            end: *r.end(),
951
        };
952
        let old = |r| SplitRange::Old(range(r));
953
        let new = |r| SplitRange::New(range(r));
954
        let both = |r| SplitRange::Both(range(r));
955
956
        // case 3
957
        assert_eq!(split(0..=0, 0..=0), vec![both(0..=0)]);
958
        assert_eq!(split(9..=9, 9..=9), vec![both(9..=9)]);
959
960
        // case 4
961
        assert_eq!(split(0..=5, 0..=6), vec![both(0..=5), new(6..=6)]);
962
        assert_eq!(split(0..=5, 0..=8), vec![both(0..=5), new(6..=8)]);
963
        assert_eq!(split(5..=5, 5..=8), vec![both(5..=5), new(6..=8)]);
964
965
        // case 5
966
        assert_eq!(split(1..=5, 0..=5), vec![new(0..=0), both(1..=5)]);
967
        assert_eq!(split(3..=5, 0..=5), vec![new(0..=2), both(3..=5)]);
968
        assert_eq!(split(5..=5, 0..=5), vec![new(0..=4), both(5..=5)]);
969
970
        // case 6
971
        assert_eq!(split(0..=6, 0..=5), vec![both(0..=5), old(6..=6)]);
972
        assert_eq!(split(0..=8, 0..=5), vec![both(0..=5), old(6..=8)]);
973
        assert_eq!(split(5..=8, 5..=5), vec![both(5..=5), old(6..=8)]);
974
975
        // case 7
976
        assert_eq!(split(0..=5, 1..=5), vec![old(0..=0), both(1..=5)]);
977
        assert_eq!(split(0..=5, 3..=5), vec![old(0..=2), both(3..=5)]);
978
        assert_eq!(split(0..=5, 5..=5), vec![old(0..=4), both(5..=5)]);
979
980
        // case 8
981
        assert_eq!(
982
            split(3..=6, 2..=7),
983
            vec![new(2..=2), both(3..=6), new(7..=7)],
984
        );
985
        assert_eq!(
986
            split(3..=6, 1..=8),
987
            vec![new(1..=2), both(3..=6), new(7..=8)],
988
        );
989
990
        // case 9
991
        assert_eq!(
992
            split(2..=7, 3..=6),
993
            vec![old(2..=2), both(3..=6), old(7..=7)],
994
        );
995
        assert_eq!(
996
            split(1..=8, 3..=6),
997
            vec![old(1..=2), both(3..=6), old(7..=8)],
998
        );
999
1000
        // case 10
1001
        assert_eq!(
1002
            split(3..=6, 6..=7),
1003
            vec![old(3..=5), both(6..=6), new(7..=7)],
1004
        );
1005
        assert_eq!(
1006
            split(3..=6, 6..=8),
1007
            vec![old(3..=5), both(6..=6), new(7..=8)],
1008
        );
1009
        assert_eq!(
1010
            split(5..=6, 6..=7),
1011
            vec![old(5..=5), both(6..=6), new(7..=7)],
1012
        );
1013
1014
        // case 11
1015
        assert_eq!(
1016
            split(6..=7, 3..=6),
1017
            vec![new(3..=5), both(6..=6), old(7..=7)],
1018
        );
1019
        assert_eq!(
1020
            split(6..=8, 3..=6),
1021
            vec![new(3..=5), both(6..=6), old(7..=8)],
1022
        );
1023
        assert_eq!(
1024
            split(6..=7, 5..=6),
1025
            vec![new(5..=5), both(6..=6), old(7..=7)],
1026
        );
1027
1028
        // case 12
1029
        assert_eq!(
1030
            split(3..=7, 5..=9),
1031
            vec![old(3..=4), both(5..=7), new(8..=9)],
1032
        );
1033
        assert_eq!(
1034
            split(3..=5, 4..=6),
1035
            vec![old(3..=3), both(4..=5), new(6..=6)],
1036
        );
1037
1038
        // case 13
1039
        assert_eq!(
1040
            split(5..=9, 3..=7),
1041
            vec![new(3..=4), both(5..=7), old(8..=9)],
1042
        );
1043
        assert_eq!(
1044
            split(4..=6, 3..=5),
1045
            vec![new(3..=3), both(4..=5), old(6..=6)],
1046
        );
1047
    }
1048
1049
    // Arguably there should be more tests here, but in practice, this data
1050
    // structure is well covered by the huge number of regex tests.
1051
}