Coverage Report

Created: 2025-07-23 07:16

/src/regex/regex-automata/src/util/determinize/state.rs
Line
Count
Source (jump to first uncovered line)
1
/*!
2
This module defines a DFA state representation and builders for constructing
3
DFA states.
4
5
This representation is specifically for use in implementations of NFA-to-DFA
6
conversion via powerset construction. (Also called "determinization" in this
7
crate.)
8
9
The term "DFA state" is somewhat overloaded in this crate. In some cases, it
10
refers to the set of transitions over an alphabet for a particular state. In
11
other cases, it refers to a set of NFA states. The former is really about the
12
final representation of a state in a DFA's transition table, where as the
13
latter---what this module is focused on---is closer to an intermediate form
14
that is used to help eventually build the transition table.
15
16
This module exports four types. All four types represent the same idea: an
17
ordered set of NFA states. This ordered set represents the epsilon closure of a
18
particular NFA state, where the "epsilon closure" is the set of NFA states that
19
can be transitioned to without consuming any input. i.e., Follow all of the NFA
20
state's epsilon transitions. In addition, this implementation of DFA states
21
cares about two other things: the ordered set of pattern IDs corresponding
22
to the patterns that match if the state is a match state, and the set of
23
look-behind assertions that were true when the state was created.
24
25
The first, `State`, is a frozen representation of a state that cannot be
26
modified. It may be cheaply cloned without copying the state itself and can be
27
accessed safely from multiple threads simultaneously. This type is useful for
28
when one knows that the DFA state being constructed is distinct from any other
29
previously constructed states. Namely, powerset construction, in practice,
30
requires one to keep a cache of previously created DFA states. Otherwise,
31
the number of DFA states created in memory balloons to an impractically
32
large number. For this reason, equivalent states should endeavor to have an
33
equivalent byte-level representation. (In general, "equivalency" here means,
34
"equivalent assertions, pattern IDs and NFA state IDs." We do not require that
35
full DFA minimization be implemented here. This form of equivalency is only
36
surface deep and is more-or-less a practical necessity.)
37
38
The other three types represent different phases in the construction of a
39
DFA state. Internally, these three types (and `State`) all use the same
40
byte-oriented representation. That means one can use any of the builder types
41
to check whether the state it represents already exists or not. If it does,
42
then there is no need to freeze it into a `State` (which requires an alloc and
43
a copy). Here are the three types described succinctly:
44
45
* `StateBuilderEmpty` represents a state with no pattern IDs, no assertions
46
and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A
47
`StateBuilderEmpty` can only be used to query its underlying memory capacity,
48
or to convert into a builder for recording pattern IDs and/or assertions.
49
50
* `StateBuilderMatches` represents a state with zero or more pattern IDs, zero
51
or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches`
52
can only be used for adding pattern IDs and recording assertions.
53
54
* `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or
55
more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA`
56
can only be used for adding NFA state IDs and recording some assertions.
57
58
The expected flow here is to use the above builders to construct a candidate
59
DFA state to check if it already exists. If it does, then there's no need to
60
freeze it into a `State`. If it doesn't exist, then `StateBuilderNFA::to_state`
61
can be called to freeze the builder into an immutable `State`. In either
62
case, `clear` should be called on the builder to turn it back into a
63
`StateBuilderEmpty` that reuses the underlying memory.
64
65
The main purpose for splitting the builder into these distinct types is to
66
make it impossible to do things like adding a pattern ID after adding an NFA
67
state ID. Namely, this makes it simpler to use a space-and-time efficient
68
binary representation for the state. (The format is documented on the `Repr`
69
type below.) If we just used one type for everything, it would be possible for
70
callers to use an incorrect interleaving of calls and thus result in a corrupt
71
representation. I chose to use more type machinery to make this impossible to
72
do because 1) determinization is itself pretty complex and it wouldn't be too
73
hard to foul this up and 2) there isn't too much machinery involved and it's
74
well contained.
75
76
As an optimization, sometimes states won't have certain things set. For
77
example, if the underlying NFA has no word boundary assertions, then there is
78
no reason to set a state's look-behind assertion as to whether it was generated
79
from a word byte or not. Similarly, if a state has no NFA states corresponding
80
to look-around assertions, then there is no reason to set `look_have` to a
81
non-empty set. Finally, callers usually omit unconditional epsilon transitions
82
when adding NFA state IDs since they aren't discriminatory.
83
84
Finally, the binary representation used by these states is, thankfully, not
85
serialized anywhere. So any kind of change can be made with reckless abandon,
86
as long as everything in this module agrees.
87
*/
88
89
use core::mem;
90
91
use alloc::{sync::Arc, vec::Vec};
92
93
use crate::util::{
94
    int::{I32, U32},
95
    look::LookSet,
96
    primitives::{PatternID, StateID},
97
    wire::{self, Endian},
98
};
99
100
/// A DFA state that, at its core, is represented by an ordered set of NFA
101
/// states.
102
///
103
/// This type is intended to be used only in NFA-to-DFA conversion via powerset
104
/// construction.
105
///
106
/// It may be cheaply cloned and accessed safely from multiple threads
107
/// simultaneously.
108
#[derive(Clone, Eq, Hash, PartialEq, PartialOrd, Ord)]
109
pub(crate) struct State(Arc<[u8]>);
110
111
/// This Borrow impl permits us to lookup any state in a map by its byte
112
/// representation. This is particularly convenient when one has a StateBuilder
113
/// and we want to see if a correspondingly equivalent state already exists. If
114
/// one does exist, then we can reuse the allocation required by StateBuilder
115
/// without having to convert it into a State first.
116
impl core::borrow::Borrow<[u8]> for State {
117
14.7M
    fn borrow(&self) -> &[u8] {
118
14.7M
        &*self.0
119
14.7M
    }
120
}
121
122
impl core::fmt::Debug for State {
123
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
124
0
        f.debug_tuple("State").field(&self.repr()).finish()
125
0
    }
126
}
127
128
/// For docs on these routines, see the internal Repr and ReprVec types below.
129
impl State {
130
253k
    pub(crate) fn dead() -> State {
131
253k
        StateBuilderEmpty::new().into_matches().into_nfa().to_state()
132
253k
    }
133
134
1.26M
    pub(crate) fn is_match(&self) -> bool {
135
1.26M
        self.repr().is_match()
136
1.26M
    }
137
138
8.13M
    pub(crate) fn is_from_word(&self) -> bool {
139
8.13M
        self.repr().is_from_word()
140
8.13M
    }
141
142
2.89M
    pub(crate) fn is_half_crlf(&self) -> bool {
143
2.89M
        self.repr().is_half_crlf()
144
2.89M
    }
145
146
5.68M
    pub(crate) fn look_have(&self) -> LookSet {
147
5.68M
        self.repr().look_have()
148
5.68M
    }
149
150
19.0M
    pub(crate) fn look_need(&self) -> LookSet {
151
19.0M
        self.repr().look_need()
152
19.0M
    }
153
154
0
    pub(crate) fn match_len(&self) -> usize {
155
0
        self.repr().match_len()
156
0
    }
157
158
0
    pub(crate) fn match_pattern(&self, index: usize) -> PatternID {
159
0
        self.repr().match_pattern(index)
160
0
    }
161
162
936k
    pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
163
936k
        self.repr().match_pattern_ids()
164
936k
    }
165
166
    #[cfg(all(test, not(miri)))]
167
    pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) {
168
        self.repr().iter_match_pattern_ids(f)
169
    }
170
171
16.2M
    pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) {
172
16.2M
        self.repr().iter_nfa_state_ids(f)
173
16.2M
    }
174
175
3.43M
    pub(crate) fn memory_usage(&self) -> usize {
176
3.43M
        self.0.len()
177
3.43M
    }
178
179
54.2M
    fn repr(&self) -> Repr<'_> {
180
54.2M
        Repr(&*self.0)
181
54.2M
    }
182
}
183
184
/// A state builder that represents an empty state.
185
///
186
/// This is a useful "initial condition" for state construction. It has no
187
/// NFA state IDs, no assertions set and no pattern IDs. No allocations are
188
/// made when new() is called. Its main use is for being converted into a
189
/// builder that can capture assertions and pattern IDs.
190
#[derive(Clone, Debug)]
191
pub(crate) struct StateBuilderEmpty(Vec<u8>);
192
193
/// For docs on these routines, see the internal Repr and ReprVec types below.
194
impl StateBuilderEmpty {
195
16.9M
    pub(crate) fn new() -> StateBuilderEmpty {
196
16.9M
        StateBuilderEmpty(alloc::vec![])
197
16.9M
    }
198
199
16.8M
    pub(crate) fn into_matches(mut self) -> StateBuilderMatches {
200
16.8M
        self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]);
201
16.8M
        StateBuilderMatches(self.0)
202
16.8M
    }
203
204
16.6M
    fn clear(&mut self) {
205
16.6M
        self.0.clear();
206
16.6M
    }
207
208
3.32M
    pub(crate) fn capacity(&self) -> usize {
209
3.32M
        self.0.capacity()
210
3.32M
    }
211
}
212
213
/// A state builder that collects assertions and pattern IDs.
214
///
215
/// When collecting pattern IDs is finished, this can be converted into a
216
/// builder that collects NFA state IDs.
217
#[derive(Clone)]
218
pub(crate) struct StateBuilderMatches(Vec<u8>);
219
220
impl core::fmt::Debug for StateBuilderMatches {
221
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
222
0
        f.debug_tuple("StateBuilderMatches").field(&self.repr()).finish()
223
0
    }
224
}
225
226
/// For docs on these routines, see the internal Repr and ReprVec types below.
227
impl StateBuilderMatches {
228
16.8M
    pub(crate) fn into_nfa(mut self) -> StateBuilderNFA {
229
16.8M
        self.repr_vec().close_match_pattern_ids();
230
16.8M
        StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO }
231
16.8M
    }
232
233
526k
    pub(crate) fn set_is_from_word(&mut self) {
234
526k
        self.repr_vec().set_is_from_word()
235
526k
    }
236
237
16.2k
    pub(crate) fn set_is_half_crlf(&mut self) {
238
16.2k
        self.repr_vec().set_is_half_crlf()
239
16.2k
    }
240
241
776M
    pub(crate) fn look_have(&self) -> LookSet {
242
776M
        LookSet::read_repr(&self.0[1..])
243
776M
    }
244
245
2.80M
    pub(crate) fn set_look_have(
246
2.80M
        &mut self,
247
2.80M
        set: impl FnMut(LookSet) -> LookSet,
248
2.80M
    ) {
249
2.80M
        self.repr_vec().set_look_have(set)
250
2.80M
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#0}>
Line
Count
Source
245
77.4k
    pub(crate) fn set_look_have(
246
77.4k
        &mut self,
247
77.4k
        set: impl FnMut(LookSet) -> LookSet,
248
77.4k
    ) {
249
77.4k
        self.repr_vec().set_look_have(set)
250
77.4k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#2}>
Line
Count
Source
245
10.3k
    pub(crate) fn set_look_have(
246
10.3k
        &mut self,
247
10.3k
        set: impl FnMut(LookSet) -> LookSet,
248
10.3k
    ) {
249
10.3k
        self.repr_vec().set_look_have(set)
250
10.3k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#3}>
Line
Count
Source
245
24.0k
    pub(crate) fn set_look_have(
246
24.0k
        &mut self,
247
24.0k
        set: impl FnMut(LookSet) -> LookSet,
248
24.0k
    ) {
249
24.0k
        self.repr_vec().set_look_have(set)
250
24.0k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#4}>
Line
Count
Source
245
3.15k
    pub(crate) fn set_look_have(
246
3.15k
        &mut self,
247
3.15k
        set: impl FnMut(LookSet) -> LookSet,
248
3.15k
    ) {
249
3.15k
        self.repr_vec().set_look_have(set)
250
3.15k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#5}>
Line
Count
Source
245
5.12k
    pub(crate) fn set_look_have(
246
5.12k
        &mut self,
247
5.12k
        set: impl FnMut(LookSet) -> LookSet,
248
5.12k
    ) {
249
5.12k
        self.repr_vec().set_look_have(set)
250
5.12k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#6}>
Line
Count
Source
245
8.27k
    pub(crate) fn set_look_have(
246
8.27k
        &mut self,
247
8.27k
        set: impl FnMut(LookSet) -> LookSet,
248
8.27k
    ) {
249
8.27k
        self.repr_vec().set_look_have(set)
250
8.27k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#7}>
Line
Count
Source
245
15.3k
    pub(crate) fn set_look_have(
246
15.3k
        &mut self,
247
15.3k
        set: impl FnMut(LookSet) -> LookSet,
248
15.3k
    ) {
249
15.3k
        self.repr_vec().set_look_have(set)
250
15.3k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#8}>
Line
Count
Source
245
1.57k
    pub(crate) fn set_look_have(
246
1.57k
        &mut self,
247
1.57k
        set: impl FnMut(LookSet) -> LookSet,
248
1.57k
    ) {
249
1.57k
        self.repr_vec().set_look_have(set)
250
1.57k
    }
Unexecuted instantiation: <regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#9}>
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#10}>
Line
Count
Source
245
15.2k
    pub(crate) fn set_look_have(
246
15.2k
        &mut self,
247
15.2k
        set: impl FnMut(LookSet) -> LookSet,
248
15.2k
    ) {
249
15.2k
        self.repr_vec().set_look_have(set)
250
15.2k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#11}>
Line
Count
Source
245
8.21k
    pub(crate) fn set_look_have(
246
8.21k
        &mut self,
247
8.21k
        set: impl FnMut(LookSet) -> LookSet,
248
8.21k
    ) {
249
8.21k
        self.repr_vec().set_look_have(set)
250
8.21k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#1}>
Line
Count
Source
245
25.7k
    pub(crate) fn set_look_have(
246
25.7k
        &mut self,
247
25.7k
        set: impl FnMut(LookSet) -> LookSet,
248
25.7k
    ) {
249
25.7k
        self.repr_vec().set_look_have(set)
250
25.7k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#12}>
Line
Count
Source
245
15.2k
    pub(crate) fn set_look_have(
246
15.2k
        &mut self,
247
15.2k
        set: impl FnMut(LookSet) -> LookSet,
248
15.2k
    ) {
249
15.2k
        self.repr_vec().set_look_have(set)
250
15.2k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#2}>
Line
Count
Source
245
31.2k
    pub(crate) fn set_look_have(
246
31.2k
        &mut self,
247
31.2k
        set: impl FnMut(LookSet) -> LookSet,
248
31.2k
    ) {
249
31.2k
        self.repr_vec().set_look_have(set)
250
31.2k
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#3}>
Line
Count
Source
245
2.49M
    pub(crate) fn set_look_have(
246
2.49M
        &mut self,
247
2.49M
        set: impl FnMut(LookSet) -> LookSet,
248
2.49M
    ) {
249
2.49M
        self.repr_vec().set_look_have(set)
250
2.49M
    }
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#1}>
Line
Count
Source
245
63.0k
    pub(crate) fn set_look_have(
246
63.0k
        &mut self,
247
63.0k
        set: impl FnMut(LookSet) -> LookSet,
248
63.0k
    ) {
249
63.0k
        self.repr_vec().set_look_have(set)
250
63.0k
    }
251
252
2.42M
    pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) {
253
2.42M
        self.repr_vec().add_match_pattern_id(pid)
254
2.42M
    }
255
256
0
    fn repr(&self) -> Repr<'_> {
257
0
        Repr(&self.0)
258
0
    }
259
260
22.6M
    fn repr_vec(&mut self) -> ReprVec<'_> {
261
22.6M
        ReprVec(&mut self.0)
262
22.6M
    }
263
}
264
265
/// A state builder that collects some assertions and NFA state IDs.
266
///
267
/// When collecting NFA state IDs is finished, this can be used to build a
268
/// `State` if necessary.
269
///
270
/// When dont with building a state (regardless of whether it got kept or not),
271
/// it's usually a good idea to call `clear` to get an empty builder back so
272
/// that it can be reused to build the next state.
273
#[derive(Clone)]
274
pub(crate) struct StateBuilderNFA {
275
    repr: Vec<u8>,
276
    prev_nfa_state_id: StateID,
277
}
278
279
impl core::fmt::Debug for StateBuilderNFA {
280
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
281
0
        f.debug_tuple("StateBuilderNFA").field(&self.repr()).finish()
282
0
    }
283
}
284
285
/// For docs on these routines, see the internal Repr and ReprVec types below.
286
impl StateBuilderNFA {
287
2.26M
    pub(crate) fn to_state(&self) -> State {
288
2.26M
        State(Arc::from(&*self.repr))
289
2.26M
    }
290
291
16.6M
    pub(crate) fn clear(self) -> StateBuilderEmpty {
292
16.6M
        let mut builder = StateBuilderEmpty(self.repr);
293
16.6M
        builder.clear();
294
16.6M
        builder
295
16.6M
    }
296
297
16.6M
    pub(crate) fn look_need(&self) -> LookSet {
298
16.6M
        self.repr().look_need()
299
16.6M
    }
300
301
14.9M
    pub(crate) fn set_look_have(
302
14.9M
        &mut self,
303
14.9M
        set: impl FnMut(LookSet) -> LookSet,
304
14.9M
    ) {
305
14.9M
        self.repr_vec().set_look_have(set)
306
14.9M
    }
307
308
413M
    pub(crate) fn set_look_need(
309
413M
        &mut self,
310
413M
        set: impl FnMut(LookSet) -> LookSet,
311
413M
    ) {
312
413M
        self.repr_vec().set_look_need(set)
313
413M
    }
314
315
2.12G
    pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) {
316
2.12G
        ReprVec(&mut self.repr)
317
2.12G
            .add_nfa_state_id(&mut self.prev_nfa_state_id, sid)
318
2.12G
    }
319
320
17.8M
    pub(crate) fn as_bytes(&self) -> &[u8] {
321
17.8M
        &self.repr
322
17.8M
    }
323
324
16.6M
    fn repr(&self) -> Repr<'_> {
325
16.6M
        Repr(&self.repr)
326
16.6M
    }
327
328
428M
    fn repr_vec(&mut self) -> ReprVec<'_> {
329
428M
        ReprVec(&mut self.repr)
330
428M
    }
331
}
332
333
/// Repr is a read-only view into the representation of a DFA state.
334
///
335
/// Primarily, a Repr is how we achieve DRY: we implement decoding the format
336
/// in one place, and then use a Repr to implement the various methods on the
337
/// public state types.
338
///
339
/// The format is as follows:
340
///
341
/// The first three bytes correspond to bitsets.
342
///
343
/// Byte 0 is a bitset corresponding to miscellaneous flags associated with the
344
/// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1
345
/// if the state has pattern IDs explicitly written to it. (This is a flag that
346
/// is not meant to be set by determinization, but rather, is used as part of
347
/// an internal space-saving optimization.) Bit 2 is set to 1 if the state was
348
/// generated by a transition over a "word" byte. (Callers may not always set
349
/// this. For example, if the NFA has no word boundary assertion, then needing
350
/// to track whether a state came from a word byte or not is superfluous and
351
/// wasteful.) Bit 3 is set to 1 if the state was generated by a transition
352
/// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is
353
/// enabled.
354
///
355
/// Bytes 1..5 correspond to the look-behind assertions that were satisfied
356
/// by the transition that created this state. (Look-ahead assertions are not
357
/// tracked as part of states. Instead, these are applied by re-computing the
358
/// epsilon closure of a state when computing the transition function. See
359
/// `next` in the parent module.)
360
///
361
/// Bytes 5..9 correspond to the set of look-around assertions (including both
362
/// look-behind and look-ahead) that appear somewhere in this state's set of
363
/// NFA state IDs. This is used to determine whether this state's epsilon
364
/// closure should be re-computed when computing the transition function.
365
/// Namely, look-around assertions are "just" conditional epsilon transitions,
366
/// so if there are new assertions available when computing the transition
367
/// function, we should only re-compute the epsilon closure if those new
368
/// assertions are relevant to this particular state.
369
///
370
/// Bytes 9..13 correspond to a 32-bit native-endian encoded integer
371
/// corresponding to the number of patterns encoded in this state. If the state
372
/// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is
373
/// PatternID::ZERO, then no integer is encoded at this position. Instead, byte
374
/// offset 3 is the position at which the first NFA state ID is encoded.
375
///
376
/// For a match state with at least one non-ZERO pattern ID, the next bytes
377
/// correspond to a sequence of 32-bit native endian encoded integers that
378
/// represent each pattern ID, in order, that this match state represents.
379
///
380
/// After the pattern IDs (if any), NFA state IDs are delta encoded as
381
/// varints.[1] The first NFA state ID is encoded as itself, and each
382
/// subsequent NFA state ID is encoded as the difference between itself and the
383
/// previous NFA state ID.
384
///
385
/// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints
386
struct Repr<'a>(&'a [u8]);
387
388
impl<'a> Repr<'a> {
389
    /// Returns true if and only if this is a match state.
390
    ///
391
    /// If callers have added pattern IDs to this state, then callers MUST set
392
    /// this state as a match state explicitly. However, as a special case,
393
    /// states that are marked as match states but with no pattern IDs, then
394
    /// the state is treated as if it had a single pattern ID equivalent to
395
    /// PatternID::ZERO.
396
2.31M
    fn is_match(&self) -> bool {
397
2.31M
        self.0[0] & (1 << 0) > 0
398
2.31M
    }
399
400
    /// Returns true if and only if this state has had at least one pattern
401
    /// ID added to it.
402
    ///
403
    /// This is an internal-only flag that permits the representation to save
404
    /// space in the common case of an NFA with one pattern in it. In that
405
    /// case, a match state can only ever have exactly one pattern ID:
406
    /// PatternID::ZERO. So there's no need to represent it.
407
35.6M
    fn has_pattern_ids(&self) -> bool {
408
35.6M
        self.0[0] & (1 << 1) > 0
409
35.6M
    }
410
411
    /// Returns true if and only if this state is marked as having been created
412
    /// from a transition over a word byte. This is useful for checking whether
413
    /// a word boundary assertion is true or not, which requires look-behind
414
    /// (whether the current state came from a word byte or not) and look-ahead
415
    /// (whether the transition byte is a word byte or not).
416
    ///
417
    /// Since states with this set are distinct from states that don't have
418
    /// this set (even if they are otherwise equivalent), callers should not
419
    /// set this assertion unless the underlying NFA has at least one word
420
    /// boundary assertion somewhere. Otherwise, a superfluous number of states
421
    /// may be created.
422
8.13M
    fn is_from_word(&self) -> bool {
423
8.13M
        self.0[0] & (1 << 2) > 0
424
8.13M
    }
425
426
    /// Returns true if and only if this state is marked as being inside of a
427
    /// CRLF terminator. In the forward direction, this means the state was
428
    /// created after seeing a `\r`. In the reverse direction, this means the
429
    /// state was created after seeing a `\n`.
430
2.89M
    fn is_half_crlf(&self) -> bool {
431
2.89M
        self.0[0] & (1 << 3) > 0
432
2.89M
    }
433
434
    /// The set of look-behind assertions that were true in the transition that
435
    /// created this state.
436
    ///
437
    /// Generally, this should be empty if 'look_need' is empty, since there is
438
    /// no reason to track which look-behind assertions are true if the state
439
    /// has no conditional epsilon transitions.
440
    ///
441
    /// Satisfied look-ahead assertions are not tracked in states. Instead,
442
    /// these are re-computed on demand via epsilon closure when computing the
443
    /// transition function.
444
23.4M
    fn look_have(&self) -> LookSet {
445
23.4M
        LookSet::read_repr(&self.0[1..])
446
23.4M
    }
447
448
    /// The set of look-around (both behind and ahead) assertions that appear
449
    /// at least once in this state's set of NFA states.
450
    ///
451
    /// This is used to determine whether the epsilon closure needs to be
452
    /// re-computed when computing the transition function. Namely, if the
453
    /// state has no conditional epsilon transitions, then there is no need
454
    /// to re-compute the epsilon closure.
455
449M
    fn look_need(&self) -> LookSet {
456
449M
        LookSet::read_repr(&self.0[5..])
457
449M
    }
458
459
    /// Returns the total number of match pattern IDs in this state.
460
    ///
461
    /// If this state is not a match state, then this always returns 0.
462
0
    fn match_len(&self) -> usize {
463
0
        if !self.is_match() {
464
0
            return 0;
465
0
        } else if !self.has_pattern_ids() {
466
0
            1
467
        } else {
468
0
            self.encoded_pattern_len()
469
        }
470
0
    }
471
472
    /// Returns the pattern ID for this match state at the given index.
473
    ///
474
    /// If the given index is greater than or equal to `match_len()` for this
475
    /// state, then this could panic or return incorrect results.
476
0
    fn match_pattern(&self, index: usize) -> PatternID {
477
0
        if !self.has_pattern_ids() {
478
0
            PatternID::ZERO
479
        } else {
480
0
            let offset = 13 + index * PatternID::SIZE;
481
0
            // This is OK since we only ever serialize valid PatternIDs to
482
0
            // states.
483
0
            wire::read_pattern_id_unchecked(&self.0[offset..]).0
484
        }
485
0
    }
486
487
    /// Returns a copy of all match pattern IDs in this state. If this state
488
    /// is not a match state, then this returns None.
489
936k
    fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
490
936k
        if !self.is_match() {
491
821k
            return None;
492
115k
        }
493
115k
        let mut pids = alloc::vec![];
494
115k
        self.iter_match_pattern_ids(|pid| pids.push(pid));
495
115k
        Some(pids)
496
936k
    }
497
498
    /// Calls the given function on every pattern ID in this state.
499
115k
    fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) {
500
115k
        if !self.is_match() {
501
0
            return;
502
115k
        }
503
115k
        // As an optimization for a very common case, when this is a match
504
115k
        // state for an NFA with only one pattern, we don't actually write the
505
115k
        // pattern ID to the state representation. Instead, we know it must
506
115k
        // be there since it is the only possible choice.
507
115k
        if !self.has_pattern_ids() {
508
115k
            f(PatternID::ZERO);
509
115k
            return;
510
0
        }
511
0
        let mut pids = &self.0[13..self.pattern_offset_end()];
512
0
        while !pids.is_empty() {
513
0
            let pid = wire::read_u32(pids);
514
0
            pids = &pids[PatternID::SIZE..];
515
0
            // This is OK since we only ever serialize valid PatternIDs to
516
0
            // states. And since pattern IDs can never exceed a usize, the
517
0
            // unwrap is OK.
518
0
            f(PatternID::new_unchecked(usize::try_from(pid).unwrap()));
519
0
        }
520
115k
    }
521
522
    /// Calls the given function on every NFA state ID in this state.
523
16.2M
    fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) {
524
16.2M
        let mut sids = &self.0[self.pattern_offset_end()..];
525
16.2M
        let mut prev = 0i32;
526
2.16G
        while !sids.is_empty() {
527
2.14G
            let (delta, nr) = read_vari32(sids);
528
2.14G
            sids = &sids[nr..];
529
2.14G
            let sid = prev + delta;
530
2.14G
            prev = sid;
531
2.14G
            // This is OK since we only ever serialize valid StateIDs to
532
2.14G
            // states. And since state IDs can never exceed an isize, they must
533
2.14G
            // always be able to fit into a usize, and thus cast is OK.
534
2.14G
            f(StateID::new_unchecked(sid.as_usize()))
535
        }
536
16.2M
    }
<regex_automata::util::determinize::state::Repr>::iter_nfa_state_ids::<regex_automata::util::determinize::next::{closure#0}>
Line
Count
Source
523
16.2M
    fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) {
524
16.2M
        let mut sids = &self.0[self.pattern_offset_end()..];
525
16.2M
        let mut prev = 0i32;
526
2.16G
        while !sids.is_empty() {
527
2.14G
            let (delta, nr) = read_vari32(sids);
528
2.14G
            sids = &sids[nr..];
529
2.14G
            let sid = prev + delta;
530
2.14G
            prev = sid;
531
2.14G
            // This is OK since we only ever serialize valid StateIDs to
532
2.14G
            // states. And since state IDs can never exceed an isize, they must
533
2.14G
            // always be able to fit into a usize, and thus cast is OK.
534
2.14G
            f(StateID::new_unchecked(sid.as_usize()))
535
        }
536
16.2M
    }
Unexecuted instantiation: <regex_automata::util::determinize::state::Repr>::iter_nfa_state_ids::<<regex_automata::util::determinize::state::Repr as core::fmt::Debug>::fmt::{closure#0}>
537
538
    /// Returns the offset into this state's representation where the pattern
539
    /// IDs end and the NFA state IDs begin.
540
16.2M
    fn pattern_offset_end(&self) -> usize {
541
16.2M
        let encoded = self.encoded_pattern_len();
542
16.2M
        if encoded == 0 {
543
16.2M
            return 9;
544
0
        }
545
0
        // This arithmetic is OK since we were able to address this many bytes
546
0
        // when writing to the state, thus, it must fit into a usize.
547
0
        encoded.checked_mul(4).unwrap().checked_add(13).unwrap()
548
16.2M
    }
549
550
    /// Returns the total number of *encoded* pattern IDs in this state.
551
    ///
552
    /// This may return 0 even when this is a match state, since the pattern
553
    /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in
554
    /// the match state (the overwhelming common case).
555
16.2M
    fn encoded_pattern_len(&self) -> usize {
556
16.2M
        if !self.has_pattern_ids() {
557
16.2M
            return 0;
558
0
        }
559
0
        // This unwrap is OK since the total number of patterns is always
560
0
        // guaranteed to fit into a usize.
561
0
        usize::try_from(wire::read_u32(&self.0[9..13])).unwrap()
562
16.2M
    }
563
}
564
565
impl<'a> core::fmt::Debug for Repr<'a> {
566
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
567
0
        let mut nfa_ids = alloc::vec![];
568
0
        self.iter_nfa_state_ids(|sid| nfa_ids.push(sid));
569
0
        f.debug_struct("Repr")
570
0
            .field("is_match", &self.is_match())
571
0
            .field("is_from_word", &self.is_from_word())
572
0
            .field("is_half_crlf", &self.is_half_crlf())
573
0
            .field("look_have", &self.look_have())
574
0
            .field("look_need", &self.look_need())
575
0
            .field("match_pattern_ids", &self.match_pattern_ids())
576
0
            .field("nfa_state_ids", &nfa_ids)
577
0
            .finish()
578
0
    }
579
}
580
581
/// ReprVec is a write-only view into the representation of a DFA state.
582
///
583
/// See Repr for more details on the purpose of this type and also the format.
584
///
585
/// Note that not all possible combinations of methods may be called. This is
586
/// precisely what the various StateBuilder types encapsulate: they only
587
/// permit valid combinations via Rust's linear typing.
588
struct ReprVec<'a>(&'a mut Vec<u8>);
589
590
impl<'a> ReprVec<'a> {
591
    /// Set this state as a match state.
592
    ///
593
    /// This should not be exposed explicitly outside of this module. It is
594
    /// set automatically when a pattern ID is added.
595
2.42M
    fn set_is_match(&mut self) {
596
2.42M
        self.0[0] |= 1 << 0;
597
2.42M
    }
598
599
    /// Set that this state has pattern IDs explicitly written to it.
600
    ///
601
    /// This should not be exposed explicitly outside of this module. This is
602
    /// used internally as a space saving optimization. Namely, if the state
603
    /// is a match state but does not have any pattern IDs written to it,
604
    /// then it is automatically inferred to have a pattern ID of ZERO.
605
0
    fn set_has_pattern_ids(&mut self) {
606
0
        self.0[0] |= 1 << 1;
607
0
    }
608
609
    /// Set this state as being built from a transition over a word byte.
610
    ///
611
    /// Setting this is only necessary when one needs to deal with word
612
    /// boundary assertions. Therefore, if the underlying NFA has no word
613
    /// boundary assertions, callers should not set this.
614
526k
    fn set_is_from_word(&mut self) {
615
526k
        self.0[0] |= 1 << 2;
616
526k
    }
617
618
    /// Set this state as having seen half of a CRLF terminator.
619
    ///
620
    /// In the forward direction, this should be set when a `\r` has been seen.
621
    /// In the reverse direction, this should be set when a `\n` has been seen.
622
16.2k
    fn set_is_half_crlf(&mut self) {
623
16.2k
        self.0[0] |= 1 << 3;
624
16.2k
    }
625
626
    /// The set of look-behind assertions that were true in the transition that
627
    /// created this state.
628
17.7M
    fn look_have(&self) -> LookSet {
629
17.7M
        self.repr().look_have()
630
17.7M
    }
631
632
    /// The set of look-around (both behind and ahead) assertions that appear
633
    /// at least once in this state's set of NFA states.
634
413M
    fn look_need(&self) -> LookSet {
635
413M
        self.repr().look_need()
636
413M
    }
637
638
    /// Mutate the set of look-behind assertions that were true in the
639
    /// transition that created this state.
640
17.7M
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
17.7M
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
17.7M
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::add_nfa_states::{closure#1}>
Line
Count
Source
640
14.9M
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
14.9M
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
14.9M
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#0}>
Line
Count
Source
640
77.4k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
77.4k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
77.4k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#2}>
Line
Count
Source
640
10.3k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
10.3k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
10.3k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#3}>
Line
Count
Source
640
24.0k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
24.0k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
24.0k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#4}>
Line
Count
Source
640
3.15k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
3.15k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
3.15k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#5}>
Line
Count
Source
640
5.12k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
5.12k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
5.12k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#6}>
Line
Count
Source
640
8.27k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
8.27k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
8.27k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#7}>
Line
Count
Source
640
15.3k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
15.3k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
15.3k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#8}>
Line
Count
Source
640
1.57k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
1.57k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
1.57k
    }
Unexecuted instantiation: <regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#9}>
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#10}>
Line
Count
Source
640
15.2k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
15.2k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
15.2k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#11}>
Line
Count
Source
640
8.21k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
8.21k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
8.21k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#1}>
Line
Count
Source
640
25.7k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
25.7k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
25.7k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#12}>
Line
Count
Source
640
15.2k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
15.2k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
15.2k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#2}>
Line
Count
Source
640
31.2k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
31.2k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
31.2k
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#3}>
Line
Count
Source
640
2.49M
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
2.49M
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
2.49M
    }
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#1}>
Line
Count
Source
640
63.0k
    fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641
63.0k
        set(self.look_have()).write_repr(&mut self.0[1..]);
642
63.0k
    }
643
644
    /// Mutate the set of look-around (both behind and ahead) assertions that
645
    /// appear at least once in this state's set of NFA states.
646
413M
    fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
647
413M
        set(self.look_need()).write_repr(&mut self.0[5..]);
648
413M
    }
649
650
    /// Add a pattern ID to this state. All match states must have at least
651
    /// one pattern ID associated with it.
652
    ///
653
    /// Callers must never add duplicative pattern IDs.
654
    ///
655
    /// The order in which patterns are added must correspond to the order
656
    /// in which patterns are reported as matches.
657
2.42M
    fn add_match_pattern_id(&mut self, pid: PatternID) {
658
2.42M
        // As a (somewhat small) space saving optimization, in the case where
659
2.42M
        // a matching state has exactly one pattern ID, PatternID::ZERO, we do
660
2.42M
        // not write either the pattern ID or the number of patterns encoded.
661
2.42M
        // Instead, all we do is set the 'is_match' bit on this state. Overall,
662
2.42M
        // this saves 8 bytes per match state for the overwhelming majority of
663
2.42M
        // match states.
664
2.42M
        //
665
2.42M
        // In order to know whether pattern IDs need to be explicitly read or
666
2.42M
        // not, we use another internal-only bit, 'has_pattern_ids', to
667
2.42M
        // indicate whether they have been explicitly written or not.
668
2.42M
        if !self.repr().has_pattern_ids() {
669
2.42M
            if pid == PatternID::ZERO {
670
2.42M
                self.set_is_match();
671
2.42M
                return;
672
0
            }
673
0
            // Make room for 'close_match_pattern_ids' to write the total
674
0
            // number of pattern IDs written.
675
0
            self.0.extend(core::iter::repeat(0).take(PatternID::SIZE));
676
0
            self.set_has_pattern_ids();
677
0
            // If this was already a match state, then the only way that's
678
0
            // possible when the state doesn't have pattern IDs is if
679
0
            // PatternID::ZERO was added by the caller previously. In this
680
0
            // case, we are now adding a non-ZERO pattern ID after it, in
681
0
            // which case, we want to make sure to represent ZERO explicitly
682
0
            // now.
683
0
            if self.repr().is_match() {
684
0
                write_u32(self.0, 0)
685
0
            } else {
686
0
                // Otherwise, just make sure the 'is_match' bit is set.
687
0
                self.set_is_match();
688
0
            }
689
0
        }
690
0
        write_u32(self.0, pid.as_u32());
691
2.42M
    }
692
693
    /// Indicate that no more pattern IDs will be added to this state.
694
    ///
695
    /// Once this is called, callers must not call it or 'add_match_pattern_id'
696
    /// again.
697
    ///
698
    /// This should not be exposed explicitly outside of this module. It
699
    /// should be called only when converting a StateBuilderMatches into a
700
    /// StateBuilderNFA.
701
16.8M
    fn close_match_pattern_ids(&mut self) {
702
16.8M
        // If we never wrote any pattern IDs, then there's nothing to do here.
703
16.8M
        if !self.repr().has_pattern_ids() {
704
16.8M
            return;
705
0
        }
706
0
        let patsize = PatternID::SIZE;
707
0
        let pattern_bytes = self.0.len() - 13;
708
0
        // Every pattern ID uses 4 bytes, so number of bytes should be
709
0
        // divisible by 4.
710
0
        assert_eq!(pattern_bytes % patsize, 0);
711
        // This unwrap is OK since we are guaranteed that the maximum number
712
        // of possible patterns fits into a u32.
713
0
        let count32 = u32::try_from(pattern_bytes / patsize).unwrap();
714
0
        wire::NE::write_u32(count32, &mut self.0[9..13]);
715
16.8M
    }
716
717
    /// Add an NFA state ID to this state. The order in which NFA states are
718
    /// added matters. It is the caller's responsibility to ensure that
719
    /// duplicate NFA state IDs are not added.
720
2.12G
    fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) {
721
2.12G
        let delta = sid.as_i32() - prev.as_i32();
722
2.12G
        write_vari32(self.0, delta);
723
2.12G
        *prev = sid;
724
2.12G
    }
725
726
    /// Return a read-only view of this state's representation.
727
450M
    fn repr(&self) -> Repr<'_> {
728
450M
        Repr(self.0.as_slice())
729
450M
    }
730
}
731
732
/// Write a signed 32-bit integer using zig-zag encoding.
733
///
734
/// https://developers.google.com/protocol-buffers/docs/encoding#varints
735
2.12G
fn write_vari32(data: &mut Vec<u8>, n: i32) {
736
2.12G
    let mut un = n.to_bits() << 1;
737
2.12G
    if n < 0 {
738
625M
        un = !un;
739
1.49G
    }
740
2.12G
    write_varu32(data, un)
741
2.12G
}
742
743
/// Read a signed 32-bit integer using zig-zag encoding. Also, return the
744
/// number of bytes read.
745
///
746
/// https://developers.google.com/protocol-buffers/docs/encoding#varints
747
2.14G
fn read_vari32(data: &[u8]) -> (i32, usize) {
748
2.14G
    let (un, i) = read_varu32(data);
749
2.14G
    let mut n = i32::from_bits(un >> 1);
750
2.14G
    if un & 1 != 0 {
751
631M
        n = !n;
752
1.51G
    }
753
2.14G
    (n, i)
754
2.14G
}
755
756
/// Write an unsigned 32-bit integer as a varint. In essence, `n` is written
757
/// as a sequence of bytes where all bytes except for the last one have the
758
/// most significant bit set. The least significant 7 bits correspond to the
759
/// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in
760
/// very common cases, it uses fewer than 4.
761
///
762
/// https://developers.google.com/protocol-buffers/docs/encoding#varints
763
2.12G
fn write_varu32(data: &mut Vec<u8>, mut n: u32) {
764
2.27G
    while n >= 0b1000_0000 {
765
156M
        data.push(n.low_u8() | 0b1000_0000);
766
156M
        n >>= 7;
767
156M
    }
768
2.12G
    data.push(n.low_u8());
769
2.12G
}
770
771
/// Read an unsigned 32-bit varint. Also, return the number of bytes read.
772
///
773
/// https://developers.google.com/protocol-buffers/docs/encoding#varints
774
2.14G
fn read_varu32(data: &[u8]) -> (u32, usize) {
775
2.14G
    // N.B. We can assume correctness here since we know that all varuints are
776
2.14G
    // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic
777
2.14G
    // is all okay.
778
2.14G
    let mut n: u32 = 0;
779
2.14G
    let mut shift: u32 = 0;
780
2.30G
    for (i, &b) in data.iter().enumerate() {
781
2.30G
        if b < 0b1000_0000 {
782
2.14G
            return (n | (u32::from(b) << shift), i + 1);
783
157M
        }
784
157M
        n |= (u32::from(b) & 0b0111_1111) << shift;
785
157M
        shift += 7;
786
    }
787
0
    (0, 0)
788
2.14G
}
789
790
/// Push a native-endian encoded `n` on to `dst`.
791
0
fn write_u32(dst: &mut Vec<u8>, n: u32) {
792
    use crate::util::wire::NE;
793
794
0
    let start = dst.len();
795
0
    dst.extend(core::iter::repeat(0).take(mem::size_of::<u32>()));
796
0
    NE::write_u32(n, &mut dst[start..]);
797
0
}
798
799
#[cfg(test)]
800
mod tests {
801
    use alloc::vec;
802
803
    use quickcheck::quickcheck;
804
805
    use super::*;
806
807
    #[cfg(not(miri))]
808
    quickcheck! {
809
        fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool {
810
            // Builders states do not permit duplicate IDs.
811
            let sids = dedup_state_ids(sids);
812
813
            let mut b = StateBuilderEmpty::new().into_matches().into_nfa();
814
            for &sid in &sids {
815
                b.add_nfa_state_id(sid);
816
            }
817
            let s = b.to_state();
818
            let mut got = vec![];
819
            s.iter_nfa_state_ids(|sid| got.push(sid));
820
            got == sids
821
        }
822
823
        fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool {
824
            // Builders states do not permit duplicate IDs.
825
            let pids = dedup_pattern_ids(pids);
826
827
            let mut b = StateBuilderEmpty::new().into_matches();
828
            for &pid in &pids {
829
                b.add_match_pattern_id(pid);
830
            }
831
            let s = b.into_nfa().to_state();
832
            let mut got = vec![];
833
            s.iter_match_pattern_ids(|pid| got.push(pid));
834
            got == pids
835
        }
836
837
        fn prop_state_read_write_nfa_state_and_pattern_ids(
838
            sids: Vec<StateID>,
839
            pids: Vec<PatternID>
840
        ) -> bool {
841
            // Builders states do not permit duplicate IDs.
842
            let sids = dedup_state_ids(sids);
843
            let pids = dedup_pattern_ids(pids);
844
845
            let mut b = StateBuilderEmpty::new().into_matches();
846
            for &pid in &pids {
847
                b.add_match_pattern_id(pid);
848
            }
849
850
            let mut b = b.into_nfa();
851
            for &sid in &sids {
852
                b.add_nfa_state_id(sid);
853
            }
854
855
            let s = b.to_state();
856
            let mut got_pids = vec![];
857
            s.iter_match_pattern_ids(|pid| got_pids.push(pid));
858
            let mut got_sids = vec![];
859
            s.iter_nfa_state_ids(|sid| got_sids.push(sid));
860
            got_pids == pids && got_sids == sids
861
        }
862
    }
863
864
    quickcheck! {
865
        fn prop_read_write_varu32(n: u32) -> bool {
866
            let mut buf = vec![];
867
            write_varu32(&mut buf, n);
868
            let (got, nread) = read_varu32(&buf);
869
            nread == buf.len() && got == n
870
        }
871
872
        fn prop_read_write_vari32(n: i32) -> bool {
873
            let mut buf = vec![];
874
            write_vari32(&mut buf, n);
875
            let (got, nread) = read_vari32(&buf);
876
            nread == buf.len() && got == n
877
        }
878
    }
879
880
    #[cfg(not(miri))]
881
    fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> {
882
        let mut set = alloc::collections::BTreeSet::new();
883
        let mut deduped = vec![];
884
        for sid in sids {
885
            if set.contains(&sid) {
886
                continue;
887
            }
888
            set.insert(sid);
889
            deduped.push(sid);
890
        }
891
        deduped
892
    }
893
894
    #[cfg(not(miri))]
895
    fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> {
896
        let mut set = alloc::collections::BTreeSet::new();
897
        let mut deduped = vec![];
898
        for pid in pids {
899
            if set.contains(&pid) {
900
                continue;
901
            }
902
            set.insert(pid);
903
            deduped.push(pid);
904
        }
905
        deduped
906
    }
907
}