/src/regex/regex-automata/src/util/determinize/state.rs
Line | Count | Source (jump to first uncovered line) |
1 | | /*! |
2 | | This module defines a DFA state representation and builders for constructing |
3 | | DFA states. |
4 | | |
5 | | This representation is specifically for use in implementations of NFA-to-DFA |
6 | | conversion via powerset construction. (Also called "determinization" in this |
7 | | crate.) |
8 | | |
9 | | The term "DFA state" is somewhat overloaded in this crate. In some cases, it |
10 | | refers to the set of transitions over an alphabet for a particular state. In |
11 | | other cases, it refers to a set of NFA states. The former is really about the |
12 | | final representation of a state in a DFA's transition table, where as the |
13 | | latter---what this module is focused on---is closer to an intermediate form |
14 | | that is used to help eventually build the transition table. |
15 | | |
16 | | This module exports four types. All four types represent the same idea: an |
17 | | ordered set of NFA states. This ordered set represents the epsilon closure of a |
18 | | particular NFA state, where the "epsilon closure" is the set of NFA states that |
19 | | can be transitioned to without consuming any input. i.e., Follow all of the NFA |
20 | | state's epsilon transitions. In addition, this implementation of DFA states |
21 | | cares about two other things: the ordered set of pattern IDs corresponding |
22 | | to the patterns that match if the state is a match state, and the set of |
23 | | look-behind assertions that were true when the state was created. |
24 | | |
25 | | The first, `State`, is a frozen representation of a state that cannot be |
26 | | modified. It may be cheaply cloned without copying the state itself and can be |
27 | | accessed safely from multiple threads simultaneously. This type is useful for |
28 | | when one knows that the DFA state being constructed is distinct from any other |
29 | | previously constructed states. Namely, powerset construction, in practice, |
30 | | requires one to keep a cache of previously created DFA states. Otherwise, |
31 | | the number of DFA states created in memory balloons to an impractically |
32 | | large number. For this reason, equivalent states should endeavor to have an |
33 | | equivalent byte-level representation. (In general, "equivalency" here means, |
34 | | "equivalent assertions, pattern IDs and NFA state IDs." We do not require that |
35 | | full DFA minimization be implemented here. This form of equivalency is only |
36 | | surface deep and is more-or-less a practical necessity.) |
37 | | |
38 | | The other three types represent different phases in the construction of a |
39 | | DFA state. Internally, these three types (and `State`) all use the same |
40 | | byte-oriented representation. That means one can use any of the builder types |
41 | | to check whether the state it represents already exists or not. If it does, |
42 | | then there is no need to freeze it into a `State` (which requires an alloc and |
43 | | a copy). Here are the three types described succinctly: |
44 | | |
45 | | * `StateBuilderEmpty` represents a state with no pattern IDs, no assertions |
46 | | and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A |
47 | | `StateBuilderEmpty` can only be used to query its underlying memory capacity, |
48 | | or to convert into a builder for recording pattern IDs and/or assertions. |
49 | | |
50 | | * `StateBuilderMatches` represents a state with zero or more pattern IDs, zero |
51 | | or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches` |
52 | | can only be used for adding pattern IDs and recording assertions. |
53 | | |
54 | | * `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or |
55 | | more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA` |
56 | | can only be used for adding NFA state IDs and recording some assertions. |
57 | | |
58 | | The expected flow here is to use the above builders to construct a candidate |
59 | | DFA state to check if it already exists. If it does, then there's no need to |
60 | | freeze it into a `State`. If it doesn't exist, then `StateBuilderNFA::to_state` |
61 | | can be called to freeze the builder into an immutable `State`. In either |
62 | | case, `clear` should be called on the builder to turn it back into a |
63 | | `StateBuilderEmpty` that reuses the underlying memory. |
64 | | |
65 | | The main purpose for splitting the builder into these distinct types is to |
66 | | make it impossible to do things like adding a pattern ID after adding an NFA |
67 | | state ID. Namely, this makes it simpler to use a space-and-time efficient |
68 | | binary representation for the state. (The format is documented on the `Repr` |
69 | | type below.) If we just used one type for everything, it would be possible for |
70 | | callers to use an incorrect interleaving of calls and thus result in a corrupt |
71 | | representation. I chose to use more type machinery to make this impossible to |
72 | | do because 1) determinization is itself pretty complex and it wouldn't be too |
73 | | hard to foul this up and 2) there isn't too much machinery involved and it's |
74 | | well contained. |
75 | | |
76 | | As an optimization, sometimes states won't have certain things set. For |
77 | | example, if the underlying NFA has no word boundary assertions, then there is |
78 | | no reason to set a state's look-behind assertion as to whether it was generated |
79 | | from a word byte or not. Similarly, if a state has no NFA states corresponding |
80 | | to look-around assertions, then there is no reason to set `look_have` to a |
81 | | non-empty set. Finally, callers usually omit unconditional epsilon transitions |
82 | | when adding NFA state IDs since they aren't discriminatory. |
83 | | |
84 | | Finally, the binary representation used by these states is, thankfully, not |
85 | | serialized anywhere. So any kind of change can be made with reckless abandon, |
86 | | as long as everything in this module agrees. |
87 | | */ |
88 | | |
89 | | use core::mem; |
90 | | |
91 | | use alloc::{sync::Arc, vec::Vec}; |
92 | | |
93 | | use crate::util::{ |
94 | | int::{I32, U32}, |
95 | | look::LookSet, |
96 | | primitives::{PatternID, StateID}, |
97 | | wire::{self, Endian}, |
98 | | }; |
99 | | |
100 | | /// A DFA state that, at its core, is represented by an ordered set of NFA |
101 | | /// states. |
102 | | /// |
103 | | /// This type is intended to be used only in NFA-to-DFA conversion via powerset |
104 | | /// construction. |
105 | | /// |
106 | | /// It may be cheaply cloned and accessed safely from multiple threads |
107 | | /// simultaneously. |
108 | | #[derive(Clone, Eq, Hash, PartialEq, PartialOrd, Ord)] |
109 | | pub(crate) struct State(Arc<[u8]>); |
110 | | |
111 | | /// This Borrow impl permits us to lookup any state in a map by its byte |
112 | | /// representation. This is particularly convenient when one has a StateBuilder |
113 | | /// and we want to see if a correspondingly equivalent state already exists. If |
114 | | /// one does exist, then we can reuse the allocation required by StateBuilder |
115 | | /// without having to convert it into a State first. |
116 | | impl core::borrow::Borrow<[u8]> for State { |
117 | 14.7M | fn borrow(&self) -> &[u8] { |
118 | 14.7M | &*self.0 |
119 | 14.7M | } |
120 | | } |
121 | | |
122 | | impl core::fmt::Debug for State { |
123 | 0 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
124 | 0 | f.debug_tuple("State").field(&self.repr()).finish() |
125 | 0 | } |
126 | | } |
127 | | |
128 | | /// For docs on these routines, see the internal Repr and ReprVec types below. |
129 | | impl State { |
130 | 253k | pub(crate) fn dead() -> State { |
131 | 253k | StateBuilderEmpty::new().into_matches().into_nfa().to_state() |
132 | 253k | } |
133 | | |
134 | 1.26M | pub(crate) fn is_match(&self) -> bool { |
135 | 1.26M | self.repr().is_match() |
136 | 1.26M | } |
137 | | |
138 | 8.13M | pub(crate) fn is_from_word(&self) -> bool { |
139 | 8.13M | self.repr().is_from_word() |
140 | 8.13M | } |
141 | | |
142 | 2.89M | pub(crate) fn is_half_crlf(&self) -> bool { |
143 | 2.89M | self.repr().is_half_crlf() |
144 | 2.89M | } |
145 | | |
146 | 5.68M | pub(crate) fn look_have(&self) -> LookSet { |
147 | 5.68M | self.repr().look_have() |
148 | 5.68M | } |
149 | | |
150 | 19.0M | pub(crate) fn look_need(&self) -> LookSet { |
151 | 19.0M | self.repr().look_need() |
152 | 19.0M | } |
153 | | |
154 | 0 | pub(crate) fn match_len(&self) -> usize { |
155 | 0 | self.repr().match_len() |
156 | 0 | } |
157 | | |
158 | 0 | pub(crate) fn match_pattern(&self, index: usize) -> PatternID { |
159 | 0 | self.repr().match_pattern(index) |
160 | 0 | } |
161 | | |
162 | 936k | pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { |
163 | 936k | self.repr().match_pattern_ids() |
164 | 936k | } |
165 | | |
166 | | #[cfg(all(test, not(miri)))] |
167 | | pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) { |
168 | | self.repr().iter_match_pattern_ids(f) |
169 | | } |
170 | | |
171 | 16.2M | pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) { |
172 | 16.2M | self.repr().iter_nfa_state_ids(f) |
173 | 16.2M | } |
174 | | |
175 | 3.43M | pub(crate) fn memory_usage(&self) -> usize { |
176 | 3.43M | self.0.len() |
177 | 3.43M | } |
178 | | |
179 | 54.2M | fn repr(&self) -> Repr<'_> { |
180 | 54.2M | Repr(&*self.0) |
181 | 54.2M | } |
182 | | } |
183 | | |
184 | | /// A state builder that represents an empty state. |
185 | | /// |
186 | | /// This is a useful "initial condition" for state construction. It has no |
187 | | /// NFA state IDs, no assertions set and no pattern IDs. No allocations are |
188 | | /// made when new() is called. Its main use is for being converted into a |
189 | | /// builder that can capture assertions and pattern IDs. |
190 | | #[derive(Clone, Debug)] |
191 | | pub(crate) struct StateBuilderEmpty(Vec<u8>); |
192 | | |
193 | | /// For docs on these routines, see the internal Repr and ReprVec types below. |
194 | | impl StateBuilderEmpty { |
195 | 16.9M | pub(crate) fn new() -> StateBuilderEmpty { |
196 | 16.9M | StateBuilderEmpty(alloc::vec![]) |
197 | 16.9M | } |
198 | | |
199 | 16.8M | pub(crate) fn into_matches(mut self) -> StateBuilderMatches { |
200 | 16.8M | self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]); |
201 | 16.8M | StateBuilderMatches(self.0) |
202 | 16.8M | } |
203 | | |
204 | 16.6M | fn clear(&mut self) { |
205 | 16.6M | self.0.clear(); |
206 | 16.6M | } |
207 | | |
208 | 3.32M | pub(crate) fn capacity(&self) -> usize { |
209 | 3.32M | self.0.capacity() |
210 | 3.32M | } |
211 | | } |
212 | | |
213 | | /// A state builder that collects assertions and pattern IDs. |
214 | | /// |
215 | | /// When collecting pattern IDs is finished, this can be converted into a |
216 | | /// builder that collects NFA state IDs. |
217 | | #[derive(Clone)] |
218 | | pub(crate) struct StateBuilderMatches(Vec<u8>); |
219 | | |
220 | | impl core::fmt::Debug for StateBuilderMatches { |
221 | 0 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
222 | 0 | f.debug_tuple("StateBuilderMatches").field(&self.repr()).finish() |
223 | 0 | } |
224 | | } |
225 | | |
226 | | /// For docs on these routines, see the internal Repr and ReprVec types below. |
227 | | impl StateBuilderMatches { |
228 | 16.8M | pub(crate) fn into_nfa(mut self) -> StateBuilderNFA { |
229 | 16.8M | self.repr_vec().close_match_pattern_ids(); |
230 | 16.8M | StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO } |
231 | 16.8M | } |
232 | | |
233 | 526k | pub(crate) fn set_is_from_word(&mut self) { |
234 | 526k | self.repr_vec().set_is_from_word() |
235 | 526k | } |
236 | | |
237 | 16.2k | pub(crate) fn set_is_half_crlf(&mut self) { |
238 | 16.2k | self.repr_vec().set_is_half_crlf() |
239 | 16.2k | } |
240 | | |
241 | 776M | pub(crate) fn look_have(&self) -> LookSet { |
242 | 776M | LookSet::read_repr(&self.0[1..]) |
243 | 776M | } |
244 | | |
245 | 2.80M | pub(crate) fn set_look_have( |
246 | 2.80M | &mut self, |
247 | 2.80M | set: impl FnMut(LookSet) -> LookSet, |
248 | 2.80M | ) { |
249 | 2.80M | self.repr_vec().set_look_have(set) |
250 | 2.80M | } <regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#0}> Line | Count | Source | 245 | 77.4k | pub(crate) fn set_look_have( | 246 | 77.4k | &mut self, | 247 | 77.4k | set: impl FnMut(LookSet) -> LookSet, | 248 | 77.4k | ) { | 249 | 77.4k | self.repr_vec().set_look_have(set) | 250 | 77.4k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#2}> Line | Count | Source | 245 | 10.3k | pub(crate) fn set_look_have( | 246 | 10.3k | &mut self, | 247 | 10.3k | set: impl FnMut(LookSet) -> LookSet, | 248 | 10.3k | ) { | 249 | 10.3k | self.repr_vec().set_look_have(set) | 250 | 10.3k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#3}> Line | Count | Source | 245 | 24.0k | pub(crate) fn set_look_have( | 246 | 24.0k | &mut self, | 247 | 24.0k | set: impl FnMut(LookSet) -> LookSet, | 248 | 24.0k | ) { | 249 | 24.0k | self.repr_vec().set_look_have(set) | 250 | 24.0k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#4}> Line | Count | Source | 245 | 3.15k | pub(crate) fn set_look_have( | 246 | 3.15k | &mut self, | 247 | 3.15k | set: impl FnMut(LookSet) -> LookSet, | 248 | 3.15k | ) { | 249 | 3.15k | self.repr_vec().set_look_have(set) | 250 | 3.15k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#5}> Line | Count | Source | 245 | 5.12k | pub(crate) fn set_look_have( | 246 | 5.12k | &mut self, | 247 | 5.12k | set: impl FnMut(LookSet) -> LookSet, | 248 | 5.12k | ) { | 249 | 5.12k | self.repr_vec().set_look_have(set) | 250 | 5.12k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#6}> Line | Count | Source | 245 | 8.27k | pub(crate) fn set_look_have( | 246 | 8.27k | &mut self, | 247 | 8.27k | set: impl FnMut(LookSet) -> LookSet, | 248 | 8.27k | ) { | 249 | 8.27k | self.repr_vec().set_look_have(set) | 250 | 8.27k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#7}> Line | Count | Source | 245 | 15.3k | pub(crate) fn set_look_have( | 246 | 15.3k | &mut self, | 247 | 15.3k | set: impl FnMut(LookSet) -> LookSet, | 248 | 15.3k | ) { | 249 | 15.3k | self.repr_vec().set_look_have(set) | 250 | 15.3k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#8}> Line | Count | Source | 245 | 1.57k | pub(crate) fn set_look_have( | 246 | 1.57k | &mut self, | 247 | 1.57k | set: impl FnMut(LookSet) -> LookSet, | 248 | 1.57k | ) { | 249 | 1.57k | self.repr_vec().set_look_have(set) | 250 | 1.57k | } |
Unexecuted instantiation: <regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#9}> <regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#10}> Line | Count | Source | 245 | 15.2k | pub(crate) fn set_look_have( | 246 | 15.2k | &mut self, | 247 | 15.2k | set: impl FnMut(LookSet) -> LookSet, | 248 | 15.2k | ) { | 249 | 15.2k | self.repr_vec().set_look_have(set) | 250 | 15.2k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#11}> Line | Count | Source | 245 | 8.21k | pub(crate) fn set_look_have( | 246 | 8.21k | &mut self, | 247 | 8.21k | set: impl FnMut(LookSet) -> LookSet, | 248 | 8.21k | ) { | 249 | 8.21k | self.repr_vec().set_look_have(set) | 250 | 8.21k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#1}> Line | Count | Source | 245 | 25.7k | pub(crate) fn set_look_have( | 246 | 25.7k | &mut self, | 247 | 25.7k | set: impl FnMut(LookSet) -> LookSet, | 248 | 25.7k | ) { | 249 | 25.7k | self.repr_vec().set_look_have(set) | 250 | 25.7k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#12}> Line | Count | Source | 245 | 15.2k | pub(crate) fn set_look_have( | 246 | 15.2k | &mut self, | 247 | 15.2k | set: impl FnMut(LookSet) -> LookSet, | 248 | 15.2k | ) { | 249 | 15.2k | self.repr_vec().set_look_have(set) | 250 | 15.2k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#2}> Line | Count | Source | 245 | 31.2k | pub(crate) fn set_look_have( | 246 | 31.2k | &mut self, | 247 | 31.2k | set: impl FnMut(LookSet) -> LookSet, | 248 | 31.2k | ) { | 249 | 31.2k | self.repr_vec().set_look_have(set) | 250 | 31.2k | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#3}> Line | Count | Source | 245 | 2.49M | pub(crate) fn set_look_have( | 246 | 2.49M | &mut self, | 247 | 2.49M | set: impl FnMut(LookSet) -> LookSet, | 248 | 2.49M | ) { | 249 | 2.49M | self.repr_vec().set_look_have(set) | 250 | 2.49M | } |
<regex_automata::util::determinize::state::StateBuilderMatches>::set_look_have::<regex_automata::util::determinize::next::{closure#1}> Line | Count | Source | 245 | 63.0k | pub(crate) fn set_look_have( | 246 | 63.0k | &mut self, | 247 | 63.0k | set: impl FnMut(LookSet) -> LookSet, | 248 | 63.0k | ) { | 249 | 63.0k | self.repr_vec().set_look_have(set) | 250 | 63.0k | } |
|
251 | | |
252 | 2.42M | pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) { |
253 | 2.42M | self.repr_vec().add_match_pattern_id(pid) |
254 | 2.42M | } |
255 | | |
256 | 0 | fn repr(&self) -> Repr<'_> { |
257 | 0 | Repr(&self.0) |
258 | 0 | } |
259 | | |
260 | 22.6M | fn repr_vec(&mut self) -> ReprVec<'_> { |
261 | 22.6M | ReprVec(&mut self.0) |
262 | 22.6M | } |
263 | | } |
264 | | |
265 | | /// A state builder that collects some assertions and NFA state IDs. |
266 | | /// |
267 | | /// When collecting NFA state IDs is finished, this can be used to build a |
268 | | /// `State` if necessary. |
269 | | /// |
270 | | /// When dont with building a state (regardless of whether it got kept or not), |
271 | | /// it's usually a good idea to call `clear` to get an empty builder back so |
272 | | /// that it can be reused to build the next state. |
273 | | #[derive(Clone)] |
274 | | pub(crate) struct StateBuilderNFA { |
275 | | repr: Vec<u8>, |
276 | | prev_nfa_state_id: StateID, |
277 | | } |
278 | | |
279 | | impl core::fmt::Debug for StateBuilderNFA { |
280 | 0 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
281 | 0 | f.debug_tuple("StateBuilderNFA").field(&self.repr()).finish() |
282 | 0 | } |
283 | | } |
284 | | |
285 | | /// For docs on these routines, see the internal Repr and ReprVec types below. |
286 | | impl StateBuilderNFA { |
287 | 2.26M | pub(crate) fn to_state(&self) -> State { |
288 | 2.26M | State(Arc::from(&*self.repr)) |
289 | 2.26M | } |
290 | | |
291 | 16.6M | pub(crate) fn clear(self) -> StateBuilderEmpty { |
292 | 16.6M | let mut builder = StateBuilderEmpty(self.repr); |
293 | 16.6M | builder.clear(); |
294 | 16.6M | builder |
295 | 16.6M | } |
296 | | |
297 | 16.6M | pub(crate) fn look_need(&self) -> LookSet { |
298 | 16.6M | self.repr().look_need() |
299 | 16.6M | } |
300 | | |
301 | 14.9M | pub(crate) fn set_look_have( |
302 | 14.9M | &mut self, |
303 | 14.9M | set: impl FnMut(LookSet) -> LookSet, |
304 | 14.9M | ) { |
305 | 14.9M | self.repr_vec().set_look_have(set) |
306 | 14.9M | } |
307 | | |
308 | 413M | pub(crate) fn set_look_need( |
309 | 413M | &mut self, |
310 | 413M | set: impl FnMut(LookSet) -> LookSet, |
311 | 413M | ) { |
312 | 413M | self.repr_vec().set_look_need(set) |
313 | 413M | } |
314 | | |
315 | 2.12G | pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) { |
316 | 2.12G | ReprVec(&mut self.repr) |
317 | 2.12G | .add_nfa_state_id(&mut self.prev_nfa_state_id, sid) |
318 | 2.12G | } |
319 | | |
320 | 17.8M | pub(crate) fn as_bytes(&self) -> &[u8] { |
321 | 17.8M | &self.repr |
322 | 17.8M | } |
323 | | |
324 | 16.6M | fn repr(&self) -> Repr<'_> { |
325 | 16.6M | Repr(&self.repr) |
326 | 16.6M | } |
327 | | |
328 | 428M | fn repr_vec(&mut self) -> ReprVec<'_> { |
329 | 428M | ReprVec(&mut self.repr) |
330 | 428M | } |
331 | | } |
332 | | |
333 | | /// Repr is a read-only view into the representation of a DFA state. |
334 | | /// |
335 | | /// Primarily, a Repr is how we achieve DRY: we implement decoding the format |
336 | | /// in one place, and then use a Repr to implement the various methods on the |
337 | | /// public state types. |
338 | | /// |
339 | | /// The format is as follows: |
340 | | /// |
341 | | /// The first three bytes correspond to bitsets. |
342 | | /// |
343 | | /// Byte 0 is a bitset corresponding to miscellaneous flags associated with the |
344 | | /// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1 |
345 | | /// if the state has pattern IDs explicitly written to it. (This is a flag that |
346 | | /// is not meant to be set by determinization, but rather, is used as part of |
347 | | /// an internal space-saving optimization.) Bit 2 is set to 1 if the state was |
348 | | /// generated by a transition over a "word" byte. (Callers may not always set |
349 | | /// this. For example, if the NFA has no word boundary assertion, then needing |
350 | | /// to track whether a state came from a word byte or not is superfluous and |
351 | | /// wasteful.) Bit 3 is set to 1 if the state was generated by a transition |
352 | | /// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is |
353 | | /// enabled. |
354 | | /// |
355 | | /// Bytes 1..5 correspond to the look-behind assertions that were satisfied |
356 | | /// by the transition that created this state. (Look-ahead assertions are not |
357 | | /// tracked as part of states. Instead, these are applied by re-computing the |
358 | | /// epsilon closure of a state when computing the transition function. See |
359 | | /// `next` in the parent module.) |
360 | | /// |
361 | | /// Bytes 5..9 correspond to the set of look-around assertions (including both |
362 | | /// look-behind and look-ahead) that appear somewhere in this state's set of |
363 | | /// NFA state IDs. This is used to determine whether this state's epsilon |
364 | | /// closure should be re-computed when computing the transition function. |
365 | | /// Namely, look-around assertions are "just" conditional epsilon transitions, |
366 | | /// so if there are new assertions available when computing the transition |
367 | | /// function, we should only re-compute the epsilon closure if those new |
368 | | /// assertions are relevant to this particular state. |
369 | | /// |
370 | | /// Bytes 9..13 correspond to a 32-bit native-endian encoded integer |
371 | | /// corresponding to the number of patterns encoded in this state. If the state |
372 | | /// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is |
373 | | /// PatternID::ZERO, then no integer is encoded at this position. Instead, byte |
374 | | /// offset 3 is the position at which the first NFA state ID is encoded. |
375 | | /// |
376 | | /// For a match state with at least one non-ZERO pattern ID, the next bytes |
377 | | /// correspond to a sequence of 32-bit native endian encoded integers that |
378 | | /// represent each pattern ID, in order, that this match state represents. |
379 | | /// |
380 | | /// After the pattern IDs (if any), NFA state IDs are delta encoded as |
381 | | /// varints.[1] The first NFA state ID is encoded as itself, and each |
382 | | /// subsequent NFA state ID is encoded as the difference between itself and the |
383 | | /// previous NFA state ID. |
384 | | /// |
385 | | /// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints |
386 | | struct Repr<'a>(&'a [u8]); |
387 | | |
388 | | impl<'a> Repr<'a> { |
389 | | /// Returns true if and only if this is a match state. |
390 | | /// |
391 | | /// If callers have added pattern IDs to this state, then callers MUST set |
392 | | /// this state as a match state explicitly. However, as a special case, |
393 | | /// states that are marked as match states but with no pattern IDs, then |
394 | | /// the state is treated as if it had a single pattern ID equivalent to |
395 | | /// PatternID::ZERO. |
396 | 2.31M | fn is_match(&self) -> bool { |
397 | 2.31M | self.0[0] & (1 << 0) > 0 |
398 | 2.31M | } |
399 | | |
400 | | /// Returns true if and only if this state has had at least one pattern |
401 | | /// ID added to it. |
402 | | /// |
403 | | /// This is an internal-only flag that permits the representation to save |
404 | | /// space in the common case of an NFA with one pattern in it. In that |
405 | | /// case, a match state can only ever have exactly one pattern ID: |
406 | | /// PatternID::ZERO. So there's no need to represent it. |
407 | 35.6M | fn has_pattern_ids(&self) -> bool { |
408 | 35.6M | self.0[0] & (1 << 1) > 0 |
409 | 35.6M | } |
410 | | |
411 | | /// Returns true if and only if this state is marked as having been created |
412 | | /// from a transition over a word byte. This is useful for checking whether |
413 | | /// a word boundary assertion is true or not, which requires look-behind |
414 | | /// (whether the current state came from a word byte or not) and look-ahead |
415 | | /// (whether the transition byte is a word byte or not). |
416 | | /// |
417 | | /// Since states with this set are distinct from states that don't have |
418 | | /// this set (even if they are otherwise equivalent), callers should not |
419 | | /// set this assertion unless the underlying NFA has at least one word |
420 | | /// boundary assertion somewhere. Otherwise, a superfluous number of states |
421 | | /// may be created. |
422 | 8.13M | fn is_from_word(&self) -> bool { |
423 | 8.13M | self.0[0] & (1 << 2) > 0 |
424 | 8.13M | } |
425 | | |
426 | | /// Returns true if and only if this state is marked as being inside of a |
427 | | /// CRLF terminator. In the forward direction, this means the state was |
428 | | /// created after seeing a `\r`. In the reverse direction, this means the |
429 | | /// state was created after seeing a `\n`. |
430 | 2.89M | fn is_half_crlf(&self) -> bool { |
431 | 2.89M | self.0[0] & (1 << 3) > 0 |
432 | 2.89M | } |
433 | | |
434 | | /// The set of look-behind assertions that were true in the transition that |
435 | | /// created this state. |
436 | | /// |
437 | | /// Generally, this should be empty if 'look_need' is empty, since there is |
438 | | /// no reason to track which look-behind assertions are true if the state |
439 | | /// has no conditional epsilon transitions. |
440 | | /// |
441 | | /// Satisfied look-ahead assertions are not tracked in states. Instead, |
442 | | /// these are re-computed on demand via epsilon closure when computing the |
443 | | /// transition function. |
444 | 23.4M | fn look_have(&self) -> LookSet { |
445 | 23.4M | LookSet::read_repr(&self.0[1..]) |
446 | 23.4M | } |
447 | | |
448 | | /// The set of look-around (both behind and ahead) assertions that appear |
449 | | /// at least once in this state's set of NFA states. |
450 | | /// |
451 | | /// This is used to determine whether the epsilon closure needs to be |
452 | | /// re-computed when computing the transition function. Namely, if the |
453 | | /// state has no conditional epsilon transitions, then there is no need |
454 | | /// to re-compute the epsilon closure. |
455 | 449M | fn look_need(&self) -> LookSet { |
456 | 449M | LookSet::read_repr(&self.0[5..]) |
457 | 449M | } |
458 | | |
459 | | /// Returns the total number of match pattern IDs in this state. |
460 | | /// |
461 | | /// If this state is not a match state, then this always returns 0. |
462 | 0 | fn match_len(&self) -> usize { |
463 | 0 | if !self.is_match() { |
464 | 0 | return 0; |
465 | 0 | } else if !self.has_pattern_ids() { |
466 | 0 | 1 |
467 | | } else { |
468 | 0 | self.encoded_pattern_len() |
469 | | } |
470 | 0 | } |
471 | | |
472 | | /// Returns the pattern ID for this match state at the given index. |
473 | | /// |
474 | | /// If the given index is greater than or equal to `match_len()` for this |
475 | | /// state, then this could panic or return incorrect results. |
476 | 0 | fn match_pattern(&self, index: usize) -> PatternID { |
477 | 0 | if !self.has_pattern_ids() { |
478 | 0 | PatternID::ZERO |
479 | | } else { |
480 | 0 | let offset = 13 + index * PatternID::SIZE; |
481 | 0 | // This is OK since we only ever serialize valid PatternIDs to |
482 | 0 | // states. |
483 | 0 | wire::read_pattern_id_unchecked(&self.0[offset..]).0 |
484 | | } |
485 | 0 | } |
486 | | |
487 | | /// Returns a copy of all match pattern IDs in this state. If this state |
488 | | /// is not a match state, then this returns None. |
489 | 936k | fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { |
490 | 936k | if !self.is_match() { |
491 | 821k | return None; |
492 | 115k | } |
493 | 115k | let mut pids = alloc::vec![]; |
494 | 115k | self.iter_match_pattern_ids(|pid| pids.push(pid)); |
495 | 115k | Some(pids) |
496 | 936k | } |
497 | | |
498 | | /// Calls the given function on every pattern ID in this state. |
499 | 115k | fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) { |
500 | 115k | if !self.is_match() { |
501 | 0 | return; |
502 | 115k | } |
503 | 115k | // As an optimization for a very common case, when this is a match |
504 | 115k | // state for an NFA with only one pattern, we don't actually write the |
505 | 115k | // pattern ID to the state representation. Instead, we know it must |
506 | 115k | // be there since it is the only possible choice. |
507 | 115k | if !self.has_pattern_ids() { |
508 | 115k | f(PatternID::ZERO); |
509 | 115k | return; |
510 | 0 | } |
511 | 0 | let mut pids = &self.0[13..self.pattern_offset_end()]; |
512 | 0 | while !pids.is_empty() { |
513 | 0 | let pid = wire::read_u32(pids); |
514 | 0 | pids = &pids[PatternID::SIZE..]; |
515 | 0 | // This is OK since we only ever serialize valid PatternIDs to |
516 | 0 | // states. And since pattern IDs can never exceed a usize, the |
517 | 0 | // unwrap is OK. |
518 | 0 | f(PatternID::new_unchecked(usize::try_from(pid).unwrap())); |
519 | 0 | } |
520 | 115k | } |
521 | | |
522 | | /// Calls the given function on every NFA state ID in this state. |
523 | 16.2M | fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) { |
524 | 16.2M | let mut sids = &self.0[self.pattern_offset_end()..]; |
525 | 16.2M | let mut prev = 0i32; |
526 | 2.16G | while !sids.is_empty() { |
527 | 2.14G | let (delta, nr) = read_vari32(sids); |
528 | 2.14G | sids = &sids[nr..]; |
529 | 2.14G | let sid = prev + delta; |
530 | 2.14G | prev = sid; |
531 | 2.14G | // This is OK since we only ever serialize valid StateIDs to |
532 | 2.14G | // states. And since state IDs can never exceed an isize, they must |
533 | 2.14G | // always be able to fit into a usize, and thus cast is OK. |
534 | 2.14G | f(StateID::new_unchecked(sid.as_usize())) |
535 | | } |
536 | 16.2M | } <regex_automata::util::determinize::state::Repr>::iter_nfa_state_ids::<regex_automata::util::determinize::next::{closure#0}> Line | Count | Source | 523 | 16.2M | fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) { | 524 | 16.2M | let mut sids = &self.0[self.pattern_offset_end()..]; | 525 | 16.2M | let mut prev = 0i32; | 526 | 2.16G | while !sids.is_empty() { | 527 | 2.14G | let (delta, nr) = read_vari32(sids); | 528 | 2.14G | sids = &sids[nr..]; | 529 | 2.14G | let sid = prev + delta; | 530 | 2.14G | prev = sid; | 531 | 2.14G | // This is OK since we only ever serialize valid StateIDs to | 532 | 2.14G | // states. And since state IDs can never exceed an isize, they must | 533 | 2.14G | // always be able to fit into a usize, and thus cast is OK. | 534 | 2.14G | f(StateID::new_unchecked(sid.as_usize())) | 535 | | } | 536 | 16.2M | } |
Unexecuted instantiation: <regex_automata::util::determinize::state::Repr>::iter_nfa_state_ids::<<regex_automata::util::determinize::state::Repr as core::fmt::Debug>::fmt::{closure#0}> |
537 | | |
538 | | /// Returns the offset into this state's representation where the pattern |
539 | | /// IDs end and the NFA state IDs begin. |
540 | 16.2M | fn pattern_offset_end(&self) -> usize { |
541 | 16.2M | let encoded = self.encoded_pattern_len(); |
542 | 16.2M | if encoded == 0 { |
543 | 16.2M | return 9; |
544 | 0 | } |
545 | 0 | // This arithmetic is OK since we were able to address this many bytes |
546 | 0 | // when writing to the state, thus, it must fit into a usize. |
547 | 0 | encoded.checked_mul(4).unwrap().checked_add(13).unwrap() |
548 | 16.2M | } |
549 | | |
550 | | /// Returns the total number of *encoded* pattern IDs in this state. |
551 | | /// |
552 | | /// This may return 0 even when this is a match state, since the pattern |
553 | | /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in |
554 | | /// the match state (the overwhelming common case). |
555 | 16.2M | fn encoded_pattern_len(&self) -> usize { |
556 | 16.2M | if !self.has_pattern_ids() { |
557 | 16.2M | return 0; |
558 | 0 | } |
559 | 0 | // This unwrap is OK since the total number of patterns is always |
560 | 0 | // guaranteed to fit into a usize. |
561 | 0 | usize::try_from(wire::read_u32(&self.0[9..13])).unwrap() |
562 | 16.2M | } |
563 | | } |
564 | | |
565 | | impl<'a> core::fmt::Debug for Repr<'a> { |
566 | 0 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
567 | 0 | let mut nfa_ids = alloc::vec![]; |
568 | 0 | self.iter_nfa_state_ids(|sid| nfa_ids.push(sid)); |
569 | 0 | f.debug_struct("Repr") |
570 | 0 | .field("is_match", &self.is_match()) |
571 | 0 | .field("is_from_word", &self.is_from_word()) |
572 | 0 | .field("is_half_crlf", &self.is_half_crlf()) |
573 | 0 | .field("look_have", &self.look_have()) |
574 | 0 | .field("look_need", &self.look_need()) |
575 | 0 | .field("match_pattern_ids", &self.match_pattern_ids()) |
576 | 0 | .field("nfa_state_ids", &nfa_ids) |
577 | 0 | .finish() |
578 | 0 | } |
579 | | } |
580 | | |
581 | | /// ReprVec is a write-only view into the representation of a DFA state. |
582 | | /// |
583 | | /// See Repr for more details on the purpose of this type and also the format. |
584 | | /// |
585 | | /// Note that not all possible combinations of methods may be called. This is |
586 | | /// precisely what the various StateBuilder types encapsulate: they only |
587 | | /// permit valid combinations via Rust's linear typing. |
588 | | struct ReprVec<'a>(&'a mut Vec<u8>); |
589 | | |
590 | | impl<'a> ReprVec<'a> { |
591 | | /// Set this state as a match state. |
592 | | /// |
593 | | /// This should not be exposed explicitly outside of this module. It is |
594 | | /// set automatically when a pattern ID is added. |
595 | 2.42M | fn set_is_match(&mut self) { |
596 | 2.42M | self.0[0] |= 1 << 0; |
597 | 2.42M | } |
598 | | |
599 | | /// Set that this state has pattern IDs explicitly written to it. |
600 | | /// |
601 | | /// This should not be exposed explicitly outside of this module. This is |
602 | | /// used internally as a space saving optimization. Namely, if the state |
603 | | /// is a match state but does not have any pattern IDs written to it, |
604 | | /// then it is automatically inferred to have a pattern ID of ZERO. |
605 | 0 | fn set_has_pattern_ids(&mut self) { |
606 | 0 | self.0[0] |= 1 << 1; |
607 | 0 | } |
608 | | |
609 | | /// Set this state as being built from a transition over a word byte. |
610 | | /// |
611 | | /// Setting this is only necessary when one needs to deal with word |
612 | | /// boundary assertions. Therefore, if the underlying NFA has no word |
613 | | /// boundary assertions, callers should not set this. |
614 | 526k | fn set_is_from_word(&mut self) { |
615 | 526k | self.0[0] |= 1 << 2; |
616 | 526k | } |
617 | | |
618 | | /// Set this state as having seen half of a CRLF terminator. |
619 | | /// |
620 | | /// In the forward direction, this should be set when a `\r` has been seen. |
621 | | /// In the reverse direction, this should be set when a `\n` has been seen. |
622 | 16.2k | fn set_is_half_crlf(&mut self) { |
623 | 16.2k | self.0[0] |= 1 << 3; |
624 | 16.2k | } |
625 | | |
626 | | /// The set of look-behind assertions that were true in the transition that |
627 | | /// created this state. |
628 | 17.7M | fn look_have(&self) -> LookSet { |
629 | 17.7M | self.repr().look_have() |
630 | 17.7M | } |
631 | | |
632 | | /// The set of look-around (both behind and ahead) assertions that appear |
633 | | /// at least once in this state's set of NFA states. |
634 | 413M | fn look_need(&self) -> LookSet { |
635 | 413M | self.repr().look_need() |
636 | 413M | } |
637 | | |
638 | | /// Mutate the set of look-behind assertions that were true in the |
639 | | /// transition that created this state. |
640 | 17.7M | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { |
641 | 17.7M | set(self.look_have()).write_repr(&mut self.0[1..]); |
642 | 17.7M | } <regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::add_nfa_states::{closure#1}> Line | Count | Source | 640 | 14.9M | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 14.9M | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 14.9M | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#0}> Line | Count | Source | 640 | 77.4k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 77.4k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 77.4k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#2}> Line | Count | Source | 640 | 10.3k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 10.3k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 10.3k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#3}> Line | Count | Source | 640 | 24.0k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 24.0k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 24.0k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#4}> Line | Count | Source | 640 | 3.15k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 3.15k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 3.15k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#5}> Line | Count | Source | 640 | 5.12k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 5.12k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 5.12k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#6}> Line | Count | Source | 640 | 8.27k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 8.27k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 8.27k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#7}> Line | Count | Source | 640 | 15.3k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 15.3k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 15.3k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#8}> Line | Count | Source | 640 | 1.57k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 1.57k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 1.57k | } |
Unexecuted instantiation: <regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#9}> <regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#10}> Line | Count | Source | 640 | 15.2k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 15.2k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 15.2k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#11}> Line | Count | Source | 640 | 8.21k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 8.21k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 8.21k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#1}> Line | Count | Source | 640 | 25.7k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 25.7k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 25.7k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::set_lookbehind_from_start::{closure#12}> Line | Count | Source | 640 | 15.2k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 15.2k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 15.2k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#2}> Line | Count | Source | 640 | 31.2k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 31.2k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 31.2k | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#3}> Line | Count | Source | 640 | 2.49M | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 2.49M | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 2.49M | } |
<regex_automata::util::determinize::state::ReprVec>::set_look_have::<regex_automata::util::determinize::next::{closure#1}> Line | Count | Source | 640 | 63.0k | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 641 | 63.0k | set(self.look_have()).write_repr(&mut self.0[1..]); | 642 | 63.0k | } |
|
643 | | |
644 | | /// Mutate the set of look-around (both behind and ahead) assertions that |
645 | | /// appear at least once in this state's set of NFA states. |
646 | 413M | fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { |
647 | 413M | set(self.look_need()).write_repr(&mut self.0[5..]); |
648 | 413M | } |
649 | | |
650 | | /// Add a pattern ID to this state. All match states must have at least |
651 | | /// one pattern ID associated with it. |
652 | | /// |
653 | | /// Callers must never add duplicative pattern IDs. |
654 | | /// |
655 | | /// The order in which patterns are added must correspond to the order |
656 | | /// in which patterns are reported as matches. |
657 | 2.42M | fn add_match_pattern_id(&mut self, pid: PatternID) { |
658 | 2.42M | // As a (somewhat small) space saving optimization, in the case where |
659 | 2.42M | // a matching state has exactly one pattern ID, PatternID::ZERO, we do |
660 | 2.42M | // not write either the pattern ID or the number of patterns encoded. |
661 | 2.42M | // Instead, all we do is set the 'is_match' bit on this state. Overall, |
662 | 2.42M | // this saves 8 bytes per match state for the overwhelming majority of |
663 | 2.42M | // match states. |
664 | 2.42M | // |
665 | 2.42M | // In order to know whether pattern IDs need to be explicitly read or |
666 | 2.42M | // not, we use another internal-only bit, 'has_pattern_ids', to |
667 | 2.42M | // indicate whether they have been explicitly written or not. |
668 | 2.42M | if !self.repr().has_pattern_ids() { |
669 | 2.42M | if pid == PatternID::ZERO { |
670 | 2.42M | self.set_is_match(); |
671 | 2.42M | return; |
672 | 0 | } |
673 | 0 | // Make room for 'close_match_pattern_ids' to write the total |
674 | 0 | // number of pattern IDs written. |
675 | 0 | self.0.extend(core::iter::repeat(0).take(PatternID::SIZE)); |
676 | 0 | self.set_has_pattern_ids(); |
677 | 0 | // If this was already a match state, then the only way that's |
678 | 0 | // possible when the state doesn't have pattern IDs is if |
679 | 0 | // PatternID::ZERO was added by the caller previously. In this |
680 | 0 | // case, we are now adding a non-ZERO pattern ID after it, in |
681 | 0 | // which case, we want to make sure to represent ZERO explicitly |
682 | 0 | // now. |
683 | 0 | if self.repr().is_match() { |
684 | 0 | write_u32(self.0, 0) |
685 | 0 | } else { |
686 | 0 | // Otherwise, just make sure the 'is_match' bit is set. |
687 | 0 | self.set_is_match(); |
688 | 0 | } |
689 | 0 | } |
690 | 0 | write_u32(self.0, pid.as_u32()); |
691 | 2.42M | } |
692 | | |
693 | | /// Indicate that no more pattern IDs will be added to this state. |
694 | | /// |
695 | | /// Once this is called, callers must not call it or 'add_match_pattern_id' |
696 | | /// again. |
697 | | /// |
698 | | /// This should not be exposed explicitly outside of this module. It |
699 | | /// should be called only when converting a StateBuilderMatches into a |
700 | | /// StateBuilderNFA. |
701 | 16.8M | fn close_match_pattern_ids(&mut self) { |
702 | 16.8M | // If we never wrote any pattern IDs, then there's nothing to do here. |
703 | 16.8M | if !self.repr().has_pattern_ids() { |
704 | 16.8M | return; |
705 | 0 | } |
706 | 0 | let patsize = PatternID::SIZE; |
707 | 0 | let pattern_bytes = self.0.len() - 13; |
708 | 0 | // Every pattern ID uses 4 bytes, so number of bytes should be |
709 | 0 | // divisible by 4. |
710 | 0 | assert_eq!(pattern_bytes % patsize, 0); |
711 | | // This unwrap is OK since we are guaranteed that the maximum number |
712 | | // of possible patterns fits into a u32. |
713 | 0 | let count32 = u32::try_from(pattern_bytes / patsize).unwrap(); |
714 | 0 | wire::NE::write_u32(count32, &mut self.0[9..13]); |
715 | 16.8M | } |
716 | | |
717 | | /// Add an NFA state ID to this state. The order in which NFA states are |
718 | | /// added matters. It is the caller's responsibility to ensure that |
719 | | /// duplicate NFA state IDs are not added. |
720 | 2.12G | fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) { |
721 | 2.12G | let delta = sid.as_i32() - prev.as_i32(); |
722 | 2.12G | write_vari32(self.0, delta); |
723 | 2.12G | *prev = sid; |
724 | 2.12G | } |
725 | | |
726 | | /// Return a read-only view of this state's representation. |
727 | 450M | fn repr(&self) -> Repr<'_> { |
728 | 450M | Repr(self.0.as_slice()) |
729 | 450M | } |
730 | | } |
731 | | |
732 | | /// Write a signed 32-bit integer using zig-zag encoding. |
733 | | /// |
734 | | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
735 | 2.12G | fn write_vari32(data: &mut Vec<u8>, n: i32) { |
736 | 2.12G | let mut un = n.to_bits() << 1; |
737 | 2.12G | if n < 0 { |
738 | 625M | un = !un; |
739 | 1.49G | } |
740 | 2.12G | write_varu32(data, un) |
741 | 2.12G | } |
742 | | |
743 | | /// Read a signed 32-bit integer using zig-zag encoding. Also, return the |
744 | | /// number of bytes read. |
745 | | /// |
746 | | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
747 | 2.14G | fn read_vari32(data: &[u8]) -> (i32, usize) { |
748 | 2.14G | let (un, i) = read_varu32(data); |
749 | 2.14G | let mut n = i32::from_bits(un >> 1); |
750 | 2.14G | if un & 1 != 0 { |
751 | 631M | n = !n; |
752 | 1.51G | } |
753 | 2.14G | (n, i) |
754 | 2.14G | } |
755 | | |
756 | | /// Write an unsigned 32-bit integer as a varint. In essence, `n` is written |
757 | | /// as a sequence of bytes where all bytes except for the last one have the |
758 | | /// most significant bit set. The least significant 7 bits correspond to the |
759 | | /// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in |
760 | | /// very common cases, it uses fewer than 4. |
761 | | /// |
762 | | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
763 | 2.12G | fn write_varu32(data: &mut Vec<u8>, mut n: u32) { |
764 | 2.27G | while n >= 0b1000_0000 { |
765 | 156M | data.push(n.low_u8() | 0b1000_0000); |
766 | 156M | n >>= 7; |
767 | 156M | } |
768 | 2.12G | data.push(n.low_u8()); |
769 | 2.12G | } |
770 | | |
771 | | /// Read an unsigned 32-bit varint. Also, return the number of bytes read. |
772 | | /// |
773 | | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
774 | 2.14G | fn read_varu32(data: &[u8]) -> (u32, usize) { |
775 | 2.14G | // N.B. We can assume correctness here since we know that all varuints are |
776 | 2.14G | // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic |
777 | 2.14G | // is all okay. |
778 | 2.14G | let mut n: u32 = 0; |
779 | 2.14G | let mut shift: u32 = 0; |
780 | 2.30G | for (i, &b) in data.iter().enumerate() { |
781 | 2.30G | if b < 0b1000_0000 { |
782 | 2.14G | return (n | (u32::from(b) << shift), i + 1); |
783 | 157M | } |
784 | 157M | n |= (u32::from(b) & 0b0111_1111) << shift; |
785 | 157M | shift += 7; |
786 | | } |
787 | 0 | (0, 0) |
788 | 2.14G | } |
789 | | |
790 | | /// Push a native-endian encoded `n` on to `dst`. |
791 | 0 | fn write_u32(dst: &mut Vec<u8>, n: u32) { |
792 | | use crate::util::wire::NE; |
793 | | |
794 | 0 | let start = dst.len(); |
795 | 0 | dst.extend(core::iter::repeat(0).take(mem::size_of::<u32>())); |
796 | 0 | NE::write_u32(n, &mut dst[start..]); |
797 | 0 | } |
798 | | |
799 | | #[cfg(test)] |
800 | | mod tests { |
801 | | use alloc::vec; |
802 | | |
803 | | use quickcheck::quickcheck; |
804 | | |
805 | | use super::*; |
806 | | |
807 | | #[cfg(not(miri))] |
808 | | quickcheck! { |
809 | | fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool { |
810 | | // Builders states do not permit duplicate IDs. |
811 | | let sids = dedup_state_ids(sids); |
812 | | |
813 | | let mut b = StateBuilderEmpty::new().into_matches().into_nfa(); |
814 | | for &sid in &sids { |
815 | | b.add_nfa_state_id(sid); |
816 | | } |
817 | | let s = b.to_state(); |
818 | | let mut got = vec![]; |
819 | | s.iter_nfa_state_ids(|sid| got.push(sid)); |
820 | | got == sids |
821 | | } |
822 | | |
823 | | fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool { |
824 | | // Builders states do not permit duplicate IDs. |
825 | | let pids = dedup_pattern_ids(pids); |
826 | | |
827 | | let mut b = StateBuilderEmpty::new().into_matches(); |
828 | | for &pid in &pids { |
829 | | b.add_match_pattern_id(pid); |
830 | | } |
831 | | let s = b.into_nfa().to_state(); |
832 | | let mut got = vec![]; |
833 | | s.iter_match_pattern_ids(|pid| got.push(pid)); |
834 | | got == pids |
835 | | } |
836 | | |
837 | | fn prop_state_read_write_nfa_state_and_pattern_ids( |
838 | | sids: Vec<StateID>, |
839 | | pids: Vec<PatternID> |
840 | | ) -> bool { |
841 | | // Builders states do not permit duplicate IDs. |
842 | | let sids = dedup_state_ids(sids); |
843 | | let pids = dedup_pattern_ids(pids); |
844 | | |
845 | | let mut b = StateBuilderEmpty::new().into_matches(); |
846 | | for &pid in &pids { |
847 | | b.add_match_pattern_id(pid); |
848 | | } |
849 | | |
850 | | let mut b = b.into_nfa(); |
851 | | for &sid in &sids { |
852 | | b.add_nfa_state_id(sid); |
853 | | } |
854 | | |
855 | | let s = b.to_state(); |
856 | | let mut got_pids = vec![]; |
857 | | s.iter_match_pattern_ids(|pid| got_pids.push(pid)); |
858 | | let mut got_sids = vec![]; |
859 | | s.iter_nfa_state_ids(|sid| got_sids.push(sid)); |
860 | | got_pids == pids && got_sids == sids |
861 | | } |
862 | | } |
863 | | |
864 | | quickcheck! { |
865 | | fn prop_read_write_varu32(n: u32) -> bool { |
866 | | let mut buf = vec![]; |
867 | | write_varu32(&mut buf, n); |
868 | | let (got, nread) = read_varu32(&buf); |
869 | | nread == buf.len() && got == n |
870 | | } |
871 | | |
872 | | fn prop_read_write_vari32(n: i32) -> bool { |
873 | | let mut buf = vec![]; |
874 | | write_vari32(&mut buf, n); |
875 | | let (got, nread) = read_vari32(&buf); |
876 | | nread == buf.len() && got == n |
877 | | } |
878 | | } |
879 | | |
880 | | #[cfg(not(miri))] |
881 | | fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> { |
882 | | let mut set = alloc::collections::BTreeSet::new(); |
883 | | let mut deduped = vec![]; |
884 | | for sid in sids { |
885 | | if set.contains(&sid) { |
886 | | continue; |
887 | | } |
888 | | set.insert(sid); |
889 | | deduped.push(sid); |
890 | | } |
891 | | deduped |
892 | | } |
893 | | |
894 | | #[cfg(not(miri))] |
895 | | fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> { |
896 | | let mut set = alloc::collections::BTreeSet::new(); |
897 | | let mut deduped = vec![]; |
898 | | for pid in pids { |
899 | | if set.contains(&pid) { |
900 | | continue; |
901 | | } |
902 | | set.insert(pid); |
903 | | deduped.push(pid); |
904 | | } |
905 | | deduped |
906 | | } |
907 | | } |