Coverage Report

Created: 2024-12-17 06:15

/rust/registry/src/index.crates.io-6f17d22bba15001f/regex-automata-0.4.9/src/util/alphabet.rs
Line
Count
Source (jump to first uncovered line)
1
/*!
2
This module provides APIs for dealing with the alphabets of finite state
3
machines.
4
5
There are two principal types in this module, [`ByteClasses`] and [`Unit`].
6
The former defines the alphabet of a finite state machine while the latter
7
represents an element of that alphabet.
8
9
To a first approximation, the alphabet of all automata in this crate is just
10
a `u8`. Namely, every distinct byte value. All 256 of them. In practice, this
11
can be quite wasteful when building a transition table for a DFA, since it
12
requires storing a state identifier for each element in the alphabet. Instead,
13
we collapse the alphabet of an automaton down into equivalence classes, where
14
every byte in the same equivalence class never discriminates between a match or
15
a non-match from any other byte in the same class. For example, in the regex
16
`[a-z]+`, then you could consider it having an alphabet consisting of two
17
equivalence classes: `a-z` and everything else. In terms of the transitions on
18
an automaton, it doesn't actually require representing every distinct byte.
19
Just the equivalence classes.
20
21
The downside of equivalence classes is that, of course, searching a haystack
22
deals with individual byte values. Those byte values need to be mapped to
23
their corresponding equivalence class. This is what `ByteClasses` does. In
24
practice, doing this for every state transition has negligible impact on modern
25
CPUs. Moreover, it helps make more efficient use of the CPU cache by (possibly
26
considerably) shrinking the size of the transition table.
27
28
One last hiccup concerns `Unit`. Namely, because of look-around and how the
29
DFAs in this crate work, we need to add a sentinel value to our alphabet
30
of equivalence classes that represents the "end" of a search. We call that
31
sentinel [`Unit::eoi`] or "end of input." Thus, a `Unit` is either an
32
equivalence class corresponding to a set of bytes, or it is a special "end of
33
input" sentinel.
34
35
In general, you should not expect to need either of these types unless you're
36
doing lower level shenanigans with DFAs, or even building your own DFAs.
37
(Although, you don't have to use these types to build your own DFAs of course.)
38
For example, if you're walking a DFA's state graph, it's probably useful to
39
make use of [`ByteClasses`] to visit each element in the DFA's alphabet instead
40
of just visiting every distinct `u8` value. The latter isn't necessarily wrong,
41
but it could be potentially very wasteful.
42
*/
43
use crate::util::{
44
    escape::DebugByte,
45
    wire::{self, DeserializeError, SerializeError},
46
};
47
48
/// Unit represents a single unit of haystack for DFA based regex engines.
49
///
50
/// It is not expected for consumers of this crate to need to use this type
51
/// unless they are implementing their own DFA. And even then, it's not
52
/// required: implementors may use other techniques to handle haystack units.
53
///
54
/// Typically, a single unit of haystack for a DFA would be a single byte.
55
/// However, for the DFAs in this crate, matches are delayed by a single byte
56
/// in order to handle look-ahead assertions (`\b`, `$` and `\z`). Thus, once
57
/// we have consumed the haystack, we must run the DFA through one additional
58
/// transition using a unit that indicates the haystack has ended.
59
///
60
/// There is no way to represent a sentinel with a `u8` since all possible
61
/// values *may* be valid haystack units to a DFA, therefore this type
62
/// explicitly adds room for a sentinel value.
63
///
64
/// The sentinel EOI value is always its own equivalence class and is
65
/// ultimately represented by adding 1 to the maximum equivalence class value.
66
/// So for example, the regex `^[a-z]+$` might be split into the following
67
/// equivalence classes:
68
///
69
/// ```text
70
/// 0 => [\x00-`]
71
/// 1 => [a-z]
72
/// 2 => [{-\xFF]
73
/// 3 => [EOI]
74
/// ```
75
///
76
/// Where EOI is the special sentinel value that is always in its own
77
/// singleton equivalence class.
78
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
79
pub struct Unit(UnitKind);
80
81
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
82
enum UnitKind {
83
    /// Represents a byte value, or more typically, an equivalence class
84
    /// represented as a byte value.
85
    U8(u8),
86
    /// Represents the "end of input" sentinel. We regretably use a `u16`
87
    /// here since the maximum sentinel value is `256`. Thankfully, we don't
88
    /// actually store a `Unit` anywhere, so this extra space shouldn't be too
89
    /// bad.
90
    EOI(u16),
91
}
92
93
impl Unit {
94
    /// Create a new haystack unit from a byte value.
95
    ///
96
    /// All possible byte values are legal. However, when creating a haystack
97
    /// unit for a specific DFA, one should be careful to only construct units
98
    /// that are in that DFA's alphabet. Namely, one way to compact a DFA's
99
    /// in-memory representation is to collapse its transitions to a set of
100
    /// equivalence classes into a set of all possible byte values. If a DFA
101
    /// uses equivalence classes instead of byte values, then the byte given
102
    /// here should be the equivalence class.
103
985
    pub fn u8(byte: u8) -> Unit {
104
985
        Unit(UnitKind::U8(byte))
105
985
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::u8
<regex_automata::util::alphabet::Unit>::u8
Line
Count
Source
103
985
    pub fn u8(byte: u8) -> Unit {
104
985
        Unit(UnitKind::U8(byte))
105
985
    }
106
107
    /// Create a new "end of input" haystack unit.
108
    ///
109
    /// The value given is the sentinel value used by this unit to represent
110
    /// the "end of input." The value should be the total number of equivalence
111
    /// classes in the corresponding alphabet. Its maximum value is `256`,
112
    /// which occurs when every byte is its own equivalence class.
113
    ///
114
    /// # Panics
115
    ///
116
    /// This panics when `num_byte_equiv_classes` is greater than `256`.
117
16.3k
    pub fn eoi(num_byte_equiv_classes: usize) -> Unit {
118
16.3k
        assert!(
119
16.3k
            num_byte_equiv_classes <= 256,
120
0
            "max number of byte-based equivalent classes is 256, but got {}",
121
            num_byte_equiv_classes,
122
        );
123
16.3k
        Unit(UnitKind::EOI(u16::try_from(num_byte_equiv_classes).unwrap()))
124
16.3k
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::eoi
<regex_automata::util::alphabet::Unit>::eoi
Line
Count
Source
117
16.3k
    pub fn eoi(num_byte_equiv_classes: usize) -> Unit {
118
16.3k
        assert!(
119
16.3k
            num_byte_equiv_classes <= 256,
120
0
            "max number of byte-based equivalent classes is 256, but got {}",
121
            num_byte_equiv_classes,
122
        );
123
16.3k
        Unit(UnitKind::EOI(u16::try_from(num_byte_equiv_classes).unwrap()))
124
16.3k
    }
125
126
    /// If this unit is not an "end of input" sentinel, then returns its
127
    /// underlying byte value. Otherwise return `None`.
128
168
    pub fn as_u8(self) -> Option<u8> {
129
168
        match self.0 {
130
158
            UnitKind::U8(b) => Some(b),
131
10
            UnitKind::EOI(_) => None,
132
        }
133
168
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::as_u8
<regex_automata::util::alphabet::Unit>::as_u8
Line
Count
Source
128
168
    pub fn as_u8(self) -> Option<u8> {
129
168
        match self.0 {
130
158
            UnitKind::U8(b) => Some(b),
131
10
            UnitKind::EOI(_) => None,
132
        }
133
168
    }
134
135
    /// If this unit is an "end of input" sentinel, then return the underlying
136
    /// sentinel value that was given to [`Unit::eoi`]. Otherwise return
137
    /// `None`.
138
0
    pub fn as_eoi(self) -> Option<u16> {
139
0
        match self.0 {
140
0
            UnitKind::U8(_) => None,
141
0
            UnitKind::EOI(sentinel) => Some(sentinel),
142
        }
143
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::as_eoi
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::as_eoi
144
145
    /// Return this unit as a `usize`, regardless of whether it is a byte value
146
    /// or an "end of input" sentinel. In the latter case, the underlying
147
    /// sentinel value given to [`Unit::eoi`] is returned.
148
16.3k
    pub fn as_usize(self) -> usize {
149
16.3k
        match self.0 {
150
0
            UnitKind::U8(b) => usize::from(b),
151
16.3k
            UnitKind::EOI(eoi) => usize::from(eoi),
152
        }
153
16.3k
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::as_usize
<regex_automata::util::alphabet::Unit>::as_usize
Line
Count
Source
148
16.3k
    pub fn as_usize(self) -> usize {
149
16.3k
        match self.0 {
150
0
            UnitKind::U8(b) => usize::from(b),
151
16.3k
            UnitKind::EOI(eoi) => usize::from(eoi),
152
        }
153
16.3k
    }
154
155
    /// Returns true if and only of this unit is a byte value equivalent to the
156
    /// byte given. This always returns false when this is an "end of input"
157
    /// sentinel.
158
16
    pub fn is_byte(self, byte: u8) -> bool {
159
16
        self.as_u8().map_or(false, |b| b == byte)
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::is_byte::{closure#0}
<regex_automata::util::alphabet::Unit>::is_byte::{closure#0}
Line
Count
Source
159
14
        self.as_u8().map_or(false, |b| b == byte)
160
16
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::is_byte
<regex_automata::util::alphabet::Unit>::is_byte
Line
Count
Source
158
16
    pub fn is_byte(self, byte: u8) -> bool {
159
16
        self.as_u8().map_or(false, |b| b == byte)
160
16
    }
161
162
    /// Returns true when this unit represents an "end of input" sentinel.
163
0
    pub fn is_eoi(self) -> bool {
164
0
        self.as_eoi().is_some()
165
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::is_eoi
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::is_eoi
166
167
    /// Returns true when this unit corresponds to an ASCII word byte.
168
    ///
169
    /// This always returns false when this unit represents an "end of input"
170
    /// sentinel.
171
48
    pub fn is_word_byte(self) -> bool {
172
48
        self.as_u8().map_or(false, crate::util::utf8::is_word_byte)
173
48
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit>::is_word_byte
<regex_automata::util::alphabet::Unit>::is_word_byte
Line
Count
Source
171
48
    pub fn is_word_byte(self) -> bool {
172
48
        self.as_u8().map_or(false, crate::util::utf8::is_word_byte)
173
48
    }
174
}
175
176
impl core::fmt::Debug for Unit {
177
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
178
0
        match self.0 {
179
0
            UnitKind::U8(b) => write!(f, "{:?}", DebugByte(b)),
180
0
            UnitKind::EOI(_) => write!(f, "EOI"),
181
        }
182
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::Unit as core::fmt::Debug>::fmt
Unexecuted instantiation: <regex_automata::util::alphabet::Unit as core::fmt::Debug>::fmt
183
}
184
185
/// A representation of byte oriented equivalence classes.
186
///
187
/// This is used in a DFA to reduce the size of the transition table. This can
188
/// have a particularly large impact not only on the total size of a dense DFA,
189
/// but also on compile times.
190
///
191
/// The essential idea here is that the alphabet of a DFA is shrunk from the
192
/// usual 256 distinct byte values down to a set of equivalence classes. The
193
/// guarantee you get is that any byte belonging to the same equivalence class
194
/// can be treated as if it were any other byte in the same class, and the
195
/// result of a search wouldn't change.
196
///
197
/// # Example
198
///
199
/// This example shows how to get byte classes from an
200
/// [`NFA`](crate::nfa::thompson::NFA) and ask for the class of various bytes.
201
///
202
/// ```
203
/// use regex_automata::nfa::thompson::NFA;
204
///
205
/// let nfa = NFA::new("[a-z]+")?;
206
/// let classes = nfa.byte_classes();
207
/// // 'a' and 'z' are in the same class for this regex.
208
/// assert_eq!(classes.get(b'a'), classes.get(b'z'));
209
/// // But 'a' and 'A' are not.
210
/// assert_ne!(classes.get(b'a'), classes.get(b'A'));
211
///
212
/// # Ok::<(), Box<dyn std::error::Error>>(())
213
/// ```
214
#[derive(Clone, Copy)]
215
pub struct ByteClasses([u8; 256]);
216
217
impl ByteClasses {
218
    /// Creates a new set of equivalence classes where all bytes are mapped to
219
    /// the same class.
220
    #[inline]
221
18
    pub fn empty() -> ByteClasses {
222
18
        ByteClasses([0; 256])
223
18
    }
<regex_automata::util::alphabet::ByteClasses>::empty
Line
Count
Source
221
12
    pub fn empty() -> ByteClasses {
222
12
        ByteClasses([0; 256])
223
12
    }
<regex_automata::util::alphabet::ByteClasses>::empty
Line
Count
Source
221
6
    pub fn empty() -> ByteClasses {
222
6
        ByteClasses([0; 256])
223
6
    }
224
225
    /// Creates a new set of equivalence classes where each byte belongs to
226
    /// its own equivalence class.
227
    #[inline]
228
8
    pub fn singletons() -> ByteClasses {
229
8
        let mut classes = ByteClasses::empty();
230
2.05k
        for b in 0..=255 {
231
2.04k
            classes.set(b, b);
232
2.04k
        }
233
8
        classes
234
8
    }
<regex_automata::util::alphabet::ByteClasses>::singletons
Line
Count
Source
228
6
    pub fn singletons() -> ByteClasses {
229
6
        let mut classes = ByteClasses::empty();
230
1.54k
        for b in 0..=255 {
231
1.53k
            classes.set(b, b);
232
1.53k
        }
233
6
        classes
234
6
    }
<regex_automata::util::alphabet::ByteClasses>::singletons
Line
Count
Source
228
2
    pub fn singletons() -> ByteClasses {
229
2
        let mut classes = ByteClasses::empty();
230
514
        for b in 0..=255 {
231
512
            classes.set(b, b);
232
512
        }
233
2
        classes
234
2
    }
235
236
    /// Deserializes a byte class map from the given slice. If the slice is of
237
    /// insufficient length or otherwise contains an impossible mapping, then
238
    /// an error is returned. Upon success, the number of bytes read along with
239
    /// the map are returned. The number of bytes read is always a multiple of
240
    /// 8.
241
0
    pub(crate) fn from_bytes(
242
0
        slice: &[u8],
243
0
    ) -> Result<(ByteClasses, usize), DeserializeError> {
244
0
        wire::check_slice_len(slice, 256, "byte class map")?;
245
0
        let mut classes = ByteClasses::empty();
246
0
        for (b, &class) in slice[..256].iter().enumerate() {
247
0
            classes.set(u8::try_from(b).unwrap(), class);
248
0
        }
249
        // We specifically don't use 'classes.iter()' here because that
250
        // iterator depends on 'classes.alphabet_len()' being correct. But that
251
        // is precisely the thing we're trying to verify below!
252
0
        for &b in classes.0.iter() {
253
0
            if usize::from(b) >= classes.alphabet_len() {
254
0
                return Err(DeserializeError::generic(
255
0
                    "found equivalence class greater than alphabet len",
256
0
                ));
257
0
            }
258
        }
259
0
        Ok((classes, 256))
260
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::from_bytes
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::from_bytes
261
262
    /// Writes this byte class map to the given byte buffer. if the given
263
    /// buffer is too small, then an error is returned. Upon success, the total
264
    /// number of bytes written is returned. The number of bytes written is
265
    /// guaranteed to be a multiple of 8.
266
0
    pub(crate) fn write_to(
267
0
        &self,
268
0
        mut dst: &mut [u8],
269
0
    ) -> Result<usize, SerializeError> {
270
0
        let nwrite = self.write_to_len();
271
0
        if dst.len() < nwrite {
272
0
            return Err(SerializeError::buffer_too_small("byte class map"));
273
0
        }
274
0
        for b in 0..=255 {
275
0
            dst[0] = self.get(b);
276
0
            dst = &mut dst[1..];
277
0
        }
278
0
        Ok(nwrite)
279
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::write_to
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::write_to
280
281
    /// Returns the total number of bytes written by `write_to`.
282
0
    pub(crate) fn write_to_len(&self) -> usize {
283
0
        256
284
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::write_to_len
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::write_to_len
285
286
    /// Set the equivalence class for the given byte.
287
    #[inline]
288
4.60k
    pub fn set(&mut self, byte: u8, class: u8) {
289
4.60k
        self.0[usize::from(byte)] = class;
290
4.60k
    }
<regex_automata::util::alphabet::ByteClasses>::set
Line
Count
Source
288
3.07k
    pub fn set(&mut self, byte: u8, class: u8) {
289
3.07k
        self.0[usize::from(byte)] = class;
290
3.07k
    }
<regex_automata::util::alphabet::ByteClasses>::set
Line
Count
Source
288
1.53k
    pub fn set(&mut self, byte: u8, class: u8) {
289
1.53k
        self.0[usize::from(byte)] = class;
290
1.53k
    }
291
292
    /// Get the equivalence class for the given byte.
293
    #[inline]
294
150k
    pub fn get(&self, byte: u8) -> u8 {
295
150k
        self.0[usize::from(byte)]
296
150k
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::get
<regex_automata::util::alphabet::ByteClasses>::get
Line
Count
Source
294
150k
    pub fn get(&self, byte: u8) -> u8 {
295
150k
        self.0[usize::from(byte)]
296
150k
    }
297
298
    /// Get the equivalence class for the given haystack unit and return the
299
    /// class as a `usize`.
300
    #[inline]
301
978
    pub fn get_by_unit(&self, unit: Unit) -> usize {
302
978
        match unit.0 {
303
970
            UnitKind::U8(b) => usize::from(self.get(b)),
304
8
            UnitKind::EOI(b) => usize::from(b),
305
        }
306
978
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::get_by_unit
<regex_automata::util::alphabet::ByteClasses>::get_by_unit
Line
Count
Source
301
978
    pub fn get_by_unit(&self, unit: Unit) -> usize {
302
978
        match unit.0 {
303
970
            UnitKind::U8(b) => usize::from(self.get(b)),
304
8
            UnitKind::EOI(b) => usize::from(b),
305
        }
306
978
    }
307
308
    /// Create a unit that represents the "end of input" sentinel based on the
309
    /// number of equivalence classes.
310
    #[inline]
311
16.3k
    pub fn eoi(&self) -> Unit {
312
16.3k
        // The alphabet length already includes the EOI sentinel, hence why
313
16.3k
        // we subtract 1.
314
16.3k
        Unit::eoi(self.alphabet_len().checked_sub(1).unwrap())
315
16.3k
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::eoi
<regex_automata::util::alphabet::ByteClasses>::eoi
Line
Count
Source
311
16.3k
    pub fn eoi(&self) -> Unit {
312
16.3k
        // The alphabet length already includes the EOI sentinel, hence why
313
16.3k
        // we subtract 1.
314
16.3k
        Unit::eoi(self.alphabet_len().checked_sub(1).unwrap())
315
16.3k
    }
316
317
    /// Return the total number of elements in the alphabet represented by
318
    /// these equivalence classes. Equivalently, this returns the total number
319
    /// of equivalence classes.
320
    #[inline]
321
16.4k
    pub fn alphabet_len(&self) -> usize {
322
16.4k
        // Add one since the number of equivalence classes is one bigger than
323
16.4k
        // the last one. But add another to account for the final EOI class
324
16.4k
        // that isn't explicitly represented.
325
16.4k
        usize::from(self.0[255]) + 1 + 1
326
16.4k
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::alphabet_len
<regex_automata::util::alphabet::ByteClasses>::alphabet_len
Line
Count
Source
321
16.4k
    pub fn alphabet_len(&self) -> usize {
322
16.4k
        // Add one since the number of equivalence classes is one bigger than
323
16.4k
        // the last one. But add another to account for the final EOI class
324
16.4k
        // that isn't explicitly represented.
325
16.4k
        usize::from(self.0[255]) + 1 + 1
326
16.4k
    }
327
328
    /// Returns the stride, as a base-2 exponent, required for these
329
    /// equivalence classes.
330
    ///
331
    /// The stride is always the smallest power of 2 that is greater than or
332
    /// equal to the alphabet length, and the `stride2` returned here is the
333
    /// exponent applied to `2` to get the smallest power. This is done so that
334
    /// converting between premultiplied state IDs and indices can be done with
335
    /// shifts alone, which is much faster than integer division.
336
    #[inline]
337
7
    pub fn stride2(&self) -> usize {
338
7
        let zeros = self.alphabet_len().next_power_of_two().trailing_zeros();
339
7
        usize::try_from(zeros).unwrap()
340
7
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::stride2
<regex_automata::util::alphabet::ByteClasses>::stride2
Line
Count
Source
337
7
    pub fn stride2(&self) -> usize {
338
7
        let zeros = self.alphabet_len().next_power_of_two().trailing_zeros();
339
7
        usize::try_from(zeros).unwrap()
340
7
    }
341
342
    /// Returns true if and only if every byte in this class maps to its own
343
    /// equivalence class. Equivalently, there are 257 equivalence classes
344
    /// and each class contains either exactly one byte or corresponds to the
345
    /// singleton class containing the "end of input" sentinel.
346
    #[inline]
347
0
    pub fn is_singleton(&self) -> bool {
348
0
        self.alphabet_len() == 257
349
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::is_singleton
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::is_singleton
350
351
    /// Returns an iterator over all equivalence classes in this set.
352
    #[inline]
353
0
    pub fn iter(&self) -> ByteClassIter<'_> {
354
0
        ByteClassIter { classes: self, i: 0 }
355
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::iter
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::iter
356
357
    /// Returns an iterator over a sequence of representative bytes from each
358
    /// equivalence class within the range of bytes given.
359
    ///
360
    /// When the given range is unbounded on both sides, the iterator yields
361
    /// exactly N items, where N is equivalent to the number of equivalence
362
    /// classes. Each item is an arbitrary byte drawn from each equivalence
363
    /// class.
364
    ///
365
    /// This is useful when one is determinizing an NFA and the NFA's alphabet
366
    /// hasn't been converted to equivalence classes. Picking an arbitrary byte
367
    /// from each equivalence class then permits a full exploration of the NFA
368
    /// instead of using every possible byte value and thus potentially saves
369
    /// quite a lot of redundant work.
370
    ///
371
    /// # Example
372
    ///
373
    /// This shows an example of what a complete sequence of representatives
374
    /// might look like from a real example.
375
    ///
376
    /// ```
377
    /// use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
378
    ///
379
    /// let nfa = NFA::new("[a-z]+")?;
380
    /// let classes = nfa.byte_classes();
381
    /// let reps: Vec<Unit> = classes.representatives(..).collect();
382
    /// // Note that the specific byte values yielded are not guaranteed!
383
    /// let expected = vec![
384
    ///     Unit::u8(b'\x00'),
385
    ///     Unit::u8(b'a'),
386
    ///     Unit::u8(b'{'),
387
    ///     Unit::eoi(3),
388
    /// ];
389
    /// assert_eq!(expected, reps);
390
    ///
391
    /// # Ok::<(), Box<dyn std::error::Error>>(())
392
    /// ```
393
    ///
394
    /// Note though, that you can ask for an arbitrary range of bytes, and only
395
    /// representatives for that range will be returned:
396
    ///
397
    /// ```
398
    /// use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
399
    ///
400
    /// let nfa = NFA::new("[a-z]+")?;
401
    /// let classes = nfa.byte_classes();
402
    /// let reps: Vec<Unit> = classes.representatives(b'A'..=b'z').collect();
403
    /// // Note that the specific byte values yielded are not guaranteed!
404
    /// let expected = vec![
405
    ///     Unit::u8(b'A'),
406
    ///     Unit::u8(b'a'),
407
    /// ];
408
    /// assert_eq!(expected, reps);
409
    ///
410
    /// # Ok::<(), Box<dyn std::error::Error>>(())
411
    /// ```
412
21
    pub fn representatives<R: core::ops::RangeBounds<u8>>(
413
21
        &self,
414
21
        range: R,
415
21
    ) -> ByteClassRepresentatives<'_> {
416
        use core::ops::Bound;
417
418
21
        let cur_byte = match range.start_bound() {
419
15
            Bound::Included(&i) => usize::from(i),
420
0
            Bound::Excluded(&i) => usize::from(i).checked_add(1).unwrap(),
421
6
            Bound::Unbounded => 0,
422
        };
423
21
        let end_byte = match range.end_bound() {
424
15
            Bound::Included(&i) => {
425
15
                Some(usize::from(i).checked_add(1).unwrap())
426
            }
427
0
            Bound::Excluded(&i) => Some(usize::from(i)),
428
6
            Bound::Unbounded => None,
429
        };
430
21
        assert_ne!(
431
            cur_byte,
432
            usize::MAX,
433
0
            "start range must be less than usize::MAX",
434
        );
435
21
        ByteClassRepresentatives {
436
21
            classes: self,
437
21
            cur_byte,
438
21
            end_byte,
439
21
            last_class: None,
440
21
        }
441
21
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::representatives::<_>
<regex_automata::util::alphabet::ByteClasses>::representatives::<core::ops::range::RangeInclusive<u8>>
Line
Count
Source
412
15
    pub fn representatives<R: core::ops::RangeBounds<u8>>(
413
15
        &self,
414
15
        range: R,
415
15
    ) -> ByteClassRepresentatives<'_> {
416
        use core::ops::Bound;
417
418
15
        let cur_byte = match range.start_bound() {
419
15
            Bound::Included(&i) => usize::from(i),
420
0
            Bound::Excluded(&i) => usize::from(i).checked_add(1).unwrap(),
421
0
            Bound::Unbounded => 0,
422
        };
423
15
        let end_byte = match range.end_bound() {
424
15
            Bound::Included(&i) => {
425
15
                Some(usize::from(i).checked_add(1).unwrap())
426
            }
427
0
            Bound::Excluded(&i) => Some(usize::from(i)),
428
0
            Bound::Unbounded => None,
429
        };
430
15
        assert_ne!(
431
            cur_byte,
432
            usize::MAX,
433
0
            "start range must be less than usize::MAX",
434
        );
435
15
        ByteClassRepresentatives {
436
15
            classes: self,
437
15
            cur_byte,
438
15
            end_byte,
439
15
            last_class: None,
440
15
        }
441
15
    }
<regex_automata::util::alphabet::ByteClasses>::representatives::<core::ops::range::RangeFull>
Line
Count
Source
412
6
    pub fn representatives<R: core::ops::RangeBounds<u8>>(
413
6
        &self,
414
6
        range: R,
415
6
    ) -> ByteClassRepresentatives<'_> {
416
        use core::ops::Bound;
417
418
6
        let cur_byte = match range.start_bound() {
419
0
            Bound::Included(&i) => usize::from(i),
420
0
            Bound::Excluded(&i) => usize::from(i).checked_add(1).unwrap(),
421
6
            Bound::Unbounded => 0,
422
        };
423
6
        let end_byte = match range.end_bound() {
424
0
            Bound::Included(&i) => {
425
0
                Some(usize::from(i).checked_add(1).unwrap())
426
            }
427
0
            Bound::Excluded(&i) => Some(usize::from(i)),
428
6
            Bound::Unbounded => None,
429
        };
430
6
        assert_ne!(
431
            cur_byte,
432
            usize::MAX,
433
0
            "start range must be less than usize::MAX",
434
        );
435
6
        ByteClassRepresentatives {
436
6
            classes: self,
437
6
            cur_byte,
438
6
            end_byte,
439
6
            last_class: None,
440
6
        }
441
6
    }
442
443
    /// Returns an iterator of the bytes in the given equivalence class.
444
    ///
445
    /// This is useful when one needs to know the actual bytes that belong to
446
    /// an equivalence class. For example, conceptually speaking, accelerating
447
    /// a DFA state occurs when a state only has a few outgoing transitions.
448
    /// But in reality, what is required is that there are only a small
449
    /// number of distinct bytes that can lead to an outgoing transition. The
450
    /// difference is that any one transition can correspond to an equivalence
451
    /// class which may contains many bytes. Therefore, DFA state acceleration
452
    /// considers the actual elements in each equivalence class of each
453
    /// outgoing transition.
454
    ///
455
    /// # Example
456
    ///
457
    /// This shows an example of how to get all of the elements in an
458
    /// equivalence class.
459
    ///
460
    /// ```
461
    /// use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
462
    ///
463
    /// let nfa = NFA::new("[a-z]+")?;
464
    /// let classes = nfa.byte_classes();
465
    /// let elements: Vec<Unit> = classes.elements(Unit::u8(1)).collect();
466
    /// let expected: Vec<Unit> = (b'a'..=b'z').map(Unit::u8).collect();
467
    /// assert_eq!(expected, elements);
468
    ///
469
    /// # Ok::<(), Box<dyn std::error::Error>>(())
470
    /// ```
471
    #[inline]
472
0
    pub fn elements(&self, class: Unit) -> ByteClassElements {
473
0
        ByteClassElements { classes: self, class, byte: 0 }
474
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::elements
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::elements
475
476
    /// Returns an iterator of byte ranges in the given equivalence class.
477
    ///
478
    /// That is, a sequence of contiguous ranges are returned. Typically, every
479
    /// class maps to a single contiguous range.
480
0
    fn element_ranges(&self, class: Unit) -> ByteClassElementRanges {
481
0
        ByteClassElementRanges { elements: self.elements(class), range: None }
482
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::element_ranges
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses>::element_ranges
483
}
484
485
impl Default for ByteClasses {
486
8
    fn default() -> ByteClasses {
487
8
        ByteClasses::singletons()
488
8
    }
<regex_automata::util::alphabet::ByteClasses as core::default::Default>::default
Line
Count
Source
486
6
    fn default() -> ByteClasses {
487
6
        ByteClasses::singletons()
488
6
    }
<regex_automata::util::alphabet::ByteClasses as core::default::Default>::default
Line
Count
Source
486
2
    fn default() -> ByteClasses {
487
2
        ByteClasses::singletons()
488
2
    }
489
}
490
491
impl core::fmt::Debug for ByteClasses {
492
0
    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
493
0
        if self.is_singleton() {
494
0
            write!(f, "ByteClasses({{singletons}})")
495
        } else {
496
0
            write!(f, "ByteClasses(")?;
497
0
            for (i, class) in self.iter().enumerate() {
498
0
                if i > 0 {
499
0
                    write!(f, ", ")?;
500
0
                }
501
0
                write!(f, "{:?} => [", class.as_usize())?;
502
0
                for (start, end) in self.element_ranges(class) {
503
0
                    if start == end {
504
0
                        write!(f, "{:?}", start)?;
505
                    } else {
506
0
                        write!(f, "{:?}-{:?}", start, end)?;
507
                    }
508
                }
509
0
                write!(f, "]")?;
510
            }
511
0
            write!(f, ")")
512
        }
513
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses as core::fmt::Debug>::fmt
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClasses as core::fmt::Debug>::fmt
514
}
515
516
/// An iterator over each equivalence class.
517
///
518
/// The last element in this iterator always corresponds to [`Unit::eoi`].
519
///
520
/// This is created by the [`ByteClasses::iter`] method.
521
///
522
/// The lifetime `'a` refers to the lifetime of the byte classes that this
523
/// iterator was created from.
524
#[derive(Debug)]
525
pub struct ByteClassIter<'a> {
526
    classes: &'a ByteClasses,
527
    i: usize,
528
}
529
530
impl<'a> Iterator for ByteClassIter<'a> {
531
    type Item = Unit;
532
533
0
    fn next(&mut self) -> Option<Unit> {
534
0
        if self.i + 1 == self.classes.alphabet_len() {
535
0
            self.i += 1;
536
0
            Some(self.classes.eoi())
537
0
        } else if self.i < self.classes.alphabet_len() {
538
0
            let class = u8::try_from(self.i).unwrap();
539
0
            self.i += 1;
540
0
            Some(Unit::u8(class))
541
        } else {
542
0
            None
543
        }
544
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassIter as core::iter::traits::iterator::Iterator>::next
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassIter as core::iter::traits::iterator::Iterator>::next
545
}
546
547
/// An iterator over representative bytes from each equivalence class.
548
///
549
/// This is created by the [`ByteClasses::representatives`] method.
550
///
551
/// The lifetime `'a` refers to the lifetime of the byte classes that this
552
/// iterator was created from.
553
#[derive(Debug)]
554
pub struct ByteClassRepresentatives<'a> {
555
    classes: &'a ByteClasses,
556
    cur_byte: usize,
557
    end_byte: Option<usize>,
558
    last_class: Option<u8>,
559
}
560
561
impl<'a> Iterator for ByteClassRepresentatives<'a> {
562
    type Item = Unit;
563
564
995
    fn next(&mut self) -> Option<Unit> {
565
1.58k
        while self.cur_byte < self.end_byte.unwrap_or(256) {
566
1.55k
            let byte = u8::try_from(self.cur_byte).unwrap();
567
1.55k
            let class = self.classes.get(byte);
568
1.55k
            self.cur_byte += 1;
569
1.55k
570
1.55k
            if self.last_class != Some(class) {
571
969
                self.last_class = Some(class);
572
969
                return Some(Unit::u8(byte));
573
587
            }
574
        }
575
26
        if self.cur_byte != usize::MAX && self.end_byte.is_none() {
576
            // Using usize::MAX as a sentinel is OK because we ban usize::MAX
577
            // from appearing as a start bound in iterator construction. But
578
            // why do it this way? Well, we want to return the EOI class
579
            // whenever the end of the given range is unbounded because EOI
580
            // isn't really a "byte" per se, so the only way it should be
581
            // excluded is if there is a bounded end to the range. Therefore,
582
            // when the end is unbounded, we just need to know whether we've
583
            // reported EOI or not. When we do, we set cur_byte to a value it
584
            // can never otherwise be.
585
6
            self.cur_byte = usize::MAX;
586
6
            return Some(self.classes.eoi());
587
20
        }
588
20
        None
589
995
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassRepresentatives as core::iter::traits::iterator::Iterator>::next
<regex_automata::util::alphabet::ByteClassRepresentatives as core::iter::traits::iterator::Iterator>::next
Line
Count
Source
564
995
    fn next(&mut self) -> Option<Unit> {
565
1.58k
        while self.cur_byte < self.end_byte.unwrap_or(256) {
566
1.55k
            let byte = u8::try_from(self.cur_byte).unwrap();
567
1.55k
            let class = self.classes.get(byte);
568
1.55k
            self.cur_byte += 1;
569
1.55k
570
1.55k
            if self.last_class != Some(class) {
571
969
                self.last_class = Some(class);
572
969
                return Some(Unit::u8(byte));
573
587
            }
574
        }
575
26
        if self.cur_byte != usize::MAX && self.end_byte.is_none() {
576
            // Using usize::MAX as a sentinel is OK because we ban usize::MAX
577
            // from appearing as a start bound in iterator construction. But
578
            // why do it this way? Well, we want to return the EOI class
579
            // whenever the end of the given range is unbounded because EOI
580
            // isn't really a "byte" per se, so the only way it should be
581
            // excluded is if there is a bounded end to the range. Therefore,
582
            // when the end is unbounded, we just need to know whether we've
583
            // reported EOI or not. When we do, we set cur_byte to a value it
584
            // can never otherwise be.
585
6
            self.cur_byte = usize::MAX;
586
6
            return Some(self.classes.eoi());
587
20
        }
588
20
        None
589
995
    }
590
}
591
592
/// An iterator over all elements in an equivalence class.
593
///
594
/// This is created by the [`ByteClasses::elements`] method.
595
///
596
/// The lifetime `'a` refers to the lifetime of the byte classes that this
597
/// iterator was created from.
598
#[derive(Debug)]
599
pub struct ByteClassElements<'a> {
600
    classes: &'a ByteClasses,
601
    class: Unit,
602
    byte: usize,
603
}
604
605
impl<'a> Iterator for ByteClassElements<'a> {
606
    type Item = Unit;
607
608
0
    fn next(&mut self) -> Option<Unit> {
609
0
        while self.byte < 256 {
610
0
            let byte = u8::try_from(self.byte).unwrap();
611
0
            self.byte += 1;
612
0
            if self.class.is_byte(self.classes.get(byte)) {
613
0
                return Some(Unit::u8(byte));
614
0
            }
615
        }
616
0
        if self.byte < 257 {
617
0
            self.byte += 1;
618
0
            if self.class.is_eoi() {
619
0
                return Some(Unit::eoi(256));
620
0
            }
621
0
        }
622
0
        None
623
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassElements as core::iter::traits::iterator::Iterator>::next
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassElements as core::iter::traits::iterator::Iterator>::next
624
}
625
626
/// An iterator over all elements in an equivalence class expressed as a
627
/// sequence of contiguous ranges.
628
#[derive(Debug)]
629
struct ByteClassElementRanges<'a> {
630
    elements: ByteClassElements<'a>,
631
    range: Option<(Unit, Unit)>,
632
}
633
634
impl<'a> Iterator for ByteClassElementRanges<'a> {
635
    type Item = (Unit, Unit);
636
637
0
    fn next(&mut self) -> Option<(Unit, Unit)> {
638
        loop {
639
0
            let element = match self.elements.next() {
640
0
                None => return self.range.take(),
641
0
                Some(element) => element,
642
0
            };
643
0
            match self.range.take() {
644
0
                None => {
645
0
                    self.range = Some((element, element));
646
0
                }
647
0
                Some((start, end)) => {
648
0
                    if end.as_usize() + 1 != element.as_usize()
649
0
                        || element.is_eoi()
650
                    {
651
0
                        self.range = Some((element, element));
652
0
                        return Some((start, end));
653
0
                    }
654
0
                    self.range = Some((start, element));
655
                }
656
            }
657
        }
658
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassElementRanges as core::iter::traits::iterator::Iterator>::next
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassElementRanges as core::iter::traits::iterator::Iterator>::next
659
}
660
661
/// A partitioning of bytes into equivalence classes.
662
///
663
/// A byte class set keeps track of an *approximation* of equivalence classes
664
/// of bytes during NFA construction. That is, every byte in an equivalence
665
/// class cannot discriminate between a match and a non-match.
666
///
667
/// For example, in the regex `[ab]+`, the bytes `a` and `b` would be in the
668
/// same equivalence class because it never matters whether an `a` or a `b` is
669
/// seen, and no combination of `a`s and `b`s in the text can discriminate a
670
/// match.
671
///
672
/// Note though that this does not compute the minimal set of equivalence
673
/// classes. For example, in the regex `[ac]+`, both `a` and `c` are in the
674
/// same equivalence class for the same reason that `a` and `b` are in the
675
/// same equivalence class in the aforementioned regex. However, in this
676
/// implementation, `a` and `c` are put into distinct equivalence classes. The
677
/// reason for this is implementation complexity. In the future, we should
678
/// endeavor to compute the minimal equivalence classes since they can have a
679
/// rather large impact on the size of the DFA. (Doing this will likely require
680
/// rethinking how equivalence classes are computed, including changing the
681
/// representation here, which is only able to group contiguous bytes into the
682
/// same equivalence class.)
683
#[cfg(feature = "alloc")]
684
#[derive(Clone, Debug)]
685
pub(crate) struct ByteClassSet(ByteSet);
686
687
#[cfg(feature = "alloc")]
688
impl Default for ByteClassSet {
689
8
    fn default() -> ByteClassSet {
690
8
        ByteClassSet::empty()
691
8
    }
<regex_automata::util::alphabet::ByteClassSet as core::default::Default>::default
Line
Count
Source
689
6
    fn default() -> ByteClassSet {
690
6
        ByteClassSet::empty()
691
6
    }
<regex_automata::util::alphabet::ByteClassSet as core::default::Default>::default
Line
Count
Source
689
2
    fn default() -> ByteClassSet {
690
2
        ByteClassSet::empty()
691
2
    }
692
}
693
694
#[cfg(feature = "alloc")]
695
impl ByteClassSet {
696
    /// Create a new set of byte classes where all bytes are part of the same
697
    /// equivalence class.
698
8
    pub(crate) fn empty() -> Self {
699
8
        ByteClassSet(ByteSet::empty())
700
8
    }
<regex_automata::util::alphabet::ByteClassSet>::empty
Line
Count
Source
698
6
    pub(crate) fn empty() -> Self {
699
6
        ByteClassSet(ByteSet::empty())
700
6
    }
<regex_automata::util::alphabet::ByteClassSet>::empty
Line
Count
Source
698
2
    pub(crate) fn empty() -> Self {
699
2
        ByteClassSet(ByteSet::empty())
700
2
    }
701
702
    /// Indicate the range of byte given (inclusive) can discriminate a
703
    /// match between it and all other bytes outside of the range.
704
22.3k
    pub(crate) fn set_range(&mut self, start: u8, end: u8) {
705
22.3k
        debug_assert!(start <= end);
706
22.3k
        if start > 0 {
707
22.3k
            self.0.add(start - 1);
708
22.3k
        }
709
22.3k
        self.0.add(end);
710
22.3k
    }
<regex_automata::util::alphabet::ByteClassSet>::set_range
Line
Count
Source
704
16.5k
    pub(crate) fn set_range(&mut self, start: u8, end: u8) {
705
16.5k
        debug_assert!(start <= end);
706
16.5k
        if start > 0 {
707
16.5k
            self.0.add(start - 1);
708
16.5k
        }
709
16.5k
        self.0.add(end);
710
16.5k
    }
<regex_automata::util::alphabet::ByteClassSet>::set_range
Line
Count
Source
704
5.80k
    pub(crate) fn set_range(&mut self, start: u8, end: u8) {
705
5.80k
        debug_assert!(start <= end);
706
5.80k
        if start > 0 {
707
5.79k
            self.0.add(start - 1);
708
5.79k
        }
709
5.80k
        self.0.add(end);
710
5.80k
    }
711
712
    /// Add the contiguous ranges in the set given to this byte class set.
713
0
    pub(crate) fn add_set(&mut self, set: &ByteSet) {
714
0
        for (start, end) in set.iter_ranges() {
715
0
            self.set_range(start, end);
716
0
        }
717
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassSet>::add_set
Unexecuted instantiation: <regex_automata::util::alphabet::ByteClassSet>::add_set
718
719
    /// Convert this boolean set to a map that maps all byte values to their
720
    /// corresponding equivalence class. The last mapping indicates the largest
721
    /// equivalence class identifier (which is never bigger than 255).
722
10
    pub(crate) fn byte_classes(&self) -> ByteClasses {
723
10
        let mut classes = ByteClasses::empty();
724
10
        let mut class = 0u8;
725
10
        let mut b = 0u8;
726
        loop {
727
2.56k
            classes.set(b, class);
728
2.56k
            if b == 255 {
729
10
                break;
730
2.55k
            }
731
2.55k
            if self.0.contains(b) {
732
1.58k
                class = class.checked_add(1).unwrap();
733
1.58k
            }
734
2.55k
            b = b.checked_add(1).unwrap();
735
        }
736
10
        classes
737
10
    }
<regex_automata::util::alphabet::ByteClassSet>::byte_classes
Line
Count
Source
722
6
    pub(crate) fn byte_classes(&self) -> ByteClasses {
723
6
        let mut classes = ByteClasses::empty();
724
6
        let mut class = 0u8;
725
6
        let mut b = 0u8;
726
        loop {
727
1.53k
            classes.set(b, class);
728
1.53k
            if b == 255 {
729
6
                break;
730
1.53k
            }
731
1.53k
            if self.0.contains(b) {
732
948
                class = class.checked_add(1).unwrap();
733
948
            }
734
1.53k
            b = b.checked_add(1).unwrap();
735
        }
736
6
        classes
737
6
    }
<regex_automata::util::alphabet::ByteClassSet>::byte_classes
Line
Count
Source
722
4
    pub(crate) fn byte_classes(&self) -> ByteClasses {
723
4
        let mut classes = ByteClasses::empty();
724
4
        let mut class = 0u8;
725
4
        let mut b = 0u8;
726
        loop {
727
1.02k
            classes.set(b, class);
728
1.02k
            if b == 255 {
729
4
                break;
730
1.02k
            }
731
1.02k
            if self.0.contains(b) {
732
632
                class = class.checked_add(1).unwrap();
733
632
            }
734
1.02k
            b = b.checked_add(1).unwrap();
735
        }
736
4
        classes
737
4
    }
738
}
739
740
/// A simple set of bytes that is reasonably cheap to copy and allocation free.
741
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
742
pub(crate) struct ByteSet {
743
    bits: BitSet,
744
}
745
746
/// The representation of a byte set. Split out so that we can define a
747
/// convenient Debug impl for it while keeping "ByteSet" in the output.
748
#[derive(Clone, Copy, Default, Eq, PartialEq)]
749
struct BitSet([u128; 2]);
750
751
impl ByteSet {
752
    /// Create an empty set of bytes.
753
10
    pub(crate) fn empty() -> ByteSet {
754
10
        ByteSet { bits: BitSet([0; 2]) }
755
10
    }
<regex_automata::util::alphabet::ByteSet>::empty
Line
Count
Source
753
6
    pub(crate) fn empty() -> ByteSet {
754
6
        ByteSet { bits: BitSet([0; 2]) }
755
6
    }
<regex_automata::util::alphabet::ByteSet>::empty
Line
Count
Source
753
4
    pub(crate) fn empty() -> ByteSet {
754
4
        ByteSet { bits: BitSet([0; 2]) }
755
4
    }
756
757
    /// Add a byte to this set.
758
    ///
759
    /// If the given byte already belongs to this set, then this is a no-op.
760
44.6k
    pub(crate) fn add(&mut self, byte: u8) {
761
44.6k
        let bucket = byte / 128;
762
44.6k
        let bit = byte % 128;
763
44.6k
        self.bits.0[usize::from(bucket)] |= 1 << bit;
764
44.6k
    }
<regex_automata::util::alphabet::ByteSet>::add
Line
Count
Source
760
33.0k
    pub(crate) fn add(&mut self, byte: u8) {
761
33.0k
        let bucket = byte / 128;
762
33.0k
        let bit = byte % 128;
763
33.0k
        self.bits.0[usize::from(bucket)] |= 1 << bit;
764
33.0k
    }
<regex_automata::util::alphabet::ByteSet>::add
Line
Count
Source
760
11.6k
    pub(crate) fn add(&mut self, byte: u8) {
761
11.6k
        let bucket = byte / 128;
762
11.6k
        let bit = byte % 128;
763
11.6k
        self.bits.0[usize::from(bucket)] |= 1 << bit;
764
11.6k
    }
765
766
    /// Remove a byte from this set.
767
    ///
768
    /// If the given byte is not in this set, then this is a no-op.
769
0
    pub(crate) fn remove(&mut self, byte: u8) {
770
0
        let bucket = byte / 128;
771
0
        let bit = byte % 128;
772
0
        self.bits.0[usize::from(bucket)] &= !(1 << bit);
773
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::remove
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::remove
774
775
    /// Return true if and only if the given byte is in this set.
776
2.55k
    pub(crate) fn contains(&self, byte: u8) -> bool {
777
2.55k
        let bucket = byte / 128;
778
2.55k
        let bit = byte % 128;
779
2.55k
        self.bits.0[usize::from(bucket)] & (1 << bit) > 0
780
2.55k
    }
<regex_automata::util::alphabet::ByteSet>::contains
Line
Count
Source
776
1.53k
    pub(crate) fn contains(&self, byte: u8) -> bool {
777
1.53k
        let bucket = byte / 128;
778
1.53k
        let bit = byte % 128;
779
1.53k
        self.bits.0[usize::from(bucket)] & (1 << bit) > 0
780
1.53k
    }
<regex_automata::util::alphabet::ByteSet>::contains
Line
Count
Source
776
1.02k
    pub(crate) fn contains(&self, byte: u8) -> bool {
777
1.02k
        let bucket = byte / 128;
778
1.02k
        let bit = byte % 128;
779
1.02k
        self.bits.0[usize::from(bucket)] & (1 << bit) > 0
780
1.02k
    }
781
782
    /// Return true if and only if the given inclusive range of bytes is in
783
    /// this set.
784
0
    pub(crate) fn contains_range(&self, start: u8, end: u8) -> bool {
785
0
        (start..=end).all(|b| self.contains(b))
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::contains_range::{closure#0}
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::contains_range::{closure#0}
786
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::contains_range
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::contains_range
787
788
    /// Returns an iterator over all bytes in this set.
789
0
    pub(crate) fn iter(&self) -> ByteSetIter {
790
0
        ByteSetIter { set: self, b: 0 }
791
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::iter
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::iter
792
793
    /// Returns an iterator over all contiguous ranges of bytes in this set.
794
0
    pub(crate) fn iter_ranges(&self) -> ByteSetRangeIter {
795
0
        ByteSetRangeIter { set: self, b: 0 }
796
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::iter_ranges
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::iter_ranges
797
798
    /// Return true if and only if this set is empty.
799
    #[cfg_attr(feature = "perf-inline", inline(always))]
800
19
    pub(crate) fn is_empty(&self) -> bool {
801
19
        self.bits.0 == [0, 0]
802
19
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::is_empty
<regex_automata::util::alphabet::ByteSet>::is_empty
Line
Count
Source
800
19
    pub(crate) fn is_empty(&self) -> bool {
801
19
        self.bits.0 == [0, 0]
802
19
    }
803
804
    /// Deserializes a byte set from the given slice. If the slice is of
805
    /// incorrect length or is otherwise malformed, then an error is returned.
806
    /// Upon success, the number of bytes read along with the set are returned.
807
    /// The number of bytes read is always a multiple of 8.
808
0
    pub(crate) fn from_bytes(
809
0
        slice: &[u8],
810
0
    ) -> Result<(ByteSet, usize), DeserializeError> {
811
        use core::mem::size_of;
812
813
0
        wire::check_slice_len(slice, 2 * size_of::<u128>(), "byte set")?;
814
0
        let mut nread = 0;
815
0
        let (low, nr) = wire::try_read_u128(slice, "byte set low bucket")?;
816
0
        nread += nr;
817
0
        let (high, nr) = wire::try_read_u128(slice, "byte set high bucket")?;
818
0
        nread += nr;
819
0
        Ok((ByteSet { bits: BitSet([low, high]) }, nread))
820
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::from_bytes
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::from_bytes
821
822
    /// Writes this byte set to the given byte buffer. If the given buffer is
823
    /// too small, then an error is returned. Upon success, the total number of
824
    /// bytes written is returned. The number of bytes written is guaranteed to
825
    /// be a multiple of 8.
826
0
    pub(crate) fn write_to<E: crate::util::wire::Endian>(
827
0
        &self,
828
0
        dst: &mut [u8],
829
0
    ) -> Result<usize, SerializeError> {
830
        use core::mem::size_of;
831
832
0
        let nwrite = self.write_to_len();
833
0
        if dst.len() < nwrite {
834
0
            return Err(SerializeError::buffer_too_small("byte set"));
835
0
        }
836
0
        let mut nw = 0;
837
0
        E::write_u128(self.bits.0[0], &mut dst[nw..]);
838
0
        nw += size_of::<u128>();
839
0
        E::write_u128(self.bits.0[1], &mut dst[nw..]);
840
0
        nw += size_of::<u128>();
841
0
        assert_eq!(nwrite, nw, "expected to write certain number of bytes",);
842
0
        assert_eq!(
843
0
            nw % 8,
844
            0,
845
0
            "expected to write multiple of 8 bytes for byte set",
846
        );
847
0
        Ok(nw)
848
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::write_to::<_>
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::write_to::<_>
849
850
    /// Returns the total number of bytes written by `write_to`.
851
0
    pub(crate) fn write_to_len(&self) -> usize {
852
0
        2 * core::mem::size_of::<u128>()
853
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::write_to_len
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSet>::write_to_len
854
}
855
856
impl core::fmt::Debug for BitSet {
857
0
    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
858
0
        let mut fmtd = f.debug_set();
859
0
        for b in 0u8..=255 {
860
0
            if (ByteSet { bits: *self }).contains(b) {
861
0
                fmtd.entry(&b);
862
0
            }
863
        }
864
0
        fmtd.finish()
865
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::BitSet as core::fmt::Debug>::fmt
Unexecuted instantiation: <regex_automata::util::alphabet::BitSet as core::fmt::Debug>::fmt
866
}
867
868
#[derive(Debug)]
869
pub(crate) struct ByteSetIter<'a> {
870
    set: &'a ByteSet,
871
    b: usize,
872
}
873
874
impl<'a> Iterator for ByteSetIter<'a> {
875
    type Item = u8;
876
877
0
    fn next(&mut self) -> Option<u8> {
878
0
        while self.b <= 255 {
879
0
            let b = u8::try_from(self.b).unwrap();
880
0
            self.b += 1;
881
0
            if self.set.contains(b) {
882
0
                return Some(b);
883
0
            }
884
        }
885
0
        None
886
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetIter as core::iter::traits::iterator::Iterator>::next
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetIter as core::iter::traits::iterator::Iterator>::next
887
}
888
889
#[derive(Debug)]
890
pub(crate) struct ByteSetRangeIter<'a> {
891
    set: &'a ByteSet,
892
    b: usize,
893
}
894
895
impl<'a> Iterator for ByteSetRangeIter<'a> {
896
    type Item = (u8, u8);
897
898
0
    fn next(&mut self) -> Option<(u8, u8)> {
899
0
        let asu8 = |n: usize| u8::try_from(n).unwrap();
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetRangeIter as core::iter::traits::iterator::Iterator>::next::{closure#0}
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetRangeIter as core::iter::traits::iterator::Iterator>::next::{closure#0}
900
0
        while self.b <= 255 {
901
0
            let start = asu8(self.b);
902
0
            self.b += 1;
903
0
            if !self.set.contains(start) {
904
0
                continue;
905
0
            }
906
0
907
0
            let mut end = start;
908
0
            while self.b <= 255 && self.set.contains(asu8(self.b)) {
909
0
                end = asu8(self.b);
910
0
                self.b += 1;
911
0
            }
912
0
            return Some((start, end));
913
        }
914
0
        None
915
0
    }
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetRangeIter as core::iter::traits::iterator::Iterator>::next
Unexecuted instantiation: <regex_automata::util::alphabet::ByteSetRangeIter as core::iter::traits::iterator::Iterator>::next
916
}
917
918
#[cfg(all(test, feature = "alloc"))]
919
mod tests {
920
    use alloc::{vec, vec::Vec};
921
922
    use super::*;
923
924
    #[test]
925
    fn byte_classes() {
926
        let mut set = ByteClassSet::empty();
927
        set.set_range(b'a', b'z');
928
929
        let classes = set.byte_classes();
930
        assert_eq!(classes.get(0), 0);
931
        assert_eq!(classes.get(1), 0);
932
        assert_eq!(classes.get(2), 0);
933
        assert_eq!(classes.get(b'a' - 1), 0);
934
        assert_eq!(classes.get(b'a'), 1);
935
        assert_eq!(classes.get(b'm'), 1);
936
        assert_eq!(classes.get(b'z'), 1);
937
        assert_eq!(classes.get(b'z' + 1), 2);
938
        assert_eq!(classes.get(254), 2);
939
        assert_eq!(classes.get(255), 2);
940
941
        let mut set = ByteClassSet::empty();
942
        set.set_range(0, 2);
943
        set.set_range(4, 6);
944
        let classes = set.byte_classes();
945
        assert_eq!(classes.get(0), 0);
946
        assert_eq!(classes.get(1), 0);
947
        assert_eq!(classes.get(2), 0);
948
        assert_eq!(classes.get(3), 1);
949
        assert_eq!(classes.get(4), 2);
950
        assert_eq!(classes.get(5), 2);
951
        assert_eq!(classes.get(6), 2);
952
        assert_eq!(classes.get(7), 3);
953
        assert_eq!(classes.get(255), 3);
954
    }
955
956
    #[test]
957
    fn full_byte_classes() {
958
        let mut set = ByteClassSet::empty();
959
        for b in 0u8..=255 {
960
            set.set_range(b, b);
961
        }
962
        assert_eq!(set.byte_classes().alphabet_len(), 257);
963
    }
964
965
    #[test]
966
    fn elements_typical() {
967
        let mut set = ByteClassSet::empty();
968
        set.set_range(b'b', b'd');
969
        set.set_range(b'g', b'm');
970
        set.set_range(b'z', b'z');
971
        let classes = set.byte_classes();
972
        // class 0: \x00-a
973
        // class 1: b-d
974
        // class 2: e-f
975
        // class 3: g-m
976
        // class 4: n-y
977
        // class 5: z-z
978
        // class 6: \x7B-\xFF
979
        // class 7: EOI
980
        assert_eq!(classes.alphabet_len(), 8);
981
982
        let elements = classes.elements(Unit::u8(0)).collect::<Vec<_>>();
983
        assert_eq!(elements.len(), 98);
984
        assert_eq!(elements[0], Unit::u8(b'\x00'));
985
        assert_eq!(elements[97], Unit::u8(b'a'));
986
987
        let elements = classes.elements(Unit::u8(1)).collect::<Vec<_>>();
988
        assert_eq!(
989
            elements,
990
            vec![Unit::u8(b'b'), Unit::u8(b'c'), Unit::u8(b'd')],
991
        );
992
993
        let elements = classes.elements(Unit::u8(2)).collect::<Vec<_>>();
994
        assert_eq!(elements, vec![Unit::u8(b'e'), Unit::u8(b'f')],);
995
996
        let elements = classes.elements(Unit::u8(3)).collect::<Vec<_>>();
997
        assert_eq!(
998
            elements,
999
            vec![
1000
                Unit::u8(b'g'),
1001
                Unit::u8(b'h'),
1002
                Unit::u8(b'i'),
1003
                Unit::u8(b'j'),
1004
                Unit::u8(b'k'),
1005
                Unit::u8(b'l'),
1006
                Unit::u8(b'm'),
1007
            ],
1008
        );
1009
1010
        let elements = classes.elements(Unit::u8(4)).collect::<Vec<_>>();
1011
        assert_eq!(elements.len(), 12);
1012
        assert_eq!(elements[0], Unit::u8(b'n'));
1013
        assert_eq!(elements[11], Unit::u8(b'y'));
1014
1015
        let elements = classes.elements(Unit::u8(5)).collect::<Vec<_>>();
1016
        assert_eq!(elements, vec![Unit::u8(b'z')]);
1017
1018
        let elements = classes.elements(Unit::u8(6)).collect::<Vec<_>>();
1019
        assert_eq!(elements.len(), 133);
1020
        assert_eq!(elements[0], Unit::u8(b'\x7B'));
1021
        assert_eq!(elements[132], Unit::u8(b'\xFF'));
1022
1023
        let elements = classes.elements(Unit::eoi(7)).collect::<Vec<_>>();
1024
        assert_eq!(elements, vec![Unit::eoi(256)]);
1025
    }
1026
1027
    #[test]
1028
    fn elements_singletons() {
1029
        let classes = ByteClasses::singletons();
1030
        assert_eq!(classes.alphabet_len(), 257);
1031
1032
        let elements = classes.elements(Unit::u8(b'a')).collect::<Vec<_>>();
1033
        assert_eq!(elements, vec![Unit::u8(b'a')]);
1034
1035
        let elements = classes.elements(Unit::eoi(5)).collect::<Vec<_>>();
1036
        assert_eq!(elements, vec![Unit::eoi(256)]);
1037
    }
1038
1039
    #[test]
1040
    fn elements_empty() {
1041
        let classes = ByteClasses::empty();
1042
        assert_eq!(classes.alphabet_len(), 2);
1043
1044
        let elements = classes.elements(Unit::u8(0)).collect::<Vec<_>>();
1045
        assert_eq!(elements.len(), 256);
1046
        assert_eq!(elements[0], Unit::u8(b'\x00'));
1047
        assert_eq!(elements[255], Unit::u8(b'\xFF'));
1048
1049
        let elements = classes.elements(Unit::eoi(1)).collect::<Vec<_>>();
1050
        assert_eq!(elements, vec![Unit::eoi(256)]);
1051
    }
1052
1053
    #[test]
1054
    fn representatives() {
1055
        let mut set = ByteClassSet::empty();
1056
        set.set_range(b'b', b'd');
1057
        set.set_range(b'g', b'm');
1058
        set.set_range(b'z', b'z');
1059
        let classes = set.byte_classes();
1060
1061
        let got: Vec<Unit> = classes.representatives(..).collect();
1062
        let expected = vec![
1063
            Unit::u8(b'\x00'),
1064
            Unit::u8(b'b'),
1065
            Unit::u8(b'e'),
1066
            Unit::u8(b'g'),
1067
            Unit::u8(b'n'),
1068
            Unit::u8(b'z'),
1069
            Unit::u8(b'\x7B'),
1070
            Unit::eoi(7),
1071
        ];
1072
        assert_eq!(expected, got);
1073
1074
        let got: Vec<Unit> = classes.representatives(..0).collect();
1075
        assert!(got.is_empty());
1076
        let got: Vec<Unit> = classes.representatives(1..1).collect();
1077
        assert!(got.is_empty());
1078
        let got: Vec<Unit> = classes.representatives(255..255).collect();
1079
        assert!(got.is_empty());
1080
1081
        // A weird case that is the only guaranteed to way to get an iterator
1082
        // of just the EOI class by excluding all possible byte values.
1083
        let got: Vec<Unit> = classes
1084
            .representatives((
1085
                core::ops::Bound::Excluded(255),
1086
                core::ops::Bound::Unbounded,
1087
            ))
1088
            .collect();
1089
        let expected = vec![Unit::eoi(7)];
1090
        assert_eq!(expected, got);
1091
1092
        let got: Vec<Unit> = classes.representatives(..=255).collect();
1093
        let expected = vec![
1094
            Unit::u8(b'\x00'),
1095
            Unit::u8(b'b'),
1096
            Unit::u8(b'e'),
1097
            Unit::u8(b'g'),
1098
            Unit::u8(b'n'),
1099
            Unit::u8(b'z'),
1100
            Unit::u8(b'\x7B'),
1101
        ];
1102
        assert_eq!(expected, got);
1103
1104
        let got: Vec<Unit> = classes.representatives(b'b'..=b'd').collect();
1105
        let expected = vec![Unit::u8(b'b')];
1106
        assert_eq!(expected, got);
1107
1108
        let got: Vec<Unit> = classes.representatives(b'a'..=b'd').collect();
1109
        let expected = vec![Unit::u8(b'a'), Unit::u8(b'b')];
1110
        assert_eq!(expected, got);
1111
1112
        let got: Vec<Unit> = classes.representatives(b'b'..=b'e').collect();
1113
        let expected = vec![Unit::u8(b'b'), Unit::u8(b'e')];
1114
        assert_eq!(expected, got);
1115
1116
        let got: Vec<Unit> = classes.representatives(b'A'..=b'Z').collect();
1117
        let expected = vec![Unit::u8(b'A')];
1118
        assert_eq!(expected, got);
1119
1120
        let got: Vec<Unit> = classes.representatives(b'A'..=b'z').collect();
1121
        let expected = vec![
1122
            Unit::u8(b'A'),
1123
            Unit::u8(b'b'),
1124
            Unit::u8(b'e'),
1125
            Unit::u8(b'g'),
1126
            Unit::u8(b'n'),
1127
            Unit::u8(b'z'),
1128
        ];
1129
        assert_eq!(expected, got);
1130
1131
        let got: Vec<Unit> = classes.representatives(b'z'..).collect();
1132
        let expected = vec![Unit::u8(b'z'), Unit::u8(b'\x7B'), Unit::eoi(7)];
1133
        assert_eq!(expected, got);
1134
1135
        let got: Vec<Unit> = classes.representatives(b'z'..=0xFF).collect();
1136
        let expected = vec![Unit::u8(b'z'), Unit::u8(b'\x7B')];
1137
        assert_eq!(expected, got);
1138
    }
1139
}