/rust/registry/src/index.crates.io-6f17d22bba15001f/regex-syntax-0.8.5/src/lib.rs
Line | Count | Source (jump to first uncovered line) |
1 | | /*! |
2 | | This crate provides a robust regular expression parser. |
3 | | |
4 | | This crate defines two primary types: |
5 | | |
6 | | * [`Ast`](ast::Ast) is the abstract syntax of a regular expression. |
7 | | An abstract syntax corresponds to a *structured representation* of the |
8 | | concrete syntax of a regular expression, where the concrete syntax is the |
9 | | pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it |
10 | | can be converted back to the original concrete syntax (modulo some details, |
11 | | like whitespace). To a first approximation, the abstract syntax is complex |
12 | | and difficult to analyze. |
13 | | * [`Hir`](hir::Hir) is the high-level intermediate representation |
14 | | ("HIR" or "high-level IR" for short) of regular expression. It corresponds to |
15 | | an intermediate state of a regular expression that sits between the abstract |
16 | | syntax and the low level compiled opcodes that are eventually responsible for |
17 | | executing a regular expression search. Given some high-level IR, it is not |
18 | | possible to produce the original concrete syntax (although it is possible to |
19 | | produce an equivalent concrete syntax, but it will likely scarcely resemble |
20 | | the original pattern). To a first approximation, the high-level IR is simple |
21 | | and easy to analyze. |
22 | | |
23 | | These two types come with conversion routines: |
24 | | |
25 | | * An [`ast::parse::Parser`] converts concrete syntax (a `&str`) to an |
26 | | [`Ast`](ast::Ast). |
27 | | * A [`hir::translate::Translator`] converts an [`Ast`](ast::Ast) to a |
28 | | [`Hir`](hir::Hir). |
29 | | |
30 | | As a convenience, the above two conversion routines are combined into one via |
31 | | the top-level [`Parser`] type. This `Parser` will first convert your pattern to |
32 | | an `Ast` and then convert the `Ast` to an `Hir`. It's also exposed as top-level |
33 | | [`parse`] free function. |
34 | | |
35 | | |
36 | | # Example |
37 | | |
38 | | This example shows how to parse a pattern string into its HIR: |
39 | | |
40 | | ``` |
41 | | use regex_syntax::{hir::Hir, parse}; |
42 | | |
43 | | let hir = parse("a|b")?; |
44 | | assert_eq!(hir, Hir::alternation(vec![ |
45 | | Hir::literal("a".as_bytes()), |
46 | | Hir::literal("b".as_bytes()), |
47 | | ])); |
48 | | # Ok::<(), Box<dyn std::error::Error>>(()) |
49 | | ``` |
50 | | |
51 | | |
52 | | # Concrete syntax supported |
53 | | |
54 | | The concrete syntax is documented as part of the public API of the |
55 | | [`regex` crate](https://docs.rs/regex/%2A/regex/#syntax). |
56 | | |
57 | | |
58 | | # Input safety |
59 | | |
60 | | A key feature of this library is that it is safe to use with end user facing |
61 | | input. This plays a significant role in the internal implementation. In |
62 | | particular: |
63 | | |
64 | | 1. Parsers provide a `nest_limit` option that permits callers to control how |
65 | | deeply nested a regular expression is allowed to be. This makes it possible |
66 | | to do case analysis over an `Ast` or an `Hir` using recursion without |
67 | | worrying about stack overflow. |
68 | | 2. Since relying on a particular stack size is brittle, this crate goes to |
69 | | great lengths to ensure that all interactions with both the `Ast` and the |
70 | | `Hir` do not use recursion. Namely, they use constant stack space and heap |
71 | | space proportional to the size of the original pattern string (in bytes). |
72 | | This includes the type's corresponding destructors. (One exception to this |
73 | | is literal extraction, but this will eventually get fixed.) |
74 | | |
75 | | |
76 | | # Error reporting |
77 | | |
78 | | The `Display` implementations on all `Error` types exposed in this library |
79 | | provide nice human readable errors that are suitable for showing to end users |
80 | | in a monospace font. |
81 | | |
82 | | |
83 | | # Literal extraction |
84 | | |
85 | | This crate provides limited support for [literal extraction from `Hir` |
86 | | values](hir::literal). Be warned that literal extraction uses recursion, and |
87 | | therefore, stack size proportional to the size of the `Hir`. |
88 | | |
89 | | The purpose of literal extraction is to speed up searches. That is, if you |
90 | | know a regular expression must match a prefix or suffix literal, then it is |
91 | | often quicker to search for instances of that literal, and then confirm or deny |
92 | | the match using the full regular expression engine. These optimizations are |
93 | | done automatically in the `regex` crate. |
94 | | |
95 | | |
96 | | # Crate features |
97 | | |
98 | | An important feature provided by this crate is its Unicode support. This |
99 | | includes things like case folding, boolean properties, general categories, |
100 | | scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`. |
101 | | However, a downside of this support is that it requires bundling several |
102 | | Unicode data tables that are substantial in size. |
103 | | |
104 | | A fair number of use cases do not require full Unicode support. For this |
105 | | reason, this crate exposes a number of features to control which Unicode |
106 | | data is available. |
107 | | |
108 | | If a regular expression attempts to use a Unicode feature that is not available |
109 | | because the corresponding crate feature was disabled, then translating that |
110 | | regular expression to an `Hir` will return an error. (It is still possible |
111 | | construct an `Ast` for such a regular expression, since Unicode data is not |
112 | | used until translation to an `Hir`.) Stated differently, enabling or disabling |
113 | | any of the features below can only add or subtract from the total set of valid |
114 | | regular expressions. Enabling or disabling a feature will never modify the |
115 | | match semantics of a regular expression. |
116 | | |
117 | | The following features are available: |
118 | | |
119 | | * **std** - |
120 | | Enables support for the standard library. This feature is enabled by default. |
121 | | When disabled, only `core` and `alloc` are used. Otherwise, enabling `std` |
122 | | generally just enables `std::error::Error` trait impls for the various error |
123 | | types. |
124 | | * **unicode** - |
125 | | Enables all Unicode features. This feature is enabled by default, and will |
126 | | always cover all Unicode features, even if more are added in the future. |
127 | | * **unicode-age** - |
128 | | Provide the data for the |
129 | | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). |
130 | | This makes it possible to use classes like `\p{Age:6.0}` to refer to all |
131 | | codepoints first introduced in Unicode 6.0 |
132 | | * **unicode-bool** - |
133 | | Provide the data for numerous Unicode boolean properties. The full list |
134 | | is not included here, but contains properties like `Alphabetic`, `Emoji`, |
135 | | `Lowercase`, `Math`, `Uppercase` and `White_Space`. |
136 | | * **unicode-case** - |
137 | | Provide the data for case insensitive matching using |
138 | | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). |
139 | | * **unicode-gencat** - |
140 | | Provide the data for |
141 | | [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). |
142 | | This includes, but is not limited to, `Decimal_Number`, `Letter`, |
143 | | `Math_Symbol`, `Number` and `Punctuation`. |
144 | | * **unicode-perl** - |
145 | | Provide the data for supporting the Unicode-aware Perl character classes, |
146 | | corresponding to `\w`, `\s` and `\d`. This is also necessary for using |
147 | | Unicode-aware word boundary assertions. Note that if this feature is |
148 | | disabled, the `\s` and `\d` character classes are still available if the |
149 | | `unicode-bool` and `unicode-gencat` features are enabled, respectively. |
150 | | * **unicode-script** - |
151 | | Provide the data for |
152 | | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). |
153 | | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, |
154 | | `Latin` and `Thai`. |
155 | | * **unicode-segment** - |
156 | | Provide the data necessary to provide the properties used to implement the |
157 | | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). |
158 | | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and |
159 | | `\p{sb=ATerm}`. |
160 | | * **arbitrary** - |
161 | | Enabling this feature introduces a public dependency on the |
162 | | [`arbitrary`](https://crates.io/crates/arbitrary) |
163 | | crate. Namely, it implements the `Arbitrary` trait from that crate for the |
164 | | [`Ast`](crate::ast::Ast) type. This feature is disabled by default. |
165 | | */ |
166 | | |
167 | | #![no_std] |
168 | | #![forbid(unsafe_code)] |
169 | | #![deny(missing_docs, rustdoc::broken_intra_doc_links)] |
170 | | #![warn(missing_debug_implementations)] |
171 | | #![cfg_attr(docsrs, feature(doc_auto_cfg))] |
172 | | |
173 | | #[cfg(any(test, feature = "std"))] |
174 | | extern crate std; |
175 | | |
176 | | extern crate alloc; |
177 | | |
178 | | pub use crate::{ |
179 | | error::Error, |
180 | | parser::{parse, Parser, ParserBuilder}, |
181 | | unicode::UnicodeWordError, |
182 | | }; |
183 | | |
184 | | use alloc::string::String; |
185 | | |
186 | | pub mod ast; |
187 | | mod debug; |
188 | | mod either; |
189 | | mod error; |
190 | | pub mod hir; |
191 | | mod parser; |
192 | | mod rank; |
193 | | mod unicode; |
194 | | mod unicode_tables; |
195 | | pub mod utf8; |
196 | | |
197 | | /// Escapes all regular expression meta characters in `text`. |
198 | | /// |
199 | | /// The string returned may be safely used as a literal in a regular |
200 | | /// expression. |
201 | 200k | pub fn escape(text: &str) -> String { |
202 | 200k | let mut quoted = String::new(); |
203 | 200k | escape_into(text, &mut quoted); |
204 | 200k | quoted |
205 | 200k | } |
206 | | |
207 | | /// Escapes all meta characters in `text` and writes the result into `buf`. |
208 | | /// |
209 | | /// This will append escape characters into the given buffer. The characters |
210 | | /// that are appended are safe to use as a literal in a regular expression. |
211 | 200k | pub fn escape_into(text: &str, buf: &mut String) { |
212 | 200k | buf.reserve(text.len()); |
213 | 5.25M | for c in text.chars() { |
214 | 5.25M | if is_meta_character(c) { |
215 | 1.00M | buf.push('\\'); |
216 | 4.25M | } |
217 | 5.25M | buf.push(c); |
218 | | } |
219 | 200k | } |
220 | | |
221 | | /// Returns true if the given character has significance in a regex. |
222 | | /// |
223 | | /// Generally speaking, these are the only characters which _must_ be escaped |
224 | | /// in order to match their literal meaning. For example, to match a literal |
225 | | /// `|`, one could write `\|`. Sometimes escaping isn't always necessary. For |
226 | | /// example, `-` is treated as a meta character because of its significance |
227 | | /// for writing ranges inside of character classes, but the regex `-` will |
228 | | /// match a literal `-` because `-` has no special meaning outside of character |
229 | | /// classes. |
230 | | /// |
231 | | /// In order to determine whether a character may be escaped at all, the |
232 | | /// [`is_escapeable_character`] routine should be used. The difference between |
233 | | /// `is_meta_character` and `is_escapeable_character` is that the latter will |
234 | | /// return true for some characters that are _not_ meta characters. For |
235 | | /// example, `%` and `\%` both match a literal `%` in all contexts. In other |
236 | | /// words, `is_escapeable_character` includes "superfluous" escapes. |
237 | | /// |
238 | | /// Note that the set of characters for which this function returns `true` or |
239 | | /// `false` is fixed and won't change in a semver compatible release. (In this |
240 | | /// case, "semver compatible release" actually refers to the `regex` crate |
241 | | /// itself, since reducing or expanding the set of meta characters would be a |
242 | | /// breaking change for not just `regex-syntax` but also `regex` itself.) |
243 | | /// |
244 | | /// # Example |
245 | | /// |
246 | | /// ``` |
247 | | /// use regex_syntax::is_meta_character; |
248 | | /// |
249 | | /// assert!(is_meta_character('?')); |
250 | | /// assert!(is_meta_character('-')); |
251 | | /// assert!(is_meta_character('&')); |
252 | | /// assert!(is_meta_character('#')); |
253 | | /// |
254 | | /// assert!(!is_meta_character('%')); |
255 | | /// assert!(!is_meta_character('/')); |
256 | | /// assert!(!is_meta_character('!')); |
257 | | /// assert!(!is_meta_character('"')); |
258 | | /// assert!(!is_meta_character('e')); |
259 | | /// ``` |
260 | 5.88M | pub fn is_meta_character(c: char) -> bool { |
261 | 5.88M | match c { |
262 | | '\\' | '.' | '+' | '*' | '?' | '(' | ')' | '|' | '[' | ']' | '{' |
263 | 1.45M | | '}' | '^' | '$' | '#' | '&' | '-' | '~' => true, |
264 | 4.42M | _ => false, |
265 | | } |
266 | 5.88M | } |
267 | | |
268 | | /// Returns true if the given character can be escaped in a regex. |
269 | | /// |
270 | | /// This returns true in all cases that `is_meta_character` returns true, but |
271 | | /// also returns true in some cases where `is_meta_character` returns false. |
272 | | /// For example, `%` is not a meta character, but it is escapeable. That is, |
273 | | /// `%` and `\%` both match a literal `%` in all contexts. |
274 | | /// |
275 | | /// The purpose of this routine is to provide knowledge about what characters |
276 | | /// may be escaped. Namely, most regex engines permit "superfluous" escapes |
277 | | /// where characters without any special significance may be escaped even |
278 | | /// though there is no actual _need_ to do so. |
279 | | /// |
280 | | /// This will return false for some characters. For example, `e` is not |
281 | | /// escapeable. Therefore, `\e` will either result in a parse error (which is |
282 | | /// true today), or it could backwards compatibly evolve into a new construct |
283 | | /// with its own meaning. Indeed, that is the purpose of banning _some_ |
284 | | /// superfluous escapes: it provides a way to evolve the syntax in a compatible |
285 | | /// manner. |
286 | | /// |
287 | | /// # Example |
288 | | /// |
289 | | /// ``` |
290 | | /// use regex_syntax::is_escapeable_character; |
291 | | /// |
292 | | /// assert!(is_escapeable_character('?')); |
293 | | /// assert!(is_escapeable_character('-')); |
294 | | /// assert!(is_escapeable_character('&')); |
295 | | /// assert!(is_escapeable_character('#')); |
296 | | /// assert!(is_escapeable_character('%')); |
297 | | /// assert!(is_escapeable_character('/')); |
298 | | /// assert!(is_escapeable_character('!')); |
299 | | /// assert!(is_escapeable_character('"')); |
300 | | /// |
301 | | /// assert!(!is_escapeable_character('e')); |
302 | | /// ``` |
303 | 88.0k | pub fn is_escapeable_character(c: char) -> bool { |
304 | 88.0k | // Certainly escapeable if it's a meta character. |
305 | 88.0k | if is_meta_character(c) { |
306 | 0 | return true; |
307 | 88.0k | } |
308 | 88.0k | // Any character that isn't ASCII is definitely not escapeable. There's |
309 | 88.0k | // no real need to allow things like \☃ right? |
310 | 88.0k | if !c.is_ascii() { |
311 | 0 | return false; |
312 | 88.0k | } |
313 | 88.0k | // Otherwise, we basically say that everything is escapeable unless it's a |
314 | 88.0k | // letter or digit. Things like \3 are either octal (when enabled) or an |
315 | 88.0k | // error, and we should keep it that way. Otherwise, letters are reserved |
316 | 88.0k | // for adding new syntax in a backwards compatible way. |
317 | 88.0k | match c { |
318 | 88.0k | '0'..='9' | 'A'..='Z' | 'a'..='z' => false, |
319 | | // While not currently supported, we keep these as not escapeable to |
320 | | // give us some flexibility with respect to supporting the \< and |
321 | | // \> word boundary assertions in the future. By rejecting them as |
322 | | // escapeable, \< and \> will result in a parse error. Thus, we can |
323 | | // turn them into something else in the future without it being a |
324 | | // backwards incompatible change. |
325 | | // |
326 | | // OK, now we support \< and \>, and we need to retain them as *not* |
327 | | // escapeable here since the escape sequence is significant. |
328 | 0 | '<' | '>' => false, |
329 | 0 | _ => true, |
330 | | } |
331 | 88.0k | } |
332 | | |
333 | | /// Returns true if and only if the given character is a Unicode word |
334 | | /// character. |
335 | | /// |
336 | | /// A Unicode word character is defined by |
337 | | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). |
338 | | /// In particular, a character |
339 | | /// is considered a word character if it is in either of the `Alphabetic` or |
340 | | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
341 | | /// or `Connector_Punctuation` general categories. |
342 | | /// |
343 | | /// # Panics |
344 | | /// |
345 | | /// If the `unicode-perl` feature is not enabled, then this function |
346 | | /// panics. For this reason, it is recommended that callers use |
347 | | /// [`try_is_word_character`] instead. |
348 | 0 | pub fn is_word_character(c: char) -> bool { |
349 | 0 | try_is_word_character(c).expect("unicode-perl feature must be enabled") |
350 | 0 | } |
351 | | |
352 | | /// Returns true if and only if the given character is a Unicode word |
353 | | /// character. |
354 | | /// |
355 | | /// A Unicode word character is defined by |
356 | | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). |
357 | | /// In particular, a character |
358 | | /// is considered a word character if it is in either of the `Alphabetic` or |
359 | | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
360 | | /// or `Connector_Punctuation` general categories. |
361 | | /// |
362 | | /// # Errors |
363 | | /// |
364 | | /// If the `unicode-perl` feature is not enabled, then this function always |
365 | | /// returns an error. |
366 | 63.7M | pub fn try_is_word_character( |
367 | 63.7M | c: char, |
368 | 63.7M | ) -> core::result::Result<bool, UnicodeWordError> { |
369 | 63.7M | unicode::is_word_character(c) |
370 | 63.7M | } |
371 | | |
372 | | /// Returns true if and only if the given character is an ASCII word character. |
373 | | /// |
374 | | /// An ASCII word character is defined by the following character class: |
375 | | /// `[_0-9a-zA-Z]`. |
376 | 62.2M | pub fn is_word_byte(c: u8) -> bool { |
377 | 62.2M | match c { |
378 | 32.2M | b'_' | b'0'..=b'9' | b'a'..=b'z' | b'A'..=b'Z' => true, |
379 | 40.9M | _ => false, |
380 | | } |
381 | 62.2M | } |
382 | | |
383 | | #[cfg(test)] |
384 | | mod tests { |
385 | | use alloc::string::ToString; |
386 | | |
387 | | use super::*; |
388 | | |
389 | | #[test] |
390 | | fn escape_meta() { |
391 | | assert_eq!( |
392 | | escape(r"\.+*?()|[]{}^$#&-~"), |
393 | | r"\\\.\+\*\?\(\)\|\[\]\{\}\^\$\#\&\-\~".to_string() |
394 | | ); |
395 | | } |
396 | | |
397 | | #[test] |
398 | | fn word_byte() { |
399 | | assert!(is_word_byte(b'a')); |
400 | | assert!(!is_word_byte(b'-')); |
401 | | } |
402 | | |
403 | | #[test] |
404 | | #[cfg(feature = "unicode-perl")] |
405 | | fn word_char() { |
406 | | assert!(is_word_character('a'), "ASCII"); |
407 | | assert!(is_word_character('à'), "Latin-1"); |
408 | | assert!(is_word_character('β'), "Greek"); |
409 | | assert!(is_word_character('\u{11011}'), "Brahmi (Unicode 6.0)"); |
410 | | assert!(is_word_character('\u{11611}'), "Modi (Unicode 7.0)"); |
411 | | assert!(is_word_character('\u{11711}'), "Ahom (Unicode 8.0)"); |
412 | | assert!(is_word_character('\u{17828}'), "Tangut (Unicode 9.0)"); |
413 | | assert!(is_word_character('\u{1B1B1}'), "Nushu (Unicode 10.0)"); |
414 | | assert!(is_word_character('\u{16E40}'), "Medefaidrin (Unicode 11.0)"); |
415 | | assert!(!is_word_character('-')); |
416 | | assert!(!is_word_character('☃')); |
417 | | } |
418 | | |
419 | | #[test] |
420 | | #[should_panic] |
421 | | #[cfg(not(feature = "unicode-perl"))] |
422 | | fn word_char_disabled_panic() { |
423 | | assert!(is_word_character('a')); |
424 | | } |
425 | | |
426 | | #[test] |
427 | | #[cfg(not(feature = "unicode-perl"))] |
428 | | fn word_char_disabled_error() { |
429 | | assert!(try_is_word_character('a').is_err()); |
430 | | } |
431 | | } |