class Pattern |
A class used to represent "PERL 5"-ish regular expressions
![]() | clearPatternCache () Don't use |
![]() | compile (const std::string & pattern, const unsigned long mode = 0) Call this function to compile a regular expression into a Pattern object.
|
![]() | compileAndKeep (const std::string & pattern, const unsigned long mode = 0) Dont use this function. |
![]() | createMatcher (const std::string & str) Creates a matcher object using the specified string and this pattern. |
![]() | findAll (const std::string & pattern, const std::string & str, const unsigned long mode = 0) Finds all the instances of the specified pattern within the string. |
![]() | findNthMatch (const std::string & pattern, const std::string & str, const int matchNum, const unsigned long mode = 0) Searches through a string for the nth match of the given pattern in the string.
|
![]() | getFlags () const Returns the flags used during compilation of this pattern |
![]() | getPattern () const Returns the regular expression this pattern represents |
![]() | matches (const std::string & pattern, const std::string & str, const unsigned long mode = 0) Determines if an entire string matches the specified pattern |
![]() | registerPattern (const std::string & name, const std::string & pattern, const unsigned long mode = 0) Registers a pattern under a specific name for use in later compilations. |
![]() | replace (const std::string & pattern, const std::string & replace, const std::string & str, const unsigned long mode = 0) Searches through replace and replaces all substrings matched by pattern with str .
|
![]() | split (const std::string & pattern, const std::string & str, const bool keepEmptys = 0, const unsigned long limit = 0, const unsigned long mode = 0) Splits the specified string over occurrences of the specified pattern. |
![]() | unregisterPatterns () Clears the pattern registry |
![]() | ~Pattern () Deletes all NFA nodes allocated during compilation |
![]() | compiledPatterns Holds all the compiled patterns for quick access. |
![]() | curInd Used during compilation to keep track of the current index into pattern
|
![]() | error Flag used during compilation. |
![]() | flags The flags specified when this was compiled |
![]() | groupCount The number of capture groups this contains |
![]() | head The front node of the NFA |
![]() | matcher Used when methods like split are called. |
![]() | nodes Holds all the NFA nodes used. |
![]() | nonCapGroupCount The number of non-capture groups this contains |
![]() | pattern The actual regular expression we rerpesent |
![]() | registeredPatterns Holds all of the registered patterns as strings. |
![]() | classCreateRange (char low, char hi) const Creates a new "class" representing the range from low thru hi .
|
![]() | classIntersect (std::string s1, std::string s2) const Calculates the intersection of two strings. |
![]() | classNegate (std::string s1) const Calculates the negation of a string. |
![]() | classUnion (std::string s1, std::string s2) const Calculates the union of two strings. |
![]() | getInt (int start, int end) Extracts a decimal number from the substring of member-variable pattern
|
![]() | parse (const bool inParen = 0, const bool inOr = 0, NFANode** end = NULL) Parses pattern .
|
![]() | parseBackref () Returns a new node representing the back reference being parsed |
![]() | parseBehind (const bool pos, NFANode** end) Parses a lookbehind expression. |
![]() | parseClass () Parses the current class being examined in pattern .
|
![]() | parseEscape (bool & inv, bool & quo) Parses the escape sequence currently being examined. |
![]() | parseHex () Returns a string containing the hex character being parsed |
![]() | parseOctal () Returns a string containing the octal character being parsed |
![]() | parsePosix () Parses the current POSIX class being examined in pattern .
|
![]() | parseQuote () Parses the current expression and tacks on nodes until a \E is found. |
![]() | parseRegisteredPattern (NFANode** end) Parses a supposed registered pattern currently under compilation. |
![]() | quantify (NFANode* newNode) Tries to quantify the last parsed expression. |
![]() | quantifyCurly (int & sNum, int & eNum) Parses a {n,m} string out of the member-variable pattern
|
![]() | quantifyGroup (NFANode* start, NFANode* stop, const int gn) Tries to quantify the currently parsed group. |
![]() | raiseError () Raises an error during compilation. |
![]() | registerNode (NFANode* node) Convenience function for registering a node in nodes .
|
This pattern class is very similar in functionality to Java's javautilregexPattern class. The pattern class represents an immutable regular expression object. Instead of having a single object contain both the regular expression object and the matching object, instead the two objects are split apart. The Matcher class represents the maching object.The Pattern class works primarily off of "compiled" patterns. A typical instantiation of a regular expression looks like:
Pattern * p = Pattern::compile("a*b"); Matcher * m = p->createMatcher("aaaaaab"); if (m->matches()) ...However, if you do not need to use a pattern more than once, it is often times okay to use the Pattern's static methods insteads. An example looks like this:
if (Pattern::matches("a*b", "aaaab")) { ... }This class does not currently support unicode. The unicode update for this class is coming soon.
This class is partially immutable. It is completely safe to call createMatcher concurrently in different threads, but the other functions (e.g. split) should not be called concurrently on the same
Pattern
.
Construct Matches Characters x
The character x
\\
The character \
\0nn
The character with octal ASCII value nn
\0nnn
The character with octal ASCII value nnn
\xhh
The character with hexadecimal ASCII value hh
\t
A tab character \r
A carriage return character \n
A new-line character Character Classes [abc]
Either a
,b
, orc
[^abc]
Any character but a
,b
, orc
[a-zA-Z]
Any character ranging from a
thruz
, orA
thruZ
[^a-zA-Z]
Any character except those ranging from a
thruz
, orA
thruZ
[a\-z]
Either a
,-
, orz
[a-z[A-Z]]
Same as [a-zA-Z]
[a-z&&[g-i]]
Any character in the intersection of a-z
andg-i
[a-z&&[^g-i]]
Any character in a-z
and not ing-i
Prefefined character classes .
Any character. Multiline matching must be compiled into the pattern for .
to match a\r
or a\n
. Even if multiline matching is enabled,.
will not match a\r\n
, only a\r
or a\n
.\d
[0-9]
\D
[^\d]
\s
[ \t\r\n\x0B]
\S
[^\s]
\w
[a-zA-Z0-9_]
\W
[^\w]
POSIX character classes \p{Lower}
[a-z]
\p{Upper}
[A-Z]
\p{ASCII}
[\x00-\x7F]
\p{Alpha}
[a-zA-Z]
\p{Digit}
[0-9]
\p{Alnum}
[\w&&[^_]]
\p{Punct}
[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
\p{XDigit}
[a-fA-F0-9]
Boundary Matches ^
The beginning of a line. Also matches the beginning of input. $
The end of a line. Also matches the end of input. \b
A word boundary \B
A non word boundary \A
The beginning of input \G
The end of the previous match. Ensures that a "next" match will only happen if it begins with the character immediately following the end of the "current" match. \Z
The end of input. Will also match if there is a single trailing \r\n
, a single trailing\r
, or a single trailing\n
.\z
The end of input Greedy Quantifiers x?
x, either zero times or one time x*
x, zero or more times x+
x, one or more times x{n}
x, exactly n times x{n,}
x, at least n
timesx{,m}
x, at most m
timesx{n,m}
x, at least n
times and at mostm
timesPossessive Quantifiers x?+
x, either zero times or one time x*+
x, zero or more times x++
x, one or more times x{n}+
x, exactly n times x{n,}+
x, at least n
timesx{,m}+
x, at most m
timesx{n,m}+
x, at least n
times and at mostm
timesReluctant Quantifiers x??
x, either zero times or one time x*?
x, zero or more times x+?
x, one or more times x{n}?
x, exactly n times x{n,}?
x, at least n
timesx{,m}?
x, at most m
timesx{n,m}?
x, at least n
times and at mostm
timesOperators xy
x
theny
x
|y
x
ory
(x)
x
as a capturing groupQuoting \Q
Nothing, but treat every character (including \s) literally until a matching \E
\E
Nothing, but ends its matching \Q
Special Constructs (?:x)
x
, but not as a capturing group(?=x)
x
, via positive lookahead. This means that the expression will match only if it is trailed byx
. It will not "eat" any of the characters matched byx
.(?!x)
x
, via negative lookahead. This means that the expression will match only if it is not trailed byx
. It will not "eat" any of the characters matched byx
.(?<=x)
x
, via positive lookbehind.x
cannot contain any quantifiers.(?x)
x
, via negative lookbehind.x
cannot contain any quantifiers.(?>x)
x{1}+
Registered Expression Matching {x}
The registered pattern x
Begin Text Extracted And Modified From java.util.regex.Pattern documentation
Backslashes, escapes, and quoting
The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace.
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
It is necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by a compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary. The string litera "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.
Character Classes
Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes.
The precedence of character-class operators is as follows, from highest to lowest:
1 Literal escape \x 2 Range a-z 3 Grouping [...] 4 Intersection [a-z&&[aeiou]] 5 Union [a-e][i-u] Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.
Groups and capturing
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C) Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.
Unicode support
Coming Soon.
Comparison to Perl 5
The
Pattern
engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.Perl constructs not supported by this class:
The conditional constructs (?{X}) and (?(condition)X|Y),
The embedded code constructs (?{code}) and (??{code}),
The embedded comment syntax (?#comment), and
The preprocessing operations \l \u, \L, and \U.
Embedded flags
Constructs supported by this class but not by Perl:
Possessive quantifiers, which greedily match as much as they can and do not back off, even when doing so would allow the overall match to succeed.
Character-class union and intersection as described above.
Notable differences from Perl:
In Perl, \1 through \9 are always interpreted as back references; a backslash-escaped number greater than 9 is treated as a back reference if at least that many subexpressions exist, otherwise it is interpreted, if possible, as an octal escape. In this class octal escapes must always begin with a zero. In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.
Perl uses the g flag to request a match that resumes where the last match left off. This functionality is provided implicitly by the
Matcher
class: Repeated invocations of thefind
method will resume where the last match left off, unless the matcher is reset.Perl is forgiving about malformed matching constructs, as in the expression *a, as well as dangling brackets, as in the expression abc], and treats them as literals. This class also strict and will not compile a pattern when dangling characters are encountered.
For a more precise description of the behavior of regular expression constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E. F. Friedl, O'Reilly and Associates, 2002.
End Text Extracted And Modified From java.util.regex.Pattern documentation
error
is no longer used.
pattern. Once the pattern is successfully
compiled, error
is no longer used.
int groupCount
-
The number of capture groups this contains
int nonCapGroupCount
-
The number of non-capture groups this contains
unsigned long flags
-
The flags specified when this was compiled
void raiseError()
-
Raises an error during compilation. Compilation will cease at that point
and compile will return
NULL
.
NFANode* registerNode(NFANode* node)
-
Convenience function for registering a node in
nodes
.
- Parameters:
- node - The node to register
- Returns:
- The registered node
std::string classUnion(std::string s1, std::string s2) const
-
Calculates the union of two strings. This function will first sort the
strings and then use a simple selection algorithm to find the union.
- Parameters:
- s1 - The first "class" to union
s2 - The second "class" to union
- Returns:
- A new string containing all unique characters. Each character
must have appeared in one or both of
s1
and
s2
.
std::string classIntersect(std::string s1, std::string s2) const
-
Calculates the intersection of two strings. This function will first sort
the strings and then use a simple selection algorithm to find the
intersection.
- Parameters:
- s1 - The first "class" to intersect
s2 - The second "class" to intersect
- Returns:
- A new string containing all unique characters. Each character
must have appeared both
s1
and s2
.
std::string classNegate(std::string s1) const
-
Calculates the negation of a string. The negation is the set of all
characters between
\x00
and \xFF
not
contained in s1
.
- Parameters:
- s1 - The "class" to be negated.
s2 - The second "class" to intersect
- Returns:
- A new string containing all unique characters. Each character
must have appeared both
s1
and s2
.
std::string classCreateRange(char low, char hi) const
-
Creates a new "class" representing the range from
low
thru
hi
. This function will wrap if low
>
hi
. This is a feature, not a buf. Sometimes it is useful
to be able to say [\x70-\x10] instead of [\x70-\x7F\x00-\x10].
- Parameters:
- low - The beginning character
hi - The ending character
- Returns:
- A new string containing all the characters from low thru hi.
int getInt(int start, int end)
-
Extracts a decimal number from the substring of member-variable
pattern starting at start
and
ending at end
.
- Parameters:
- - start The starting index in
pattern
end - The last index in pattern
- Returns:
- The decimal number in
pattern
bool quantifyCurly(int & sNum, int & eNum)
-
Parses a
{n,m}
string out of the member-variable
pattern stores the result in sNum
and eNum
.
- Parameters:
- sNum - Output parameter. The minimum number of matches required
by the curly quantifier are stored here.
eNum - Output parameter. The maximum number of matches allowed
by the curly quantifier are stored here.
- Returns:
- Success/Failure. Fails when the curly does not have the proper
syntax
NFANode* quantifyGroup(NFANode* start, NFANode* stop, const int gn)
-
Tries to quantify the currently parsed group. If the group being parsed
is indeed quantified in the member-variable
pattern, then the NFA is modified accordingly.
- Parameters:
- - start The starting node of the current group being parsed
stop - The ending node of the current group being parsed
gn - The group number of the current group being parsed
- Returns:
- The node representing the starting node of the group. If the
group becomes quantified, then this node is not necessarily
a GroupHead node.
NFANode* quantify(NFANode* newNode)
-
Tries to quantify the last parsed expression. If the character was indeed
quantified, then the NFA is modified accordingly.
- Parameters:
- newNode - The recently created expression node
- Returns:
- The node representing the last parsed expression. If the
expression was quantified,
return value != newNode
std::string parseClass()
-
Parses the current class being examined in
pattern
.
- Returns:
- A string of unique characters contained in the current class being
parsed
std::string parsePosix()
-
Parses the current POSIX class being examined in
pattern
.
- Returns:
- A string of unique characters representing the POSIX class being
parsed
std::string parseOctal()
-
Returns a string containing the octal character being parsed
- Returns:
- The string contained the octal value being parsed
std::string parseHex()
-
Returns a string containing the hex character being parsed
- Returns:
- The string contained the hex value being parsed
NFANode* parseBackref()
-
Returns a new node representing the back reference being parsed
- Returns:
- The new node representing the back reference being parsed
std::string parseEscape(bool & inv, bool & quo)
-
Parses the escape sequence currently being examined. Determines if the
escape sequence is a class, a single character, or the beginning of a
quotation sequence.
- Parameters:
- inv - Output parameter. Whether or not to invert the returned class
quo - Output parameter. Whether or not this sequence starts a
quotation.
- Returns:
- The characters represented by the class
NFANode* parseRegisteredPattern(NFANode** end)
-
Parses a supposed registered pattern currently under compilation. If the
sequence of characters does point to a registered pattern, then the
registered pattern is appended to
*end. The registered pattern
is parsed with the current compilation flags.
- Parameters:
- end - The ending node of the thus-far compiled pattern
- Returns:
- The new end node of the current pattern
NFANode* parseBehind(const bool pos, NFANode** end)
-
Parses a lookbehind expression. Appends the necessary nodes
*end
.
- Parameters:
- pos - Positive or negative look behind
end - The ending node of the current pattern
- Returns:
- The new end node of the current pattern
NFANode* parseQuote()
-
Parses the current expression and tacks on nodes until a \E is found.
- Returns:
- The end of the current pattern
NFANode* parse(const bool inParen = 0, const bool inOr = 0, NFANode** end = NULL)
-
Parses
pattern
. This function is called
recursively when an or (|
) or a group is encountered.
- Parameters:
- inParen - Are we currently parsing inside a group
inOr - Are we currently parsing one side of an or (|
)
end - The end of the current expression
- Returns:
- The starting node of the NFA constructed from this parse
static const unsigned long CASE_INSENSITIVE
- We should match regardless of case
static const unsigned long LITERAL
- We are implicitly quoted
static const unsigned long DOT_MATCHES_ALL
- We should treat a
.
as [\x00-\x7F]
static const unsigned long MULTILINE_MATCHING
^
and $
should anchor to the beginning and
ending of lines, not all input
static const unsigned long UNIX_LINE_MODE
- When enabled, only instances of
\n are recognized as
line terminators
static const int MIN_QMATCH
- The absolute minimum number of matches a quantifier can match (0)
static const int MAX_QMATCH
- The absolute maximum number of matches a quantifier can match (0x7FFFFFFF)
static Pattern* compile(const std::string & pattern, const unsigned long mode = 0)
-
Call this function to compile a regular expression into a
Pattern
object. Special values can be assigned to
mode
when certain non-standard behaviors are expected from
the Pattern
object.
- Parameters:
- - pattern The regular expression to compile
mode - A bitwise or of flags signalling what special behaviors are
wanted from this Pattern
object
- Returns:
- If successful,
compile
returns a Pattern
pointer. Upon failure, compile
returns
NULL
static Pattern* compileAndKeep(const std::string & pattern, const unsigned long mode = 0)
-
Dont use this function. This function will compile a pattern, and cache
the result. This will eventually be used as an optimization when people
just want to call static methods using the same pattern over and over
instead of first compiling the pattern and then using the compiled
instance for matching.
- Parameters:
- - pattern The regular expression to compile
mode - A bitwise or of flags signalling what special behaviors are
wanted from this Pattern
object
- Returns:
- If successful,
compileAndKeep
returns a
Pattern
pointer. Upon failure, compile
returns NULL
.
static std::string replace(const std::string & pattern, const std::string & replace, const std::string & str, const unsigned long mode = 0)
-
Searches through
replace
and replaces all substrings matched
by pattern
with str
. str
may
contain backreferences (e.g. \1
) to capture groups. A typical
invocation looks like:
Pattern::replace("(a+)b(c+)", "abcccbbabcbabc", "\\2b\\1");
which would replace abcccbbabcbabc
with
cccbabbcbabcba
.
- Parameters:
- - pattern The regular expression
- replace The string in which to perform replacements
- str The replacement text
mode - The special mode requested of the Pattern
during the replacement process
- Returns:
- The text with the replacement string substituted where necessary
static std::vector<std::string> split(const std::string & pattern, const std::string & str, const bool keepEmptys = 0, const unsigned long limit = 0, const unsigned long mode = 0)
-
Splits the specified string over occurrences of the specified pattern.
Empty strings can be optionally ignored. The number of strings returned is
configurable. A typical invocation looks like:
std::string str(strSize, '\0');
FILE * fp = fopen(fileName, "r");
fread((char*)str.data(), strSize, 1, fp);
fclose(fp);
std::vector<std::string> lines = Pattern::split("[\r\n]+", str, true);
- Parameters:
- - pattern The regular expression
- replace The string to split
keepEmptys - Whether or not to keep empty strings
limit - The maximum number of splits to make
mode - The special mode requested of the Pattern
during the split process
- Returns:
- All substrings of
str
split across pattern
.
static std::vector<std::string> findAll(const std::string & pattern, const std::string & str, const unsigned long mode = 0)
-
Finds all the instances of the specified pattern within the string. You
should be careful to only pass patterns with a minimum length of one. For
example, the pattern
a*
can be matched by an empty string, so
instead you should pass a+
since at least one character must
be matched. A typical invocation of findAll
looks like:
std::vector<td::string> numbers = Pattern::findAll("\\d+", string);
- Parameters:
- - pattern The pattern for which to search
- str The string to search
mode - The special mode requested of the Pattern
during the find process
- Returns:
- All instances of
pattern
in str
static bool matches(const std::string & pattern, const std::string & str, const unsigned long mode = 0)
-
Determines if an entire string matches the specified pattern
- Parameters:
- - pattern The pattern for to match
- str The string to match
mode - The special mode requested of the Pattern
during the replacement process
- Returns:
- True if
str
is recognized by pattern
static bool registerPattern(const std::string & name, const std::string & pattern, const unsigned long mode = 0)
-
Registers a pattern under a specific name for use in later compilations.
A typical invocation and later use looks like:
Pattern::registerPattern("ip", "(?:\\d{1,3}\\.){3}\\d{1,3}");
Pattern * p1 = Pattern::compile("{ip}:\\d+");
Pattern * p2 = Pattern::compile("Connection from ({ip}) on port \\d+");
Multiple calls to registerPattern
with the same
name
will result in the pattern getting overwritten.
- Parameters:
- name - The name to give to the pattern
- pattern The pattern to register
mode - Any special flags to use when compiling pattern
- Returns:
- Success/Failure. Fails only if
pattern
has invalid
syntax
static void unregisterPatterns()
-
Clears the pattern registry
static void clearPatternCache()
-
Don't use
static std::pair<std::string, int> findNthMatch(const std::string & pattern, const std::string & str, const int matchNum, const unsigned long mode = 0)
-
Searches through a string for the
nth
match of the
given pattern in the string. Match indeces start at zero, not one.
A typical invocation looks like this:
std::pair<std::string, int> match = Pattern::findNthMatch("\\d{1,3}", "192.168.1.101:22", 1);
printf("%s %i\n", match.first.c_str(), match.second);
Output: 168 4
- Parameters:
- - pattern The pattern for which to search
- str The string to search
matchNum - Which match to find
mode - Any special flags to use during the matching process
- Returns:
- A string and an integer. The string is the string matched. The
integer is the starting location of the matched string in
str
. You can check for success/failure by making sure
that the integer returned is greater than or equal to zero.
~Pattern()
-
Deletes all NFA nodes allocated during compilation
unsigned long getFlags() const
-
Returns the flags used during compilation of this pattern
- Returns:
- The flags used during compilation of this pattern
std::string getPattern() const
-
Returns the regular expression this pattern represents
- Returns:
- The regular expression this pattern represents
Matcher* createMatcher(const std::string & str)
-
Creates a matcher object using the specified string and this pattern.
- Parameters:
- - str The string to match against
- Returns:
- A new matcher using object using this pattern and the specified
string
Alphabetic index HTML hierarchy of classes or Java