Module xmlre
source code
Support for regular expressions conformant to the XML Schema
specification.
For the most part, XML regular expressions are similar to the POSIX
ones, and can be handled by the Python re
module. The
exceptions are for multi-character (\w
) and category escapes
(e.g., \p{N}
or \p{IPAExtensions}
) and the
character set subtraction capability. This module supports those by
scanning the regular expression, replacing the category escapes with
equivalent charset expressions. It further detects the subtraction syntax
and modifies the charset expression to remove the unwanted code
points.
The basic technique is to step through the characters of the regular
expression, entering a recursive-descent parser when one of the
translated constructs is encountered.
There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd,
with a sample document at
http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml
|
_InitializeAllEsc()
Set the values in _AllEsc without introducing k and
v into the module. |
source code
|
|
|
|
|
|
|
|
|
|
|
XMLToPython(pattern)
Convert the given pattern to the format required for Python regular
expressions. |
source code
|
|
|
_log = logging.getLogger(__name__)
|
|
_AllEsc = {}
|
|
_CharClassEsc_re = re.compile(r'\\(?: (?P< cgProp > [ pP] \{(?P< char ...
|
|
__package__ = ' pyxb.utils '
|
Parse a charClassEsc term.
This is one of:
-
SingleCharEsc, an escaped single character such as
\n
-
MultiCharEsc, an escape code that can match a range
of characters, e.g.
\s to match certain whitespace
characters
-
catEsc, the
\p{...} Unicode property
escapes including categories and blocks
-
complEsc, the
\P{...} inverted Unicode
property escapes
If the parsing fails, throws a RegularExpressionError.
- Returns:
- A pair
(cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the
code points associated with the character class, and
p is the text offset immediately following the
escape sequence.
- Raises:
|
Parse a posCharGroup term.
- Returns:
- A tuple
(cps, fs, p) where:
-
cps is a pyxb.utils.unicode.CodePointSet containing
the code points associated with the group;
-
fs is a bool that is
True if the next character is the -
in a charClassSub and False if the
group is not part of a charClassSub;
-
p is the text offset immediately following the
closing brace.
- Raises:
|
Parse a charClassExpr.
These are XML regular expression classes such as [abc] ,
[a-c] , [^abc] , or [a-z-[q]] .
- Parameters:
text - The complete text of the regular expression being translated.
The first character must be the [ starting a
character class.
position - The offset of the start of the character group.
- Returns:
- A pair
(cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the
code points associated with the property, and p is
the text offset immediately following the closing brace.
- Raises:
|
Attempt to match a character class expression.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the start of the potential expression.
- Returns:
None if position does not begin a
character class expression; otherwise a pair (cps,
p) where cps is a pyxb.utils.unicode.CodePointSet containing the
code points associated with the property, and p is
the text offset immediately following the closing brace.
|
Convert the given pattern to the format required for Python regular
expressions.
- Parameters:
- Returns:
- A Unicode string specifying a Python regular expression that
matches the same language as
pattern .
|
_CharClassEsc_re
- Value:
re.compile(r'\\(?: (?P< cgProp > [ pP] \{(?P< charProp > [ -A- Za- z0- 9] + ) \}) | (?P<
cgClass > [ ^ pP] ) ) ')
|
|