Module xmlre
source code
Support for regular expressions conformant to the XML Schema
specification.
For the most part, XML regular expressions are similar to the POSIX
ones, and can be handled by the Python re
module. The
exceptions are for multi-character (\w
) and category escapes
(e.g., \N
or \p{IPAExtensions}
) and the
character set subtraction capability. This module supports those by
scanning the regular expression, replacing the category escapes with
equivalent charset expressions. It further detects the subtraction syntax
and modifies the charset expression to remove the unwanted code
points.
The basic technique is to step through the characters of the regular
expression, entering a recursive-descent parser when one of the
translated constructs is encountered.
There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd,
with a sample document at
http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
XMLToPython(pattern)
Convert the given pattern to the format required for Python regular
expressions. |
source code
|
|
|
_NotXMLChar_set = frozenset([ ' - ' , ' [ ' , ' ] ' ])
The set of characters that cannot appear within a character class
expression unescaped.
|
|
__package__ = ' pyxb.utils '
|
Match a character property or multi-character escape identifier, which will be
enclosed in braces.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the opening brace of the character property
- Returns:
- A pair
(cps, p) where cps is a unicode.CodePointSet containing the code points
associated with the property, and p is the text
offset immediately following the closing brace.
- Raises:
RegularExpressionError - if opening or closing braces are missing, or if the text between
them cannot be recognized as a property or block identifier.
|
_MaybeMatchCharClassEsc(text,
position,
include_sce=True)
| source code
|
Attempt to match a character class escape expression.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the backslash that would begin the potential
character class escape
include_sce - Optional directive to include single-character escapes in
addition to character cllass escapes. Default is
True .
- Returns:
None if position does not begin a
character class escape; otherwise a pair (cps, p) as
in _MatchCharPropBraced.
|
Return the single character represented at the given position.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the character to return. If this is a backslash,
additional text is consumed in order to identify the single-character escape that begins at the
position.
- Returns:
- A pair
(c, p) where c is the Unicode
character specified at the position, and p is the
text offset immediately following the closing brace.
- Raises:
RegularExpressionError - if the position has no character, or has a character in _NotXMLChar_set or the position
begins an escape sequence that is not resolvable as a
single-character escape.
|
Match a positive character group that begins at the given
position.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the start of the positive character group.
- Returns:
- a pair
(cps, p) as in _MatchCharPropBraced.
- Raises:
|
Match a character group at the given position.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the start of the character group.
- Returns:
- a pair
(cps, p) as in _MatchCharPropBraced.
- Raises:
|
Match a character class expression at the given position.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the start of the character group.
- Returns:
- a pair
(cps, p) as in _MatchCharPropBraced.
- Raises:
|
Attempt to match a character class expression.
- Parameters:
text - The complete text of the regular expression being translated
position - The offset of the start of the potential expression.
- Returns:
None if position does not begin a
character class expression; otherwise a pair (cps,
p) as in _MatchCharPropBraced.
|
Convert the given pattern to the format required for Python regular
expressions.
- Parameters:
- Returns:
- A Unicode string specifying a Python regular expression that
matches the same language as
pattern .
|