Package pyxb :: Package utils :: Module xmlre

Module xmlre

Support for regular expressions conformant to the XML Schema specification.

For the most part, XML regular expressions are similar to the POSIX ones, and can be handled by the Python re module. The exceptions are for multi-character (\w) and category escapes (e.g., \p{N} or \p{IPAExtensions}) and the character set subtraction capability. This module supports those by scanning the regular expression, replacing the category escapes with equivalent charset expressions. It further detects the subtraction syntax and modifies the charset expression to remove the unwanted code points.

The basic technique is to step through the characters of the regular expression, entering a recursive-descent parser when one of the translated constructs is encountered.

There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd, with a sample document at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml

Classes

[hide private]

RegularExpressionError
Raised when a regular expression cannot be processed..

Functions

[hide private]

_InitializeAllEsc()
Set the values in _AllEsc without introducing k and v into the module. source code

_MatchCharClassEsc(text, position)
Parse a charClassEsc term.

source code

_MatchPosCharGroup(text, position)
Parse a posCharGroup term.

source code

_MatchCharClassExpr(text, position)
Parse a charClassExpr.

source code

MaybeMatchCharacterClass(text, position)
Attempt to match a character class expression.

source code

XMLToPython(pattern)
Convert the given pattern to the format required for Python regular expressions.

source code

Variables

[hide private]

_log = <logging.Logger object>

_AllEsc = {u'.': <pyxb.utils.unicode.CodePointSet object>, u'\...

_CharClassEsc_re = re.compile(r'\\(?:(?P<cgProp>[pP]\{(?P<char...

__package__ = 'pyxb.utils'

Function Details

[hide private]

_MatchCharClassEsc(text, position)

source code

Parse a charClassEsc term.

This is one of:

SingleCharEsc, an escaped single character such as \n
MultiCharEsc, an escape code that can match a range of characters, e.g. \s to match certain whitespace characters
catEsc, the \p{...} Unicode property escapes including categories and blocks
complEsc, the \P{...} inverted Unicode property escapes

If the parsing fails, throws a RegularExpressionError.

Returns:

A pair (cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the character class, and p is the text offset immediately following the escape sequence.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

_MatchPosCharGroup(text, position)

source code

Parse a posCharGroup term.

Returns:

A tuple (cps, fs, p) where:

cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the group;
fs is a bool that is True if the next character is the - in a charClassSub and False if the group is not part of a charClassSub;
p is the text offset immediately following the closing brace.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

_MatchCharClassExpr(text, position)

source code

Parse a charClassExpr.

These are XML regular expression classes such as [abc], [a-c], [^abc], or [a-z-[q]].

Parameters:

text - The complete text of the regular expression being translated. The first character must be the [ starting a character class.
position - The offset of the start of the character group.

Returns:

A pair (cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

MaybeMatchCharacterClass(text, position)

source code

Attempt to match a character class expression.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the start of the potential expression.

Returns:

None if position does not begin a character class expression; otherwise a pair

(cps, 
          p)

where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.

XMLToPython(pattern)

source code

Convert the given pattern to the format required for Python regular expressions.

Parameters:

pattern - A Unicode string defining a pattern consistent with XML regular expressions.

Returns:

A Unicode string specifying a Python regular expression that matches the same language as pattern.

Variables Details

[hide private]

_AllEsc

Value:

{u'.': <pyxb.utils.unicode.CodePointSet object>,
 u'\(': <pyxb.utils.unicode.CodePointSet object>,
 u'\)': <pyxb.utils.unicode.CodePointSet object>,
 u'\*': <pyxb.utils.unicode.CodePointSet object>,
 u'\+': <pyxb.utils.unicode.CodePointSet object>,
 u'\-': <pyxb.utils.unicode.CodePointSet object>,
 u'\.': <pyxb.utils.unicode.CodePointSet object>,
 u'\?': <pyxb.utils.unicode.CodePointSet object>,
...

_CharClassEsc_re

Value:

re.compile(r'\\(?:(?P<cgProp>[pP]\{(?P<charProp>[-A-Za-z0-9]+)\})|(?P<
cgClass>[^pP]))')