Package pyxb :: Package utils :: Module xmlre

Module xmlre

Support for regular expressions conformant to the XML Schema specification.

For the most part, XML regular expressions are similar to the POSIX ones, and can be handled by the Python re module. The exceptions are for multi-character (\w) and category escapes (e.g., \N or \p{IPAExtensions}) and the character set subtraction capability. This module supports those by scanning the regular expression, replacing the category escapes with equivalent charset expressions. It further detects the subtraction syntax and modifies the charset expression to remove the unwanted code points.

The basic technique is to step through the characters of the regular expression, entering a recursive-descent parser when one of the translated constructs is encountered.

There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd, with a sample document at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml

Classes

[hide private]

RegularExpressionError
Raised when a regular expression cannot be processed..

Functions

[hide private]

_MatchCharPropBraced(text, position)
Match a character property or multi-character escape identifier, which will be enclosed in braces.

source code

_MaybeMatchCharClassEsc(text, position, include_sce=True)
Attempt to match a character class escape expression.

source code

_CharOrSCE(text, position)
Return the single character represented at the given position.

source code

_MatchPosCharGroup(text, position)
Match a positive character group that begins at the given position.

source code

_MatchCharGroup(text, position)
Match a character group at the given position.

source code

_MatchCharClassExpr(text, position)
Match a character class expression at the given position.

source code

MaybeMatchCharacterClass(text, position)
Attempt to match a character class expression.

source code

XMLToPython(pattern)
Convert the given pattern to the format required for Python regular expressions.

source code

Variables

[hide private]

_NotXMLChar_set = frozenset(['-', '[', ']'])
The set of characters that cannot appear within a character class expression unescaped.

__package__ = 'pyxb.utils'

Function Details

[hide private]

_MatchCharPropBraced(text, position)

source code

Match a character property or multi-character escape identifier, which will be enclosed in braces.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the opening brace of the character property

Returns:

A pair (cps, p) where cps is a unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.

Raises:

RegularExpressionError - if opening or closing braces are missing, or if the text between them cannot be recognized as a property or block identifier.

_MaybeMatchCharClassEsc(text, position, include_sce=True)

source code

Attempt to match a character class escape expression.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the backslash that would begin the potential character class escape
include_sce - Optional directive to include single-character escapes in addition to character cllass escapes. Default is True.

Returns:

None if position does not begin a character class escape; otherwise a pair (cps, p) as in _MatchCharPropBraced.

_CharOrSCE(text, position)

source code

Return the single character represented at the given position.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the character to return. If this is a backslash, additional text is consumed in order to identify the single-character escape that begins at the position.

Returns:

A pair (c, p) where c is the Unicode character specified at the position, and p is the text offset immediately following the closing brace.

Raises:

RegularExpressionError - if the position has no character, or has a character in _NotXMLChar_set or the position begins an escape sequence that is not resolvable as a single-character escape.

_MatchPosCharGroup(text, position)

source code

Match a positive character group that begins at the given position.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the start of the positive character group.

Returns:

a pair (cps, p) as in _MatchCharPropBraced.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

_MatchCharGroup(text, position)

source code

Match a character group at the given position.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the start of the character group.

Returns:

a pair (cps, p) as in _MatchCharPropBraced.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

_MatchCharClassExpr(text, position)

source code

Match a character class expression at the given position.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the start of the character group.

Returns:

a pair (cps, p) as in _MatchCharPropBraced.

Raises:

RegularExpressionError - if the expression is syntactically invalid.

MaybeMatchCharacterClass(text, position)

source code

Attempt to match a character class expression.

Parameters:

text - The complete text of the regular expression being translated
position - The offset of the start of the potential expression.

Returns:

None if position does not begin a character class expression; otherwise a pair

(cps, 
          p)

as in _MatchCharPropBraced.

XMLToPython(pattern)

source code

Convert the given pattern to the format required for Python regular expressions.

Parameters:

pattern - A Unicode string defining a pattern consistent with XML regular expressions.

Returns:

A Unicode string specifying a Python regular expression that matches the same language as pattern.