Package pyxb :: Package utils :: Module xmlre
[hide private]
[frames] | no frames]

Module xmlre

source code

Support for regular expressions conformant to the XML Schema specification.

For the most part, XML regular expressions are similar to the POSIX ones, and can be handled by the Python re module. The exceptions are for multi-character (\w) and category escapes (e.g., \N or \p{IPAExtensions}) and the character set subtraction capability. This module supports those by scanning the regular expression, replacing the category escapes with equivalent charset expressions. It further detects the subtraction syntax and modifies the charset expression to remove the unwanted code points.

The basic technique is to step through the characters of the regular expression, entering a recursive-descent parser when one of the translated constructs is encountered.

There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd, with a sample document at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml

Classes [hide private]
  RegularExpressionError
Raised when a regular expression cannot be processed..
Functions [hide private]
 
_MatchCharPropBraced(text, position)
Match a character property or multi-character escape identifier, which will be enclosed in braces.
source code
 
_MaybeMatchCharClassEsc(text, position, include_sce=True)
Attempt to match a character class escape expression.
source code
 
_CharOrSCE(text, position)
Return the single character represented at the given position.
source code
 
_MatchPosCharGroup(text, position)
Match a positive character group that begins at the given position.
source code
 
_MatchCharGroup(text, position)
Match a character group at the given position.
source code
 
_MatchCharClassExpr(text, position)
Match a character class expression at the given position.
source code
 
MaybeMatchCharacterClass(text, position)
Attempt to match a character class expression.
source code
 
XMLToPython(pattern)
Convert the given pattern to the format required for Python regular expressions.
source code
Variables [hide private]
  _NotXMLChar_set = frozenset(['-', '[', ']'])
The set of characters that cannot appear within a character class expression unescaped.
  __package__ = 'pyxb.utils'
Function Details [hide private]

_MatchCharPropBraced(text, position)

source code 

Match a character property or multi-character escape identifier, which will be enclosed in braces.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the opening brace of the character property
Returns:
A pair (cps, p) where cps is a unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.
Raises:
  • RegularExpressionError - if opening or closing braces are missing, or if the text between them cannot be recognized as a property or block identifier.

_MaybeMatchCharClassEsc(text, position, include_sce=True)

source code 

Attempt to match a character class escape expression.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the backslash that would begin the potential character class escape
  • include_sce - Optional directive to include single-character escapes in addition to character cllass escapes. Default is True.
Returns:
None if position does not begin a character class escape; otherwise a pair (cps, p) as in _MatchCharPropBraced.

_CharOrSCE(text, position)

source code 

Return the single character represented at the given position.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the character to return. If this is a backslash, additional text is consumed in order to identify the single-character escape that begins at the position.
Returns:
A pair (c, p) where c is the Unicode character specified at the position, and p is the text offset immediately following the closing brace.
Raises:
  • RegularExpressionError - if the position has no character, or has a character in _NotXMLChar_set or the position begins an escape sequence that is not resolvable as a single-character escape.

_MatchPosCharGroup(text, position)

source code 

Match a positive character group that begins at the given position.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the start of the positive character group.
Returns:
a pair (cps, p) as in _MatchCharPropBraced.
Raises:

_MatchCharGroup(text, position)

source code 

Match a character group at the given position.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the start of the character group.
Returns:
a pair (cps, p) as in _MatchCharPropBraced.
Raises:

_MatchCharClassExpr(text, position)

source code 

Match a character class expression at the given position.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the start of the character group.
Returns:
a pair (cps, p) as in _MatchCharPropBraced.
Raises:

MaybeMatchCharacterClass(text, position)

source code 

Attempt to match a character class expression.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the start of the potential expression.
Returns:
None if position does not begin a character class expression; otherwise a pair (cps, p) as in _MatchCharPropBraced.

XMLToPython(pattern)

source code 

Convert the given pattern to the format required for Python regular expressions.

Parameters:
Returns:
A Unicode string specifying a Python regular expression that matches the same language as pattern.