Python Tip: Regex-based tokenizer

Here’s a handy way to define a lexical analyzer in Python:

import re

SCANNER = re.compile(r'''
  (\s+) |                      # whitespace
  (//)[^\n]* |                 # comments
  0[xX]([0-9A-Fa-f]+) |        # hexadecimal integer literals
  (\d+) |                      # integer literals
  (<<|>>) |                    # multi-char punctuation
  ([][(){}<>=,;:*+-/]) |       # punctuation
  ([A-Za-z_][A-Za-z0-9_]*) |   # identifiers
  """(.*?)""" |                # multi-line string literal
  "((?:[^"\n\\]|\\.)*)" |      # regular string literal
  (.)                          # an error!
''', re.DOTALL | re.VERBOSE)

If you combine this with a re.finditer() call on your source string like this:

for match in re.finditer(SCANNER, data):
   space, comment, hexint, integer, mpunct, \
   punct, word, mstringlit, stringlit, badchar = match.groups()
   if space: ...
   if comment: ...
   # ... 
   if badchar: raise FooException...

With this approach you can easily walk through all the tokens in your input string. The captures in each alternative decide what data you get back. If badchar should ever be set, it means there’s an unrecognized character in your input.

This is probably not the most efficient way to tokenize an input string, but it is short to type and relatively easy to maintain. The only caveats are:

  • You must explicitly match and allow whitespace
  • You must list the alternatives in the regex so that the most specific cases are caught first.
  • You must include the badchar alternative or the regex matcher will silently skip past errors in the input string!
You can probably adapt this approach to other programming languages and regex toolkits as well. Leave your thoughts in the comments. Have fun!

One thought on “Python Tip: Regex-based tokenizer

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s