Wednesday, July 08, 2009

Regular Expression Matching and Postfix notation

As the compiler scans the postfix expression, it maintains a stack of computed NFA fragments. Literals push new NFA fragments onto the stack, while operators pop fragments off the stack and then push a new fragment. For example, after compiling the abb in abb.+.a., the stack contains NFA fragments for a, b, and b. The compilation of the . that follows pops the two b NFA fragment from the stack and pushes an NFA fragment for the concatenation bb.. Each NFA fragment is defined by its start state and its outgoing arrows:
The snippet doesn't make much sense unless you read the article, but this part, I thought was rather neat. Usually, when I wrote my crappy, one-off parsers, I just used regexes to pull out the tokens that I needed. Never thought too much about how it was implemented. But what's detailed here makes sense. Regexes are just state machines where you track whether the string you're matching against lets you traverse all the way through the state machine. And to do that, it pushes each fragment of the regex onto a stack until it reaches an operator, which then pops it off and works on it. While I've usually left post-fix notation is ass-backwards from a user perspective, I can see the elegance of the implementation. I suspect Forth and Factor are similar in this regard.

Posted via web from The Web and all that Jazz