RE2 source code is interesting, it seems to switch between simulating an NFA or ...

burntsushi · on July 28, 2020

With RE2 in particular, the DFA is almost always faster. The most common cases where an NFA is used is when the DFA can't handle the request. Typically, that occurs when you need the offsets of submatches. But in some circumstances, if the regex would cause the DFA to grow exponentially, then RE2 will dynamically fall back to the NFA, since it's slightly faster.

Keep in mind that RE2 doesn't really have a true DFA. It has a hybrid NFA/DFA or a "lazy DFA." Namely, the DFA is computed at match time through incremental subset construction. The DFA states are cached in memory, so most bytes that are processed are processed as if it were a bog-standard table based DFA.

ridiculous_fish · on July 28, 2020

How do DFAs handle zero-width anchors? For example \b, or ^ and $ in multiline mode?

burntsushi · on Aug 1, 2020

Not easily. In the special case of one byte look around (which works for the ASCII interpretations of \b, ^ and $), the DFA can be massaged to make it work by doing two things. First is by embedding some look-around state into each DFA state, when appropriate. For example, in the case of \b, a DFA state is marked as "coming from a word byte" if its incoming transition is a word byte and not marked otherwise. Then, when computing the epsilon closure during subset construction, you use the look-around state to compute whether an epsilon transition is something you're allowed to take or not. e.g., a \b is an epsilon transition since it doesn't consume any input, so you only take it if the state you're in and the byte currently under inspection satisfy the \b criteria.

The second aspect of this is that you have to delay your match by one byte, in order to account for \b and $ at the end of the input. This actually requires adding a special "end of text" sentinel to your DFA's alphabet, but is otherwise pretty easy to handle.

Look-around can actually be supported in the general case using finite automata, although I believe it requires something more powerful than a bare DFA. This paper has the details on that, although it is a difficult read: https://re2c.org/2017_trofimovich_tagged_deterministic_finit...

Of course, these techniques can greatly balloon the size of the DFA. The special case of one-byte look-ahead is generally tolerable, but anything more than that can get pretty bad.

A further pain point is that in Rust's regex engine, \b is Unicode aware by default, and the DFA doesn't support that. The tagged DFA paper makes it clear that such things are possible, but it's complex and it's not clear how effective it would be. Especially since a Unicode aware \w is quite large. So Rust's regex crate has a special optimization where it treats a Unicode aware \b as an ASCII \b if and only if all of its input is ASCII. (Of course, it doesn't scan the input ahead of time to determine this. Instead, it constructs the DFA to quit if it observes any non-ASCII byte. Then it falls back to the slower PikeVM or bounded backtracking engines.)

Apologies for the late reply. :-( I wish HN sent email notifications for replies.