Regexes: The Bad, the Better, and the Best

taco_emoji · on June 23, 2015

> In General, the Longer Regex Is the Better Regex

I'd rather word this as "more specific is better". Like say for a U.S. phone number (minus area code for simplicity),

    \d{3}-?\d{4}

is better than

    .*-?.*

because it's more specific.

"Longer is better" is only useful for helping identify which regex is better, not for helping write better regexes.

leeoniya · on June 23, 2015

and [0-9] is more specific than \d

"In most flavors that support Unicode, \d includes all digits from all scripts." [1]

EDIT: But I guess for the majority of use cases it doesn't matter [2] since PCRE is the norm almost everywhere.

[1] http://www.regular-expressions.info/shorthand.html

[2] "Notable exceptions are Java, JavaScript, and PCRE. These Unicode flavors match only ASCII digits with \d"

oftenwrong · on June 23, 2015

In many environments,

   [0-9]{3}-?[0-9]{4}

is even more specific (and faster) because \d would match other digit characters outside of [0-9].

dalore · on June 24, 2015

taking it to the next level:

[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]

bane · on June 23, 2015

This is just another statement of precision vs. recall.

https://en.wikipedia.org/wiki/Precision_and_recall

pimlottc · on June 23, 2015

Where "Best" = "Most performant". Of course, there's other ways to judge a regex - readability, specificity, robustness...

In most places I used regexes, efficiency is the least of my concerns, though for a company whose product is focused around parsing massive amounts of logs, focusing on performance does make sense.

vezzy-fnord · on June 23, 2015

The RegexBuddy author has a good article on this same phenomenon, dubbed "catastrophic backtracking": http://www.regular-expressions.info/catastrophic.html

Twisol · on June 23, 2015

How would the regex below compare in this case? That is, modifying the "bad" regex to use the lazy-consumption trick on (almost) all of the `.` patterns.
/.*? (.*?)\[(.*?)\]:.*/

carapace · on June 23, 2015

Okay but think:

If you are searching a very large file for a very few occurrences of the expected match then this optimization is not so bad.

If you are running line-by-line through a very large log file to extract just those two pieces of information per line, then throw away the first N characters in each line (where N is the hopefully-constant length of your timestamps plus that space char) and start the regex engine at the beginning of the expected match. Then it doesn't have to waste any time passing over those chars.

Even if the exact details above aren't quite right the principal is (and is well-known): Avoid premature optimization! (And the corollary: Measure it. Profile your code, don't guess, you're probably wrong.)

thisrod · on June 24, 2015

There's a higher order solution. Read the AWK book, do the exercises, then ignore blog posts about regexps. Make an exception when it's Russ Cox demonstrating how often this wheel has been reinvented in square form.

astangl · on June 24, 2015

Which AWK book? The one by A, W, and K?

thisrod · on June 29, 2015

edwintorok · on June 23, 2015

"Awk and grep use the Thompson NFA algorithm which is in fact significantly faster in almost every way but supports a more limited set of features."

AFAIK the only feature of regexes that require backtracking are back-references, as long as your regex doesn't use it why doesn't PCRE switch to the more efficient algorithm, and use the backtracking algorithm only if you actually need the feature that requires backtracking?

SerpentJoe · on June 23, 2015

Backtracking is required in a lot of cases. Consider matching the pattern /^(AA|AB)*$/ against the string "AAAAAAAAB". Before it can come up with the answer (it doesn't match) the engine has to backtrack all the way from right to left.

edwintorok · on June 23, 2015

that can be compiled to a state machine where a decision to switch states is taken based on current character only, and when all input is consumed you are either in an accepting or rejecting state. In your case I think it only needs 2 states: state 0 moves to state 1 when it sees an A, and state 1 moves to state 0 when it sees either A or B. For your string it'll be in state 0 when it sees a B and thus rejects it.

thaumasiotes · on June 23, 2015

                   ┌┐
                   ↓│
               ┌────┴┐
            ┌─→│     │←──┐
            │  └─────┘   │
       ╔════╧═╗       ┌──┴──┐
    ├─→║      ╟───A──→│     │
       ╚══════╝       └─┬───┘
           ↑            │
           └─────A,B────┘

Three states if you want to process the whole input. If your model is "reject when you fail to find an appropriate transition" rather than "reject if, after processing the string, you're in a reject state", then you don't need the failure trap and you can do it in two states.

Backtracking is definitely not required, nor helpful.

hobs · on June 24, 2015

That diagram is quite pretty, did you just hand code it?

thaumasiotes · on June 24, 2015

Yeah, all manual. :/

http://unicode-table.com/en/blocks/box-drawing/

http://unicode-table.com/en/sets/arrows-symbols/

dpkendal · on June 23, 2015

Not true. Finite automata (NFAs, DFAs) can match that pattern (and any pattern that doesn't involve backreferences or lookaround) in O(n) time where n is the size of the input string. DFA implementations are worst-case O(n ^ 2) in the number of states of the regular expression, but this is far better in most cases than the exponential worst-case time in terms of the input string offered by backtracking implementations.

See https://swtch.com/~rsc/regexp/ for information on finite-state-machine implementations of regexp matching.

thaumasiotes · on June 23, 2015

> DFA implementations are worst-case O(n ^ 2) in the number of states of the regular expression

What? Why is that? If the NFA has n states, then the DFA in principle might need one state for every possible set of states the NFA might be in, of which there are 2^n. Where does n^2 come from?

dpkendal · on June 24, 2015

My mistake, I meant O(2 ^ n) indeed. Regardless, having exponential time complexity in terms of the regular expression (which is usually controlled by the programmer) where the processing is done at compile-time is much better than having exponential complexity in terms of the input string (which often comes from an untrusted source) where the processing is done at run-time.

Mithaldu · on June 23, 2015

Using this Perl module you can walk step-by-step through what a regexp engine actually does given a regexp and a string to match against:

https://metacpan.org/pod/Regexp::Debugger

After installing the module it can be started simply by calling `rxrx` on the command line.

vacri · on June 24, 2015

Surely the simplest optimisation to add would be anchors? Instead of starting the 'good' line "[12]", start it "^[12]".

As it stands, the 'good' regex will look through the whole line for any occurrence of 1 or 2 in the line, then start applying the rest of the regex from there. Putting an anchor in means it only looks at the first character for that 1 or 2.

Reefersleep · on June 24, 2015

Inspiring post - I'll be sure to put extra thought into how I write my regexes from here on out.

Does anyone know a good browser-based game that trains you to do regexes well?

gshx · on June 23, 2015

Bug in the month matching, instead of [01]\d which will also match 13-19, maybe try something like: (0[1-9]|[1-9]|1[0-2]) https://goo.gl/RMv4x2

gesman · on June 23, 2015

[flagged]

tptacek · on June 23, 2015

Please don't write comments whose sole purpose is to single out the gender of the author of a story.

Edited

Mithaldu · on June 23, 2015

Honestly, if you must engage in backseat modding, please at least have the courtesy to explain why you think someone should do something you say. Otherwise people are likely to either just have a negative reaction to you, or if they do what you say, are likely to not understand why. :)

tptacek · on June 23, 2015

I'm not sure what "backseat modding" means. We all help moderate the community; it's why we have vote buttons and flag buttons, both of which are worth using in this case.

Mithaldu · on June 23, 2015

That is actually a damn good question.

I didn't decide on the word by way or a rational chain of logic, but merely by having hindbrain pattern matching trigger on your behavior, based on what i've seen called that on other forums, and what i've seen you doing on here from time to time. If i had to put it in words, it would probably be something like "tell other people how to use HN without contributing significantly to the discussion at hand", since that is typically the job of actual mods, not of users.

In any case, it's not that i mind you doing it, but if it's done without any explanation it irks me for the aforementioned reasons.

(Note: My brain classifies even my response to your first post here as backseat modding.)

tptacek · on June 23, 2015

I'm doing a lot of things at once. I'm fixing a buggy register allocator, and I'm not that smart, so I can do real work for about 15 minutes before my brain overheats. I bounce to the HN front page: there's a story about regexes, and a recent trend of stories about how there are alternatives to standard regex libraries. I think, maybe this will be one of those stories. Click. Nope, it isn't. I wonder, did anyone write the comment where they point out that rather than tediously optimizing regexes, it's sometimes more profitable just to hand-code a quick lexer? Nope, but, look at this: a creepy comment about "young girls" and regexes. Nope. Flag. Downvote. The comment isn't grey! Someone voted this creepy thing up! Write a quick comment. Back to debugging the world's worst compiler IR.

That's how you end up with an imperfect complaint comment. Oh well! Thanks for motivating me to improve it.

See, the same thing just happened, just now; I vented a little more brain steam, and now can hopefully spend another solid 15 minutes trying to fix this bug. :)

thaumasiotes · on June 23, 2015

> Downvote. The comment isn't grey! Someone voted this creepy thing up!

I have a comment at -1 points right now that shows up (to me) as black. It used to be grey. I don't know if authors now see all of their own comments in full black or if comments revert to black after a certain period of time or what, but I'm not prepared to guarantee that a comment in full black has more than 0 points.

tptacek · on June 23, 2015

HN doesn't grey comments out for their authors anymore; it rubs salt on the downvote wounds, and, more often than not, a comment you notice as grey right now will be black again in just a few minutes, so why beg people to write about it?

But HN definitely does still show negative comments from other people as grey.

Mithaldu · on June 23, 2015

> Thanks for motivating me to improve it.

Cheers, that's all i'm hoping for. :)