Where "Best" = "Most performant". Of course, there's other ways to judge a regex - readability, specificity, robustness...
In most places I used regexes, efficiency is the least of my concerns, though for a company whose product is focused around parsing massive amounts of logs, focusing on performance does make sense.
How would the regex below compare in this case? That is, modifying the "bad" regex to use the lazy-consumption trick on (almost) all of the `.` patterns.
If you are searching a very large file for a very few occurrences of the expected match then this optimization is not so bad.
If you are running line-by-line through a very large log file to extract just those two pieces of information per line, then throw away the first N characters in each line (where N is the hopefully-constant length of your timestamps plus that space char) and start the regex engine at the beginning of the expected match. Then it doesn't have to waste any time passing over those chars.
Even if the exact details above aren't quite right the principal is (and is well-known): Avoid premature optimization! (And the corollary: Measure it. Profile your code, don't guess, you're probably wrong.)
There's a higher order solution. Read the AWK book, do the exercises, then ignore blog posts about regexps. Make an exception when it's Russ Cox demonstrating how often this wheel has been reinvented in square form.
"Awk and grep use the Thompson NFA algorithm which is in fact significantly faster in almost every way but supports a more limited set of features."
AFAIK the only feature of regexes that require backtracking are back-references, as long as your regex doesn't use it why doesn't PCRE switch to the more efficient algorithm, and use the backtracking algorithm only if you actually need the feature that requires backtracking?
Backtracking is required in a lot of cases. Consider matching the pattern /^(AA|AB)*$/ against the string "AAAAAAAAB". Before it can come up with the answer (it doesn't match) the engine has to backtrack all the way from right to left.
that can be compiled to a state machine where a decision to switch states is taken based on current character only, and when all input is consumed you are either in an accepting or rejecting state. In your case I think it only needs 2 states: state 0 moves to state 1 when it sees an A, and state 1 moves to state 0 when it sees either A or B.
For your string it'll be in state 0 when it sees a B and thus rejects it.
Three states if you want to process the whole input. If your model is "reject when you fail to find an appropriate transition" rather than "reject if, after processing the string, you're in a reject state", then you don't need the failure trap and you can do it in two states.
Backtracking is definitely not required, nor helpful.
Not true. Finite automata (NFAs, DFAs) can match that pattern (and any pattern that doesn't involve backreferences or lookaround) in O(n) time where n is the size of the input string. DFA implementations are worst-case O(n ^ 2) in the number of states of the regular expression, but this is far better in most cases than the exponential worst-case time in terms of the input string offered by backtracking implementations.
> DFA implementations are worst-case O(n ^ 2) in the number of states of the regular expression
What? Why is that? If the NFA has n states, then the DFA in principle might need one state for every possible set of states the NFA might be in, of which there are 2^n. Where does n^2 come from?
My mistake, I meant O(2 ^ n) indeed. Regardless, having exponential time complexity in terms of the regular expression (which is usually controlled by the programmer) where the processing is done at compile-time is much better than having exponential complexity in terms of the input string (which often comes from an untrusted source) where the processing is done at run-time.
Surely the simplest optimisation to add would be anchors? Instead of starting the 'good' line "[12]", start it "^[12]".
As it stands, the 'good' regex will look through the whole line for any occurrence of 1 or 2 in the line, then start applying the rest of the regex from there. Putting an anchor in means it only looks at the first character for that 1 or 2.
Honestly, if you must engage in backseat modding, please at least have the courtesy to explain why you think someone should do something you say. Otherwise people are likely to either just have a negative reaction to you, or if they do what you say, are likely to not understand why. :)
I'm not sure what "backseat modding" means. We all help moderate the community; it's why we have vote buttons and flag buttons, both of which are worth using in this case.
I didn't decide on the word by way or a rational chain of logic, but merely by having hindbrain pattern matching trigger on your behavior, based on what i've seen called that on other forums, and what i've seen you doing on here from time to time. If i had to put it in words, it would probably be something like "tell other people how to use HN without contributing significantly to the discussion at hand", since that is typically the job of actual mods, not of users.
In any case, it's not that i mind you doing it, but if it's done without any explanation it irks me for the aforementioned reasons.
(Note: My brain classifies even my response to your first post here as backseat modding.)
I'm doing a lot of things at once. I'm fixing a buggy register allocator, and I'm not that smart, so I can do real work for about 15 minutes before my brain overheats. I bounce to the HN front page: there's a story about regexes, and a recent trend of stories about how there are alternatives to standard regex libraries. I think, maybe this will be one of those stories. Click. Nope, it isn't. I wonder, did anyone write the comment where they point out that rather than tediously optimizing regexes, it's sometimes more profitable just to hand-code a quick lexer? Nope, but, look at this: a creepy comment about "young girls" and regexes. Nope. Flag. Downvote. The comment isn't grey! Someone voted this creepy thing up! Write a quick comment. Back to debugging the world's worst compiler IR.
That's how you end up with an imperfect complaint comment. Oh well! Thanks for motivating me to improve it.
See, the same thing just happened, just now; I vented a little more brain steam, and now can hopefully spend another solid 15 minutes trying to fix this bug. :)
> Downvote. The comment isn't grey! Someone voted this creepy thing up!
I have a comment at -1 points right now that shows up (to me) as black. It used to be grey. I don't know if authors now see all of their own comments in full black or if comments revert to black after a certain period of time or what, but I'm not prepared to guarantee that a comment in full black has more than 0 points.
HN doesn't grey comments out for their authors anymore; it rubs salt on the downvote wounds, and, more often than not, a comment you notice as grey right now will be black again in just a few minutes, so why beg people to write about it?
But HN definitely does still show negative comments from other people as grey.
I'd rather word this as "more specific is better". Like say for a U.S. phone number (minus area code for simplicity),
is better than because it's more specific."Longer is better" is only useful for helping identify which regex is better, not for helping write better regexes.