Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Extreme regex foo: what you need to know to become a regular expression pro (immike.net)
30 points by Anon84 on June 27, 2008 | hide | past | favorite | 12 comments


you should be able to construct basic regular expressions to match things like email addresses

Actually, writing a correct (per RFC) regex to recognize email addresses is far from simple.


You can say that again. Here's a Perl Regex to validate according to RFC 822: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html. It's well over a page long.

The problem is that email validation is really better suited to being done by a push-down automata (rather than a finite state machine, which is what true regexes are). For better or for worse, perl regular expressions aren't strictly finite state machines though - so you can kind of extend them beyond what is sane.

P.S. Here's a ridiculously cool regex that can be used to determine if a number is prime: /^1?$|^(11+?)\1+$/

If you've got a few hours (a few seconds for cperciva ;-)) to waste, try and figure it out :-D


You can watch that prime-testing regex in action. Here it is matching 49, showing it is not-prime:

http://regex.powertoy.org/?pat=/%5E1%3F%24%7C%5E%2811+%3F%29...

And here failing against 47, showing it is prime:

http://regex.powertoy.org/?pat=/%5E1%3F%24%7C%5E%2811+%3F%29...


I'm not sure how this ( /^1?$|^(11+?)\1+$/ ) would match 3, 5, 7, etc... I'm no regex expert, maybe I'm missing something?


well, you're right - there are some more details. it needs a bit of wrapper code around it.

It actually tests if n is composite, when presented with a pattern of n ones. E.g., '111' doesn't match so it is prime, and '1111' does match so it's composite.


Wow, this is really amazing (to me at least because I had not seen or thought of this before).

[ Warning: Spoiler! :) ]

It's in base 1 (tally system).

The first part up to the "|" matches the case where number 1 is the whole string, covering the fact of 1 not being considered prime. It also matches no string, I guess to mean 0.

Next part matches a captured string consisting of 2 or more ones, and the captured string must be repeated at least one time. That is, 2+ repetitions of chunks of 2+ ones.

This much is obvious, but it wasn't until I wrote it out that I realized what it is doing:

multiples of 2

11+11 or 11+11+11 or 11+11+11+11 or ... = 4, 6, 8, 10, 12, ...

multiples of 3

111+111 or 111+111+111 or 111+111+111+111 or ... = 6, 9, 12, ...

multiples of 4

1111+1111 or 1111+1111+1111 or ... = 8, 12, 16, 20, ...

multiples of 5

11111+11111 or 11111+11111+11111 or ... = 10, 15, 20, 25, ...

multiples of n

which leaves only primes remaining.

Side note: In Perl, executing it this way:

$nstr = 1 x $ARGV[0]; print ($nstr =~ m/^1?$|^(11+?)\1+$/ ? "composite\n" : "prime\n");

The largest prime I can find without a seg fault is 37397

And the largest composite I can find without a seg fault is 37399


Hint: don't express the number in base 10


Indeed: technically, it's impossible. Email addresses, per RFC 822, are not a regular language -- though it's doable in Perl because Perl "regular expressions" are unapologetically more powerful than real regular expressions.

RFC 822, section 3.4.3, defines comments in email addresses to be surrounded by matching parentheses, and which nest -- a classic example of a language which is not regular.

Chalk this up to "stupid technicalities I learned one day when I decided to write a regex to match RFC 822 perfectly". It turned out to be a fun but useless exercise, because most email programs don't accept half the crap that's in RFC 822 anyway.


He means "regex fu": "fu" as in "kung fu", not "foo bar". In other words, the construction is "[arbitrary thing] fu". Since hackers often use "foo" for an arbitrary thing, perhaps this is better rendered as "foo fu". In Ruby we might write it as follows:

  class String
    def foo_fu?
      foo = ".*"
      self =~ /#{foo} fu/
    end
  end


His regex for adding commas to numbers is a little obtuse. I would just write:

   scalar reverse join ',', split /\d{3}\K/, reverse $number;
The reason why people hate regexes are because people use them as a general purpose programming language instead of a concise way of matching strings. Note that the regex I use just says to "split the string every third digit" and the rest of the logic is actual code.


If you have any interest in regex, visit

txt2re.com

It is extremely helpful.


and really, i think the chap meant "Fu" anyways...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: