I'm guessing the following is either near-impossible or pure-impossible, but: Is...

gamegoblin · on April 5, 2014

Here is the problem with that:

Consider the string abcdefgh

Guess what!? I have the perfect regex to match your string.

  "abcdefgh"

So given a string literal, there is always a regex to match that literal. Namely, the literal itself.

Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.

So you'd give it:

  aaaaabaa
  aabaaa
  aba
  abaaaaa

And it'd generate "a+ba+"

The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!

  "[T0 T1 T2 ... ]*"

or even

  ".*"

So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.

So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples.

Source: Working on this exact problem for graduate research

nmrm · on April 5, 2014

T0 | T1 | T2 | ... would match exactly the correct thing with all positive examples, and (T0 | T1 | T2) & !(CE1 | CE2 | CE3) would match exactly the correct thing with positive and negative examples.

But that's pretty stupid, because you don't generalize beyond your examples.

What's your approach?

<em>edit: removed random conjecture</em>

gamegoblin · on April 5, 2014

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba

and the counter examples:

  abba
  ba
  ab

It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.

weavie · on April 5, 2014

Sounds like you want to be able to specify some kind of pattern to define accepted and rejected matches. A regex would be ideal for this. oh wait....

Dewie · on April 5, 2014

Since you're a researcher I must be missing something. But since regexps are closed under union, what is the problem with taking the union of all of them? I'm imagining that it would be conceptually simple to hook up all of the non deterministic state machines such that you get a non deterministic state machine which is the union of all of them. Then convert it to a deterministic state machine. You might get state explosion, but at least you would have found some machine to recognize the language. Is state minimization simple (complexity wise)? Is it even possible to find a decently small DSM in the general case (not necessarily the most minimal machine)?

gamegoblin · on April 5, 2014

My reply to nmrm might answer your question.

Finding some regular expression that matches all of the positive examples and does not match all of the negative examples is trivial. Finding a good regular expression that does that is not.

State minimization does not mitigate this problem. As an aside, state minimization is a polynomial algorithm.

Given the positive examples:

  aba
  abaa
  aaaaba

and the negative examples:

  abba
  ba
  ab

we could make a regex that does something like ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), but unfortunately, running a state minimization algorithm on this regex does not give you "a+ba+" because the two regex are not equivalent (they do not accept the same language).

So you can find plenty of regex that will match your examples and not match your counterexamples, but you cannot easily minimize them to what you do want.

staticshock · on April 5, 2014

"aaa" is a valid regex that matches the string "aaa". If you have special characters in your source string, many libraries have a regex for escaping them. So, generating a regex to match your exact string is trivial. Even matching a group of strings is trivial via (aaa|bbb|etc), though it gets long.

Given that, what I think you're really asking is, "how do I automatically generate a regex of optimal conciseness given a set of inputs I'd like to match, and maybe a bunch of other inputs I want to avoid matching?"

This looks like it iteratively does what you want: http://regex.inginf.units.it/ (Note that when I went there, it said "6 slots available", presumably because everything runs server-side. If a bunch of people pile in there, you probably won't actually be able to test it due to limited resources on their part.)

mc_hammer · on April 6, 2014

this is very cool ty for the post.

fancy_betta · on April 5, 2014

Probably not what you're looking for, but check out http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313....

chomp · on April 5, 2014

Regexp::Assemble kind of gets you there, you can feed it strings and it'll spit out a regex.