Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Btw, does anyone know a library/program which can reverse engineer a regex from multiple source strings? E.g.

  14:51 [info] 51 some message
  … more of 51 lines …
  15:22 [error] 24 error!
  … more of 24 lines …

  ^(\d\d:\d\d) \[(info|error)\] (\d+) (.+)$
Or maybe not a regex, but a structured pattern.


One probably wants to provide a set of matching and a set of non-matching strings. Then the software would output a regex and some edge-case matching strings and non-matching strings.

This could be built using set operations on deterministic finite automata (dfa). Every regex is equivalent to a dfa. You can now construct automata for every positive and negative example input. Then calculate the union for all positive examples and the union for all negative examples. And finally calculate the difference between the two unions. Convert the resulting automaton back to regex.

https://scanftree.com/automata/dfa-union-property


I was thinking of something that could categorize parts of these strings into a “language”, so there is no non-matching strings. It’s hard to specify in a formal way, but by looking at these strings you may see that e.g. […] is a static syntactic element, and a number follows it, and time precedes it. This would be nice to have to browse logs (which these strings are obviously a part of) but instead of scrolling through thousands of rows, see all of the patterns that occur among them at once, and then dig down into a pattern to inspect what happened and when to improve on “health” of a conpkex system. Of course if you know all of them in advance, it’s easy to filter by each. But lots of software/apis do not document their output in such detail.


Technically .* is a valid regex for those strings, so the issue here is not only to reverse engineer them, but to do so in a way that's meaningful for the person who has to use it after.

It shouldn't be hard to start with .* and resursively split it in two parts that still match the input strings, but I believe you will end up with matching but useless regexes.



The closest thing I know of to this is https://github.com/devongovett/regexgen (or my Ruby port https://github.com/amake/regexgen-ruby).

    % bundle exec bin/regexgen '14:51 [info] 51 some message' '15:22 [error] 24 error!'
    (?-mix:1(?:4:51\ \[info\]\ 51\ some\ message|5:22\ \[error\]\ 24\ error!))
With enough inputs it should end up with something somewhat reasonable for the leading part, but it will never be smart enough to understand that the error message is "arbitrary" and should be matched with e.g. `(.+)`.


JS has String.prototype.replaceAll, which can take a regex with multiple capture groups and output them as separate params to a callback function. This can be used to create a functional DSL which generates the regexes and callbacks.


I know this comment isn't helpful on its own, but yes this exists. I've seen it before. I just have no idea what it was called or how to find it again.

EDIT: Ah no, sorry. Was thinking of the other way around[0].

0: https://www.npmjs.com/package/regex-to-strings


RegexBuddy has some limited ability to do this, and the author of that program has a whole separate program called RegexMagic that I believe specializes in exactly this.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: