Btw, does anyone know a library/program which can reverse engineer a regex from ...

manx · on July 9, 2022

One probably wants to provide a set of matching and a set of non-matching strings. Then the software would output a regex and some edge-case matching strings and non-matching strings.

This could be built using set operations on deterministic finite automata (dfa). Every regex is equivalent to a dfa. You can now construct automata for every positive and negative example input. Then calculate the union for all positive examples and the union for all negative examples. And finally calculate the difference between the two unions. Convert the resulting automaton back to regex.

https://scanftree.com/automata/dfa-union-property

wruza · on July 9, 2022

I was thinking of something that could categorize parts of these strings into a “language”, so there is no non-matching strings. It’s hard to specify in a formal way, but by looking at these strings you may see that e.g. […] is a static syntactic element, and a number follows it, and time precedes it. This would be nice to have to browse logs (which these strings are obviously a part of) but instead of scrolling through thousands of rows, see all of the patterns that occur among them at once, and then dig down into a pattern to inspect what happened and when to improve on “health” of a conpkex system. Of course if you know all of them in advance, it’s easy to filter by each. But lots of software/apis do not document their output in such detail.

ailef · on July 9, 2022

Technically .* is a valid regex for those strings, so the issue here is not only to reverse engineer them, but to do so in a way that's meaningful for the person who has to use it after.

It shouldn't be hard to start with .* and resursively split it in two parts that still match the input strings, but I believe you will end up with matching but useless regexes.

Banana699 · on July 9, 2022

This is a special case of the general problem of program synthesis[1][2][3][4], where the search space of possible programs are all regex strings and the seed driving the synthesis is Input-Output examples.

There's research [5][6] as well as practical tools [7][8][9].

[1] https://en.wikipedia.org/wiki/Program_synthesis

[2] https://www.microsoft.com/en-us/research/project/program-syn...

[3] https://dl.acm.org/doi/10.1145/1836089.1836091

[4] https://royalsocietypublishing.org/doi/10.1098/rsta.2015.040...

[5] https://cs.stanford.edu/~minalee/pdf/gpce2016-alpharegex.pdf

[6] https://www.researchgate.net/publication/261794574_Automatic...

[7] https://regex-generator.olafneumann.org/

[8] http://regex.inginf.units.it/extract/

[9] https://stackoverflow.com/questions/6219790/need-a-regex-too...

amake · on July 10, 2022

The closest thing I know of to this is https://github.com/devongovett/regexgen (or my Ruby port https://github.com/amake/regexgen-ruby).

    % bundle exec bin/regexgen '14:51 [info] 51 some message' '15:22 [error] 24 error!'
    (?-mix:1(?:4:51\ \[info\]\ 51\ some\ message|5:22\ \[error\]\ 24\ error!))

With enough inputs it should end up with something somewhat reasonable for the leading part, but it will never be smart enough to understand that the error message is "arbitrary" and should be matched with e.g. `(.+)`.

eurasiantiger · on July 9, 2022

JS has String.prototype.replaceAll, which can take a regex with multiple capture groups and output them as separate params to a callback function. This can be used to create a functional DSL which generates the regexes and callbacks.

junon · on July 9, 2022

I know this comment isn't helpful on its own, but yes this exists. I've seen it before. I just have no idea what it was called or how to find it again.

EDIT: Ah no, sorry. Was thinking of the other way around[0].

0: https://www.npmjs.com/package/regex-to-strings

nerdponx · on July 9, 2022

RegexBuddy has some limited ability to do this, and the author of that program has a whole separate program called RegexMagic that I believe specializes in exactly this.

tomerv · on July 9, 2022

http://regex.inginf.units.it/extract/