Will be interesting to see is there is some way to train a decompilation module ...

userbinator · on March 17, 2024

If this gets really good, maybe we can dream of having a fully de-obfuscated and open source life. All the layers of binary blobs in a PC can finally be decoded. All the drivers can be open. Why not do the OS as well!

Decompilers already exist and are really good. If an LLM can do the same as these existing compilers, you can bet the lawyers will consider it an equivalent process. The main problem is legal/political, not technical.

acureau · on March 18, 2024

I don't know if I'd call the output of modern de-compilers "very good", not for native code anyway. They're just a little better than raw disassembly. Even state of the art de-compilers struggle to reconstruct control flow, distinguish data from code, identify the presence of a variable, let alone its type, and they fundamentally lack context. If a LLM could be used even just to reliably reconstruct symbol information it would be game-changing.

coddle-hark · on March 17, 2024

I wrote my bachelor thesis on something tangential — basically, some researchers found that it was possible in some very specific circumstances to train a classifier to do author attribution (i.e. figure out who wrote the program) based just on the compiled binaries they produced. I don’t think the technique has been used for anything actually useful, but it’s cool to see that individual coding style survives the compilation process, so much so that you can tell one person’s compiled programs apart from another’s.

astrange · on March 18, 2024

Do you mean the whole binary or just the text segment/instructions?

Because I think this gets a lot easier if you can look at the symbol table, strings, and codesigning certificate.

ZitchDog · on March 17, 2024

I doubt the code would be identifiable. It wouldn’t be the actual code written, but it would be very similar. But I assume many elements of code style would be lost, and any semblance of code style would be more or less hallucinated.

K0IN · on March 17, 2024

if it can make test from the decompiled code, we could reimplement it with our code style. might be cool to have some bunch of llms working together with feedback loops.