Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd be interested to see lossy text compression. That would also give much more scope for AI methods. It turned out that recently Stable Diffusion gives excellent compression

https://pub.towardsai.net/stable-diffusion-based-image-compr...

Funny thing is, on a per pixel basis, Stable Diffusion and jpeg are just as good. Stable diffusion looks much better, though. The reason is that when Stable diffusion lacks the information it just invents what it expects to see there. If you zoom in on the background entire appartement buildings are invented, moved or disappeared.

So, one wonders what GPT-3 could do for text compression. On a per-character basis, I would not expect any miracles. On a generally, the same, kind-of, basis, I'd would expect something special.



A lossy text compressor can be converted into a lossless compressor quite easily by encoding the remaining differences between the lossy reproduction and the original text. The more accurate your lossy compressor is, the less additional information you need to encode the differences. You can get even better results if your lossy compressor is probabilistic, and can compute approximate probabilities for different text continuations. GPT-3 is AFAIK probabilistic and should be applicable... maybe someone has already tried?


But if you're not careful, the delta between the compressed version and the original could be bigger than the savings from the compression.

In my armchair speculation for the case of severe lossy text compression, the delta/diff could easily get there, making a strong compression algo based on something like GPT-3 not really practical for doing lossless.


Instead of storing the differences, use a spell checker. The better written the text with proper grammar rules, the easier for getting it right with a spell checker. Yes, patented.


I hope that lossy text compression never becomes popular, it should ideally remain a technological curiosity. Think of all the damage that can be caused by incompetence combined with usage of lossy text compression in the wrong places.


Human brains have been performing lossy text compression for thousands of years. I think we’ll be ok.


For classical compression benchmarking, you need to include the size of the decoder; GPT can already recite most lyrics and speeches given a single line prompt.

The state of the art in neural language models was evaluated ˜5 years ago, and it was found that standard LSTM's do very well on text compression, when properly architectured and parametrized.

The main reason for excluding lossy text compression in these tests, is that there is no clear path around requiring a panel of human judges, and that evaluation now is subjective (instead of objective).

Perhaps a different route would be to task an AI to compress Wikipedia into a (graph) knowledge base, and then test these AIs on correctly answering "multiple choice"-questions. But then intelligence becomes a proxy for measuring compression, instead of here, where compression is chosen as a proxy for measuring intelligence.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: