I'd be interested to see *lossy* text compression. That would also give much mor...

unnah · on Sept 29, 2022

A lossy text compressor can be converted into a lossless compressor quite easily by encoding the remaining differences between the lossy reproduction and the original text. The more accurate your lossy compressor is, the less additional information you need to encode the differences. You can get even better results if your lossy compressor is probabilistic, and can compute approximate probabilities for different text continuations. GPT-3 is AFAIK probabilistic and should be applicable... maybe someone has already tried?

elevaet · on Sept 29, 2022

But if you're not careful, the delta between the compressed version and the original could be bigger than the savings from the compression.

In my armchair speculation for the case of severe lossy text compression, the delta/diff could easily get there, making a strong compression algo based on something like GPT-3 not really practical for doing lossless.

woliveirajr · on Sept 29, 2022

Instead of storing the differences, use a spell checker. The better written the text with proper grammar rules, the easier for getting it right with a spell checker. Yes, patented.

acadapter · on Sept 29, 2022

I hope that lossy text compression never becomes popular, it should ideally remain a technological curiosity. Think of all the damage that can be caused by incompetence combined with usage of lossy text compression in the wrong places.

bigyikes · on Sept 29, 2022

Human brains have been performing lossy text compression for thousands of years. I think we’ll be ok.

MultiverseUI · on Sept 29, 2022

For classical compression benchmarking, you need to include the size of the decoder; GPT can already recite most lyrics and speeches given a single line prompt.

The state of the art in neural language models was evaluated ˜5 years ago, and it was found that standard LSTM's do very well on text compression, when properly architectured and parametrized.

The main reason for excluding lossy text compression in these tests, is that there is no clear path around requiring a panel of human judges, and that evaluation now is subjective (instead of objective).

Perhaps a different route would be to task an AI to compress Wikipedia into a (graph) knowledge base, and then test these AIs on correctly answering "multiple choice"-questions. But then intelligence becomes a proxy for measuring compression, instead of here, where compression is chosen as a proxy for measuring intelligence.

abriosi · on Sept 29, 2022

It has been tried

https://bellard.org/libnc/gpt2tc.html