Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
When to use Dictionary Compression (fastcompression.blogspot.com)
64 points by ot on Feb 17, 2018 | hide | past | favorite | 16 comments


For real-time dictionary compression, with infinite-size dictionary (LZ77, using a hash table for O(1) pattern search), you can try:

git clone https://github.com/faragon/libsrt

cd libsrt

make

./enc -ezh <hugefile >hugefile.z

./enc -dz <hugefile.z >hugefile.again

For 1MB dictionary (faster, because of higher cache hit ratio, and using a much smaller buffer):

./enc -ez <hugefile >hugefile.z

./enc -dz <hugefile.z >hugefile.again

For the source code, check compression (senc_lz_aux()) and decompression (sdec_lz()) functions, much easier to understand than the Zstandard code (because of having less features):

https://github.com/faragon/libsrt/blob/master/src/saux/senc....


The article does not say what dictionary compression is and how it works or I have missed it? It would be interesting, and also dictionary compression seems to be topic of article.


Usually, when you compress, it's with, say, gzip, rather than something custom for the data. The compression dictionary is kinda implicitly built and encoded along with the data.

If you have, like, some standard data (handwave), you can build a dictionary of frequently-used terms and refer to it when you compress and decompress, and you can get better compression for being able to point at that pre-built dictionary.

I might be totally offbase here, as I think database normalization would remove much of the redundancy that a dictionary would target.


Database normalization has little to do with this. It's about structurally removing redundancy from your schema. Like not having the same column and data twice in two tables. It's not going to do anything if you have two rows with common data such as two people with the same first name. In the context of compression, dictionaries also don't generally correspond to a set of words or records per se but rather arbitrary (sub)strings or blobs.


I think they were trying to say if you had a normalized database where instead of repeated varchar fields, you had a foreign key — that would act like a dictionary.

    CREATE TABLE customer(
    id SERIAL,
    name VARCHAR(1024),
    status VARCHAR(1024)
    )
Vs

    CREATE TABLE customer(
    id SERIAL,
    name VARCHAR(1024),
    status INT REFERENCES status(id)
    )

    CREATE TABLE status(
    id SERIAL,
    name VARCHAR(1024),
    )
 
It wasn’t very clear from the parent above, but I think this is where they were trying to go. The normalized version is smaller because you’ve swapped a repeated varchar ‘status’ for an int — so that’s effectively a dictionary. It’s not entirely correct, but it’s not that far off.


That makes sense, thanks for pointing out my misreading of the parent comment.


https://en.m.wikipedia.org/wiki/Dictionary_coder (and an arbitrary large number of other explanations that you can find with a search)

In a nutshell, both encoding and deciding side hold a dictionary that maps long sequences of data to still-unique short ones and vice versa (ie. a bijective mapping). Encoder uses it to compress, decoder reverses it.

There are obviously many interesting practical questions such as how the dictionary is synchronized, how the mappings are determined, how to keep it a bijective function, etc.


"Dictionary compression" can be any scheme in which coded symbols in the compressed data are references to some body of accompanying data where their replacements are found.


I know when not, when the decompression needs to be done in an interrupt routine. I just did a "run or delta" system for GCODE. Being decoded in a micro-controller interrupt routine I could not quite the dictionary based routines to run quite fast enough or in small enough ram.


Was this your first embedded project? Never EVER do actual work in interrupt routine, no compression, no floating point (even if its native and not emulated), no computation, no lookups, just pass the marker along and exit.


> Never EVER do actual work in interrupt routine

Never? Sometimes you have to in order to adhere to timing requirements.

Say we're talking about a USB device. You'd better respond EP0 control requests without data stage under 50ms; USB spec demands that.

Even more so, when there's physical safety involved. You might really want to stop that stepper motor long before main loop notices your marker...

Then there are those situations you need to do the processing in IRQ or lose (some) of the data due to system constraints.

Just never say never.


Not the OP, and I have never done embedded programming, but you got me hooked with the sentence "passing along the marker" and my search engine skills sadly failed me. Do you mean the preferred way is to do the actual work in the main routine and only set a marker that the work should be triggered in the IR?


Ideally, an IR should just collect whatever information is necessary for the main thread to actually handle the routine. Compare this to raising and handling an event in e.g. javascript: the event payload is handled by the main thread, the IR would just be responsible for creating that event payload object. (this is a horrible comparison, to be honest)

The problem with IRs is they put the processor in a very fragile state. Different chips handle e.g. stacked interrupts (an interrupt which occurs while another interrupt was being handled) differently, but no matter how they do it, you won't like it. Sometimes it turns off certain types of memory access protection, or other important (hardware) peripherals which might be ignored (because they, themselves, operate using those same interrupts under the hood); it's all madness.

Hence the interrupt just puts together some event data, sets some flag, and the main event loop can handle it when the time is there.


Completely OT, but this site is horrible on mobile. Trying to zoom in changes articles. Feels like any interaction changes pages.


Ironic how Blogger pages have become much worse for mobile in the face of Google promoting AMP.


Certainly it feels like Google wants Blogger to die already. Same with the Google Groups.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: