Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Trust me. If Linux really eats your RAM to the point of reaching an OOM state you will know.

(This was of course because of having too many apps relative to my RAM. Not because of disk caching.)

The OOM behavior is not pleasant for a desktop system.



And this (the OOM behaviour instead of the paging behaviour on linux) is something that can (and should) be criticised. Every time I encountered a situation where I was running out of memory (usually due to some out of control process) the system would become completely unusable. All interactivity is gone, so it was impossible to kill the out of control process (which was typically misconfigured program i started). If the OOM killer started to take action it would almost never kill the process that was gobling up memory like crazy but instead any of the other apps that are necessary to intervene (like e.g. the terminal or the WM). It always seemed incredibly stupid to me.

I remember some time back there was discussion about improving the OOM killer, but I don't know what came out of it.


This may or may not preserve your desktop and other important applications in an OOM situation. https://github.com/hakavlad/prelockd

I've heard some good results with it and the applications locked in memory is configurable.


Its interesting how much stuff we have in Linux now to make OOM decisions (even userland daemons) yet on every modern distribution it still ends up killing your desktop environment instead of the fricking C++ compiler jobs that caused the problem in the first place.


    Out of memory: kill process 12345
    Killed process 12345 (sshd)
is the funniest and ugliest message to see on the iLO/VM console.


Really broken and stupid would be how I would describe it. Typically is just hangs hard with the disk at 100% and if you’re really patient you might be able to get a shell and kill some things over the course of the next 10 minutes.


It has been fixed by MGLRU patchset since kernel 6.1. Do:

    cat > /etc/tmpfiles.d/mglru-min-ttl.conf <<EOF
    w-      /sys/kernel/mm/lru_gen/enabled          -       -       -       -       y
    w-      /sys/kernel/mm/lru_gen/min_ttl_ms       -       -       -       -       1000
    EOF
and reboot. I've been struggling with this issue, as many others, for years. Now I can run two VMs with 8 GB physical RAM and 5+ GB swapped, and it barely noticeable.

More information, although a bit outdated (pre-MGLRU): https://notes.valdikss.org.ru/linux-for-old-pc-from-2007/en/... Linux issue%3A poor performance under RAM shortage conditions


This is why I stopped having swap on my desktop. I prefer a clean death than prolonged agony.


Having no swap was no panacea, because all of the code sections of your running programs that are memory-mapped in effectively count as "available" clean pages that can be evicted when memory is tight, and they'll cause thrashing just as much as swap would. The solution is to OOM-kill processes before that happens.


Hmm, I personally hasn't experienced any trashing after disabling swap. Instead of the desktop freezing up or heavily lagging for a while until I somehow able to kill some apps to free some memory after 10 minutes struggling to open a terminal, after disabling swap, now it instantly crash back into the login screen when running out of memory.


How does the NT kernel handle OOM situations, compared to Linux? I know it feels a lot smoother and almost like a non problem (it will slow down for a few seconds and get back to normal), but I wonder what goes on behind the scenes and why (if?) Linux has a different approach


I don't know the full answer, but on Windows the problem is less significant because of the core memory management decisions that were made.

In Linux you get a ton of copy-on-write memory - every fork() (the most basic way of multiprocessing) creates a new process that shares all of its memory with parent. Only when something is written the child process actually gets "its" memory pages.

To put that into perspective, imaging you have only one process in your system, and it has a big 4GB buffer of rw memory allocated. So far so good. Then you fork() three times - your overall system memory usage is still roughly 4 GB. And now all four processes (parent and 3 children) overwrite that 4GB buffer to random values. Only at this point your system RAM usage spikes to 16GB.

This means, that the thing that actually OOMS may be just "buffer[i] = 1". It's very hard to recover from this situation gracefully, because this is an exceptional situation, and exceptional situations may require more allocations which are already impossible. Now compare that to Windows, where most memory allocations are in predictable moments, like when malloc() is called, and failures can be safely handled at that point.

So, in the ideal situation, Windows running out of memory will just stop giving new memory to processes and every malloc will fail. In Linux it's not an option, since every write to a memory location can suddenly cause allocation due to copy on write.


Which can lead to dozens of unrelated applications dying on windows when they assume infallible allocators while linux keeps going (sluggishly) until it has to kill just the biggest one.


I've worked on memory constrained Windows VMs. The problem shows up as the application you're on dying, because guess what, you're trying to allocate memory that isn't there.

The rest of the system is still usable.

It's fine.

For the longest time I also ran with no swap on Windows (and just an excessive amount of memory). I'd notice when I'd run out of memory when a particularly hungry application like Affinity Photo died and I had a zillion browser tabs open, but again, the system is perfectly responsive and fine.

The Windows behavior seems much closer to deterministic and much more sane than the OOM killer of Linux.


I've had important background processes die on windows when the offender didn't die and the OOM situation persisted for some time - I assume because it was using fallible allocations while the other processes weren't.


Linux swapper used to be very aggressive on file cache, evicting it to the point that for the next second you'll need all of these libraries again. That is the main reason of the slowdowns.

Fortunately now we have MGLRU patchset, which "freezes" the active file cache for a desired amount of milliseconds, and in general is much smarter algo.


This may be applicable for desktops, but not for servers.

In a low-memory situation, the admin wants to ssh into the server and fix the problem that led into memory exhaustion in the first place. Whoops, MGLRU freezes the active file cache only, which includes the memory hog, but does not include sshd, bash, PAM, and other files that are normally unused when nobody is logged in, but become essential during an admin intervention. So, de facto, the admin still cannot login, and the server is effectively inaccessible. The only difference is that the production application is still responding, which is not so helpful for restarting it.


The main problem with Linux OOM behaviour is exactly because of what counts as "available" memory. In essence, when the system is really low on memory, it will evict all the pages that are "available", which includes all those pages that are clean and can be loaded in from disc, which of course includes all the memory-mapped code segments in all of your running software. Because of that, this makes the system really run at a crawl because every little bit of progress involves loading in a page of code before running it. Recent versions are a lot better, but certainly ten years ago on systems with a very large amount of memory this could cause the system to become basically completely unresponsive. The solution was to get the OOM killer to start taking action a lot earlier, so that it never reached the point of being so low on memory that it would thrash like that. There is a program called early_oom that helped with that.


Over the last few years there has been ongoing work to improve this. Including improved pressure detection, multi-generational LRU, large huge page swap and a bunch of other things. Some aren't enabled by default, some need userspace daemons to make use of them.

So out-of-the-box experience of some random distro is not necessarily the best you can get, especially on older kernels.


> there has been ongoing work to improve this. Including improved pressure detection,

Are you referring to the /proc/pressure interface?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


Yes, the pressure information is used by userspace OOM killers but afaik also internally to detect whether progress has been made on reclaiming memory and it's supposed to be better than the previous progress heuristic.


Same with disk!

I'm running Arch with i3wm. I didn't get ANY notification or helpful error message when I ran out! Instead, somehow, ghcup installed a corrupted version of cabal, that would segfault every time it was invoked. That was my only hint at first. I eventually ran df -h and discovered what was going on but man...


I've never encountered this with "many apps" starting to OOM, but many times with one process OOMing. That one will simply crash and everything else continues to run unharmed.


What distribution are you using?

IME, if a process grows out of control, Linux won't notice until the whole system is thrashing, at which point it's too late and it tries killing random things like browser tabs way before the offending process.

In rare cases Linux might recover, but only because I hammered C-c the right place 30 minutes ago. In most cases a hard reboot is required (I left it overnight once, fans spinning at max, hoping the kernel would eventually accept my plea for help, but she had other priorities).


I guess OOM is more problematic on low-memory systems or when you have more than a nominal amount of swap.

If you have enough memory that the desktop environment, browsers, and other processes that should keep running only use a small fraction of it, the OOM killer can pick a reasonable target reliably. A process that tries to allocate too much memory gets killed, and everything is robust and deterministic. I sometimes trigger OOM several times in an hour, for example when trying to find reasonable computational parameters for something.


On the contrary, it's worse on systems with lots of memory, because those are the systems that are trying to do more.

About 8 years ago I got a work machine with 384GB of RAM, and I installed early_oom on it to make the OOM killer work a whole load earlier, otherwise the system would just become completely unresponsive for hours if one of my students/colleagues accidentally ran something that would make it run out of RAM.


Linux provably got better because i've got a 200GB multi user machine where oom kills (on a stock debian 12) are largely uneventful.


How much memory do you think is reasonable? I've had it happen to me with 16GB and even 32GB, where I never ever have this issue on Windows (unless for some reason I'm on a 2GB RAM system for God knows why). I wish people would stop defending pathological behavior that's broken for standard desktop use. What's wrong with wanting things to improve?


Nobody was defending anything. I just told that I don't remember having any issues with the Linux OOM killer, and guessed a potential reason.

I haven't really used any Windows version later than 2000 for anything except gaming, so I don't know how things work there these days. I mostly use macOS and Linux, and I've had far more trouble with pathological memory management behavior in macOS. Basically, macOS lets individual processes allocate and use far more memory than is physically available. When I'm running something with unpredictable memory requirements, I have to babysit the computer and kill the process manually if necessary, or the system may become slow and poorly responsive.


Ubuntu.

I'm guessing you are referring to "swapping", though?

If it's just one user process, it'll be killed by the OOM killer¹. That application will just be gone: poof. And for the rest you'll probably not notice anything, not even a hiccup in your Bluetooth headphones.

If it's many services, or services that are excempt from that killer, your system might start swapping. Which, indeed, leads to the case you describe.

¹https://unix.stackexchange.com/questions/153585/how-does-the...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: