More

eklitzke · 2025-12-09T21:02:59 1765314179

Pretty much all of the history of HN front pages, posts, and comments are surely in the Gemini training corpus. Therefore it seems totally plausible that Gemini would understand HN inside jokes or sentiment outside of what's literally on the front page given in the prompt, especially given that the prompt specifically stated that this is the front page for HN.

eklitzke · 2025-10-13T19:07:18 1760382438

I think it's important to point out the distinction between what POSIX mandates and what actual libc implementations, notably glibc, do. Nearly all non-reentrant POSIX functions are actually only non-reentrant if you are using a 1980s computer that for some reason has threads but doesn't have thread-local storage. All of these things like strerror are implemented using TLS in glibc nowadays, so while it is technically true you need to use the _r versions if you want to be portable to computers that nobody has used in 30 years in practice you usually don't need to worry about these things, especially if you're using Linux, since they use store results in static thread-local memory rather than static global memory.

As for the string.h stuff, while it is all terrible it's at least well documented that everything is broken unless you use wchar_t, and nobody uses wchar_t because it's the worst possible localization solution. No one is seriously trying to do real localization in C (and if they were they'd be using libicu).

jkrejcha · 2025-10-13T21:07:40 1760389660

strerror, at least on glibc, was only made thread safe back in 2020[1], which is really not that long ago in the grand scheme of things. It was WONTFIXed when it was initially reported back in 2005(!). There have only been 10 glibc releases since then and the 2.32 branch is still actively maintained.

There is probably a wide breadth of software that is actively not using that glibc version.

But yeah, agreed that trying to do localization with the builtin functions are fraught with traps and pitfalls. Part of the problem though is less about localization and more due to the fact that you can have bugs inflicted on you if you're not careful to just overwrite the locale with the C locale (and make sure to do this everywhere you can)

[1]: https://sourceware.org/bugzilla/show_activity.cgi?id=1890 (see specifically the target milestone, the 2023 date seems to be overly pessimistic)

eklitzke · 2025-10-09T04:28:56 1759984136

NUMA has a huge amount of overhead (e.g. in terms of intercore latency), and NUMA server CPUs cost a lot more than single socket boards. If you look at the servers at Google or Facebook they will have some NUMA servers for certain workloads that actually need them, but most most servers will be single socket because they're cheaper and applications literally run faster on them. It's a win win if you can fit your workload on a single socket server so there is a lot of motivation to make applications work in a non-NUMA way if at all possible.

eklitzke · 2025-09-13T21:56:55 1757800615

A few reasons, I think.

The first is that getaddrinfo is specified by POSIX, and the POSIX evolve very conservatively and at a glacial pace.

The second reason is that specifying a timeout breaks symmetry with a lot of other functions in Unix/C, both system calls and libc calls. For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations. There are ways to do these things in a non-blocking manner with timeouts using aio or io_uring, but those are already relatively complicated APIs for just using simple system calls, and getaddrinfo is much more complicated.

The last reason is that if you use the sockets APIs directly it's not that hard to write a non-blocking DNS resolver (c-ares is one example). The thing is though that if you write your own resolver you have to consider how to do caching, it won't work with NSS on Linux, etc. You can implement these things (systemd-resolved does it, and works with NSS) but they are a lot of work to do properly.

jstimpfle · 2025-09-14T06:59:50 1757833190

> For example, you can't specify a timeout when opening a file, reading from a file, or closing a file, which are all potentially blocking operations.

No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking. But if you do that, you should probably also consider scheduling and really any CPU instruction to be blocking. (If the system is too loaded, anything can be slow).

To be fair, network requests can be considered non-blocking in a similar way, but they're depending on other systems that you generally can't control or inspect. In practice you'll see network timeouts. Note that you (at least normally -- there might be tricky exceptions) won't see EINTR from read() to a filesystem file. But you can see EINTR for network sockets. The difference is that, in Unix terminology, disks are not considered "slow devices".

jcelerier · 2025-09-14T07:09:03 1757833743

I'd consider "blocking" anything that given same inputs, state and cpu frequency, may take variable time. That means pretty much every system call and entering the system scheduler, doing something that leads to a page fault, etc. Pretty much only pure math in total functions and function calls to paged functions are acceptable.

Joker_vD · 2025-09-14T08:03:52 1757837032

Yeah... the sudden paging in also has been noted as a source of latency in the network-oriented software. But that's the problem with our current state of the APIs and their implementations: ideally, you'd have as many independent threads of executions as you want/need, and every time one of them initiates some "blocking" operation, it is quickly end efficiently scheduled, and another ready-to-run thread is switched in. Native threads don't give you that context-switching efficiency, and user-space threads can accidentally cause an underlying native thread block even on "read a non-local variable".

surajrmal · 2025-09-14T16:57:39 1757869059

In a data center, networks can have lower latency than disk (even ssd). Now the real place this all falls on its head is page faults. That there are definitely places where you need to have granular control over what can and cannot stall a thread from making progress.

jcalvinowens · 2025-09-15T21:03:52 1757970232

> No they're not. Not really, unless you consider disk access and interacting with the page cache/inode cache inside the kernel to be blocking.

The important point is that the kernel takes locks during all those operations, and will wait an unbounded amount of time if those locks are contended.

So really and truly, yes, any synchronous syscall can schedule out for an arbitrary amount of time, no matter what you do.

It's sort of semantic. The word "block" isn't a synonym for "sleep", it has a specific meaning in POSIX. In that meaning, you're correct, they never "block". But in the generic way most people use the term "block", they absolutely do.

Joker_vD · 2025-09-14T07:54:16 1757836456

> disks are not considered "slow devices".

And neither are the tapes. But the pipes, apparently, are.

Well, unfortunately, disk^H^H^H^H large persistent storage I/O is actually slow, or people wouldn't have been writing thread-pools to make it look asynchrnous, or sometimes even process-pools to convert disk I/O to pipe I/O, for the last two decades.

jstimpfle · 2025-09-14T14:24:42 1757859882

There is a misunderstanding. "Slow device" in the POSIX sense is about unpredictability, not maximal possible bandwidth. Reading from a spinning disk might be comparably slow in the bandwidth sense, but it's actually quite deterministic how much data you can shovel to or from it.

A pipe on the other hand might easily stall for an hour. The kernel generally can't know how long it will have to wait for more data. That's why pipe reads (as well as writes) are interruptible.

The absolute bandwidth number of a harddisk doesn't matter --- in principle you can overload any system such that it fails to schedule and complete all processes in time. Putting aside possible system overload, the "slow device" terminology makes a lot of sense.

Joker_vD · 2025-09-14T15:05:40 1757862340

Seeking a tape also takes an unpredictable amount of time; and so is seeking a disk, for that matter (IIRC, historically it was actually quite difficult for UNIX systems to saturate disk's througput with random reads).

jstimpfle · 2025-09-14T17:09:27 1757869767

According to ChatGPT, a tape device is actually considered a "slow device". Even though I'm not sure it's that unpredictable. Maybe for most common use cases it is.

I was under the impression that seeking a disk you can generally calculate well with 10ms? Again, it depends on the file system abstractions built on top, and then the cache and the current system load -- how many seeks will be required?

jcalvinowens · 2025-09-15T23:32:44 1757979164

>> disks are not considered "slow devices".

> And neither are the tapes. But the pipes, apparently, are.

The "slow vs fast" language is really unfortunate, I realize it's traditional but it's unnecessarily confusing.

A better way to conceptualize it IMHO is bounded vs unbounded: a file or a tape contains a fixed amount of data known a priori, a socket or a pipe does not.

eklitzke · 2025-09-03T03:12:44 1756869164

I agree. If you actually know what you're doing you can use perf and/or ftrace to get highly detailed processor metrics over short periods of time, and you can see the effects of things like CPU stalls from cache misses, CPU stalls from memory accesses, scheduler effects, and many other things. But most of these metrics are not very actionable anyway (the vast majority of people are not going to know what to do with their IPC or cache hit or branch hit numbers).

What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned.

To know how much latency is impacted by utilization you need to measure your specific workload. Also, how much you care about latency depends on what you're doing. In many cases people care much more about throughput than latency, so if that's the top metric then optimize for that. If you care about application latency as well as throughput then you need to measure both of those and decide what tradeoffs are acceptable.

eklitzke · 2025-08-29T23:11:25 1756509085

Writing drivers is easy, getting vendors to write *correct* drivers is difficult. At work right now we are working with a Chinese OEM with a custom Wifi board with a chipset with firmware and drivers supplied by the vendor. It's actually not a new wifi chipset, they've used it in other products for years without issues. In conditions that are difficult to reproduce sometimes the chipset gets "stuck" and basically stops responding or doing any wifi things. This appears to be a firmware problem because unloading and reloading the kernel module doesn't fix the issue. We've supplied loads of pcap dumps to the vendor, but they're kind of useless to the vendor because (a) pcap can only capture what the kernel sees, not what the wifi chipset sees, (b) it's infeasible for the wifi chipset to log all its internal state and whatnot, and (c) even if this was all possible trying to debug the driver just from looking at gigabytes of low level protocol dumps would be impossible.

Realistically for the OEM to debug the issue they're going to need a way to reliably repro which we don't have for them, so we're kind of stuck.

This type of problem generalizes to the development of drivers and firmware for many complex pieces of modern hardware.

throwaway2037 · 2025-08-30T05:57:13 1756533433

    > custom Wifi board

Why didn't you use something more mainstream? Cost?

typpilol · 2025-08-30T06:47:29 1756536449

Probably some weird design spec or size requirement

eklitzke · 2025-08-24T02:56:46 1756004206

From what I could tell from the article and the linked videos the innovation here is that it essentially lets you serve the shuttlecock while it's facing the wrong direction. Normally even if the shuttlecock has spin when it crosses the court it will move with the cork side forward, at least by the time it crosses the net. Hence I don't think this technique would be applicable to other sports that use a ball.

eklitzke · 2025-08-13T20:19:05 1755116345

NFS can be super fast, in a past life I had to work a lot with a large distributed system of NetApp Filers (hundreds of filers located around the globe) and they have a lot of fancy logic for doing doing locale-aware caching and clustering.

That said, all of the open source NFS implementations are either missing this stuff or you'd have to implement it yourself which would be a lot of work. NetApp Filers are crazy expensive and really annoying to administer. I'm not really surprised that the cloud NFS solutions are all expensive and slow because truly *needing* NFS is a very niche thing (like do you really need `flock(2)` to work in a distributed way).

throw0101c · 2025-08-13T20:51:00 1755118260

> NFS can be super fast

Modern day NFS also has RDMA transports available with some vendors. Plus perhaps have it over IB for extra speed.

eklitzke · 2025-08-13T21:07:41 1755119261

Yeah if you were really trying to make things fast you'd have the compute and NFS server in the same rack connected this way. But you aren't going to get this from any cloud providers.

For read-only data (the original model is about serving file weights) you can also use iscsi. This is how packages/binaries are served to nearly all borg hosts at Google (most Borg hosts don't have any local disk whatsoever, when they need to run a given binary they mount the software image using iscsi and then I believe mlock nearly all of the elf sections).

eklitzke · 2025-03-18T06:19:32 1742278772

This is vastly oversimplifying the problem, the difference between IPv4 and IPv6 is not just the format of the address. Different protocols have different features, which is why the sockaddr_in and sockaddr_in6 types don't just differ in the address field. Plus the vast majority of network programs are using higher level abstractions, for example even in C or C++ a lot of people would be using a network library like libevent or asio to handle a lot of these details (especially if you want to write code that easily works with TLS).

eklitzke · on Dec 20, 2024

I don't understand why they say that reference counting is "slow". Slow compared to what? Atomic increments/decrements to integers are one of the fastest operations you can do on modern x86 and ARM hardware, and except in pathological cases will pretty much always be faster than pointer chasing done in a traditional mark and sweep VMs.

This isn't to say reference counting is without problems (there are plenty of them, inability to collect cyclical references being the most well known), but I don't normally think of it as a slow technique, particularly on modern CPUs.

gpderetta · on Dec 20, 2024

Atomic reference counting per se is fairly slow compared to other simple operations [1]. But the biggest issue with reference counting is that it doesn't scale well in multithreaded programs: even pure readers have to write to shared memory locations. Also acquiring a new reference from a shared atomic pointer is complex and need something like hazard pointers or a lock.

[1] an atomic inc on x86 is typically ~30 clock cycles, doesn't really pipeline well and will stall at the very least other load operations.