I think this is why some techniques for doing rolling deploys set up a health check for the node which is configurable by the deploy process. That way, otherwise healthy nodes report their status as "down" while still accepting and finishing requests, but then get taken out of the HAProxy rotation without having to re-load the configuration. The new code is rolled out on that node, and the toggle is set to report "healthy" again, so HAProxy brings the node back into rotation.
This is what I've done in the past, not sure why we need to "hack TCP" to get zero downtime.
As an additional thing I do rolling deploys where I just create an entirely new VM and add it before removing the old one. This just means that I don't have to recycle a node that might potentially have some state on it.
The article is about restarting HAProxy without any downtime. HAProxy restarts are needed when adding new service instances or adjusting configuration options. This is a different and much harder problem than gracefully restarting load balanced service instances.
As other posters have mentioned the solution in this link is actually suboptimal. I showed how the iptables solution leads to 1-3s latency for connections that establish during the restart but you can avoid it by getting a little more hardcore.
It should, and it allows you to do some changes over a socket, but allowing complete updates with the current haproxy architecture is a lot of work and isn't there yet.
The challenge is that whether HAProxy's own support for graceful restarts works depends on which OS and kernel version you're running and what network socket features are supported, so ensuring that's all kosher is a bit of a challenge. I took the approach that it's simpler to do something that takes a bit longer without having to worry about all that.
The downtime "gap" between step 1 and step 2 is probably way less than 500ms right? If being down for 500ms, is unacceptable, you probably should have multiple HAProxy instances running, and just do a rolling deploy. Seems like a bit of over-optimization.
Exactly, with these "high uptime" requirements, isn't it usual to have 2+ HAProxy instances running on different servers bound to the same IP? At least that's how I do it. When you update the config, first update the backup HAProxy, then update the main one. Any requests that take place during the main proxy restart would go to the backup.
Could you please describe how can I bind 2 different servers to the same IP with automatic request balancing (so when one server is down, another one will serve all requests)? I understand how to do it with another server with HAProxy, but your solution seems to be without HAProxy.
To have 2 servers with the same IP, we use VRRP (Virtual Router Redundancy Protocol) and keepalived. HAProxy is setup on 2 separate servers/instances, and using VRRP/keepalive they both share an IP address (which HAProxy binds to). The servers also have their own unique IP address(es) (on top of the shared one), so the shared address doesn't really "belong" to any machine.
If one server goes down, VRRP gives the IP to the other server and that HAProxy takes over.
If you have a few HA proxy nodes with DNS load balancing, wouldn't the client auto move on to another HA proxy? It is a tiny lag but shouldn't be an error.
That might have been the case if you could expect clients to be well behaved, but you can't. In my experience, hardly no network clients handle that kind of thing properly.
No, once the client has resolved the IP it will either connect to that proxy or fail that particular connection attempt. But you can use the DNS to drain a proxy before restarting it.
I've been a little confused by the claims of how complex the SYN delay solution is. Is it actually all that more complex?
The original blog post wraps a restart command in two iptables invocations (and relies on a hacky sleep interval which may or may not work sometimes). The SYN delay method wraps a restart command in two tc invocations. The concepts are more or less identical in complexity as one is telling the kernel "drop SYNs now please" and the other is saying "delay SYNs for a bit please".
All the complexity in the qdisc solution is in the one time setup of the queuing disciplines. I think the largest drawback of the delaying SYN solution is not complexity but that getting it to work with external load balancers is more tricky than getting it to work with internal load balancers. Honestly, you're right that if an org doesn't have to restart HAProxy a ton, then it doesn't make a lot of sense to invest in solving this problem; although if it were me I'd just make sure I was on the latest Linux kernel so that the period during which HAProxy can cause RSTs is as small as possible and not bother with either the iptables or tc solutions.
Does nginx has the same problem of "small time window" downtime? I was guessing reloading would be more gracefully handled by having a single process binding on http/s ports and child processes which contact upstream servers. That way there is no "small time window" while reloading.
No, nginx should not have this problem as it uses fd passing to gracefully handoff connections.
HAProxy only has this issue on Linux because Linux's SO_REUSEPORT implementation unfortunately introduces a race condition between accept and close. While I haven't personally tested it, HAProxy on one of the BSDs should not have this small window of downtime.
Without being dogmatic about it, I feel similarly about fixing things by inserting sleeps.
At a high level, you often see programmers sprinkle sleeps into their code to "fix" race conditions or deadlocks. That doesn't really fix the problem, it just moves it around and it's usually done because they don't know how to reason about the underlying problem and fix it properly.
You need to sleep long enough that whatever you're waiting for will definitely have finished. Most of the time you have no exact guarantee of that, so you have to pick some N that is relatively large. Inevitably, no matter what N you pick, sooner or later, the thing you're waiting for will take N + 1 and things break. To make it worse, the N + 1 situation often happens because you're getting an unusually large amount of traffic or because something else in the system is already in a failure state. So the breakage tends to come at the worst possible time and exacerbate things.
Meanwhile, if you sleep for N ms somewhere, one thing you can guarantee is that whatever you're doing will take at least N ms. There's no way to make it faster, even if it may have been unnecessary to wait that long. Often not a big deal, but the more developers sprinkle sleeps into their code, the more often you run into bizarre performance bottlenecks as a result.
Network timeouts and similar are kind of a fact of life, so there's no perfect solution. But if you find yourself trying to solve a problem by sleeping for some arbitrary period of time, a little alarm should go off in your head telling you that there's probably a better solution.
Sometimes sleep is the only viable solution when dealing with an external system. I have such a problem with one of my programs that has to access a hardware device. The problem is the load the hardware can handle varies. To get maximum throughput I have to pound the hardware as fast as possible and when it fails sleep for a random but increasing time. If anyone knows of a more elegant solution please post.
Yeah, that's why I'm not dogmatic about it. With network timeouts, external systems that you can't control, and low level hardware access, sometimes it's the best you can do. The better solution would be for the hardware/system you are interacting with to publish an event or otherwise signal when it is or isn't able to handle more load. If it wasn't designed with back pressure in mind though, you do the best you can, and in your case, exponential backoff is probably it.
Adding jitter to avoid dogpiling is another case where sleeping is perfectly reasonable.
What I get wary about is the common pattern of: Make a call to some external service. Sleep for some amount of time (to "let it finish"). Then continue under the assumption that it has completed.
I think we are in massive agreement here. Sleep can be used as crutch inappropriately, but when you have a broken leg a crutch is exactly what you need.
Not only is this from 2014 and not labeled as such, but the shortened title is linkbait. No system gets 100% uptime. This article is about eliminating one small source of downtime for HAProxy systems. Nice, but not exactly a revolution.
Why does HAProxy need reloading of config change, can't it load the new config do a diff and apply only that? Or alternative ways of having a web frontend for HAProxy to make changes to the config piecemeal without restarting?