The big thing that articles like this miss completely is that we are no longer in the brief HTTP/1.0 era (1996) where every request is a new TCP connection (and therefore possibly a new DNS query).
In the HTTP/1.1 (1997) or HTTP/2 era, the TCP connection is made once and then stays open (Connection: Keep-Alive) for multiple requests. This greatly reduces the number of DNS lookups per HTTP request.
If the web server is configured for a sufficiently long Keep-Alive idle period, then this period is far more relevant than a short DNS TTL.
If the server dies or disconnects in the middle of a Keep-Alive, the client/browser will open a new connection, and at this point, a short DNS TTL can make sense.
(I have not investigated how this works with QUIC HTTP/3 over UDP: how often does the client/browser do a DNS lookup? But my suspicion is that it also does a DNS query only on the initial connection and then sends UDP packets to the same resolved IP address for the life of that connection, and so it behaves exactly like the TCP Keep-Alive case.)
> patched an Encrypted DNS Server to store the original TTL of a response, defined as the minimum TTL of its records, for each incoming query
The article seems to be based on capturing live dns data from some real network. While it may be true that persistent connections help reduce ttl it certainly seems like the article is accounting for that unless their network is only using http1.0 for some reason.
I agree that low TTL could help during an outage if you actually wanted to move your workload somewhere else, and I didn't see it mentioned in the article, but I've never actually seen this done in my experience, setting TTL extremely low for some sort of extreme DR scenario smells like an anti pattern to me.
Consider the counterpoint, having high TTL can prevent your service going down if the dns server crashes or loses connectivity.
It's very local here. I'm in the suburbs of Philadelphia, in one of the highest income counties in the state, two blocks from a major hospital, one block from a suburban downtown. Despite that, I've experienced one or two 4-6 hour long power outages per year the past few years. (Mostly correlated with weather.) One outage in June 2025 was 50 hours long!
Many larger homes in this area have whole-house generators (powered by utility natural gas) with automatic transfer switches. During the 50-hour outage, we "abandoned ship" and stayed with someone who also had an outage, but had a whole-house generator.
Other areas just 5-10 miles away are like what you describe: maybe one outage in the past 10 years.
> If something goes wrong, like the pipeline triggering certbot goes wrong, I won't have time to fix this. So I'd be at a two day renewal with a 4 day "debugging" window.
I think a pattern like that is reasonable for a 6-day cert:
- renew every 2 days, and have a "4 day debugging window"
- renew every 1 day, and have a "5 day debugging window"
100%, I've run into this too. I wrote some minimal scripts in Bash, Python, Ruby, Node.js (JavaScript), Go, and Powershell to send a request and alert if the expiration is less than 14 days from now: https://heyoncall.com/blog/barebone-scripts-to-check-ssl-cer... because anyone who's operating a TLS-secured website (which is... basically anyone with a website) should have at least that level of automated sanity check. We're talking about ~10 lines of Python!
Lots of people are speculating that the price spike is AI related. But it might be more mundane:
I'd bet that a good chunk of the apparently sudden demand spike could be last month's Microsoft Windows 10 end-of-support finally happening, pushing companies and individuals to replace many years worth of older laptops and desktops all at once.
I worked in enterprise laptop repair two decades ago — I like your theory (and there's definitely meat there) but my experience was that if a system's OEM configuration wasn't enough to run modern software, we'd replace the entire system (to avoid bottlenecks elsewhere in the architecture).
I have no idea about the number of people this has actually affected, but this is exactly my situation. Need a new workstation with a bunch of RAM to replace my Win10 machine, so I don't really have viable options than paying the going rate.
There's a tradeoff and the assumption here (which I think is solid) is that there's more benefit from avoiding a supply chain attack by blindly (by default) using a dependency cooldown vs. avoiding a zero-day by blindly (by default) staying on the bleeding edge of new releases.
It's comparing the likelihood of an update introducing a new vulnerability to the likelihood of it fixing a vulnerability.
While the article frames this problem in terms of deliberate, intentional supply chain attacks, I'm sure the majority of bugs and vulnerabilities were never supply chain attacks: they were just ordinary bugs introduced unintentionally in the normal course of software development.
On the unintentional bug/vulnerability side, I think there's a similar argument to be made. Maybe even SemVer can help as a heuristic: a patch version increment is likely safer (less likely to introduce new bugs/regressions/vulnerabilities) than a minor version increment, so a patch version increment could have a shorter cooldown.
If I'm currently running version 2.3.4, and there's a new release 2.4.0, then (unless there's a feature or bugfix I need ASAP), I'm probably better off waiting N days, or until 2.4.1 comes out and fixes the new bugs introduced by 2.4.0!
Yep, that's definitely the assumption. However, I think it's also worth noting that zero-days, once disclosed, do typically receive advisories. Those advisories then (at least in Dependabot) bypass any cooldown controls, since the thinking is that a known vulnerability is more important to remediate than the open-ended risk of a compromised update.
> I'm sure the majority of bugs and vulnerabilities were never supply chain attacks: they were just ordinary bugs introduced unintentionally in the normal course of software development.
Yes, absolutely! The overwhelming majority of vulnerabilities stem from normal accidental bug introduction -- what makes these kinds of dependency compromises uniquely interesting is how immediately dangerous they are versus, say, a DoS somewhere in my network stack (where I'm not even sure it affects me).
Of course. They can simply wait to exploit their vulnerability. It it is well hidden, then it probably won't be noticed for a while and so you can wait until it is running on the majority of your target systems before exploiting it.
From their point of view it is a trade-off between volume of vulnerable targets, management impatience and even the time value of money. Time to market probably wins a lot of arguments that it shouldn't, but that is good news for real people.
You should also factor in that a zero-day often isn’t surfaced to be exploitable if you are using the onion model with other layers that need to be penetrated together. In contrast to a supply chain vulnerability that is designed to actively make outbound connections through any means possible.
Thank you. I was scanning this thread for anyone pointing this out.
The cooldown security scheme appears like some inverse "security by obscurity". Nobody could see a backdoor, therefor we can assume security. This scheme stands and falls with the assumed timelines. Once this assumption tumbles, picking a cooldown period becomes guess work. (Or another compliance box ticked.)
On the other side, the assumption can very well be sound, maybe ~90% of future backdoors can be mitigated by it. But who can tell. This looks like the survivorship bias, because we are making decisions based on the cases we found.
I’d estimate the vast majority of CVEs in third party source are not directly or indirectly exploitable. The CVSS scoring system assumes the worst case scenario the module is deployed in. We still have no good way to automate adjusting the score or even just figuring false positive.
The big problem is the Red Queen's Race nature of development in rapidly-evolving software ecosystems, where everyone has to keep pushing versions forward to deal with their dependencies' changes, as well as any actual software developments of their own. Combine that with the poor design decisions found in rapidly-evolving ecosystems, where everyone assumes anyting can be fixed in the next release, and you have a recipe for disaster.
Could always just use a status page that updates itself. For my side project Total Real Returns [1], if you scroll down and look at the page footer, I have a live status/uptime widget [2] (just an <img> tag, no JS) which links to an externally-hosted status page [3]. Obviously not critical for a side project, but kind of neat, and was fun to build. :)
This is unrelated to the cloudflare incident but thanks a lot for making that page. I keep checking it from time to time and it's basically the main data source for my long term investing.
Looks great! Would you have a recommendation for intro materials to help me learn the basics of electronics using CircuitLab? I have a working understanding of signal processing but building an actual circuit without electrocuting myself, not setting my Raspberry Pi on fire, or selecting the right set of components for the simplest DIY project based on spec sheets are a mystery to me.
A favorite of mine and one of the most common ways to generate a pretty high voltage DC. The full wave version pairs well with a center tapped secondary of a resonant transformer.
For fun, playing with Meshtastic https://meshtastic.org/ and contributing to the open source firmware and apps. They have something cool but need lots of help. I've patched 3 memory leaks and had a few other PRs merged already.
For work, https://heyoncall.com/ as the best tool for on-call alerting, website monitoring, cron job monitoring, especially for small teams and solo founders.
I guess they both fall under the category of "how do you build reliable systems out of unreliable distributed components" :)
In the HTTP/1.1 (1997) or HTTP/2 era, the TCP connection is made once and then stays open (Connection: Keep-Alive) for multiple requests. This greatly reduces the number of DNS lookups per HTTP request.
If the web server is configured for a sufficiently long Keep-Alive idle period, then this period is far more relevant than a short DNS TTL.
If the server dies or disconnects in the middle of a Keep-Alive, the client/browser will open a new connection, and at this point, a short DNS TTL can make sense.
(I have not investigated how this works with QUIC HTTP/3 over UDP: how often does the client/browser do a DNS lookup? But my suspicion is that it also does a DNS query only on the initial connection and then sends UDP packets to the same resolved IP address for the life of that connection, and so it behaves exactly like the TCP Keep-Alive case.)
reply