You can get free uptime monitoring from Google Cloud. The limit is 100 uptime checks per monitoring scope, which may mean either a project or an organization based on how you configure IIUC. https://cloud.google.com/monitoring/uptime-checks. The checks are ran from 6 locations around the world, so you can also catch network issues, that you likely cannot do much about when you're running a tiny service. My uptime checks show the probes come from: usa-{virginia,oregon,iowa}, eur-belgium, apac-singapore, sa-brazil-sao_paulo
Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.
It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs.
It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert.
We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing.
> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"
This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.
I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.
One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?
Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:
- started, but never finished (no error reported)
- started, and finished with error reported ("crash") (need immediate alert)
- finished (without crashing), but not 100% successful (data not fetched)
- finished successfully
As you mention, it makes sense to have the alerts be:
- no successful completion (regardless of number of attempts) within X time
- explicit error occurred
I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.
The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.
The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.
For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.
I wish New Relic would support plain old ICMP ping. That would be nice. You used to be able to implement it using their Scripted API thing (which is just sandboxed Node), but at some point they broke raw socket support, which broke every ping NPM in existence. I think you can still make it work if you run a private minion, but that's more effort than I want to spend.
My particular favourite is how GraphQL servers respond with "200 OK" and the errors will be sent in a key called "errors". Makes regular healthchecks almost useless.
I ended up writing my own service[0] to detect problems with graphql responses, before expanding it to cover websites and web apps too.
I honestly hate that so much, it's a relief to read someone saying the same.
I sort of almost made myself feel a bit better about it by thinking 'no, it's not REST, we have reached the graphql server successfully and got a .. "successful" response from it, it's sort of a "Layer 8" on top of HTTP'. The problem is that none of the bloody tooling is 'Layer 8', so you end up in browser dev tools with all these 200 responses and no idea which ones are errorful. If any.
I mean, I agree. Given the nature of the protocol, it makes sense that a half-successful response of independent queries would still return a 200 on the network protocol.
I actually think I agree with your former self, how do you tell the difference between a server and an application error? How do you tell the difference between "record not found" and "there is no GraphQL endpoint here at all"? Or "you are not allowed to access GraphQL" and "you are not allowed to access the server."
Especially because error responses from your web server layer are usually really different than errors from your backends.
While using HTTP status codes could work for GraphQL payloads which have only one operation in them, this approach would not work for those which have multiple[0].
> GraphQL looks like it's been implemented by someone who thought 200 and 404 were the only possible codes.
Maybe. Or maybe they decided that a 2xx status would be interpreted as "success" by a non-trivial set of libraries and/or systems. Either way, take it up with the standards committee :-).
AWS S3 does the opposite when querying objects that don't exist. If you don't have s3:ListObjects permissions on the bucket you'll get a 403 error (you can't differentiate between the object not existing vs. you don't have access to it).
I think either approach is valid as long as you're consistent. You can make a case for either 404 or 403 when you don't have enough permissions. In GitHub's case you can argue that it's a 404 because the resource does indeed not exist through your auth context. In AWS' case you can argue that a 403 makes sense because you don't have permission to know the answer to your query.
I like this apparent shift back to "small is okay" where not every service has to be an overengineered allegedly hyper-scalable distributed mess of five nines uptime with enterprise logging, alerting and monitoring.
Those things are nice when you have a bazillion users and downtime means hordes of unhappy users and dollars flushing away at insane rates, but for the vast majority of hobby projects and even mid stage startups, what is described in this article is plenty good enough.
I've thought about posting an AskHN about simple infrastructure for some time but I'm not sure how to word it to attract as many responses as possible.
Currently got the cheapest VPS that I could (in my case from Time4VPS, some others might prefer Hetzner, or Scaleway Stardust instances), setup Uptime Kuma on it (https://github.com/louislam/uptime-kuma), now have checks every 5 minutes against 30+ URLs (could easily do each minute, but don't need that sort of resolution yet).
It's integrated with Mattermost currently, seems to work pretty well. Could also set it up on another VPS, for example on Hetzner (which also has excellent pricing), could also integrate another alerting method such as sending e-mails, or anything else that's supported out of the box: https://github.com/louislam/uptime-kuma/issues/284
Oh, also Zabbix for the servers themselves. Honestly, if things are as simple to setup as nowadays and you have about 50 EUR per year per node that you want (1 is usually enough, 2 is better from a redundancy standpoint, since then it becomes feasible to monitor the monitoring, others might go for 3 nodes for important things etc.), you don't even need to look for cloud services or complex systems out there.
Of course, if someone knows of some affordable options for cloud services, feel free to share!
I briefly checked the prices for a few and most of them are a little bit more expensive than just getting a VPS, setting up sshd to only use key based auth, throwing Let's Encrypt in front of the web UI (or maybe additional auth, or making it accessible only through VPN, whatever you want), adding fail2ban and unattended updates, and doing some other basic configuration that you probably have automated anyways.
The good news is that if you prefer cloud services and would rather have that piece of your setup be someone else's problem, they're not even an order of magnitude off in most cases - though I'm yet to see how Uptime Kuma in particular scales once I'll get to 100 endpoints. Seems like at a certain scale it's a bit cheaper to run your own monitoring, but at that point you might still find it easier to just pay a vendor.
At the end of the day, there's lots of great options out there, both cloud based and self-hosted, whichever is your personal preference.
I guess I'd personally also mention Contabo as an affordable host in general (though their web UI is antiquated), especially their storage nodes: https://contabo.com/en/storage-vps/
For the most part, though, use whichever host you've been with for a few years (though feel free to experiment with whatever new platforms catch your eye), but ideally still have local backups for everything (as long as you don't have to deal with regulations that'd make it not possible) so you can migrate elsewhere.
Rightly or wrongly, if I see an Oracle deal that sounds too good to be true, I'm going to assume someone at Oracle has a plan to trap me into a costly arrangement.
Or they’re just desperate to get even a single person to sign up. It’s a testament to their reputation that they can hand over tons of resources for free, and have everyone still go ‘nah, that’s Oracle, I’ll find a different provider’.
i was gonna post the same thing. though i've barely messed around with mine, at first blush it seems like their weird firewall doesn't work perfectly...
Anyone still remember Steve Yegge's platforms rant? One particular point from it has stuck with me, because it's so obviously correct and so obviously difficult to implement in small scale: "Monitoring and QA are the same thing". This is probably my internal OCD not being able to cope with sub-100% solutions, but every time I see a healthcheck endpoint doing basically a ping-pong response or maybe check the database connection I can't help but think about what it doesn't do and that's basically everything up until integrations test's "works correctly". It's fascinating but at the same time horrible to know how much of "works fine so far" in our industry is circumstantial and good-will optimistic judgement, but not knowledge. "If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization" indeed.
I installed Uptime Kuma (https://github.com/louislam/uptime-kuma) on my dokku paas to monitor my dokku apps. It works great. It is great for pure HTTP services, but it can be used against things like RTMP servers because it also permits configuration of a health check with TCP pings. It gives me an email when things are down, and supports retry, heartbeat intervals, and can validate a string in the HTML retrieved. I love it.
I considerated this option but then realized that both sides, the api/services and the uptime checker will be in the same server then any problem impacting the server itself will leave offline the monitoring
If I have to do one thing to monitor a simple website I'm probably going to use something that takes a screenshot periodically and checks it for changes. There are open source solutions but I just prefer to pay a bit for a managed service to do it.
I think it covers quite a lot of things - the servers are up, DNS is OK, assets are OK. It can also be a safety net in case of other, more sophisticated monitoring fails to detect an unusual state.
This doesn't work well for website with too much javascript, ads or widgets.
Selenium is the first one that comes to mind. I think most of the big browser automation toolkits should have the ability to do screenshots in some capacity.
Yes you’re definitely right - I’ve recently started using SeleniumBase for Python which includes a few extra niceties. I was more asking about the part where you compare between releases (or test suite runs) to see if anything changed. I suppose you can just use compare hashes, but I could imagine also having some feature that highlights changes on the screenshot.
If you have a popular service, then one of the best approaches is to have your users notify you when something is down or is broken. This pattern follows the famous quote: “Given enough eyeballs all bugs are shallow.” I have employed this approach to great success and haven’t had a need for any monitoring services.
Another approach that has been working great for me: https://www.webalert.me. This app runs on your phone, you can configure it to check once an hour if any content on a page changes.
Since everyone is posting their favorite free-tier monitoring products - does anyone have a recommendation for a cloud product that will allow us to create a group of ping monitors and alert only if all monitors in the group are down for N minutes?
I’ll double check when they’re online, but I’m pretty sure BitPing can do stuff like this.
They farm the actual checks out to actual users devices across a whole stack of geographies, you can customise regions, rates, etc.
With icinga2 (or any nagios successor) you could write a custom check command that does a ping check on both IPs (and return an error status only if both are down).
have to say, this is exactly what kubernetes was designed to solve. but the focus was on microservices and containers. and things also got out of hands.
> have to say, this is exactly what kubernetes was designed to solve
Kubernetes probes are much different in my opinion.
Your Kubernetes liveness check will check if things are working inside of your cluster which is great for a high frequency checkup to potentially modify the state of your pod based on the result.
But Uptime Robot is an end to end test. It tests a real connection over the internet to your domain which exercises external DNS, traffic flowing through any reverse proxies, your SSL certificate, etc..
Both compliment each other for different use cases.
Fortunately using an email address when registering a new Let's Encrypt cert will let you get warned by LE if something happens where your cert doesn't get renewed and is about to expire.
I really wish managed Kubernetes offerings remained "free" for small use, and would only expose "empty" nodes ready for full utilization by end-user containers.
The reality however is that every managed node (like on GKE) uses quite a lot of CPU and memory out of the box, for which the user pays. On top of that there're cluster fees, just for having it around. This makes it completely unfriendly to hobbyist projects, unless one is ready to pay dozens of $s just to have Kubernetes (prior to deploying any apps to it).
(And sure, there're free tiers here and there, but they never solve this problem completely on any of the big cloud providers, at least)
Compare that to managed "serverless" offerings (even pseudo-compatible with K8s API like Cloud Run), which eliminate the management fees, but impose a tax with latency. Oh well.
One reason this is not feasible is that K8s is not designed for secure multitenancy, so for every tenant, you'll need to spin up an entire K8s control plane, which includes a database and several services - this is what's driving the cluster fees. Keep in mind that customers also expect managed K8s to be highly available, so this cost is also going into things like replicating data, setting up load balancers, etc...
Compare this to a serverless offering that is multitenant by design, the control plane is shared making the overhead cost of an extra user is basically zero, which is why they don't charge you a fee like this.
IMO if you're a hobbyist interested in K8s, your best way to go is to install K3s, which is a lightweight, API compatible K8s alternative that runs on a single node. It's pretty nice if you don't care about fault tolerance or High Availability.
I'm not so sure about the economics of what you describe. I think it could very well be that small customers don't really consume that much "bandwidth" that their resource requirements could be subsumed entirely by larger uses. It doesn't make much sense that both large and small customers have to pay the same cluster fee, for example - it would be much more fair to charge more the more you use, and approach "near zero" the lesser you use it.
At the end of the day, all resources are run by the cloud provider on KVMs sharing the same physical machines anyways, so it's up to them how much to charge. The fact that both small and large customers get to pay for the same amount of resources allocated for them, only means these resources are not allocated in the most efficient manner. So a cloud provider could fix this.
We should also not discount the net positive effect of attracting more hobbyists and startups to your platform. That's how AWS and GCP started, for example, but now they're just focusing on more enterprise business so smaller ones mean less to them (although AWS arguably less so). But we shouldn't forget that while they don't contribute as much to the revenue, they're essentially a free advertising resource that make your platform stay "relevant" (and especially more so for burgeoning startups that could grow to bring more revenue in the future!). The moment they leave, the platform just becomes another IBM that's bound to die, for better or worse.
On top of that, the anti-analogy with serverless for control plane breaks down, because one could always run it on the same shared pool of resources in gVisor or Firecracker, just like with serverless.
Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.