Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Monitoring tiny web services (jvns.ca)
218 points by mfrw on July 9, 2022 | hide | past | favorite | 76 comments


You can get free uptime monitoring from Google Cloud. The limit is 100 uptime checks per monitoring scope, which may mean either a project or an organization based on how you configure IIUC. https://cloud.google.com/monitoring/uptime-checks. The checks are ran from 6 locations around the world, so you can also catch network issues, that you likely cannot do much about when you're running a tiny service. My uptime checks show the probes come from: usa-{virginia,oregon,iowa}, eur-belgium, apac-singapore, sa-brazil-sao_paulo

Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.


healthchecks.io is a great service (and apparently can be self-hosted - https://github.com/healthchecks/healthchecks) that I use for both personal projects and at work.

It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs.

It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert.

We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing.


> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"

This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.

I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.

One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?


Thank you for the response!

I saw you make that suggestion on this issue - https://github.com/healthchecks/healthchecks/issues/525#issu...

----

Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:

  - started, but never finished (no error reported)
  - started, and finished with error reported ("crash") (need immediate alert)
  - finished (without crashing), but not 100% successful (data not fetched)
  - finished successfully
As you mention, it makes sense to have the alerts be:

  - no successful completion (regardless of number of attempts) within X time
  - explicit error occurred
I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.

The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.

The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.

For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.


Healthchecks is a great service!

Not sure if you tried it too but https://cronitor.io/ supports more complex alerting rules like the one you describe.

As a bonus, you can also create uptime checks and status pages under the same roof.

Full-disclosure: I work at Cronitor, happy to help if you have any questions :)


What is the interval for the checks ?

Its written that its 100 per metric scope, but I don't know what that means really.(2)

Also there seems to be no status monitor page ?

2- https://cloud.google.com/monitoring/uptime-checks


Metrics scope is the logical grouping of assets you are monitoring. Explained in detail here https://cloud.google.com/monitoring/settings along with a video.

Web console allows check intervals of 1, 5, 10, or 15 minutes.


New Relic also offers a similar uptime monitoring with a generous free tier via their Synthetics service.

https://newrelic.com/platform/synthetics


I wish New Relic would support plain old ICMP ping. That would be nice. You used to be able to implement it using their Scripted API thing (which is just sandboxed Node), but at some point they broke raw socket support, which broke every ping NPM in existence. I think you can still make it work if you run a private minion, but that's more effort than I want to spend.


> but at some point they broke raw socket support

Sockets are a transport layer feature e.g. TCP or UDP. ICMP works at the network layer and has no notion of sockets.


The BSD socket API on many systems implements a raw socket type[1], so you can use the socket APIs to talk raw IP.

Some systems (Linux, Darwin) also implement a special icmp socket type which can allow unprivileged ping.

[1]: https://man7.org/linux/man-pages/man7/ip.7.html


Regardless of what layer ICMP works at, its traffic is still transmitted using datagrams, and the call to do so is socket(), yes?

Are you just trying to point out the misnomer?


Continuing the tooling thread: the free tier of https://www.uptimetoolbox.com/ is quite good.


My particular favourite is how GraphQL servers respond with "200 OK" and the errors will be sent in a key called "errors". Makes regular healthchecks almost useless.

I ended up writing my own service[0] to detect problems with graphql responses, before expanding it to cover websites and web apps too.

-[0]: https://onlineornot.com


I honestly hate that so much, it's a relief to read someone saying the same.

I sort of almost made myself feel a bit better about it by thinking 'no, it's not REST, we have reached the graphql server successfully and got a .. "successful" response from it, it's sort of a "Layer 8" on top of HTTP'. The problem is that none of the bloody tooling is 'Layer 8', so you end up in browser dev tools with all these 200 responses and no idea which ones are errorful. If any.


I mean, I agree. Given the nature of the protocol, it makes sense that a half-successful response of independent queries would still return a 200 on the network protocol.

That doesn’t mean it’s not bloody annoying.


I actually think I agree with your former self, how do you tell the difference between a server and an application error? How do you tell the difference between "record not found" and "there is no GraphQL endpoint here at all"? Or "you are not allowed to access GraphQL" and "you are not allowed to access the server."

Especially because error responses from your web server layer are usually really different than errors from your backends.


> How do you tell the difference between "record not found" and "there is no GraphQL endpoint here at all"?

The standard 503 and 404 codes?


503 doesn’t map well to me meaning “you performed a successful GraphQL query but the result is that the thing you tried to query is missing.”


Yes that would be 404.

Similar to how there's no difference between "there's no /pets/" or "there is no /pet/15", they are both 404 for /pets/15

400 is also a good code for "your query doesn't make sense". Anything but 200 really.


While using HTTP status codes could work for GraphQL payloads which have only one operation in them, this approach would not work for those which have multiple[0].

0 - http://spec.graphql.org/October2021/#sec-Executing-Requests


207 has been used for that for decades. GraphQL looks like it's been implemented by someone who thought 200 and 404 were the only possible codes.


> 207 has been used for that for decades.

That's a WebDAV status code.

> GraphQL looks like it's been implemented by someone who thought 200 and 404 were the only possible codes.

Maybe. Or maybe they decided that a 2xx status would be interpreted as "success" by a non-trivial set of libraries and/or systems. Either way, take it up with the standards committee :-).


Google's uptime monitoring also allows writing JSONPath checks, so one can monitor HTTP 200 JSON responses semantically.


Github answers 404 instead of a 403 when you try to access a private repository while not being logged in.

I assume the rational is to not leak information about what's private. But still, it's weird.


AWS S3 does the opposite when querying objects that don't exist. If you don't have s3:ListObjects permissions on the bucket you'll get a 403 error (you can't differentiate between the object not existing vs. you don't have access to it).

I think either approach is valid as long as you're consistent. You can make a case for either 404 or 403 when you don't have enough permissions. In GitHub's case you can argue that it's a 404 because the resource does indeed not exist through your auth context. In AWS' case you can argue that a 403 makes sense because you don't have permission to know the answer to your query.


That's exactly correct and has been generally the best practices guidance for decades now.


I like this apparent shift back to "small is okay" where not every service has to be an overengineered allegedly hyper-scalable distributed mess of five nines uptime with enterprise logging, alerting and monitoring.

Those things are nice when you have a bazillion users and downtime means hordes of unhappy users and dollars flushing away at insane rates, but for the vast majority of hobby projects and even mid stage startups, what is described in this article is plenty good enough.


I've thought about posting an AskHN about simple infrastructure for some time but I'm not sure how to word it to attract as many responses as possible.


Does it involve SQLite? If so, include that in the title ;-)


Yes, can confirm.


Currently got the cheapest VPS that I could (in my case from Time4VPS, some others might prefer Hetzner, or Scaleway Stardust instances), setup Uptime Kuma on it (https://github.com/louislam/uptime-kuma), now have checks every 5 minutes against 30+ URLs (could easily do each minute, but don't need that sort of resolution yet).

It's integrated with Mattermost currently, seems to work pretty well. Could also set it up on another VPS, for example on Hetzner (which also has excellent pricing), could also integrate another alerting method such as sending e-mails, or anything else that's supported out of the box: https://github.com/louislam/uptime-kuma/issues/284

Oh, also Zabbix for the servers themselves. Honestly, if things are as simple to setup as nowadays and you have about 50 EUR per year per node that you want (1 is usually enough, 2 is better from a redundancy standpoint, since then it becomes feasible to monitor the monitoring, others might go for 3 nodes for important things etc.), you don't even need to look for cloud services or complex systems out there.

Of course, if someone knows of some affordable options for cloud services, feel free to share!

I briefly checked the prices for a few and most of them are a little bit more expensive than just getting a VPS, setting up sshd to only use key based auth, throwing Let's Encrypt in front of the web UI (or maybe additional auth, or making it accessible only through VPN, whatever you want), adding fail2ban and unattended updates, and doing some other basic configuration that you probably have automated anyways.

The good news is that if you prefer cloud services and would rather have that piece of your setup be someone else's problem, they're not even an order of magnitude off in most cases - though I'm yet to see how Uptime Kuma in particular scales once I'll get to 100 endpoints. Seems like at a certain scale it's a bit cheaper to run your own monitoring, but at that point you might still find it easier to just pay a vendor.

At the end of the day, there's lots of great options out there, both cloud based and self-hosted, whichever is your personal preference.


You can get a cheaper VPS through ramnode & $15/year atm


That's pretty cool!

I guess I'd personally also mention Contabo as an affordable host in general (though their web UI is antiquated), especially their storage nodes: https://contabo.com/en/storage-vps/

For the most part, though, use whichever host you've been with for a few years (though feel free to experiment with whatever new platforms catch your eye), but ideally still have local backups for everything (as long as you don't have to deal with regulations that'd make it not possible) so you can migrate elsewhere.


If we're going to divert into Cheap VPS providers, I just link LowEndTalk or LowEndBox for a huge list of cheap providers.


I can only find a $3/mo instance there... Did you mean $15/month?


You can get a free 4vcpu 24gb ram 200gb storage VPS with oracle cloud Free tier.


Rightly or wrongly, if I see an Oracle deal that sounds too good to be true, I'm going to assume someone at Oracle has a plan to trap me into a costly arrangement.


Or they’re just desperate to get even a single person to sign up. It’s a testament to their reputation that they can hand over tons of resources for free, and have everyone still go ‘nah, that’s Oracle, I’ll find a different provider’.


The catch is that you're getting an ARM server with the above offer.


you can make two single-core x86 music for free.


i was gonna post the same thing. though i've barely messed around with mine, at first blush it seems like their weird firewall doesn't work perfectly...



Anyone still remember Steve Yegge's platforms rant? One particular point from it has stuck with me, because it's so obviously correct and so obviously difficult to implement in small scale: "Monitoring and QA are the same thing". This is probably my internal OCD not being able to cope with sub-100% solutions, but every time I see a healthcheck endpoint doing basically a ping-pong response or maybe check the database connection I can't help but think about what it doesn't do and that's basically everything up until integrations test's "works correctly". It's fascinating but at the same time horrible to know how much of "works fine so far" in our industry is circumstantial and good-will optimistic judgement, but not knowledge. "If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization" indeed.


I have been using cronitor[0] for a few months now and I have been really satisfied with them so far!

[0]: https://cronitor.io


$2/monitor?!


I’m currently on the “hacker” tier, no business use case for me at the moment.


I installed Uptime Kuma (https://github.com/louislam/uptime-kuma) on my dokku paas to monitor my dokku apps. It works great. It is great for pure HTTP services, but it can be used against things like RTMP servers because it also permits configuration of a health check with TCP pings. It gives me an email when things are down, and supports retry, heartbeat intervals, and can validate a string in the HTML retrieved. I love it.


I considerated this option but then realized that both sides, the api/services and the uptime checker will be in the same server then any problem impacting the server itself will leave offline the monitoring


I think two would cover it:

* an uptime checker, in a container, running against however many services or sites

* uptime robot or whatever, against the local uptime checker


If I have to do one thing to monitor a simple website I'm probably going to use something that takes a screenshot periodically and checks it for changes. There are open source solutions but I just prefer to pay a bit for a managed service to do it.

I think it covers quite a lot of things - the servers are up, DNS is OK, assets are OK. It can also be a safety net in case of other, more sophisticated monitoring fails to detect an unusual state.

This doesn't work well for website with too much javascript, ads or widgets.


What are the OSS solutions for this?



Selenium is the first one that comes to mind. I think most of the big browser automation toolkits should have the ability to do screenshots in some capacity.


Yes you’re definitely right - I’ve recently started using SeleniumBase for Python which includes a few extra niceties. I was more asking about the part where you compare between releases (or test suite runs) to see if anything changed. I suppose you can just use compare hashes, but I could imagine also having some feature that highlights changes on the screenshot.

SeleniumBase has a feature that looks at markup and which stores screenshots for visual comparison to a set baseline: https://seleniumbase.io/examples/visual_testing/ReadMe/

Found this blog describing use of SeleniumBase for that and OpenCV for screenshot comparison as another layer: https://blog.streamlit.io/testing-streamlit-apps-using-selen...


You want to update your tools with Puppeteer or Playwright.


I like monitored cgi shell script endpoints a lot, e.g. the one behind https://updown.io/44q5 is https://codeberg.org/mro/internet-radio-recorder/src/branch/...


Health checks should include a read from the database(s).


If you have a popular service, then one of the best approaches is to have your users notify you when something is down or is broken. This pattern follows the famous quote: “Given enough eyeballs all bugs are shallow.” I have employed this approach to great success and haven’t had a need for any monitoring services.


If users see the problem it is too late. You will be seen as unable to keep the service up and the service will be seen as flaky.

Also, the Holly grail of monitoring is to be able to remediate the problem automatically - this is pretty hard when users are reporting it.


Another approach that has been working great for me: https://www.webalert.me. This app runs on your phone, you can configure it to check once an hour if any content on a page changes.


Since everyone is posting their favorite free-tier monitoring products - does anyone have a recommendation for a cloud product that will allow us to create a group of ping monitors and alert only if all monitors in the group are down for N minutes?


I’ll double check when they’re online, but I’m pretty sure BitPing can do stuff like this. They farm the actual checks out to actual users devices across a whole stack of geographies, you can customise regions, rates, etc.

https://www.bitping.com/

Full disclosure: my friends business


You could hack that together with huginn pretty easily

https://github.com/huginn/huginn


> [...] recommend a cloud product

Hacker mentality never left this site since inception :)


I am curious for the use case of it. What group of servers do you want to monitor?


We have dual internet connections coming into a satellite office and we only want to be alerted if both are down.


With icinga2 (or any nagios successor) you could write a custom check command that does a ping check on both IPs (and return an error status only if both are down).


Seems like a trivial service to make. Reddit has RSS, Twitter has webhooks, there are so many ways to monitor.

Hopefully if the service is worth reading, your customers will go and politely inform you. It is nice to have your work acknowledged.


have to say, this is exactly what kubernetes was designed to solve. but the focus was on microservices and containers. and things also got out of hands.


> have to say, this is exactly what kubernetes was designed to solve

Kubernetes probes are much different in my opinion.

Your Kubernetes liveness check will check if things are working inside of your cluster which is great for a high frequency checkup to potentially modify the state of your pod based on the result.

But Uptime Robot is an end to end test. It tests a real connection over the internet to your domain which exercises external DNS, traffic flowing through any reverse proxies, your SSL certificate, etc..

Both compliment each other for different use cases.


Be careful, uptimerobot actually ignores TLS errors in its free plan. You will get no notification if your certificate is expired or outright invalid.


Yep, that's a good call.

Fortunately using an email address when registering a new Let's Encrypt cert will let you get warned by LE if something happens where your cert doesn't get renewed and is about to expire.


I really wish managed Kubernetes offerings remained "free" for small use, and would only expose "empty" nodes ready for full utilization by end-user containers.

The reality however is that every managed node (like on GKE) uses quite a lot of CPU and memory out of the box, for which the user pays. On top of that there're cluster fees, just for having it around. This makes it completely unfriendly to hobbyist projects, unless one is ready to pay dozens of $s just to have Kubernetes (prior to deploying any apps to it).

(And sure, there're free tiers here and there, but they never solve this problem completely on any of the big cloud providers, at least)

Compare that to managed "serverless" offerings (even pseudo-compatible with K8s API like Cloud Run), which eliminate the management fees, but impose a tax with latency. Oh well.


One reason this is not feasible is that K8s is not designed for secure multitenancy, so for every tenant, you'll need to spin up an entire K8s control plane, which includes a database and several services - this is what's driving the cluster fees. Keep in mind that customers also expect managed K8s to be highly available, so this cost is also going into things like replicating data, setting up load balancers, etc...

Compare this to a serverless offering that is multitenant by design, the control plane is shared making the overhead cost of an extra user is basically zero, which is why they don't charge you a fee like this.

IMO if you're a hobbyist interested in K8s, your best way to go is to install K3s, which is a lightweight, API compatible K8s alternative that runs on a single node. It's pretty nice if you don't care about fault tolerance or High Availability.

https://k3s.io/


I'm not so sure about the economics of what you describe. I think it could very well be that small customers don't really consume that much "bandwidth" that their resource requirements could be subsumed entirely by larger uses. It doesn't make much sense that both large and small customers have to pay the same cluster fee, for example - it would be much more fair to charge more the more you use, and approach "near zero" the lesser you use it.

At the end of the day, all resources are run by the cloud provider on KVMs sharing the same physical machines anyways, so it's up to them how much to charge. The fact that both small and large customers get to pay for the same amount of resources allocated for them, only means these resources are not allocated in the most efficient manner. So a cloud provider could fix this.

We should also not discount the net positive effect of attracting more hobbyists and startups to your platform. That's how AWS and GCP started, for example, but now they're just focusing on more enterprise business so smaller ones mean less to them (although AWS arguably less so). But we shouldn't forget that while they don't contribute as much to the revenue, they're essentially a free advertising resource that make your platform stay "relevant" (and especially more so for burgeoning startups that could grow to bring more revenue in the future!). The moment they leave, the platform just becomes another IBM that's bound to die, for better or worse.

On top of that, the anti-analogy with serverless for control plane breaks down, because one could always run it on the same shared pool of resources in gVisor or Firecracker, just like with serverless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: