Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can get free uptime monitoring from Google Cloud. The limit is 100 uptime checks per monitoring scope, which may mean either a project or an organization based on how you configure IIUC. https://cloud.google.com/monitoring/uptime-checks. The checks are ran from 6 locations around the world, so you can also catch network issues, that you likely cannot do much about when you're running a tiny service. My uptime checks show the probes come from: usa-{virginia,oregon,iowa}, eur-belgium, apac-singapore, sa-brazil-sao_paulo

Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.



healthchecks.io is a great service (and apparently can be self-hosted - https://github.com/healthchecks/healthchecks) that I use for both personal projects and at work.

It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs.

It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert.

We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing.


> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"

This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.

I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.

One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?


Thank you for the response!

I saw you make that suggestion on this issue - https://github.com/healthchecks/healthchecks/issues/525#issu...

----

Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:

  - started, but never finished (no error reported)
  - started, and finished with error reported ("crash") (need immediate alert)
  - finished (without crashing), but not 100% successful (data not fetched)
  - finished successfully
As you mention, it makes sense to have the alerts be:

  - no successful completion (regardless of number of attempts) within X time
  - explicit error occurred
I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.

The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.

The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.

For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.


Healthchecks is a great service!

Not sure if you tried it too but https://cronitor.io/ supports more complex alerting rules like the one you describe.

As a bonus, you can also create uptime checks and status pages under the same roof.

Full-disclosure: I work at Cronitor, happy to help if you have any questions :)


What is the interval for the checks ?

Its written that its 100 per metric scope, but I don't know what that means really.(2)

Also there seems to be no status monitor page ?

2- https://cloud.google.com/monitoring/uptime-checks


Metrics scope is the logical grouping of assets you are monitoring. Explained in detail here https://cloud.google.com/monitoring/settings along with a video.

Web console allows check intervals of 1, 5, 10, or 15 minutes.


New Relic also offers a similar uptime monitoring with a generous free tier via their Synthetics service.

https://newrelic.com/platform/synthetics


I wish New Relic would support plain old ICMP ping. That would be nice. You used to be able to implement it using their Scripted API thing (which is just sandboxed Node), but at some point they broke raw socket support, which broke every ping NPM in existence. I think you can still make it work if you run a private minion, but that's more effort than I want to spend.


> but at some point they broke raw socket support

Sockets are a transport layer feature e.g. TCP or UDP. ICMP works at the network layer and has no notion of sockets.


The BSD socket API on many systems implements a raw socket type[1], so you can use the socket APIs to talk raw IP.

Some systems (Linux, Darwin) also implement a special icmp socket type which can allow unprivileged ping.

[1]: https://man7.org/linux/man-pages/man7/ip.7.html


Regardless of what layer ICMP works at, its traffic is still transmitted using datagrams, and the call to do so is socket(), yes?

Are you just trying to point out the misnomer?


Continuing the tooling thread: the free tier of https://www.uptimetoolbox.com/ is quite good.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: