You can get free uptime monitoring from Google Cloud. The limit is 100 uptime checks per monitoring scope, which may mean either a project or an organization based on how you configure IIUC. https://cloud.google.com/monitoring/uptime-checks. The checks are ran from 6 locations around the world, so you can also catch network issues, that you likely cannot do much about when you're running a tiny service. My uptime checks show the probes come from: usa-{virginia,oregon,iowa}, eur-belgium, apac-singapore, sa-brazil-sao_paulo
Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.
It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs.
It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert.
We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing.
> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"
This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.
I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.
One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?
Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:
- started, but never finished (no error reported)
- started, and finished with error reported ("crash") (need immediate alert)
- finished (without crashing), but not 100% successful (data not fetched)
- finished successfully
As you mention, it makes sense to have the alerts be:
- no successful completion (regardless of number of attempts) within X time
- explicit error occurred
I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.
The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.
The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.
For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.
I wish New Relic would support plain old ICMP ping. That would be nice. You used to be able to implement it using their Scripted API thing (which is just sandboxed Node), but at some point they broke raw socket support, which broke every ping NPM in existence. I think you can still make it work if you run a private minion, but that's more effort than I want to spend.
Another neat monitoring thing I rely on is https://healthchecks.io. Anything that needs to run periodically checks in with the API at the start and the end of execution so you can be sure they are running as they should, on time, and without errors. Its free tier allows 20 checks.