|
|
| | Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise? | | 22 points by yansoki 70 days ago | hide | past | favorite | 40 comments | | I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here.
My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time.
My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?).
But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff.
How much developer time is lost at your company to investigating "false positive" infrastructure alerts?
Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams?
Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing?
I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable. |
|

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
|
The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.
What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.
If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle it when the EDA code can’t fix the issue. Fix it once in code instead of every time you get an alert.