For me the worst thing about being on-call is not the actual work outside business hours (it’s usually not much), but the potential work: if something happens I need to jump into my laptop within X minutes (changes from company to company, but it’s usually within 10 minutes). This means: I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip). Basically, all I can do is stay at home and be available. It sucks, and the money is not worth it.
10 min doesn't seem tenable over 24/7. Most likely you need to run errands and so on. In my team, our alerts aren't that critical, I just acknowledge on the phone and make sure I can get to the computer with 30 mn. I take it with me if needed.
No, for me, the real pain with oncall is that there are a lot of systems in my team. I understand well maybe 30% of it. I'm clueless about 30%. In between for the rest. I can try to fix issues myself (take long time, issues can add up), triage (but means bothering someone else). There are also things than nobody understand, and if they break, it can mean extra days of work for you with extra-stress because you don't know how hard it can be to fix.
Then, some people in the team ship code without adequate testing, because of pressure to ship. Which often adds work to the oncall. So there's all this extra tensions with colleagues which can be hard to deal with for an introvert.
Overall, it "kind of" works for us, but I agree with the conclusion, it sucks. It's really the worst part of my job. I went into software engineering because I like coding. Not because I liked monitor unreliable systems. And I think unreliability is encouraged by management to some extent. That keeps people at work.
This is exactly why I gave up a position as a full stack / devops engineer in favor of going back to low level drivers - there were too many unknowns, and far too many unknown unknowns often paired with expectations of prompt (and cheap) solutions to complicated issues.
Technically it was interesting and challenging, but in terms of stress just not worth it. You could pay me twice my current salary and I still would not go back to it. Now I try to place myself as far away from paying customers as technically possible.
> ...far too many unknown unknowns often paired with expectations of prompt (and cheap) solutions to complicated issues.
That describes pretty much all of my "full-stack" experience.
What sort of job/background do you have where you are writing low level drivers? I'd love to get into that side of things but I don't know where to start.
How'd you manage the transition (back?) to low-level? I would love to do chip work (or really anything systems-y) but all my experience is fullstack/webdev. Every time I apply I get bounced for insuffficient domain experience.
I started off my career in low level stuff and transitioned upwards to web. I’ve always been all over the place in terms of tech so it wasn’t particularly big steps either way. I’ve usually got something low level-ish going on at home. Emulator development, robotics, …
Oncall is becoming popular even for low level. My last few roles have all required it for reasons I've been unable to figure out beyond "all developers need on-call and you're a developer". In my case, a fix often requires hardware access and my commute is longer than the start-work SLA.
When I'm in charge of an on-call rotation I always try to make it very clear that this is not the expectation.
In my preferred model of on-call, you have a primary, then after 5min an escalation to secondary, then after 5min an escalation to something drastic (sometimes "everyone", sometimes a manager).
The expectation is that most of the time you should be able to respond within 5 minutes, but if you can't then that's what the secondary role is for - to catch you. This means it's perfectly acceptable to go for a run, go to a movie, etc.
You relax the responsibility on the individual and let a sensible amount of redundancy solve the problem instead. Everyone is less stressed, and sure you get the occasional 5min delay in response but I'm willing to bet that the overall MTTR is lower since people are well rested and happier to be on call to begin with.
We have a primary/backup setup and I would be pretty pissed if my primary just started going out for movies or a date night during their shift tbh. My job as a backup is to be there for unexpected events, ie they did not wake up or had an accident. Not be on call effectively 2 weeks in a row just because the primary doesn't take it seriously.
Yeah, going for a run or a dinner where you might be able to ack but not actually at keys for 10-20 minutes is one thing. Going to a movie or date where you might not even ack and won't be at keys for hours? Not cool at all.
I don’t see how this changes the problem where there is an expected guarantee of a rapid response except that now two people are expected to be available and would now need to directly coordinate in order to ensure one person’s going for a swim doesn’t interfere with the other’s WoW raid.
I guess to me that seems worse because that’d effectively double the number of off-hours accountability per teammate. Not only do you need to be first on call for your primary hours, therefore severely restricting the quality of your “free time” but now you ALSO have to be secondary on call for that irresponsible coworker that goes afk without properly communicating for 2 hours, dipping twice into your actual free time.
Out of 168 hours in a week, there are maybe up to 8 where I want to do something that interferes with being oncall. There's no downside real downside to being oncall for the other 160 hours. But I would get a lot of disutility from losing my freedom during those 8.
This is pretty much how it should be done. If the business demands more, they should have a properly manned 24x7 NOC.
You also need *ownership*. There is nothing worse than having to support somebody else's work and not being allowed (either via time or other restrictions) to do things "right" so that you're not always paged for fixable problems. Everywhere I worked where the techs had ownership (which varied from OPS people being allowed to override the backlog to fix issues or developers being given enough free reign to fix technical debt) has usually meant that oncall is barely an issue. My current gig I often forget I'm even on call at all and the main issues that do crop up are usually external.
Almost all the reliability issues I encounter is usually due to constraints ordered by people who don't have to deal with on-call.
Things like, running in AWS but you have to use a custom K8S install so they aren't dependent on AWS.
Using self managed Kafka so that you aren't dependent on proprietary tech.
It all sucks because they are always less reliable and generate their own errors and noise for on-calls.
If they had to deal with phone calls every time there's a firewall issue that had absolutely nothing to do with the application, they would soon change their tune.
So it takes 10 min until you've gone to the drastic solution? With this time-frame it would be risky to go the bathroom, not go to a movie. Also even the backup sounds like a primary in this scenario.
Sure, but the assumption here is that primary and backup (edit: probably, ie. they're not coordinating this) aren't going to the bathroom at the same time. It's also based on the idea that alerts are extremely rare to begin with. If you're expecting at least one page every rotation, that's way, way too often. Step one is to get alerts under control, step two is a sane on-call rotation.
We want to ack within five minutes, and be at a laptop within 30. So long as I'm within mobile signal when the page goes off, it doesn't really matter what I'm doing — an ack is a button press on a push notification. And I can stay within 30 minutes of my laptop and an Internet connection by carrying said laptop and my phone (with "unlimited" data).
If the primary (paid) on-call doesn't catch the notification, the secondary (unpaid) will be paged. And so on, down a couple more steps, to a senior manager. There's no expectation that anyone other than the primary would actually be available to ack the alert.
Having the primary/secondary rotation is arguably worse. In that model, from the perspective of any one participant, now they're on-call for two weeks each time around instead of one.
Pay is all about power. Nurses individually have little power (as they are replaceable), which is why unionising is good for them as the union gives them collective power.
Software engineers are an interesting case; some have a great deal of domain-specific knowledge, giving them leverage over their employers. Many less so, and so a union could help. AI might change this equation too.
> Nurses individually have little power (as they are replaceable)
Replaceable with what, exactly? The local ER is now having to close in the evening because they can't find sufficient nursing staff to keep it operating.
At least locally the experience is that hospital admin has gone delulu by thinking they can replace hiring unionized nursing staff with much more expensive travel nurses
When I was at a small IT consulting shop ~15 years ago, this is roughly how it worked. We'd get paid 24x7 for a week on-call at minimum wage + 1.5x normal wage for any hours we had to log in.
Fun fact: In the US the concept of time-and-a-half (also minimum wage, and not hiring child labor) was created by the Fair Labor Standards Act. Most tech employees are classified as "exempt" -- the FLSA and its protections doesn't apply to them.
I don't believe it's physically possible for anyone to be available at 10 minutes notice for 168 hours straight. And if it's possible, it would be deeply unhealthy. But if they did achieve this, then yes they should be paid some very large amount of money. But that doesn't just happen -- pay isn't fair, it's about power. So a union can help to negotiate this, or ideally, better working conditions.
> Also most tech companies don't have unions...
In Europe, there are plenty of unions that cover tech people. I'm a member of Prospect (https://prospect.org.uk/).
Usually (always?) regular working hours are compensated as usual. Then there’s rate for standby periods and another rate for each (half) hour when you get pinged and start doing actual work.
If I had a union it would demand a bunch of unqualified people join my team (and get paid the same as me) and it would forbid me from doing certain things because,say, moving the computer or plugging in a cable is IT's job, whereas I'm SE. No thanks
While you will find some extreme examples that could go that far, unions don't generally do that. Organisations that fight unions however do like to bring up that example, so... you've been had with anti union propaganda.
So my coworker who was a UAW member who told me stories about sleeping on the roof, and being reprimanded for moving a desk to retrieve a pen...was trying to dupe me?
So I should ignore evidence directly from the source of one of the largest unions in the US, because it doesn't support your view? I should only accept evidence from your trusted sources? Ok.
Edit: or my UPS friend who told me how the union box loaders would falsely claim alcoholism or drug addiction before being fired so they could abuse the union "protection" that was given to them? Is he trying to dupe me too?
Think about your own employment experience. Was the work environment always static, or did your employer ever introduce change that wasn't popular? Were you still singing the corporate anthem afterwards?
As far as I can tell, unions only show up after decades of management malfeasance. They're kind of a natural reaction. The line "the only thing worse than a union is no union" is probably a hundred years old.
Aren't hackers supposed to be a curious bunch? Is that really the only way you can imagine unions working? Can you not see the imbalance of power between a single individual and the corporation that employs them? Unions are fundamentally about balancing that power dynamic.
i'm a big union advocate, but i worry that the traditional messaging that unions use doesn't work for tech employees.
things like more pay/better hours/safer working conditions are appealing to people working low-paid, dangerous jobs but don't really click with most tech employees because those aren't the things they hate about their work.
to win over tech employees unions should talk about more ambitious things like codetermination (i.e., getting workers on the board), 4-day work weeks, remote work policies, employee sabbaticals, etc
My wife is part of a union and there’s none of that bullshit. However when her employer wanted to reduce costs across the board the union negotiated a shorter working week for everyone instead of a pay rise next year. They voted on it and accepted it with an overwhelming majority.
The union offering "shrinkflation" as the way for the business to cut costs is an interesting framing. Your wife's union associates must hang out with grocery store executives.
It just dawned on me how this argument runs perfectly parallel to religion if you point out intolerance, misogyny, or violence. It's always "those other people" that do the bad thing, and everyone only reaffirms their own system of worship. You could almost do a 1-1 find/replace of keywords and have the same argument.
Any particular reason you can't handle incidents while out and about?
I know it varies by situation. When I've been on call I've been able to mostly go about my life. I just had to keep my laptop close, stay in cell signal, and accept I would sometimes have interruptions (typically brief). We fought to keep them infrequent enough that they didn't ruin our lives.
I do long(ish) distance running as a hobby - it's not feasible to take a laptop out on a two hour run.
If I want to go meet a friend for a drink or food, I have to lug around a backpack, keep an eye on it to make sure it's not stolen. If I wanted to have a beer or wine, I can't because I may need to work at any point.
Favourite band is performing? I suppose you could take a backpack and the laptop to the venue, but again there's a chance it's pinched, and they'll make you check it at the cloakroom for the performance.
> If I want to go meet a friend for a drink or food, I have to lug around a backpack, keep an eye on it to make sure it's not stolen. If I wanted to have a beer or wine, I can't because I may need to work at any point.
If this is a stated requirement from your employer, talk to a lawyer. This is a common litmus test for whether you need to be paid while on call, even if you aren't actively working. Depending on the jurisdiction you may be entitled to pay (or trigger a relaxation of your company's policies).
I use my pocket computer if something comes up. It's not nearly as pleasant to use, granted, but way more pleasant than carrying a laptop everywhere. But I also wouldn't hesitate to have a beer if the desire arose. Perhaps I'm just not as committed to my work as you.
Not the GP, but I was in a similar situation. It was a requirement to be able to get to the office if the situation required the lab to diagnose the problem.
In California, as a non-exempt employee (basically not a manager if you're at a big company), you'd have to be paid for that on-call time with those requirements. The key term is "restricted" and the 10 minute expectation is quite a severe restriction.
If you're on-call, you're working - it doesn't matter if there's an active incident or not. Unless you're a contractor (in which case, you're unlikely to be on-call) the company you work for pays for your time, not delivery of specific work-items. On-call pay should reflect this.
If home is fine then usually all you need is Internet and laptop.
> I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip)
Sounds more like setting expectations and explaining the situation than a "cannot" (maybe except movies).
You can explain to your family for example that you're on-call and may need to leave urgently. I mean e.g. police do that. It's not that uncommon.
You can take runs within a 10 minute distance back home. The route is up to you. You can start by acknowledging on the phone as someone else commented, which would grant you maybe another few minutes.
There are lots of options. It's on you to workaround it. On-call isn't perfect, sure.
I once had a job with a lot of 2-4am wakeup outage calls. The timing was perfect such that you can't fall back asleep generally.
An aggravatingly large percent of them could be resolved by voice over the phone by walking the offshore support team through the same 2-3 runbook items.
"Did you look at the log... I see, OK are you looking at it now? Does it say X? Did you do Y? Good now? Great, goodnight."
"Did you try restarting it.. ok then try that now. Is it good now that you restarted it? Great, I'm going back to sleep"
Ironically we'd have less of these outage calls when the offshore person went on holiday because they'd send one of the competent NY support staff over for 2 weeks. Slept like a log every time.
I agree. The potential work is worse than the actual work. I used to think it wasn't so bad, but then I was so relieved when I left the on-call rotation that I must have been suppressing my feelings about it.
>Most of the software I wrote requires mostly no interference or fix-ups, unless of course the requirements have changed
Bug-free software is great, but changing requirements are precisely the reason on-call needs to exist.
Even if you have a whole team of engineers who write bug-free software like this guy, you'll still have failures. Because the world is constantly slipping out from under your assumptions.
Customers never stop changing their usage patterns. They add load at different rates, come up with unexpected requests of all shapes and sizes, and invent new use cases that fly in the face of the original project requirements.
Even if you have created a software system with no bugs that perfectly meets both the functional and non-functional requirements of the project, changes in the state of the world vis a vis customer behavior will come along and change what counts as a bug. If your system has a blanket 60-second database query timeout, and everything's working fine, then there's no bug. But as soon as a new API usage pattern causes certain queries to run on average 10 times longer than before, now you have connection starvation and an urgent bug to fix.
I'm not saying that "timely maintenance and improvement" and "a culture of perpetual ownership" won't have positive effects on reliability. But it's unrealistic that any amount of responsible, careful software development will fully eliminate the occurrence of sudden and unexpected failures. Human on-call, as uncomfortable as it is, will remain a requirement as long as reliability is taken seriously.
FWIW my perspective is that of someone that runs an on-call/incident management platform (Rootly).
If you allow yourself to sleep then it's not entirely true. I totally agree that it's lurking and you think twice before starting an activity, but what's the worst that can happen?
Run 10 minutes away from your house, take your laptop to the movies, and to your friends and family. I do it all the time and yes sometimes I need to isolate or find a place to work, I still enjoy the rest of the day
The problem is on-call is an essential and critical part of a
managerial role, but toxic to those in a developer role.
Managers must be on-call to ensure the appropriate people and
resources are brought to bear on unexpected problems that threaten the
business.
Developers must NOT be on-call to ensure appropriate attention is
spent designing, developing and maintaining the code that makes the
business possible.
The rise of software-as-a-service led to companies promoting "devops"
engineering which conflates these roles and unfortunately helps
unscrupulous executives unfairly squeeze more work from employees.
The core idea of devops, that managers/operators and developers should
understand and be capable of performing each other's role, isn't a bad
one. Those who understand how the business works at all levels can do
more to make it successful. It goes hand-in-hand with continuous delivery.
The best engineers alternate between these roles in a predictable schedule.
When in the managerial role they need to observe, react, delegate and
escalate problems as appropriate. When in the development role they
need to deliver features that create recurring value for the business.
But businesses should not expect engineers to play both roles at the same time!
This form of "on-call" is a toxic moral hazard. It's a sign of instabilty.
It's a signal of executive grift looking for a quick pop.
"on-call" robs developers of attention they need to develop the
features and increases risks that schedules will slip.
It doesn't need to be this way. If a business needs software
development it should hire or train engineers with that experience.
Likewise if it needs managers or operators to deliver software as a service.
As an operator or manager I look forward to working a shift,
but as a developer I will never again accept on-call rotation.
All companies I have worked for wanted you to answer within minutes, but you had half an hour to actually connect and try fixing the issue, so you could totally go out and if you were in a larger than a 30min drive/ride/walk you would just keep a laptop in a vehicle or backpack.
I used to do 3hours rides with my bicycle and go to dinner or social events with a gpd pocket 2 in a small bag.
I was on-call for over a decade, usually in roles where there was no compensation for working out of hours other than maybe TOIL. We're not talking FAANG gigs here - like £20-50k in the UK stuff. It's amazing how much having to carry an extra phone or making sure your laptop is in your car impacts your day-to-day life. Any social thing you're at could be interrupted at zero notice. Heck, I've taken calls in supermarkets and concert venues.
One place I worked had a 1 in 2 rotation. Every other week on call or weeks back to back if your colleague was on holiday. There was no front-line service screening calls which meant you could be woken several times in one night. All for £30 pcm towards broadband costs.
Most places are more sane than that example but suffer from the same core problem. Follow the sun support is incredibly expensive when compared to putting your existing staff to be on call. Here in the UK, so long as your equivalent hourly rate doesn't drop below national minimum wage and you're opted out of the working time directive (a lot of employers slip an opt-out form into your paperwork implying it's normal to sign it), then it's legal.
Unfortunately I'm yet to find anywhere that on-call operational teams have the clout to get code induced issues high up the priority list outside of cases where they've had to drag developers out of bed at 2am. In my experience that also plays out with getting anything infrastructure based into tech debt budgets. Why focus on fixing problems you don't directly suffer from when you can spend the time on a refactor, integrating a cool new library or spaffing out one more feature in the sprint?
This. And alerts are often just a fluke anyway. Sure many of us can step up and pull an all-nighter if it saves some company and you make some minor sacrifice to do something heroic. Being woken up several times at night for no reason but some metric that is a bit off is pretty soul crushing, and then come the knock-on effects in professional and personal life that you're always tired and demotivated during the day.
> This means: I cannot go for a run, I cannot go to the movies, I cannot go for a dinner with family, I cannot even go shopping (shopping mall is further than a 10 min. trip).
I live in rural Texas. The same things apply here, and more: I'm lucky to have good internet (which enables working remotely) but half my home doesn't get cell coverage so being responsive to a text message or phone call means not even going around my own home (for example, no cell signal in the kitchen means I can't cook while on-call); and with large tracts of land, I can't go out to do land maintenance (good luck hearing a phone ring or feeling it vibrate from a call when you're operating heavy machinery, assuming you even have cell signal there); all services are 15 minutes or more away: groceries, doctor, contractors, government, etc etc.
It's important to stress how much being on-call ruins my capability to use the time effectively for my own purposes (Texas Guidebook for Employers [0]; 29 CFR 785.16 [2] and 785.17 [3]). I tried telling this to a previous employer when they started wanting me to be on-call (3+ years after start of employment), and they indicated that those laws are only used for hourly employees but being salary + exempt means I do not qualify for additional pay and falls under "and other duties as assigned" in the employment contract. So the employer effectively started getting 60 hours of work for 40 hours of pay. Oof.
I also absolutely refuse to mix my personal devices with work; just at a minimum, I refuse to make my personal device available to legal discovery related to any legal issues with the employer. So if the employer wanted me to have cell phone availability, then I demanded that the employer provide that cell phone. That was a fun conversation that ended with some relaxed requirements (eg, I don't have to have cell phone availability if I'm responsive at my work desk already) which further reinforced the fact that I couldn't use the time for my own purposes.
Thankfully multiple years in this industry at (what was) fair compensation allows me to be picky for new employment contracts. And lesson learned: I'll be a lot more careful about contract language from now on, and specifically look for (or negotiate) carve-outs around being on-call and work/personal device separation. I recognize that having 10+ years of experience makes me able to handle that, but newcomers to the industry won't yet have that buffer and it sucks for them to not have that safety net for negotiation leverage.
A lot of this disagreement comes from businesses demanding rapid response while insisting on not taking on new hardware/payment obligations. To contrast: take the fireman who's waiting for an alarm (29 CFR 785.15 [1]): they are often often idle and can often go out for groceries but they're easily reachable. Ever seen a firetruck in front of a grocery store and the firemen are just inside shopping for groceries? Then see them come running out and turn on the lights & siren and drive off? I have. It's an interesting event, and it sucks for the grocery store that has to put those groceries (for ~15 people) back on the shelves and refrigerators. Nonetheless, those firemen are paid to do so and have special equipment (eg radios or cell phones) to be able to receive those messages, and the firement generally don't pay for that equipment themselves (the community does either through taxes or donations). I see analogies about on-call software engineers being called to put out (virtual) fires as very apt in this case.
From my experience working on SaaS, and improving ops at large organizations, I've seen that "on-call culture" often exists inversely proportional to incentive alignment.
When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes. When incident response becomes an organizational checkbox divorced from financial outcomes and planning, you get perpetual firefighting.
The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
Big companies aren't missing the resources to fix this; they just don't have the aligned incentive structures that make fixing it rational for individuals involved.
The most rational thing to do as an individual on a bad rotation: quit or transfer.
This assumes that the engineers in question get to choose how to allot their time, and are _allowed_ to spend time to add graceful failure modes. I cannot tell you how many stories I have heard of, and companies I have directly worked at, where this power is not granted to engineers, and they are instead directed to "stop working on technical debt, we'll make time to come back to that later". Of course, time is never found later, and the 3am pages continue because the people who DO choose how time is allocated are not the ones waking up at 3am to fix problems.
Definitely an issue but I think there's a little room for push back. Work done outside normal working hours is automatically the highest priority, by definition. It's helpful to remind people of that.
If it's important enough to deserve a page, it's top priority work. The reverse is also true (if a page isn't top priority, disable the paging alert and stick it on a dashboard or periodic checklist)
IMO it's when the incident response and readiness practice imposes a direct backpressure on feature delivery that you get the issues actually fixed and a resilient system.
if it's just the engineer while product and management see no real cost then people burn out and leave.
> The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.
> When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes.
Making engineers handle 3 AM issues caused by their code is one thing, but making them bear the financial consequences is another. That’s how you create a blame-game culture where everyone is afraid to deploy at the end of the day or touch anything they don’t fully understand.
"Financial consequences" probably mean "the success of the startup, so your options won't be worth less than the toilet paper", rather than "you'll pay for the downtime out of your salary".
At a lot of companies engineers are involved in picking the work. It's silly to hire competent problem solvers and treat them as unskilled workers needing micro-management.
Besides, if you set the on-call system up so people get free time the following day to compensate for waking up at night, the manager can't pretend there's no cost.
Bad management will fail on both of these of course, but there's no saving that beyond finding a better company.
This assumes that the engineers who wrote the code that caused the 3 AM pages will still be around to suffer the consequences of the 3 AM pages. This is a lot of times, not true, especially in an environment which fostered moving around internally every now and then. Happens in at least one of the FAANGs.
Minimizing 3am pages is good for engineers but it is not necessarily the best investment for the company. Beyond a certain scale it is probably not a good investment to try to get rid of all pages.
As a current SRE, and having worked in a small startup, this doesn't echo my experience at all. What the author describes is possible what we would call "on duty" work, the grunt/maintenance work that comes with big software systems. It's not fun, and most companies/teams have friction getting this sort of work done. It's also however not how my SRE role is defined by any stretch. Our on-call work is much more about support during exceptional, somewhat rare circumstances.
Yeah, I was similarly confused. My experience has been that on-call is a roster of who will keep their laptop with them on the weekend. Incidents involve little to no customer comms, because if anything more than a sentence once an hour is necessary then there's someone else on call to handle it.
Support work is a necessary nuisance, but it's also not what on-call is meant to be.
Sounds like normal support work not sure why that's affecting morale.
It's normal in any kind of system to also cover issues.
Yes, at some point you might want to have a customer support / customer success later to at least triage them, but that makes more sense as you get bigger not when you are small.
I actually like having discussions on support days with customer. Yes, sometimes they're more annoying but it's direct feedback of people trying to use your software.
My on call experience required that I had to be able to respond within 10 ten minutes of the call, with 24/7/365 coverage. But if I couldn't get the issue resolved remotely it meant that I'd have to be in the office lab to recreate and reproduce the problem. It effectively restricted my movements personal movements to stay within commute distance of my office, and that includes all my vacation time as well.
That was the better part of a year in my life, continuous. Of constantly considering that every decision, every meal, every movement, every action at all times, and weighing it against the risk of impacting my ability to respond to a customer call. Maintaining an extended period of alertness for a threat that very rarely materializes is frustrating in many ways that I'd like very much to forget.
I didn't burn out from it, but it was a major factor in my decision to resign from that company. Obviously people out there that can handle this lifestyle, but I couldn't. And frankly I'm quite content to never try again.
I've worked at places where I was the only person on-call (except during vacations - that's horseshit).
It was fine, but as an SRE/DevOps person I had the authority to override the developer backlogs to get issues fixed and enforce monitoring, architectural, and quality standards. Technical debt rarely built up and we could take one step back before taking two forwards. The product people hate me at first, but the result was usually a platform where problems only ever occurred during deployments (which I've been mostly lucky that I was at places we could do it during business hours) or external factors happened (cloud resource issues, etc).
At my current company, I can count on one hand how many times I've been paged out of business hours in the last 4 years (we recently re-did oncall because it was recognized that it was a risk that I'd become irreplaceable, which is the one point I tried to hammer on them for awhile).
The worst on-call experience I ever had was working at a managed services company. That was the hell of supporting customers that refused to properly allocate resources or invest in their tech stacks. Never again.
I don't understand why on-call is normal. It's a huge mental burden on employees, which is especially an important issue in modern times of widespread mental issues, just so that some shitty mobile app can be available 24/7/365. If your business is important enough to have on-call, then you should have dedicated employees covering night shifts and nothing else, effectively limiting it to someone's office hours, effectively removing on-call. I think that there should be laws against on-call.
It's similar to using Electron to develop a software: Create quickly, offload the inefficiencies to the users' computers, so the developer can be comfy, and the development can be cheap.
When you have on-call like this, you offload your expenses to your employee's life and mental health, and it's cheaper on the paper, for the short term.
Then your company sinks as people starts leave, and the managers ask "Why?".
> If your business is important enough to have on-call, then you should have dedicated employees covering night shifts and nothing else, effectively limiting it to someone's office hours, effectively removing on-call.
Easy to say when you envision it as someone's office hours and not your office hours. If my employer gave me a choice between a normal shift plus oncall or a night shift with no oncall, I'd pick the on-call in a heartbeat.
You should be paid extra for all that time and indeed there are countries with legal frameworks which require an employer to pay the employee if their personal life outside of working hours is in any way constrained. If you are a contractor, always put limits for this kind of extra services.
Yeah, I worked in a couple companies that had the same requirements.
It also meant having to wake up on command in case the phone beeped, being unable to drink in my free time, being limited in which activities I could go to.
It was hell.
I know people are gonna hit back with "you're doing it wrong", but in this case it's the company doing it wrong, but nobody on HN will go there and tell them.
The only profession where there is a legitimate case for the "cannot drink in free time" requirement is Emergency medicine physicians. And even among physicians, they are some of the highest paid specialities. For most other cases it is simply the company trying to extract as much juice as possible from the existing working staff.
That's absolutely insane. I've argued with a CEO about on-call burdens being too heavy before, and that was with a roster of three people who were only available 7am to 7pm.
What kind of slave driver expects literal 24/7/365 availability? Does that not breach labour laws where you work?
The on-call rotation had escalation, and instead of going to the manager it went to someone else in the team. Since there were only two backend engineers in my team, I was either always 1st or 2nd (clown emoji).
Unfortunately this is a huge hole in German labour law.
I quit there pretty fast, after a couple months. It was a tourism company, so Corona treated them very well.
The other company where this happened was a Content Marketing company that was wiped out by ChatGPT. I didn't do on-call there but the other team did.
> It effectively restricted my movements personal movements to stay within commute distance of my office, and that includes all my vacation time as well.
So were you the only one ever on call? That's rough. I've been in two man (every other week) rotations and that was annoying, but went to being solo on call for about a month. That was a shitty month, despite not having many calls. It's just the constant thought and consideration like you mention.
> Filing for a new feature implementation would require a thorough documentation, rightfully so, followed by a political campaign to convince the political party of principal engineers and managers to accept the new feature. These stakeholders carry incentives and principals of their own - that do not necessarily always align with the true engineering spirit of solving the problem, nor with satisfying the customer.
> The same friction applied to fixing a bug or a flawed process. I would reproduce the bug: spin up the entire environment, the appropriate binary artifact and the reproduced state of application, create the test cases and pin-point the exact problem for the stakeholders as well as present the possible solutions. I would get sent to a dozen of meetings, bouncing my ideas back and forth until receiving the dire verdict - rejection to fix this bug altogether.
This has been my experience for the entire duration of my tenure at the current place I am employed at. I've stopped doing this, because my backlog is filled with "best effort" features and when I attempt to slot some of these into a quiet sprint, management says no.
Many features I request is not from assumption, but data I gathered with analytics for the customer-facing website. One example is to change a page layout and add better search functionality for this page and its data, because I noticed a +70% no-results rate in the search analytics. I suggested a change but marketing shot it down saying it's low priority for them, while they frequently say that the website can perform better at generating leads. I might be wrong, but to me it feels like utter short-sightedness and goes against the strategic goal of the company.
Not to distract from the article, but I'm pretty sure that photo of the guard tower is from Manzanar, one of the concentration camps in California that interned Japanese Americans during WWII. Probably not in great taste to use that photo to represent the idea of "guard duty" in software.
People that have been interned in this or one of the other camps, or their descendants. It’s one generation ago, people born and raised in camps are still alive. George Takeo for example was born in one of the segregation camps.
It’s a low stakes change.
Nobody assumes harmful intentions from the author - I would not have recognized the picture either. But now that it’s been pointed out that it’s from a site where people were held illegally against their will, the reaction is a tell-tale. Knowing this, and insisting on keeping the images is now willfully associating with harmful behavior.
Apart from that, I cannot associate with the picture either - as an on-call engineer for some widely used infrastructure, I am not a guard on duty keeping people in a camp. I am an emergency responder. I fix things when they go haywire or an accident happens. A firefighter, paramedics, civil emergency responder would IMO be a much better metaphor for what I do.
>Nobody assumes harmful intentions from the author
Then what's the issue or purpose? Other than satisfying or inflating some own moral selfimage?
That kind of social dynamic appears to be what this kind of things is about most of the time as opposed to preventing mental harm or stopping such issues from (re)occurring.
>Knowing this, and insisting on keeping the images is now willfully associating with harmful behavior.
Do you not associate with harmfull behavior in far far more direct ways? Perhaps something that's considered ridiculous for most people to avoid like paying taxes that get spent on bombs or the like?
If i want to make an article showing a prison and i pick one from google does it matter if that prison happened to be used in the past to house civil rights activists?
Is the implication that that somehow normalises, advocates for or otherwise inches towards oppressing civil rights movements in any meaningfull way?
To not use a thing that symbolizes the indiscriminate incarceration of innocent people as some clipart for an article about oncall. It's in bad taste, and also the wrong metaphor, and it sends the wrong message, even if not done on purpose.
> Other than satisfying or inflating some own moral selfimage?
Morality has a subjective component and often works by seeing through other people's eyes: How would I feel if this happened to me? How will other people see me if I did this? There's no universal consensus for this, of course, thankfully.
So when people disagree about morality, I've found that it can be hard for the side that doesn't see a problem to understand what the other side is making such a fuss about. I also think there can be a performative aspect to moralizing. But I don't think it's fair or warranted to jump to the conclusion that just because one doesn't see the issue, that the other person must be posturing - because that's how you can make sense of their behavior from within your own point of view.
People see things that rub them wrong, they speak up, as did you. How would you know you're not just inflating your self-image by putting down their comment like this? If you aren't, maybe they aren't, either.
So you're essentially saying: "I don't believe people should try to be nice to each other even in places where it essentially comes for free." I get it, it's impossible to do the right thing all of the time. It's a chore. But very often, it comes for essentially free. Help carry the baby prom down the stairs if the elevator is broken. Tell people their shoelace is open. Change a picture that symbolizes hurt to people in places where the point is not talking about these places. Call people by their chosen name. It comes for almost free. It makes other peoples life better. What's the argument against it?
I think people should be nice to eachother even in places where it doesn't essentially come for free. Offer a helping hand. Take part in a cleanup event. Give to the needy. And if possible I'd say do it without broadcasting your moral status or don't do it just to broadcast.
If you truly wanted to do some microscopic good you could've just used this picture to quickly bring awareness to what happened and could happen again rather than try to relegate the story to the view of those interested in history for the benefit of....
(1) is the important point, (2) is irrelevant. On the important point, you seem to have assumed several things:
a) that it is harmful
b) that anyone to whom the assumed harm would occur would see it
c) that they would know what the camp looked like
(a) I see no evidence for this, in fact, research suggests the opposite[1]
> The strange paradox about triggers and PTSD – and this is true for all anxiety-related disorders – is that avoiding triggers makes the disorder worse, not better (Jones et al., 2020). Being exposed to small instances of one’s triggers, in a safe environment such as therapy where one can be helped to process the situation, is a way to gradually become less reactive to those triggers (APA, 2013). Many people have learned to reduce their reactivity to psychological triggers through this process, called exposure therapy.
and (b) really is a stretch, are we to believe there are people in the tech industry who are on-call and were interned in WW2 and will read this article and will have some kind of meltdown because of it? Why would their descendants react badly to it? Unless they have a pre-existing condition, that's untenable, and if they have a pre-existing condition then they should seek help for it.
And to (c), how would they even know what the camp looked like if they're so triggered by the thought of it? They have an aversion to it.
No, none of that makes sense, and cloaking it in "compassion" or trying to handwave scrutiny away because it would be a "low stakes" behavioural change won't hide it.
> Apart from that, I cannot associate with the picture either
We are intelligent people, the link is really simple to make. There being "better" choices does not make the picture a non-sequitur.
I'm sure I'm not alone in wishing this West coast American style of pop psychology being misapplied to real life would die a death.
By the way, George Takei (I believe that's who you meant) is 87, and he's not really in tech.
More important than that to me is the use of rational, sound, valid reasoning, and the avoidance of petty and facile straw man arguments.
But okay, I really want there to be a picture of a concentration camp guard tower, because the best way to deal with childishness and unreasonableness is, unsurprisingly, childishness and unreasonableness.
I have no idea what article was trying to convey since on call was poorly defined
I also couldn’t take it seriously when article opened with this.
> Startups cannot afford engineers to baby sit software, big tech does.
Say what? As ops person, I’ve seen multiple startups where devs are drowning in operational issues because software was written enough for feature MVP, ship with poor testing then constantly poke at with sticks to keep it working in hopes they could get enough revenue to not flame out.
Fair point about the importance of storytelling, but I wouldn't say "engineering rarely matters" in the tech industry. Compelling vision can mask those problems for a long while, but at some point the wheels will fall off if engineers can't deliver.
Been a few years since I worked at Google as an SRE, but I did not find "There’s no incentive in big tech to write software with no bugs" to be particularly true. Perhaps because I was an SRE for a few of the of the older, intrinsic, core products. A lot of the pages we got were for things outside our control (e.g. some fiber optic cables got broken (we found out later), so we've got to drain some cluster because nothing's getting through there, or something's getting overloaded because it's the superbowl today, so throw more machines/memory at it, or other weird external things.) I don't remember a lot of pages due to outright bugs in the code... though it's been enough years now that I might have forgotten.
Were the systems designed to scale based on load and handle transient failures?
It seems like lack of automated remediation would be a bug unless it's an "accepted business risk" i.e. cheaper to throw people at to manually fix than build a software solution.
You can’t plan for all failure modes. Weird shit happens and it needs human intervention to figure out what went wrong. Sometimes someone needs to assess what path forward is the lowest (financial) harm and weigh the options. No computer should make that call.
The industry myth of "devs need to be on-call just in case prod crashes at 3am" needs to die.
First, a system failure addressed by someone awoken "at 3am" assumes said person can transition from sleep to peek analytic mode in moments. This is obviously not the case.
Second, a system failure addressed by someone awoken "at 3am" assumes said person possesses omniscient awareness of all aspects of a non-trivial system. This is obviously not the case.
Third, and finally, a system failure addressed by someone "at 3am" assumes said person can rectify the problem in real time and without any coworker collaboration nor stakeholder input.
The last is most important:
If a resolution requires a system change, what organization
would allow one person at 3am to make it and push to prod?
The way I was taught to be on call by a guy I worked with that was also an SRE at a large software company was to "patch the hole in the tire and get it to the service center".
It wasn't about fixing the problem or fully understanding it, but instead making sure the system can run for the time being, get some sleep, and have a more complete triage in the morning. I've found this to work pretty well the past few years.
While I do agree with your sentiment, I'd say that the perspective I've learned about being on call is a big different than the one you've experienced (which may be the more common one, I'm not sure).
So this may end up being a cultural difference between on-call situations.
> The way I was taught to be on call by a guy I worked with that was also an SRE at a large software company was to "patch the hole in the tire and get it to the service center".
That is a great philosophy to have when supporting a prod system off-hours, one which I fully subscribe.
> While I do agree with your sentiment, I'd say that the perspective I've learned about being on call is a big[sic] different than the one you've experienced (which may be the more common one, I'm not sure).
My underlying thesis is not with being on-call, but instead expecting a developer to perform their non-support duties in addition to being on-call. The worst case scenario of this is when the on-call developer is also tier one support.
If an organization wants to have developers perform SRE duties, presumably due to deeper understanding of a system, fine. Assign them to support and suspend development responsibilities during same.
In my experience we patch the hole in the tire and then get directed to patch a hole in another tire. Manager says work on the highest priority tire so now there are 2 partially broken tires but the problem is technically fixed. Repeat for 10 years and now it’s time to rebuild it “correctly”, but actually we just have two cars to patch since nobody knows what awful hacks we need to port over for “expected” behaviour
If the code is important enough that it needs to be fixed if it breaks at 03:00, someone needs to be on-call
If that someone isn't the dev, things breaking at 03:00 is someone else's problem from the PoV of the dev, and it will likely keep breaking.
I've tried at several companies to get dev-teams to prioritise things causing a lot of work for the ops-team, and nothing has worked as well as disbanding the ops team and putting the devs on-call.
Pain from a problem needs to live where the problem can be fixed.
> I've tried at several companies to get dev-teams to prioritise things causing a lot of work for the ops-team, and nothing has worked as well as disbanding the ops team and putting the devs on-call.
This is a leadership problem, not a development problem. If stakeholders prioritize resolving production issues and development teams do not address the objectives set forth, then there are bigger problems afoot.
> Pain from a problem needs to live where the problem can be fixed.
Punitive management policies erode morale and result in retention only of those whom have no better options. See "Price's Law"[0].
It's only a leadership problem if you split the responsibility for the thing so you have A: a team that builds and maintains it and B: someone else who has to wake up at night and fix it.
If A and B are the same people, very little leadership is needed as they are innately motivated, and have direct experience.
> Punitive management policies erode morale and result in retention only of those whom have no better options. See "Price's Law"[0].
It's not punitive management. Someone will have to work when the thing breaks. It can be the people who can fix the problem, or it can be someone else.
> It's only a leadership problem if you split the responsibility for the thing so you have A: a team that builds and maintains it and B: someone else who has to wake up at night and fix it.
A couple points come to mind here.
First, there is no reason for any employee to have to "wake up at night and fix it" if the organization staffs SRE's with a "follow the sun" model. Three complementary time zones are ideal, but two can suffice if one is first-shift and the other second shift.
Second and more importantly, splitting responsibility between a "dev team" and an "ops team" is problematic when the two teams do not communicate, coordinate, and assist each other in improving the system. Hence this situation qualifying as a leadership problem.
> If A and B are the same people, very little leadership is needed as they are innately motivated, and have direct experience.
A and B are different jobs, with different responsibilities, with different skills, and different success objectives. Combining the two is a recipe for burnout and/or failure.
>> Punitive management policies erode morale and result in retention only of those whom have no better options. See "Price's Law"[0].
> It's not punitive management. Someone will have to work when the thing breaks.
The statement I quoted was:
Pain from a problem needs to live where the problem can be fixed.
This is a punitive management policy in and of itself and unquestionably so when combined with the assertion:
If A and B are the same people, very little leadership is
needed as they are innately motivated ...
Two additional things to contemplate is categorizing team members as "they" as well as identifying "very little leadership" as a goal to achieve.
> If that someone isn't the dev, things breaking at 03:00 is someone else's problem from the PoV of the dev, and it will likely keep breaking.
If that's the case, then the person responding at 03:00 needs to have a serious conversation with the dev, or with whoever's assigning/prioritizing tickets to the dev.
great points. I think over the 8 years of my SRE experience, I've probably caused a few outages after being paged at 3-4am and/or prolonged them. I've fallen asleep during one responding time (fortunately the main outage had been fixed but we went way beyond our internal SLO for backups as I had fallen asleep before running them).
That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well, and is pretty easy to staff in a large enough org
> That said, the author neglected to mention timezones or following the sun. A 12/12 hour shift or 8/8/8 (NA/EU, NA/EU/EMEA) addresses the sleep deprivation problem pretty well
I completely agree.
The "get a call at 3am" scenario, to me, is shorthand for an organization which intentionally under-staffs in order to save money. If a system has genuine 24x7 support needs, be it SLA's or inferior construction, it is incumbent upon the organization to staff accordingly.
Still and all, identifying a production issue is one thing, expecting near real-time resolution by personnel likely unfamiliar with the intricacies of every aspect of a non-trivial production system is another.
It always amazes me these places that have "24/7 support needs" then all of a sudden a bug comes in that's "not important" even though it has customer uptime impacts.
> If a resolution requires a system change, what organization would allow one person at 3am to make it and push to prod?
Well yeah, if you have on call, I guess you have to trust the on-call person to make that decision. If they get called at 3am, the problem is obviously so bad that almost anything they do can only make it better, right? If it's some overlooked edge case that crashes the whole application and can be fixed with a one line change, it's probably Ok to do it a 3am. If it's more complicated, maybe wait until 7am and discuss it with others too. And, in any case, a thorough post mortem (during office hours) and checking if the 3am fix is really a proper fix for the issue is in order...
Devs need to be on call at three AM so that they suffer when their software fails. This is how you align developer motivation with operational motivation.
The organization needs to incur financial penalties when the on-call staff have to respond at three AM. This aligns the organizations motivations with the operational motivations.
When managers see an on-call-incident line-item going up then they're more willing to tell product to take a hike.
> Devs need to be on call at three AM so that they suffer when their software fails. This is how you align developer motivation with operational motivation.
As I mentioned in a peer response:
Punitive management policies erode morale and result in retention only of those whom have no better options. See "Price's Law"[0].
It's interesting that you see this as being a punitive policy. Our job is producing quality systems. If our software is so bug ridden that caring for it is a punishment, then failed we've at producing quality software.
> If a resolution requires a system change, what organization would allow one person at 3am to make it and push to prod?
The only advantage of having a micromanaging manager: if they want to review all PRs and system changes, they also must be on call every single night of the year.
Well during on-call I think usually there is a chain of wake-ups. The on-call guy tries to fix the issue but finds out he is not familiar with it so creates a channel and wakes up the next guy in the right team, and so on. The first guy usually just wakes up to "acknowledge" that someone is taking care of the issue, and writes the wrap-ups and closes the channel.
I'll add to this some companies (my current one) still has a hero culture. People that are forced to work all weekend to fix a problem are lauded as the ones who went the extra mile. An ego boost and kudos, for some reason are reward enough. Companies don't have substantially reward these employees nor does the company have to invest in paying technical debt, infrastructure, or automation/monitoring. It's incentivisation gone horribly wrong.
I'm pretty confident the idea of on-call would go away if the company had to pay for your time. But being salaried, they assume you work for them 24/7.
It seems that companies often end up with On-Call after they've reached some critical size but decide to do it in the cheapest possible way: let's just make our engineers do it!
One big problem is that they tend not to make any attempt to ensure that the on-call rotations are equitable across teams. You might need to be on call one week out of 4 or one week out of 20. At review time, this matters. Since you're typically not getting paid for the on call work or having it factor into your performance review, you're actually incentivized to snooze alerts and kick the can down the road until after your rotation or to join another team with lower rates of on call incidents.
Ironically, if you're spending 25% or your time on call and actually being diligent about the role, you're more likely to be adept with the typically hodgepodge collection of tools that you need to use to diagnose and address problems and will be better able to handle incidents than if you only ever use those tools once every 5 months.
I was at the tender age of 21, and a mere contractor at my first job. I recall being paid hourly, but never filling a time card.
I had just been promoted to a systems admin role, after demonstrating prowess with coding and system configuration, and also the actual qualified admin had quit. One of my projects involved a rewrite of server code for efficiency as our /etc/passwd balooned to thousands of users, and too many logging in simultaneously.
So on Thanksgiving Day, we just got in the door with Mom after shopping, and my supervisor rings the home phone; his first words are "There's a hacker on [production shell server]". And I told him a few commands to try, and we did not seem to suffer downtime or need our backups.
That may have been the most consequential job I had for a decade. There were thousands of customers depending on us giving "four nines" of available service, including the corporate customers, but we had no formal on-call agreement, not even a way for me to access systems from outside the office, and I was still a novice after a few college courses and a lot of unstructured hacking.
My [salaried] father, on the other hand, often brought a pager [which never seemed to beep] home, and one day we were all sitting around the living room when an explosion rattled our windows. We were still a bit rattled ourselves when the phone rang, and it was his cue to go down to the site and respond as a health & safety officer.
So my mother said that our household policy would henceforth be: "if something explodes out there, nobody answers the phone!"
I'm unsure how this got flipped turned upside-down, but this was Mom asserting boundaries to Dad's work-life balance.
The central controversy of on-call duty is that it leaves the household and family short-staffed, contributing to burnout and "workaholism". We were accustomed to Dad being at work M-F, 9-5, and being available and supporting us at all other times.
While a boiler explosion is certainly exceptional circumstances, it can be the "camel's nose in the tent" and eventually, Dad is potentially driving up there every Saturday night that a lab worker spilled 1 gram of mercury. Then he's not eating dinner with us, resting, mowing the lawn, helping Mom to shop or do errands.
I'm trying to recall any times Dad responded to a family emergency by leaving work, but it was nearly never necessary, since preparedness and response plans defined his career.
Here's my answer to on-call requests: no. And if you ask me to clarify my statement, I will answer: no. Because no is good enough, and it's what I negotiated when I took the job.
I don't work for employers who can't manage their staff properly. If you're on-call at a company with any sort of scale, you aren't being managed properly. One of the things I tell hiring managers is that I don't mind working later if we're reaching the end of a project and a late afternoon or two might get the product pushed out of the door sooner, or if I've gotten into the zone and am making reat progress. But these are my choices to make. There will never ever be a time when I will subject myself to being tied to a phone for the purposes of waking up in the middle of the night and messing with servers or software. My time is MY time. I only ever have reciprocal relationships with employers and refuse to upset that balance.
IANAL, but my understanding is that in Europe on-call can fall under working time, depending on the exact nature of the requirements. In fact, I'm very interested whether anyone has firsthand experience arguing developer on-call falls under such category. E.g. being 8 minutes away from a work machine and network access is not too dissimilar to a firefighter being 8 minutes away from the fire station.
Question to tech folks in general: when you are "on-call" are you provided with differentials in pay regardless of whether an incident actually occurs?
I have seen it vary across companies:
- Some companies don’t pay any differential
- Some companies only pay out _per incident_
- One company I worked with paid out per shift of on-call rotation
This is quite scummy unless your role's primary duties are support and nothing else. I would count Site Reliability Engineer as support being a primary duty, while I would not count Software Engineer as support being a primary duty.
> Some companies only pay out _per incident_
That's a bad incentive. It incentivices more incidents to occur.
> One company I worked with paid out per shift of on-call rotation
Sounds decent on paper but everyone has their breaking point!
> One company treated it as overtime pay
Sounds decent on paper but everyone has their breaking point!
The author's description of how on-call works at "big tech" seems to be a complaint about a specific tech company, and the author is hastily generalizing to the rest of the industry.
His experience has nothing in common with my on-call experience working at a different "big tech" company.
I am on-call for a week every ~8 weeks. I get 500 € and two additional days of vacation for it. If I do get a call, the additional hours go to my time account.
Fair deal, I'd say. Only problem is that most of the applications are from vendors, so getting problems fixed is an ordeal.
I work for a big tech and your solution has already been built. The on-call software uses an LLM to summarize the tickets, and a vector db to find troubleshooting guides and similar tickets.
Just hire an SRE. If you want to piss off your team and incur more costs than the annual salary of one decent SRE... ask your team to do on-call rotations.
In this day and age if you're a properly capitalized startup you should have the budget for one decent SRE hire. They often times double as a technical writer and are great at preventing knowledge silos in teams / members of a small team since they have to sort of know everything to be an effective SRE.
Best process improvement I ever made for my teams.
No offense, but this is wildly underselling the goals of oncall SRE. LLMs are extremely crappy about causal analysis, or even just mitigation techniques for services that haven't been widely discussed on stackoverflow.com (i.e. your service).
> Creating an on-call process to manually inspect errors in test suites is more valuable than improving the project to be more reliable, as you can directly measure the amount of tests that failed on a weekly basis. It is measurable and presentable to the upper management.
You can also measure requests that failed on a weekly basis, and I do. In fact, I added a dashboard panel to do exactly that today for a service (10 years old!!) on a new team I just reorg'd into. I did this because I was annoyed to discover the first (internal) customer outage report of the day could have been repaired by the east coast half of the team team hours before the west coast QA team logged in for the day, but they were unaware anything was wrong. This is a trivial promQL query to implement, and yet it wasn't until today.
The problem isn't visibility but risk -- what if you make reliability fixes but the data gets worse? This is not hypothetical, a Youtube engineer documented a similar tale[1]. You can also imagine all kinds of fixes that sound good on paper but can produce paradoxical outcomes (i.e. adding retries causes a metastable failure state[2]). And heck, what if you make no changes, and the numbers decline all on their own? Are you going to scuttle this quarters project work (and promotion fodder!) just to bring this KPI back to normal? Of course, all numbers, even the test suite pass rate, come with risks of missing targets, so the incentives are to commit to reporting as few of them as possible.
> tools automate the mundane tasks of an on-call engineer: searching for issues related to a customer report, tracking related software (or hardware) crashes, verifying if the current issue that arose during an on-call is a regression or a known bug and so on.
I have a coworker trying to use LLMs for ticket triage, but there's a huge GIGO risk here. Very few people correctly fill in ticket metadata, and even among the more diligent set there will be disagreement. Try an experiment: pick 10 random tickets, and route copies to two of your most diligent ticket workers. Then see how closely their metadata agrees. Is it P1 or P3? Is the bug reported against the puppet repo or the LB repo? Is a config change feature work, bug fix, or testing? Do they dupe known issues, and if to, to the same ticket, or do they just close it as a NTBF known issue? If these two can't agree on basics, then your fine tuning is essentially just additional entropy. Worse, you can't even really measure quality without this messy dataset, and the correct answers should change over time as the software and network architecture evolves.
Typical HN blogspam. A blog post that makes vague but incredibly certain claims, generalizing about an entire industry, with no evidence outside a single person's limited experience. Yet there he is at the end of his post, asking people to hire him for consulting. But it's perfect for HN, because it gives people in the comments something to complain about, driving up engagement.
I will admit, though, the SaaS tech industry as a whole has a problem. The cargo cult of tech has convinced itself that SaaS is special. That it's never done. That it must be never finished, never stable, always changing, always breaking. Like it's some immutable law of nature.
This of course is a convenient lie. Before every business was an online service, software worked like every other product in the world. You built it to do a thing, you tested it thoroughly to ensure it didn't have defects, and then you coordinated one large release. No constant changes or late Friday deploys. No schema changes destroying columns. There was no on-call, because there was no 24/7 service. There were floppy disks and CDROMs, and people running software on their own computers, and a very small number of 24/7 enterprise systems shared by lots of users (mostly run by ISPs, and one or two large tech companies and industry bodies).
But you can deliver online software like it's desktop software. You can ensure your software is bug-free before you burn your master disk. But then you don't get to skip all those time-consuming tests. Then you don't get to ship your features faster than your competitor. Then you don't get to do A/B testing and micro-tweaks and experiments and feature flags and blue/green deploys. We want to do all the things, at any time, with no consequences. So on-call is still a thing, because people want to have their cake and eat it too.
The actual infrastructure on the backend shared by millions of users? Not actually hard to maintain. It's the same shit as 20 years ago, but much, much easier. Make sure the disks don't fill up. Auto-scale the VMs and containers. Design your apps to not exceed network bandwidth. There's really not much else to break. The only thing that breaks a system is changing it, or bugs from not enough testing. So do the testing, and the bare minimum of autoscaling in the cloud, and nothing should ever break in production.
But constantly mutating SaaS, without due diligence, is too addictive. Nobody's going to abandon the freedom of doing whatever the fuck they want, just to keep from waking people up in the middle of the night. The business people sure as fuck aren't going to abandon their "competitive advantage". And devs who don't care if their software works or not don't want to abandon untested deploys from their laptops. So on-call is here to stay.
A moderate on-call ritual is a necessary evil. I’ve worked at places that tried to get rid of it with all kinds of automation and playbooks, only to revert back to PagerDuty a few months later.
That said, my last workplace completely burned me out with a terrible on-call policy and an absurdly short recovery period. Not to mention, upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
I get why it’s necessary, but I’m lucky enough to be in a position where I can flat-out decline any job that mandates unpaid on-call.
I don't think on-call is a necessary evil; I think it's the result of managers and leaders not caring about having multiple failsafes instead opting to foist that problem onto engineers via unpaid labor.
You can have a system with enough redundancy, ability to rollback and deployment scheduling where any sort of on-call incident is highly rare and low impact. But that requires spending time and money on solving those problems which is money that could be spent on developing more features faster. A buggy and broken product doesn't matter as long as you can shovel out More faster.
I've been on both types of teams and I'm at the point in my career that if I'm going to be on-call I'm expecting to be compensated appropriately with actual on-call hours worked. And part of the problem is that even on a team where you're informed of on-call duties and rotation, if the team gets cut or people leave then you're on the hook for working longer hours for less pay. It's inherently exploitative.
We get time off in lieu for hours worked, and paid (at a fixed rate across the company) for hours where we need to be available to be paged. You don't want to only get paid if you're paged, especially if your services are normally quite reliable.
Fail-safes are one thing, but they don't always kick in when we need them. My team is only paged one or two times a month, and when we are it's probably for something we've not seen before because whatever it was that broke last time has been fixed.
My company has operated a 24/7 shift rota for 100 years. American “On call” is not normal.
Some companies might have exceptional on call - the C suite will get called back from holidays or whatever should things really hit the fan. That’s once every few years.
> tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
It is part of normal work for roles where it's a thing, and I'd think it could be either baked in or on top as long as which one it was was known at salary negotiation time.
> Not to mention, upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
Every position I've held (save my current one), that is most definitely the norm. If you have a busy night managers are cool with you coming in late the next day (or potentially not at all), but it's very unusual to be paid for on call in my experience.
Fortunately every job I've held had the situation where the manager was fine with me/the team taking make-up time the following day if you were interrupted during offhours. Sure extra comp would be nice, but if on-call has to be done, this isn't the worst way.
> upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
If it's made explicitly clear in the job interview process that on call is expected in a role and if it's enforced evenly across the organization, this can absolutely be normal and ethical—you're on a specific salary and your salary includes on call expectations. If you knew about that going in then you weighed it against the offered salary and decided it was worth it. It's not "unpaid on-call" in that case, it's paid on-call.
Obviously it can also be done poorly in ways that are harmful and dishonest. Not communicating it clearly during the job interview and salary negotiation, applying it inconsistently, or changing the frequency or the difficulty of the rotation after you have started are all real problems.
For me personally, there's no possible way that a company can correctly factor on-call costs into my salary, because the on-call costs are based on what I am doing at the time.
You'd have to pay me a biblically large multiplier on my base salary to get me to pick work over my kids or wife. My kids aren't going to remember that I made a few thousand bucks by ducking out halfway through an important-to-them event. My wife is sure as shit ain't going to forgive me _or you_ if you call me into work in the middle of some adult time. And the cost for you to make sure that I'm not participating in such activities and am available to work on very short notice is...well, it's the exact same number as if you had interrupted such activities.
If you really need the coverage in the middle of the night, hire more engineers whose normal 8-ish hour shifts are during the night? Can't afford to? Perhaps you're not so good at business then. Better yet, stop building a fucking house of cards that can topple your company if it wobbles a little bit. If your systems going down for a few hours can take down your company, outside of some exceptional circumstances, that's a failure on your part.
>> upper management tried to gaslight everyone into thinking on-call was just part of normal work and didn’t warrant additional compensation.
> If it's made explicitly clear in the job interview process that on call is expected in a role and if it's enforced evenly across the organization, this can absolutely be normal and ethical—you're on a specific salary and your salary includes on call expectations.
Where this logic breaks down is when on-call expectations are in addition to what a salaried position compensates - a 40-ish hour work week.
Expecting an employee to be paid industry rate for services rendered and then expecting periodic 24-168 hour near real-time availability without compensation is, by definition, "unpaid on-call".
And the frequency with which you have to take one of the 168-hour shifts is inversely proportional to how many colleagues you have in the rotation. if somebody leaves, the amount of unpaid work you have to do increases.
So if on-call isn’t explicitly compensated, an employee quitting essentially gives the rest of the team more hours at a lower hourly rate.
I used to be on-call too. I still have muscle memory of the buzz that’d make me leap off my couch like my house was on fire. Except it wasn’t. Just prod, doing prod things.
We all know it. You trade your peace of mind, sleep, and sometimes personal freedom (hello, “don’t go beyond cell signal” life). And I agree with many of you—it shouldn’t be this way.
We’ve seen teams turn their on-call rotations into something way more humane. Not perfect (nothing in ops is, let’s be real), but sustainable.
At Zenduty, we work with a lot of SREs, DevOps folks, and incident commanders who’ve said, “We can’t keep doing this.” So they started rethinking on-call. One of our partners cut their alert volume by 50%. Half! That’s not a magic trick. Just better processes, better tooling, and some good old-fashioned empathy for the humans behind the screens.
And if your team’s not giving you breathing room after an all-nighter? You deserve better. I’ll just say it.
I saw some of you mention how this stuff becomes political or gets deprioritized in favor of shiny new features. Ugh. I feel that in my bones. The best teams we’ve worked with treat incidents as opportunities to make things better. Every alert is basically your system saying “Help me, I’m tired.” And when you listen and fix it? Less 3 AM wake-ups. Win-win.
Anyway, I didn’t mean to ramble, but this topic is close to home. If you’re still stuck in on-call hell, or just want to vent about it. Always happy to swap war stories or chat about how folks are making on-call suck less.