Hacker Newsnew | past | comments | ask | show | jobs | submit | pkumar00007's commentslogin

Did your incident response team look at the last few changes that were executed? If they had , they could have just rolleback the change. or just looking at the changes executed, in the vicinity of the start of the outage could have pointed to the problem.

Didn't the services that were crashing due to OOM raise any alerts?

This is shitty at so many levels.


If you knew the expected number of features , any input file with >100 should be discarded as bad input and you failback to the last good feature file received. This would have protected your service even though you are unable to get the newly populated features. I believe these features are not updated that frequently. Even if they were, you would have biased your system towards availability vs 'correctness'.

What were the teams doing between 11 to 1300 hrs , no explanation of what investigations were going on to not being able to figure the root cause.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: