More

z2amiller · on Nov 3, 2012

So 10 years and a few employers ago, we had a case of a few "haunted" server chassis. Hard drives would fail on these chassis very frequently, and when a fresh drive was swapped in, it would take many days to rebuild the RAID, if it ever rebuilt at all.

Putting the RAID set in a new machine, it would rebuild fine. But in the original machine, we swapped out the raid controller, CPU's, even the whole motherboard, and the RAID sets still would not rebuild.

Long story short, each of these "haunted" servers had a bad fan that was causing a lot of vibration within the chassis - enough physical vibration happening that the hard drives were essentially rendered inoperable.

The moral of the story is to make sure you have good vibration dampening on your fans, and to use the sensors monitoring to alert you if the fans are going bad. (Even this is not perfect, since sometimes the fan gets off-kilter but is still happily spinning at 10K RPM. The first thing we did if we got an alert for a disk failure was to replace the fans and attempt a RAID rebuild before touching the "bad" disk)

reeses · on Nov 4, 2012

This wasn't a Sun E450 was it? We had one (of a "matched" pair) that was "haunted" as well. Drives died, Sun replaced drives. Drives died again. Sun replaced SCSI controller and drives. Drives died again. Sun replaced motherboard, SCSI controller, memory, and drives. Drives died again, and we make the (at the time) scary move to Pentium III app servers, which were inexpensive enough to triple up compared to SPARC, but even better, drives didn't die.

We swapped out the E450s for 440s for Oracle when we moved to InterNAP, and all seemed to be well.

Hearing your story, I wouldn't be surprised if we had just enough/wrong vibration in the case to make it go Tacoma Narrows on us.

z2amiller · on Nov 4, 2012

These haunted servers were actually supermicro barebones chassis.

It has been a (long) while since I have seen the inside of an e450 but iirc there were a bunch of fans in trays in there. So it is certainly possible that the vibration did bad things. I still carry one of the e450 era keys on my keychain as a momento.

z2amiller · on Oct 13, 2012

In the "updates" section, he shows that there is a new system to attach some phosphorescent plastic between the spokes next to the rim. It isn't clear if these are included with the standard backer price, but it does allow the system to be used on bikes with brakes. Hopefully he updates the front page to make this more apparent, since it was my first question also.

scoot · on Oct 13, 2012

Thanks hadn't spotted that - it really should be on the front page. So it seems when he first put this kick-starter up he really had overlooked the obvious (despite having no brakes himself) or hoped everyone else would overlook it, until potential backers pointed out that it wasn't going to fly.

Now he has a bigger problem - his new system appears to work by wedging into the spokes where they converge at the rim, so he will need to supply the correct radius of strip for every size rim on the market.

z2amiller · on Jan 5, 2012

I was thinking of something related to this earlier today. One thing I'd really like from the emergency call screen is to allow me to tag several of my contacts as emergency contacts. That way if my phone is stolen, or I am hurt in an accident, the list of proper people to notify is very apparent.

evandena · on Jan 5, 2012

Yeah, the old ICE contact.

z2amiller · on Dec 9, 2011

The best way I've seen for dealing with cache expiry, which the article does not talk about, is to use version numbers on assets. We found this to be especially important with javascript, css, etc -- if all of that stuff doesn't expire at the same time, it can hose the layout of your site.

Also there are may be many layers of caching between you and the user; not only HTTP caching in the browser, but you have to take into consideration any CDN's (Akamai, etc) and sometimes even caching reverse proxies in corporations.

At my previous job, we handled the versioning with deployment-time rewriting of the assets included in the base page to include the version number (As tagged by the build software with branch name + build number).

That said, enabling browser side caching was a huge win for page speed on the site.

z2amiller · on Dec 6, 2011

Is there a good way to phrase (i.e. "Doctor Speak") for something like "No Code unless a probably positive outcome with intervention"?

z2amiller · on Nov 6, 2011

Also think for a minute about what "you should be ok" means in this context. Sure, if you are developing something in your spare time that kinda-mostly-doesn't-compete with what your employer is doing, and they find out, maybe they can't legally go after you.

But that doesn't mean that they're obligated to keep sending you a paycheck, either. California is an at-will employment state, and violating your employment contract tends to remove the "will" to employ you.

z2amiller · on Oct 27, 2011

I've always thought the "Drive to the datacentre" argument was BS. If you're writing your app for the cloud, you have to deal with spurious instances going away, degrading, etc. It is no different in the datacenter. If you're driving to the datacenter in the middle of the night to replace a disk or a fan, you're doing it just as wrong as if getting evicted from an EC2 instance causes you to have to scramble oncall resources.

In my experience, the highest operational cost with running services is managing the application itself - deployment, scaling, and troubleshooting. None of that goes away with the cloud.

necro · on Oct 28, 2011

I have to agree. I put our stuff in a colo 2 years ago and never looked back. Pretty much all servers come with some kind of remote console interface IPMI, and that's not terminal redirection, thats actually a totally self contained microprocessor and ether port that you can run on a separate subnet and control your server even if it's off. I updated the bios, reinstalled OS's, all via IPMI which is part of the motherboards. Add to that power strips that you can also control remotely and you're all set. Our servers are in the Bay Area, I'm in Canada. I have NEVER had to drive/fly to fix anything. Never even had to use remote hands for anything. Sure some drives died, but standby drives are in place.

The costs are dirt cheap these days. You can get a full rack, power and a gigabit feed for about $800 in many colos in texas. We opted for equinix in san jose, which is all fancy with work areas, meeting rooms, etc when you are there, but the funny part is, we're never there!

I do like the virtualization for some maintenance/flexibility so we have a few servers that are hosts and we run our own private cloud where we get to decide where/what runs. Database servers on bare metal with ssd drives in other cases. Best of both worlds.

It's so cheap you get a second colo in a different part of the country to house a second copy of your backups, and some redundant systems just in case something really bad happens.

Oh yeah and don't get me started on storage. We store about 100TB of data. How much is that on S3 per month? $12,000/month! A fancy enterprise storage system pays for itself every couple of month of s3 fees.

Huppie · on Oct 28, 2011

I have NEVER had to drive/fly to fix anything. Never even had to use remote hands for anything. Sure some drives died, but standby drives are in place.

Consider yourself lucky. We thought the same thing, but when a RAID controller died on us recently we really didn't know what hit us. It didn't just stop working, it started by hanging the server every now and then, then after a day slowly corrupting drives, then after a day or two it stopped completely.

necro · on Oct 29, 2011

Im a bit conservative when it comes to hardware like raid controllers. My choice was 3ware. They are by no means the fastest, in fact the performance sucks compared to others. I went to a company that builds storage systems, but will build any kind you want, not locked into any controller. I trusted them when they recommended that by their experience is returned/fails the least. Of course everything fails, so it's just a matter of time. We have tripple redundant storage for file backup... active, 5 minute backup that is ready to be swapped in at one click, and long term. If something goes wrong with the active set or slows down, we just flip a switch and all our app servers use the new system that at most is 5 minutes behind. Old system gets shot in the head, and can be diagnosed off line. Shoot first ask questions later.

eli · on Oct 28, 2011

This is totally anecdotal, but I've personally had far more problems with bad RAID controllers than with dying hard drives.

Xixi · on Oct 28, 2011

You are not necessarily doing it wrong, you may simply not have enough ressources ($$$) to buy enough hardware for complete redundancy.

When you get evicted from an EC2 instance you just switch to a new one, the cost is constant. When your piece of hardware at the datacenter goes down, unless you had the ressources for a spare one, you are hosed.

z2amiller · on Sept 18, 2011

Access to live sports is the only thing keeping me from cutting the cord. Everything else we watch is readily available from other sources 'a la carte' which would be much less expensive than buying a giant package of cable channels I never watch.

z2amiller · on Aug 23, 2011

About 10 years ago during a minor earthquake in the California Bay Area, I happened to be on the phone with my girlfriend at the time who was in Mountain View, I was in San Jose. The conversation went something like:

GF: "Oh! There's an earthquake!"

Me: "What, no there isn-- Oh wow, there's an earthquake!"

(few seconds of shaking)

GF: "Okay, it's over"

Me: "No it isn't, I stil feel-- Oh yeah, it's over!"

I'd estimate the delay to have been ~2-3 seconds over ~20 miles -- but I don't remember where the epicenter was, or how deep the quake was.

z2amiller · on Aug 17, 2011

I am reminded of this post:

http://gadgetopia.com/post/6819

I think your thoughts of being a talentless nobody have more to do with gaining experience than it does having access to more information and seeing more products. You have crossed the "Humility threshold" where "What you think you know" < "What you actually know".