My friend Ken wrote a response to Joel Spolsky’s post, “Five Ways.” But then he discovered the blog did not allow comments.
Ergo, I am posting his comment for him on my blog, because he has some pretty interesting stuff to say. You can contact Ken on twitter @kendonoghue
A server is just one part of an infrastructure, but an important one that can head off many of the unfortunate events you describe. You covered a lot of territory, so bear with me. Assume that your business critical application (or several virtual apps) is running on a true fault-tolerant server. Everything is redundant, down to the power cords. It’s the equivalent of two physical x86 servers running as a single logical server. Any standard RH Linux or Windows app (one license for the OS and application) runs without having to think about every conceivable problem that might occur and writing failover scripts for it. A single application runs on “both sides” of the server. Transient error!
The application continues to run as the server rides through it. Component failure! The server identifies the problem and attempts to restart the offending part. If it can’t, the server takes that “side” offline while the application continues to run and your sys admin continues to sleep. The server calls into the customer service center and gets a human who diagnoses the issue down to the component level. Ninety-five percent of the time, the problem can be fixed remotely. When it can’t, the a replacement part ships out for next-day delivery. The application is still running and the sys admin is still sleeping, unless the server is configured to issue alerts, or the service center is instructed to call. Regardless, she/he doesn’t have to do anything. The replacement part arrives the next day and is hot-swapped for the failed component. And, we’re still running, no downtime, no data loss, no failover.
The system automatically resyncs and “both sides” are again running in lockstep. Admittedly, even at six nines (32 seconds/ year) downtime is a remote possibility. If that happens, you restart the application on one half of the server, retaining the crash state on the other side so that root-cause can be determined so as not to repeat another time. Who can afford this? It’s not commodity server pricing, but at $15K to $60K, these servers are pretty affordable, especially for business-critical applications. Industry standard fault tolerant servers like this have been around for a decade. No slight intended, but the free “Fault Tolerance for Dummies” covers this and more in entertaining detail. I’m with Stratus Technologies, and I approve this message . Hope it is helpful.