If you've been following my writing of late, you'll know that I've hopped on the Survivability bandwagon with both feet (see my blog post "Defensibility and Recoverability" and the slides from my recent full-day course "Total Enterprise Assurance"). Key to this notion of Survivability is being able to operate under degraded conditions. After all, it's not if you're going to be attacked and compromised, but rather when. As such, your organization is already operating under less-than-ideal circumstances. Acknowledge this, accept this, and plan for continuing to operate in such a degraded fashion.
This notion of operating under degraded conditions does not always resonate immediately with people. After all, how can an enterprise function when key systems are under attack or out-right broken? By way of explanation, allow me to provide a human analogy. The human body is, in fact, usually very good at operating under degraded conditions. However, this is not always the case, which can have some very bad results. Allow me to explain further.
Illness
Humans are breeding grounds for disease; walking petri dishes of viruses, bacteria, and who knows what else. And yet we continue to exist. Some diseases have had a very strong impact on the human race, nearly to the degree of complete devastation. Many pundits fear the same thing happening again, and soon (hence all the frenzy over Avian Flu and Swine Flu in the past few years, as well as the attention on AIDS in the 80s and early 90s). Regardless, humans continue to survive, and continue to function reasonably well. In fact, without certain "infections" we wouldn't be able to survive at all (e.g. the role of bacteria in digestion). Whether they should or not, humans continue to work and operate heavy equipment (aka "cars") while under the influence of illness.
Similarly, enterprises are rife with infections. They are constantly fighting off viruses, worms, trojans, and various other forms of malicious code (or malware). How well does your organization operate under these degraded conditions? In many cases these days the preferred response to an infected Windows workstation is a wipe and reinstall. How much productivity is lost as a result? More importantly, is it possible to plan for such events such that the impact will be lessened? That is, how do build in tolerance for operating under degrade conditions in the face of infections?
For humans, we build in sick leave and work-from-home policies that allow humans to continue operating despite degraded conditions, and in an effort to limit the effect of infection on other personnel (see my post "Presenteeism and Inadequate Vacation Affect Security" for my thoughts on shortcomings in this current approaches). Progressive organizations are also increasingly promoting healthier living through various "wellness" programs in their organizations, such as by offering flu shots, healthier cafeteria selections, exercise support, support for reducing stress, and so on.
Organizations need to take similar approaches to protecting information resources. They need policies (and standards) that promote the adoption of "healthier" behavior by all personnel. Healthy computing initiatives such as avoiding spam and phishing schemes, safe handling of data, responsible disclosure of sensitive information, and safe web browsing techniques are just a quick sampling of the many opportunities for improvement. Similarly, sysadmins need to adopt and own proactive measures to protect systems and data against attack and compromise, as well as have pre-planned processes in place for dealing with the inevitability of attacks and compromises. You will be operating under degrade conditions at some point in time, if not at all times: plan for it and find way to optimize performance despite it.
Injury
In many ways, injuries are worse than infections, because their impact, on average, can be far more severe. At the same time, humans seem able to cope with most injuries in a far better way. Why is that? It's unclear, but my guess is that it has to do with the will to survive. Broken arm? No worries, I have another. Broken leg? Splint it, crutch it, move along. Of course, not all injuries are survivable, but we'll assume for the moment that the ones we're talking about here are survivable (death being covered below).
In the enterprise, break-fix is an all-too-common situation. Hardware seems to exist to fail (or so it seems most days). Honestly, there's very little in the organization that isn't subject to break - rules included! This is not, unfortunately, particularly useful in terms of fault tolerance. Broken systems don't work, which means an outage, which means lost productivity, revenue, or worse. Is there a way to operate under degraded conditions that involve a break?
There are a couple ready corollaries to injuries in the computing world, though I'm sure you can think of more. Load-balanced, or otherwise fault-tolerant, systems are one solution. RAID for storage is another. Cloud computing resources and virtualization is another. And son on. The name of the game in each of these is that a single component or system fails, but the failure is absorbed by other components or systems running in parallel. If spares aren't "hot" to take over, then there may be "warm" or "cold" spares at the ready to recover. This is all well-covered territory.
However, let's talk about preventative measures. Human bodies are reasonably frail when considering things like collisions. As such, we've developed several mechanisms to help minimize or prevent injuries stemming from collisions. Pads, helmets, and mouthguards are common in contact sports. Seat belts and roll cages are common in racing vehicles, as are crumple zone and air bags in private vehicles. All of these various preventative measures exist to lessen the damage from an impact; that is, they're designed to minimize injuries and allow the continued operation despite potentially degraded conditions.
Similarly, enterprises can deploy preventative measures to help minimize the damage of various attacks. Backups, data encryption, logging & monitoring, firewalls, intrusion detection & prevention, anti-virus, compartmentalization, strong authentication, incident response planning, and so on, and so forth. These tools exist to not necessarily stop bad things from happening, but to help minimize the damage (think about that for a minute - it's actually a shift in the thinking - do you accept the statement?). Bad things are going to happen. Whether it be a drive failure or a backup tape falling off the back of a truck or an active intrusion into the network. Deployed tools will help with minimizing the damage, but there will be damage (merely having to redirect resources to respond to the situation represents a degree of financial damage). At some point in time, humans will suffer injuries of varying degrees. The same is true for the enterprise. Planning, forethought, and preventative measures can go a long way toward helping minimize the damage while along continued operation under degrade conditions.
Distraction
One of the more interesting examples of common threats to humans is distraction. First we saw laws requiring the use of hands-free headsets for use with mobile phones (for better or worse). Now we're seeing the emergence of laws prohibiting texting during driving. Humans were already a distractible race, but advances in communication technology has only made that worse. How many more people will walk in front of oncoming traffic while texting, or crash their cars as they text and drive, before we realize that focus is not a bad thing, and that multi-tasking is in fact a myth?
We have similar problems in information assurance/security. Enterprises are frequently distracted by what is commonly called "shiny objective syndrome." That is, the biases of the people making decisions about IT and security are often attracted to what is new and interesting, rather than to what is perhaps challenging and mundane and absolutely necessary. This scenario plays out every week in organizations around the world as they opt to buy cutting-edge technologies that may or may not add value, instead of focusing on the fundamentals of protecting the corporation (see my post "Hotel Showerheads" for my thoughts on failing at the fundamentals, and my older post "Cut Through the Noise, Focus, Find Success" on the need for focus).
A key failure in the IT industry overall is a shared failure with the business. What is the core purpose, or mission, of the enterprise? What is important to the business? What's the most important data? What operations must continue to support the function of the business? Where is important data stored? How is it protected? How does it traverse the network (ingress AND egress)? These are all key questions that we often seem to busy to ask. If I've seen it once, I've seen it a hundred times. Important projects are derailed, either because of ongoing break-fix problems, or because the attention of key executives cannot be captured for long enough to get important decisions made.
Focus is important. Multi-tasking is a myth. Distractions are plenty. And yet we continue to function despite it all. Distractions represent a common and major form of degradation in operations. Ironically, it's one that can, and should, be easily remedied. Remove distractions, establish focus (and discipline), and see what happens. Oh, and don't go blaming social media for this problem. Distractions existed long before Facebook and Twitter (solitaire, anyone?). It's not the distractions that are at fault here, but rather the culture that tolerates excessive distraction. Culture is set at the top of the organization; the example of executives is almost always followed.
Death
Perhaps the most critical state of degradation, against which all bets are usually hedged, is death. In the human sense, death is rather final, and really represents the termination of operations. Is there a corollary for the enterprise? In many ways, I think so. I remember several years ago discussing backup sites with a customer. Their corporate HQ was located in the middle of a tornado zone, which meant that HQ could be wiped out any Spring or Summer in the matter of minutes (and, in fact, this nearly happened this year - a tornado missed by well less than a mile, destroying a nearby park). During the out-brief, I asked the VP IT directly "how long would it take to recover from a direct hit to the building, assuming all services were wiped out?" The answer was a sobering "2 weeks to restore data from backups to systems we'd have to acquire on the fly." Suddenly he went pale as he realized what he'd just admitted. Factory workers were paid on a weekly basis, and all payroll processing was now down through central systems at HQ. Given the strength of the union, there was no way they could continue factor operations for 2 weeks without paying employees (many of whom lived paycheck to paycheck). A direct hit would result in major harm to the company, up to and possibly including death. The financial repercussions for all involved would be quite severe.
While few of us like thinking about death, it is in fact a vital part of our planning exercises. Here in the human space we talk about wills and custody. In the enterprise space, we talk about business continuity planning and disaster recovery, and one of the key questions that should be included in those planning discussions is "what would mean death to the enterprise?" It is essential to face mortality, whether it be of the body or the business, and understand the potential impact, the frailty, the strengths, and, more importantly, how to best manage it. Luckily, the enterprise is better positioned to try and avoid death, unlike humans.
(Note: This entry is cross-posted from Truth to Power Association.)