Single Point of Enterprise Failure

Many large enterprises have deployed single sign on (SSO). As many of the early adopters found out the hard way, SSO creates a single point of enterprise failure. If SSO goes down, so too do all applications that are protected by SSO.

Many SSO implementers (whom I call the “technoids”), think they have addressed this by having hot failovers. In other words, if an SSO server or groups of servers goes down, the SSO system automatically fails over to another site of servers. While this is a good idea, it doesn’t prevent a catastrophe from happening, which is the subject of this blog.

When a user is using web based single sign on, to first of all understand the complexities, you need to follow the flow of electrons. The electrons flow from the browser, through the network, to the application and it’s servers where the signal is intercepted and then redirected, usually through load balancers, through to the identity and access management subnets and their internal firewalls, to access servers and on to LDAP directories and/or databases, then back to the client and onwards to the application and it’s servers.That’s a large portion of the IT infrastructure network.

Most SSO systems are required to operate at 99.999% availability or even higher. This means that:
* All pieces of the route the electrons flow through need to be continuously monitored independently and as a system by using authentication scripts
* The monitoring needs to interact with some logic determining if level of incident
* The incident management system must then automatically be started and escalate differently depending on the criticaliy of the developing incident
* A central command console needs to be implemented instantly displaying all the monitoring and incident management events
* A central security ops team needs to make instant decisions and be able to down, restart and isolate certain servers and/or networks and/or databases and/or directories

All oh which needs to happen in seconds and minutes. The system can never go down.

It has been my extensive experience that most enterprises wander into this not comprehending the enterprise risk and thinking that simply by having the servers as hot failovers that all will be well. When the systems go down through to failure of one part of the access management infrastructure noted above or, due to a lack of fast, well thought out coordianted response, then all hell breaks loose. I have seen the CEO on the phone overy 30 minutes for several hours demanding to know when their enterprise will become digitally unstuck from SSO failure.

If you are embarking on a large SSO and/or provisioning project, then pay heed to what I have written above. In large enterprises, some of the hidden logs lurking beneath the SSO/Provisioning waters not normally told you by your consultants are:

1. Implementation of a wide monitoring campaign for all parts of your network and infrastructure.
2. Long time lines to integrate the monitoring software with logic that will differentiate events. For example, if one access server goes down, the event may trigger an email to the on-duty security ops team requiring them to fix it over the next 12 hours. However, if two access servers goes down (or one quickly followed by another), then the logic must be to rapidly escalate this up the ladder, have secuity ops people in the middle of the night logging on within a few minutes, etc.

Integrating the monitoring, with the incident management and IT ticketing systems all takes lots of time. All of which, I have found usually takes several months to prepare for.

3. IT reorgs – When you’re operating a five or six nines availability system, you don’t have time to call up Jane or John in networks, then call over to database support, etc, to determine what to do about a problem. Very frequently, I have re-org’d the IT support infrastructure such that there is one security ops team that is well cross-trained on networks, load balancers, firewalls, access servers, database servers, directory servers, etc. They have only seconds and minutes to make critical decisions.

Enterprises can learn from other’s past mistakes and avoid having the CEO on the phone every 30 minutes demanding to have their enterprise back on line.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s