In recent times we have seen high-profile IT outages involving the UK Border Agency, RBS, Blackberry, O2 and others. These outages have undoubtedly resulted in credibility issues, direct financial loss and so on.
If large organisations such as these with large, professional, well funded IT departments can get things wrong, then what is the root cause that everyone can learn from?
My view is that we have allowed complacency to creep in due to the increasing reliability of technology and consequently the systems and services which run on that technology. I think we have forgotten that bad things can, and do, happen and we need to take a risk-based approach to managing technology risk better.
IT Service Continuity is the answer. We must merge availability management with the traditional disaster recovery viewpoint. We must go beyond simply having a plan for the major incident and focus more effort on asking ourselves the “what if?” question in relation to the end to end service provision. What could go wrong? Are there any single points of failure? How do we protect ourselves against human error and omission? What about a deliberate malicious act? How do we control changes? If we do make a mistake, how do we back this out and get back to a good state? How do we ensure service alerts are timely and relevant? No matter what the application system is, do we have inbuilt checks on data integrity and data exceptions and associated control points which allow us to say “stop!” and invoke human intervention to prevent the situation getting worse?
If we simply ignore IT Service Continuity Management then the next major downtime could be in your organisation…
So what can you do to improve your current IT Service Continuity capability? Here are some ideas.
1. Implement a Major Incident Response Procedure
You must respond efficiently, effectively and professionally to any major incident situation which may arise. This involves assigning a Major Incident Team, or equivalent, with a mix of senior decision makers and subject matter experts capable of assessing the situation and agreeing what to do to resolve it.
The Major Incident Team needs to be supported by an escalation process which ensures that all adverse situations, actual or developing, are made known to them quickly. There must be a robust call out process which allows all interested parties to be made aware of the situation and to be mobilised if appropriate.
2. Develop an IT Service Continuity Plan
You should have an IT Service Continuity Plan which documents how to manage the adverse situation from the IT Departments viewpoint and engage with interested parties, including senior management and business users as well as all affected IT staff. The Plan should also document the agreed coping strategies to adopt when a major incident results in the loss of IT systems and services, collectively or individually. Having pre-agreed coping strategies saves time, prevents unnecessary lengthy debate and helps avoid wrong decisions being taken in a real major incident situation.
3. Run Scenario Exercises
Tabletop exercises, where those responsible for responding to an adverse situation are gathered together and are asked to respond to a specific event scenario, are an excellent way of taking people through the thought process required to rehearse incident roles and responsibilities, validate response plans and recovery strategies and generally raise awareness of what to do if disaster strikes.
For IT Service Continuity a good focus is on asking the what if? question in relation to your critical IT systems and services e.g.
- What if the system or service went down –
- how would we respond?
- what would be the business impact?
- what would be the customer impact?
- what would be the lead time to fix or replace the system or service?
- what coping strategy would we adopt?
- What if data was corrupted or wrong, how would we know?
- are there system alerts inbuilt and if so are these appropriate and timely?
- are there restart / control points inbuilt?
4. Conduct an Information Security Risk Assessment
An information security risk assessment will highlight the threats to your IT systems and services and will explore the effectiveness of the current controls to maintain confidentiality, integrity and availability of information in all its forms. The risk assessment must cover physical threats as well as virtual ones and will thus include business continuity related risks.
5. Determine Your Recovery Capability
For each IT system or service you need to determine the recovery time and data recovery point should you need to resurrect each IT system or service off site following a major incident.
If you haven’t formally tested the recovery to be fairly confident about the capability, then draw a timeline and map out the activities required to resurrect affected IT systems, services and data. When doing this, be realistic. Make sure you take into account the initial lag in actually invoking the response from the incident being declared, notably out of hours; the lead times for procuring any replacement kit; the dependencies between recovery activities; relative priorities for recovery; the limited availability of technical experts to actually perform the recovery; and the need for regular rest breaks. Be careful not to be over optimistic when it comes to your actual recovery capability as the business will doubtless suffer in the event of a real incident.
6. Make Sure Users Understand What Your Recovery Capability Is
Use the output from Step 5 to prepare a recovery capability statement which clearly shows the recovery time for each IT system or service and the data recovery point should you need to revert to the most recent available backup. Discuss this statement with your business users to identify any gaps between your current capability and the business needs.
7. Inform Senior Management of Any Risk Exposures
Where risk exposures are identified during the risk assessment in Step 4, identify ways of closing those gaps and report the exposures and solutions to senior management for their information and consideration.
Where there are gaps identified in the recovery capability review in Step 6, explore solutions to fill those gaps and present these solutions to senior management for their consideration. Remember that the business should also explore manual workarounds to use while IT systems and services are down rather than simply procuring technical solutions.
This is your opportunity to discharge your duty to senior management to make them aware of any risk exposures and to dispel any possible misconceptions they may have in this regard. Remember to ask for the relevant priority and resource to be assigned to implementing any recommended improvements.
8. Implement Agreed Improvements
Where senior management decides in Step 7 to accept any risk exposures identified, this must be formally documented. Otherwise using the mandate you have hopefully now received from senior management, assign the agreed priority and necessary resource to driving through implementation of the improvements.
Taken together the steps above can greatly reduce your exposure to serious threats and allow you to better manage a major incident which does occur.