Incident Management Training for IT Operations

Point in Time Exercise

Step 1: Click the + to read details about the part of the incident to which you are assigned:

 

02:30:00 The software release team and the operations team discuss that the release is now impacting service and should become an incident. Nothing formal is decided and no Incident Commander is identified. The release manager (Pete) is still functioning as the lead on the bridge.  

02:31:00 An SME is asking to check the analytics. An unidentified person states that there is a problem on Server 24, and that this may be an e-mail problem. The person also says there is a need to call the e-mail application team. “On second thought,” he says, “someone should definitely call them.” The SME reports they are not able to log in and check the dashboard. An unknown voice asks, “Is something else failing”? One of the other SMEs was able to log in. 

02:33:45 Neal from customer support is reporting that the dashboard is looking good. 

02:34:01 Pete reports that the dashboard is displaying time delayed analytics. “Is it possible it hasn’t been updated yet?” He asks Jan to check on it. Jan asks, “We are checking on the e-mail issue—is that right”?

02:35:00 Pete asks Jan, “How do the analytics look?” Jan reports all good but Server 24. There is a lot of uncontrolled group discussion on database issues. The discussion centers on node 1 being shut down. There is also general discussion on which database team should be contacted. 

02:40:47 One of the SMEs that earlier dropped from the call rejoins. Two SMEs from different database teams join and only provide their names, not their function/teams. Pete does not know them,and asks if they see any problems. They ask for the reason they were called, and want to know what is going on. There are several minutes of discussion on the situation. Pete is still the leader of the call but has not assumed the position of Incident Commander. The SMEs announce to the bridge that their analytics are showing some problems. They would like to move some partitions, and there is some discussion on this action. Other SMEs offer some other suggestions. 

02:44:27 Customer service is reporting more customer tickets, and they are starting to pile up. Pete asks for more specifics. Seems to be related to Server 24 and Server 28. One of the SMEs asks, “Which servers are the customers on? Does anyone know?” This will have to be investigated, but Pete does not make this an assignment to a specific function or person. 

02:30:00 The software release team and the operations team discuss that the release is now impacting service and should become an incident. Nothing formal is decided and no Incident Commander is identified. The release manager (Pete) is still functioning as the lead on the bridge.

02:31:00 An SME is asking to check the analytics. An unidentified person states that there is a problem on Server 24, and that this may be an e-mail problem. The person also says there is a need to call the e-mail application team. “On second thought,” he says, “someone should definitely call them.” The SME reports they are not able to log in and check the dashboard. An unknown voice asks, “Is something else failing”? One of the other SMEs was able to log in.

02:33:45 Neal from customer support is reporting that the dashboard is looking good.

02:34:01 Pete reports that the dashboard is displaying time delayed analytics. “Is it possible it hasn’t been updated yet?” He asks Jan to check on it. Jan asks, “We are checking on the e-mail issue—is that right”?

02:35:00 Pete asks Jan, “How do the analytics look?” Jan reports all good but Server 24. There is a lot of uncontrolled group discussion on database issues. The discussion centers on node 1 being shut down. There is also general discussion on which database team should be contacted.

02:40:47 One of the SMEs that earlier dropped from the call rejoins. Two SMEs from different database teams join and only provide their names, not their function/teams. Pete does not know them,and asks if they see any problems. They ask for the reason they were called, and want to know what is going on. There are several minutes of discussion on the situation. Pete is still the leader of the call but has not assumed the position of Incident Commander. The SMEs announce to the bridge that their analytics are showing some problems. They would like to move some partitions, and there is some discussion on this action. Other SMEs offer some other suggestions.

02:44:27 Customer service is reporting more customer tickets, and they are starting to pile up. Pete asks for more specifics. Seems to be related to Server 24 and Server 28. One of the SMEs asks, “Which servers are the customers on? Does anyone know?” This will have to be investigated, but Pete does not make this an assignment to a specific function or person.

02:46:23 There is a lot of background noise with one of the bridge participants. It sounds like someone is working from home and there are children’s voices in the background. One of the Network SMEs finally addresses the noisy SME on the phone. “Whoever is working from home, “he says, “put you phone on mute now please. We can hear your kids”. The discussion is disjointed.

02:48:22 One of the SMEs offer that there should be a script revision, and Pete says someone should go in and change host names, but it is not directed at anyone. The possibility that the host names may not have been correct in the release is brought up. There are a lot of frustrated comments made about another failed software release.

02:53:00 A number of people have dropped from the bridge as the software release team wanted an off-line discussion. No one announced this meeting, and the release team including Pete just disappeared from the bridge. Currently there are only two SMEs on the bridge, and they are feeling the bridge may have been lost, they are confused as to what is going on and what to do. They decide to continue their discussion on the host name issue and hope that the others will dial back in.

02:54:23 One SME comes back on the bridge and says that others will come back in a couple of minutes.

03:00:21 Customer service reports that one of their big customers, Acme Chemical Sales, is reporting that multiple customers are not able to log in and place orders.  There is some discussion as to whether this is related to this incident, or another bridge should be opened. It is determined to open a separate bridge.

03:03:26 Customer service suggests to Pete that this a big problem and that he should make the appropriate notifications for a P1 (Priority 1) incident, which requires Executive and customer notifications at varied intervals. Pete ignores the discussion.

03:05:41 An unidentified SME asks, “have we let the customers know we are resolving the issue”? Pete says it is not resolved and there are still problems. Customer Service says we have an incident and we need to send the proper notifications.

03:12:54 The customer notifications have still not gone out. There is some discussion about whether this a performance degradation or a service disruption, but the customers are reporting they can’t log in. Pete determines that he will keep it a performance degradation.

03:14:54 An SME reports that the analytics do not look too bad. The only person that hears and comments on this is one of the database engineers.

03:19:02 An e-mail SME joins and there is a discussion about e-mail issues and one of the servers misbehaving. Pete is not sure what the problem is. Pete is looking for Bill, one of the e-mail SMEs.

03:29:11 Bill rejoins and says, “There are some problems, and we need to get the senior Executive on the call.”

03:39:25 One of the SMEs suggests disabling incoming e-mail. In the current state, the e-mail is being lost. There is some discussion, then someone says, “Disable it, then we can discuss it.” Another unidentified bridge participant observes that the problem is only for Server 24. Joe said he needs to check Server 28 to make sure there are no problems there.

04:09:14 E-mail test failed for Server 24 and analytics as well.

04:19:20 Pete reports that the executive has been contacted. Pete reports there are some issues in the Main Street Datacenter (MSDC), which started about 06:20. Pete reports there is another bridge opened by the network group and they are talking about the Server 28 issue. An SME wants to move to the other bridge to look at the database issues. There is confusion as to who is going to handle database issues between the SMEs.

04:21:43 The Executive, Paul, joins and asks for an update. Pete says, “After the software release, there are a couple of customers that are not able to log in.” Paul chimes in, “This may be a metadata issue.” One of the SMEs confirms. Additional discussion occurs about the script. Paul asks, “How long are we going to investigate this issue?” The discussion continues, but no one answers the question. Paul asks the question a second and third time. One of the SMEs offers a very longwinded discussion on the investigation and, after several minutes, the SME answers that it will be about 30 minutes. Paul directs the group to “not do the investigation and to take quick action to restore service. This will cause a service disruption, but if successful would offer a quick solution.”  Many of the SMEs want to do more investigation for 30 more minutes.

04:29:22 Another Executive joins, and someone provides an overview. One of the SMEs wants to rerun the script. Someone asks how long that might take. An unidentified person thinks it will take 30 minutes. The executive says if there was a problem with the script, it won’t solve the problem. We need to confirm that the script was correct.

04:46:22 Discussion on the two approaches to take to resolve the issue.

04:58:41 Pete asks both Executives if they want the group to focus on resolving the issue or continue investigating. No answer from either one.

05:08:05 An SME says, “It appears that the script did not run correctly and that we should run it again.” Some discussion on whether there is any harm in running the script again other than time. Discussion on the process to run the script.

05:15:52 Customer Service says there are seven cases now and that the customers are unable to log in.

05:44:23 The Executive asks for an update. Pete responds that the group is still looking at what caused it. “We are looking at action items to perform next,” he says.

05:49:45 There is further general discussion on the path forward, but with no consensus. After several minutes, the Executive says, “Let’s move forward.”

06:10:43 Discussion on how long the script will run and the need to monitor the time.

06:24:06 One of the SMEs asks about the e-mail issues one customer has, and another SME says she knows nothing about that and she needs more information. An SME provides some background info on the e-mail issue. The SME says, “Let’s work on the e-mail issue.” There is some discussion, but the Pete is silent in this discussion.

06:29:41 Pete reports that the analytics look good and are all green.

06:31:05 Customer Service reports they have checked with a customer and they are able to log in. They are checking with more customers.

06:37:25 Paul asks who is running this bridge, Pete or another SME? Pete offers that he is the leader.

07:02:26 SME reports the new e-mail is coming in, but the old e-mail in the queue is backed up. SME asks for some help in flushing the queue. Pete asks another SME to take care of it. The Executive wants to make sure everything is documented for the root cause analysis.

07:13:12 Customer Service says another customer is reporting they are able to log in and do not appear to have any functionality issues. Acme Chemical Sales is only reporting one issue now. SMEs are still investigating. Problem in the chat window.

07:17:00 Pete says, “Looks like e-mail is working again.” No objections from the group and the callers begin to drop off the bridge.

07:47:14 There is discussion on root cause between Paul and Pete.

07:50:00 Call is terminated.

Step 2: Prepare a CAN report or an IMS briefing from the information provided.

CAN Report

Do not submit this form until leaving the breakout room. If chosen as the LNO to represent your group, submit AFTER you deliver your briefing in the main room.

IMS Briefing

Do not submit this form until leaving the breakout room. If chosen as the LNO to represent your group, submit AFTER you deliver your briefing in the main room.

CAN Report Example

Hello, I’m ________, your Liaison Officer for this incident.

Conditions:
At 09:53 PST, after returning a server host to the pool, all of the hosts became unresponsive resulting in a loss of access to email and websites.

Actions:
1. Server team accessing the server hosts directly to look for possible causes.
2. Server team spinning up user directory server in the cloud to restore access to email.

Needs:
1. Submit an urgent ticket with VMware for their assistance
2. Retrieve logs from servers, ready to submit to VMware when they respond

This is the end of my report.

IMS Briefing Example

Introduction
<<Self introduction by name and function>>

<<Insert briefing number>>

I have two issues to update you on and my briefing is expected to last <<XXX>> minutes.

The first issue is regarding the Sev 1 incident currently in progress with our Core Production Release. The incident has caused users to be disconnected from the network.

The second point I will cover is our anticipated time to resolution.

Main Points [typically 2-3 main points]
As you know, at 22:00 PST we were alerted to an issue with a Core Production release. A script in this release led to subscribers being disconnected from the network. At this time, our team has identified the script that caused the incident. Currently, 30 subscribers are affected by this incident. We are in the process of further identifying the specific subscribers and our DBA group is developing a new script to run in order to restore service.

Second, I’d like to update you on the anticipated time to resolution. Based on the opinion of our Lead DBA, we anticipate the resolution to occur between 10:15 and 10:45 PST.

Summary/Conclusion
To conclude, we share your sense of urgency in resolving this incident and have identified the specific cause of the service interruption. We are in the process of resolving the issue and
declaring the incident ‘All Clear’, again estimating the window of resolution to be between 10:15 and 10:45 PST.

My next briefing will be at 10:45 PST. If there are any significant changes prior to that, I will notify you and schedule the briefing for an earlier time.

At this time, is there any additional information I can provide?