Incident Management Training for IT Operations

Point in Time Exercise: Resolution of Incident

03:39:25 One of the SMEs suggests disabling incoming e-mail. In the current state, the e-mail is being lost. There is some discussion, then someone says, “Disable it, then we can discuss it.” Another unidentified bridge participant observes that the problem is only for Server 24. Joe said he needs to check Server 28 to make sure there are no problems there. 

04:09:14 E-mail test failed for Server 24 and analytics as well. 

04:19:20 Pete reports that the executive has been contacted. Pete reports there are some issues in the Main Street Datacenter (MSDC), which started about 06:20. Pete reports there is another bridge opened by the network group and they are talking about the Server 28 issue. An SME wants to move to the other bridge to look at the database issues. There is confusion as to who is going to handle database issues between the SMEs. 

04:21:43 The Executive, Paul, joins and asks for an update. Pete says, “After the software release, there are a couple of customers that are not able to log in.” Paul chimes in, “This may be a metadata issue.” One of the SMEs confirms. Additional discussion occurs about the script. Paul asks, “How long are we going to investigate this issue?” The discussion continues, but no one answers the question. Paul asks the question a second and third time. One of the SMEs offers a very longwinded discussion on the investigation and, after several minutes, the SME answers that it will be about 30 minutes. Paul directs the group to “not do the investigation and to take quick action to restore service. This will cause a service disruption, but if successful would offer a quick solution.”  Many of the SMEs want to do more investigation for 30 more minutes. 

04:29:22 Another Executive joins, and someone provides an overview. One of the SMEs wants to rerun the script. Someone asks how long that might take. An unidentified person thinks it will take 30 minutes. The executive says if there was a problem with the script, it won’t solve the problem. We need to confirm that the script was correct. 

04:46:22 Discussion on the two approaches to take to resolve the issue. 

04:58:41 Pete asks both Executives if they want the group to focus on resolving the issue or continue investigating. No answer from either one. 

05:08:05 An SME says, “It appears that the script did not run correctly and that we should run it again.” Some discussion on whether there is any harm in running the script again other than time. Discussion on the process to run the script. 

05:15:52 Customer Service says there are seven cases now and that the customers are unable to log in. 

05:44:23 The Executive asks for an update. Pete responds that the group is still looking at what caused it. “We are looking at action items to perform next,” he says. 

05:49:45 There is further general discussion on the path forward, but with no consensus. After several minutes, the Executive says, “Let’s move forward.”

06:10:43 Discussion on how long the script will run and the need to monitor the time. 

06:24:06 One of the SMEs asks about the e-mail issues one customer has, and another SME says she knows nothing about that and she needs more information. An SME provides some background info on the e-mail issue. The SME says, “Let’s work on the e-mail issue.” There is some discussion, but the Pete is silent in this discussion.

06:29:41 Pete reports that the analytics look good and are all green. 

06:31:05 Customer Service reports they have checked with a customer and they are able to log in. They are checking with more customers. 

06:37:25 Paul asks who is running this bridge, Pete or another SME? Pete offers that he is the leader. 

07:02:26 SME reports the new e-mail is coming in, but the old e-mail in the queue is backed up. SME asks for some help in flushing the queue. Pete asks another SME to take care of it. The Executive wants to make sure everything is documented for the root cause analysis. 

07:13:12 Customer Service says another customer is reporting they are able to log in and do not appear to have any functionality issues. Acme Chemical Sales is only reporting one issue now. SMEs are still investigating. Problem in the chat window. 

07:17:00 Pete says, “Looks like e-mail is working again.” No objections from the group and the callers begin to drop off the bridge.

07:47:14 There is discussion on root cause between Paul and Pete.

07:50:00 Call is terminated.

Prepare a CAN report from the information provided