THE RIDDLER: TAKING THE GUESSWORK OUT OF SYSTEM TROUBLESHOOTING

joker cto

 

PREPARATION

System troubleshooting starts at the development level.
A common attitude is: “Who has time for logs?!” 

 

There is a dramatic lack of technical leadership that includes an inadequate level of : preparedness, communication, and rehearsal. It’s not even prevention we’re talking about – basic structures and processes need to be in place in order to even have a chance to respond to an incident in a timely manner. 

 

Not everyone has the requirement to trace their entire stack – it’s a business decision to authorize logging. But not having logging at all, or not having the infrastructure to process logs shouldn’t be something that a business accepts.

 

Also, remember that not all technology outages are a technology problem. Humans are still involved. Communication is key in high risk, complex work. In order to have clear communications that work – especially during incident response – you need to cultivate the appropriate organizational culture.

 

Organizational culture is generally understood as all of a company’s beliefs, values and attitudes, and how these influence the behavior of its employees.

 

At BitWise MnM, we:

  • Believe in structures, processes and a continuous learning culture

  • Value communication

  • Uphold a primary attitude of rational problem-solving


1. DEFINE STRUCTURES AND PROCESSES

  • Common / anticipated problems

  • Methodology for solving

  • Roles and tasks

  • Succession planning

  • Communication channels and protocols

  • Budget for creating and maintaining logs

  • SOPs for troubleshooting and hypothesis creation and testing

  • Configure system to pay attention to exceptions

  • Identify an incident management group of technical and product SMEs

  • Incident response policy

  • Designate a team lead

 

2. MONITOR & COMMUNICATE
      –    Logs from front to backend

  • Data analysis

  • Rotate on-call engineers

 

3. REHEARSE RATIONAL PROBLEM-SOLVING

  • Practice scenarios to incident response

  • Ascertain what is going to happen next

  • Determine what the core system is and to keep it running

  • Apply root cause analysis as a core competency



INCIDENT MANAGEMENT

It is important to remember that in a crisis, senior executives just want the problem resolved. They want to get back to operation as fast as possible because outages cost money and reputation. Root cause analysis and solving the actual problem are also important, but less so. The real problem is that after the crisis is over, fixing the root cause often drops in priority.

 

Another important angle is having a protocol for emergencies. There is a real tension between applying the agreed upon protocols (assuming you have any) and reaching for heuristics. Heuristics are obviously faster, but carry a huge risk of missing something, or skipping a crucial step.

 

Perhaps the biggest risk factor of them all that everyone agreed on is relying on one main developer. Everyone understands that this comes at a cost. We’re all human, and as such will burn out sooner rather than later. And when it happens, everything falls apart like a house of cards:: fatigue leads to error, pressure leads to misrepresentation of the problem to avoid blame, and their absence results in brain drain. 

 

If you are adequately prepared, you can follow your SOP (Standard Operating Procedure)s and crisis protocols as rehearsed. You check for what is the same; you check for what is different. You run processes separately and try to isolate the problem. You check the logs. 

 

But what happens when that doesn’t work?    

 

Humans try to make rational decisions but are ultimately impacted by cognitive heuristics. 

They’re short-term results that create cognitive biases, which in turn create problems themselves.

 

Lucky teams do have an experienced, “out of the box” thinker that they can call on in a complex crisis when protocol fails. These game theory / hacker minds can get creative in crisis situations. However, explaining what they are doing in simple terms to upper management is nearly impossible. This is why they are also easiest and first to blame.

 

  1. FOLLOW PROTOCOL

  • Impatience will usually lead to a jump to:

  1. APPLY HEURISTICS

  • Wait for The Riddler to appear. There will be consequences.

 

ROOT CAUSE ANALYSIS

Only after the dust settles, the investors, government regulators, and stakeholders will want someone or something to blame. Also known as a post mortem. 

 

In conclusion, by cultivating a learning culture, you get rid of The Riddler. You win because even post-incident, you build a database of knowledge, better prepared teams, more informed management, better protocols, and elevated communications. That will eventually pay off. Entire products are born in crisis management.

 

  1. REVIEW

  • What actions and decisions the team performed

  1. DOCUMENT 

  • What actions triggered what results 

  • Incident metrics and analyze them 

  1. TEST HYPOTHESES

  • Be sure you’re not guessing what the solutions were in the same way you guessed what the problem was and what caused it.

  1. UPDATE

  • SOPs and policies 

  1. CREATE

  • An incident report

  1. CULTIVATE

 

  • A culture of psychological safety and continuous learning