THE RIDDLER: TAKING THE GUESSWORK OUT OF SYSTEM TROUBLESHOOTING
System troubleshooting starts at the development level.
A common attitude is: “Who has time for logs?!”
There is a dramatic lack of technical leadership that includes an inadequate level of : preparedness, communication, and rehearsal. It’s not even prevention we’re talking about – basic structures and processes need to be in place in order to even have a chance to respond to an incident in a timely manner.
Not everyone has the requirement to trace their entire stack – it’s a business decision to authorize logging. But not having logging at all, or not having the infrastructure to process logs shouldn’t be something that a business accepts.
Also, remember that not all technology outages are a technology problem. Humans are still involved. Communication is key in high risk, complex work. In order to have clear communications that work – especially during incident response – you need to cultivate the appropriate organizational culture.
Organizational culture is generally understood as all of a company’s beliefs, values and attitudes, and how these influence the behavior of its employees.
At BitWise MnM, we:
Believe in structures, processes and a continuous learning culture
Uphold a primary attitude of rational problem-solving
1. DEFINE STRUCTURES AND PROCESSES
Common / anticipated problems
Methodology for solving
Roles and tasks
Communication channels and protocols
Budget for creating and maintaining logs
SOPs for troubleshooting and hypothesis creation and testing
Configure system to pay attention to exceptions
Identify an incident management group of technical and product SMEs
Incident response policy
Designate a team lead
2. MONITOR & COMMUNICATE
– Logs from front to backend
Rotate on-call engineers
3. REHEARSE RATIONAL PROBLEM-SOLVING
Practice scenarios to incident response
Ascertain what is going to happen next
Determine what the core system is and to keep it running
Apply root cause analysis as a core competency
It is important to remember that in a crisis, senior executives just want the problem resolved. They want to get back to operation as fast as possible because outages cost money and reputation. Root cause analysis and solving the actual problem are also important, but less so. The real problem is that after the crisis is over, fixing the root cause often drops in priority.
Another important angle is having a protocol for emergencies. There is a real tension between applying the agreed upon protocols (assuming you have any) and reaching for heuristics. Heuristics are obviously faster, but carry a huge risk of missing something, or skipping a crucial step.
Perhaps the biggest risk factor of them all that everyone agreed on is relying on one main developer. Everyone understands that this comes at a cost. We’re all human, and as such will burn out sooner rather than later. And when it happens, everything falls apart like a house of cards:: fatigue leads to error, pressure leads to misrepresentation of the problem to avoid blame, and their absence results in brain drain.
If you are adequately prepared, you can follow your SOP (Standard Operating Procedure)s and crisis protocols as rehearsed. You check for what is the same; you check for what is different. You run processes separately and try to isolate the problem. You check the logs.
But what happens when that doesn’t work?
Humans try to make rational decisions but are ultimately impacted by cognitive heuristics.
They’re short-term results that create cognitive biases, which in turn create problems themselves.
Lucky teams do have an experienced, “out of the box” thinker that they can call on in a complex crisis when protocol fails. These game theory / hacker minds can get creative in crisis situations. However, explaining what they are doing in simple terms to upper management is nearly impossible. This is why they are also easiest and first to blame.
Impatience will usually lead to a jump to:
Wait for The Riddler to appear. There will be consequences.
ROOT CAUSE ANALYSIS
Only after the dust settles, the investors, government regulators, and stakeholders will want someone or something to blame. Also known as a post mortem.
In conclusion, by cultivating a learning culture, you get rid of The Riddler. You win because even post-incident, you build a database of knowledge, better prepared teams, more informed management, better protocols, and elevated communications. That will eventually pay off. Entire products are born in crisis management.
What actions and decisions the team performed
What actions triggered what results
Incident metrics and analyze them
Be sure you’re not guessing what the solutions were in the same way you guessed what the problem was and what caused it.
SOPs and policies
An incident report
A culture of psychological safety and continuous learning