From the Cockpit to the IDE: Investigating Software Incidents

11 min readAug 17, 2023

Twitter | LinkedIn | YouTube | Instagram

Civil aviation is not the safest mode of transportation in the world by accident. Although aeronautical technology has evolved dramatically over the past few decades, have you ever stopped to think about what makes this mode of transportation the safest in the world?

One of the reasons is the way air incident investigations are conducted by different entities worldwide. And although each country has its own organization responsible for these investigations, they are guided globally by Annex 13 of ICAO, Chicago Convention.

If you watched the movie Sully, starred by Tom Hanks, which depicts the story around the landing of American Airlines flight US1549 on the Hudson River, you might remember the acronym NTSB.

NTSB (National Transport Safety Board) is the entity responsible in the United States for investigating air incidents. In fact, it’s a shame that it was portrayed in the film as the villain because, as we will see below, they actually ensure that aviation continues and becomes safer over time.

And can you see how this relates to the world of software development?

Incidents in the programming world are not news to anyone. What developer has never had to spend hours awake at night resolving an incident? Many. But what few do is think about how to prevent new incidents from recurring. And that’s where aviation gives us a light that has been illuminated for years and that can help us avoid more sleepless nights as software developers.

Don’t aim at blame. Aim at cause.

To start with, item “3.1”, Chapter 3 of Annex 13 of the Chicago Convention states, “the sole objective of the investigation of an accident shall be the prevention of future accidents” and “the purpose of this activity is not to determine blame or liability.”

This is because the main purpose of investigating an air accident is not to identify the guilty parties or to impose any punishment but rather to identify the contributing factors that led to the accident and to issue what are called recommendations so that responsible parties can become aware of their deficiencies and make the necessary adjustments to prevent future accidents.

It was him who messed up!

Depending on the culture of your company, when an incident occurs, the first thing some people may do is point fingers. For example, they search git for the last person who made changes in a certain part of the code or associate the incident with some functionality developed by a colleague with the intention of assigning blame, humiliating, or firing.

But have you ever stopped to think that an incident is usually not caused by a single factor? But by multiple ones?

James Reason developed in 1990 the Swiss cheese theory that helps us understand why failures occur. And this theory shows us how accidents happen due to a succession of events.

According to James, each safety barrier of a given process is represented by a slice of cheese. The more slices there are, the more safety there will be.

The problem is that these slices (barriers) are imperfect, meaning they have holes, and these holes represent deficiencies in the system.

In an ideal world, these barriers would be solid and would not allow penetration for possible accident trajectories. In the real world, however, each barrier has weaknesses and holes.

How these holes are created within the operation points to the risks, hazards, and failures in the operation. And if something or someone can go through holes in all slices, an accident occurs.

The theory separates the holes into active failures and latent conditions.

Active Failures

According to the theory, active failures are part of human error. And understanding the types of human errors is crucial for identifying these weaknesses and mitigating their effects.

Let’s introduce three common types of human errors: negligence, imprudence, and inexperience: negligence, imprudence, and inexperience.

Negligence: A failure to exercise proper care, resulting in an error. For example, a developer not testing a new feature adequately, leading to software errors.

Imprudence: Consciously taking risks, even when aware of potential consequences. An example is a developer intentionally implementing insecure code due to tight deadlines.

Inexperience: Making mistakes due to lack of knowledge or skill. An example is a developer unfamiliar with a programming language, making coding errors that cause system failure.

These unsafe acts can have a direct impact on the safety system, and because of their adverse effects on the operation, James Reason characterized them as active failures.

But, Rapha, if a developer committed a violation or acted imprudently, shouldn’t they be punished?

That decision is not the investigator’s role. For the investigator, the important thing is to understand what led the developer to make these errors and violations and most importantly, how it could have been avoided. And that’s where latent conditions come in.

Latent Conditions

Latent conditions are hidden errors that reside within a system or organization, often going unnoticed until they interact with other factors and contribute to an active failure. They arise due to factors such as inadequate resources, poor management, or faulty designs.

Inadequate Resources: Lack of funding, insufficient personnel, or outdated technology can lead to systemic vulnerabilities. For instance, a company without a budget for proper cybersecurity measures may be more susceptible to data breaches.

Poor Management: Leadership and organizational culture play a significant role in system safety. Ineffective leadership or a culture that prioritizes speed over safety may lead to latent conditions, such as the acceptance of unsafe practices.

Faulty Designs: Flawed system designs, even if unintentional, create latent conditions. For example, a software architecture that fails to consider potential security risks could make a system vulnerable to hacking.

These underlying issues may be present for a long time without causing harm, but they can be exposed when triggered by specific events or circumstances. Addressing latent conditions is crucial to preventing active failures and improving system safety.

Lightning never strikes the same place twice

Have you ever noticed that aviation accidents rarely happen for the same reasons?

That’s because each accident is independently investigated by the responsible entities, who focus on understanding the cause of an accident and not on finding someone to blame.

Simply labeling the worker’s behavior as an unsafe act won’t prevent these errors or violations from happening again in the future. We need to go beyond that and understand why that developer acted that way and change our operations to prevent the same error from happening again.

Mitigate Errors

Finally, by understanding why an incident occurred, we should generate recommendations and implement measures to prevent them from happening again in the future.

There are several actions that are probably already implemented in your company, such as code review, automated tests, CI and CD pipelines, monitoring, and logging, among many others.

But whenever an incident occurs, we should not only focus on what happened, but we should especially focus on why it happened in detail and what we can do to prevent it in the future.

Remembering that accidents occur due to a series of factors, all of which should be identified.

The US1549 flight, for example, crashed due to the ingestion of large birds into each engine, which resulted in an almost total loss of thrust in both engines.

However, NTSB made 34 recommendations in its final report, including that engines be tested for resistance to bird strikes at low speeds; development of checklists for dual-engine failures at low altitude, and changes to checklist design in general “to minimize the risk of flight crewmembers becoming stuck in an inappropriate checklist or portion of a checklist”; improved pilot training for water landings; provision of life vests on all flights regardless of route, and changes to the locations of vests and other emergency equipment; research into improved wildlife management, and technical innovations on aircraft, to reduce bird strikes; research into possible changes in passenger brace positions; and research into “methods of overcoming passengers’ inattention” during preflight safety briefings.

Long list, but NTSB’s comprehensive approach illustrates the aviation industry’s commitment to learning from every incident and accident, no matter how rare or unusual it may be. Even though the outcome of US1549 flight was mostly positive, given the lack of fatalities, the investigation provided valuable lessons for the aviation industry, aiming to prevent future incidents and improve the overall safety of flights.

Every Incident Could Have Been an Accident

In the context of aviation, the difference between incidents and accidents refers to the severity of the event and the resulting consequences.

Incidents are events that occur during the operation of an aircraft that could adversely affect safety but do not result in serious injury, death, or substantial damage to the aircraft. Incidents can indicate underlying problems that must be resolved to prevent future accidents.

Accidents are more severe events that occur with an aircraft and result in serious injury or death to people or substantial damage to the aircraft. Accidents are thoroughly investigated to determine their causes and prevent similar future occurrences.

For example, two planes that almost collide, but are automatically instructed by their respective transponders to change their routes to avoid a collision, is an incident. And if this incident is not investigated and mitigated, the chances of two planes colliding due to a transponder failure increase and can effectively cause an accident.

Every incident could be an accident, so it must be investigated.

In programming, just like in aviation, it is crucial to treat incidents seriously and investigate them to prevent future accidents. An incident, no matter how small, can be a sign of a deeper system failure or security vulnerabilities that, if not fixed, can evolve into serious accidents.

For example, a slight system slowdown can indicate performance problems leading to more significant service failures. If not investigated and resolved, this slowdown can eventually become an accident, causing a prolonged service interruption and negatively affecting users and businesses.

Similarly, non-critical system failures may seem harmless at first glance but may indicate code or configuration problems that can lead to more serious failures. Investigating these incidents and fixing their underlying causes is crucial to prevent accidents.

Also, security alerts can be considered incidents but are extremely important in preventing security accidents. If these alerts are not treated with the necessary seriousness and investigated to determine the root cause, they can lead to successful security attacks, compromising user security and privacy.

Therefore, just like in aviation, it is important that software development teams treat incidents seriously, investigate their causes, and implement corrective measures to improve system reliability and security. Remember that every incident could be an accident, and preventing accidents is always better than remedying their consequences.

In summary, the investigation and mitigation of incidents are fundamental steps to ensure the reliability and security of software systems. By treating incidents with the necessary seriousness, development teams can prevent accidents, protect users and businesses, and ensure the continuous and effective operation of systems.

Getting onboard

Investigating incidents is a fundamental process in software development, and it should be approached with a clear objective of understanding the root cause and preventing future incidents rather than assigning blame.

Defining a protocol for investigating incidents is important for companies engaged in software development. A well-established protocol ensures consistency in the approach, promotes a structured investigation, and enables efficient identification of root causes. It also emphasizes the company’s commitment to understanding the reasons behind incidents rather than assigning blame.

Here is a draft to help you get started with creating your own incident investigation protocol, detailing the steps and explaining their significance in the context of understanding the cause instead of assigning blame:

Gather information: The first step involves collecting data related to the incident, such as logs, user reports, and system metrics. This is crucial for establishing a factual base for the investigation, eliminating guesswork, and identifying patterns and anomalies that might point to the root cause.

Analyze the data: After gathering information, the next step is to analyze it, identifying correlations and potential causes of the incident. This step is vital for narrowing down the root causes of the incident, eliminating unlikely possibilities, and focusing on the most probable reasons for the incident’s occurrence.

Recreate the incident: Once you have an idea of the potential root causes, try to recreate the incident in a controlled environment. This step is essential for validating the hypotheses formed during the analysis phase, ensuring that the identified causes are indeed responsible for the incident.

Develop a solution: Based on the results of the recreation phase, develop a solution that addresses the root causes and mitigates the risks of similar incidents in the future. This step is critical for implementing preventive measures and demonstrating the organization’s commitment to continuous improvement.

Implement and monitor: Deploy the solution and monitor its effectiveness over time. This step is essential for validating that the implemented measures are working as intended and for identifying any unforeseen consequences that may arise.

Document and share the findings: Finally, document the incident, the investigation process, and the implemented solution, and share this information with relevant stakeholders. This step is crucial for fostering a culture of transparency and learning within the organization, ensuring that everyone is aware of the incident’s causes and the measures taken to prevent similar occurrences in the future.

Throughout this process, it is essential to maintain a focus on understanding the root cause of the incident and developing effective preventive measures rather than assigning blame.

By doing so, organizations can foster a culture of continuous learning and improvement, where team members are encouraged to share their experiences and insights without fear of reprisal.

Before diving headfirst into the world of programming, I graduated with a degree in civil aviation from Anhembi Morumbi University in São Paulo. My initial plan was to become an airline pilot, but along the way, I decided to change direction and follow my childhood dream of becoming a programmer. Many of the lessons from my aviation degree have their place in the programming world as well. This is one of them.

My first instruction flight, in 2013. Ten years before presenting at the Dutch Kotlin MeetUp in 2023.

I hope you all enjoyed this. Until next time!

Contribute

Writing takes time and effort. I love writing and sharing knowledge, but I also have bills to pay. If you like my work, please, consider donating through Buy Me a Coffee: https://www.buymeacoffee.com/RaphaelDeLio

Or by sending me BitCoin: 1HjG7pmghg3Z8RATH4aiUWr156BGafJ6Zw

Follow Me on Social Media

Stay connected and dive deeper into the world of tech with me! Follow my journey across all major social platforms for exclusive content, tips, and discussions.