Blay Whitby FBCS explores how air crashes are investigated and discusses how lessons from that industry might be applicable when IT and AI projects experience disastrous outcomes.

Whenever an airliner crashes, a thorough investigation takes place. This investigation aims not to apportion any blame or responsibility, but rather to learn so that similar crashes can be prevented in future. The aviation industry has been doing this for more than 75 years — or more than three quarters of the entire history of aviation. Over the decades, it has made commercial aviation very, very safe. Many other industries — and IT in particular — could learn a lot from this.

It might not always look so obvious, but we have known – or should have known – that bad practice in IT causes suffering and death in the same way as bad practice in aviation for some time. I remember the London Ambulance Service introducing an IT system in 1992 that failed completely and caused 30 or more deaths. A long time has elapsed during which the process of learning from our IT mistakes could have been set in motion so as not to repeat them. As yet, it has not.

The importance of open investigations

How can I be so definite about that? Well, accidents involving commercial airliners are thoroughly investigated, and the detailed technical conclusions of every investigation are also openly published in English for all the world to see. The reasons for that are obvious: a secret accident is likely to happen again. If one is truly concerned about preventing similar accidents, then completely open publication is the best way forward. I know of no comparable publication database for IT mishaps. Indeed, the level of secrecy common in the IT industry means that one often has to rely on journalistic speculation and whistleblowers to get any technical information about IT disasters. Rarely do we see print outs of defective code or a detailed technical account of how code that might usually be satisfactory has failed under circumstances that led to a disaster. If we are to learn from our mistakes, we need far more openness. Ensuring that large software systems are open source would help with this, and that is something that BCS could encourage. 

Readers may be keen to point out important differences between the two industries. For example, one can’t see the code that caused a problem because software is the intellectual property of the companies that sell it, and therefore subject to commercial confidentiality. But commercial aviation is also competitive. Airlines compete for passengers, and aircraft manufacturers compete to sell their products, yet they accept that their staff, business practices, and products may need to be analysed to find the cause of an accident and then openly published. It’s also vital that aviation investigation conclusions cannot be used as the basis for prosecution. This means that people talk freely to investigators, knowing they are not getting themselves into trouble — and that might be another important lesson. Making the software of IT systems open source would be much easier with this approach.

The lessons to be learned from air crash investigation are far broader and more useful than simply the importance of conducting matters on a no blame basis. It turns out that some organisational structures and, indeed, some national cultures are far more prone to accidents than others. Other transportation sectors and, to some extent, medicine have deliberately imitated aviation investigation.

Humans in the loop

Of course, disasters aren’t caused only by defective code any more than they are caused only by defective aircraft. While there remain humans ‘in the loop’ on the flight deck, their role in accidents is often the most important part of most air accident investigations. A crucial observation here is that air accident investigation no longer uses the term ‘pilot error’. So why do I still hear it so often in the IT world? Of course, pilots still make errors but if we are to prevent it happening again, we need to know why they made this particular error. Great strides have been made in finding out why pilots make errors and in building systems to prevent them recurring.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

It would be a great relief to think that the IT industry had made similar strides in tracing the human factors behind errors, but this isn’t the case. The most glaring contrast is that of fatigue. Fatigue has been detected as a frequent major contributor to human mistakes on the flight deck. Not only are there strict legal rules ensuring adequate rest between flights, but pilots are also taught to recognise their personal symptoms of fatigue and ground themselves as unfit if they may be suffering. Another interesting finding is that light-hearted banter or an excess of jokes heard on the cockpit voice recorder is typical of an over-tired flight crew. How different is this from IT practice, where long hours (and banter) are usual as deadlines approach, sometimes culminating in ‘the crunch’ — pizza is ordered in, the music turned up loud, and employees are expected to work all night? If one intended to produce defective software, this is probably one of the best methods to achieve it.

The impact of AI

Recent developments in AI add extreme urgency to the need to institute proper investigative practices. AI induced disasters are an order of magnitude more complex to investigate than previous IT disasters; no longer can we look at the code and see the logical errors. However, investigating AI errors is going to be a necessity. We can now see clearly that sometimes deep learning and generative transformer systems can and do get things very wrong.

Already, the ‘inscrutable AI excuse’ is being served to customers in the financial services industry. That is to say that, for example, if you ask why your bank turned you down for a loan, you are told not only that ‘computer says no’ but that because it is AI, it is technically impossible to say why it ‘said no’v. EU Law is currently being enacted that will make this sort of behaviour illegal. The UK may or may not follow, but no one could say that this has ever been an ethical way to treat customers.

Difficult is not impossible

Deep learning technologies are indeed very difficult to analyse or predict. We have to consider trillions of state transitions – which may be untraceable. ‘Very difficult’ is not the same as impossible, however, and, in this context, we should be cynical about those who claim that it is impossible. Maybe they have something to hide.

To return to the aviation analogy, there are also massive natural neural nets — human brains — in the system to be investigated. Investigators will ask what information they had at the time, how were they trained to respond to that information, and how did they respond. These are good ways to tap into the astronomical number of possible configurations of a natural neural net, and they will work equally well with an artificial one.

It’s not my intention here to define ways to investigate AI induced disasters. This is a highly specialised technical task that should be done by technical experts — not lawyers, journalists, or politicians. Investigators will learn most of how to do it by actually doing it. My claim is that no blame, freely published technical investigation of IT disasters is long overdue as standard practice in the industry.

About the author

Blay Whitby FBCS is a member of the Advisory Board to the APPG on AI, an ethics expert for the European Commission and chair of the BCS Sussex Branch.