Critical Systems Revisited

December 26, 2016 (Updated March 20, 2025)
Reading Time: 6 minutes and probably give or two thou of some milli-seconds

A while ago there was an outage of a major airline due to a computer system glitch. Jetstar's computer system went down.[1]

I was pondering about how Facebook never goes down yet a lot of corporate networks regularly experience outages.

Why?

Why is it that non-critical systems such as Twitter, Facebook, Amazon, eBay, Google rarely go down while real-time critical systems fail more frequently?

I believe this comes down to two primary factors: Organisational Structures and Company Culture.

Organisational structures define where you sit within the company hierarchy. Usually, the people writing the code are at the lower end of this structure. One of the major drawbacks of such hierarchies is that you could have brilliance—someone like Einstein working for you, offering CEO-level insights—but their advice will go unheard simply due to their position and the number of organisational layers between them and decision makers.

I suspect that engineers working on Jetstar's systems likely knew about major flaws due to aging infrastructure and legacy code. The problem with culture is that it's typically defined by founders and early hires who eventually occupy the top positions on the org chart. Technology choices are often determined by these same people. If leadership makes incorrect technological assessments, they'll likely choose suboptimal solutions. Even if half their engineering team agrees that migrating from platform X to Y would yield substantial improvements with zero downtime, the decision isn't theirs to make, and the company culture doesn't create channels for their reasoning to be heard.

Paul Graham writes about culture and technology stacks in his excellent essay "Great Hackers."[2]

Companies with consistently reliable online services understand this dynamic. They recognise the value of culture and creating feedback mechanisms for engineers.

Perhaps the distance between reality and perception grows proportional to how far someone is from the bottom of the organisational chart. The more levels in your hierarchy, the more likely executives become disconnected from operational priorities.

These thoughts represent my observations from my perspective at that time. I still fly Jetstar regularly, so no hard feelings there.


Nine Years Later: A Reflection

March 20, 2025

Reading my thoughts from nearly a decade ago, I'm struck by both how much has changed in the technology landscape and yet how much of my original thoughts still rings true.

Back in 2016, I observed the disconnect between decision-makers and technical practitioners. Today, this dynamic continues to play out across industries, though with some important changes. The rise of DevOps, site reliability engineering practices, and the "you build it, you run it" philosophy has helped bridge some of these gaps in forward-thinking companies. Yet many enterprises still struggle with the same fundamental organisational challenges I identified years ago.

What I didn't fully appreciate then was how deeply these companies' culture reflect their broader philosophical approaches around problem-solving. The hierarchical organisations I criticised weren't just inefficient communication structures; they represented a worldview where specialisation and compartmentalisation were valued over holistic understanding. Need to know basis breaks holistic problem solving….mostly.

In the years since writing that original piece, I've come to recognise that my frustration wasn't just about technical systems failing. It was about human systems failing to appreciate the interconnected nature of problems. Resilience isn't just technically sound companies, but ones that cultivate environments where the understanding of "why" is valued.

This realisation has profoundly shaped my choices and the projects I've been drawn to. I've found myself gravitating toward environments where curiosity is encouraged and where questioning assumptions is viewed as a strength rather than insubordination. I've seen firsthand how organisations that flatten communication hierarchies - not necessarily management hierarchies - tend to build more resilient systems.

What remains consistent in my thinking is the belief that technical problems are rarely just technical problems. They're usually manifestations of a company's challenges. The companies that have the most reliable systems aren't necessarily those with the most advanced technology, they're the ones with cultures that value understanding root causes, that create space for reflection, and that respect the insights of those closest to the systems themselves.

As I look back on my younger self's observations, I see someone who was beginning to connect these dots. I was starting to understand that the reliability of critical systems isn't just a matter of better code or more redundancy, it's about creating organisations where reality can flow freely up and down the hierarchy, where perception and reality remain tightly coupled regardless of one's position on an org chart.

In today's world of increasing system complexity and interconnectedness, these insights seem more relevant than ever. The most successful modern startups and companies have found ways to combat the distance between perception and reality, creating feedback loops that keep decision-makers connected to ground truth. And while we've made progress, there's still much work to be done in building truly resilient socio-technical systems.

Perhaps the most important evolution in my thinking is recognising that effective problem-solving requires not just technical expertise but a deep curiosity about why things work the way they do, at both the technical and human levels. The organisations that cultivate this curiosity, that reward asking why instead of just accepting what is, are the ones building the most reliable systems for our future.


[1] This referenced a specific Jetstar outage from 2016.
[2] Graham, Paul. "Great Hackers." http://www.paulgraham.com/gh.html