Frozen Servers, Melting Trust: The Day IT Went Dark

I previously shared a post called We all have war stories! and, today, I’m going to tell a story that has stayed with me for many years and played a pivotal role in what kind of leader I wanted to be.

It was a regular Wednesday in June back in 2009, but little did I know, it would soon be remembered by me as “Black Wednesday”, a day that would send shockwaves through our entire company, cripple our operations and stay etched in my mind for the over 14 years. I was working as an IT Technician for a national chilled and frozen haulage and storage company based in the South West UK.

The Decision That Led to The Chaos

I was tasked with a critical update – installing Service Pack 2 onto the Windows Server 2003 machines. These were the days before virtualization, so we were dealing with physical servers. I had a decision before me, which was whether to play it safe and update one server at a time or to take a more daring approach and install the service pack on both domain controllers simultaneously. Known for my innovative thinking (which can be translated as “cocky and naïve”), I chose the latter, thinking it would save time and showcase my efficiency. I think we can all see where this story is heading!

The Crisis Unfolds

Little did I know that hidden within the intricacies of the 2003 domain controllers was a quirk that would turn this ordinary day into a crisis that would be referred to as Black Wednesday thereafter! When these servers booted up, they would first seek out another DNS server before fully starting. I did not know this and we had both domain controllers set to use the other as their primary DNS server across the network… and I decided to reboot them at the same time. This would turn out to be a huge mistake!

1:00PM, I initiated the updates on both servers. I had informed our staff across the country about the planned downtime, and the company was used to coming of of the systems at 1pm for updates to various things. The old servers had a reputation for slow reboots, so after 30 minutes, I wasn’t that worried but, 45 minutes in, concern started to gnaw at me and, after an hour, the phone phone call from my IT Director, audibly tense, questioning the prolonged downtime.

Both servers seemed frozen in time, displaying the mysterious message “Applying Registry Settings” and the panic and confusion began to wash over me. Was this normal during a service pack install, or had something gone terribly wrong?

As I watched in horror, the gears of our business operations ground to a catastrophic halt. The once-reliable computer network, which we depended on for almost every aspect of our business, was now paralyzed, leaving the company in a state of utter chaos.

The ramifications were colossal. We were a company that prided itself on seamless logistics, efficiently guiding pallets through an intricate transportation maze. Yet, almost instantaneously, this well-oiled machine fell apart. Trucks queued up at depots with no knowledge of what goods to load onto them. Our vast warehouses were filled with pallets, and we were clueless about whether they needed to be stored in the chiller, deep chill, or freezer. Trucks departed depots with no clear destination in mind, wandering aimlessly. Staff members sat helplessly at their desks across the country, unable to carry out their work. Customers calling to say that they weren’t able to book their jobs into our online system. Our reputation for flawlessly maintaining temperature-controlled goods during transit was about to be shattered. To add to this escalating crisis, a procession of anxious company directors converged on me, their expressions a canvas of concern and consternation.

Responding to the Crisis

The very core of our business, which prided itself on precision and efficiency, had been thrown into disarray. The repercussions of my ill-fated decision had sent shockwaves through the entire organization, and the pressure was mounting with each passing moment.

As the enormity of the situation began to become more obvious, we began trying to work out how to get ourselves out of this situation. We started by splitting the IT Team into four sections:

Fix the Domain Controllers: The linchpin to regaining control
Transport Team Triage: A temporary workaround to get trucks moving
Customer Interface: Restoring our online booking system for minimal customer fallout
Warehousing Strategy: Get pallets back into the right temperature

I was part of team 1 and worked to to restore the entire network to operational status. With every minute that ticked away, the pressure intensified to resolve the crisis. Hours turned into a nerve-wracking ordeal, but slowly and steadily, we unravelled the complexities of the deadlock. We were able to drive to another depot where there was a working domain controller; bring it back to our head office and power it on with no NICs attached (which allowed it to boot without DNS); change it’s IP address to “masquerade” as one of the faulting domain controllers; then connect it up to the network which caused one of the faulting domain controllers to see it and finally boot up. In turn, the other domain controller did the same and our whole network began to come back to life. It took us nearly 8 hours to get the computer network back up and operational, and it’s an 8-hour ordeal I will likely never forget.

Understanding the Haulage Network Model

To better explain the fallout from this incident, I need to quickly explain the sensitive nature of the just-in-time haulage network model. The system is fairly simple:

Initial Delivery Phase: A truck is dispatched from a local depot to deliver pallets to various destinations.
Collection Phase: After all the pallets are delivered, the same truck begins collecting pallets scheduled for delivery the next day.
Return to Depot: The truck returns to its local depot, where the collected pallets are unloaded into the warehouse.
Inter-Depot Transfer: In the warehouse, pallets are moved onto other trucks destined for the depot that is closest to each pallet’s final destination.
Final Delivery Phase: At the receiving depots, pallets are sorted onto new trucks that will take them to their final destinations.

Each step is interdependent and time-sensitive, making the entire process vulnerable to delays or inefficiencies at any stage. The model relies on the flawless execution of each step to maintain optimal operations.

This process is incredibly efficient, and it allows the company to take a single pallet from anywhere in the UK and deliver it, next-day, while keeping it at the desired temperature through the whole journey. The problem with this process is that, if it’s ever interrupted, the time it takes to get everything straight again is exponential. In the case of this story, the company missed almost all of the collections and deliveries for our customers on the Wednesday. This led to a cascade of challenges:

Rescheduling to Thursday: On Thursday, we were occupied with completing Wednesday’s missed collections and deliveries, effectively pushing back our usual schedule.
Resource Limitation: Because we were playing catch-up with Wednesday’s tasks, we didn’t have enough trucks available to handle Thursday’s scheduled collections and deliveries. This led us to postpone Thursday’s tasks to Friday.
Snowball Effect: The delays had a domino effect, continually pushing tasks to subsequent days. This disrupted the entire operation to a degree that it took many weeks to fully rectify.

The Aftermath: Customer and Financial Fallout

The fallout from missing just one day’s operations was immense, highlighting the operational vulnerability of our highly efficient but interconnected system.

During the incident, we also faced a loss of customers due to various issues:

Switching to Competitors: Some customers, unable to book jobs with us on the missed day, opted for competitors and chose not to return to us.
Failure to Maintain Temperature: In other cases, customers were disappointed because we didn’t maintain the proper temperature for their shipments. For example, items that were supposed to be frozen defrosted. This negligence not only led to customer dissatisfaction but also incurred costs for the company, as we had to reimburse these customers.
Loss of High-Value Shipments: Further exacerbating the situation, we had high-value items in transit that also went out of their required temperature range. The financial repercussions were significant.

The Risk Department’s assessment of the total cost incurred due to this disruption was staggering, highlighting the severe financial and reputational risks associated with such incidents in our highly interdependent system.

This incident became more than a mere crisis; it became a profound lesson. It underscored the critical importance of meticulous planning and the unforeseen consequences of seemingly simple decisions. It also highlighted to the company the importance of the computer network and how dependent on the systems the company had become.

In the aftermath of the incident, we implemented a comprehensive strategy for future updates, ensuring that the lessons learned from this event became an integral part of our company’s procedures. I learnt to put proper planning in place to prevent such issues happening again and we also learned the importance of not making decisions alone and, rather, collaborating with colleagues and listening to their thoughts or concerns.

Lessons and Reflections

The day after the incident stood out for its transformative impact on my future leadership philosophy, thanks in part to a visit from our Managing Director. I had spent a restless night, consumed by worries about my job security and the magnitude of the mistake. What unfolded the next day was entirely unexpected and incredibly enlightening.

Rather than chastising us or reacting with anger, the MD calmly inquired about the events, the measures we had taken to correct them, and our plans to prevent future occurrences. His approach was constructive: “We’ll treat this as a learning experience and use it to build a more resilient system,” he said.

This leadership style was a revelation, contrasting sharply with my initial fears. It demonstrated the value of turning setbacks into learning opportunities and the importance of building a resilient system rather than assigning blame. Our IT Director shared a similar mindset, and together, they epitomized the kind of leaders who cultivate trust and confidence within their teams.

They both emphasized the value of learning from errors and focused on how to prevent similar issues in the future. We were then tasked with designing a more robust system and presenting the associated costs for approval, which was promptly given.

This episode served as an invaluable lesson in effective leadership, highlighting the importance of constructive problem-solving over punitive measures. It was a defining moment that shaped my own managerial aspirations, inspiring me to be a leader who sees challenges as opportunities for improvement and growth.

Conclusion

The Black Wednesday incident of 2009 was an eye-opening experience that left an indelible mark on my professional journey. Not only did it reveal the vulnerabilities in our operational systems, but it also exposed the hazards of hubris and the necessity for meticulous planning in IT management. This catastrophe, while a severe setback for the company, turned out to be a masterclass in crisis management and the essence of resilient leadership.

While my initial decision had catastrophic effects, rippling through the very core of our time-sensitive, interdependent operations, the experience provided a deep well of lessons. It highlighted the company’s heavy dependence on its IT infrastructure and showcased the disastrous consequences of even a minor lapse in judgment or procedure.

The aftermath led to essential changes, both technical and managerial. These ranged from implementing comprehensive planning for updates to encouraging collective decision-making. The leadership demonstrated by our Managing Director and IT Director provided a template for constructive crisis management—focusing on solutions rather than blame, and on learning rather than punishment.

The incident shaped my outlook on leadership and drove home the critical importance of teamwork, thoroughness, and contingency planning. Although it was a day of profound professional hardship, it ultimately endowed me with invaluable insights into the type of leader I aspire to be. In embracing its lessons, we strengthened not only our systems but also our corporate culture, laying down a pathway for future resilience and growth.

Black Wednesday was indeed a dark day, but it was also a transformative one, steering me towards a more thoughtful, collaborative, and proactive approach to challenges—qualities that have been indispensable in my ongoing career.