The Software Monoculture Is Here to Stay

The recent Crowdstrike debacle has reignited an old argument among computer and security practitioners: should organizations do away with their software monoculture.

NOTE: I was recently quoted in a story for NPR’s Marketplace regarding this issue.

For clarity, a software monoculture is when an organization uses a small, standardized set of software, service providers, and/or hardware. The most obvious example is the dominance of Microsoft Windows on desktop and laptop computers. Software monocultures extend to security technologies as well, which is why the CrowdStrike outage was so widespread.

Like it or not, the software monoculture is here to stay. Standardized compute environments are preferred as they are easier to monitor, manage, and secure. The recent uproar over monoculture due to the CrowdStrike incident is a distraction. It avoids the real problem that organizations are unprepared for systemic outages and looking to blame somebody else for their problems.

Marge vs. the Monoculture*

In the early 2000s, my company was conducting a penetration test on a client. One of our scans crashed the customer’s network. After a tense 30 minutes, we got them back online. However, the CIO was enraged and demanded to know why we did this. When I explained that the firewall had a bug that made it crash when scanned, he persisted with his complaints. I reminded the CIO that discovering this kind of flaw is why you conduct penetration tests.

This incident was an opportunity to build resilience into the organization. However, this immature CIO was more interested in who he could blame for the outage rather than how to recover from it. Similarly, every time there is a large outage, social media fills with “thought-leaders” whining about how evil Microsoft is and that we need the government to intervene. The recent CrowdStrike debacle is no different.

Microsoft is not evil. CrowdStrike is not incompetent. Bugs like this are not indicative of some systemic failure. Mistakes happen. The mistake is not as important as how we react to it. Either you view an outage as an opportunity to improve or as an opportunity to blame.

Blaming others for the outage does nothing of value. It merely allows people to feel better about the situation. An outage should be seen as a chance to review response, recovery, and contingency plans. Organizations that had reliable plans breezed through the latest outage. Those that did not struggled to come back online.

More is Worse

Ultimately, monocultures are a net positive. A standardized, uniform, consistent environment is immensely easier to manage, monitor, and secure. This is not a new idea. Standardization has been a driving force in technology since the dawn of civilization. The entire Internet is built on standards. The benefits of a monoculture far outweigh the negatives.

This reminds me of another immature CIO I encountered. The CIO’s security team was struggling to operate their next-generation firewall (NGFW), resulting in numerous outages and security incidents. Consequently, the CIO wanted to purchase a competitive NGFW and run them both, believing that one could monitor the other. In a moment of brutal honesty, I replied: “You cannot effectively run one firewall; why do you think running two will be better?”

This CIO believed that the firewall (or monoculture) was the problem. He also believed that adding more technologies to the environment would compensate for this perceived weakness. Of course, the problem was him (and his team). They were blaming the technology for their own inexperience and ignorance. Unsurprisingly, the new firewall they installed caused additional problems and more outages.

Single Point of Fail

This CIO was consumed with preventing a “single point of failure.” The single point of failure issue is often applied to Microsoft Windows since a single flaw in Windows can lead to systemic outages. There is truth to this. However, it is not a justification for adding complexity to the environment. Making an environment more complex with a diverse set of technologies merely to avoid a possible single-point of failure only creates lots of points of failure. At least with a single point of failure you can identify, remediate, and recover more quickly.

When redundancy is necessary, it must extend to all dimensions of the environment. This is why containerization and cloud technologies are ideal for resilience. They have redundancy integrated into the platforms.

It does not make sense to spend millions building redundancy into a cloud architecture only to entrust its successful operation to a single overworked IT person or single piece of security software (like CrowdStrike). For redundancy to truly work, it must extend to all dimensions of the environment. This becomes an immensely expensive proposition, which makes it unreasonable for all but the largest organizations.

Every organization has single points of failure. They are unavoidable. It is useful to know where they are, but it is not always useful to mitigate them. Rather than implement complex redundant systems, have a robust set of contingency plans to rapidly recover in the event of an outage.

Overcoming Monoculture Anxiety

The CrowdStrike incident added a lot of stress and anxiety to already overworked IT teams. It is natural to seek out ways to prevent the next incident. However, the answer is not to deploy more technology (necessarily.) CrowdStrike is an effective security control. It is effective a lot more than it crashes.

A more reasoned response to this (or any other outage) would be:

Review your system backup and recovery processes. You should be able to restore any system, anywhere in your network to a previous state on a moment’s notice.
Consider technologies that provide rapid recovery. Microsoft has many of these embedded into the operating system. There are plenty of third-party tools as well.
Have a contingency plan for effected workers. One suggestion is to quickly spin up cloud-workstations in AWS or Azure that employees can use to continue working.
Have a communications plan. When systems are offline, employees, customers, and partners need to know what is going on. Have a way to contact everybody with a unified message. This message should come from senior leadership (like the CEO).
Perform an annual “table top” exercises with your teams on how they would respond to an outage. This prepares people to handle the situation.
For mission critical systems, migrate them to containerized platforms that can automatically reset to a known good state. For security, consider moving target defense technologies.

Conclusion

Outages are inevitable. No amount of technology, people, or processes can overcome this. Rather than complain about Microsoft’s dominance, work on ensuring that when those Microsoft systems go down, they can be recovered and reset quickly. Microsoft already has integrated functions in Windows to support this. Moreover, numerous third-party companies provide rapid recovery software.

This most recent outage demonstrated clearly which organizations had dependable contingency plans. Those that did were up and running in a few hours. Those that did not spent time blaming others rather than fixing their problems.

The monoculture is here to stay. How we react to it can change.

* This is a reference to the Simpson’s episode, Marge vs. the Monorail.