Poor testing allowed CrowdStrike error to crash millions of computers

Img

Following a widespread IT outage last month, CrowdStrike revealed that a test it designed to detect problems with updates before they are issued had failed, allowing a problematic update to slip out to the public and cause millions of Windows devices to crash.

The File 291 incident — so named for the file at the root of the fiasco — raised questions about the development and testing process at CrowdStrike that allowed a bug to be released widely enough that banks were unable to function normally, airlines had to cancel flights and broadcasters went dark temporarily.

CrowdStrike recently answered some of these questions in a preliminary post-incident review, which revealed that a faulty test was one of the primary failures. The test had been designed to detect errors in a type of content that CrowdStrike issues on a rapid, as-needed basis in response to novel cybersecurity threats.

CrowdStrike has a regular track and fast track for updating cybersecurity threat sensors installed by customers on their Windows, Mac and Linux systems. These updates allow the sensors to detect new cybersecurity threats as CrowdStrike discovers them.

Updates issued via the fast track (CrowdStrike calls these updates Rapid Response Content) differ in design from updates issued via the regular track. This design takes advantage of templates that CrowdStrike can easily fill out to issue fast-tracked updates, and because they are based on templates, they require far less testing than regular updates.

CrowdStrike calls the suite of tests it runs on fast-tracked updates a Content Validator. Last month, CrowdStrike learned the hard way that the Content Validator had a flaw. This flaw caused the test suite to overlook a problem in the update it issued, which was subsequently issued to millions of Windows computers — Microsoft estimates 8.5 million of them — that then crashed.

In this month's roundup of top banking news, how the widespread CrowdStrike outage affected banks of all sizes, JPMorgan Chase's new tech bet for combating rising fraud, what Project 2025 could mean for bankers and more.

July 31

CrowdStrike had trusted its Content Validator and its templated design for fast-tracked updates to provide sufficient protection against a faulty update like the one that ultimately went out. The company said it had trusted the process in part because it had issued other templated updates without issue.

CrowdStrike will no longer trust this process alone to catch errors with fast-tracked updates, the company said in its post-incident review. The company promised additional testing processes to catch problems like the one that caused the File 291 incident last month.

Among the new testing CrowdStrike has promised is local developer testing. This type of testing involves deploying an update to developers' computers before they go out to the broader public. This allows developers to catch any glaring issues (like a "blue screen of death") before an update goes out into the wild. It's a basic measure and standard practice in the software engineering industry.

CrowdStrike also promised better error handling in the software that crashed when running the problematic update. This ideally would ensure that, even if an error with CrowdStrike's code causes a piece of its threat detection sensor to fail, the rest of the computer can continue to boot up and run as normal.

The cybersecurity company also said it would start using more advanced testing techniques, such as rollback testing, stress testing, fuzzing and fault injection. These techniques provide redundancy to the more basic tests CrowdStrike has promised.

Additionally, CrowdStrike is still developing a root cause analysis, which is likely to reveal more about the high-level thinking at the company that allowed, for example, fast-tracked updates to face such scant testing before getting issued to the public. The company has not provided a timeline of when it will release this analysis.


More From Life Style