Tracking the blackout bug
A number of factors and failings came together to make the August 14th northeastern blackout the worst outage in North American history. One of them was buried in a massive piece of software compiled from four million lines of C code and running on an energy management computer in Ohio.
To nobody's surprise, the final report on the blackout released by a U.S.-Canadian task force Monday puts most of blame for the outage on Ohio-based FirstEnergy Corp., faulting poor communications, inadequate training, and the company's failure to trim back trees encroaching on high-voltage power lines. But over a dozen of task force's 46 recommendations for preventing future outages across North America are focused squarely on cyberspace.
That may have something to do with the timing of the blackout, which came three days after the relentless Blaster worm began wreaking havoc around the Internet -- a coincidence that prompted speculation at the time that the worm, or the traffic it was generating in its efforts to spread, might have triggered or exacerbated the event. When U.S. and Canadian authorities assembled their investigative teams, they included a computer security contingent tasked with looking specifically at any cybersecurity angle on the outage.
In the end, it turned out that a computer snafu actually played a significant role in the cascading blackout -- though it had nothing to do with viruses or cyber terrorists. A silent failure of the alarm function in FirstEnergy's computerized Energy Management System (EMS) is listed in the final report as one of the direct causes of a blackout that eventually cut off electricity to 50 million people in eight states and Canada.
The alarm system failed at the worst possible time: in the early afternoon of August 14th, at the critical moment of the blackout's earliest events. The glitch kept FirstEnergy's control room operators in the dark while three of the company's high voltage lines sagged into unkempt trees and "tripped" off. Because the computerized alarm failed silently, control room operators didn't know they were relying on outdated information; trusting their systems, they even discounted phone calls warning them about worsening conditions on their grid, according to the blackout report.
"Without a functioning alarm system, the [FirstEnergy] control area operators failed to detect the tripping of electrical facilities essential to maintain the security of their control area," reads the report. "Unaware of the loss of alarms and a limited EMS, they made no alternate arrangements to monitor the system."
With the FirstEnergy control room blind to events, operators failed to take actions that could have prevented the blackout from cascading out of control.
In the aftermath, investigators quickly zeroed in on the Ohio line-tripping as a root cause. But the reason for the alarm failure remained a mystery. Solving that mystery fell squarely on the corporate shoulders of GE Energy, makers of the XA/21 EMS in use at FirstEnergy's control center. According to interviews, a half-a-dozen workers at GE Energy began working feverishly with the utility and with energy consultants from KEMA Inc. to figure out what went wrong.
The XA/21 isn't based on Windows, so it couldn't have been infected by Blaster, but the company didn't immediately rule out the possibility that the worm somehow played a role in the alarm failure. "In the initial stages, nobody really knew what the root cause was," says Mike Unum, manager of commercial solutions at GE Energy. "We spent a considerable amount of time analyzing that, trying to understand if it was a software problem, or if -- like some had speculated -- something different had happened."
Sometimes working late into the night and the early hours of the morning, the team pored over the approximately one-million lines of code that comprise the XA/21's Alarm and Event Processing Routine, written in the C and C++ programming languages. Eventually they were able to reproduce the Ohio alarm crash in GE Energy's Florida laboratory, says Unum. "It took us a considerable amount of time to go in and reconstruct the events." In the end, they had to slow down the system, injecting deliberate delays in the code while feeding alarm inputs to the program. About eight weeks after the blackout, the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds.
"There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time," says Unum. "And that corruption led to the alarm event application getting into an infinite loop and spinning."
Testing for Flaws
"This fault was so deeply embedded, it took them weeks of poring through millions of lines of code and data to find it," FirstEnergy spokesman Ralph DiNicola said in February.
After the alarm function crashed in FirstEnergy's controls center, unprocessed events began to cue up, and within half-an-hour the EMS server hosting the alarm process folded under the burden, according to the blackout report. A backup server kicked-in, but it also failed. By the time FirstEnergy operators figured out what was going on and restarted the necessary systems, hours had passed, and it was too late.
This week's blackout report recommends that the U.S. and Canadian governments require all utilities using the XA/21 to check in with GE Energy to ensure "that appropriate actions have been taken to avert any recurrence of the malfunction." GE Energy says that's a moot point: though the flaw has not manifested itself elsewhere, last fall the company gave its customers a patch against the bug, along with installation instructions and a utility to repair any alarm log data corrupted by the glitch. According to Unum, the company sent the package to every XA/21 customer -- more than 100 utilities around the world -- and offered to help install it, "irrespective of their current support status," he says.
The company did everything it could, says Unum. "We text exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug," says Unum. "I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software."
Tom Kropp, manager of the enterprise information security program at the Electric Power Research Institute, an industry think tank, agrees. He says faulty software may always be a part of the electric grid's DNA. "Code is so complex, that there are always going to be some things that, no matter how hard you test, you're not going to catch," he says. "If we see a system that's behaving abnormally well, we should probably be suspicious, rather than assuming that it's behaving abnormally well."
But Peter Neumann, principal scientist at SRI International and moderator of the Risks Digest, says that the root problem is that makers of critical systems aren't availing themselves of a large body of academic research into how to make software bulletproof.
"We keep having these things happen again and again, and we're not learning from our mistakes," says Neumann. "There are many possible problems that can cause massive failures, but they require a certain discipline in the development of software, and in its operation and administration, that we don't seem to find. ... If you go way back to the AT&T collapse of 1990, that was a little software flaw that propagated across the AT&T network. If you go ten years before that you have the ARPAnet collapse.
"Whether it's a race condition, or a bug in a recovery process as in the AT&T case, there's this idea that you can build things that need to be totally robust without really thinking through the design and implementation and all of the things that might go wrong," Neumann says.
Despite the absence of cyber terrorism in the blackout's genesis, the final report includes 13 recommendations focused squarely on protecting critical power-grid systems from intruders. The computer security prescriptions came after task force investigators discovered that the practices of some of the utility companies involved in the blackout created "potential opportunities for cyber system compromise" of EMS computers.
"Indications of procedural and technical IT management vulnerabilities were observed in some facilities, such as unnecessary software services not denied by default, loosely controlled system access and perimeter control, poor patch and configuration management, and poor system security documentation," reads the report.
Among the recommendations, the task force says cyber security standards established by the North America Electric Reliability Council, the industry group responsible for keeping electricity flowing, should be vigorously enforced. Joe Weiss, a control system cyber security consultant at KEMA, and one of the authors of the NERC standards, says that's a good start. ""The NERC cyber security standards are very basic standards," says Weiss. "They provide a minimum basis for due diligence."
But so far, it seems software failure has had more of an effect on the power grid than computer intrusion. Nevertheless, both Weiss and EPRI's Kropp believe that the final report is right to place more emphasis on cybersecurity than software reliability. "You don't try to look for something that's going to occur very, very, very infrequently," says Weiss. "Essentially, a blackout like this was something like that. There are other issues that are higher probability that need to be addressed."
We had in excess of three million online operational hours in which nothing had ever exercised that bug. I'm not sure that more testing would have revealed it.