How Y2K Could Hose Your Network (isen.com)

Intelligence at the Edge #5

HOW Y2K COULD HOSE YOUR NETWORK

Everyone needs a Plan B for Year 2000 network surprises.

By David S. Isenberg

From America's Network, January 1, 1999
http://www.americasnetwork.com/issues/99issues/990101/990101edge.htm

The Titanic’s maiden voyage was on a 99% iceberg-free ocean. So it is not good news when Judy List, head of Bellcore’s Year 2000 program, says that large U.S. companies will be, at best, 92% Y2K-compliant by Jan. 1, 2000. And, she adds, there’s no reason for telcos to be an exception.

List says that 100% of network management systems and 75% of voice networking devices are date-sensitive. If 92% of these systems are fixed on Jan. 1, 2000, then 8% won’t be.

Not that some telcos aren’t taking Y2K very seriously — they are. They’re attacking internal systems head-on. AT&T, for example, has spent $500 million on Y2K fixes to date. John Pasqua, AT&T’s Y2K leader, is Type-A serious about it.

But I wonder about some of the other telcos. Bell Atlantic, for example, seems disturbingly naive. The carrier claims all of its mission-critical network elements will be fixed by June 30 "in sufficient time to allow for testing" (emphasis added). If customers are concerned about interoperability, they are invited to test, "simply by transmitting your data ... after [the Bell Atlantic network] is fully Year 2000-compliant."

The Y2K-sober telco must face the likelihood of network failure even as it works to prevent it. Some kinds of failures can be addressed directly by telcos, but others can’t. Let’s consider seven categories.

• Intrinsic failure. The typical telco head-on attack is focused on failure of intrinsic network components. But despite heroic efforts, systems still could contain Y2K flaws that could bring down the network, or big pieces of it. Flaws surface even without Y2K; last spring, the Galaxy 4 satellite failed and the AT&T frame relay network went down. Y2K won’t make intrinsic failure less probable.

• Interconnection failure. One telephone company’s network might be working fine, but the networks of other telcos, or customer premises equipment, could send bad Signaling System 7 (SS7) messages — or other kinds of Y2K-contaminated data associated with operations or maintenance — that might cause network failures.

• Overload failure. If uncertainties in non-telecom sectors (or outright failures) cause a dramatic increase in call attempts, this could overload local switches, or other network resources. If call centers (for airlines or banks, for example) fail during times of high anxiety, this could cause even more overload.

• Infrastructure failure. There could be failures in the non-telco-related infrastructure. For example, if the electrical grid fails, it could bring telecom systems with it. Furthermore, it is plausible that Y2K electrical failures could last longer than network backup facilities can operate. If transportation systems fail, key network operations people might not be able to get to work.

• Failures due to Y2K fixes. Some of the biggest recent network failures have occurred during upgrades; witness the SS7 debacle of 1991. In the pre-Y2K months, people will be pressured to rush Y2K upgrades into service.

• Security breach failures. Many strange hands touch software during Year 2000 remediation. Some of these hands could plant malicious code. Others could plant well-intentioned access points (for maintenance, for example) that could provide entry for later security breaches.

• Emergent failures. In complex, tightly coupled systems, an unexpected conjunction of improbable events can be disastrous. Because so many elements can interact in so many ways, an emergent failure is virtually certain to be a surprise. When subsystems are tightly coupled, effects can cascade rapidly. A telecommunications network is a transmission system controlled by information systems that are joined by transmission systems. They depend on electrical power systems that, in turn, critically depend on telecommunications systems. In situations like this, improbable interactions cascade.

Plan B

Many kinds of failures originate outside of telco systems, yet telco contingency planning could avert or lessen damages in virtually every case. Furthermore, when time is critical, existing contingency plans — plans that have already been made — can save time.

Has your company developed triage rules? Will motivation for good software practice maintain in the face of time pressure? Are there plans for shutting off operations interaction, signaling and/or traffic with other telcos? Are there plans for non-regional, widely distributed network overloads? How will network operations people get to work if transport systems fail? Are there
non-network-dependent alerting systems to respond to surprise emergent failures?

The Titanic was unsinkable, so it sailed without lifeboats for everyone. Our networks are almost that good. We should have plans in place in case we hit one of the Y2K icebergs out there.

David S. Isenberg can be reached at isen@isen.com.