CIS Homeline

 

CIS Home divider Penn Engineering divider PENN   spacer
 

 
 Lui Sha: Software System Stability  

The development of large scale mission critical system has emerged to be a major scientific and engineering challenge. For example,
· FAA's major modernization project, the Advanced Automation System (AAS), was originally estimated to cost $2.5 billion with a completion date of 1996. In 1994, “FAA cancelled the AAS program, casting aside 11 years of development time and, according to GAO, wasting more than $1.5 billion of taxpayer money.”

· The chaos in the opening of Hong Kong’s new airport, “Many people could not find their departure gate. The monitor would say Gate 15, but the airline staff would say Gate 43 . . . or 19 or 33. ... Out on the tarmac, some pilots didn't know where to park and passengers sat perspiring in their seats. … a computer gremlin had prevented the main air cargo operator, HACTL, from retrieving vital shipping information. As a result, freight operations ground to a halt, and containers full of perishables rotted on the steamy tarmac”

These are not isolated incidents. Serious problems in the development and integration of large software systems are in fact very common. The problems are so serious that US congress passed laws to mandate the reform of large scale information system development, acquisition and maintenance. This leads to creation of an architecture framework for system integration. In spite such efforts, the system development and integration problems remain. Even in less ambitious typical commercial system development, debugging-and-testing account for 50 -75% of total development cost.

Large, networked system of systems is built with many components designed for different requirements in the past and contains known and unknown defects. They often have overly complex and unconstrained interactions. Indeed, major system failures often traced back to unexpected global interactions that involve many components, not an isolated defect in one module.

On the other hand, it is worthy to note that in the current mission management software of civil aviation, there are hundreds of reported residual bugs but the flights remain safe. There are at least thousands of residual bugs in the telecomm network and it remains highly reliable. There are perhaps millions of bugs in the World Wide Web system of systems, but it works quite acceptably. On an even large scale, United States of America is a highly stable and evolvable system. It has grown and made truly remarkable progress by the metric of civilization, even though many problems remain. And its basic components, human beings, are complex, error prone, and hard to test or verify. Complex but stable systems are uncommon but have been built.

This talk presents an initial investigation on
· The root-causes of large system failures and what needs to be done.
· Structures that keep highly complex systems stable in spite of residual errors.

http://www.house.gov/transportation/press/press2001/release15.html
http://www.asiaweek.com/asiaweek/98/0717/nat_6_clk.html
http://wwwoirm.nih.gov/policy/itmra.html
http://www.aitcnet.org/dodfw/
http://www.research.ibm.com/journal/sj/411/hailpern.html


For more information on Lui Sha please visit his web site:http://www.cs.uiuc.edu/people/faculty/sha.html


Back to main Colloq Page


 
 
CIS Home divider Penn Engineering divider PENN   spacer