 |
The
development of large scale mission critical system has emerged
to be a major scientific and engineering challenge. For example,
· FAA's major modernization project, the Advanced Automation
System (AAS), was originally estimated to cost $2.5 billion with
a completion date of 1996. In 1994, “FAA cancelled the AAS
program, casting aside 11 years of development time and, according
to GAO, wasting more than $1.5 billion of taxpayer money.”
· The chaos in the opening of Hong Kong’s new airport,
“Many people could not find their departure gate. The monitor
would say Gate 15, but the airline staff would say Gate 43 . .
. or 19 or 33. ... Out on the tarmac, some pilots didn't know
where to park and passengers sat perspiring in their seats. …
a computer gremlin had prevented the main air cargo operator,
HACTL, from retrieving vital shipping information. As a result,
freight operations ground to a halt, and containers full of perishables
rotted on the steamy tarmac”
These are not isolated incidents. Serious problems in the development
and integration of large software systems are in fact very common.
The problems are so serious that US congress passed laws to mandate
the reform of large scale information system development, acquisition
and maintenance. This leads to creation of an architecture framework
for system integration. In spite such efforts, the system development
and integration problems remain. Even in less ambitious typical
commercial system development, debugging-and-testing account for
50 -75% of total development cost.
Large,
networked system of systems is built with many components designed
for different requirements in the past and contains known and
unknown defects. They often have overly complex and unconstrained
interactions. Indeed, major system failures often traced back
to unexpected global interactions that involve many components,
not an isolated defect in one module.
On
the other hand, it is worthy to note that in the current mission
management software of civil aviation, there are hundreds of reported
residual bugs but the flights remain safe. There are at least
thousands of residual bugs in the telecomm network and it remains
highly reliable. There are perhaps millions of bugs in the World
Wide Web system of systems, but it works quite acceptably. On
an even large scale, United States of America is a highly stable
and evolvable system. It has grown and made truly remarkable progress
by the metric of civilization, even though many problems remain.
And its basic components, human beings, are complex, error prone,
and hard to test or verify. Complex but stable systems are uncommon
but have been built.
This
talk presents an initial investigation on
· The root-causes of large system failures and what needs
to be done.
· Structures that keep highly complex systems stable in
spite of residual errors.
http://www.house.gov/transportation/press/press2001/release15.html
http://www.asiaweek.com/asiaweek/98/0717/nat_6_clk.html
http://wwwoirm.nih.gov/policy/itmra.html
http://www.aitcnet.org/dodfw/
http://www.research.ibm.com/journal/sj/411/hailpern.html
For more information on Lui Sha please visit his web site:http://www.cs.uiuc.edu/people/faculty/sha.html
Back
to main Colloq Page
|
 |