Last updated April 5, 2006

ACM Home
SIGMETRICS/Performance 2006
IFIP Home












INVITED TALK (Friday, June 30, 9:00AM--10:00AM) Palais du Grand Large


Performance and Reliability: The Ubiquitous Challenge

Daniel A. Reed

Renaissance Computing Institute
University of North Carolina at Chapel Hill

Legend says that Archimedes remarked, on the discovery of the lever, "Give me a place to stand, and I can move the world." Today, computing pervades all aspects of science and engineering. "Science" and "computational science" have become largely synonymous, and computing is the intellectual lever that opens the pathway to discovery. As new discoveries increasingly lie at the interstices of traditional disciplines, computing is also the enabler for a scholarship in the arts, humanities, creative practice and public policy. Equally importantly, computing is an enabler of our critical infrastructure, from monetary and communication systems to the electric power grid.

With such pervasive dependence, computing system reliability and performance are ever more critical. Although the mean time before failure (MTBF) of commodity hardware components (i.e., processors, disks, memories, power supplies and networks) is high, their use in highly parallel, mission critical systems can still lead to systemic failures. In contrast, distributed software for networks, whether transport protocols or web/Grid services, is designed to be resilient to component failures. Our thesis is that these "two worlds" of software -- distributed systems and parallel systems -- must meet, embodying ideas from each, if we are to build resilient systems. This talk surveys some of these challenges and presents possible approaches for high-performance, resilient design, ranging from intelligent hardware monitoring and adaptation, through low-overhead recovery schemes, statistical sampling and differential scheduling and to alternative models of system software, including evolutionary adaptation.


Biographical Sketch

Dan Reed is the Chancellor's Eminent Professor at the University of North Carolina at Chapel Hill, as well as the Director of the Renaissance Computing Institute (RENCI), a venture supported by the three universities: the University of North Carolina at Chapel Hill, Duke University and North Carolina State University that is exploring the interactions of computing technology with the sciences, arts and humanities. Reed also serves as Vice-Chancellor for Information Technology and Chief Information Officer for the University of North Carolina at Chapel Hill.

Dr. Reed is chair of the board of directors for the Computing Research Association, which represents the interests of the major academic departments and industrial research laboratories. He recently completed a term of service as a member of President George W. Bush's Information Technology Advisory Committee (PITAC), where he chaired the subcommittee on computational science. He was previously Director of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, where he also led National Computational Science Alliance, a consortium of roughly fifty academic institutions and national laboratories that is developing next-generation software infrastructure of scientific computing. He was also one of the principal investigators and chief architect for the NSF TeraGrid. He received his PhD in computer science in 1983 from Purdue University.