More than the sum of its parts
PC clusters, building on low-cost components, challenge supercomputer supremacy
Supercomputers traditionally have been a breed apart. Running on proprietary hardware and operating systems, these powerful machines allow scientists and engineers to complete calculations in minutes, hours or a day that would take a mere mainframe machine days or even weeks.
But traditional supercomputers are very expensive. And being proprietary, they fly in the face of the current open-systems zeitgeist. So a number of research organizations are testing the possibility of replacing bulked-up specialized systems with clusters of hundreds or even thousands of PCs running the open-source Linux operating system. To perform a calculation, clusters divide the work among multiple processors, then assemble the pieces at the end to provide the answer.
Many of the projects are still embryonic, and questions about scalability, cost and administrative overhead remain. But the interim results, albeit more anecdotal than scientific, appear positive.
Take Sandia National Laboratories, an organization known for using the most massive supercomputers. Just about the time that Sandia's ASCI Red SuperComputer, built with the help of Intel Corp., went fully operational in June 1997, the lab was beginning to study ways to develop a virtual supercomputing environment based on commodity and near-commodity hardware, Linux OS and other open-source software.
"Our worry then and now was mainly about what to do when the last [supercomputer] vendor disappeared," said William Camp, Sandia's director of Computation, Computers and Mathematics. "It was only secondarily about saving money."
Computational plant, or Cplant — as Sandia's cluster project is called — is composed of Compaq Computer Corp.'s AlphaServer processors and Myricom Inc.'s Myrinet connectors. The largest cluster within Cplant comprises more than 1,500 AlphaServer nodes. The Energy Department and the National Nuclear Security Administration funded the research under the Accelerated Strategic Computing Initiative (ASCI).
Weighing Cost, Performance
Sandia is not alone in testing this new approach to supercomputing. Many major research organizations are seeking ways to move from their traditional supercomputers or Unix reduced instruction-set computing (RISC) machines, used for high-performance computing.
Sandia's focus on vendor independence rather than cost savings probably saved the project because Cplant has been very expensive. The lab has shelled out more than $10 million to develop it. But Camp is quick to point out that most of this effort and expense resulted from breaking new ground.
Now, with research and development almost completed, Cplant can provide a given level of performance at half the cost of an equivalent proprietary option, "taking into account the best deals you can get in both cases," Camp said.
Aside from using cheap processors, clusters can save money by eliminating the need to buy or lease a proprietary OS and software. Linux is free and uses open-source code.
Mark Seager, assistant department head for Terascale Systems at Lawrence Livermore National Laboratory, said the performance of his cluster project is from one-third as fast to up to three times as fast as its fastest RISC machines, the lab's legacy systems for high-performance computing. The variation depends on the type of application: Linux does not perform as well on memory-intensive applications. However, the cluster costs less than one-quarter of the price of the RISC machines.
"Overall, the performance and cost benefit of the cluster [are] stunning," Seager said.
The Livermore project, also part of ASCI, is a 10-year program to reach 100 trillion calculations per second by 2005. The project, among other applications, will enable scientists to simulate the aging and operation of nuclear weapons. The cluster was developed in partnership with Linux NetworX Inc. and Silicon Graphics Inc. Federal.
Aside from cost savings, a cluster solution based on open-source software should make it easier to develop new applications, Seager said. "You don't have to worry about the vendor's methodology as you do with a proprietary system," he said. The Livermore lab would sometimes train people in the idiosyncrasies of one system only to have to retrain them when another system was used.
And getting machines from different vendors to work together was also a problem. "You never knew whose neck to wring when things went wrong," Seager said.
Can Clusters Make the Grade?
The most important question is whether Linux clusters can in fact replicate the power of supercomputers. Camp and Seager answer that question with a preliminary yes, but both admit the jury is still out. Many experts say it depends on the application.
"The only way to be sure that a cluster solution will work is to benchmark it," said Peter Ungaro, vice president for high-performance computing at IBM Corp. Ungaro said IBM encouraged some customers to create hybrid solutions, utilizing a traditional supercomputer when necessary and the Linux cluster when possible.
The National Center for Supercomputing Applications (NCSA) at the University of Illinois is in the process of conducting test runs on Linux clusters to determine their performance capabilities. The National Science Foundation has awarded $53 million to NCSA and other research organizations to build and deploy a distributed terascale facility for scientific research.
Dan Reed, director of NCSA, said a recent run proved that the cluster works well with up to 300 processors. "Considering the results of our tests so far, I believe clusters based on commodity equipment and processors will be the dominant scientific computing platform in the coming years," he said. NCSA's cluster will consist primarily of IBM servers based on Intel Itanium processors interconnected with Myrinet.
Cluster Problems
Despite optimistic results so far, it's important not to turn a blind eye to potential cluster problems. Fortunately, administration, one issue that plagued early clusters used for high-performance computing, has been partially solved.
Khalid Ahmed, product manager at cluster vendor Platform Computing Inc., said, "If you just tie a bunch of PCs together, you're going to have an administrative headache. And if the cluster is very large, you can't solve it by throwing bodies at it."
The answer: administrative software that displays the cluster as a single entity and handles tasks such as workload and resource and performance management.
Reed agrees with Ahmed. "There are two ways you can buy a cluster," he said. "You can buy the PCs off some Web site and just stack them up in a closet. That might work for a small lab. But for a production environment like ours, you need to use an integrator who can provide the administrative software to ensure the system equals, or at least comes close to, the reliability of a proprietary machine."
All Linux cluster vendors, including Compaq, Cray Inc., IBM, Linux NetworX and Platform Computing provide administrative software. And all of the organizations using Linux clusters that were contacted for this article say that in using such software, they have no extra administrative overhead compared to proprietary solutions.
However, Sandia's Camp is concerned about the reliability of the hardware if he were to add a large number of processors. For example, if it would take 500,000 hours to lose power on one processor — which effectively never happens — it would take only 50 hours on 10,000 processors or 25 hours on 20,000 processors. And because the PC processors have less robust power supplies and fans and lack the triple-redundant parts of a system such as ASCI Red, they are more likely to fail.
As organizations begin to scale up during the next few years, many experts see a corresponding improvement in cluster hardware and management software. The Linux cluster industry is moving quickly to provide the speed and reliability required for very high-performance computing, IBM's Ungaro said. "Some products are close to being solved now, some tomorrow. But we're getting there," he said.
Research organizations are generally upbeat about prospects for moving from a proprietary system to Linux clusters for at least some of their high-performance computing requirements. The extent to which clusters can handle really big jobs is still uncertain. Testing will continue for the next two to three years.
Stevens is a freelance journalist who has written about information technology since 1982.
***
A new look at performance
When evaluating traditional supercomputers, comparing performance has been a tricky problem. Most measures focus on peak processor performance to the exclusion of all others. IDC, with input from industry, government and academia, developed a new test last fall to rate supercomputers. It is called the IDC Balanced Rating, and its creators believe it provides a better picture of a machine's overall capabilities.
The IDC Balanced Rating benchmark incorporates these three measures:
* Processor performance — The speed at which the processors could generate results if they are kept fully occupied with work.
* Memory system capability — Measures how much data can be moved in to and out of the processors in a given time period.
* Scaling capability — Measures the total processor count and total system memory bandwidth.
For larger computer configurations, only the installed specifications are used in the benchmark tables.
NEXT STORY: The ABCs of how CDs work