Cluster headaches

Emerging file systems aim to relieve pain of Linux storage bottlenecks.

The rise of clusters using low-cost computers that run on Linux has been the top story in high-performance computing during the past two years.

Linux clusters offer massive amounts of computing power at a fraction of the cost of proprietary symmetrical multiprocessor (SMP) machines,

once supercomputing's mainstay.

But there's a cloud inside this silver lining: Storage architectures have failed to keep up with the Linux cluster's processing punch. The humble file system, which plays a behind-the-scenes role in organizing storage, is at the core of this issue. SMP boxes had their own file systems, but organizations pursuing Linux clusters have found file systems designed for their style of computing to be sorely lacking.

Vendors have responded with a number of file system initiatives aimed at clusters. Officials at young companies such as Cluster File Systems Inc., Ibrix Inc. and Panasas Inc. are focusing on these issues. Industry veterans such as IBM Corp. are involved as well. Activity is all over the map. File systems may be open-source or proprietary, and geared toward large clusters or smaller installations. In addition, offerings may fare differently in different computing situations.

Industry efforts aim to tackle the same general problem. And according to Dan Wickard, chief information officer at the Idaho National Engineering and Environmental Laboratory (INEEL), that problem is how to effectively interlace large numbers of computer processors — and their large memory footprint — with equally high-performance input/output conduits for shuttling data to and from storage.

"We don't have a good solution for that today," Wickard said. "Everybody would like to and that is why we are seeing so much activity."

The current interest in file systems is a product of "a shift from monolithic computing to scale-out, or cluster, computing," according to Shaji John, chief executive officer at Ibrix.

Officials found it much less expensive to harness the capabilities of multiple commodity servers as opposed to an SMP box. A 128-way monolith might cost $3 million to $4 million, while a cluster of 64 dual-processor servers might cost $300,000, John said.

Linux clusters "deliver way more price/ performance than the old SMP monolithic supercomputers," said Larry Jones, vice president of marketing at Panasas. He noted that nearly 65 percent of the top 500 supercomputers are Linux clusters.

The clusters' compelling value proposition comes with a challenge, however. The new school of computing lacks the SMP machines' proprietary interconnects, buses and file systems — and the resulting performance advantages. John said organizations can get 1 gigabyte/sec of bandwidth from a single monolithic machine. "People tried to do the same with clusters and couldn't get that kind of performance," he said.

John cited a cluster used for simulation to illustrate the imbalance of computing power and input/output speed. The cluster ran simulations as a series of compute cycles. After each iteration, a large amount of data was flashed to disk before the next could begin. Each compute iteration took about 12 seconds, but it took more than five minutes to write them to disk, John said.

If high-performance computing shops can't get the necessary bandwidth, they "can't keep the systems busy," said Garth Gibson, chief technology officer at Panasas. "File systems haven't kept up."

The task of providing a cluster-capable file system is far from simple.

File systems for clusters "have been a very difficult thing," said Arun Taneja, president of Taneja Group, a technology analysis firm. "It's only in the last five years, relatively speaking, that we've had some major breakthroughs."

Vendor approaches

Work on file systems goes back well beyond five years, however. In the early 1980s, Digital Equipment Corp. VAX clusters emerged with what John described as "probably one of the best cluster file systems ever invented."

At about the same time, researchers at Carnegie Mellon University began work on a distributed file system that became known as the Andrew File System. That effort was later commercialized as AFS and, still later, evolved into open-source software (see box, Page 42).

But the pace of development has quickened since 2000, largely because of high-performance computing demands in government and academia. Officials at Energy Department labs and the National Nuclear Security Agency, for example, requested proposals for a scalable global secure file system as part of the Advanced Accelerated Strategic Computing Initiative. In 2002, the group awarded a contract to a combined Hewlett-Packard Co., Intel Corp. and Cluster File Systems team.

The ongoing contract focuses on the development of Lustre, a file system designed for large-scale Linux clusters. Cluster File Systems officials maintain Lustre as open-source software under the GNU General Public License. The first version of Lustre debuted late last year. Users include Lawrence Livermore National Laboratory and the National Center for Supercomputing Applications (NCSA).

Phil Schwan, Cluster File Systems' CEO, said Lustre is scheduled to be installed on some of the largest supercomputers, including Sandia National Laboratories' Red Storm and Lawrence Livermore's IBM Blue Gene/L.

Government requirements also helped get Panasas and its ActiveScale File System off the ground. Panasas was among a handful of companies responding to Los Alamos National Laboratory's request for a scalable file system. Last year, the lab awarded a deal to Panasas.

Research in the mid-1990s at the University of Minnesota led to another cluster file system: the Global File System (GFS). To advance the software, professor Matt O'Keefe launched Sistina Software Inc., which Red Hat Inc. purchased last year.

Now officials at Red Hat, which has maintained GFS' original open-source status, aim to provide a complete solution stack — file system, volume manager and failover software — built on Red Hat Enterprise Linux, said O'Keefe, Red Hat's GFS product manager.

Ibrix officials, meanwhile, market the company's Fusion scalable file system. The company has an installation at NCSA, among other government users, John said.

Start-ups aren't the only story in cluster file systems, however. IBM weighs in with the General Parallel File System (GPFS), originally devised for Unix but now available for Linux clusters. Meanwhile, Sun Microsystems Inc.'s QFS and Silicon Graphics Inc.'s CXFS are high-performance file systems for data sharing in a storage-area network (SAN) environment.

INEEL officials are deploying QFS on two Solaris boxes and will export the file systems via Network File System to a Linux cluster and other computing resources.

IBM's GPFS and SAN File System operate in the SAN space as well. The difference: GPFS was specifically designed for clusters, while SAN File System works with sets of computers that may or may not be clustered.

You can't have it all

The diverse offerings begin to address the high-performance computing managers' scalability and performance problems. Lustre and GPFS, for example, can scale beyond 1,000 nodes, according to the companies' representatives. Last year, Panasas officials, working with Los Alamos scientists, reported that their technology exceeded 10 gigabytes/second of sustained data bandwidth for parallel Linux cluster applications.

Cluster file systems have seen significant progress, Taneja said, but the technology isn't home free.

One problem is that the latest crop of systems tends to be good at one thing but not another. Some are geared toward very large clusters, while others have a more modest focus. Some are good at handling large files, but stumble with small files.

NCSA officials have worked with a number of high-performance file systems. But they don't do everything, said Michelle Butler, technical program manager at NCSA's Storage Enabling Technologies Group. Some file systems, she added, aren't stable enough, while others are stable but don't scale.

Consequently, clusters may use more than one file system. Wickard said large clusters may have three file systems. Systems that must deal with a wide range of file sizes are an issue. "To date, no one has been able to create one that works well across eight orders of magnitude or so" in file size, he said.

So while industry has made strides, the quest for the perfect file system continues.

Moore is a freelance writer based in Syracuse, N.Y.

NEXT STORY: CRM crosses over