Super storage

By John Moore

| May 29, 2006

Multiserver clusters are the new supercomputer stand-in. Here’s how storage has adapted to its important new supporting role.

“Los Alamos’ self-assessment sets the bar high”

Storage was straightforward in supercomputing’s monolithic days. Storage subsystems came with supercomputers and were fairly simple to integrate.

Today, high-performance computing often involves clusters of computers. Now 72 percent of the world’s 500 fastest supercomputers are clusters, according to Top500.org, a group that publishes an annual list. The increasingly dominant distributed computing model tests storage in ways centralized supercomputer systems never did.

“It’s infinitely harder to connect storage in a cluster environment,” said Eric Seidman, storage systems manager at Verari Systems, a high-performance computing vendor. The overall objective, Seidman said, is to achieve balanced performance among processors, memory and input/output subsystems, which connect the cluster to storage.

Many vendors now offer storage systems for clusters. Storage experts say, however, that no single solution on the market will fit every supercomputing task. Instead, tiers of solutions are emerging to address the input/output requirements of different high-performance computing environments.

High-end clusters with a heavy input/output demand typically call for features such as parallel file systems and high-speed storage interconnects. But clusters of more modest size may use more familiar storage staples, such as network-attached storage (NAS) systems.

Special requirements
High-performance computing storage differs from the mainstream, enterprise variety in that storage managers need to pursue a holistic approach, said Anne Vincenti, storage director at Linux Networx, which builds Linux-based supercomputing clusters.

“They need to not only look at the storage but…get into the guts of the system architecture in order to optimize clusters,” Vincenti said.

On the flip side, technology managers who concentrate on the computing side of clusters should also take a broader view, Seidman said. “There’s a lot of focus on the processor type and clock speed…and less focus on the overall architecture,” he added.

The architecture’s storage element has several critical parts, including the storage array, the link between the storage device and the computing cluster, and the file system that sits atop the storage architecture. Altogether, those elements supply data to the scientific and technical applications running on a cluster. Storage systems also house the data that those applications generate.

Customers must tune those various components to maximize performance. Because clusters vary in size and performance and run different applications, they need individualized storage approaches.

Lawrence Berkeley National Laboratory’s Scientific Cluster Support program has more than 20 Linux clusters in production. The group “specializes in building individual clusters for specific areas of scientific research,” said Gary Jung, SCS project manager at the lab’s Information Technology Division. “Since we are building [high-performance computing] systems for specific areas of science, we can address the unique storage considerations more directly.”

Storage classes
Storage solutions at SCS and other high-performance computing centers tend to fall into three categories based on cluster size: small, medium and large.

Linux Networx defines the value performance segment as small clusters with as many as 32 nodes. Applications may include crash simulations or smaller computational fluid dynamics jobs, Vincenti said.

Customers in the value camp often can manage with a file server running the standard Network File Server (NFS) protocol or a commercial NAS box, she said. Value customers emphasize manageability and seek solutions “that will deploy and work quickly without having to hack around,” Vincenti said.

Smaller file sizes also characterize the value bracket. Vincenti said smaller files may be 50M or less, while a large file could be bigger than 50G.

At SCS, smaller clusters with modest input/output requirements use some form of Linux-based NFS server with a direct-attached Redundant Array of Independent Disks solution, Jung said. NFS, commonly used in mainstream Unix and Linux environments, provides a distributed file system through which users share files on a network.

Midsize clusters, meanwhile, gravitate toward NAS. The SCS program has deployed Network Appliance’s NAS systems for clusters with moderate input/output requirements, Jung said. He said the systems are reliable and easy to integrate.

The Army Research Laboratory also uses NAS storage. “NAS systems with multiple Gigabit Ethernet connections have been successful for small- to moderate-sized clusters,” said Tom Kendall, chief engineer at the Army Research Laboratory’s Major Shared Resource Center.

The midsize market segment contains clusters with as many as 256 nodes, according to Linux Networx.

Applications in this segment include process integration and design optimization and computational fluid dynamics. Clusters may handle small and large file-streaming applications. Using the latter, scientists can quickly load massive amounts of data into memory for faster execution of jobs.

In the case of top-tier clusters, storage systems aim to address the problem of hundreds or thousands of data-hungry nodes.

Vincenti said some high-performance computing systems don’t get enough data to take advantage of their computing power. “You have to have enough [input/output] built into the system to keep the [system] fully optimized,” she said.

Large clusters with intensive input/output demands often call for a parallel file system. Parallel file systems are designed for supercomputing tasks in which multiple nodes access the same files simultaneously. In addition, parallel file systems let a cluster’s nodes access data at speeds much faster than what NFS can achieve.

“Parallel file systems are the greatest hope in this area,” Kendall said. The Army lab is deploying clusters with input/output requirements approaching 10 gigabytes/sec. The clusters have “capacities of hundreds of terabytes, which are beyond the capability of nearly all single-server solutions,” he said.

Parallel file systems don’t suit every supercomputing challenge, however.

“There are applications that are by their very nature serial and, therefore, unable to leverage existing parallel file systems beyond the capability” of a single node’s input/output performance, Kendall said.

“Fortunately, only a small percentage of our applications fall into this regime,” he said.

The San Diego Supercomputer Center’s three largest clusters — DataStar, TeraGrid IA-64 and BlueGene — use IBM’s General Parallel File System, said Richard Moore, director of production systems at the center. Similarly, SCS has used Panasas’ ActiveScale Storage Cluster for its larger clusters with intensive input/output requirements, Jung said. ActiveScale Storage Cluster includes Panasas’ parallel file system, PanFS.

Cluster File Systems’ Lustre is another example of a parallel file system. The Lawrence Livermore National Laboratory, the National Center for Supercomputing Applications and Sandia National Laboratories are among the government customers using this open-source file system.

The high-end cluster segment also marks a shift in storage architecture. One approach incorporates the lower-level, block-based data access method of storage-area networks (SANs) alongside the file-based access techniques of NAS. In this approach, a large cluster may reserve a certain number of nodes for input/output. Those nodes feed data to the cluster’s computing nodes. The ratio of input/output nodes to computing nodes varies. The San Diego Supercomputer Center’s BlueGene cluster, for example, operates 1,024 computing nodes and 128 input/output nodes.

Input/output nodes may connect to a cluster’s computing nodes via proprietary interconnects, such as Myrinet and Quadrics, or standard Gigabit Ethernet, storage experts say. Reaching out from clusters to external storage devices, input/output nodes typically use Fibre Channel and attach to storage in a SAN-like manner. Supercomputer SANs may use storage devices from vendors such as DataDirect Networks, Engenio Information Technologies and Verari Systems, which markets Engenio products as an OEM partner.

Vincenti cited interconnect bandwidth and fast and flexible disk controllers as attributes that make SANs a suitable environment for high-performance computing applications such as streaming. But storage technologists say such storage architectures aren’t strictly SANs. Seidman said input/output nodes are typically NAS-attached to computing nodes and SAN-attached to storage systems.

Some NAS-based approaches can also offer storage to high-end clusters. Vendors such as Panasas offer NAS solutions that let a cluster’s input/output nodes access data in parallel. “We are starting to see a big move toward parallelized NAS,” said Dan Smith, senior enterprise consultant at GTSI.

Panasas’ storage product deploys a small file system — the company’s DirectFlow client — on all the nodes in a Linux cluster. The DirectFlow client lets the nodes access Panasas’ storage devices in parallel.

This parallel NAS solution has a SAN element, however. Panasas uses Internet SCSI over Ethernet as the underlying transport mechanism, said Larry Jones, vice president of marketing at Panasas.

“SAN and NAS get a bit muddy in clusters,” Seidman said.

Buyers have multiple options when it comes to cluster storage. They must weigh factors such as cluster size, applications and throughput requirements before reaching a decision.

Making storage history

The San Diego Supercomputer Center has found more than one way to apply its storage knowledge.

The center builds storage subsystems for its computer clusters, and it also works with customers who want to preserve digitized historical archives. For example, the center has a project with the Library of Congress and the University of California at San Diego Libraries to host digitized versions of photographs of the Russian empire in the years before World War I. Sergei Mikhailovich Prokudin-Gorskii undertook the photographic survey of Russia from 1909 to 1912.

The center wants to ensure that such digital information persists while storage systems, media and other technology components continue to evolve, officials said.

Richard Moore, director of the center’s Production Systems Division, said the organization’s high-performance computing storage work provides a good foundation from which to pursue digital preservation.

“It’s a different application, but what we’ve done prepares us well for this kind of role,” he said.

Vijay Samalam, the center’s executive director, said people responsible for digital archives are still deciding which preservation methods are best. He added that those needs, coupled with high-performance computing’s storage requirements, will help determine the future of storage technology.

InfiniBand makes storage inroadsExpect to see InfiniBand, a high-speed interconnect technology, expand its cluster foothold in storage networking, the technology’s backers say.

Developers completed the work on the initial InfiniBand specification in 2000, and the first wave of products hit the market in 2001. The InfiniBand Trade Association, whose members include Cisco Systems, IBM and Sun Microsystems, develops and maintains the InfiniBand specification.

Despite slow acceptance, InfiniBand has managed to carve a niche in internode communications within clusters. RB Hooks III, public sector chief technology officer for Sun’s Data Management Group, said customers now ask about pushing InfiniBand into the storage fabric, the network of storage resources.

“There is a lot of interest in InfiniBand,” said Roger Haskin, senior manager of file systems at IBM. “People are looking at InfiniBand to have the performance of Fibre Channel and be standards-based like Ethernet.”

Industry executives say InfiniBand has made most of its inroads among large clusters. InfiniBand’s bandwidth and low latency are its main attractions. The interconnect offers 10 gigabits/sec bandwidth compared with Fibre Channel’s 4 gigabits/sec. And industry executives say the protocol’s remote direct memory access feature makes InfiniBand less taxing to computing resources than other networking approaches.

A cluster’s architecture involves communication between its computing nodes and input/output nodes and between the input/output nodes and external storage resources. InfiniBand has had its initial success in the former area.

“Over the last few years, [InfiniBand] has largely replaced proprietary interconnects…between the cluster” and input/output nodes, said Shawn Hansen, manager of product management at Cisco.

But InfiniBand can make the input/output node-to-storage jump, too, he said.

“It is an all-purpose…connection,” he said.

In some cases, InfiniBand has edged out proprietary interconnects for linking input/output nodes to clusters. But as prices drop, the technology becomes increasingly attractive for clusters that have been Ethernet-only, Hansen said. Because Ethernet-based products are so widely deployed, they are less expensive than proprietary interconnects.

Hansen said InfiniBand has started to replace Fibre Channel between the input/output nodes and storage. The arrival of native InfiniBand-attached storage from vendors such as DataDirect Networks and Engenio facilitates that move, he added. Such products eliminate the need for an InfiniBand-to-Fibre Channel gateway.

InfiniBand’s long-term success depends to some degree on developments in Ethernet technology. The advancement of protocol offload techniques in the Ethernet camp could take away InfiniBand’s main advantage, Haskin said.

He asked the pivotal question: “Will Ethernet progress enough to make InfiniBand irrelevant?”

Click here to enlarge graphic(.pdf).

NEXT STORY: Davis asks OMB to act against TCE

Identity in Government

Health Tech

Multiserver clusters are the new supercomputer stand-in. Here’s how storage has adapted to its important new supporting role.

Related Links