No quit in these systems
Need affordable high-availability computing? You’ve got it.
High-availability systems have always been part of the computing mix, but until fairly recently, they commanded a steep price and options were limited. Now the same inexpensive yet powerful technology that’s driving much of the sea change in enterprise computing is also bringing high availability to the masses.
PC servers with redundant storage running either Microsoft Windows or Linux — systems that cost just a few thousand dollars but can support hundreds of users and hundreds of gigabytes of data — can now be designed to stay up and running for all but a few hours a year.
True continuous, fault-tolerant computing — where no transactions are lost, performance never degrades and users are shielded from system failures — will likely always be expensive. But high availability that falls just short of that pinnacle is a choice now within the reach of most information technology shops.
“It has become an expected feature for business and mission-critical systems, a checkbox item,” said Robert Desautels, president of Harvard Research Group, a market research and consulting company. “Then it becomes a business decision about what level of availability is needed and what kind of outage can be absorbed.”
The Federal Aviation Administration was faced with making such a decision when the aging mainframe system that supports the National Airspace Data Interchange Network (NADIN) failed catastrophically in 2000, causing hundreds of flight delays around the country and millions of dollars in lost revenue for the industry.
No aircraft can enter, leave or fly within the National Airspace System without first filing a flight plan that is accessible to all flight controllers through NADIN. It’s the very definition of a system that needs to run around the clock with no downtime. FAA officials are in the process of replacing the old Philips DS714 mainframes that drive the NADIN message switching network with two Stratus Technologies ftServer 6600 servers, which use Intel Xeon processors.
The new systems deliver on a number of requirements, said Andy Isaksen, an FAA computer scientist and NADIN program manager. They provide for a minimum of 99.999 percent availability, they cost substantially less than the old systems, and they come with Microsoft’s Windows operating system, which has been bolstered with Stratus’ failsafe system software.
“We are all Windows on the programming side, so we had a major investment there we had to support,” Isaksen said.
The FAA is spending $6.4 million to field the new system, including the cost of the Stratus hardware and a 10-year maintenance agreement. That’s about one-tenth of what it would have cost to keep the old mainframe system up and running for the next decade, Isaksen said. He said he hopes the new system will be operational by November.
Working side by side
As late as 1998, Stratus’ systems were based on proprietary technology. Fault tolerance was achieved by a combination of specially engineered hardware and software. But when organizations began designing entire businesses around IT, the high costs of proprietary systems quickly pushed them to the industry-standard Intel processor for many important applications. Stratus had to follow, said Denny Lane, the company’s director of product marketing.
A new line of Stratus systems are basically two computers that work side by side — one functions as the operational system and the other as the backup system. The two machines process the same transactions simultaneously. If the operational system breaks down, the redundant system takes over with no interruption in processing.
The systems do not rely entirely on off-the-shelf components, however. Stratus uses specially designed chips to detect errors and faults in the system and ensure that the two computers are synchronized.
Company engineers also developed management and diagnostic software, and a system architecture that allows the servers to monitor their own condition so faults can be isolated and repaired before they can do any damage.
But the price difference for the new systems vs. Stratus’ original proprietary product line, which the company still offers, is dramatic. For example, the entry-level model in the company’s continuous availability Continuum line starts at around $500,000. In comparison, a model 2300 in Stratus’ new Windows-based ftServer W Series starts at about $10,000.
Another player in the high-availability market, Marathon Technologies, offers a product that works in a similar way, though solely through software. The company’s Marathon FTvirtual Server takes two standard PC servers connected by Gigabit Ethernet links and, using any mix of storage systems, combines them into a configuration that operates as a single system.
If the operational server or any component fails, the system hands control of transaction processing to the other server. The change is transparent, and as far as the end user is concerned, nothing has changed. The software then alerts the IT staff to the location of the failure so it can be repaired.
The cluster option
Another option for high availability comes in the form of clusters of PC servers. Such clusters are often used to harness the combined processing power of multiple machines for big or complex computing jobs.
They also offer some high-availability capabilities because some servers in the cluster can take over if others fail.
Clusters can be an attractive option to cost-conscious IT managers. That’s because with the active/active clustering approach, the reserve server doesn’t have to perform redundant work but can do other tasks.
That is not possible with the Stratus and Marathon models.
But using clusters for high availability is not simple. “High availability is one of the abilities of clusters, but it has to be engineered,” said Larry Kelly, senior systems administrator for RGII Technologies. The company developed a cluster system to run public Web sites for the National Oceanic and Atmospheric Administration that provide potentially life-saving information, such as early warnings about hurricanes.
Among other features, load balancing is important in clusters because servers that have to take on new responsibilities can quickly become overwhelmed. Also, administrators must develop and test cluster-aware applications and failover scripts. And they must properly configure access to storage resources so that reserve servers have quick access to the right application data.
Stacy Morang, Internet/intranet administrator for the Maine Legislature, considered both clusters and Marathon’s product when searching for a high-availability system to support the legislature’s Microsoft Exchange messaging and bill-drafting applications. She chose the Marathon server.
“If you use clustering, you have to make the application aware of clustering for the failover part of the operation,” Morang said. “With the Marathon product, you don’t need that kind of awareness because the application doesn’t need to know what it’s running on.”
Another point to consider with clusters: Typically, failover and recovery could take at least a few seconds and up to 1 minute or longer. End users may not see any interruption in service, but they could experience some system sluggishness.
In most cases, that kind of availability is enough, Desautels said. “Most computer users don’t need millisecond recovery times,” he said. “High availability for them is where they don’t see much [performance] difference.”
For many government applications, that kind of performance is improvement enough.
NEXT STORY: Pegasus rides CryptoCard wings