How cloud storage could catch up with big data
Erasure coding is emerging as a cost-effective way to safely maintain vast amounts of data in the cloud.
Cloud computing has managed to make the world’s already colossal appetite for data storage even more voracious.
Last year, IDC, an IT market research firm, cited public cloud-based service providers, from Amazon Web Services to YouTube, as the most significant drivers of storage consumption in the past three years. The government sector contributes as well: IDC noted that the private clouds of government and research sites compare in scope and complexity to their public cloud counterparts.
The so-called big data problem has surfaced in the past two years to rank among the primary IT challenges. Technologies such as the Apache Hadoop distributed computing framework and NoSQL databases have emerged to take on the challenge of very large — and unwieldy — datasets.
And now another technology, already at work behind the scenes, could grow in importance in the coming years. Erasure coding has been around since the 1980s, but until recently its use in storage circles has mainly been confined to single storage boxes as a way to boost reliability more efficiently.
Now erasure coding is moving into distributed storage. Its application becomes trickier here, but industry executives and storage researchers believe erasure coding — particularly in conjunction with increasingly popular techniques such as object-based storage — will play a growing role in cloud storage. Potential government adopters include Energy Department labs and other agencies with vast data stores.
Why it matters
When it comes to storage, everything is getting bigger, whether it’s an individual disk, a storage system or a cloud-based repository. Erasure coding, an error-correcting algorithm, plays a role across this range of ever-growing storage platforms.
Vendors most commonly use erasure coding to boost the resiliency and performance of their Redundant Array of Independent Disks (RAID) storage systems, said Bob Monahan, director of management information systems at DRC, a consulting and IT services firm.
But it’s the use of erasure coding as an alternative to data replication that is attracting new interest in this storage mechanism. In many traditional cases, redundancy is achieved by replicating data from primary storage devices to target arrays at the data center or an off-site location. Mirroring data in that way provides protection but also consumes lots of storage, particularly when organizations make multiple copies of data for greater redundancy. The approach becomes particularly unwieldy for organizations that deal with petabytes or more of data.
Erasure coding offers an alternative way to achieve redundancy while using less storage space, said Russ Kennedy, vice president of product strategy, marketing and customer solutions at storage vendor Cleversafe, which uses erasure codes in its object-based storage solutions.
Organizations that rely on replication might make three or four copies of data — one copy at another location then a copy of the copy to be safe and so on. In comparison, the overhead to make a sufficiently fault-tolerant copy with erasure coding is less than double the size of the original volume, Kennedy said.
Jean-Luc Chatelain, executive vice president of strategy and technology at DataDirect Networks, said financial concerns are driving interest in erasure coding among customers who don’t want to replicate data two or three times. DataDirect takes advantage of erasure coding in its RAID system, file storage offerings and Web Object Scaler product for cloud storage.
The prospect of saving space and money hasn’t been lost on the cloud community. The major providers are on their way to adopting erasure coding, said James Plank, a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee. His research focuses on erasure codes in storage applications.
“Pretty much every cloud installation you can think of is either using erasure coding or converting to erasure coding,” he said, citing Amazon, Google and Microsoft as examples. “They are using erasure coding for fault tolerance because the disk space savings is huge.”
There’s a bandwidth benefit as well. “While the big savings today would come from reduced capacity requirements, the big win, from my standpoint, is the two- or threefold reduction in bandwidth [compared to what is] used during replication,” said Galen Shipman, group leader of the Technology Integration group at Oak Ridge National Laboratory’s National Center for Computational Sciences.
The fundamentals
Erasure coding might have implications for the nascent cloud, but the technology has been around the storage block a few times. In a storage setting, the technique encodes data into fragments from which the original data can be reconstructed.
For example, erasure coding is the underlying technology of Cleversafe’s dispersed storage method, which takes a data object (think of a file with self-describing metadata) and chunks it into segments. Each segment is encrypted and cut into 16 slices and dispersed across an organization’s network to reside on different hard drives and servers. If the organization has access to only 10 of the slices — because of disk failures, for instance — the original data can still be put back together, Kennedy said.
Numerous experts see erasure coding paired with object-based storage as a good option for achieving more fault-tolerant repositories with petabytes and even exabytes of capacity.
The hurdles
Government clouds and data centers have yet to jump on erasure coding, apart from agencies using RAID storage devices that embed the technique.
“It is less well understood and therefore less mature in commercially available solutions,” Monahan said. “As it becomes more mature, the use cases for when it is more appropriate will drive implementation scenarios.”
Performance is another limitation. Shank Shome, a storage engineer at Agilex Technologies, said the impact of erasure coding on storage performance has yet to be fully explored. He added that reading the data back from an erasure-coded system is generally fast, but the real performance cost lies in writing the data to storage.
“If the data is generally static with very few rewrites, such as media files and archive logs, creating and distributing the data is a one-time cost,” Shome said. “If the data is very dynamic, the erasure codes have to be re-created and the resulting data blocks redistributed.”
Erasure coding also runs into problems with high-performance computing. One complication arises when data is being written simultaneously from many sources and at a high rate, said Robert Ross, a computer scientist at DOE’s Argonne National Laboratory and a senior fellow at the University of Chicago’s Computation Institute. That activity requires a level of coordination that isn’t easy with current approaches.
In general, storage experts believe erasure coding faces the biggest obstacle in frequently accessed “hot data.” Accordingly, they believe a key initial use case lies in protecting data that has cooled enough to move to long-term storage.
Monahan said the benefits of erasure coding are “higher local availability at a lower cost and highly available dispersed archival systems that are an order of magnitude less expensive than traditional systems.”
The trick is knowing when to use replication to get data out of a system quickly and when to use erasure coding to create more economical, resilient long-term storage, Ross said.
“Both have important roles moving forward in high-performance computing,” he added.
The Oak Ridge lab is now exploring the use of erasure coding for the Oak Ridge Leadership Computing Facility. That facility already uses RAID 6 systems from DataDirect Networks. Shipman said erasure coding could play a significant role in two distributed storage systems: a Lustre parallel distributed file system and the large-scale archival High Performance Storage System, which uses replication for data integrity and resiliency.
“Erasure coding will likely emerge as a viable alternative to replication due to savings in the media and bandwidth consumed for replication,” Shipman said.
He acknowledged the computational demands of the more advanced erasure-coding techniques but said ongoing research on algorithms aims to minimize that cost.
Next steps: Updating the storage toolbox
As data storage needs continue to grow and cloud-based models introduce new options for distributed systems, agencies should constantly re-evaluate their storage strategies. Specifically, they should:
- Monitor current storage options. Erasure coding might not be at the top of your agenda today, but if your storage growth is outpacing your budget, it probably makes sense to add the technology into the mix of current or near-term future options.
- Assess likely use cases. Beyond data archiving, erasure coding could prove useful for maintaining and protecting large quantities of sensor-derived data. For example, Cleversafe recently signed GeoEye, a provider of high-resolution satellite imagery, as a customer.