What the heck is Hadoop?
The open-source tool simplifies big-data management, but don't think of it as just another means of data analysis, experts say. In the right application, Hadoop frees users to explore information in whole new ways.
In the right application, Hadoop frees users to explore information in whole new ways. For applications that involve comparisons between gigantic databases, such as in analyzing genomic sequences, it shines. (Stock image)
Every day, people send 150 billion new email messages. The number of mobile devices already exceeds the world's population and is growing. With every keystroke and click, we are creating new data at a blistering pace.
This brave new world is a potential treasure trove for data scientists and analysts who can comb through massive amounts of data for new insights, research breakthroughs, undetected fraud or other yet-to-be-discovered purposes. But it also presents a problem for traditional relational databases and analytics tools, which were not built to handle the data being created. Another challenge is the mixed sources and formats, which include XML, log files, objects, text, binary and more.
"We have a lot of data in structured databases, traditional relational databases now, but we have data coming in from so many sources that trying to categorize that, classify it and get it entered into a traditional database is beyond the scope of our capabilities," said Jack Collins, director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research. "Computer technology is growing rapidly, but the number of [full-time equivalent positions] that we have to work with this is not growing. We have to find a different way."
Enter Apache Hadoop, an open-source, distributed programming framework that relies on parallel processing to store and analyze tremendous amounts of structured and unstructured data. Although Hadoop is far from the only big-data tool, it is one that has generated remarkable buzz and excitement in recent years. And it offers a possible solution for IT leaders who are realizing that they will soon be buried in more data than they can efficiently manage and use.
"In the last 10 years, this is one of the most important developments because it's really transforming the way we work, our business processes and the way we think about data," said Ed Granstedt, a vice president at predictive analytics firm GoldBot Consulting. "This change is coming, and if government leaders don't understand how to use this change, they're going to get left behind or pushed aside."
Why it matters
Hadoop is more than just a faster, cheaper database and analytics tool. In some cases, the Hadoop framework lets users query datasets in previously unimaginable ways.
Take the Frederick laboratory, whose databases contain scientific knowledge about cancer genes, including the expression levels of a gene and what chromosome it is on. New projects seek to mine literature, scientific articles, results of clinical trials and adverse-event databases for related or useful connections. Other researchers are exploring whether big-data analysis of patient blogs, Google searches and Twitter feeds can also provide useful correlations.
"In many cases, we're trying to find associations, so we're doing mining and asking questions that weren't previously imagined," Collins said.
If we can revolutionize the way we think about what scientists can do with data analysis, it would change the perspective on what is possible. -- David Skinner.
Last summer, his team conducted a study of two Hadoop implementations with both real and simulated data to see whether the framework would improve performance and allow for new types of analysis. The project reduced hours-long computations to minutes and won a government big-data award from CTOvision. Building on that success, the institution is working on the next phase by aiming to better integrate data and improve visualization of results.
"Data is the new natural resource," said Josh Sullivan, a vice president at Booz Allen Hamilton and founder of the Hadoop-DC Meetup group. "Hadoop is the first enterprise tool we have that lets us create value from data. Every agency should be looking at Hadoop."
However, implementation is not as simple as converting existing databases into a Hadoop framework. That would be a missed opportunity for strategic data analysis, Sullivan said. Moreover, many existing databases should be maintained separately and connected to Hadoop databases and analytics.
As a general rule, any group with more than 2 terabytes of data should consider Hadoop. "Anything more than 100 [terabytes], you absolutely want to be looking at Hadoop," Sullivan said.
David Skinner, leader of the Outreach, Software and Programming Group at the Energy Department's Lawrence Berkeley National Laboratory, said he hopes Hadoop will offer a solution to the growing problem of data blindness, which keeps scientists from deeply understanding their own datasets. Skinner's group evaluates new technologies and makes them accessible to the thousands of scientists who use the lab’s National Energy Research Scientific Computing Center (NERSC).
"We're very interested in technologies that deliver data transparency and allow people to do analysis with large sets of data," said Skinner, whose group has been exploring scalable data solutions for a couple of years. "Science is increasingly inundated with data. If we can revolutionize the way we think about what scientists can do with data analysis, it would change the perspective on what is possible."
The fundamentals
Hadoop evolved out of Google researchers' work on the MapReduce framework, which Yahoo programmers brought into the open-source Apache environment. Core Hadoop consists of the Hadoop Distributed File System for storage and the MapReduce framework for processing. Queries migrate to the data rather than pulling the data into the analysis, yielding fast load times but potentially slower queries. In addition, Hadoop queries require higher-level programming skills compared with the user-friendly SQL, so developers have released additional software solutions with colorful names such as Cassandra, HBase, Hive, Pig and ZooKeeper to make it easier to program Hadoop and perform complex analyses.
"Like a database, Hadoop is a mechanism for storing, manipulating and querying data," said Steven Hillion, chief product officer at Alpine Data Labs. "Unlike databases, Hadoop can handle data in a very fluid way. It doesn't insist that you've structured your data. Hadoop is sort of a big dumping ground for whatever data you can throw at it. People who have struggled to deal with big data have found Hadoop to be a cheap and flexible and powerful platform for dealing with these very large volumes of unstructured and fluid data." Because Hadoop evolved in the Internet space -- LinkedIn and Facebook were early adopters -- it is well-suited to the kind of data you find in those environments: log files, text files and the like. However, users should be aware of the upsides and downsides to parallel processing, which is Hadoop's salient characteristic.
"While the MapReduce programming model is very powerful because it makes it very easy to express a problem and run in parallel, there are lots of applications that just don't decompose that way," said Shane Canon, leader of the Technology Integration Group at NERSC. "It may require synchronization between the pieces. Maybe it's a complex workflow that has lots of synchronized parts."
Moreover, areas such as high-energy and nuclear physics typically rely on binary data formats, which do not work as well in Hadoop. On the other hand, bioinformatics is well-suited to Hadoop because the data comes from a sequencer and needs to be compared to a reference database for similarities. "That's something that fits well into a map model," Canon said.
In two years of working with Hadoop, NERSC leaders have found that rather than establishing dedicated clusters, users have the most success by bringing up a Hadoop cluster, running their applications and then tearing down the cluster -- even though they miss out on some of the positive features of Hadoop. That could change as the technology evolves and more user-friendly applications are created.
A survey by the Data Warehousing Institute found an average of 45 machines in a Hadoop cluster, with a median of 12, suggesting the existence of a few extremely large clusters.
"You're not going to see the benefit until you're running a larger environment," said David Jonker, SAP's director of product marketing for big data. "The true benefit of Hadoop is when you have multiple machines together."
Hadoop appeals to IT leaders because of the improved performance, scalability, flexibility, efficiency, extensibility and fault tolerance it offers, said Glenn Tamkin, a software engineer at the NASA Center for Climate Simulation. Users can simply dump all their data into the framework without taking time to reformat it, which lifts a huge burden off NASA scientists working with 32 years of climate data.
The center has 36 nodes in its Hadoop cluster and envisions scaling up, Tamkin said, who added that potential Hadoop converts still have to understand the data and know whether Hadoop is an appropriate solution.
"Make sure that your base design or format is able to solve your use cases," he said. "If you make a wrong decision, you're kind of hosed.”
Nevertheless, any skeptic need only look at how the Defense Department is using Hadoop to provide real-time tactical information in support of battlefield missions and intelligence operations. Or the genome sequencing that can now be accomplished in a few minutes instead of hours.
"We've found it a really exciting technology to work with and investigate," said Deb Agarwal, head of the Advanced Computing for Science Department at Lawrence Berkeley National Laboratory. "Sure, there are places where it doesn’t fit our paradigms well, but it helps point us to areas where we could make improvements."
NEXT STORY: TSA to Heed Formal Gripes Over Body Scans