NSA shows how big 'big data' can be
Experts say the massive scale of metadata that the NSA is collecting represents a daunting challenge in finding useful information within it.
If big data was cheap and easy and always resulted in an abundance of relevant insights, every agency and organization would do it.
The fact that so few federal agencies are engaging this new technology – zero out of 17 in a recent Meritalk survey – only highlights the challenges inherent with what recent intelligence leaks show the National Security Agency is trying to do.
NSA reportedly collects the daily phone records of hundreds of millions of customers from the largest providers in the nation, as well as a wealth of online information about individuals from Internet companies like Facebook, Microsoft, Google and others.
To put the NSA's big data problems into perspective, Facebook's 1 billion worldwide users alone generate 500 terabytes of information per day – about as much data as a digital library containing all books ever written in any language. Worldwide, humans generate 6.1 trillion text messages annually, and Americans alone make billions of phone calls each year.
Even if the NSA takes in only a small percentage of the metadata generated daily by those major companies and carriers in its efforts to produce foreign signals intelligence and thwart terrorists, the information contained therein would be a vast sea of data.
In response to the recent reports by The Guardian and The Washington Post with information they received from former NSA contractor Edward Snowden, Director of National Intelligence James Clapper confirmed the Prism program's existence, but provided scant information about the system itself except to clarify that it was not a data mining or collection tool.However, a story published June 11 on CNN.com detailed several terror plots that were apparently foiled by intercepted electronic communications.
While the public does not know what the NSA does with its available data sets, it is clear that the NSA faces challenges in its efforts that industry and private-sector companies don't, according to Paul Kocher, president and chief scientist at San Francisco-based Cryptography Research.
Kocher said firms like Google typically analyze large datasets to siphon general inferences that help it optimize business, achieve greater efficiencies or perhaps chart a phenomenon. Retail companies have perfected these models, which is why you might notice certain ads popping up on certain applications after you've attended a specific venue or made a unique keyword search online.
But the NSA's challenges aren't in generalizing large data sets, they're in finding tiny nuggets of data that might turn out to be a terrorist communication or signal, and that is a huge undertaking.
"The NSA is focused on very specific information about a fairly well-defined threat – they are warehousing data and drilling it down at the narrowest levels and having people look at it," Kocher said. "The metadata the NSA is known to collect can give them a good picture of the very specific things they are interested in."
As reported by Information Week, the NSA relies heavily on Accumulo, "a highly distributed, massively parallel processing key/value store capable of analyzing structured and unstructured data" to process much of its data. NSA's modified version of Accumulo, based on Google's BigTable data model, reportedly makes it possible for the agency to analyze data for patterns while protecting personally identifiable information – names, Social Security numbers and the like.
Before news of Prism broke, NSA officials revealed a graph search it operates on top of Accumulo at a Carnegie Melon tech conference. The graph is based on 4.4 trillion data points, which could represent phone numbers, IP addresses, locations, or calls made and to whom; connecting those points creates a graph with more than 70 trillion edges. For a human being, that kind of visualization is impossible, but for a vast, high-end computer system with the right big data tools and mathematical algorithms, some signals can be pulled out.
Rep. Mike Rogers (R-Mich.), chairman of the House Intelligence Committee, publicly stated that the government's collection of phone records thwarted a terrorist plot inside the United States "within the last few years," and other media reports have cited anonymous intelligence insiders claiming several plots have been foiled.
Needles in endless haystacks of data are not easy to find, and the NSA's current big data analytics methodology is far from a flawless system, as evidenced by the April 15 Boston Marathon bombings that killed three people and injured more than 200. The bombings were carried out by Chechen brothers Dzhokhar and Tamerlan Tsarnaev, the latter of whom was previously interviewed by the Federal Bureau of Investigation after the Russian Federal Security Service notified the agency in 2011 that he was a follower of radical Islam. The brothers had made threats on Twitter prior to their attack as well, meaning several data points of suspicious behavior existed, yet no one detected a pattern in time to prevent them from setting off bombs in a public place filled with people.
"We're still in the genesis of big data, we haven't even scratched the surface yet," said big data expert Ari Zoldan, CEO of New-York-based Quantum Networks. "In many ways, the technology hasn't evolved yet, it's still a new industry."
In all likelihood, the NSA is one of the few organizations at the forefront of big data, and it's already gotten past some of the initial barriers to harnessing the technology: Cost and manpower.
Double the size the of the FBI and the Central Intelligence Agency, NSA has the analysts necessary to delve into these massive data sets, Zoldan said, and whatever the costs have been – NSA does not publish its budget – the agency has covered them.
"Up front, the costs are exorbitant, but as we better understand what big data is, there will be an industry that will evolve and focus on interpreting and understanding and translating that data, and it's going to get cheaper," Zoldan said. "But in my opinion, you can't put a price tag on human life, and the underlying goal of the NSA is to thwart terrorism and threats to the US."
The cost to store information, already cheap by comparison, will continue to decrease as well, Zoldan said. While it might takes years before sufficient tools exist to analyze all stored information, Zoldan said at some point, technology will catch up, and that may allow agencies that have large data sets siloed off to really put that data to use.
"It's like if someone is hoarding up land on Mars, what are they going to do with it right now?" Zoldan said. "But eventually, as space exploration continues to evolve, that land will be worth something."