The intelligence community's big-data problem
Analyzing fragmented information that doesn’t come neatly packaged is a challenge most other government users of big data can relate to.
The intelligence community is perhaps the most innovative data collector on the planet, with each of its 17 agencies able to siphon off various pools of information from nearly any source.
Yet the IC collects voluminous amounts of mostly fragmented data, and therein lies a challenge every other body in government struggling to make use of big data can relate to.
“In our world, we’re very good at collecting data, we’re also pretty good at analyzing it – we have to quickly parse out what is valuable,” Roger Hockenberry, a former chief technology officer for the Central Intelligence Agency, said during a panel session March 11 at the Symantec Government Symposium in Washington.
“Our data is always fragmented, and we’re trying to make sense of fragmented data options, which is extremely difficult,” said Hockenberry, who is now a consultant. “How we analyze every piece of data, how we reprocess it to continue to make better sense of what is going on – that is the biggest we have, especially when we can’t get complete databases.”
Former National Security Agency contractor Edward Snowden’s public disclosures of classified information have highlighted how the NSA and other agencies collect various sorts of signals intelligence. A significant amount of this data doesn’t come packaged neatly for ingestion and analysis in any open-source or proprietary platform. Social media feeds and emails, for example, represent large but highly unstructured datasets. To “normalize” that kind of unstructured data in a way that it becomes useful continues to be a major challenge, Hockenberry said.
To conduct its large-scale analytics effectively, the CIA uses a mixture of open-source and commercial products built off a data-science oriented reference architecture that sprung up from one of its small labs in the past decade. The CIA started with OpenStack and added commercial products in various places to note differences and build an effective and scalable solution.
Hockenberry said platforms and tools differ in usefulness depending on the environment in which they’re operating, and that logic also carries over to the post-analytic visualizations a dataset produces.
“You have to decide the right mix,” said Hockenberry, adding that big data forces analysts or data scientists to be creative in how they ask questions.
The intelligence community is at the forefront of big data as a technology, but even at its most effective levels, analyzing piles of unstructured, fragmented data is challenging. Algorithms will improve and data holders will inevitably learn to ask better questions of data, yet as the deluge of unstructured information continues to pour forth, finding meaningful signal in the noise is likely to remain problematic for some time.
“It’d be nice if al-Qaeda would ship us all their records in a nice, standard format, but they don’t,” Hockenberry said.