Squeezing out the sparks

CIA to invest millions in developing software to sift through digital data

CIA officials plan to spend more than $27 million to develop advanced data mining and automated analysis techniques they hope will dramatically improve the ability of computers to find valuable intelligence buried in massive volumes of electronic data.

From TV broadcasts to e-mail traffic to wireless phone calls, the world is inundated by an unceasing torrent of electronic data. The volume is enormous — far too much for human analysts to digest, overwhelming even to the current generation of data mining and computer analysis capabilities.

But CIA officials, through the Intelligence Technology Innovation Center, hope to perfect technologies such as "data mining on the fly," sophisticated speech recognition called prosodics and instant automated translation of data from languages such as Chinese or Arabic into searchable English text.

A three-year research effort is planned, but CIA officials hope useful technology will be available in the next year, said Gary Strong, a program director at the National Science Foundation's Directorate for Computer and Information Sciences and Engineering.

NSF expects to receive $8 million a year for the next three years to develop better ways to automate intelligence analysis, Strong said.

The CIA is also paying about $3 million for software that can analyze huge volumes of documents, transcripts of conversations, espionage reports and other text and "unstructured data."

The software turns unstructured data into structured tables that make it easier to see cause-and-effect relationships, links between individuals, trends and other associations between people or events, according to software developer Attensity Corp.

On Aug. 5, Attensity plans to announce receipt of $3 million in financing from In-Q-Tel, the CIA's venture capital branch.

The CIA's problem is that it receives too much data and has too few analysts to examine it, said Todd Wakefield, Attensity's chief executive officer. As a result, reports, transcripts and other potentially valuable text records wind up in a "data graveyard." Attensity software should enable the agency to perform at least a basic level of analysis of a much greater volume of data.

In the longer term, however, the CIA and other intelligence agencies must seek "some sort of conceptual breakthrough that will dramatically improve performance" of automated analysis, said Steven Aftergood of the Federation of American Scientists.

Pushed by the terrorist attacks of Sept. 11, NSF plans to tackle the problem by providing additional funding to speed research that already shows promise, Strong said.

Data mining on the fly is one such technology. Scientists at IBM Corp.'s Watson Research Center are already trying to develop software that can detect and extract significant intelligence information from multilingual streams of messages (see box).

Advanced voice recognition is another potentially useful technology. Current methods involving voice prints are not very accurate, Strong said. The same voice recorded with two different microphones, for example, might yield two different voice prints.

But a newer technique, prosodics, analyzes the "nonspeech content of spoken language." Speakers are identified by the pauses, hesitations, intonation and inflection of their speech, Strong said. The method is surprisingly accurate, and attempts to disguise speech by changing it do not work well, he said.

SRI International is trying to develop ways that machines can use prosodics to automatically identify speakers.

Meanwhile, researchers at Columbia University are working on a system that monitors news programs, online chat rooms, e-mail and voice mail messages, and automatically generates a summary of what was said about a specific event.

***

Information in motion

Officials at the CIA and other intelligence agencies believe the concept of "data mining on the fly" will enable them to monitor and analyze the growing volume of TV and radio broadcasts, e-mails, Web pages and other "streaming data" in real time.

Traditional data mining involves using software to examine static data stored in databases. For intelligence purposes, there is simply too much data being created too fast for standard data mining to be useful, said Gary Strong, a program director at the National Science Foundation's Directorate for Computer and Information Sciences and Engineering.

The solution appears to be enabling researchers to analyze data on the fly, he said. Software development by IBM Corp. is one of 15 projects NSF has decided to fund "to move the projects in directions that are important to the intelligence community," Strong said.

NEXT STORY: Indiana court to try new system