Search fusion

Support grows for new way to integrate information analysis and retrieval tools.

Despite the hype surrounding popular search engines such as Google and Yahoo, the kind of keyword search technology those services use are not enough to satisfy many of the government’s industrial-strength information management requirements. With agencies riding herd on an expanding amount of unstructured information contained in Web sites, e-mail messages and other file formats, the government must address a critical need for tools to help people make sense of the online information, not simply search for the occurrence of a handful of words.

Those needs have been the catalyst for the development of a diverse range of natural language processing tools that allow users to launch sophisticated queries of vast information stores using simple language.

However, product interoperability has not kept pace with the increasing specialization of analysis tools. Different tools excel at certain tasks, such as recognizing and understanding different languages or extracting the who, what, where, when and why contained in a file’s content. But the barriers to getting tools to work together have been difficult to overcome.

Now government agencies, including the Defense Advanced Research Projects Agency (DARPA), are exploring a new method for tool integration called the Unstructured Information Management Architecture. IBM developed UIMA to coordinate the efforts of its application development teams that work on ways to better manage unstructured information.

UIMA defines a common interface that gives application developers a standard way to exchange data from different applications. In the absence of a standard approach, such integration typically involves time-consuming and costly custom integration.

IBM recently made UIMA available to the open-source software development community, hoping to spur wider adoption. Signs indicate that is starting to happen.

Easier collaboration
The origins of UIMA date to five years ago at IBM’s development labs. At that time, “we had some 200 researchers around the world active in unstructured information management, which we viewed as a vital field, but we were not satisfied with our rate of progress,” said Arthur Ciccolo, department group manager of information and knowledge management at IBM Research and one of the leaders of UIMA’s development.

“One of the things that was lacking was a common infrastructure that everyone could use and not have to reinvent each time they needed it,” Ciccolo said. “Plus, there was no way people could share their results with each other.”

UIMA is IBM’s response to those challenges. It is a framework that supports an application from the acquisition of unstructured information in its raw form to its analysis and then use in tools such as databases, search engines and knowledge management systems.

IBM’s early work on UIMA caught the eye of officials at the Mayo Clinic, which was already collaborating with IBM on unstructured text processing. The clinic used UIMA to implement a system for extracting knowledge from 20 million clinical notes. The Memorial Sloan-Kettering Cancer Center also worked with IBM to develop a Web-based data warehouse that clinicians and researchers could use to search for various concepts in text-based pathology reports.

The early stages of UIMA also piqued DARPA’s interest because agency officials recognized its potential value to military systems. DARPA formed a working group with IBM that brought together university and industry experts in unstructured information management to help drive UIMA’s evolution.

It is the first system that allows analytical applications to easily connect with one another as modules that plug into a common architecture through the use of a “really nice wraparound language,” said Joseph Olive, a DARPA program manager.

“You could do this before, but [users] had to do all of the work themselves to make these modules connect up with each other,” he said. “Now you can just wrap a [UIMA] envelope around them.”

DARPA has used UIMA in various small projects, Olive said. Now the agency is putting it to large-scale use in the Global Autonomous Language Exploitation program, which aims to develop software that can analyze and interpret large volumes of speech and text in multiple languages.

Because that program employs three lead vendors who each have big teams of subcontractors, he said, UIMA will let the contractors more easily share and distribute their work.

Support grows
Although UIMA is still relatively early in development, some vendors have already committed to the IBM framework.

Attensity, for example, provides an applications suite that allows customers to extract information from unstructured text and combine it with structured data to quickly provide analysis-ready datasets. Some of Attensity’s government customers use a logistics analysis solution that lets them convert unstructured data from equipment service notes and repair logs into relational tables so that automated tools can then detect patterns indicating manufacturing or maintenance problems.

Attensity’s products fully comply with UIMA, said Michelle de Haaff, Attensity’s vice president of marketing, and the company is working with government clients to develop their own UIMA adapters for plugging in other applications.

“Most agencies already use a wide range of tools in their search and analysis operations,” she said. “They can simply put out UIMA calls to get data extracted using our tools and then put it back into other applications that also use UIMA.”

This dynamic is especially beneficial for agencies that use tools that they can’t reveal much about for security reasons. UIMA provides a standard exchange method external to the classified application, said David Bean, Attensity’s co-founder, chief technology officer and vice president of engineering.

Even traditional search companies see an advantage in UIMA compliance. Exalead, a 6-year-old company that’s already well-established in industry and government in Europe, is looking to expand in the United States and sees UIMA as a potential advantage.

Exalead sells an enterprise search platform built on Extensible Markup Language and Java that is complete for most search and retrieval purposes, said Francois Bourdoncle, the company’s president and chief executive officer and one of the early developers of the AltaVista search engine. But the government frequently requires third-party tools for its most complicated applications.

“That’s why UIMA is interesting to us,” he said. “It’s a document description language that allows those third-party tools to plug in to our product.”

Convera, which already has a large customer base in government agencies through its Excalibur and RetrievalWare search and knowledge discovery platforms, also recognizes the value of interacting with other companies’ tools, although it does not support UIMA yet. Its search products already provide integration with other products and services through the use of open interface standards such as XML and Representational State Transfer.

A huge demand exists right now for any capability that enables people to extract information from unstructured data, said Sameer Kalbag, Convera’s vice president of product management. But he said it’s unclear if UIMA will be the only answer.

Largely to address that need, IBM decided to make UIMA available to the open-source community in January by publishing the source code at SourceForge.net, the world’s largest open-source development site. Later this year, IBM said it intends to move UIMA to a full open-source community development model.

“That will allow other vendors to freely apply it to their product development,” Olive said. “We do firmly believe UIMA is a growing thing, which is why [the move to] open source is so important.”


**********

Soon the two shall meet

The Unstructured Information Management Architecture is one potential answer to a problem that many see as the golden ring for information management — the melding of the worlds of structured and unstructured data.

  • Structured data conforms to set formats and known relationships and is managed via constructs such as database technologies that allow for querying and reporting against those known data types and relationships.

  • Unstructured data is free-form data included in text — and, increasingly, in images, video and audio — that is not laid down according to any set format or does not exhibit any formal relationship with other data. As the Internet expands, most available data is unstructured.

— Brian Robinson
UIMA development a balancing act

At the core of the Unstructured Information Management Architecture (UIMA) is a construct called the common analysis structure. It provides a way for one software application to produce output that can be easily interpreted by others.

Developing a technology such as UIMA involves balancing thoroughness and economy. For example, IBM developers wanted to make UIMA as helpful as possible to the programmers who would use it by including a large set of application program interfaces, thereby supporting a wide range of existing applications and sparing the programmers lots of custom coding work.

At the same time, the developers had to resist the pressure to load the framework with what they felt were unnecessary capabilities.

“People were coming back to us and saying we really needed to put this and that into the framework,” said Arthur Ciccolo, department group manager for information and knowledge management at IBM Research. “But we said, ‘No, you can put that into an application. There’s no need for it in the framework itself.’”

In addition to the architecture framework, IBM has developed a UIMA software development toolkit, which is widely used in IBM Research and was made available to outside developers last year. It has registered more than 5,000 downloads in just one year, Ciccolo said.

— Brian Robinson

NEXT STORY: Verizon adds secure IM