The data reference model gets real

XML schema is major step in tearing down agencies' data walls.

When data resides in incompatible systems, useful information can be hard to find, which means links are missed and conclusions are incomplete. It's expensive, too, when agencies duplicate data-collection efforts. Despite calls for greater information sharing governmentwide, agencies continue to gather and store data in incompatible ways.

The Office of Management and Budget aims to improve data sharing through governmentwide adoption of a data reference model (DRM) schema based on Extensible Markup Language (XML). The schema is a hierarchy of specifications for agencies to describe and exchange data.

OMB officials have selected the model — the fifth and final portion of the federal enterprise architecture — to fulfill a section of the E-Government Act of 2002 requiring government information to be organized, categorized and electronically searchable. Under that timeline, the DRM must be finalized by Dec. 17, say members of the CIO Council's Data Reference Model Task Force.

"This is a detailed blueprint for how organizations are going to describe the structure, categorization and exchange of their information," said Michael Daconta, the task force's leader, during the draft schema's public release June 13.

If agencies widely adopt the schema, they could more easily spot complementary or overlapping datasets, proponents say.

"You have to be able to have an organization find the other information assets and logical data models of other organizations," Daconta said. Public access to government information would likewise improve.

A draft version of the proposed schema is available for public discussion; the deadline for comments is Sept. 14.

The schema would be the template for the DRM documents produced by all agencies, said Owen Ambur, chief XML strategist at the Interior Department. Daconta credits Ambur with devising the schema's approach.

But not all agency data needs to be expressed within the schema.

"That's too big, too scary," Daconta said. "We very clearly state that this is information that you share or will share within a year."

Ideally, OMB would direct agencies to identify information that should adhere to the schema to achieve annual information-sharing objectives, Ambur said.

Some of the schema's elements would be optional when tagging data. "You populate different things to achieve different purposes," Daconta said.

Agencies already invested in their own approaches to data tagging will be able to reference them within the DRM schema.

"The intent is having to save them from redoing the effort they've already done and just capitalize on it automatically," Ambur said.

Unlike the data exchange model proposed in the first DRM version, which was released last October, the new XML approach will support the exchange of structured, unstructured and semi-structured data, Daconta said.

That is important because "80 percent of government information is either unstructured or semi-structured," said Andy Hoskinson, one of the contractors OMB employs in the Federal Enterprise Architecture Program Management Office.

The DRM schema allows unstructured information, such as text documents or photos, to be tagged with metadata that identifies the information's subject, source and creator. That information can also be linked to other resources.

The revised reference model will allow agencies to exchange information and query data via a registry, preferably federated, Daconta said. The Core.gov site is under consideration as a registry.

The DRM is more abstract than an agency- specific effort called the National Information Exchange Model (NIEM), which seeks to identify and standardize a core set of XML schema terminology. NIEM participants include most law enforcement communities within the Justice and Homeland Security departments and state and local organizations.

They have also agreed to standardize the data elements for people, places, dates and other items.

"NIEM is creating a framework for how you assemble messages rapidly that are interoperable out of the box," Daconta said. He added that the DRM is "more about how does everything tie together to answer high-level questions, not all the details of how do you exchange it and why do you exchange it."

Experts say that inventorying data is a separate task from harmonizing data. The latter step will be an ongoing bottom-up and top-down process, with emphasis on the former, Daconta said.

"This approach will work," he said. "Let's move forward with it."

Schematic friction

The data reference model's Extensible Markup Language (XML) schema has three major sections: data description, data sharing and data context. Among other benefits, the schema will allow agencies to categorize their data according to the federal enterprise architecture, allowing agencies with common lines of business to more easily identify one another.

But experts are debating what level of detail agencies should capture. The discussion can become heated because of the time and resource implications. The schema's supporters say data sharing must start as soon as possible, even if the schema is not perfect.

"As long as it is good or very good, we're going to need to move forward with it," said Michael Daconta, leader of the CIO Council's Data Reference Model Task Force. "We can't wait for the perfect representation format."

The U.S. intelligence community in particular disapproves of the XML schema approach. "I think there are second-tier issues that have to be solved," said Bryan Aucoin, the community's chief architect.

The matter boils down to different approaches toward data modeling, some federal officials say. "Abstract philosophical differences, in my opinion, are part of the problem, not part of the solution," Daconta said.

— David Perera

NEXT STORY: On a bit of a holiday