16 Agencies Create One Confidential Data Process to Rule Them All
In three years, creators of the Standard Application Process condensed 16 different agencies’ processes into one central portal for confidential data requests.
Through three years, 16 federal agencies, more than 1,300 datasets and a cascade of privacy laws and interagency agreements underpinned by decades-old entrenched processes, a team of federal employees launched the Standard Application Process in December, creating for the first time a single portal for U.S. researchers to request access to the mountains of confidential statistical data generated by the federal government.
The Foundations for Evidence-Based Policymaking Act of 2018—better known as the Evidence Act—requires federal agencies to generate, use and share more data to improve all aspects of society. That includes sensitive and confidential data that could be useful for researchers in and out of government but must be treated with utmost care and only shared with responsible parties.
The Evidence Act acknowledges this issue and calls for the creation of a standardized application process to better enable researchers to apply for access to confidential data held by the federal government’s 16 statistical agencies.
Each agency is charged with collecting important data on its niche topic and maintaining those datasets for the public benefit. However, while the data is generated in order to be used, the subjects of that data—real people with privacy to protect—must be kept confidential and anonymous.
“The purpose of the SAP is to try to streamline the process by which researchers and data users request access to confidential data,” said Heather Madray, program director for the Data Access, Confidentiality and Quality Assessment project based out of the National Science Foundation’s National Center for Science and Engineering Statistics.
“This isn’t public use data,” she said. “This is data that’s protected by various confidentiality laws.”
That work began in September 2019 at the Census Bureau, where Madray and a team launched a pilot with a basic online application and data catalog. Over the next few years, the team grew to include a working group with representatives from all 16 statistical agencies, including a governance board and a policy board.
By mid-2020, the group was building out a metadata catalog to categorize all types of data held by the agencies—an effort that would turn out to be just as significant as the SAP itself.
The full catalog was completed in August 2022 and the first full version of the application went live in December.
“Before the SAP, every agency had their own way—their own application, their own process for people to apply for data. So, if you wanted data from five different agencies, you had to go to five different websites—five different applications, five different processes—[to] try to figure it all out. It was quite cumbersome and quite difficult trying to navigate all that,” Madray told Nextgov.
“What this does is bring all the principle statistical agencies and units—so, 16 right now—together under one common application and one standard process,” she said.
While establishing the SAP was the main goal of the project, the data catalog has been a huge achievement and value-add, Madray said. With the catalog in place, requestors can look in one place to discover all the types of data they can get access to, rather than having to search through each agency’s repository separately.
As of the end of February, 13 agencies had uploaded 1,337 datasets and the associated metadata to allow a user to quickly find the right data. That inventory will grow as three more agencies work to get their datasets included in the catalog.
“The metadata inventory is huge,” Madray said. “Having all of that restricted use data—and all of the information about that data—available in one location, it’s much easier for users to find information on data, contact an agency if they need additional information: That’s really a game-changer, I think.”
But getting to this point took three years of dogged work involving lots of people, process and technology challenges.
The first major issue was synthesizing processes and requirements from 16 distinct federal agencies, each of which had developed its own way of doing things over the years.
“But the great thing about that was everybody coming together. We learned a lot about each other’s processes; we learned a lot about what the different challenges that different agencies face with their own research programs,” Madray said. “We had to make a lot of compromises but I think it built relationships.”.
In developing a single approach, it was important to weed out bad processes that had developed over time at individual agencies. The team went “back to basics” to establish what was required by statute and then what supported administrative needs at each agency.
Finally, the team looked at the standing agreements—deals between agencies and with non-profits and academic institutions—that “had to be honored.” Sometimes, those conversations got uncomfortable, Madray said, but the team just had to “talk it through.”
“We all—every single one of us that was working on this—we had our security blankets, things we’d be doing for years that we had to think about, ‘Do we really need that or can we work in a different space?’” she said.
After getting through the people and process issues came the last part of any technology modernization project: the technology.
The first technological issue was realizing that the process could only be standardized up to a point. Each of the statistical agencies represents a different sector, different set of data and different potential use cases. While the final application needed to have a standardized functionality, it also needed some customization options.
The other major technical hurdle was enabling a system that could securely uptake large documents. As the data requested through this process is sensitive and confidential, the participating agencies need to review the methodology of the research the datasets will be used for.
“The first criteria: The use has to be statistical in nature,” Madray explained. “No law enforcement uses, no regulatory uses, no commercial uses.
“Then, the use of the data has to be allowed under the agency’s laws, their individual statute or any agreements they have in place to use the data,” she said. “For example, if they have a data sharing agreement with an agency, the use of the data has to be consistent with the terms of that data sharing agreement.”
The approving agency must make sure that any output from the studies—including charts and tables—don’t compromise the confidentiality of the real people represented by that data. This becomes even more important when data is being requested from multiple agencies to avoid what’s known as the mosaic effect, in which real people can be reidentified by combining multiple sources of anonymous data.
From there, the agency has to validate that the research could not be done with publicly available data; that the research proposal is feasible; that the use of the data is consistent with and contributes to the agency’s mission; and that the work won’t jeopardize the public’s trust in the agency.
Some agencies are also required by law to ensure that the research work produces some public benefit, Madray added.
While each agency needs to go through each application using its own determination process, the SAP portal needed to be able to accommodate collection of all that relevant information.
With the initial SAP live and in use, Madray and the team are now looking to future capabilities, such as developing application programming interfaces, or APIs, to connect directly to individual agency systems. This will allow agencies to have a direct feed of incoming applications, rather than having to download them from the portal.
One of the big goals of future implementations will be to allow for a single application that requests data from multiple agencies.
“This is part technology, part working within the legal framework,” Madray said. “The ultimate down the road would be: Someone could go in and ask for data from five different agencies, they could link it and they could use it in one environment. We’re not quite there yet.”
Madray said the team has to work through conflicting statutes, security setups, data-sharing agreements and the like.
As it stands, the application can only process multi-agency requests if those agencies already have data sharing agreements in place.
Achieving a common framework for data requests would also allow the SAP to develop a platform on which the requestors could work with the data, allowing for more agency oversight and security.
“The multi-agency piece really is one of the Holy Grails,” Madray said.
Other capabilities in progress include a way to amend applications without having to withdraw and resubmit the request; an online appeal process so requestors can get more information when an application is denied; and establishment of “application windows” for agencies that are only able to accept applications at certain times of the year.
Further into the future, Madray said she hopes the program might include some identity and credential proofing, as well, to help agencies with the people side of the determination process.
For now, the team is soliciting feedback on the SAP and data catalog and providing metrics on whether the new process improves speed, ease of use and transparency.