DOD announces completion of pilot to identify medical AI vulnerabilities

Just_Super/Getty Images

By Edward Graham,
Managing Editor, Nextgov/FCW

| January 2, 2025

The Pentagon’s red teaming effort identified more than 800 “potential vulnerabilities and biases” in the use of large language models for clinical note summarization and for a medical advisory chatbot.

The Department of Defense’s Chief Digital and Artificial Intelligence Office, or CDAO, announced on Thursday that it “successfully concluded” a generative artificial intelligence pilot focused on identifying vulnerabilities in the use of large language models to enhance military medical services.

The Pentagon said the initiative was conducted by Humane Intelligence, a technology nonprofit, through its Crowdsourced AI Red-Teaming Assurance Program. The Program Executive Office, Defense Healthcare Management Systems and the Defense Health Agency also collaborated on the pilot.

CDAO’s LLM pilot focused on identifying potential system weaknesses and flaws when it came to using emerging tools for clinical note summarization and for a medical advisory chatbot. DOD said more than 200 people — including clinical providers and healthcare analysts within the department — participated in the red teaming effort, which “compared three popular LLMs.”

According to a press release, the initiative uncovered over 800 “potential vulnerabilities and biases” when it came to using these LLMs to enhance military medical care.

“This exercise will result in repeatable and scalable output via the development of benchmark datasets, which can be used to evaluate future vendors and tools for alignment with performance expectations,” DOD said. “Furthermore, these findings will play a crucial role in shaping DOD policies and best practices for responsible use of Generative AI (GenAI), ultimately improving military medical care.”

Matthew Johnson, who heads CDAO’s Responsible AI Division and served as the office’s lead on the pilot, also said in a statement that “this program acts as an essential pathfinder for generating a mass of testing data, surfacing areas for consideration and validating mitigation options that will shape future research, development and assurance of GenAI systems that may be deployed in the future.”

CDAO, which became operational in June 2022, has worked to test, expand and streamline DOD’s adoption and use of AI capabilities since its creation. The office previously launched a GenAI task force — known as Task Force Lima — in August 2023 to better study and understand how it could use emerging technologies “in a responsible and strategic manner.”

Although the department sunset the task force last month, it also created an Artificial Intelligence Rapid Capabilities Cell to carry out the group’s recommendations. CDAO said the new program, created in partnership with the Defense Innovation Unit, “will lead efforts to accelerate and scale the deployment of cutting-edge AI-enabled tools, to include Frontier models, across the Department of Defense.”

In its Thursday announcement, DOD said, in part, that pilot initiatives conducted as part of its Crowdsourced AI Red-Teaming Assurance Program “will be critical to accelerating the CDAO’s AI Rapid Capabilities Cell.”

NEXT STORY: Transportation’s AI lead sees ‘collaboration with everybody’ at his agency

Future-Ready Workforce

Health Tech

The Pentagon’s red teaming effort identified more than 800 “potential vulnerabilities and biases” in the use of large language models for clinical note summarization and for a medical advisory chatbot.