DoD to develop scalable genAI testing datasets

The U.S. Department of Defense’s Chief Digital and Artificial Intelligence Office and technology nonprofit Humane Intelligence announced the conclusion of the agency’s Crowdsourced Artificial Intelligence Red-Teaming Assurance Program pilot, which is focused on testing large language model chatbots used in military medical services.

The findings could ultimately improve military medical care by adhering to all required risk management practices for the use of AI, DoD officials said.

WHY IT MATTERS

In an announcement Thursday, DoD said the CAIRT program’s most recent red-team test involved more than 200 agency clinical providers and healthcare analysts to compare three LLMs for two prospective use cases: clinical note summarization and a medical advisory chatbot.

They found more than 800 potential vulnerabilities and biases where LLMs are being tested to enhance military medical care.

CAIRT aimed to build a community of practice around algorithmic evaluations in collaboration with the Defense Health Agency and the Program Executive Office, Defense Healthcare Management Systems. In 2024, the program also offered a financial AI bias bounty focused on unknown risks in LLMs, beginning with open-source chatbots.

Crowdsourcing casts a wide net that can produce large volumes of data across multiple stakeholders. DoD said the findings from all CAIRT program red-teaming efforts will be crucial to shaping policies and best practices for the responsible use of generative AI.

DoD also said continued testing of LLMs and AI systems through the CAIRT Assurance Program is critical to accelerating AI capabilities and justifying confidence across DoD genAI use cases.

THE LARGER TREND

Trust is essential for clinicians to embrace AI. To use genAI in clinical care, LLMs must meet critical performance expectations