The federal government is set to pilot the use of artificial intelligence to help review applications for its Digital Marketplace 2 panel.

The Digital Transformation Agency (DTA) has developed a proof-of-concept using a large language model to review an IT supplier’s application case study in partnership with an assessment officer.
The agency plans to expand the proof-of-concept into a pilot, with the goal of going live later this year using an AI–human pairing model, rather than AI-based assessment alone.
Launched in October 2024, DM2 is a government-wide procurement arrangement for IT labour hire, as well as professional and consulting services.
Speaking during the AI Government Showcase in Canberra, former DTA principal technology advisor Ben Bildstein said the marketplace draws around 20,000 applications.
“This is a really big and important piece of work,” he said.
“We've got all these people doing all this work, and we think maybe AI can rate applications.”
After reviewing the government’s procurement policies and standards, as well AI ethics guidelines, this idea that AI could to it completely was ruled out.
“Pretty simply, AI can't do that,” Bildstein said. “It can't evaluate an application in a procurement context for you - that's a human's job.”
However, the agency agreed to trial AI on supplier case studies, which are typically assessed by two human reviewers to evaluate prior work.
“We have the people rate that case study from one to five,” explained Bildstein.
“We've got two staff members doing that independently. If they agree with a margin of error of one point, we basically consider that to be sufficient.
“If they agree, the case is passed for [an additional] review by a delegate. If they disagree, it goes to a discussion with a third person.”
“So, can AI read a case study and give a rating from one to five? The answer is yes, of course."
Three metrics
The proof-of-concept tested 268 previous applications, comparing the AI model’s assessments with those made by two human case officers.
The testing used three metrics to assess how well the AI model performed when compared with human assessors in evaluating case studies.
The first of these was the agreement rate between two humans compared to one person and an AI.
According to Bildstein, the two case workers agree on average 81 percent of the time, which stood as a benchmark for assessing the AI’s performance.
In contrast, the AI agreed with a human 84 percent of the time.
“We still have a 16 percent disagreement here,” Bildstein noted.
“In those cases, we throw away the AI and we get two humans to evaluate, like [the DTA] always did, and proceed on that basis.
“But in the majority of cases, the AI agreed with the human.”
The other metric used was the average rating difference or margin of error.
“The idea here is: we've got a human rating something out of five. We've got a second human rate and something out of five. You look at the difference between those two humans rating in an application. On average, how much do they disagree by?”
With two human assessors, the average disagreement score was 0.92, while the disagreement between a human and the AI was 0.76 — meaning the AI’s ratings are closer to a human’s than humans are to one another.
“So, we’re getting a little bit more consistency with a human and an AI,” Bildstein added.
The last metric was correlation, which measures how similarly two raters ranked case studies overall, or as Bildstein explained: "If the first person gives a high score, is the second person likely to do the same?”
The next stage will see the DTA use a larger data set of 6448 applications to provide more statistical weight to the preliminary results.
Bildstein added that there were “further governance and assurance boxes to tick”, but that the model could potentially be live for the next marketplace round.
As some parting advice to the Canberra audience, Bildstein said: “These days, AI is in fact the easy part.
“I would say put some real effort into your AI assurance early on because that’s probably where you will spend most of your time.
"And decide what you’re going to measure; be really clear what good looks like and what is good enough.”