State Archives and Records NSW was spooked off using Microsoft cognitive services for a pilot project over fears the authority could permanently lose control of its data.
The authority last year said it planned to run both internal and other agency pilots of machine learning to determine whether the technology could be used to automate some records classification and disposal activities.
It is now in the process of running the internal pilot on a 30GB data corpus from the defunct National Occupational Licensing Authority (NOLA); the data was deemed to fall under NSW’s jurisdiction and has been permanently archived.
However, plans to run the data through Microsoft’s Azure learning workbench and cognitive services were scotched after an internal risk assessment raised two potential red flags.
One of those flags was from a condition listed online that “cognitive services collect and use many types of data, such as images, audio files, video files, or text, all of which may be retained by Microsoft indefinitely to improve Microsoft products and services, without a means for you to access or delete that retained data".
Digital state archives project officer Glen Humphries told a record managers forum last month that the clause forced the authority to reconsider its planned technology choice.
“Microsoft has got an Azure machine learning workbench and really good algorithms. We were really thinking 'this is going to be great' until I started doing the risk assessment about the data,” Humphries said.
“I got a bit deeper in paperwork on Microsoft sites and .. [I noticed] that actually the machine learning components fit under Microsoft cognitive services where all the APIs and algorithms live, and they have a really small clause in there that says they may use content and retain data for their own production development.
“All of a sudden it dawned on me these are records of NSW and they’re going to keep them forever and there’s loss of ownership. The risk is too high. We can’t do this anymore.”
Humphries said a separate clause in the user agreements required consent to be obtained from anyone in the dataset before the data was sent to an Azure API.
This became a secondary barrier to progressing on Azure, but Humphries made clear that the “main barrier was they were going to hold onto our data".
Instead of Azure, the authority turned to its own computers. Humphries said at least some were highly-spec’ed “and can run quite hefty algorithms and data through them so we are totally doing it in-house".
The pilot is still a work-in-progress.
“It’s really early days for us in this field," he said.
In addition, State Archives and Records NSW is still looking to pair up with an agency on a machine learning project that would use vendor technology to understand how machine learning could play out in other parts of NSW government.
State Archives and Records NSW is also currently undergoing a restructure that will see certain functions consolidated into other business units.
One of the changes will see the digital archives team - which looks after the machine learning pilots - folded into corporate ICT.
Digital archives team lead Richard Lehane said the team would be “expanding slightly to have new roles that are dedicated towards looking at the kind of solutions that government needs to better support records management in a digital age".
“Machine learning fits very neatly into that because it has great potential to assist in classification and disposal, particularly of unstructured digital records such as email and network shares,” Lehane said.
“We’re also looking at a couple of other digital initiatives at the moment, including a social media framework for whole-of-government.
“We’re also doing some initial work looking at a whole-of-government digital records repository for long-term management of digital records that aren’t required as state archives.”