The Australian Broadcasting Corporation is using machine learning to extract metadata from text, podcasts and other forms of media, making them easier to find via a new search engine.
Machine learning engineer Gareth Seneque told the YOW! Data 2019 conference that the ABC moved out of beta in February this year with a new search engine based on technology from US startup Algolia (which also runs search for the likes of Twitch and Stripe).
The search domain still sports beta labelling but is in full production use.
“There are reasons for [the url] behind the scenes - stuff involving CMS migrations and the like that I won't detour into - but we're very much in the scaling up and out phase of things,” Seneque said.
ABC’s existing search indexes about 600,000 bits of content from the past decade, including 230,000 articles, 270,000 pieces of audio and 85,000 videos.
But Seneque said user feedback on search was poor.
“Specifically, content types were not supported, indexing speeds were slow stuff as the stuff would take a while to show up in the index, and the relevance of results was poor,” he said.
“The audience experienced difficulty while finding content as well due to accessibility reasons.”
The new Algolia-powered search is expected to be rolled across all of ABC’s digital properties, including iView and the listen app.
But before that occurs, Seneque’s team is working to improve the metadata recorded against individual pieces of content - particularly podcasts and video - to make these more searchable.
“Our challenges are twofold: How do we get people to use our search engine? And once they’re using it, how do we deliver the most relevant content?” Seneque said.
Seneque noted the challenge that every digital property developer faced with search - making it function as well as Google does.
“People use Google every day,” he said.
“It sets the benchmark for what a search engine can do and conditions people's expectations accordingly.
“We have a public feedback form and see this kind of stuff all the time. We get comments like ‘Why won't the search answer my question’ or ‘Why don't you provide results that include biographies of presenters’.”
The second challenge is to provide relevant results to searchers - and, in most cases, that means improving the metadata recorded against each piece of content.
Seneque said the obvious way to improve metadata was to automate as much as possible the selection of keywords and tags, instead of relying on journalists to manually do it.
“We have a number of different content teams, each with their own standards for generating metadata, all doing this over many, many years, meaning staff and processes change,” he said.
“These folks are busy actually making content, so using automated systems to help them makes sense.
“Our team sits adjacent to the content development pipeline. We pick up the content after it has been created and published, and transform it into a searchable state.
“Clearly if we want to deliver relevant results, we need consistency of metadata and coverage across as many attributes as possible for all objects in our index.
“If, along the way, we can build a system that say the CMS can plug into to suggest metadata for teams to include should they choose to, why not?”
But ABC faced a separate issue with content such as podcasts.
These were rarely transcribed or converted into text, and therefore “had very little in the way of metadata” associated with the files, “yet were a key content type for our audience”, Seneque said.
The answer was to use some form of speech-to-text to transcribe podcast content.
This would provide some keywords to make them searchable. The transcripts could also be a useful accompaniment to the audio files, assuming they were accurate enough.
“The answer - but perhaps not the solution - was obvious: get machines to do it cheaply,” Seneque said.
“I emphasize cheaply because as I'm sure many of you will have experienced when trying to implement completely new things in an organisation, it's difficult to get large amounts of budget to test unproven ideas.”
That led Seneque to initially experiment with Mozilla's open source implementation of Baidu's Deep Speech system, touted as an accurate, non-proprietary speech-to-text tool.
“It's freely available on GitHub along with pre trained models, so it seemed like a reasonable starting point for our experiments,” he said.
One of the drawbacks of the model, however, was the “massive amount of memory” required for operations, meaning experiments were done on short clips of audio.
The pre-trained model also struggled with portions of a 30-second clip from Radio National’s Rear Vision podcast.
“Generating words for dates is to be expected, but producing ‘guly’ instead of ‘arguably’ and ‘hearth’ instead of ‘Earth’, as well as jamming words together, clearly points to more fundamental problems or challenges that would require training our own models," Seneque said.
“This would be on top of a data pipeline that slices up and maintains mappings of podcasts that run up to 90 minutes plus length.
“So the engineering required to build such a system is obviously not trivial, and we needed to prove our idea without that kind of investment of resource effort.”
Seneque changed tack and shifted to using machine learning as-a-service via AWS Transcribe.
“Long story short, we found the service to be very effective at generating good enough transcripts relatively cheaply,” he said.
“By that I mean, good enough for metadata, but not good enough to necessarily present to our audience - something you might consider an obvious next step.
“There are specific requirements [for audience presentation] around accessibility - things like identifying sounds in audio content, that Transcribe is not yet capable of, although I have full faith in Amazon to figure it out in due course.”
Seneque found AWS Transcribe able to handle names and domain-specific tools with a better degree of accuracy.
“We still get dates as words, but overall the transcript looks much better. The individuals' names are close with a couple being correct and a couple not,” Seneque said.
“We can also see that the program name is mistranscribed, but AWS Transcribe offers features to improve these kinds of issues, including custom vocabularies and speaker identification.
“We will be exploring the use of these in the future.”
Encouraging early results
Seneque’s small team has now built an automated metadata platform out of AWS components.
This is used to extract metadata from content and feeds it into the new search engine to improve search results, and the early results are promising. It also includes a serverless process that picks up podcasts, sends them off for transcription, grabs the results and pushes them into the search index.
“For the top articles on the news website, we see an average increase of 280 percent in terms of the number of keywords attached to the relevant search object,” he said, though he noted some of the keywords automatically applied were less useful than others.
“We also see a 22 percent increase in the number of results returned for popular search terms involving audio content.
“This is of course a pretty fuzzy metric. But when search is integrated into our listen app, where 91 percent of our podcast audience is, we'll have more in the way of hard data.”