The University of Southern Queensland has delivered a beta search engine for open access journals, using spare capacity on AWS rather than on-campus supercomputers to crunch more than 16 million full-text records.
Around the world, a growing number of academics are opting to publish their research in open access journals or university repositories rather than through journals that end up behind a paywall online.
With more than 4000 university search repositories now on the internet, each holding many thousands of academic papers, finding a paper on a particular topic without the use of a search engine is virtually impossible.
While there are a number of existing academic search engines, such as Google Scholar, these typically tend to index both paywalled and open access research journals, meaning students without journal subscriptions often hit paywalls.
In a bid to make freely available research more accessible, USQ’s senior project officer of technology demonstrator projects Tim McCallum started work creating a search engine targeting open access journals, winning an AMP Tomorrow Makers grant in the process.
However, McCallum soon discovered that assembling the compute power necessary to index millions of papers quickly would be very expensive, while trying to undertake the project on the cheap with fewer resources would take a really long time.
“The job would take about seven years to index millions of papers, and to ramp it up by buying 1000 machines at $500 each would end up costing half a million dollars,” McCallum said.
USQ has significant high-performance computing resources, through both partnerships such as the Queensland Cyber Infrastructure Foundation (QCIF) and its own on-campus SGI system.
However, the nature of McCallum’s project, which involved querying data from thousands of endpoints, meant distributing the workload across a large number of virtual machines on a cloud platform provider was more efficient than scheduling tasks on a supercomputer.
Buying spare compute cycles
To square the circle, McCallum began bidding for spare compute capacity from major cloud vendors.
“Buying machines upfront is expensive, and cloud computing services such as Amazon and Google Compute provide easy access to a range of services,” McCallum said.
"To reduce the price of the operation I constantly bid on spare cloud infrastructure and expect, at all times, that the infrastructure will disappear at a moment’s notice due to pricing. This new ephemeral model works just fine thanks to my upfront design and software automation which accounts for these shifts.”
The problem is if someone comes along to buy it as a customer, the cloud provider will terminate your machines and sell it to that person.
“The trick is to make sure that the whole process is automated for example creating machines and queues of work for those machines to do using software automation – so nothing traditional about this exercise at all really,” McCallum said.
“There is the small task of setting up security credentials between machines, this literally takes minutes to set up but allows a deluge of information to be pushed and pulled between machines without further human intervention.”
McCallum said this approach allows him to spawn 1000 machines and have them all hit a queue of PDF files, convert them to text, and then index them.
“If a machine does complete a task it removes the job from the queue and then takes another job from the queue, and if the queue is empty or the workload lessens the machine can terminate itself - this also keeps the costs down,” McCallum said.
“It’s a much more sophisticated and scalable approach where we can ramp up or throttle down using code rather than always using brute force."
While the open access search engine has predominantly used AWS so far, McCallum said there was no reason he wouldn’t work with either Google Compute or Microsoft Azure in the future.
“Amazon offers a robust suite of storage, queuing and computing infrastructure which can all be controlled by computer code,” McCallum said.
“I am not biased to any cloud platform and will continue to work with them all in order to maintain an edge for seeing opportunities.”
A beta version of the search engine is now live, having indexed around 16 million records, including traditional search results and a Google Books-style ngram interface.
“Most of the back-end or scaffolding is now done, I’d now like to make the front end a lot more slick,” McCallum said.