USQ's open access search project hits 16 million records

By Andrew Sadauskas

May 10 2016 12:00PM

Spare capacity on AWS used for full-text index journals.

The University of Southern Queensland has delivered a beta search engine for open access journals, using spare capacity on AWS rather than on-campus supercomputers to crunch more than 16 million full-text records.

USQ's open access search project hits 16 million records

Tim McCallum.

Around the world, a growing number of academics are opting to publish their research in open access journals or university repositories rather than through journals that end up behind a paywall online.

With more than 4000 university search repositories now on the internet, each holding many thousands of academic papers, finding a paper on a particular topic without the use of a search engine is virtually impossible.

While there are a number of existing academic search engines, such as Google Scholar, these typically tend to index both paywalled and open access research journals, meaning students without journal subscriptions often hit paywalls.

In a bid to make freely available research more accessible, USQ’s senior project officer of technology demonstrator projects Tim McCallum started work creating a search engine targeting open access journals, winning an AMP Tomorrow Makers grant in the process.

However, McCallum soon discovered that assembling the compute power necessary to index millions of papers quickly would be very expensive, while trying to undertake the project on the cheap with fewer resources would take a really long time.

“The job would take about seven years to index millions of papers, and to ramp it up by buying 1000 machines at $500 each would end up costing half a million dollars,” McCallum said.

USQ has significant high-performance computing resources, through both partnerships such as the Queensland Cyber Infrastructure Foundation (QCIF) and its own on-campus SGI system.

However, the nature of McCallum’s project, which involved querying data from thousands of endpoints, meant distributing the workload across a large number of virtual machines on a cloud platform provider was more efficient than scheduling tasks on a supercomputer.

Buying spare compute cycles

To square the circle, McCallum began bidding for spare compute capacity from major cloud vendors.

“Buying machines upfront is expensive, and cloud computing services such as Amazon and Google Compute provide easy access to a range of services,” McCallum said.

"To reduce the price of the operation I constantly bid on spare cloud infrastructure and expect, at all times, that the infrastructure will disappear at a moment’s notice due to pricing. This new ephemeral model works just fine thanks to my upfront design and software automation which accounts for these shifts.”

The problem is if someone comes along to buy it as a customer, the cloud provider will terminate your machines and sell it to that person.

“The trick is to make sure that the whole process is automated for example creating machines and queues of work for those machines to do using software automation – so nothing traditional about this exercise at all really,” McCallum said.

“There is the small task of setting up security credentials between machines, this literally takes minutes to set up but allows a deluge of information to be pushed and pulled between machines without further human intervention.”

McCallum said this approach allows him to spawn 1000 machines and have them all hit a queue of PDF files, convert them to text, and then index them.

“If a machine does complete a task it removes the job from the queue and then takes another job from the queue, and if the queue is empty or the workload lessens the machine can terminate itself - this also keeps the costs down,” McCallum said.

“It’s a much more sophisticated and scalable approach where we can ramp up or throttle down using code rather than always using brute force."

While the open access search engine has predominantly used AWS so far, McCallum said there was no reason he wouldn’t work with either Google Compute or Microsoft Azure in the future.

“Amazon offers a robust suite of storage, queuing and computing infrastructure which can all be controlled by computer code,” McCallum said.

“I am not biased to any cloud platform and will continue to work with them all in order to maintain an edge for seeing opportunities.”

A beta version of the search engine is now live, having indexed around 16 million records, including traditional search results and a Google Books-style ngram interface.

“Most of the back-end or scaffolding is now done, I’d now like to make the front end a lot more slick,” McCallum said.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Partner Content

Promoted Content AI in cybersecurity: weapon or shield?

AI Copilot: Breaking Down Silos & Securing the Future

Partner Content AI and quantum computing widen the machine identity security gap

Partner Content Machine identity a key priority for organisations’ security strategies: CyberArk

NSW Digital Minister urges PEXA to improve outage reporting

Melbourne dev finds gift card PINs can be brute-forced

Jaguar Land Rover hit by cyber incident

Department of Health to centralise SecOps model

NSW gov to use AI to speed up major development assessments

USQ's open access search project hits 16 million records

Spare capacity on AWS used for full-text index journals.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

Woolworths cuts Big W loose from shared technology stack

DuluxGroup taps SAP to drive sales in Bunnings stores

Qantas is building a group-wide AI capability

Qantas makes architectural changes to its API management platform

Most popular tech stories

Coles eyes AI to keep shelves stocked in next viral recipe trend

Swinburne Uni folds DocuSign into processes across organisation

CBA keeps pushing limits of its Workday environment

Chemist Warehouse runs AI on HR shared inbox

NBN Co weaves AI and automation into its operational "fabric"

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Govt launches consumer tech label program for smart devices

Photos: The 2024 IoT Awards winners

Rail operator Aurizon uses IoT to help save $380m

Photos: Australian industry explores data for net zero

NSW Digital Minister urges PEXA to improve outage reporting

Melbourne dev finds gift card PINs can be brute-forced

Jaguar Land Rover hit by cyber incident

Department of Health to centralise SecOps model

NSW gov to use AI to speed up major development assessments

USQ's open access search project hits 16 million records

Spare capacity on AWS used for full-text index journals.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

Woolworths cuts Big W loose from shared technology stack

DuluxGroup taps SAP to drive sales in Bunnings stores

Qantas is building a group-wide AI capability

Qantas makes architectural changes to its API management platform

Most popular tech stories

Coles eyes AI to keep shelves stocked in next viral recipe trend

Swinburne Uni folds DocuSign into processes across organisation

CBA keeps pushing limits of its Workday environment

Chemist Warehouse runs AI on HR shared inbox

NBN Co weaves AI and automation into its operational "fabric"

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Govt launches consumer tech label program for smart devices

Photos: The 2024 IoT Awards winners

Rail operator Aurizon uses IoT to help save $380m

Photos: Australian industry explores data for net zero

Log In