REA Group has ramped up its defences against site scrapers and account takeover fraudsters, using technology from Australian security start-up Kasada to frustrate bots targeting its content and users.
The ASX-listed operator of property sites like realestate.com.au said it faced increasingly sophisticated attempts to scrape its content for monetisation elsewhere.
It also suspected it was the target of credential-stuffing attacks, where bots aimed reams of stolen credentials at its websites in the hope that some logins worked, allowing attackers to take over those accounts.
“One thing that [real estate listing platforms] have in common is they're pretty much a treasure trove of data,” REA Group systems manager Andrew Logue told the recent AppSec Day Australia conference.
The data is not just REA’s. Increasingly, third-party data, such as performance data of local schools, is surfaced through real estate listing sites.
“The problem with that is we're now not only safeguarding our own intellectual property, we're safeguarding somebody else's,” Logue said.
“And if you read the terms and conditions when you sign up to a third party, they're probably going to say, 'If you access our data feeds, you've got to make sure that nobody's going to scrape them'.
“We're going to put them on the public internet - so anyway, that's another hard problem.”
The bot problem
Logue said scrapers ranged from individuals running scripts for personal use, such as to be alerted to new rental listings each Monday, to more sophisticated operators that re-packaged the data and onsold it.
Some scrapers were relatively easy to identify and frustrate using “traditional” means, such as rate limiting, IP address blocking, geoblocking, or by turning on web application firewall (WAF) rules.
But Logue said scrapers were becoming more sophisticated and harder to detect as they were able to use tools and open source scripts to better blend in with normal site traffic.
“We've found that a lot of the scrapers and the people involved in potential account takeover activity [are] fairly easily circumventing a lot of the more traditional approaches,” he said.
“Scrapers are getting better, largely because they have a better set of tools at their disposal.
“You've got toolkits such as Selenium and Puppeteer - and then on the [account takeover] side you've got tools such as Sentry MBA that can be used to drip feed a whole ton of user creds under the radar using a previously disclosed breach.”
Scrapers were also able to distribute their activities over thousands - potentially millions - of distinct residential IP addresses and user agents (“crawlers”), and various geographies.
Logue noted that was “making the lives of folks out there that are managing users, managing intellectual property and trying to safeguard user identities a hell of a lot harder”.
“It means that you're not only dealing with these very specific traits,” he said.
“You've got an incredibly high-cardinality when it comes to IP addresses, locations, user agents, and most of the other traits that get associated with any kind of scraping activity or attack that you might encounter.”
High-cardinality describes values (such as in a database) that are relatively uncommon or unique.
Logue said a scraping bid that REA called ‘The Monster’ was a case in point.
“It was a lower but much more sustained number of requests, so over time, it just blended in and made us kind of think that our base level of user traffic was a lot higher than it was,” he said.
“[It] had over 3000 distinct IPs and over 10,000 distinct user agents, and it was spread across 72 different countries.
“And you're like, Okay, where do we start? Luckily they were only trying to scrape. This wasn't an account takeover attempt, because the ramifications of that can be a lot greater.”
Logue said REA tackled ‘The Monster’ primarily by geoblocking.
“I can't remember the total count, but I think .... we were talking about 50-70 different countries blocked around the world, and as we were blocking them, they were popping up elsewhere.
“Luckily, the only casualties were person hours. So despite what seemed like quite a grave attack at the time, the damage was really mitigated to probably about $100,000 worth of engineering time.
“But in terms of the opportunity cost, we'd rather spend that money elsewhere.”
At the time, Logue said REA had Kasada’s Polyform under proof-of-concept.
The technology uses a large amount of telemetry to distinguish between “good” and “bad” bots and human traffic, and then deploys a range of methods against sources of unwanted traffic.
“In the event of that what we thought might be a credential-stuffing attack, we actually fast tracked our proof-of-concept … into production, and it instantly stopped all of the attempts that we were seeing pop up,” Logue said.
Logue said Kasada has only been set up to protect “specific channels” associated with realestate.com.au, but he said the results had been promising.
Polyform was able to return traffic levels “back to what we think human traffic should look like” for one particular channel, knocking “out 43 percent of all the requests that were hitting our origin for that particular channel” without skewing the site’s audience metrics, Logue said.
Logue said REA also found scraping attempts - and vectors - that had previously gone undetected.
“We were getting scraped by Google Docs, Google Sheets, PHP, PowerShell … and we didn't know about any of this until we turned this device on,” he said.
Kasada’s field CTO Nick Rieniets said the company’s goal was to turn the table on attackers.
“You're taking a situation where it's very cheap, easy and fast to attack the website, and flipping it so that it becomes very difficult, very expensive, and ultimately, very time consuming,” he said.
“The idea here is that we're now trying to make the REA website as expensive as possible to scrape, and we do that by changing the way that we defend them on a very regular basis.
“So we will allow an attacker to go down a route of investigating a particular tool of choice - maybe they're going to move from Python to a Headless Chrome environment. We'll follow that environment and then as they get more confident that's possible, we'll then neutralise them.
Rieniets claimed Polyform offered “limitless” possibilities to the defenders of websites to frustrate “real, live attacks”.
“I could instruct the browser to solve a puzzle where I could make that puzzle super hard, and so all of a sudden, you've got no memory left in your bot,” he said.
“I could instruct the browser to download an incredibly large file and just blow that part of it up. I could consume the CPU, I could send you to the wrong information, and I could change things on the fly, so that it's totally unreliable to you and I've neutralised the output of your bot, but you still probably think that it's working because the thing about this is that we're actually delivering back to the user the response, which will not trigger to them that anything [untoward is] happening [to their bot].”
Rieniets said the aim was to “elongate and frustrate” attempts to scrape data or to run a credential-stuffing attack, to the point where the attacker “makes a decision to stop attacking REA and to move somewhere else.”
“The whole goal of changing the game of the economics of attacking this particular part of the website is to get the person that's writing the scripts to make one final statement in the battle for victory: to give up. And that's the goal.”