BitTorrent researchers stand by sponsored study

 

Call for others to replicate.

One of the authors of a contentious study of BitTorrent trackers effectively dismissed a critique of the work by a major torrent blog overnight, challenging the critics to produce some "comparable research".

The Internet Commerce Security Laboratory (ICSL) study, partly paid for by Village Roadshow, found "at least 89.9 percent" of a sample of 1,000 popular torrents infringed copyright.

The study was lampooned by file-sharing blog Torrent Freak who claimed it was riddled with mistakes and that its conclusions were based on "painfully inaccurate data".

ICSL head Paul Watters forwarded what appeared to be part of written correspondence sent directly to the Torrent Freak site.

It thanked Torrent Freak "for your enquiry" and conceded the site "raised some interesting points that are fundamental to the validity of any study in this area: the sampling strategy; verification of results and so on."

"As researchers, we not only stand by the findings that we have arrived at, but - having made our methodology public - we are providing other bona fide researchers to replicate and/or dispute our findings," Watters said.

"Their results can in turn be assessed through the peer review process; this is the process that normal research activity takes.

"We believe that our methodology was rigorously applied to the sample that we obtained. Over time, we will replicate the sampling process, so that we will gain better estimates of the population results.

"This is the fundamental tenet of statistical sampling; I would be happy to send you a complimentary of my O'Reilly "Statistics [in] a Nutshell" book that might give further insight into statistical methodology.

"We look forward to reading the results of any comparable research that you produce!"

Watters didn't appear to have responded directly to Torrent Freak's criticisms of the methodology itself.

The study was released just over a week before the film studio that commissioned it heads back to the Federal Court to appeal an unfavourable decision in its long-running case against ISP iiNet.

The film industry wants to make iiNet and other ISPs responsible for the copyright infringing actions of internet users on their networks.


BitTorrent researchers stand by sponsored study
"@Ace, surely it is unthinkable that the torrent police could be allowed to - oh, just a minute, somebody's pounding on my front door. . ."
By anonymous
 
 
 
Comments: 9
ITrant
Jul 27, 2010 1:28 PM
Without equally funded peer review studies, this report is meaningless. Corporations have long realised they can 'write the history' if they fund a study.

The Torrent Freak author taught statistics and research methods to PhD students and has far greater specialist knowledge. How about Village Roadshow fund that review of the report?
Ezy2Confuze
Jul 27, 2010 2:33 PM
Blind Freddy can see AFACT using this against iiNet.

When Conjob loses his seat in the upcoming election, he should apply for a job at AFACT. It's a perfect fit, neither of them truly knows what they are talking about, both try to pull the wool over the publics eyes at every turn and both of them know how to twist words just the right way to fit their own agendas.
Cham
Jul 27, 2010 2:35 PM
The impression I got from the TorrentFreak wasn't just that they disagree'd with the findings of the study, but that they disagree'd with the methodology as well. It's all well and good to say "our methodology supports our results" but if the methodology is wrong, then the results are wrong too.
Bloemfontein78
Jul 27, 2010 2:52 PM
@Cham - actually read the study. The methodolgy is fairly grounded. Just because Ernesto says so, doesn't make it so.... Independant thought is a good start

@Ezy2Confuse - I thought this has already been covered elsewhere that AFACT can't use it in court?

@ITrant:
Has a far greater specialist knowledge then who? Eeek.



1. Ernesto claims he doesn't know how the researches came to a 1 million + torrent figure. From my reading of the research, it is clearly outlined in the methodology. The fact that open bittorrent list over 2 million torrents is pointless if it doesn't allow a full tracker scrape or the researches were unable to obtain a full scrape. Did Ernesto try to do this (or replicate any part of the study)?

2. That the categorisation process is flawed is utter crap. I'm a post grad in Maths, and a sample set of 15000 is enough to get a statistically sound sample.

Anyways, TorrentFreak misinterprets (doesn't read) all the data from the IsoHunt gospel anyway. In IsoHunt reporting, they state only 127168 torrents are online (http://isohunt.com/stats.php?mode=btSites), so > 10% of the total currently traded torrents is a huge statistical sample set, and one would think, contains books, and all the other things that they claim is under represented. How many people download an eBook on BT anyway?

3. The seed count issue they make at point 2 and 3 is probably the only valid point they (TorrentFreak) make. And the top 100 may be loaded with fakes. That said, TorrentFreak rely solely on IsoHunt data and what it is reporting. Hardly a scientific basis for critical analysis.

4. TorrentFreak/Ernesto deliberately misrepresent the data of the study. Rather than taking the lead out of the Exec Summ from the initial research, they used the upper threshold of the analysis (whereas every other news agency mention the 89%). The report erred on the side of caution when they state 89% of BT was infringing, up to 97.9%, but there was a large unknown. TorrentFreak clearly uses the high figure to make their point skewing the interpretation of the data. If TF want to seriously critically analyse the data, they should be referring to the complete subset of data - not cherry picking.

The fact Ernesto taught stats once upon a time makes no difference. He is deliberately ignoring or skewing data in initial research paper for his own (Read: his/TorrentFreak/file sharers) purpose.

Actually read the research and attempt to understand it before criticising it based on what the blogosphere claims.
torrentfreak
Jul 27, 2010 6:47 PM
@Bloemfontein78

1. Fair point. Personally I found the mention of 1 million torrent misleading in reply to the question of how many torrents are shared. Also, they estimate that this figure would increase at a lower rate with more trackers being added, but this obviously not the case.

2. The categorization process IS flawed (based on a biased sample). The sample is taken from the most seeded torrents and not all the torrents they found through their scape. Because of this some categories are overrepresented, and others under (e.g. movies have more seeders than books generally). The size of the sample is irrelevant if your selection method (most peers) is biased.

How many people download ebooks? http://www.kickasstorrents.com/browse/

3. TorrentFreak is not relying on isoHunt, that was merely used as an example. We actually have a few machines dedicated to tracking all torrents and downloads for our weekly charts. I'd like to think that the system we built and optimized in the last 3 years is pretty good.

4. Ernesto's not misrepresenting anything. The authors do state that only 0.3% was confirmed legal.

The researchers should be ashamed of themselves for posting this weak and misleading report.

Bloemfontein78
Jul 27, 2010 9:28 PM
@torrentfreak - thanks for the reply.

1. Without further research, and given the ISOHunt bible states there is little over 100,000 torrents being actively seeded (shared-much less than the 5 million tracked) - see summary @ http://isohunt.com/stats.php?mode=btSites, the study is correct if you take at least one seeder as being a prerequisite for the file to be shared. So yes, there would be a law of diminishing returns.

2. See point above that if > 1 seed is required to be counted as a seeder. 15000 is 15% of the population which is huge. If you look at the distribution, I would personally hypohesize that the distribution would stay roughly the same, although without conducting my own study, I cannot confirm. The researches could try a random sample also in the next iteration of their study. Or TF with their optimized system could produce some information?

The link you post to browse kickass torrents doesn't give any indication to torrents being actively seeded.... This is something for further exploration - the bias with a 15% of the population sampled is minimal at best.

3. Fair call, although that is not clear from any information which was actually posted on your site. However, I would suggest such a system which you possess could be used in order to produce a comparable survey.

4. Actually, he does:
"Here the researchers conclude that 97.9% of all files on BitTorrent are copyright infringing, and only 0.3% confirmed ‘legal"

Incorrect. The study claims 97.9% when the ambiguous titles were non infringing and porn was not included. So he's presenting two separate sets of the result as one coherent result. The 97.9% section was set when porn was not included and the 16 ambiguous titles were considered non-infringing thereby allowing 2.1% of non-infringing. The 0.3% was when considering the entire population sample including porn and leaving ambiguous titles as such

There is further research to be done, but certainly the researches were right to stand by the study. The methodology was open, and plain to see. TF/Ernesto jumped the gun and misrepresented the data. Princeton University claimed 99%+ was infringing without publishing any methodology yet didn't come in for nearly the same criticism.

In most instances, the published study erred on the side of caution (with the exception of seed counts), so perhaps TF should show more respect to the academic process rather than merely paying lip service, misrepresenting the data presented and not providing any real proof beyond citing IsoHunts stats.

To call this ' one of the most inaccurate reports we’ve seen thus far' is by far an exaggeration given the open way it was conducted and the results published.

It is TorrentFreak and Ernesto who evidently taught stats who should be ashamed for providing a misconstrued representation of the study as facts.
david.price
Jul 28, 2010 8:15 AM
I've been responsible for a more extensive study along these lines, examining the 10,000 most popular torrents tracked by OpenBitTorrent. We hope to publish soon but a brief breakdown is below.

I'm not here to enter the methodology arguments that surround the ICSL study outlined above. I think there may be errors in there, particularly in error checking some of the torrent swarms found to be most popular (for instance, The Incredible Hulk, a modestly popular film released in 2008, was the most seeded file on bittorrent in April 2010 with 1 *million* seeds?). But the main point to make is that the research we have produced found a significant amount of copyrighted content as well, across a sample of torrents ten times greater.

To summarise our study: we took a full scrape of OpenBT (the largest tracker when we scraped at about 1.9m unique infohashes). We then sorted these infohashes by number of *leechers*, not seeds (one reason for this is that fake files / malware are most often promoted as having many, many seeds in order to attract downloaders). We then took the top 10,000 torrents as ordered by leechers and:
i. checked infohashes against two portals
ii. weeded out fakes (a largely manual process)
iii. checked all infohashes up to the 2,000th most popular swarm against google. In the end, all files up to the 2,000th most popular infohash were identified.

That was a fairly long process, as you might imagine. We then categorised, analysed, etc.

So of the top 10,000:
+ 15.1% was identified as pornography
+ 25.3% of infohashes could not be identified by checks against the two portals and in some cases, Google. Subsequent analysis has led us to think that a good deal of this content is pornography (and a large amount of it appears to be found mostly on Asian forums - remember that EZTV research that found Xunlei the most used torrent client? Looks like they could all be after pornography).
+ 283 torrents were classified as fake

So after that we were left with 5,677 torrents which were non-pornography and identified. Of all these 5,677 infohashes, *only one* was non-copyrighted content - a Linux distribution.

As such, *at least* 56.76% of the content we found in the most popular 10,000 files on bittorrent was copyrighted and being shared illegitimately. I *could* say "99% of the infohashes we identified and which were not pornography were copyrighted" but I see where that kind of statement got the authors of the study above, so I won't.

So to break down the top 10,000:
+ 25.3% was unknown
+ 15.1% was identified pornography, copyrighted status unknown and unexplored
+ 27.8% was films, all identified, all copyrighted
+ 14.8% was television (we sampled on a Tuesday morning from what I remember and Monday night's episode of House was the most in-demand single infohash), all identified, all copyrighted
+ 7.8% was games, all identified, all copyrighted
+ 4.1% was music, all identified, all copyrighted
The rest was a smattering of software, books, comics, a few sports broadcasts (UFC is very popular).

A few last stats:
+ the top 10,000 torrents sorted by leecher comprised 35.5% of all peers (seeds and leechers together). Top 10,000 torrents = only 0.54% of total torrents tracked by OpenBT.
+ the top 10,000 torrents represented 44.8% of all leechers - that is to say, nearly half of all active downloaders were only interested in 0.54% of the content.
+ over half of all infohashes tracked by OpenBT had no downloaders at all at the point of scrape

As I say, we hope to publish all this in more detail soon - with a full methodology...
Ace
Jul 28, 2010 12:11 PM
Nice work people. However, are ISPs responsible for the software people run on their PCs at home? Can they legitimately snoop on traffic. If torrents are encrypted, would that even be possible?

There is clearly a problem, but short of raiding peoples houses and inspecting their PCs, it is difficult to see what could be done about it.
anonymous
Jul 28, 2010 1:37 PM

@Ace, surely it is unthinkable that the torrent police could be allowed to - oh, just a minute, somebody's pounding on my front door. . .
Comments have been disabled for this article.
 
 
 
Top Stories
Telstra shifts BigPond email to Windows Live
All data to be migrated to Microsoft cloud.
 
Windows 8: Under the hood
Part One of iTnews' enterprise guide to Windows 8.
 
iTnews on tour: The Executive Summit Series
Join us in Sydney and Melbourne to meet Australia's tech leaders.
 
Sign up to receive iTnews email bulletins
   FOLLOW US...

Latest VideosSee all videos »

Latest Comments
Polls
Would you be concerned about your business' email data being hosted offshore?

   |   View results
Yes
  88%
 
No
  12%
TOTAL VOTES: 101

Vote