Security firm releases 114m-record dataset built from live enterprise attack traffic

By Juha Saarinen

Apr 21 2026 6:14AM

Captured on five enterprise networks in 2024.

US-New Zealand based security vendor WitFoo has made freely available a large-scale labelled cyber security dataset, derived from security operations centre (SOC) signals and events, featuring real-world information instead of synthetic data.

Security firm releases 114m-record dataset built from live enterprise attack traffic

Named the Precinct 6 Cybersecurity Dataset, it contains 114 million labelled security event records drawn from production environments monitored between July and August 2024, and was developed in partnership with the University of Canterbury / Te Whare Wānanga o Waitaha.

The dataset is freely available on Hugging Face under the Apache 2.0 open-source licence, covering telemetry from 158 security products across more than 70 vendors.

There is no Australian or New Zealand data in the collection of security events, which features over 10,000 incident graphs, WitFoo said.

Of the records, 99.34 percent describe benign events, with 0.11 percent confirmed as malicious, reflecting what a real SOC sees.

The University of Canterbury's Computer Science and Software Engineering department under associate professor Etienne Borde developed the dataset specifications such as the fields to include, labelling taxonomy and more, WitFoo co-founder Charles Herring told iTnews.

WitFoo collected the data from five of the company's US-based enterprise customers, processed it with the company's tools, and undertook a four-stage sanitisation process to de-identify it.

Past attempts at creating similar data troves have relied on synthesised data, which Herring said is not useless, but it doesn't cover real-world data patterns.

"This dataset shows how we translate live commercial, proprietary signals into a common language and then how those translated artifacts are used to create relationships, and how those relationships tell stories about theories of crime as they played out across the participating organisations," Herring said.

Herring said he expects Anthropic's yet to be released new large language model (LLM), Claude Mythos, to absorb the dataset.

However, he said using an LLM directly on large amounts of data, like the terabyte amount that most organisations produce each month would take at least 250 billion tokens for generative AI to process.

"Based on current rates, that would cost Mythos Preview US$9.38 million [$13 million] using discounted batching, or US$1.88 million using [Claude] Opus 4.6," Herring said.

The electricity required would be around 360 MWh, enough for 33 homes for a year, Herring said, and added that that is not sustainable for the planet, or anyone's budget.

"[It is] roughly 50 times the size of CICIDS2017 and as far as we can tell the largest dataset of its kind built from real adversary behaviour," Herring said.

WitFoo said use cases for the data could include provenance graph-based intrusion detection, AI-driven cyber defence simulation and security alert classification, as well as detection rule evaluation.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Partner Content

Scalable AI solutions: secure delivery

Partner Content AI agents are reshaping identity governance, and attackers are already exploiting the gap

Why AI governance matters at scale

Promoted Content Digital sovereignty is no longer optional - Agentic AI has made it fundamental

Events

Most Read Articles

ASD to retire Essential Eight cyber security framework within next two years

Impact Awards: Tecala slashes customer response times for fintech IQumulate

Interactive introduces private cloud platform

Digital61 expands cybersecurity portfolio

Tackling critical infrastructure IT and OT

Bendigo Bank aims to have Australia's "first agentic SOC"

Apple is releasing updates early in response to AI cyber security concerns

Soaring bills reshape how businesses choose AI models

Vic TAFEs set for "common" student management platform

Security firm releases 114m-record dataset built from live enterprise attack traffic