US-New Zealand based security vendor WitFoo has made freely available a large-scale labelled cyber security dataset, derived from security operations centre (SOC) signals and events, featuring real-world information instead of synthetic data.
Named the Precinct 6 Cybersecurity Dataset, it contains 114 million labelled security event records drawn from production environments monitored between July and August 2024, and was developed in partnership with the University of Canterbury / Te Whare Wānanga o Waitaha.
The dataset is freely available on Hugging Face under the Apache 2.0 open-source licence, covering telemetry from 158 security products across more than 70 vendors.
There is no Australian or New Zealand data in the collection of security events, which features over 10,000 incident graphs, WitFoo said.
Of the records, 99.34 percent describe benign events, with 0.11 percent confirmed as malicious, reflecting what a real SOC sees.
The University of Canterbury's Computer Science and Software Engineering department under associate professor Etienne Borde developed the dataset specifications such as the fields to include, labelling taxonomy and more, WitFoo co-founder Charles Herring told iTnews.
WitFoo collected the data from five of the company's US-based enterprise customers, processed it with the company's tools, and undertook a four-stage sanitisation process to de-identify it.
Past attempts at creating similar data troves have relied on synthesised data, which Herring said is not useless, but it doesn't cover real-world data patterns.
"This dataset shows how we translate live commercial, proprietary signals into a common language and then how those translated artifacts are used to create relationships, and how those relationships tell stories about theories of crime as they played out across the participating organisations," Herring said.
Herring said he expects Anthropic's yet to be released new large language model (LLM), Claude Mythos, to absorb the dataset.
However, he said using an LLM directly on large amounts of data, like the terabyte amount that most organisations produce each month would take at least 250 billion tokens for generative AI to process.
"Based on current rates, that would cost Mythos Preview US$9.38 million [$13 million] using discounted batching, or US$1.88 million using [Claude] Opus 4.6," Herring said.
The electricity required would be around 360 MWh, enough for 33 homes for a year, Herring said, and added that that is not sustainable for the planet, or anyone's budget.
"[It is] roughly 50 times the size of CICIDS2017 and as far as we can tell the largest dataset of its kind built from real adversary behaviour," Herring said.
WitFoo said use cases for the data could include provenance graph-based intrusion detection, AI-driven cyber defence simulation and security alert classification, as well as detection rule evaluation.

Melbourne Cloud & Datacenter Convention 2026
iTnews Executive Retreat - Data & AI Edition
iTnews Cloud Covered Breakfast Summit
iTnews State of Security Breakfast
The 2026 iAwards



