A new document leak by former US National Security Agency contractor Edward Snowden sheds light on how Britain's main signals intelligence agency captures as much data flowing across the communications links and the internet as possible for processing.
Dated September 20, 2011 and first published by Boing Boing, the redacted 96 page Heilbronn Institute for Mathematical Data Mining Research Problem Book is marked as top secret and only to be shared with the UK's Five Eyes partners.
As at 2011, the UK Government Communications Headquarters' (GCHQ) Special Source Collection wiretapping technology could keep up with 10 gigabit per second internet circuits.
The spy agency had connected probes to around 200 of the links since 2008, with the intercepted data being processed at GCHQ Cheltenham in Gloucestershire, Bude in Cornwall and "LECKWITH", sited in Oman.
According to the documents, the GCHQ has "access to many more bearers than we can have cover on at any one time, and the set we have on cover is changed to meet operational needs".
The amount of data collected from the links is massive: a single 10Gbps link produces so much information it is "far too much to store, or even to process in any complicated way", the GHCQ said.
To make the processing manageable, the GCHQ discards the vast majority of data packets captured. Software called the packet processing framework (PPF) filters incoming data, with matching information being sent to the TERRAIN platform for more complicated processing.
GCHQ aims to separate communications metadata which the agency is more free to capture at will from the content of voice calls and internet transmitted messages, which are covered by "extremely stringent legal and policy constraints.
However, the agency said that while for traditional telephony communications it is easy to separate metadata (phone numbers) from content (voice calls), it is more difficult to keep the two apart when it comes to the internet.
Botnets such as Conficker are of interest to the GCHQ, and are being detected to track potential criminal and foreign government computer network exploitation (CNE) operations. The GCHQ also engages in CNE operations itself.
"GCHQ's first CNE operation was carried out in the early nineties, and since then CNE has grown to the scale of a small industry in GHCQ," the document said.
The spy agency also tries to keep lists of payphones in different countries for metadata collection.
IBM's InfoSphere Streams, which started out as a research project between the IT giant and the NSA with input from GCHQ, and which the spy agencies refer to as DISTILLERY, is used to process data as it arrives.
Building a processing flow from streams of data saves on storage, and also provides "near-real-time tipping" when an event of interest occurs.
"We can typically provide a tip-off within a second of the event occurring, although the latency of the analyst is somewhat higher," the document notes.
In 2011, the GHCQ used the Streams Processing Language (SPL) to write applications for DISTILLERY, and ran them on InfoSphere Streams version 2; older applications were written in the Streams Processing Application Declarative Engine (SPADE), running on version 1 of InfoSphere Streams, with the GCHQ conveting them to SPL.
The open source Hadoop distributed data processing package is also used by the GCHQ.
Hadoop is suggested for use "whenever you want to batch process a large amount of static data" such as the multi-terabyte datasets the GHCQ intercepts.
Supervised and semi-supervised machine learning to churn through captured information "often produce functions with high accuracies on real-world data sets", the document notes.
Difficulties in creating training sets for machine learning have limited their usefulness for the GCHQ however, as the outcome changes with time and any algorithm used must be periodically retrained.