Opinion: Data veracity helps cement your data governance policy in trust

We now live in a world where text, images, videos and even voice can be created digitally by generative AI – upending the adage “seeing is believing”.

Many organisations are also building synthetic data sets to validate mathematical models and to train machine learning models for real-world scenarios where not enough real data can or has been extracted.

A great example of this is, Microsoft released a database in 2021 of 100,000 synthetic faces based on (500 real faces) that claims to "match real data in accuracy". Yet the "truth" of these synthetic data sets relies on the individual biases and inclusions or occlusions of the people and organisations conjuring them.

How do you know who created the data you consume? How much of that data do you personally trust? How do you know it can be trusted? (As an aside, are the authors of this article real? Did they use generative AI to write the article? How much of the article comes from original material?).

Earlier this year in the United States, a defendant sought to dismiss a federal court case when the lawyer for the plaintiff filed a response compiled using generative AI text, which cited at least six other cases to show precedent.

The court found that the cases didn’t exist and had “bogus judicial decisions with bogus quotes and bogus internal citations,” leading a federal judge to consider sanctions. This simple example serves as a timely warning that we all need to know where our data comes from before using it.

What is data veracity?

Data veracity is generally defined as the accuracy or truthfulness of a data set. And it is becoming a big deal.

In a world awash with data from both real and created sources, knowing the provenance and originality of your data can make the difference between telling the truth or being caught up in a lie.

The veracity of data is tested by the level of bias, noise, and abnormalities present in created sets. It is also impacted by incomplete data or the presence of errors, outliers, and missing values.

With generative AI, this now also extends to having data that may be impacted by deep fakes, confabulations, and hallucinations. Knowing whether data can be trusted will continue to grow as a massive challenge over the coming decade – until such time that the correct means of identifying and measuring data veracity have been implemented by society.

Currently, data veracity is not easily measured or identified between different parties. Whilst data governance in most organisations (rightly) covers many detailed guidelines and stipulations for data privacy and security, data veracity is often barely considered.

For a data platform that combines data from a variety of internal and external sources and formats to help construct AI-powered insights and solutions, data veracity isn’t a nice to have, it is essential.

Creating a framework

With a framework for data veracity in place, you can ensure that those insights are reliable, consistent, and generated rapidly.

How can data governance ensure data veracity is managed effectively? Below are some suggestions or components for a data veracity framework.

Data Source Reliability

The first step in measuring data veracity is evaluating the reliability of the data source. Credible sources, such as government agencies, established research institutions, and reputable companies, tend to produce more reliable data.

Steps should be taken to identify the source of data, when it was created, and by what means by updating tags in the metadata of any set. To help counter the veracity challenges raised by synthetic image generation, for example, Google has just launched a new digital watermarking functionality for their Imagen generative AI image creation engine called SynthID.

SynthID generated watermarks remain detectable even when image metadata has been manipulated or removed.

Data Consistency:

Ways to collect data have grown exponentially in recent years. Whether using structured or unstructured sources, direct queries, defined reports, or data visualisation tools, data should be consistent across different sources and time periods.

You should check that running different data queries across all means by which data is accessible in your organisation returns the same results. The integrity of the data also requires that reasonably strong API authentication mechanisms be used for queries including API keys, OAuth 2.0 and multi-factor authentication.

Data Security Measures

Implementing robust security measures includes encryption, access controls and systems in place to protect data during transfer and at rest.

For example, do your data pipelines use Transport Layer Security (e.g. TLS 1.2) for data transit? If you’re dealing with devices, do your sensor vendors ensure certificate-based authentication to verify the identity of their devices? Do they have a secure boot mechanism to ensure that only authorised (untampered) firmware can run on their devices?

Data Validation and Verification

Putting checks in place to test the veracity of data as it arrives at your organisation can make all the difference. Like the need for two (or even three) factor authentication when accessing an online application, your organisation should have pre-registered indications that data is coming from a trusted source. The validation and verification should be captured in the metadata of any data set, showing when it was received and by what means.

Data Quality Metrics

Various data quality metrics can be applied to assess data veracity. Examples include completeness (is the data missing any critical information?), accuracy (how close is the data to the actual value?), and timeliness (is the data up to date?).

These metrics help organisations identify and rectify data quality issues. Having a minimum viable level of quality for your data set will help determine if data should be kept or discarded.

Cross-Referencing and Cross-Validation

Cross-referencing data with multiple sources or using cross-validation techniques can enhance data veracity. Check that any new data set correlates to previously received information, and doesn’t extend beyond the norm in size, type, or range.

This is particularly relevant for various applications that use large language model (LLM) – based assistants to respond to user queries for critical data.

Hallucinations or deviations from accuracy can have serious consequences for users and for the reputation of the application provider. It is critical, in such applications, to have methods that can independently verify the accuracy of the models.

Veracity is vital

In an era where misinformation and data manipulation can have far-reaching consequences, measuring data veracity is critical. No matter the organisation you work for, the accuracy and reliability of data plays a pivotal role in shaping your decisions and outcomes.

By applying these measurement methods, you can foster a more trustworthy data landscape that empowers individuals and organisations to make informed choices based on solid information. Or you can ignore the challenge at your peril like Colonel Nathan R. Jessup (played by Jack Nicholson) did in A Few Good Men, and just declare “You can’t handle the truth”.

Colin Dominish is the head of podium services at Lendlease Digital. He is a customer-first digital native with over thirty years of experience in bringing the best digital solutions and expertise from around the world and applying them to infrastructure projects.

Aditya Dayal, head of AI and Data Science at Lendlease Digital contributed to this piece.