How AI Depends on Data Quality

"Garbage in, garbage out" is a computing principle that has never been more consequential than it is in the age of machine learning. AI systems trained on poor-quality data do not just produce poor outputs — they produce poor outputs confidently, consistently and at scale.

What Data Quality Means for AI

Data quality for AI encompasses several dimensions. Completeness means the dataset covers the full range of cases the model will encounter — a model trained only on data from certain demographics, geographies or time periods will underperform on cases outside those parameters. Accuracy means the data correctly represents reality — errors in training data create systematic errors in outputs. Representativeness means the data distribution matches the real-world distribution the model will face. Recency means the data reflects current patterns, not historical ones that may no longer apply.

When Bad Data Creates Bad AI

Several well-documented cases illustrate the consequences. Hiring algorithms trained on historical resume data that reflected previous (often biased) hiring decisions reproduced those biases at scale. Healthcare risk models trained on insurance claims data, rather than actual health outcomes, systematically underestimated risk for populations that had less access to healthcare. Credit models trained on data from periods of economic growth did not adequately account for recessionary patterns.

In education technology, early warning systems for student outcomes trained on data from well-resourced districts performed poorly when deployed in under-resourced ones, because the patterns were genuinely different and the training data did not capture that.

What Organizations Can Do

Before deploying any AI system, assess the quality of the data it depends on. Ask: What are the known gaps in our data? Are there populations in our data that are underrepresented? How old is the data used for training, and do current patterns differ from historical ones? This is where the connection between data interoperability and AI governance becomes concrete — organizations with fragmented, siloed data have worse data inputs for AI than organizations with integrated, validated data pipelines.