Maps Wavering in the Shadow of Undersampling How to Decipher a World Missing Data December 2025
When an AI learns about the world, its view is limited only to the data it is given. If there is extremely little data on a particular group or attribute, AI will understand the missing pieces as blanks, forming a biased picture of reality. This is the problem of undersampling, and no matter how much data is added later, as long as the missing pieces themselves are structured, the bias is difficult to eliminate. Unseen areas will forever remain blind spots, and although learning is enhanced, the perspective is not broadened.
These problems have had serious consequences in areas of high social impact, such as credit scores, medical diagnoses, and recidivism prediction. For example, the controversial COMPAS recidivism prediction model in the U.S. reportedly suffers from structural bias with more data from blacks and relatively less data from whites, resulting in overestimation of risk and unfair judgments. In facial recognition technology, academic studies have shown that the lack of data on women of color led to extremely high misidentification rates, causing a worldwide debate. These are all symbolic examples of how undersampling is not just a lack of data, but has the power to reproduce discrimination and inequality.
The issue is clearly recognized in international governance as well: the EU AI Act makes representative and sufficient data a requirement for high-risk AIs and mandates audits of bias in data sets; the OECD AI Principles also emphasize data quality and balance are emphasized. In other words, undersampling is not only an ethical issue, but is also treated as a fundamental risk that cannot be ignored in legal and institutional design.
The essential problem is that undersampling is a structural bias that lurks before learning. Even if one attempts to correct it after the model has been developed, the bias has already been internalized and it is difficult to go back. It is necessary to involve statistical experts and domain practitioners at an early stage to inspect which populations are missing data and why. Unless the design of the data collection itself is redesigned, the added data will eventually only further thicken the existing majority group, leaving the minority group forever thin.
Under-sampling not only distorts AI decisions, it reinforces social imbalances and shakes up justice and fairness. To avoid the danger of navigating with a missing map, it is essential to be aware from the outset whose voices are missing and to put in place mechanisms to compensate for their shadows. Shedding light on the silent voids in our data is the first step toward supporting ethics in the age of AI.
No comments:
Post a Comment