A Shaky Map in the Shadow of Undersampling How to Decipher a World Missing Data December 2025
Undersampling refers to a situation in which there is an extreme lack of data on a particular group or attribute, causing AI to learn the world in a biased way. No matter how sophisticated the model, data with many missing regions cannot make unbiased decisions, and bias remains structural. This is not simply a problem of data volume, but is caused by the collection design in the first place, and is difficult to improve even if data is added later. As the failures of recidivism prediction systems and facial recognition systems have shown, undersampling reproduces social inequity and creates real harm that disadvantages certain groups.
The importance of under-sampling has been internationally recognized, with the EU AI Act specifying the obligation to use representative and sufficient data for high-risk AI, and the OECD AI Principles also emphasize data quality to ensure fairness, and under-sampling left unchecked is considered a serious ethical risk. The OECD AI Principles also emphasize data quality to ensure equity, and under-sampling is considered a serious ethical risk. These criteria also take into account the possibility that missing data may be linked to discrimination and exclusion due to social structures rather than technical inadequacies.
The core of the problem lies in the fact that bias lurks in the pre-study phase and affects the entire model. If experts do not intervene early and inspect which populations are missing data and why, the bias will remain fixed. Proactive measures such as designing a well-balanced data collection plan, devising survey methods for inaccessible populations, and introducing equity indicators are essential.
Undersampling is a fundamental problem that shakes not only the performance of AI but also the fairness of society, and visualizing and filling the void lurking in the shadows will be the foundation for AI ethics and institutional design in the future.
No comments:
Post a Comment