Editorial Feature

The Ethics of Training AI on Human Data

The Scale of the Problem
Consent and Its Fundamental Limits
Bias as an Ethical Injury
The Legal and Regulatory Response
Towards Ethical Practice
References and Further Reading

Every time a large language model generates a sentence or a facial recognition system identifies a face, it is drawing on a vast body of human-generated data. Much of that data comes from real people: their words, photographs, voices, browsing habits, and even biometric markers.

Digital transformation concept. Futuristic technology. Abstract background.

Image Credit: metamorworks/Shutterstock.com

Yet the use of this data often happens quietly, folded into the training processes that allow AI systems to function at scale. 

The ethical questions surrounding this practice are not simple ones. They touch on privacy laws, meaningful consent, and the fundamental right individuals have to control how their identity and personal information are used.

The Scale of the Problem

Modern AI systems depend on enormous volumes of data to function accurately. Large language models are trained on billions of web pages, forum posts, and digitized books. Computer vision systems rely in much the same way on millions of labeled photographs of real human faces. Gathering data on this scale would be impossible by hand, so most of it is collected through automated web scraping, a process that harvests publicly accessible content at machine speed.

The Organisation for Economic Co-operation and Development (OECD)'s 2025 report on intellectual property and AI training notes that this practice raises a wide range of legal questions.

Data scraping can affect copyright, database rights, trademarks, trade secrets, publicity rights, and moral rights. Yet many of the laws that govern these areas were written long before modern AI training practices existed. The result is a patchwork of legal standards that vary across countries and are often difficult to apply in practice. In practical terms, this means that a person’s publicly shared photo, comment, or post can end up in an AI training dataset without their knowledge, payment, or any real opportunity to object.1,2

The practical consequence is that real people can become unwitting contributors to commercial AI systems.

Reddit, for example, licenses user-generated content to AI companies, and the platform has faced scrutiny from the US Federal Trade Commission over data-licensing practices connected to AI training.3,4 

In 2025, the number of AI copyright infringement cases more than doubled, rising from around 30 to more than 60 active lawsuits. Much of this litigation has come from content creators, journalists, and publishers whose work was incorporated into training datasets without permission.3,4 

Taken together, these disputes suggest that the ethical tension surrounding AI data collection is a serious ethical concern, producing real conflicts for the people whose work and personal data help make these systems possible.

Consent is often treated as the foundation of ethical data use.

In medical and social research, informed consent means that participants understand what their data will be used for, how it will be stored, and who will have access to it. In principle, this model is meant to give individuals meaningful control over how their information is used. AI training, however, complicates that standard in important ways.

A recent study describes what researchers call a “consent gap” in AI systems. The authors frame this gap around three related challenges:5 

  1. The scope problem
  2. The temporality problem
  3. The autonomy trap

The scope problem arises because people may agree to the initial use of their data but cannot realistically consent to every possible future use. Once data enters large training datasets, it may be reused, combined, and repurposed in ways that are difficult to anticipate.

The temporality problem highlights a similar difficulty. Consent given at one moment in time cannot predict how a model might draw on that data years later, long after the original context has disappeared.

Finally, the autonomy trap reflects a more subtle issue. Even when consent appears voluntary, it is not fully meaningful if individuals have no practical way to refuse participation in the systems that collect and process their data.5

The French data protection authority, known as the National Commission on Informatics and Liberty (CNIL), for instance, is trying to balance these tensions within the General Data Protection Regulation (GDPR) framework. Its guidance acknowledges that large-scale AI training on personal data sourced from public content can be lawful under the legitimate interest basis, provided that a credible balancing of interests is documented, proportionality principles are observed, and mitigation measures address risks like model memorization, which can cause training data to be reproduced verbatim in model outputs.6 

Yet the CNIL is explicit that training-stage compliance does not automatically permit commercial deployment. Copyright laws, database rights, and platform terms may separately prohibit the use of the same data.6 

This complex regulatory situation means AI developers might meet one legal requirement but break another, leaving individuals without a clear way to seek help.

Bias as an Ethical Injury

The ethics of using real people as training data extends beyond consent into questions of fairness.

When training datasets fail to adequately represent all demographic groups, AI systems can produce results that disproportionately harm those who are already underrepresented or historically marginalized. In this way, social inequalities embedded in historical data can reappear in technical systems that appear neutral but are not.7

Facial recognition technology provides a clear example of this dynamic. Early training datasets for facial recognition systems contained far more images of white men than of women or people of color. As a result, the models became significantly better at identifying white male faces than others. A review published in Frontiers in Big Data notes that facial recognition technology also removes anonymity in public spaces. Unlike passwords or identification numbers, facial features cannot easily be changed or hidden, raising serious ethical concerns about consent and personal identity.

These technical limitations have already produced real-world consequences. At least eight documented cases of wrongful arrest have been linked to facial recognition errors, and seven of those cases involved Black men. Researchers attribute this pattern in part to racially unrepresentative training datasets. When training data drawn from real people reflects existing social bias, the resulting AI system does not simply inherit those biases; it can reinforce and extend them at scale.8,9

A similar pattern appears in other algorithmic systems. In the criminal justice context, the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm, trained on historical criminal justice data, has been shown to assign higher recidivism risk scores to Black defendants compared to white defendants with equivalent criminal histories, producing higher false-positive rates for Black individuals.

Bias of this kind begins in the data pipeline itself. When datasets about real people are used without examining the structural inequalities they contain, the resulting systems can reproduce those inequalities long after the training process has ended.7

Regulatory frameworks are starting to address ethical concerns, but the responses vary.

The legal landscape surrounding IP data scraping is complex and rapidly evolving. Existing IP laws, many predating modern AI practices, differ across jurisdictions, complicating their application.

Data scraping frequently involves content protected by IP rights, raising questions about infringement, the applicability of exceptions such as fair use or text and data mining (TDM) provisions, and adherence to contractual terms and conditions.

Scraping copyrighted materials raises questions about whether the collection or use of the scraped data constitutes copyright infringement.

Litigation in this area is increasing globally, with prominent cases emerging in the United States, European Union, and beyond. Additionally, concerns about AI-generated outputs - particularly those that mimic an individual's style, voice, or likeness without authorisation - have prompted varied legal responses aimed at protecting rights and preventing misuse.2

In the European Union, the General Data Protection Regulation (GDPR) requires that personal data be collected for clear and legitimate purposes and that it not be reused in ways that conflict with those original purposes. Building on this framework, the EU AI Act goes further by prohibiting certain high-risk practices, including AI systems that use biometric data to categorize individuals or infer sensitive personal attributes.3,10 

In the United States, lawmakers have started to explore similar protections. Senators have proposed the AI CONSENT Act, which would require online platforms to obtain explicit user permission before using personal data to train AI models. Although the legislation is still under consideration, it reflects growing concern about how personal data is collected and reused in the development of AI technologies.3,10

Regulators have also begun enforcing existing consumer protection laws in this space. The US Federal Trade Commission (FTC)  has stated that companies cannot revise their privacy policies after collecting user data in order to permit new uses such as AI training. If a company originally collected data under specific privacy assurances, those promises still apply when AI systems are introduced.

Courts are beginning to reinforce this position. In 2025, a US federal court ruled against an AI company that had used proprietary legal research content without authorization to train its system. The court concluded that this use harmed the market value of the original material and did not qualify as fair use. Together, these regulatory and legal developments suggest that the legal system is slowly beginning to grapple with the realities of large-scale AI training.11,12

Towards Ethical Practice

Researchers and developers are beginning to explore ways of reducing the ethical challenges associated with using real people’s data to train AI systems.

One approach is anonymization, which removes or obscures identifying details so that individuals cannot easily be traced within a dataset. Another method, known as federated learning, keeps raw data on users’ devices while allowing models to learn from it indirectly by sharing only updated parameters rather than the underlying information.

Synthetic data offers another possibility. Instead of relying solely on photographs of real individuals, researchers can generate artificial images that resemble human faces but do not correspond to any actual person. Some studies suggest that carefully designed synthetic datasets can help reduce racial bias in facial recognition systems more effectively than similarly sized real-world datasets.9,10

These approaches do not eliminate the ethical questions surrounding AI training, but they show that the tension between technological capability and individual rights is not fixed.

Responsible development requires more than technical safeguards alone. It also depends on transparency about the data used to train models, meaningful mechanisms for individuals to object to or request deletion of their data, and independent evaluation of training datasets for bias before systems are deployed.

Ethical AI development also depends on awareness across the broader data ecosystem. The OECD notes that rights holders, data producers, developers, and users all need clearer information about how data is collected, used, and protected. Improving public understanding of data scraping practices and AI training processes can help individuals better manage their rights while encouraging more responsible behavior from companies developing these systems.2

Ultimately, the ethical responsibility toward the people whose data contributes to AI systems does not end once a model is released. It continues throughout the entire lifecycle of the technology their information helped make possible, and it will shape how societies decide to govern AI in the years ahead.1

Further reading on data consent, AI governance, and algorithmic fairness can offer deeper insight into how societies are beginning to address these challenges:

References and Further Reading

  1. AI system development: CNIL’s recommendations to comply with the GDPR. (2026). CNIL. https://www.cnil.fr/en/ai-system-development-cnils-recommendations-to-comply-gdpr
  2. INTELLECTUAL PROPERTY ISSUES IN ARTIFICIAL INTELLIGENCE TRAINED ON SCRAPED DATA. (2025). OECD ARTIFICIAL INTELLIGENCE PAPERS. https://www.oecd.org/content/dam/oecd/en/publications/reports/2025/02/intellectual-property-issues-in-artificial-intelligence-trained-on-scraped-data_a07f010b/d5241a23-en.pdf
  3. Who owns your data? The ethics of AI training practices. Secure Redact. https://www.secureredact.ai/news/who-owns-your-data-the-ethics-of-ai-training-practices
  4. Madigan, K. (2026). AI Copyright Lawsuit Developments in 2025: A Year in Review. Copyright Alliance. https://copyrightalliance.org/ai-copyright-lawsuit-developments-2025/
  5. Pistilli, G., & Trevelin, B. (2025). Can AI be Consentful? ArXiv. DOI:10.48550/arXiv.2507.01051. https://arxiv.org/abs/2507.01051
  6. Kirk, D. J. et al. (2025). CNIL Clarifies GDPR Basis for AI Training – But It’s Just One Part of the Compliance Picture. Skadden. https://www.skadden.com/insights/publications/2025/06/cnil-clarifies-gdpr-basis-for-ai-training
  7. Alhammad, T. (2024). Deployment of COMPAS Algorithm in the Criminal Justice System. ResearchGate. DOI:10.13140/RG.2.2.16917.23520. https://www.researchgate.net/publication/394443577_Deployment_of_COMPAS_Algorithm_in_the_Criminal_Justice_System
  8. Wang, X. et al. (2024). Beyond surveillance: Privacy, ethics, and regulations in face recognition technology. Frontiers in Big Data, 7, 1337465. DOI:10.3389/fdata.2024.1337465. https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2024.1337465/full
  9. Zhao, C. (2025). Can fake faces make AI training more ethical? Science News. https://www.sciencenews.org/article/fake-faces-ai-training-ethical
  10. Ethical Use of Training Data: Ensuring Fairness and Data Protection in AI. (2024). Lamarr Institute. https://lamarr-institute.org/blog/ai-training-data-bias/
  11. Gatto, J. (2024). Legal Issues When Training AI On Previously Collected Data. Law 360. https://www.law360.com/articles/1814964/legal-issues-when-training-ai-on-previously-collected-data
  12. Yudin, A. (2025). Web Scraping Legal Issues: The Complete 2026 Enterprise Compliance Guide. BWT Group. https://groupbwt.com/blog/is-web-scraping-legal/

Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.

Ankit Singh

Written by

Ankit Singh

Ankit is a research scholar based in Mumbai, India, specializing in neuronal membrane biophysics. He holds a Bachelor of Science degree in Chemistry and has a keen interest in building scientific instruments. He is also passionate about content writing and can adeptly convey complex concepts. Outside of academia, Ankit enjoys sports, reading books, and exploring documentaries, and has a particular interest in credit cards and finance. He also finds relaxation and inspiration in music, especially songs and ghazals.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Singh, Ankit. (2026, March 04). The Ethics of Training AI on Human Data. AZoRobotics. Retrieved on March 04, 2026 from https://www.azorobotics.com/Article.aspx?ArticleID=813.

  • MLA

    Singh, Ankit. "The Ethics of Training AI on Human Data". AZoRobotics. 04 March 2026. <https://www.azorobotics.com/Article.aspx?ArticleID=813>.

  • Chicago

    Singh, Ankit. "The Ethics of Training AI on Human Data". AZoRobotics. https://www.azorobotics.com/Article.aspx?ArticleID=813. (accessed March 04, 2026).

  • Harvard

    Singh, Ankit. 2026. The Ethics of Training AI on Human Data. AZoRobotics, viewed 04 March 2026, https://www.azorobotics.com/Article.aspx?ArticleID=813.

Tell Us What You Think

Do you have a review, update or anything you would like to add to this article?

Leave your feedback
Your comment type
Submit

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.

or

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.