Back to the blog

Thousands of live API keys and passwords found exposed in training data

On February 27, 2025, security researchers revealed that LLMs were trained on datasets containing approximately 12,000 live API keys and passwords.

What Happened?

On February 27, 2025, security researchers at Truffle Security revealed that large language models (LLMs), including DeepSeek, were trained on datasets containing approximately 12,000 live API keys and passwords. Researchers scanned Common Crawl, a publicly available dataset widely used to train AI coding assistants, and discovered extensive hardcoded secrets across millions of web pages.

‍

Why This Issue Matters

AI models trained on insecure data risk inadvertently suggesting unsafe coding practices, such as embedding sensitive credentials directly in source code. The repeated exposure of live secrets in widely used training datasets significantly increases the risk of compromised API keys and passwords.

‍

How the Secrets Were Exposed

  1. Websites inadvertently published live API keys, passwords, and sensitive credentials in front-end HTML/JavaScript.
  2. Common Crawl dataset captured snapshots of these insecure web pages.
  3. LLMs like DeepSeek subsequently trained on this publicly available dataset.

‍

Implications

  • Increased risk of credential misuse in phishing campaigns, data breaches, and brand impersonation.
  • Higher likelihood of insecure code recommendations from AI coding assistants.

‍

Recommended Actions

  • Review API and Password Management: Immediately audit and rotate exposed API keys and passwords.
  • Enhanced Secret Scanning: Extend scanning to cover public internet datasets such as Common Crawl and archive.org.
  • Educate Developers: Incorporate secure coding guidelines explicitly into AI coding assistant instructions.
  • Engage AI Providers: Advocate for stricter data alignment and additional safeguards in AI model training processes.

Related posts

Report

Debunking the "stupid user" myth
in security

Exploring the influence of employees’ perception
and emotions on security behaviors