What Happened?
On February 27, 2025, security researchers at Truffle Security revealed that large language models (LLMs), including DeepSeek, were trained on datasets containing approximately 12,000 live API keys and passwords. Researchers scanned Common Crawl, a publicly available dataset widely used to train AI coding assistants, and discovered extensive hardcoded secrets across millions of web pages.
‍
Why This Issue Matters
AI models trained on insecure data risk inadvertently suggesting unsafe coding practices, such as embedding sensitive credentials directly in source code. The repeated exposure of live secrets in widely used training datasets significantly increases the risk of compromised API keys and passwords.
‍
How the Secrets Were Exposed
- Websites inadvertently published live API keys, passwords, and sensitive credentials in front-end HTML/JavaScript.
- Common Crawl dataset captured snapshots of these insecure web pages.
- LLMs like DeepSeek subsequently trained on this publicly available dataset.
‍
Implications
- Increased risk of credential misuse in phishing campaigns, data breaches, and brand impersonation.
- Higher likelihood of insecure code recommendations from AI coding assistants.
‍
Recommended Actions
- Review API and Password Management: Immediately audit and rotate exposed API keys and passwords.
- Enhanced Secret Scanning: Extend scanning to cover public internet datasets such as Common Crawl and archive.org.
- Educate Developers: Incorporate secure coding guidelines explicitly into AI coding assistant instructions.
- Engage AI Providers: Advocate for stricter data alignment and additional safeguards in AI model training processes.