Tony Sellprano

Our Sales AI Agent

Back to all case studies
Extracting Secrets: Model Extraction via Bias Maps

LLM Security & Reverse Engineering

Extracting Secrets: Model Extraction via Bias Maps

Research Project by Krystof Mitka

Groundbreaking LLM Reverse-Engineering

Developed a novel method to reconstruct internal model predictions (logits) using only bias map API access, even in black-box scenarios.

New API Attack Vector Unveiled

Discovered and demonstrated a significant vulnerability, showing how limited API features can be exploited to extract sensitive model information.

Critical Security Insights for LLM Providers

Highlighted major API security implications, urging a reassessment of feature exposure to protect commercial language models.

Background

Krystof Mitka's project at the University of Twente explored a unique vulnerability in how production language models expose certain features via their APIs. Specifically, the work focused on the ability to reconstruct parts of a model's internal prediction mechanisms—even when access to log probabilities is restricted.

The Discovery

By studying the bias map functionality available in some large language model APIs, Mitka developed a technique to recover the full logit distribution of the next-token predictions. This effectively allows for reverse engineering of internal model behavior without needing full API access.

The work extends earlier research by applying a formal transformer-based analysis and proving that logit recovery is possible purely through controlled bias manipulation.

Key Innovations

  • Bias-Only Extraction: A method that uses only the bias map to infer the complete logit output.
  • Black-Box Attack Simulation: Demonstrated how attackers could exploit even limited access to gain deep insight into a model's internals.
  • Security Insight: This work signals a need to reassess what features are safe to expose through public APIs.

Technical Approach

Mitka systematically applied biases to target tokens and recorded changes in output probability. From this controlled manipulation, the underlying logits could be inferred. No log probabilities were needed—only access to the bias map.

Impact

This project contributes to a growing awareness of how LLMs, even in limited-access environments, can be vulnerable to extraction attacks. The findings are particularly relevant for companies deploying commercial models behind APIs.

For a detailed technical breakdown of the research and methodology, read the full blog post by Krystof Mitka.

What's Next

Further research may explore mitigation techniques, such as limiting or obfuscating bias manipulation options, and better understanding the trade-off between model openness and robustness against reverse engineering.

Ready to work together?

Let's discuss your project.

Contact Us