Background
Krystof Mitka's project at the University of Twente explored a unique vulnerability in how production language models expose certain features via their APIs. Specifically, the work focused on the ability to reconstruct parts of a model's internal prediction mechanisms—even when access to log probabilities is restricted.
The Discovery
By studying the bias map functionality available in some large language model APIs, Mitka developed a technique to recover the full logit distribution of the next-token predictions. This effectively allows for reverse engineering of internal model behavior without needing full API access.
The work extends earlier research by applying a formal transformer-based analysis and proving that logit recovery is possible purely through controlled bias manipulation.
Key Innovations
- Bias-Only Extraction: A method that uses only the bias map to infer the complete logit output.
- Black-Box Attack Simulation: Demonstrated how attackers could exploit even limited access to gain deep insight into a model's internals.
- Security Insight: This work signals a need to reassess what features are safe to expose through public APIs.
Technical Approach
Mitka systematically applied biases to target tokens and recorded changes in output probability. From this controlled manipulation, the underlying logits could be inferred. No log probabilities were needed—only access to the bias map.
Impact
This project contributes to a growing awareness of how LLMs, even in limited-access environments, can be vulnerable to extraction attacks. The findings are particularly relevant for companies deploying commercial models behind APIs.
For a detailed technical breakdown of the research and methodology, read the full blog post by Krystof Mitka.
What's Next
Further research may explore mitigation techniques, such as limiting or obfuscating bias manipulation options, and better understanding the trade-off between model openness and robustness against reverse engineering.