fun research story about how we jailbroke the the chatGPT API:
so every time you run inference with a language model like GPT-whatever, the model outputs a full probabilities over its entire vocabulary (~50,000 tokens)
but when you use their API, OpenAI hides all this info from you, and just returns the top token -- or at best, the top 5 probabilities
we needed the full vector (all 50,000 numbers!) for our research, so we developed a clever algorithm for recovering it by making many API calls
important to know is that the API supports a parameter called "logit bias" which lets you upweight or downweight the probability of certain tokens. our insight was that we could run a binary search on the logit bias for each token to find the exact value that makes that token most likely, yielding the relative probability for that token
to get a full next-token probability vector, we run 50,000 binary searches (it's actually not as expensive as you'd think) – shout out to
@justintchiu for coming up with this and implementing it efficiently!
and there's a bonus level: in the setting where openAI gives us the top-5 logprobs (available for some models), there's a much more efficient algorithm, with a pretty elegant solution
in this setting, to get the probability for a certain token, you just add a really large fixed logit bias to it. given its new probability (which openAI will give you, since that token will be in the top 5 now) you can solve for its original probability in closed-form.
since in this setting OpenAI provides probabilities for the top 5 tokens in a single API call, and we only have to run one call per token, this new method lets you get the full vector in 50,000/5≈1,000 queries
funnily enough, after we posted the code for the binary search algorithm we got an email from fellow researcher
@mattf1n with the math for the top-5 algorithm. and he followed it up with a pull request. nice guy!
if you thought this was interesting:
- want to run the algorithm yourself? check out the code here:
github.com/justinchiu/openlo…
- want to read about it? see Section 5 of our paper Language Model Inversion:
arxiv.org/abs/2311.13647