Testing LLMs (and prompts) like we test software: towardsdatascience.com/testi…
TL;DR: (1) You should, (2) How to test: specific properties, evaluate these with LLMs (perception is easier than generation), (3) What to test: get the LLM to help you figure it out.
Also highly relevant: guidance from microsoft
"Guidance programs allow you to interleave generation, prompting, and logical control"
Also internally handles subtle but important tokenization-related issues, e.g. "token healing".
github.com/microsoft/guidanc…
Blog post: playing with Vicuna-13B, ChatGPT (3.5), MPT-7B-Chat on harder stuff medium.com/@marcotcr/explori…
TL;DR: We think ChatGPT is still way ahead, but sometimes the extra control from open source models is worth it.