Does finetuning on arXiv dramatically increase benchmark contamination? Surely all these ablation studies and examples in the appendix are a source for leakage
New institution LLM benchmark: blind debugging. Model can request certain inputs and gets outputs. Based on that information alone: can the model 1. predict the algorithm and 2. predict a Bugfix for said algorithm?
Shadertoy buffers are RGBA and 32bit floats. so they could easily hold the position and velocity of particles in a 2D simulaiton.
experiments with 1 particle: shadertoy.com/view/lXK3WV
Did you know that Intel ExtraSS implementation is open source? This seemingly contains the scripts to generate the dataset, model architecture and training code... 👀
github.com/ltkong218/IFRNet/…
Any alarmist blog post or sentient machines fiction will end up in training data and cause language models to act in this way. If you want models to be more friendly - write about "helpful AI" than a system prompt telling a model 'you are a helpful AI assistant' makes it better.
We need an evaluation benchmark against safety or overdone diversity.
Targeting API products specifically, were you pay for it and don't know what preprocessing is done on your inputs.