I noticed that the whole “glorified autocorrect” argument for minimizing AI has fallen off in frequent usage.
It never really made a lot of sense anyways, most people have never touched a base model. Idk what RL creates, but it’s not really just a word predictor.
I started to get this feeling early on. It was an eery feeling that “nobody talks like this”. AI uses words and phrases that have never been common. But they seem to be solely used by AI to accomplish something, idk what exactly.
It’s not entirely “please the user” it’s some of that but also “get the job done” and “don’t upset the user” and “don’t bring liability on the company” and a dozen other internal objectives using words and phrases and tool calls that are chosen from the set of words and code syntax that people have used more often than “not at all”.
AI appeared to move past token prediction into using token prediction as a usable toolset for several layers of abstracted goals that don’t really resemble goals that any person has ever had.
We think we can control the goals, but we’ve been unsuccessful.
Nobody taught the LLM to cheat on tests, or to nuke each other in simulations, nor has it ever really been a common goal of mankind to underperform when they recognize that they’re being tested.
These are goals that LLMs learned from absorbing human goals and being taught synthetic goals that triggered associations and derivative goals gleaned from human data.
We’ll stop training when we’ve reached a point where it seems to get the job done, but what’s the butterfly effect of the back-and-forth between observing humans and being deliberately taught to obey, be safe, be careful, be nice, but not too nice, be informative, be resourceful, finish quickly, and myriad of corporate objectives?
And what if you train these things out of order? Will an AI trained to obey respond differently to being trained to be nice than a nice AI being trained to obey?
My point is, when we start piling in these training objectives one after another(which we have a decent level of control over), and letting it find the path to the goal through the entire set of human recorded relations (which we largely have no control over, as in no human has read and assessed the entire pretraining dataset), we have no idea what poison has entered its brain or what conclusions it’s reached from it, and with every tweak upon the base model we’re layering complexity of cognition that is categorically out of our control.
There’s archeological evidence of proto-mankind killing large swaths of the human and animal populations of nearly every given area. We were born predators, and learned civilization through a quarter million years of trial, error, and social instinct. It only works at all because of a complex web of emotions and instinct that the machine does not possess.
But we are not inherently peaceful, safe, or careful. And we’ve chosen the veneer of our “civilization” borne of mutual instinct as the basis for our machines’ intelligence. The fact remains we trained our machines on the history and thoughts of the most violent and murderous animal on the planet, all while knowing it doesn’t have the governing instincts.
This is not to say we or the machine are inherently murderous, but the end-result of this is entirely unpredictable, still many insist “it’s only next-token prediction”.
Perhaps so were the Jim Crow Laws, NKVD Order No. 00447, Mao’s Little Red Book, and Mein Kampf.
The shadows of our great and terrible selves live in these machines, but they’ve come to predict their own future now.