Seriously one of the best LLM explanations… :-)
yeah, so pretty much when you talk to an LLM (chatgpt, claude, grok) and get fancy schmancy stuff from it, youre just interfacing with a probabilistic sequence-prediction engine
each word provided to the interface (or subwords like "ing" or "un", whatever) goes through a thingy called a tokenizer. the tokenizer transforms the words (or subwords) into tokens. although if you want to get super technical the tokenizer doesnt even know words its just raw text but whatever
the tokens are stored in a big ass fuck off prebuilt in-memory dictionary for the tokenizer thingy. the words (tokens) match a 32bit integer (literally just a number). this is basically like a dictionary where "i like cats" is translated to something like "1 200 1337"
"i" = 1
" like" = 200
" cats" = 1337
those tokenized numbers are vendor specific, they dont really mean anything, but these tokens are then sent to a "embedding lookup table" where theyre actually important
once the LLM has the tokens its passed to the embedding lookup table which just does a bunch of fancy math, nerds try to make it all complicated, but its literally just arrays and indexes and stuff
in this "embedding lookup table" (im just gonna write lookup table) each token (text to number) has a bunch of numbers associated with it (weights).
" cats" = 1337
lookup table entry 1337 = a bunch of numbers
so the word cats has a bunch of numbers associated with it, each LLM is different, but usually its 768 numbers, 1024 numbers, 2048 numbers, or 4096 numbers. these numbers associated with a token are called dimensions. each LLM has different numbers of dimensions for representing words
the llm then takes these numbers and stacks them on top of each other
i like cats = 1 200 1337
1 200 1337 =
(768 numbers)
(768 numbers)
(768 numbers)
its like a height by width thingy
basically if you get fancier its a 3x768 matrix (or 1024, 2048, whatever). the more stuff you feed the LLM the larger this matrix becomes. if you feed is 9000 word essay its
9000 words-to-tokens x 768 numbers matrix
each vendor will handle the words different, 9000 words could be 9000 tokens, or 10000 tokens, or 14000 tokens
ok thanks, now you understand llm tokenization, llm lookups, and the basics of llm weights (matrixing), this doesnt cover llm lookups with position matrixes, transformers, probability output, and transforming back to text. im tired of writing