1. Token
A token is the smallest unit the model actually reads and writes.
It’s not exactly a word.
For example:
•"hello" → 1 token
•"unbelievable" → might be ["un", "believ", "able"]
•Code, emojis, spaces — all become tokens.
Why engineers care:
•Cost = tokens
•Latency = tokens
•Limits = tokens
If you exceed the token limit, the model literally cannot see the rest of your input.
2. Context Window
The context window is how much the model can “remember” at once.
It includes:
•Your prompt
•Conversation history
•System instructions
•Retrieved documents (RAG)
Once you cross the window size, older tokens fall off the cliff.
It’s like RAM, not disk.
If it’s not in memory, the model can’t reason about it.
3. Prompt
A prompt is the input you give the model to shape its behavior and output.
This includes:
•Instructions (“Act like a senior backend engineer”)
•Data (logs, code, JSON)
•Constraints (format, tone, rules)
Important truth:
LLMs don’t “understand intent” — they follow patterns.
A bad prompt is like a vague API contract.
4. Embedding
An embedding is a numerical vector representation of text meaning.
Similar text → vectors close together
Different meaning → far apart
Used for:
•Semantic search
•Recommendations
•Clustering
•RAG
Mental model:
Text → vector → math → relevance
This is how machines compare meaning, not keywords.
5. Temperature
Temperature controls randomness.
•0.0 → deterministic, boring, safe
•0.7 → balanced
•1.0 → creative, risky, chaotic
Rule of thumb:
•Use low temperature for code, configs, facts
•Use higher temperature for brainstorming or writing
It doesn’t make the model smarter — just more adventurous.
6. Top-P (Nucleus Sampling)
Top-P limits the model to the smallest set of tokens whose total probability ≥ P.
Example:
•top_p = 0.9 → only consider the most likely 90% of outcomes
Difference from temperature:
•Temperature reshapes probabilities
•Top-P trims the tail of unlikely nonsense
Most production systems tune both.
7. Hallucination
A hallucination is when the model confidently produces incorrect information.
Why it happens:
•Missing context
•No access to source of truth
•Probabilistic guessing under uncertainty
Key insight:
LLMs optimize for plausibility, not truth.
If correctness matters, you must:
•Ground it with data (RAG)
•Add verification
•Reduce temperature
8. LLM (Large Language Model)
An LLM is a neural network trained to predict the next token, at massive scale.
It doesn’t:
•Think
•Reason like humans
•Understand meaning inherently
It does:
•Recognize patterns extremely well
•Compress large amounts of knowledge
•Generate surprisingly useful behavior
Think of it as:
A probabilistic autocomplete trained on the internet.
9. RAG (Retrieval Augmented Generation)
RAG = fetch real data first, then ask the LLM to reason over it.
Flow:
1.User asks a question
2.System retrieves relevant docs (via embeddings)
3.Docs are injected into the prompt
4.LLM generates grounded output
Why engineers love RAG:
•Reduces hallucinations
•Keeps data fresh
•Avoids retraining models
It’s basically LLM database search.
10. Inference
Inference is the act of running the trained model to generate output.
Training = expensive, offline
Inference = cheaper, online, repeatable
Concerns during inference:
•Latency
•Cost per token
•Throughput
•Streaming vs batch