RL isn’t “just an algorithm.”
It’s an agent an environment. The environment is everything outside the agent: what it can see, what it can do, and how it gets rewarded for doing it.
It’s the rules of the game, the physics, the scoreboard, and the win condition all rolled into one.
Formally, an RL environment is a Markov Decision Process: a state space (what situations exist), an action space (what moves are allowed), transition dynamics (how the world changes when you act), and a reward function (what you care about). At each step the agent gets a state, picks an action, the environment jumps to a new state and hands back a scalar reward. All of the beautiful math in RL - value functions, Bellman equations, optimal policies- is built on this loop.
The state and action spaces define what’s even possible to learn. A simple gridworld with four discrete moves is a different universe from a robot arm with continuous joint angles. Fully observable environments spoon‑feed you the whole state; partially observable ones force the agent to infer hidden variables from a stream of noisy observations, turning the problem into a POMDP. That’s where recurrence, memory, and belief states stop being “fancy” and become mandatory.
Rewards are where most real systems go to die. A reward is just a number, but it encodes the entire objective. Sparse rewards (only at the goal) are brutal to learn from; dense rewards (lots of shaping) are easier but invite reward hacking, where the agent finds shortcuts you didn’t intend. The classic pattern: “optimize clicks” turns into clickbait; “maximize points” turns into glitch‑farming. Good reward design is less about clever formulas and more about being paranoid about how your agent will exploit them.
Reinforcement learning starts with a simple picture: an agent interacts with an environment, sees a state, takes an action, gets a reward, and the environment moves to a new state. Formally that environment is an MDP with four pieces: a state space, an action space, transition dynamics, and a reward function. Change any of those and you’ve literally defined a different “universe” for the agent to learn in.
Classical RL environments were mostly small, clean simulations: gridworlds, CartPole, MuJoCo robots, Atari. The design questions were: is the state discrete or continuous, is the action space discrete or continuous, is the world deterministic or stochastic, are rewards sparse or dense, is it fully observable or a POMDP where the agent only gets partial glimpses and must remember history. All the famous control benchmarks (CartPole, Pendulum, Ant, Humanoid) are just different choices along those axes, packaged behind a common Gym-style API with `reset()` and `step()` calls.
Modern RL environments for LLMs and agents look very different. Instead of a tiny state vector and a handful of actions, the “state” can be a browser viewport or an OS screen, and the “action” is a sequence of keypresses, mouse moves, or tool calls. Benchmarks like WebArena, WebVoyager, OSWorld and newer frameworks like WebCanvas or WebRL define tasks like “book a flight,” “edit a document,” or “configure a dashboard” by exposing a live or simulated web/desktop interface as the environment. The agent sees pixels or DOM/text, chooses atomic UI actions, and gets rewards based on task success or stepwise progress.
For alignment-style training like RLHF/RLAIF, the “environment” is even more abstract: it’s basically a bandit setup where an LLM outputs a whole response (sequence of actions), then a human or reward model scores that outcome with a single scalar. The state is the prompt, the action is the full completion, and the reward arrives at the end instead of after every step. This is why people describe RLHF as a contextual bandit on top of language models - there’s no rich simulator, but the environment still defines what feedback you get and how often.
The interesting shift is that environments have become large, messy, and open-ended. WebRL, WebAgent-R1, and OpenAI’s computer-using agents all operate in environments that change over time, have long horizons, and give extremely sparse or delayed feedback. That forces environment designers to add things like curriculum generation (creating new tasks from failed attempts), outcome-based reward models, and safety filters to keep agents from making harmful changes. In other words: in the age of LLM agents, “designing the environment” is no longer wrapping a small simulator - it’s designing the entire sandbox where your model will behave, explore, and sometimes break things.