3/ Answering Question 2: How do we define "data quality" in a way that doesn't depend on human taste, but on what the model itself structurally needs?
If you ask a human to label "good" pre-training text, you get FineWeb-Edu scores, QuRating preferences, or DCLM FastText-based classifiers. These are useful, but they are fundamentally human heuristics dressed up as metrics.
The problem with human-defined quality:
Static filters assume a document's utility is time-invariant. A "high-quality" math document is always high-quality, regardless of whether the model is at step 1,000 or step 500,000. But this is obviously false—what the model needs changes as its parameters evolve.
Worse, human preference is often just taste. We favor Wikipedia prose over Reddit threads, but the model might learn more from a well-structured technical forum post than a generic encyclopedia entry.
In OPUS, we define quality as a dynamic, model-dependent utility:
A batch is valuable only if it moves the model's parameters in a direction that improves performance on the proxy distribution under the optimizer's specific geometry.
Formally, we score candidates by the expected one-step loss reduction on the proxy set, measured not in raw gradient space, but in the optimizer-induced update space (AdamW's diagonal preconditioner, Muon's Newton-Schulz orthogonalization).
This means "quality" is no longer a scalar label on the data. It is a vector inner product between:
- The optimizer's effective update direction for this candidate
- The proxy's desired descent direction
If the optimizer geometry changes (e.g., switching from AdamW to Muon), the quality score changes—even for the exact same data point. The quality is a function of the model × optimizer × data triplet, not the data alone.