In my post yesterday, I described the double descent phenomenon: in many situations, massively over-parametrized models outperform simpler models out of sample. As I emphasized, this is one of the most surprising and intriguing developments in computer science and statistics in recent decades and it has important implications for economics.
The double descent phenomenon raises many questions, most still unanswered or only partially answered, despite intense work by top researchers. Today I will sketch one key element of the emerging picture: the inductive bias of deep learning. This requires some work, so please stay with me.
When a model is heavily over-parametrized (for example, as I mentioned yesterday, with 12,001 parameters for 12 observations), there are many parameter configurations that interpolate the data, i.e., they fit all 12 observations perfectly.
Which parameters are selected in practice? In high dimensions, gradient-based optimizers that minimize fit loss often converge to minimum-norm (min-norm) interpolants under the relevant function class/parameterization. In linear settings (and several over-parameterized regimes/initializations for neural networks), this can be proved; more broadly, practice shows a strong implicit bias toward min-norm solutions.
What does that mean? Think of the fitted model as a function (the curve that goes exactly through all 12 points). The optimizer effectively selects the curve that is “smoothest” in a well-defined sense: the curve that minimizes a functional seminorm (e.g., a Sobolev seminorm).
What is a functional seminorm? A basic mathematical task is to measure the “size” of an object (like the length of a vector).
A norm is just a concept of “size” that is useful for the task we are dealing with. That is why in high-school math we introduce ideas like the Euclidean norm of a vector: it gives us a very intuitive and useful way to think about the “size” of a vector.
Norms can be too strict for many problems; sometimes we want a size that treats certain nonzero objects as “equivalent to zero.” A seminorm does exactly that: it measures size while allowing some nonzero objects to have size zero.
A functional seminorm applies this idea to functions, giving a way to quantify the size of the whole function, not just its value at a point (though pointwise evaluation can itself be used as a seminorm).
Enter Sobolev seminorms (and their relatives): here, the “aspect” of the function we measure is smoothness—how large its derivatives are. A Sobolev seminorm ignores baseline level and measures only the magnitude of derivatives; it is a seminorm because adding a constant (or, for higher orders, a lower-degree polynomial) does not change it.
A metaphor: imagine rating the difficulty of a bike trail for your weekend excursion. You want to assess the entire trail, not a single point. The Sobolev seminorm “measures” the curvature, wiggles, and twists of the trail, but it does not care whether the trail sits at 100 or 500 meters above sea level.
There are many reasons Sobolev seminorms matter across mathematics. For example, if you’re solving the heat equation for a rod, you care about how steep the temperature gradient is along the rod, not how hot the rod is.
But the key point for us is that they give an intuitive measure of how smooth a curve is.
Now return to the figure from my post yesterday:
x.com/JesusFerna7026/status/…
Of all neural networks with 12,001 parameters that fit the 12 observations perfectly, a gradient-based optimizer tends to pick the smoothest one (in the sense above). That is the inductive bias at work, and it’s remarkable.
The formal statement is in the snapshot from my paper:
x.com/JesusFerna7026/status/…
included in this post.
We can prove this behavior in a number of cases; in practice, it appears beyond those instances as well. This is why yesterday’s example was not cherry-picked.
Why do we care about “smooth” curves?
1️⃣ Occam’s razor. Smooth curves are typically the simplest, delivering an inductive bias toward simplicity.
2️⃣ Dynamic economic models. The smooth solutions are precisely those that satisfy the transversality condition. I have a full paper on this:
sas.upenn.edu/~jesusfv/Spook…
I will explain more another day.
3️⃣ Forecasting. In practice, smoother curves generalize better and forecast new observations more accurately.
Finally, a quick diagnostic from the figure: the ℓ₂ distance between the true function and the interpolated solution is much smaller in panel 4 than in panel 3. That is, the massively over-parametrized solution (panel 4) outperforms the merely overfit solution (panel 3) precisely because the massively over-parametrized solution minimizes the seminorm.
I realize this post likely opens more questions than it answers (and I had to be a bit sloppy with some details), but bear with me—I’ll try to address those next week.