Think of it as two parabolic bowls in weight space — one minimum for Task A, another for Task B. When you optimize for B, the weights physically move away from A’s minimum. There’s no single θ that satisfies both.
ALT Diagram showing weight space for minimising loss on MNIST digits (task A) and Fashion MNIST dataset (task B) and the usual training trajectory on task B with no awareness of task A