2/ The idea: a region-aware Mixture-of-Experts inside the document encoder.
A router reads each region's content, its 2D position, and a pooled query context z_q, then mixes 4 latent experts ( 1 shared).
The page is now encoded differently per query β D(q). Still MaxSim.