Mirror Mirror on the Wall: The Serendipitous Journey Behind MirrorCheck ๐งฉโจ
Every research paper has a polished, sanitized version that ends up on arXiv. It presents a seamless narrative of hypothesis, experimentation, and triumph. But behind our recent Distinguished Paper Award at the 6th AdvML@CV Workshop (CVPR 2026) lies a real and funny storyโone involving a touch of the divine, a healthy dose of academic skepticism, and a classic fairytale metaphor.
Here is how MirrorCheck actually came to life.
Act I: Into the VLM Wilderness ๐ฒ
It all started during the second semester of my first year as a PhD student. I was taking a course on Vision-Language Models (VLMs) taught by Prof. Ivan Laptev. The final requirement was simple yet daunting: execute a successful project within the scope of VLMs.
I teamed up with
@Samar_M_Fares and
@ziu_klea . Right from day one, we knew we wanted to attack a massive vulnerability in the space: adversarial defense for VLMs. The catch? At the time, no dedicated detection method existed specifically for VLMs. We were staring at a blank canvas and had absolutely no idea where to start.
We engineered a mountain of approaches. We failed, iterated, and failed again. Eventually, we stumbled upon a chaotic technique that actually seemed to work. Frankly, we didn't know why it worked...at the time, it felt like a direct touch of God.
Our chaotic first approach looked like this:
Input Image โก๏ธ VLM โก๏ธ Caption โก๏ธ T2I Generator โก๏ธ Overlay Image on Input (High Strength) โก๏ธ Relook at Caption
Given an input image (clean or adversarial), weโd feed it into a victim VLM and extract the output text caption. We then took that text, generated a brand-new image from it, and overlayed this new synthetic image directly on top of the original using a calibrated strength parameter. We fed this heavily overlayed image back into the VLM, extracted a new caption, and compared it to the original. A massive semantic distance meant the original image was an adversarial manipulation.
It worked! Not perfectly, but well enough to give us hope.
Act II: "This Just Doesn't Make Sense" ๐
Armed with our suspicious preliminary findings, we quietly took our results to Prof. Ivan. While he was genuinely impressed by the numbers, I will never forget his hilariously candid remark:
"While I'm a huge fan of simple tricks to solve a mystery, this just doesn't seem to make sense." ๐๐๐
He wasnโt wrong. Our underlying intuition was inspired by an earlier study showing that adding random noise to an image can neutralize or "purify" adversarial features. However, adaptive attackers easily bypass that by designing adversarial features that anticipate random noise. We had thought: What if instead of random noise, we used fine-grained semantic noiseโthe regenerated images?
The problem was, it only worked when the calibrated strength of the overlay was cranked up high. Visually, the resulting image looked absolutely disgusting. It worked, but we desperately needed a rigorous, logical explanation for it.
Act III: The Pivot to "Replaying" ๐
Recognizing the potential, Prof. Ivan introduced us to
@nikitadurasov, a student of his colleague Prof. Pascal Fua. Nikita specialized in uncertainty estimation, and there were fascinating conceptual crossovers between our approach and his paper, Zigzag. We jumped into intense brainstorming sessions with Nikita, initially trying to explicitly adapt his Zigzag framework to our problem. It flat-out didn't work.
But breakthroughs happen when you least expect them. During one of our sessions, Nikita proposed adding noise to selected layers inside the victim model and "replaying" the original image to estimate the modelโs internal prediction uncertainty.
That wordโreplayingโstuck in my brain. Right there in the middle of the meeting, a lightning bolt struck (for dramatic effect). I thought of a variation of our original, "not-so-sensemaking" approach. Instead of overlaying the disgusting synthetic image back onto the original, why not just compare the two generated images directly?
We ran it. It worked beautifully. I blurted it out right there in the meeting: "At least this one makes so much sense!"
Act IV: The Psychology of a Horse and Snow White ๐ด๐
The logic mimics human psychology. Humans know something is fishy based on prior experiences. If I own a brown horse with small white spots, and one day I walk into the stable and see a similar horse but with noticeably larger white spots, I instinctively know something is wrong. By comparing the VLM's visual interpretation directly against the source, we were capturing that exact cognitive discrepancy.
As we scaled our experiments and had more frequent general meetings with our Professors, things rapidly fell into place, and the framework began to solidify.
One morning, we woke up to a brilliant title proposed by Nikita: MirrorCheck. It was inspired by the iconic scene in Snow White where the Evil Queen demands, "Mirror, mirror on the wall, who is the fairest of all?"โonly to see Snow Whiteโs face staring back at her instead of her own. An adversarial image demands a specific malicious output from the VLM, but when passed through our defense, the mirror reflects the true, underlying reality. It was perfect.
Act V: From Rejection to the Big Stage ๐
Getting the work accepted was its own grueling mountain to climb, but the journey has been nothing short of surreal.
The First Iteration: Focused heavily on defending against general attacks and utilizing a unique One-Time-Use (OTU) noise mechanism to completely shatter the optimization space for adaptive attackers.
The Current Iteration: The framework that secured the Distinguished Paper Award focuses on amplifying model uncertainty through strategic, stochastic model selection and layer perturbations.
Today, MirrorCheck has come a long way from a "simple trick that doesn't make sense" to an award-winning framework cited as crucial prior work in the broader domain of Trustworthy Machine Learning (yes, our work has been adapted to other domains too๐).
Pen drop. ๐๏ธ
#CVPR2026 #AISafety #ResponsibleAI #MBZUAI #EPFL #AdversarialRobustness #TrustworthyAI