DocAI exists for PDFs, invoices, contracts. But for video? Not really.
Video is messy. Audio says one thing, slides show another, people talk over each other. Most tools just summarize the transcript and stop there.
I had read a little bit about world models, not deeply, but enough to spark an idea. What if instead of summarizing a transcript, we try to build a kind of world state from the video? Something structured that tracks speakers, claims, decisions, and contradictions over time.
So I built it in 48 hours for the
@MistralAI hackathon.
You drop in a video or a YouTube link, and it creates a Temporal Knowledge Graph with every insight tied to the exact second it came from.
It even catches when someone says “revenue up 18%” while the slide shows 8%, with proof from both.
3 Mistral models.
50-minute video in under 4 minutes.
10 full analyses for $0.28.