we took many of the best-in-class papers and methods from Meta, Adobe, ByteDance, etc around agentic editing and implemented them but couldn’t get anything to work really well (evaluation metric = where edited outputs were loved EVERY SINGLE time...or in this case, even 50% of the time)
The truth is -- timing, cuts, and pacing are still very hard to nail, especially when the media is contextual and not just someone talking
Podcast clipping is much easier because the transcript gives you the structure to edit over…you can cut on the start and end of a sentence, trim the silences, find the most interesting soundbites, and end up with interesting / great videos. That’s what opus clip, submagic, and the similar apps have done really well- i imagine a lot of these apps get commoditized over time (as we’re already seeing), but they nailed a very important wedge early
clips without speech are a totally different problem
you have to understand what is actually happening in the footage, whether the action is relevant, when the moment starts, when it peaks, when enough is enough, and when you should move on to the next clip. There isn’t a transcript / inherent structure telling you / the agent where to make the next cut
This is why it’s ironically easier to make full videos using AI-generated media than it is to take existing media and edit it. If you’re generating the scenes on demand, the scenes / structure / editing pace are already defined before they’re made. The agent knows what it’s supposed to create (has a predefined base to evaluate off of) and knows how to tie it together from its well-structured context
Editing existing media is entirely different. You’re trying to get an agent to understand what already happened (using a VLM like Gemini models) and turn it into something compelling…requires an insane amount of processing over media great multimodal embeddings. I’m short-term bullish on agents being able to take a stab at the first pass -- but i feel the output must be an editable project within Premiere Pro / CapCut to clean up the edges
when this gets solved, I think we’ll see a renaissance of old content getting recycled -- especially long-form--niche content that was never really watched because it was too long, too dense, or too hard to process. A lot of that content probably has great moments buried inside it, but they’re not obvious bc they don’t have transcripts
If anyone’s working on this, would love to help shed some light on the scars I got and what to avoid