A socratic dialogue over the utility of DNA language models (Part 1 of 2)
here's the link:
owlposting.com/p/a-socratic-…
and here's a longpost of why i wrote this:
i think the effort that went into Evo 2 is very cool and its clearly a very comprehensive paper
but the excitement over it made me realize that i didn't understand a more basic concept: what's the point of a DNA language model? it felt like all the instinctive 𝕏 takes i read about them were just...wrong at worst, and overly optimistic at best. im sure a Real Genomics person would instinctively understand the utility of such a type of model. but i do not!
this is made worse by everyone i know irl agreeing that they too dont really get the point of models like these
this essay is an attempt to rectify my own understanding and hopefully help others too. i interleave in my own instinctive questions with the answers i stumbled across as i researched more. unfortunately, i have many dumb questions, but hopefully some smart ones too
part 1 is specifically focused about variant pathogenicity prediction using these models
i should note that this essay is not about Evo 2 specifically. Evo 2 is referred to heavily, specifically their pathogenic variant discovery results, but i do not spend much time on the data/model/etc results. it is intended to be more broad than that