Today, RGC has published an important paper in
@Nature today, describing an analysis of close to a million human exomes (n=983,578) as a single variant call set (!). This is the largest and most diverse rare variant database created so far. This impressive feat is accomplished by a large
@RegeneronDNA team led by my wonderful colleague
@suganthibala.
@SunKat_y et al. Nature 2024
nature.com/articles/s41586-0…
What kind of insights can we learn from sequencing ~980k exomes? Below is a summary of the major findings from the paper.
Background of RGC
Regeneron Genetics Center (RGC) was established in 2014 just on time when major pharma companies started entering into the human genomics playfield. Last year, RGC celebrated its 10th year anniversary. I've written about the origin story of RGC before (
x.com/doctorveera/status/171…).
The business model of RGC is simple and efficient. It collaborates with academic institutions across the world and provide sequencing as free service in exchange for access to genotypic and phenotypic data.
The first successful collaboration was made with Geisinger Health system (GHS) to sequence 100,000 individuals, which was soon followed by an avalanche of large collaborations. Some of our largest collaborators include UK Biobank (N=500k), GHS (N=175k) and Mexico City Prospective Study (N=150k). Today, RGC has more than 300 collaborations around the world. Just a few months ago, it surpassed the milestone of 2 million exomes. What is described in the current paper is only a fraction of that sample.
Diversity of samples
The 980k exome dataset come from a diverse set of samples. 23% (n=190k) of the participants are of non-European ancestries, the largest proportion to date for any similar datasets created so far. This includes both outbred populations and special populations enriched with communities with long-standing cultural history consanguineous and endogamous unions.
When it comes to human genetics, diversity is the key to making discoveries. Almost everyone agrees, and the field is embracing it now. But RGC is way ahead of the game. Just a few months ago, RGC partnered with other companies and laid the first foundational stone of what will become in a few years from now the world's largest genomics resource comprising half a million African Americans and Africans (
x.com/doctorveera/status/171…).
Variant survey
Human genome is ~3 billion base pairs long. ~1% of which (~30 million base pairs), containing exons, is targeted by exome sequencing. By sequencing 980k exomes, the authors have captured ~16.5 million unique variants. That is, on average, one per every two base pairs across the exome.
The main goal of concentrating on exomes is to capture deleterious spelling errors in the genome, resulting in either loss or substantial decrease or, sometimes, increase in gene function. The authors have identified
- ~1.1 million predicted loss of function variants (pLOFs), ~50% of which are singletons (that is, seen in just one individual)
- ~10 million missense variants, 40% of which are singletons.
As expected, African ancestry groups had more variants (18% more) than any other ancestry group.
Footprints of selection
pLOFs in the human genomes are like bullet holes in aircraft returning from war. The genes untouched or rarely hit by the pLOFs are the most critical genes, without which life is probably impossible.
Studying ~980k exomes, the authors have identified ~4000 genes that are depleted of pLOFs, suggesting they are indispensable. For more than 20% of these genes, we are learning their critical requirement for normal life for the first time. Previous datasets were not able to quantify their mutation constraints because of the shorter length. Most of these genes were not linked to a human disease yet. The current list will inspire many Mendelian discoveries in the near future.
Regional selection
We have 10 times more missense variants than pLOFs, which means we can zoom into within genes and study which parts of a gene are indispensable and which parts aren't.
Not all parts of a protein are critical, but some parts are. For example, DNA binding regions of transcription factor protein, catalytic sites of an enzyme protein, transmembrane domains that forms the pore of channel proteins etc. With a knowledge of ~10 million missense variants from 980,000 humans, such critical regions are now starting to light up, illuminating the most crucial regions of proteins. For example, here is a trace of missense tolerance across different domains of cancer gene KRAS. Human genetics shows that the first 80 amino acids as the most critical region of KRAS, falling under the top 1 percentile of regional missense constrain metric.
Human knockouts
The function of a gene in an organism is understood, typically, by studying the phenotypic consequences of deleting the gene. We cannot do such experiments in humans. But fortunately, Nature has already done this mutagenesis experiments for us. By studying naturally occurring human knockouts, we can assess the consequences of completely inhibiting a gene. This is crucial data for drug developers, as it informs about safety of drugs that act by inhibiting a gene or its product.
Studying the pLOFs across 980k humans, the authors have found 4.686 genes with at least one human knockout, suggesting that a life without these genes is likely possible. In line with that, the authors find that these genes are the ones that were mutationally least constrained (that is, they are enriched for pLOFs). For >1700 genes, we are learning for the first time humans completely lacking these genes do exist in this world. This is an incredible resource for drug development.
Clinical genetics insights
One of the most important use case of reference variant databases is to help clinical geneticists to identify disease causing variants in the patients. Historically, variant databases have been biased towards European populations. As a result, clinical geneticists struggle when they study exomes of non-European ancestry patients and often label the suspected variants as variants of unknown significance (VUS), because of a lack of proper reference database.
Cross-referencing the clinvar database with RGC dataset, the authors find European ancestry groups had more variants labelled "pathogenic" in Clinvar than African ancestry groups. Conversely, African ancestry groups had more VUS than European ancestry groups. This is not because Africans are protected from pathogenic variants, but simply reflect current databases are ignorant to clinically important variants in non-European ancestry individuals. With growing diverse databases such as the current one from RGC, the situation will soon change.
Conclusion
RGC has created one of the largest reference database for studying human exomes. The implications of this resource are many, spanning all areas of human biology from basic science to drug discovery.
Congrats to all my colleagues (
@SunKat_y et al.) on this incredible accomplishment. And thanks to all RGC collaborators and research participants without whom such a dataset wouldn't exist.
Something big happened a few days ago. Industry leaders in the genomics fields (Regeneron, AstraZeneca, Novo Nordisk and Roche) announced their collaboration with the US's largest black medical school, Meharry Medical College, Nashville, to establish what might become the UK Biobank of Africa--the largest genomics database of 500,000 volunteers from African ancestries (
businesswire.com/news/home/2…)
The lack of diversity in genomics databases is today's greatest barrier to scientific progress. To break down this barrier, we need scientific minds, money, power and importantly, trust and engagement from the underrepresented communities. That can happen only when academia, industry and underrepresented communities join hands together to change the future.
Just a week ago, the Regeneron Genetics Center (RGC), AstraZeneca and AbbVie published the largest genomics database of Mexican-Americans (N=150,000) in collaboration with academic institutes in Oxford and Mexico (
nature.com/articles/s41586-0…) (I'll write a separate post on this).
Now, RGC and other industry partners have established the Diaspora Human Genomics Institute (DHCI). Check out this website (
thedhgi.org/) and the launch event.
Note, this is not just a commitment towards establishing a genomics database. The initiative will also ensure to uplift the African communities by funding the education and training of young African and African-American students and researchers in the genomics field.
One of those things about RGC that inspire me most is their commitment to improving diversity in the genomics databases. One of the many reasons why RGC is being flooded with top scientists from all over the world.