I watched the recording of the recent
@gnomad_project @GA4GH webinar on v4 to address common misconceptions about the gnomAD by
@HeidiRehm @SamBaxterCGC and Katherine Chao. 👀 📺
Much of that was said is likely known to most who use gnomAD on a daily basis such as myself. Still, some interesting nuggets were in there. So, here are my unsorted notes, in case you find them helpful:
gnomAD is not a control group in the normal sense, but represents a reasonable reference set for severe diseases such as child-onset muscular dystrophy. It is best thought of as a general population dataset. It is NOT a healthy population dataset.
Included are individuals from case-control studies and biobank participants. Not included are cohorts specifically recruited for detection of early onset mendelian diseases to avoid any over-enrichment of causal variants.
Under US regulatory framework, there is no chance of any HPO terms that can be made available by a resource such as gnomAD. Because its freely accessible. No login required. Only sex, age and original study that the person was recruited in, is available to the gnomAD team. Only aggregated data can be shared. Not details on any specific participant.
The only description they have is, if a person is a "control" or "case" in the original study and that only refers to the small phenotypic window thats addressed in the original study (say schizophrenia or diabetes).
This is in contrast to
@uk_biobank or
@AllofUsResearch data. These datasets are not freely accessible - you only get to access them through their data portal where you need to login. But you get individual level data.
@HeidiRehm said it will be "a while" until gnomAD data is available on another genome build other than the current hg38. Conversations on e.g. T2T or pangenome graph type of future gnomAD versions have just begun.
Roughly half of the exome data made available in gnomAD v4 is from
@uk_biobank. As it is a biobank that has a healthy volunteer bias, because volunteers tend to be slightly wealthier and healthier than the general population (see PMID: 28641372). Which actually means that the non-UK-biobank subset in gnomAD has a higher disease burden because it is more broadly recruited.
UK Biobank participants do have rare diseases, see PMID: 28641372
Individuals in gnomAD are also from case-control studies, such as schizophrenia, myocardial infarction etc. Thus the gnomAD team created subsets, such as non-neuro to filter out certain groups/study participants. See contributing projects here
gnomad.broadinstitute.org/ab…
At the current size of gnomAD in v4, any phenotype enrichment is less of a concern than in smaller previous versions. Simply because the dataset is 800k people.
Question - should a subset be used to apply e.g. the ACMG BA1 or BS1 criteria? -
@HeidiRehm says v4 is best, when used as a whole dataset, because the more people, the better. If you want a max allele freq for a certain variant, then check ancestry subgroups.
All of gnomAD v4 can be considered as roughly "non-cancer" because all of the data from TCGA (The Cancer Genome Atlas Program) was removed due to QC. Which means, people in gnomAD can of course have cancer, but there is no enrichment of cancer phenotypes in v4.
Question - there seems to be higher freq of cancer variants in the non-UK-biobank subset. Why? - KC: Could be due to recruiting, since UK Biobank only recruited people aged 40-69 years old. They did find far lower rates of cancer in UKB than in the general UK population. Again, biobank participants are on average healthier.
@AllofUsResearch data will be inclued in v5 ( another 800k individuals), which is "coming soon". Again, no phenotype data.
Question - what about CHIP variants? -
@HeidiRehm: gnomAD has a full spectrum of age so likely no enrichment, but of course older people are in there with CHIP variants. There are flags on genes, to signal high rate of CHIP variants.
gnomAD does have age on many individuals included, but not all. This was information given by the original study. It is unclear to the gnomAD team actually what is meant by age in all of the different studies - e.g. time of inclusion into the study? Age when data was transferred to gnomAD? Other? The gnomAD team did not tinker with the age data, just used it as is, where possible and shares it in buckets in individual variant pages.
Question - are you planning to include the million variant program dataset? -
@HeidiRehm - not yet included because of no easy access to it.
Since external sequencing data has to be reprocessed by the gnomAD team (quite expensive), only highly "valuable" datasets, meaning ancestrally diverse external datasets will be re-processed in future gnomAD. Simply due to cost-benefit.
gnomAD v4 is mostly saturated for common variants from european ancestries, so non-european ancestries are more "valuable". Here is the preprint for gnomAD v4 by
@konradjk and team if you want to dive into that PMID: 41929314.
Question - why is
#CFTR delta-F508 58x homozygous in gnomAD?
@HeidiRehm - the team did a little research, as this is likely a bit too high for a general population. Turns out, a cohort from a clinic (important: not from UK Biobank) was included with a pulmonary focus.
gnomad.broadinstitute.org/va…
My former colleague
@tomaeusTo also stumbled upon this
#CFTR variant back in 2023 right after the release. See here for more details with a link to the gnomAD FAQ forum:
x.com/platzer_k/status/17369…
There are actually more than a few
#CFTR over-represented variants in gnomAD, again likely due to the inclusion of that one pulmo clinic's cohort.
Question - what about LongReads? HR:
@TalkowskiLab is working on a LongRead gnomAD dataset. It will happen. They are working on it. No date for release just yet.
There is a Canadian gnomAD dataset, not yet included in the primary gnomAD dataset. For now, it is available for download — with a browser interface coming soon. They actually want to provide phenotypes, but that is still "quite a few months out" (Jordan Lerner-Ellis).
linkedin.com/posts/jordan-le…
genomicdatacommons.ca/
bcgsc.ca/gnomad/
There is also a Singapore gnomAD dataset - probably not possible to access the data right now. So different groups can add their own wrinkles to these separate gnomad type datasets.
Recruiting more samples for gnomAD is ongoing. Federated gnomAD explained here:
gnomad.broadinstitute.org/fe…
@HeidiRehm points out there is also a gnomAD forum / discussion board. Please ask questions or feature requests. Link:
discuss.gnomad.broadinstitut…
Thanks to the gnomAD on all their work on this invaluable resource. Couldn't do without it.
Hope this summary helps. 🙂