We’re excited to share our latest publication in
@NatComputSci “MetaSTAARlite: an all-in-one tool for biobank-scale whole-genome sequencing meta-analysis” 🧬
Sincere thanks to Yohhan Kumarasinghe (UNC, co-lead), Jacob Williams (NCI, co-lead), Yuxin Yuan, Wenbo Wang,
@AlexiaDiasF,
@AndrewHaoyu,
@muzizimumu1, and the study participants from
@uk_biobank and
@AllofUsResearch.
Biobank-scale WGS/WES studies are transforming rare variant discovery, but pooled individual-level analyses across biobanks are often limited by data-sharing restrictions. MetaSTAARlite is designed to overcome this challenge by providing a scalable, resource-efficient, summary statistics–based pipeline for functionally informed rare variant meta-analysis across the coding and noncoding genome.
MetaSTAARlite provides an all-in-one workflow to:
• Generate resource-efficient study-specific summary statistics, including variant-level summary statistics sparse LD matrices
• Perform functionally informed coding, noncoding, ncRNA, and custom-mask rare variant meta-analysis
• Dynamically incorporate multiple variant functional annotations to improve power and interpretation
• Exactly reconstruct the variance-covariance matrix of score statistics (referred to as the LD matrix), so meta-analysis results closely mirror pooled individual-level analysis
• Account for population structure and relatedness using sparse GRM / mixed-model framework
• Support conditional analysis, Manhattan/QQ plots, and analytical follow-up to identify annotations and variants driving associations
A key advantage of MetaSTAARlite is scalability. By leveraging sparse GRM and directly operating on sparse genotype dosage matrices, MetaSTAARlite greatly reduces runtime, memory, and storage. Benchmarking on UK Biobank WES data for TTN missense variants (the largest gene in the human genome) and total cholesterol phenotype:
• At n = 300K, MetaSTAARlite achieved 332× and 1,386× lower peak memory than MetaSTAAR and Raremetal2, respectively
• It also achieved 24× and 2,206× lower computation time than MetaSTAAR and Raremetal2
• At n = 446K with 22,994 variants, summary statistics generation finished in 48.82 seconds with <1 GB peak memory
Another important challenge in rare variant meta-analysis is the storage of LD matrices. MetaSTAARlite substantially reduces this burden. In a UK Biobank WGS total cholesterol benchmark, we randomly partitioned 190,110 participants into three studies and generated genome-wide summary statistics with 12 functional annotations. MetaSTAARlite required only:
• 0.48 GB total storage per mask for genome-wide coding meta-analysis summary statistics
• 1.67 GB total storage per mask for genome-wide noncoding meta-analysis summary statistics
Notably, the sparse LD matrices accounted for only 7.7% and 17.8% of the total storage for coding and noncoding analyses, respectively. This means that, in MetaSTAARlite, LD matrix storage is no longer a bottleneck for rare variant meta-analysis.
In UK Biobank WGS total cholesterol analyses, we compared MetaSTAARlite meta-analysis with pooled STAARpipeline analysis using the same individual-level data. The results were nearly perfectly concordant:
• Pearson r² > 0.999 for log10-transformed P values across genome-wide significant and suggestive masks
• 58 genome-wide significant coding associations
• 88 genome-wide significant noncoding associations
• Signals included known lipid biology: PCSK9, APOB, APOA5, LDLR and APOE clusters
We further applied MetaSTAARlite to cross-biobank meta-analysis of UK Biobank (in the Research Analysis Platform) and All of Us (in the Research Workbench) data for five traits: total cholesterol, height, eGFR, calcium and elevated LDL-C (a binary trait). These analyses included up to 692,445 diverse participants. Across these traits, MetaSTAARlite identified 165, 536, 117, 38 and 94 genome-wide significant coding associations, respectively, while keeping average peak memory below 1 GB.
For these cloud-based analyses, in the UK Biobank RAP, for example, the genome-wide summary-statistics generation per trait had theoretical costs of ~£3.60–£3.90 and actual costs typically below £7, never exceeding ~£8.20, for a total of 5 masks across the genome.
We hope MetaSTAARlite will make cross-biobank rare variant discovery more accessible, scalable and privacy-preserving for large WGS/WES consortia.
Software and tutorial are open source:
Paper:
nature.com/articles/s43588-0…
MetaSTAARlite:
github.com/li-lab-genetics/M…
Tutorial:
github.com/li-lab-genetics/M…
Manuscript code:
github.com/li-lab-genetics/M…