If you have unlimited resources, supplement long-read data with very high depth short-read sequencing, due to often-superior error rates. Sequence to 1000x depth for better resolution of within-sample mosaicism. There are likely also single-nucleotide variants differing between twins. Apply Sanger sequencing to confirm each short-read WGS-identified variant, as done in real-world applications, given those alleles have a high allele fraction. Using both SNV-based and structural variation-based results gives extra confidence if the two approaches agree.
Quantify the strength of evidence by modifying the likelihood ratio from Krawczak et al. (2018), "Distinguishing genetically between the germlines of male monozygotic twins," which operates on read counts and can account for multiple tissue types (
doi.org/10.1371/journal.pgen…). At high read depth, bias from sample contamination could invalidate the read count estimator, so cap the maximum likelihood contribution from a single variant or something.
Validate the pipeline using genomic data in All of Us and the UK Biobank, both of which include cohorts with short- and long-read sequencing, to test the method with twins and children of known relationships. This calibrates confidence estimates and informs how sure you can be in the paternity result.
If follow-up analyses warrant it, sequence the mother to resolve ambiguities from maternal transmission or to confirm phasing, optionally combined with Strand-seq on the child's genome to improve assembly contiguity for reference-free structural variation analysis. For low allele fraction mosaic variants, use targeted unique molecular identifier amplicon sequencing to validate SNVs, which is more robust to PCR and sequencing errors.
The primary variants of interest are discordant sites from triplicate sperm samples between the twins compared against the child's genome, followed up with PCR and Sanger sequencing for SNVs and small indels, and optionally droplet digital PCR for copy number variation differences. If PacBio HiFi plus ultra-long Oxford Nanopore assemblies suggest large higher-order repeat variation, use optical genome mapping or optionally pulse-field gel electrophoresis followed by Southern blot as confirmation.
Cost estimate of extreme version: 1000x WGS of the human genome is 3.2 Gb × 1000 = 3,200 Gb per sample; 100x is 320 Gb. PacBio Onso short-read sequencing runs about $772k total across blood, triplicate sperm from each twin, and saliva, using 300-cycle kits at roughly 120 Gb per flow cell. PacBio HiFi long-read WGS at 100x depth across two sperm and one blood sample runs about $19k. Oxford Nanopore ultra-long reads at roughly 50 Gb per PromethION flow cell for three samples runs about $41k. Optical genome mapping on three blood samples adds about $6k. PCR and Sanger sequencing for 30 loci across three samples, forward and reverse, comes to about $1k. Total: approximately $840,000. DNA extraction is completed by asking very nicely for someone with relevant wet lab experience to do it before sending out samples, so that part is free. Optional contingent components like Strand-seq, sequencing the mother, and ddPCR are omitted unless ambiguity arises.