GenSI

GenSI

Users
Tweets

GenSI

@hello_gensi

Jan 21

What exactly is Protein Search? 🧬 Unlike De novo Design or Representation Learning, Protein Search focuses on "searching with an evolutionary perspective." Why is it harder than Web Search? ✅ Requires joint reasoning of missing modalities. ✅ Targets high-confidence evolutionary hypotheses. ✅ Matches true biological kinship, not just text similarity. Check out our detailed video! 👇 #AIforScience #MachineLearning #ComputationalBiology #ProteinSearch #StructuralBiology #DeNovoDesign #RepresentationLearning #SequenceAnalysis #Tweetorial #AcademicTwitter

1:29

Biology AI Daily

Biology AI Daily @BiologyAIDaily

5 Oct 2025

Edit Distance Embedding with Genomic Large Language Model 1. A new study has made significant strides in the field of genomic sequence analysis. The research introduces LLMED, a model that leverages genomic large language models to produce sequence embeddings approximating the edit distance, outperforming existing methods in both accuracy and efficiency. 2. The core innovation of LLMED lies in its unique approach to edit distance approximation. Traditional methods struggle with the computational expense of calculating edit distance for large-scale genomic sequences. LLMED addresses this by embedding sequences into a normed space, allowing for faster and more efficient distance estimation. This not only enhances the speed of sequence analysis but also maintains high accuracy. 3. One of the standout features of LLMED is its versatility. The model is designed to be trained using any existing genomic foundation model, making it highly adaptable to various genomic datasets and applications. This flexibility ensures that LLMED can be easily integrated into different research pipelines, further expanding its potential impact in the field. 4. The study conducted extensive experimental comparisons to validate the performance of LLMED. Results showed that LLMED surpassed leading machine learning and rule-based embedding methods in approximating the edit distance. In critical applications such as similar sequence search, LLMED achieved significantly improved accuracy, demonstrating its superior embedding capabilities. 5. The training process of LLMED is also noteworthy. The model employs contrastive learning based on a pretrained genomic large language model. Three different loss functions—MAE loss, triplet loss, and combined loss—are explored to optimize the model’s performance. This rigorous training approach ensures that LLMED can effectively learn from data and generate high-quality embeddings. 6. Practical applications of LLMED are demonstrated through tasks like K-nearest neighbor search. The model’s ability to accurately identify similar sequences in both synthetic and real datasets highlights its potential for use in various biological applications, such as phylogeny reconstruction and nearest sequence search. This makes LLMED a valuable tool for researchers working with genomic data. 7. The research concludes that LLMED represents a significant advancement in the field of genomic sequence analysis. By leveraging the power of genomic large language models, LLMED offers a more efficient and accurate solution for edit distance approximation. Future work could focus on further enhancing the model’s performance by incorporating advanced techniques used in natural language processing. 📜Paper: biorxiv.org/content/10.1101/… 💻Code: github.com/Shao-Group/llmemb… #Genomics #Bioinformatics #LargeLanguageModels #SequenceAnalysis #EditDistance #LLMED

1,494

Biology AI Daily

Biology AI Daily @BiologyAIDaily

4 Jul 2025

Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble @PLOSONE １．This paper introduces a novel model for predicting protein-protein interactions (PPIs) using a combination of bidirectional gated recurrent units (BiGRUs) and an explicit ensemble approach. The model achieves state-of-the-art performance on both in-species and cross-species datasets. ２．The most attractive feature is its high generalizability. Trained on S. cerevisiae, the model maintains strong performance across multiple cross-species and disease-specific datasets—highlighting its robustness and potential utility in diverse biological settings. ３．To enhance sequence representation, the authors incorporate the SVHEHS descriptor, a 20×13-dimensional matrix derived from 457 physicochemical properties of amino acids, into three key feature encoding techniques: PseAAC, AD, and AC. ４．Six feature coding techniques are used in total—PseAAC, AD, AC, CT, LD, and MMI—each capturing different aspects of protein sequences (composition, sequence order, local/global dependencies, and mutual information). ５．Each feature vector is processed by a dedicated BiGRU for dimensionality reduction. These BiGRUs are then explicitly ensembled to retain diverse learned features while reducing noise and redundancy. ６．The final feature set is input into a LightGBM classifier. This combination of deep learning for feature transformation and gradient boosting for classification allows for both high accuracy and efficient inference. ７．On the H. pylori and S. cerevisiae datasets, the model achieves 96.47% and 97.79% accuracy, respectively—outperforming models like GcForest-PPI, GTB-PPI, and DeepPPI. ８．BiGRU outperforms both forward and backward GRUs in capturing sequence dependencies. Bidirectional context is shown to be crucial for capturing interaction-related patterns. ９．The explicit ensemble (MultiEns) of six independent BiGRUs outperforms simpler strategies like feature concatenation (MultiCon) or dual-branch networks (MultiSep), demonstrating the benefit of architectural diversity. １０．The model also surpasses traditional classifiers (e.g., SVM, KNN, RF, AdaBoost) in accuracy, robustness, and computational efficiency, justifying the choice of LightGBM for final prediction. １１．Evaluation on independent datasets (e.g., C. elegans, E. coli, H. sapiens, M. musculus) confirms excellent generalization, with accuracy consistently above 94%. On disease-specific datasets, it achieves perfect accuracy in some cases. １２．The framework is modular and data-efficient. A key limitation is the absence of protein structural data or pretrained language model embeddings, which the authors note as future work. １３．Overall, this study presents a flexible and accurate PPI prediction pipeline with demonstrated cross-domain utility and a well-grounded methodological design. 💻Code: github.com/bingo111111/BiGRU… 📜Paper: journals.plos.org/plosone/ar… #PPI #DeepLearning #Bioinformatics #GRU #ProteinInteractions #MachineLearning #BiGRU #LightGBM #SequenceAnalysis

649

Biology AI Daily

Biology AI Daily @BiologyAIDaily

4 Jul 2025

Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble １．This paper introduces a novel model for predicting protein-protein interactions (PPIs) using a combination of bidirectional gated recurrent units (BiGRUs) and an explicit ensemble approach. The model achieves state-of-the-art performance on both in-species and cross-species datasets. ２．The most attractive feature is its high generalizability. Trained on S. cerevisiae, the model maintains strong performance across multiple cross-species and disease-specific datasets—highlighting its robustness and potential utility in diverse biological settings. ３．To enhance sequence representation, the authors incorporate the SVHEHS descriptor, a 20×13-dimensional matrix derived from 457 physicochemical properties of amino acids, into three key feature encoding techniques: PseAAC, AD, and AC. ４．Six feature coding techniques are used in total—PseAAC, AD, AC, CT, LD, and MMI—each capturing different aspects of protein sequences (composition, sequence order, local/global dependencies, and mutual information). ５．Each feature vector is processed by a dedicated BiGRU for dimensionality reduction. These BiGRUs are then explicitly ensembled to retain diverse learned features while reducing noise and redundancy. ６．The final feature set is input into a LightGBM classifier. This combination of deep learning for feature transformation and gradient boosting for classification allows for both high accuracy and efficient inference. ７．On the H. pylori and S. cerevisiae datasets, the model achieves 96.47% and 97.79% accuracy, respectively—outperforming models like GcForest-PPI, GTB-PPI, and DeepPPI. ８．BiGRU outperforms both forward and backward GRUs in capturing sequence dependencies. Bidirectional context is shown to be crucial for capturing interaction-related patterns. ９．The explicit ensemble (MultiEns) of six independent BiGRUs outperforms simpler strategies like feature concatenation (MultiCon) or dual-branch networks (MultiSep), demonstrating the benefit of architectural diversity. １０．The model also surpasses traditional classifiers (e.g., SVM, KNN, RF, AdaBoost) in accuracy, robustness, and computational efficiency, justifying the choice of LightGBM for final prediction. １１．Evaluation on independent datasets (e.g., C. elegans, E. coli, H. sapiens, M. musculus) confirms excellent generalization, with accuracy consistently above 94%. On disease-specific datasets, it achieves perfect accuracy in some cases. １２．The framework is modular and data-efficient. A key limitation is the absence of protein structural data or pretrained language model embeddings, which the authors note as future work. １３．Overall, this study presents a flexible and accurate PPI prediction pipeline with demonstrated cross-domain utility and a well-grounded methodological design. 💻Code: github.com/bingo111111/BiGRU… 📜Paper: journals.plos.org/plosone/ar… #PPI #DeepLearning #Bioinformatics #GRU #ProteinInteractions #MachineLearning #BiGRU #LightGBM #SequenceAnalysis

555

Tim Liao

Tim Liao @tfliao

20 Jun 2025

My new paper “Do Birds of a Nest Flock Together?” with the wonderful @yyortiga is finally out in @IMRjournal. It's a polyadic #SequenceAnalysis of HK Filipina domestic workers' migration trajectories. journals.sagepub.com/doi/abs…

Do Birds of a Nest Flock Together? A Study of Home Provinces and Migration Paths among Filipina...

There is strong evidence that migration is networked. However, it is unclear whether having the same province of origin can also lead to similar migratory paths...

journals.sagepub.com

966

Biology AI Daily

Biology AI Daily @BiologyAIDaily

26 May 2025

Predicting protein folding dynamics using sequence information １．This study introduces a computational framework to predict protein folding dynamics directly from amino acid sequences, going beyond static structure predictions to model how proteins fold and how mutations impact their folding pathways. ２．The method leverages Direct Coupling Analysis (DCA) to infer a Potts model from multiple sequence alignments, capturing evolutionary constraints as a proxy for folding energetics. ３．Folding dynamics are simulated using a coarse-grained finite-chain Ising model, where proteins are partitioned into discrete folding units called foldons, each modeled as a two-state (folded/unfolded) spin. ４．The framework estimates folding temperatures and cooperative folding behavior for individual foldons, enabling the identification of subdomains and critical folding transitions within a protein. ５．A key innovation is the use of evolutionary energy landscapes to simulate folding curves, free energy profiles, and cooperative transitions without requiring structural input or experimental folding data. ６．The model accommodates a variety of foldon partitioning schemes, including repeat-based, secondary structure-based, exon-based, and neutral models, allowing tailored analyses for different protein topologies. ７．It also estimates the selection temperature (Tsel) for a protein family, quantifying the evolutionary pressure on folding stability, either from experimental ΔΔG data or inferred from sequence variability. ８．The Monte Carlo simulation protocol is optimized to detect folding/unfolding transitions across temperature ranges, and outputs thermal unfolding curves, cooperativity scores, and domain emergence maps. ９．The framework enables rapid in silico assessment of mutation effects, predicting changes in folding stability and cooperativity for all possible single-point mutants using the wild-type energy field. １０．By extending the simulation to many sequences from the same family, the model supports family-wide analyses and rational protein design, including ranking sequences by thermal stability. １１．Furthermore, it enables generation of novel protein sequences using the Potts model and maps them in an energy-cooperativity space, providing predictive insights into their folding properties before simulation. １２．A Google Colab notebook implementing the entire pipeline is publicly available, allowing researchers to run custom simulations from sequence and alignment data with minimal setup. 💻Code: colab.research.google.com/gi… 📜Paper: arxiv.org/abs/2505.17237 #ProteinFolding #EvolutionaryBiophysics #PottsModel #SequenceAnalysis #FoldingMechanism #ComputationalBiology #CoarseGraining #DirectCouplingAnalysis

5,134

Biology AI Daily

Biology AI Daily @BiologyAIDaily

26 May 2025

Predicting protein folding dynamics using sequence information １．This study introduces a computational framework to predict protein folding dynamics directly from amino acid sequences, going beyond static structure predictions to model how proteins fold and how mutations impact their folding pathways. ２．The method leverages Direct Coupling Analysis (DCA) to infer a Potts model from multiple sequence alignments, capturing evolutionary constraints as a proxy for folding energetics. ３．Folding dynamics are simulated using a coarse-grained finite-chain Ising model, where proteins are partitioned into discrete folding units called foldons, each modeled as a two-state (folded/unfolded) spin. ４．The framework estimates folding temperatures and cooperative folding behavior for individual foldons, enabling the identification of subdomains and critical folding transitions within a protein. ５．A key innovation is the use of evolutionary energy landscapes to simulate folding curves, free energy profiles, and cooperative transitions without requiring structural input or experimental folding data. ６．The model accommodates a variety of foldon partitioning schemes, including repeat-based, secondary structure-based, exon-based, and neutral models, allowing tailored analyses for different protein topologies. ７．It also estimates the selection temperature (Tsel) for a protein family, quantifying the evolutionary pressure on folding stability, either from experimental ΔΔG data or inferred from sequence variability. ８．The Monte Carlo simulation protocol is optimized to detect folding/unfolding transitions across temperature ranges, and outputs thermal unfolding curves, cooperativity scores, and domain emergence maps. ９．The framework enables rapid in silico assessment of mutation effects, predicting changes in folding stability and cooperativity for all possible single-point mutants using the wild-type energy field. １０．By extending the simulation to many sequences from the same family, the model supports family-wide analyses and rational protein design, including ranking sequences by thermal stability. １１．Furthermore, it enables generation of novel protein sequences using the Potts model and maps them in an energy-cooperativity space, providing predictive insights into their folding properties before simulation. １２．A Google Colab notebook implementing the entire pipeline is publicly available, allowing researchers to run custom simulations from sequence and alignment data with minimal setup. 💻Code: colab.research.google.com/gi… 📜Paper: arxiv.org/abs/2505.17237 #ProteinFolding　#EvolutionaryBiophysics　#PottsModel　#SequenceAnalysis　#FoldingMechanism　#ComputationalBiology　#CoarseGraining　#DirectCouplingAnalysis

3,681

Matthias Studer

Matthias Studer @studer_matthias

13 Jan 2025

New article with Kevin Emery and André Berchtold: A systematic comparison of methods to impute missing longitudinal categorical data Also propose new MICT-Timing algorihtm #SequenceAnalysis methods and available in seqimpute #Rpackage doi.org/10.1007/s11135-024-0…

129

Matthias Studer

Matthias Studer @studer_matthias

19 Dec 2024

A key article for the development of robust #SequenceAnalysis @SeqAnalysisAssn The new framework takes into account measurement error when creating a typology with cluster analysis and using it in subsequent analysis link.springer.com/10.1186/s1…

Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with...

BMC Medical Research Methodology - In standard Sequence Analysis, similar trajectories are clustered together to create a typology of trajectories, which is then often used to evaluate the...

link.springer.com

544

Biology AI Daily

Biology AI Daily @BiologyAIDaily

7 Dec 2024

Embed-Search-Align: DNA Sequence Alignment using Transformer Models 1. Introducing Embed-Search-Align (ESA), a novel framework leveraging Transformer-based Reference-Free DNA Embedding (RDE) to align DNA sequences with unmatched efficiency and accuracy, rivaling traditional methods like Bowtie and BWA-Mem. 2. Key innovation: ESA transforms genome-wide sequence alignment into a vector search task, enabling efficient identification of top-matching fragments through a specialized DNA vector store. 3. RDE achieves 99% accuracy in aligning 250-length reads to a human reference genome, significantly outperforming 6 recent DNA-Transformer baselines like Hyena-DNA and DNABERT-2 in terms of both precision and scalability. 4. Unique features: Self-supervised training with contrastive loss allows RDE to generate rich embeddings, preserving sequence locality in the embedding space and enabling robust cross-species and cross-chromosome alignment. 5. ESA reduces computational complexity, achieving a speed of aligning 10,000 reads per minute while maintaining high accuracy. This represents a step forward in aligning reads for large and complex genomes. 6. Real-world implications: ESA’s superior performance in aligning short reads from simulated and experimental datasets offers transformative potential for genomics, including variant calling, transcriptomics, and epigenomics. 7. Looking ahead: ESA’s framework paves the way for advanced applications like pan-genome alignment and de novo genome assembly, with promising initial results on species like Thermus aquaticus. @LajoyceMboning @KCEnevoldsen 📜Paper: arxiv.org/abs/2309.11087 #Genomics #DNAAlignment #Transformers #MachineLearning #Bioinformatics #SequenceAnalysis #GenomeAssembly

1,301

Biology AI Daily

Biology AI Daily @BiologyAIDaily

21 Nov 2024

The Protein Language Visualizer: Sequence Similarity Networks for the Era of Language Models • PLVis introduces an innovative pipeline to visualize protein sequence relationships using Protein Language Model (PLM) embeddings. It leverages dimensionality reduction techniques (e.g., UMAP, t-SNE) and clustering methods for interactive, intuitive exploration of protein similarities. • Compared to traditional Sequence Similarity Networks (SSNs), PLVis demonstrates superior clustering efficiency by capturing high-dimensional protein family relationships. It identifies functional protein clusters that remain ambiguous or isolated in conventional SSNs. • A head-to-head comparison on datasets such as radical SAM enzymes and Mycobacterium proteomes shows that PLVis clusters are well-defined, with high silhouette scores (>0.95), preserving local and global sequence similarities. • Functional annotations in PLVis clusters reveal enriched protein families, enabling rapid identification of biologically significant patterns, including species-specific expansions in Mycobacterium and malaria-related Plasmodium species. • The pipeline supports interactive exploration with a Google Colab Notebook, allowing users to upload their protein datasets, generate embeddings, and analyze cluster properties, bridging the gap between high-throughput data generation and biological insights. @champiDicty 📜Paper: biorxiv.org/content/10.1101/… #ProteinVisualization #MachineLearning #SequenceAnalysis #Bioinformatics #DimensionalityReduction

5,710

Lisa Crossman

Lisa Crossman @Lisa_Crossman

11 Nov 2024

We are looking for a PIPS student intern #SequenceAnalysis #Bioinformatics

954

EMBL-EBI Training

EMBL-EBI Training @EBItraining

5 Nov 2024

There’s still time to join the #webinar tomorrow to explore the sequence analysis tools and how to access them programmatically using the Job Dispatcher framework. Registration is free but essential: ebi.ac.uk/training/events/gu… #Bioinformatics #DataScience #SequenceAnalysis @emblebi

Graphics image showing webinar at EMBL-EBI about A guide to Job Dispatcher sequence analysis tools and programmatic access. The webinar will run on 06 November 2024 at 15:30 GMT. The speaker is Nandana Madhusoodanan. Image credit: Louise Walker, EMBL-EBI

ALT Graphics image showing webinar at EMBL-EBI about A guide to Job Dispatcher sequence analysis tools and programmatic access. The webinar will run on 06 November 2024 at 15:30 GMT. The speaker is Nandana Madhusoodanan. Image credit: Louise Walker, EMBL-EBI

1,090

EMBL-EBI Training

EMBL-EBI Training @EBItraining

30 Oct 2024

Join our #webinar next week to explore the sequence analysis tools and how to access them programmatically using the Job Dispatcher framework. Registration is free but essential: ebi.ac.uk/training/events/gu… #Bioinformatics #DataScience #SequenceAnalysis @emblebi @ebi_jdispatcher

760

Isabelle Nic Craith 🌻⭕️

Isabelle Nic Craith 🌻⭕️@iniccraith

25 Jun 2024

Returning to life in Dublin after spending last week in Barcelona attending the third annual @coordinate_eu summer school on #LifeCourse research & #SequenceAnalysis! 📚 The summer school was a fantastic introduction to theory and methods in this domain of #LongitudinalResearch.

1,015

lxcv

lxcv @latinxincv

18 Jun 2024

⚠️ Octavia Camps delivered an insightful talk on "Frugal, Interpretable, Dynamics-Inspired Architectures for Sequence Analysis." Thank you 👏 #SequenceAnalysis #Dynamics" #CVPR2024 @Beto_OchoaRuiz 👏

776

Matthias Studer

Matthias Studer @studer_matthias

24 May 2024

A few days left to register for this 3-days workshop on #SequenceAnalysis With @GilbertRitscha1 and Kevin Emery Online participation possible. 5-7 June, 2024 forscenter.ch/swiss-househol…

343

EMBL-EBI Training

EMBL-EBI Training @EBItraining

14 May 2024

There’s still time to join tomorrow’s #webinar about exploring the sequence analysis tools provided by the Job Dispatcher team at EMBL-EBI. Registration is free but essential: ebi.ac.uk/training/events/ac… #bioinformatics #SequenceAnalysis #clustalomega #DataScience @ebi_jdispatcher

Graphics card showing EMBL-EBI webinar about accessing sequence analysis tools via the new Job Dispatcher website. The speaker is Nandana Madhusoodanan. The webinar will run on Wednesday 15 May at 3:30pm BST.

ALT Graphics card showing EMBL-EBI webinar about accessing sequence analysis tools via the new Job Dispatcher website. The speaker is Nandana Madhusoodanan. The webinar will run on Wednesday 15 May at 3:30pm BST.

1,389

EMBL-EBI Training

EMBL-EBI Training @EBItraining

8 May 2024

Join our #webinar next week to explore the sequence analysis tools provided by the Job Dispatcher team at EMBL-EBI. Registration is free but essential: ebi.ac.uk/training/events/ac… #Bioinformatics #SequenceAnalysis #clustalomega #DataScience

4,544

Sequence Analysis Association

Sequence Analysis Association @SeqAnalysisAssn

24 Apr 2024

Reminder #SequenceAnalysis: Tomorrow April 25, 4pm CET Workshop on SA research design Link to join at sequenceanalysis.org/webinar…

738