Filter
Exclude
Time range
-
Near
What exactly is Protein Search? 🧬 Unlike De novo Design or Representation Learning, Protein Search focuses on "searching with an evolutionary perspective." Why is it harder than Web Search? ✅ Requires joint reasoning of missing modalities. ✅ Targets high-confidence evolutionary hypotheses. ✅ Matches true biological kinship, not just text similarity. Check out our detailed video! 👇 #AIforScience #MachineLearning #ComputationalBiology #ProteinSearch #StructuralBiology #DeNovoDesign #RepresentationLearning #SequenceAnalysis #Tweetorial #AcademicTwitter
1
5
98
Edit Distance Embedding with Genomic Large Language Model 1. A new study has made significant strides in the field of genomic sequence analysis. The research introduces LLMED, a model that leverages genomic large language models to produce sequence embeddings approximating the edit distance, outperforming existing methods in both accuracy and efficiency. 2. The core innovation of LLMED lies in its unique approach to edit distance approximation. Traditional methods struggle with the computational expense of calculating edit distance for large-scale genomic sequences. LLMED addresses this by embedding sequences into a normed space, allowing for faster and more efficient distance estimation. This not only enhances the speed of sequence analysis but also maintains high accuracy. 3. One of the standout features of LLMED is its versatility. The model is designed to be trained using any existing genomic foundation model, making it highly adaptable to various genomic datasets and applications. This flexibility ensures that LLMED can be easily integrated into different research pipelines, further expanding its potential impact in the field. 4. The study conducted extensive experimental comparisons to validate the performance of LLMED. Results showed that LLMED surpassed leading machine learning and rule-based embedding methods in approximating the edit distance. In critical applications such as similar sequence search, LLMED achieved significantly improved accuracy, demonstrating its superior embedding capabilities. 5. The training process of LLMED is also noteworthy. The model employs contrastive learning based on a pretrained genomic large language model. Three different loss functions—MAE loss, triplet loss, and combined loss—are explored to optimize the model’s performance. This rigorous training approach ensures that LLMED can effectively learn from data and generate high-quality embeddings. 6. Practical applications of LLMED are demonstrated through tasks like K-nearest neighbor search. The model’s ability to accurately identify similar sequences in both synthetic and real datasets highlights its potential for use in various biological applications, such as phylogeny reconstruction and nearest sequence search. This makes LLMED a valuable tool for researchers working with genomic data. 7. The research concludes that LLMED represents a significant advancement in the field of genomic sequence analysis. By leveraging the power of genomic large language models, LLMED offers a more efficient and accurate solution for edit distance approximation. Future work could focus on further enhancing the model’s performance by incorporating advanced techniques used in natural language processing. 📜Paper: biorxiv.org/content/10.1101/… 💻Code: github.com/Shao-Group/llmemb… #Genomics #Bioinformatics #LargeLanguageModels #SequenceAnalysis #EditDistance #LLMED
7
24
1,494
Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble @PLOSONE 1.This paper introduces a novel model for predicting protein-protein interactions (PPIs) using a combination of bidirectional gated recurrent units (BiGRUs) and an explicit ensemble approach. The model achieves state-of-the-art performance on both in-species and cross-species datasets. 2.The most attractive feature is its high generalizability. Trained on S. cerevisiae, the model maintains strong performance across multiple cross-species and disease-specific datasets—highlighting its robustness and potential utility in diverse biological settings. 3.To enhance sequence representation, the authors incorporate the SVHEHS descriptor, a 20×13-dimensional matrix derived from 457 physicochemical properties of amino acids, into three key feature encoding techniques: PseAAC, AD, and AC. 4.Six feature coding techniques are used in total—PseAAC, AD, AC, CT, LD, and MMI—each capturing different aspects of protein sequences (composition, sequence order, local/global dependencies, and mutual information). 5.Each feature vector is processed by a dedicated BiGRU for dimensionality reduction. These BiGRUs are then explicitly ensembled to retain diverse learned features while reducing noise and redundancy. 6.The final feature set is input into a LightGBM classifier. This combination of deep learning for feature transformation and gradient boosting for classification allows for both high accuracy and efficient inference. 7.On the H. pylori and S. cerevisiae datasets, the model achieves 96.47% and 97.79% accuracy, respectively—outperforming models like GcForest-PPI, GTB-PPI, and DeepPPI. 8.BiGRU outperforms both forward and backward GRUs in capturing sequence dependencies. Bidirectional context is shown to be crucial for capturing interaction-related patterns. 9.The explicit ensemble (MultiEns) of six independent BiGRUs outperforms simpler strategies like feature concatenation (MultiCon) or dual-branch networks (MultiSep), demonstrating the benefit of architectural diversity. 10.The model also surpasses traditional classifiers (e.g., SVM, KNN, RF, AdaBoost) in accuracy, robustness, and computational efficiency, justifying the choice of LightGBM for final prediction. 11.Evaluation on independent datasets (e.g., C. elegans, E. coli, H. sapiens, M. musculus) confirms excellent generalization, with accuracy consistently above 94%. On disease-specific datasets, it achieves perfect accuracy in some cases. 12.The framework is modular and data-efficient. A key limitation is the absence of protein structural data or pretrained language model embeddings, which the authors note as future work. 13.Overall, this study presents a flexible and accurate PPI prediction pipeline with demonstrated cross-domain utility and a well-grounded methodological design. 💻Code: github.com/bingo111111/BiGRU… 📜Paper: journals.plos.org/plosone/ar… #PPI #DeepLearning #Bioinformatics #GRU #ProteinInteractions #MachineLearning #BiGRU #LightGBM #SequenceAnalysis
2
649
Protein-protein interaction prediction using bidirectional GRUs with explicit ensemble 1.This paper introduces a novel model for predicting protein-protein interactions (PPIs) using a combination of bidirectional gated recurrent units (BiGRUs) and an explicit ensemble approach. The model achieves state-of-the-art performance on both in-species and cross-species datasets. 2.The most attractive feature is its high generalizability. Trained on S. cerevisiae, the model maintains strong performance across multiple cross-species and disease-specific datasets—highlighting its robustness and potential utility in diverse biological settings. 3.To enhance sequence representation, the authors incorporate the SVHEHS descriptor, a 20×13-dimensional matrix derived from 457 physicochemical properties of amino acids, into three key feature encoding techniques: PseAAC, AD, and AC. 4.Six feature coding techniques are used in total—PseAAC, AD, AC, CT, LD, and MMI—each capturing different aspects of protein sequences (composition, sequence order, local/global dependencies, and mutual information). 5.Each feature vector is processed by a dedicated BiGRU for dimensionality reduction. These BiGRUs are then explicitly ensembled to retain diverse learned features while reducing noise and redundancy. 6.The final feature set is input into a LightGBM classifier. This combination of deep learning for feature transformation and gradient boosting for classification allows for both high accuracy and efficient inference. 7.On the H. pylori and S. cerevisiae datasets, the model achieves 96.47% and 97.79% accuracy, respectively—outperforming models like GcForest-PPI, GTB-PPI, and DeepPPI. 8.BiGRU outperforms both forward and backward GRUs in capturing sequence dependencies. Bidirectional context is shown to be crucial for capturing interaction-related patterns. 9.The explicit ensemble (MultiEns) of six independent BiGRUs outperforms simpler strategies like feature concatenation (MultiCon) or dual-branch networks (MultiSep), demonstrating the benefit of architectural diversity. 10.The model also surpasses traditional classifiers (e.g., SVM, KNN, RF, AdaBoost) in accuracy, robustness, and computational efficiency, justifying the choice of LightGBM for final prediction. 11.Evaluation on independent datasets (e.g., C. elegans, E. coli, H. sapiens, M. musculus) confirms excellent generalization, with accuracy consistently above 94%. On disease-specific datasets, it achieves perfect accuracy in some cases. 12.The framework is modular and data-efficient. A key limitation is the absence of protein structural data or pretrained language model embeddings, which the authors note as future work. 13.Overall, this study presents a flexible and accurate PPI prediction pipeline with demonstrated cross-domain utility and a well-grounded methodological design. 💻Code: github.com/bingo111111/BiGRU… 📜Paper: journals.plos.org/plosone/ar… #PPI #DeepLearning #Bioinformatics #GRU #ProteinInteractions #MachineLearning #BiGRU #LightGBM #SequenceAnalysis
4
555
Predicting protein folding dynamics using sequence information 1.This study introduces a computational framework to predict protein folding dynamics directly from amino acid sequences, going beyond static structure predictions to model how proteins fold and how mutations impact their folding pathways. 2.The method leverages Direct Coupling Analysis (DCA) to infer a Potts model from multiple sequence alignments, capturing evolutionary constraints as a proxy for folding energetics. 3.Folding dynamics are simulated using a coarse-grained finite-chain Ising model, where proteins are partitioned into discrete folding units called foldons, each modeled as a two-state (folded/unfolded) spin. 4.The framework estimates folding temperatures and cooperative folding behavior for individual foldons, enabling the identification of subdomains and critical folding transitions within a protein. 5.A key innovation is the use of evolutionary energy landscapes to simulate folding curves, free energy profiles, and cooperative transitions without requiring structural input or experimental folding data. 6.The model accommodates a variety of foldon partitioning schemes, including repeat-based, secondary structure-based, exon-based, and neutral models, allowing tailored analyses for different protein topologies. 7.It also estimates the selection temperature (Tsel) for a protein family, quantifying the evolutionary pressure on folding stability, either from experimental ΔΔG data or inferred from sequence variability. 8.The Monte Carlo simulation protocol is optimized to detect folding/unfolding transitions across temperature ranges, and outputs thermal unfolding curves, cooperativity scores, and domain emergence maps. 9.The framework enables rapid in silico assessment of mutation effects, predicting changes in folding stability and cooperativity for all possible single-point mutants using the wild-type energy field. 10.By extending the simulation to many sequences from the same family, the model supports family-wide analyses and rational protein design, including ranking sequences by thermal stability. 11.Furthermore, it enables generation of novel protein sequences using the Potts model and maps them in an energy-cooperativity space, providing predictive insights into their folding properties before simulation. 12.A Google Colab notebook implementing the entire pipeline is publicly available, allowing researchers to run custom simulations from sequence and alignment data with minimal setup. 💻Code: colab.research.google.com/gi… 📜Paper: arxiv.org/abs/2505.17237 #ProteinFolding #EvolutionaryBiophysics #PottsModel #SequenceAnalysis #FoldingMechanism #ComputationalBiology #CoarseGraining #DirectCouplingAnalysis
4
51
5,134
Predicting protein folding dynamics using sequence information 1.This study introduces a computational framework to predict protein folding dynamics directly from amino acid sequences, going beyond static structure predictions to model how proteins fold and how mutations impact their folding pathways. 2.The method leverages Direct Coupling Analysis (DCA) to infer a Potts model from multiple sequence alignments, capturing evolutionary constraints as a proxy for folding energetics. 3.Folding dynamics are simulated using a coarse-grained finite-chain Ising model, where proteins are partitioned into discrete folding units called foldons, each modeled as a two-state (folded/unfolded) spin. 4.The framework estimates folding temperatures and cooperative folding behavior for individual foldons, enabling the identification of subdomains and critical folding transitions within a protein. 5.A key innovation is the use of evolutionary energy landscapes to simulate folding curves, free energy profiles, and cooperative transitions without requiring structural input or experimental folding data. 6.The model accommodates a variety of foldon partitioning schemes, including repeat-based, secondary structure-based, exon-based, and neutral models, allowing tailored analyses for different protein topologies. 7.It also estimates the selection temperature (Tsel) for a protein family, quantifying the evolutionary pressure on folding stability, either from experimental ΔΔG data or inferred from sequence variability. 8.The Monte Carlo simulation protocol is optimized to detect folding/unfolding transitions across temperature ranges, and outputs thermal unfolding curves, cooperativity scores, and domain emergence maps. 9.The framework enables rapid in silico assessment of mutation effects, predicting changes in folding stability and cooperativity for all possible single-point mutants using the wild-type energy field. 10.By extending the simulation to many sequences from the same family, the model supports family-wide analyses and rational protein design, including ranking sequences by thermal stability. 11.Furthermore, it enables generation of novel protein sequences using the Potts model and maps them in an energy-cooperativity space, providing predictive insights into their folding properties before simulation. 12.A Google Colab notebook implementing the entire pipeline is publicly available, allowing researchers to run custom simulations from sequence and alignment data with minimal setup. 💻Code: colab.research.google.com/gi… 📜Paper: arxiv.org/abs/2505.17237 #ProteinFolding #EvolutionaryBiophysics #PottsModel #SequenceAnalysis #FoldingMechanism #ComputationalBiology #CoarseGraining #DirectCouplingAnalysis
1
13
54
3,681
New article with Kevin Emery and André Berchtold: A systematic comparison of methods to impute missing longitudinal categorical data Also propose new MICT-Timing algorihtm #SequenceAnalysis methods and available in seqimpute #Rpackage doi.org/10.1007/s11135-024-0…
1
4
129
Embed-Search-Align: DNA Sequence Alignment using Transformer Models 1. Introducing Embed-Search-Align (ESA), a novel framework leveraging Transformer-based Reference-Free DNA Embedding (RDE) to align DNA sequences with unmatched efficiency and accuracy, rivaling traditional methods like Bowtie and BWA-Mem. 2. Key innovation: ESA transforms genome-wide sequence alignment into a vector search task, enabling efficient identification of top-matching fragments through a specialized DNA vector store. 3. RDE achieves 99% accuracy in aligning 250-length reads to a human reference genome, significantly outperforming 6 recent DNA-Transformer baselines like Hyena-DNA and DNABERT-2 in terms of both precision and scalability. 4. Unique features: Self-supervised training with contrastive loss allows RDE to generate rich embeddings, preserving sequence locality in the embedding space and enabling robust cross-species and cross-chromosome alignment. 5. ESA reduces computational complexity, achieving a speed of aligning 10,000 reads per minute while maintaining high accuracy. This represents a step forward in aligning reads for large and complex genomes. 6. Real-world implications: ESA’s superior performance in aligning short reads from simulated and experimental datasets offers transformative potential for genomics, including variant calling, transcriptomics, and epigenomics. 7. Looking ahead: ESA’s framework paves the way for advanced applications like pan-genome alignment and de novo genome assembly, with promising initial results on species like Thermus aquaticus. @LajoyceMboning @KCEnevoldsen 📜Paper: arxiv.org/abs/2309.11087 #Genomics #DNAAlignment #Transformers #MachineLearning #Bioinformatics #SequenceAnalysis #GenomeAssembly
1
13
1,301
The Protein Language Visualizer: Sequence Similarity Networks for the Era of Language Models • PLVis introduces an innovative pipeline to visualize protein sequence relationships using Protein Language Model (PLM) embeddings. It leverages dimensionality reduction techniques (e.g., UMAP, t-SNE) and clustering methods for interactive, intuitive exploration of protein similarities. • Compared to traditional Sequence Similarity Networks (SSNs), PLVis demonstrates superior clustering efficiency by capturing high-dimensional protein family relationships. It identifies functional protein clusters that remain ambiguous or isolated in conventional SSNs. • A head-to-head comparison on datasets such as radical SAM enzymes and Mycobacterium proteomes shows that PLVis clusters are well-defined, with high silhouette scores (>0.95), preserving local and global sequence similarities. • Functional annotations in PLVis clusters reveal enriched protein families, enabling rapid identification of biologically significant patterns, including species-specific expansions in Mycobacterium and malaria-related Plasmodium species. • The pipeline supports interactive exploration with a Google Colab Notebook, allowing users to upload their protein datasets, generate embeddings, and analyze cluster properties, bridging the gap between high-throughput data generation and biological insights. @champiDicty 📜Paper: biorxiv.org/content/10.1101/… #ProteinVisualization #MachineLearning #SequenceAnalysis #Bioinformatics #DimensionalityReduction
1
16
66
5,710
We are looking for a PIPS student intern #SequenceAnalysis #Bioinformatics
1
5
3
954
There’s still time to join the #webinar tomorrow to explore the sequence analysis tools and how to access them programmatically using the Job Dispatcher framework. Registration is free but essential: ebi.ac.uk/training/events/gu… #Bioinformatics #DataScience #SequenceAnalysis @emblebi
3
7
1,090
Join our #webinar next week to explore the sequence analysis tools and how to access them programmatically using the Job Dispatcher framework. Registration is free but essential: ebi.ac.uk/training/events/gu… #Bioinformatics #DataScience #SequenceAnalysis @emblebi @ebi_jdispatcher
3
7
760
Returning to life in Dublin after spending last week in Barcelona attending the third annual @coordinate_eu summer school on #LifeCourse research & #SequenceAnalysis! 📚 The summer school was a fantastic introduction to theory and methods in this domain of #LongitudinalResearch.
1
1
12
1,015
18 Jun 2024
⚠️ Octavia Camps delivered an insightful talk on "Frugal, Interpretable, Dynamics-Inspired Architectures for Sequence Analysis." Thank you 👏 #SequenceAnalysis #Dynamics" #CVPR2024 @Beto_OchoaRuiz 👏
4
10
776
A few days left to register for this 3-days workshop on #SequenceAnalysis With @GilbertRitscha1 and Kevin Emery Online participation possible. 5-7 June, 2024 forscenter.ch/swiss-househol…
1
6
343
There’s still time to join tomorrow’s #webinar about exploring the sequence analysis tools provided by the Job Dispatcher team at EMBL-EBI. Registration is free but essential: ebi.ac.uk/training/events/ac… #bioinformatics #SequenceAnalysis #clustalomega #DataScience @ebi_jdispatcher
4
10
1,389
Join our #webinar next week to explore the sequence analysis tools provided by the Job Dispatcher team at EMBL-EBI. Registration is free but essential: ebi.ac.uk/training/events/ac… #Bioinformatics #SequenceAnalysis #clustalomega #DataScience
10
23
4,544
Reminder #SequenceAnalysis: Tomorrow April 25, 4pm CET Workshop on SA research design Link to join at sequenceanalysis.org/webinar…

1
1
4
738