Annotating Publicly-Available Samples and Studies Using Interpretable Modeling of Unstructured Metadata
1. This study introduces txt2onto 2.0, an improved NLP and ML-based tool that automates the annotation of unstructured biomedical metadata, linking samples and studies to controlled disease and tissue vocabularies without manual intervention .
2. By using a TF-IDF-based feature extraction approach instead of averaging word embeddings, txt2onto 2.0 offers more interpretable results, allowing it to accurately identify key predictive terms within sample and study metadata .
3. The model outperforms its predecessor in both tissue and disease annotation tasks, excelling particularly in scenarios with limited training data, thus making it ideal for infrequent or rare biomedical terms .
4. A notable strength of txt2onto 2.0 is its ability to work across different biomedical text sources (e.g., GEO, PRIDE, ClinicalTrials), providing consistent annotations by capturing meaningful semantic relationships even with unseen terms .
5. The interpretability of txt2onto 2.0 is highlighted through word clouds of predictive terms, where it captures domain-specific keywords without requiring explicit mentions of target terms, showcasing its robustness and potential to adapt to new datasets .
6. This tool’s transparent prediction process and scalability support its application across various data repositories, advancing the FAIR data principles (Findable, Accessible, Interoperable, Reusable) in biomedical research .
@compbiologist
💻Code:
github.com/krishnanlab/txt2o…
📜Paper:
doi.org/10.1101/2024.06.03.5…
#BiomedicalNLP #DataAnnotation #MachineLearning #FAIRdata #ComputationalBiology