We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use. huggingface.co/datasets/cora…
For full technical details compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996
As for German-specific models trained on this data... stay tuned 👀
Come join us at the poster session at ICTIR 2025 to discuss:
- Axioms for Retrieval-Augmented Generation webis.de/publications.html#m…
- Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications.html#g…
Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_axiom…
Nice to see axiomatic IR gaining momentum.
Happy to share that our paper "The Viability of Crowdsourcing for RAG Evaluation" received the Best Paper Honourable Mention at #SIGIR2025! Very grateful to the community for recognizing our work on improving RAG evaluation.
📄 webis.de/publications.html#g…
Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)
The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.
More details are available at: trec-tot.github.io/guideline…
Our paper on self-distillation for training bi-encoders got accepted at #ICTIR2025! By exploiting pretrained encoder capabilities, our approach eliminates expensive teacher models and batch sampling while maintaining the same effectiveness.
Results on BEIR demonstrate that our method matches teacher distillation effectiveness, while using only 13.5% of the data and achieving 3-15x training speedup. This makes effective bi-encoder training more accessible, especially for low-resource settings.
Short: Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-ranking webis.de/publications.html#s…
Full: Set-Encoder: Permutation-Invariant Inter-Passage Attention for Listwise Passage Re-Ranking with Cross-Encoders webis.de/publications.html#s…
📢 Our paper "The Viability of Crowdsourcing for RAG Evaluation" has been accepted to #SIGIR2025 ! We compared how good humans and LLMs are at writing and judging RAG responses, assembling 1800 responses across 3 styles, and 47K pairwise judgments in 7 quality dimensions. 🧵➡️
🧵 3/4 This fundamentally challenges previous assumptions about RAG evaluation and system design. But we also show how crowdsourcing offers a viable and scalable alternative! Check out the paper for more. 📝 Preprint @ downloads.webis.de/publicati…⚙️Code/Data is openly available.