Bryan Li

Bryan Li

13 Photos and videos

Tweets

Pinned Tweet

Bryan Li @bryanlics

27 Jun 2024

Do LLMs' reasoning abilities come from training on code🤔? Many think so, but how does this hold across languages🌐? We study the interplay of code and reasoning in our recent work (#acl2024). 📃arxiv.org/abs/2403.02567 🗃️github.com/amazon-science/xs… 1/6 🧵

154

16,686

Bryan Li

Bryan Li @bryanlics

5 Jul 2025

In a world of geopolitical conflicts, how can AI help us navigate? Our #ACL2025-F work studies RAG robustness across 49 languages. TL;DR: 📈 boost robustness w/ multilingual RAG, 🤔 take care w/ low-resource citations 📜arxiv.org/abs/2410.01171 🤗huggingface.co/datasets/bord… 1/4 🧵

984

Bryan Li

Bryan Li @bryanlics

28 Jul 2025

I'm in Vienna this week to present our poster on the robustness of RAG systems to multilingual contexts at #ACL2025NLP! 🗓️ Poster Session | Wednesday, July 30, 16:00 - 17:30 📍 Hall 4/5 @aclmeeting

133

Bryan Li

Bryan Li @bryanlics

5 Jul 2025

We study cross-lingual robustness over 4 LLMs and 2 IR models. We find A) multilingual RAG performs best; B) LLM’s citations varies widely across langs. Our further experiments investigate aspects of cross-lingual RAG from IR to LLM explanations. 3/4 🧵

113

Bryan Li

Bryan Li @bryanlics

5 Jul 2025

This is the final paper of my PhD! Thanks to my many @upennnlp collaborators: @samarhdr, Chris, and the 7 wonderful students who I was fortunate to mentor. Please look out for our poster at ACL 2025 in Vienna. 4/4 🧵

120

Bowen Jiang (Lauren)

Bryan Li retweeted

Bowen Jiang (Lauren)@laurenbjiang

23 Apr 2025

🚀 How well can LLMs know you and personalize your response? Turns out, not so much! Introducing the PersonaMem Benchmark -- 👩🏻‍💻Evaluate LLM's ability to understand evolving persona from 180 multi-session user-chatbot conversation history 🎯Latest models (GPT-4.1, GPT-4.5, o4-mini, Llama-4, Gemini 2.0, Deepseek-R1, Claude-3.7) all struggle in personalization! 🎨7 personalization skills tested in 15 scenarios 🌟Realistic long-context evaluation up to 1M tokens 👇 Check out what we discovered… (1/6)

Fig 1: Overview of PersonaMem benchmark. Each benchmark sample is a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across a variety of topics such as food recommendation, travel planning, and therapy consultation. As the user’s preferences evolve over time, the benchmark offers annotated questions assessing whether models can track and incorporate the changes into their responses. Fig 2: Model performances by number of sessions elapsed since most recent preferences were mentioned in long context. Top: up to 20 sessions/128k tokens; Bottom: up to 60 sessions/1M tokens. Long-context retrieval is important for personalization in practice.

ALT Fig 1: Overview of PersonaMem benchmark. Each benchmark sample is a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across a variety of topics such as food recommendation, travel planning, and therapy consultation. As the user’s preferences evolve over time, the benchmark offers annotated questions assessing whether models can track and incorporate the changes into their responses. Fig 2: Model performances by number of sessions elapsed since most recent preferences were mentioned in long context. Top: up to 20 sessions/128k tokens; Bottom: up to 60 sessions/1M tokens. Long-context retrieval is important for personalization in practice.

4,601

Bryan Li

Bryan Li @bryanlics

11 Mar 2025

Externally retrieving knowledge empowers LLMs for domain-adapted MT ⚖️🩺. But how is knowledge best represented, and how viable is generating it from an LLM itself? Our @GoogleAI paper investigates these questions through a careful experimental setup 📜. arxiv.org/abs/2503.05010

446

Bryan Li

Bryan Li @bryanlics

11 Mar 2025

TL;DR - translation pairs > bilingual terminologies, generation especially boosts translations for small LLMs Our ablations highlight the need for more challenging domain-adapted MT datasets with modern LLMs. Thanks to collaborators Jiaming, @ebriakou & @ColinCherry!

Yue Yang

Bryan Li retweeted

Yue Yang

@YueYangAI

24 Feb 2025

We share Code-Guided Synthetic Data Generation: using LLM-generated code to create multimodal datasets for text-rich images, such as charts📊, documents📄, etc., to enhance Vision-Language Models. Website: yueyang1996.github.io/cosyn/ Dataset: huggingface.co/datasets/alle… Paper: arxiv.org/pdf/2502.14846 Code: github.com/allenai/pixmo-doc…

ALT Average performance on 7 text-rich benchmarks: ChartQA, DocVQA, InfoVQA, TableVQA, AI2D, TextVQA, ScreenQA.

194

23,149

Shreya Havaldar

Bryan Li retweeted

Shreya Havaldar @shreyahavaldar

29 Jan 2025

🚨 LLMs must grasp implied language to reason about emotions, social cues, etc. Our @GoogleDeepMind paper presents the Implied NLI dataset. Targeting social norms 🌎 and conversational dynamics 💬, we enhance LLM understanding of real-world implication! arxiv.org/abs/2501.07719

Entailed Between the Lines: Incorporating Implication into NLI

Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and...

arxiv.org

6,257

Bryan Li

Bryan Li @bryanlics

3 Oct 2024

RAG enables LLMs to access external info 📖. But when this info is multiple languages 🌐, can LLMs reconcile differing viewpoints 🧐? We introduce BordIRlines, a dataset to study the robustness of cross-lingual RAG. 📃arxiv.org/abs/2410.01171 🗃️ huggingface.co/datasets/bord… 1/4 🧵

790

more replies

Bryan Li

Bryan Li @bryanlics

3 Oct 2024

Using cross-lingually aligned queries, we analyze responses in a RAG setting. Responses can be "flipped" by varying passages' linguistic composition. We thus find these systems to be far from cross-lingually robust, as certain viewpoints can be amplified over others. 3/4 🧵

139

Bryan Li

Bryan Li @bryanlics

3 Oct 2024

We'll be presenting this at the NLP for Wikipedia workshop @emnlpmeeting. This is ongoing work, and we'd love to hear feedback from the community! A shout-out to my collaborators Fiona and Adwait for their amazing first paper efforts, @samarhdr, and Chris. 4/4 🧵

123

Bryan Li

Bryan Li @bryanlics

27 Jun 2024

154

16,686

more replies

Bryan Li

Bryan Li @bryanlics

27 Jun 2024

Results on BLOOM(Z) show that both techniques in tandem supercharge LLMs' complex reasoning across languages. Also, results on GPT-3 show that our code prompt format alone works well for API-based LLMs. 5/6 🧵

270

Bryan Li

Bryan Li @bryanlics

27 Jun 2024

Check out our paper for more details and results, and we invite you to download and work with our xSTREET dataset! A huge thanks to my @AmazonScience collaborators: Tamer, @dbonadim, @nik0spapp, & Saab ~ 6/6 🧵

541