Research Scientist @Visa. CS Ph.D. @UICCS. Foundation Model & Anomaly/Fraud Detection. Opinions are my own.

Joined April 2016
13 Photos and videos
Yingtong Dou retweeted
Some new results I found surprising that Iโ€™m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.
33
154
1,230
222,083
Yingtong Dou retweeted
If you're doing applied AI research (especially system design, benchmarks, evals, efficiency, or ops) you should be submitting to the Conference on AI and Agentic Systems... caisconf.org/pages/cfp/
7
34
264
26,340
๐Ÿงต(3/6) 3๐ƒ-๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ ๐€๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž.ย We design a new architecture, specifically to model and fuse the complex, multi-modal nature of transaction data. This approach hierarchically encodes transaction features, individual transactions, and their sequences.
1
47
๐Ÿงต(4/6) ๐‘๐ž๐š๐ฅ-๐ฐ๐จ๐ซ๐ฅ๐ ๐ˆ๐ฆ๐ฉ๐š๐œ๐ญ.ย TGPT achieves aย significant improvementย over a production model on transaction classification. TGPT excels at generating realistic future transaction trajectories, opening up new avenues for forecasting and personalization.
41
๐Ÿงต(3/6) 3๐ƒ-๐“๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ ๐€๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž.ย We design a new architecture, specifically to model and fuse the complex, multi-modal nature of transaction data. This approach hierarchically encodes transaction features, individual transactions, and their sequences.
1
33
๐Ÿงต(4/6) ๐‘๐ž๐š๐ฅ-๐ฐ๐จ๐ซ๐ฅ๐ ๐ˆ๐ฆ๐ฉ๐š๐œ๐ญ.ย TGPT achieves aย significant improvementย over a production model on transaction classification. TGPT excels at generating realistic future transaction trajectories, opening up new avenues for forecasting and personalization.
31
Yingtong Dou retweeted
Introducing Eigent โ€” the first multi-agent workforce on your desktop. Eigent is a team of AI agents collaborating to complete complex tasks in parallel. It is your long-term working partner with fullly customizable workers and MCPs. Public beta available to download for MacOS, Windows. 100% open-source on Github. Comment for 500 extra credits.
142
136
682
220,916
Yingtong Dou retweeted
๐Ÿ“ฃ Our spicy ICML 2025 position paper: โ€œGraph Learning Will Lose Relevance Due To Poor Benchmarksโ€. Graph learning is less trendy in the ML world than it was in 2020-2022. We believe the problem is in poor benchmarks that hold the field back - and suggest ways to fix it! ๐Ÿงต1/10
5
50
294
84,298
Yingtong Dou retweeted
Visa President of Technology Rajat Taneja rebuilt the companyโ€™s data platform from scratch, helping position it for the generative AI boom. trib.al/6XClvqz
1
1
7
6,161
Yingtong Dou retweeted
26 Dec 2024
There is a lot of unconscious emphasis of the DeepSeek model being โ€œChineseโ€ and implicit connection with the Sino-US relationship or the GPU power. In my eyes, the success of DeepSeek has little to do with that. It is simple intelligence and pragmatism at work: given a limit of computation and manpower present, produce the best outcome with smart research. Same with the AlexNet model when Alex Krizhevsky needed to make magic with 2 GPUs, and not a supercluster. There are a lot of super smart AI people and companies in the world. In terms of the Chinese ethnic group, people I had the privilege to have worked with include (but are not limited to) - Kaiming He who is the OG of modern computer vision. - Song Han who founded DeePhi, OmniML and now professor at MIT. - the DMLC folks who created early frameworks like MxNet and TVM. - Bing Xu who did MxNet, was coauthor of GAN, founded HippoML and is now at NVidia. - Orbeus, a startup on early CV applications and now the foundation of AWS ReKognition. And many more. They ace in the frontier of AI, whether itโ€™s research, product, small startups, or big companies. AI should bring us closer rather than more separate. I was saddened by the discriminative comments given by Professor Rosalind Picard at NeurIPS, but was too busy to put my thoughts together and say something. Looking back at 2024, I think what really stood out is the fundamental seek for AI breakthrough - collect what we have, use our brain, and achieve our best. Itโ€™s like the Olympics: faster, higher, stronger, together.
21
60
528
67,834
Yingtong Dou retweeted
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit visual generation from an LLM, when trained jointly with visual understanding.
25
133
718
253,438
Yingtong Dou retweeted
What is ๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐—ฅ๐—”๐—š? In real world applications, simple naive RAG systems are rarely used nowadays. To provide correct answers to a user query, we are always adding some agency to the RAG system. However, it is important to ๐—ป๐—ผ๐˜ ๐—ด๐—ฒ๐˜ ๐—น๐—ผ๐˜€๐˜ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ฏ๐˜‚๐˜‡๐˜‡ ๐—ฎ๐—ป๐—ฑ ๐˜๐—ฒ๐—ฟ๐—บ๐—ถ๐—ป๐—ผ๐—น๐—ผ๐—ด๐˜† and understand that there is ๐—ป๐—ผ ๐˜€๐—ถ๐—ป๐—ด๐—น๐—ฒ ๐—ฏ๐—น๐˜‚๐—ฒ๐—ฝ๐—ฟ๐—ถ๐—ป๐˜ to add the mentioned agency to your RAG system and you should adapt to your use case. My advice is to not get stuck on terminology and think about engineering flows. Letโ€™s explore some of the moving pieces in Agentic RAG: ๐Ÿญ. Analysis of the user query: we pass the original user query to a LLM based Agent for analysis. This is where: โžก๏ธ The original query can be rewritten, sometimes multiple times to create either a single or multiple queries to be passed down the pipeline. โžก๏ธ The agent decides if additional data sources are required to answer the query. ๐Ÿฎ. If additional data is required, the Retrieval step is triggered. In Agentic RAG case, we could have a single or multiple agents responsible for figuring out what data sources should be tapped into, few examples: โžก๏ธ Real time user data. This is a pretty cool concept as we might have some real time information like current location available for the user. โžก๏ธ Internal documents that a user might be interested in. โžก๏ธ Data available on the web. โžก๏ธ โ€ฆ ๐Ÿฏ. If there is no need for additional data, we try to compose the answer (or multiple answers) straight via an LLM. ๐Ÿฐ. The answer (or answers) get analyzed, summarized and evaluated for correctness and relevance: โžก๏ธ If the Agent decides that the answer is good enough, it gets returned to the user. โžก๏ธ If the Agent decides that the answer needs improvement, we try to rewrite the usr query and repeat the generation loop. The real power of Agentic RAG lies in its ability to perform additional routing pre and post generation, handle multiple distinct data sources for retrieval if it is needed and recover from failures in generating correct answers. What are your thoughts on Agentic RAG? Let me know in the comments! ๐Ÿ‘‡ #RAG #LLM #AI
15
223
1,064
196,756
Yingtong Dou retweeted
I am hiring a research intern, working LLM (Llama 3 ) safety. The internship is expected to start in Summer/Spring 2025, based in New York City. Please drop me an email at jianfengchi@meta.com (Subject starts with "[2025 Intern]") Learn more here: metacareers.com/jobs/1719537โ€ฆ

3
30
262
32,181
Yingtong Dou retweeted
Respectfully disagree. It's the structure of language and words that make LLMs effective. Pure speech, time series, or video without linguistic co-supervision don't yield the same results. Language provides the minimal conceptual units that enable these models to work.
It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something. They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it". Actually, as the LLM stack becomes more and more mature, we may see a convergence of a large number of problems into this modeling paradigm. That is, the problem is fixed at that of "next token prediction" with an LLM, it's just the usage/meaning of the tokens that changes per domain. If that is the case, it's also possible that deep learning frameworks (e.g. PyTorch and friends) are way too general for what most problems want to look like over time. What's up with thousands of ops and layers that you can reconfigure arbitrarily if 80% of problems just want to use an LLM? I don't think this is true but I think it's half true.
9
15
153
31,394
Yingtong Dou retweeted
Excited to share Just read twice: going beyond causal language modeling to close quality gaps between efficient recurrent models and attention-based models!! Thereโ€™s so much recent progress on recurrent architectures, which are dramatically more memory efficient and asymptotically faster than attention ๐Ÿ’จ But thereโ€™s no free lunch ๐Ÿฅช these models canโ€™t fit all the information from long contexts into the limited memory, degrading in-context learning quality. Is all lost?
7
57
299
93,074
The review form of #NeurIPS 2024 is cumbersome!
2
541
Yingtong Dou retweeted
The paper we have been waiting for essentially shows that #timeseries #llms do not work in forecasting. Back in 2022, paper โ€œAre Transformers Effective for Time Series Forecasting?โ€œ challenged the appearing narrative that transformers are useful for forecasting. By removing transformer elements the authors showed the performance went up โฌ†๏ธ And now people did the same with time series LLMs. The papers demonstrated: - removing the LLM component or replacing it with a basic attention layer does not degrade the forecasting resultsโ€”in most cases the results even improved! - in fact removing even removing the language model entirely, yields comparable or better performance! - these simpler methods after removal of LLM component reduce training and inference time by up to three orders of magnitude while maintaining comparable performance! - the sequence modeling capabilities of LLMs do not transfer to time series. By shuffling input time series the authors find no appreciable change in performance. What this says is that LLMs canโ€™t deal with critical features of time series, the time order is key and if LLMs performance doesnโ€™t change when shuffling data it basically means it doesnโ€™t model time series. These finding are as damming to time series LLMs as the โ€œAre Transformers Effective for Time Series Forecasting?โ€ was for transformers. #timeseries #forecasting
11
181
856
120,077
Yingtong Dou retweeted
This is such a good paper. I love NLP error analysis papers -- @chrmanning and co do them so well (another great example will always be "Part-of-Speech Tagging from 97% to 100%").
4
28
158
29,829