Assistant prof @CIS_Penn. Machine learning for systems, databases.

Joined March 2009
Photos and videos
26 Dec 2025
Most database teams optimize what they see in workload logs. But those very optimizations change what users choose to run! In our CIDR paper, we argue that industrial workloads exhibit ๐ฌ๐ฎ๐ซ๐ฏ๐ข๐ฏ๐จ๐ซ๐ฌ๐ก๐ข๐ฉ ๐›๐ข๐š๐ฌ: logs reflect a negotiation between users and the platform.
1
6
347
26 Dec 2025
For researchers, databases traces are a MAJOR upgrade compared to synthetic benchmarks (or simply making something up, which is shockingly common). We argue we need more of these workload traces to build a complete picture, and, perhaps more importantly, see what is missing.
1
1
188
26 Dec 2025
We conclude with a discussion about how database researchers should use industrial traces, and how we might begin to build systems that optimize for "the query the user never sends." ๐Ÿ“„Paper: rm.cab/survivorshipbias

1
151
OLAP workloads are dominated by repetitive queries -- how can we optimize them? A promising direction is to do ๐—ผ๐—ณ๐—ณ๐—น๐—ถ๐—ป๐—ฒ query optimization, allowing for a much more thorough plan search. Two new SIGMOD papers! ๐Ÿงต
1
10
619
LimeQO (by @yi_zixuan), a ๐‘ค๐‘œ๐‘Ÿ๐‘˜๐‘™๐‘œ๐‘Ž๐‘‘-๐‘™๐‘’๐‘ฃ๐‘’๐‘™ approach to query optimization, can use neural networks or simple linear methods to find good query hints significantly faster than a random or brute force search. ๐Ÿ“„rm.cab/limeqo
1
7
422
For that one query that must go ๐‘Ÿ๐‘’๐‘Ž๐‘™๐‘™๐‘ฆ ๐‘“๐‘Ž๐‘ ๐‘ก, BayesQO (by Jeff Tao) finds superoptimized plans using Bayesian optimization in a learned plan space. Itโ€™s costly, but the results can train an LLM to speed things up next time. ๐Ÿ“„rm.cab/bayesqo
6
322
15 Feb 2025
Pair(akeet) programming.
1
13
819
At aiDM@SIGMOD, PhD student Zixuan Yi will present LimeQO, the first *workload-level* learned query optimizer: simultaneously learning to optimize an entire query workload at once! By casting the problem as low rank matrix completion, we show that linear methods are all you need.
2
2
22
1,805
๐Ÿ“ Check out the paper: rm.cab/limeqo ๐Ÿ“„ Zixuan's NEDB poster: bu-disc.github.io/nedbday/20โ€ฆ ๐ŸŒ Zixuan's website: zixy17.github.io/
2
559
20 May 2024
Greatly enjoyed talking with Jack! We discussed the "research journey," what it means for DB research to be impactful, and new work from our lab about query optimization!
๐Ÿšจ The first episode in our #HighImpact series with Ryan Marcus (@RyanMarcus) is available now! ๐ŸŽง Listen on Spotify โžก๏ธ open.spotify.com/show/6IQIF9โ€ฆ ๐ŸŽง Listen on Apple โžก๏ธ podcasts.apple.com/us/podcasโ€ฆ
1
16
1,884
12 Apr 2024
How much has everyone's favorite open source query optimizer, PostgreSQL, improved over the last 10 years? Turns out, quite a lot! Blog post: rmarcus.info/blog/2024/04/12โ€ฆ
2
11
61
6,299
10 Nov 2023
I'm recruiting PhD students for Fall 2024 @CIS_Penn! Our lab is using ML to build the next generation of data systems. Come build systems that automatically invent new algorithms, adapt to changing environments, and understand user intention! rm.cab/phd
1
18
70
13,557
25 Aug 2023
Excited to be involved in 3 VLDB papers, a demo, and 2 workshop papers! Collaborations between UPenn and Meta, Intel, MIT, TUM, and Stony Brook. Check them all out in this thread ๐Ÿ‘‡ or on my website rm.cab/vldb23
1
6
31
3,316
27 Aug 2023
SageDB: a prototype instance optimized analytics DB, is the culmination of several years of research into instance optimized systems -- make sure to check out @jialin_ding's presentation on Wednesday at 10:30am in Gulf Islands. rm.cab/sagedb
1
2
1,003
27 Aug 2023
(technically, SageDB was in VLDB volume 15, last year, but it is being presented this year!)
818