Filter
Exclude
Time range
-
Near
Replying to @vintage_meow_
A galera querendo gatekeepizar simpletes o maior artista de todos os tempos. โ€œEstรฃo descobrindo o Michael Jacksonโ€ minha filha ACORDAAAAAAA
1
1
75
Most autoresearch emulate an individual researcher. We created #SimpleTES to emulate a research community. The result: new SOTA discoveries across 21 open science problems, including ๐Ÿš€ More efficient astrodynamics โšก 2ร— faster LASSO ๐Ÿ”ฌ Better quantum circuit compilation
1
17
112
10,524
Look forward to speaking at the AI Agents for Discovery in the Wild workshop today! Be sure to check out @haotian_yeee's spotlight talk on simpleTES that discovered many new sota solutions! ai-discovery-in-the-wild.gitโ€ฆ

Very excited about our workshop on AI Agents for Discovery in the Wild ๐Ÿพ๐Ÿฆ’๐Ÿ…, happening *tomorrow*, Tuesday, May 26th 9am-5pm, as part of CAIS '26 in San Jose. We were blown away by all of the excellent submissions we gotโ€ฆ(1/n)
1
3
25
3,118
Scaling evaluationsโ€”not just computeโ€”is critical for AI-driven science. SimpleTES introduces a new framework to scale discovery loops, finding new SOTA solutions across 21 open science problems. Including: โ€ข >2ร— faster LASSO algorithm โ€ข more efficient quantum routing more! Great work led by @haotian_yeee and wonderful collaborators!
๐Ÿš€ Today, weโ€™re excited to introduce SimpleTES for scaling the scientific discovery loop. ๐Ÿงต I always ask myself: what are we actually scaling in scientific discovery? Most LLM discovery methods focus on test-time scalingย generationย โ€” more tokens, more agents, more turns. But science advances through the evaluation-driven loops: propose โ†’ evaluate โ†’ refine โ†’ repeat. SimleTES captures this idea, discovering SOTA solutions across 21 scientific problems! Key discoveries: ๐ŸŽ๏ธย 2.17x faster lasso solverย than glmnet โ€” the gold-standard LASSO solver, engineered for decades. โš›๏ธย 24.5% fewer quantum routing overhead on IBM Q20 โ€” superior than previous standard library LightSABRE. ๐Ÿ“ย 0.380868ย on Erdล‘s Minimum Overlap โ€” outperforming previous solutions fromย mixed-frontier ensembles or humans. ๐Ÿงฌย 0.74ย on Tabula Muris (scRNA-seq denoising) โ€” new SOTA, generalizing to unseen tissue types without retraining. #LLM #AI4Science #ScalingLaws #SimpleTES #MachineLearning
1
19
78
20,535
Congrats @haotian_yeee and team on SimpleTES and on discovering 21 SOTA solutions across 6 scientific problems! ๐Ÿš€๐Ÿ”ฌ What I find especially exciting is the shift from scaling generation to scaling the full scientific discovery loop: propose โ†’ evaluate โ†’ refine โ†’ repeat. ๐Ÿ” By making evaluation signals the core driver of test-time search, SimpleTES points to a compelling path toward more systematic, evaluation-driven AI for science. ๐Ÿš€
๐Ÿš€ Today, weโ€™re excited to introduce SimpleTES for scaling the scientific discovery loop. ๐Ÿงต I always ask myself: what are we actually scaling in scientific discovery? Most LLM discovery methods focus on test-time scalingย generationย โ€” more tokens, more agents, more turns. But science advances through the evaluation-driven loops: propose โ†’ evaluate โ†’ refine โ†’ repeat. SimleTES captures this idea, discovering SOTA solutions across 21 scientific problems! Key discoveries: ๐ŸŽ๏ธย 2.17x faster lasso solverย than glmnet โ€” the gold-standard LASSO solver, engineered for decades. โš›๏ธย 24.5% fewer quantum routing overhead on IBM Q20 โ€” superior than previous standard library LightSABRE. ๐Ÿ“ย 0.380868ย on Erdล‘s Minimum Overlap โ€” outperforming previous solutions fromย mixed-frontier ensembles or humans. ๐Ÿงฌย 0.74ย on Tabula Muris (scRNA-seq denoising) โ€” new SOTA, generalizing to unseen tissue types without retraining. #LLM #AI4Science #ScalingLaws #SimpleTES #MachineLearning
1
2
22
3,968
8/8 This is an amazing project collaborating with Wizard Intelligence Learning Lab (WILL), Stanford, Peking University, Tsinghua University, and HKUST-GZ. We are launching the platform for anyone who wishes to use SimpleTES soon! ๐ŸŒ wizardquant.com/will/simpletโ€ฆย (110-page paper code waitlist)
3
12
832
7/N Summary: ๐Ÿš€ Scaling model size helps. ๐Ÿง  Scaling reasoning tokens helps. ๐Ÿ”„ But scaling theย evaluation-driven discovery loopย is anย unlocked dimensionย โ€” and SimpleTES shows how far it can take you.
1
3
503
6/N SimpleTES doesn't just solve problems; it creates Expert Trajectories. ๐Ÿ“ˆ We post-trained on these trajectories in a crazy way: ignoring all intermediate rewards, using only the final score of each trajectory. The resulting modelย generalized to unseen problemsย and discovered solutions the base model never could. Example: Sum-Difference Problem โ†’ new SOTA:ย 1.144887ย (previous best: 1.143975) It learnedย how to search, not just what to output.
1
3
514
1/N SimpleTES treats "Evaluation" as a first-class citizen. Surprisingly, it scales the evaluation loops along 3 simple yet effective dimensions: ๐Ÿ”น C = parallel trajectories (global exploration) ๐Ÿ”น L = refinement depth (feedback-driven improvement) ๐Ÿ”น K = local sample size (greedy selection per step) That's it. No complex heuristics. Just structured scaling.
2
9
1,163
๐Ÿš€ Today, weโ€™re excited to introduce SimpleTES for scaling the scientific discovery loop. ๐Ÿงต I always ask myself: what are we actually scaling in scientific discovery? Most LLM discovery methods focus on test-time scalingย generationย โ€” more tokens, more agents, more turns. But science advances through the evaluation-driven loops: propose โ†’ evaluate โ†’ refine โ†’ repeat. SimleTES captures this idea, discovering SOTA solutions across 21 scientific problems! Key discoveries: ๐ŸŽ๏ธย 2.17x faster lasso solverย than glmnet โ€” the gold-standard LASSO solver, engineered for decades. โš›๏ธย 24.5% fewer quantum routing overhead on IBM Q20 โ€” superior than previous standard library LightSABRE. ๐Ÿ“ย 0.380868ย on Erdล‘s Minimum Overlap โ€” outperforming previous solutions fromย mixed-frontier ensembles or humans. ๐Ÿงฌย 0.74ย on Tabula Muris (scRNA-seq denoising) โ€” new SOTA, generalizing to unseen tissue types without retraining. #LLM #AI4Science #ScalingLaws #SimpleTES #MachineLearning
10
43
150
56,419
Scientific discovery is no longer just about generating a single "lucky" idea. It is about how an AI manages its budget to test and refine those ideas. New research by Bo Liu, Yijia Chen, and Linfeng Ye introduces SimpleTES, a system that reframes scientific discovery as a task of "scaling up" the evaluation process. Instead of just asking the AI to give one answer, this system allows the model to propose many candidates, score them with specific verifiers, and then refine the best ones until they work. By balancing how much time the AI spends exploring new ideas versus polishing old ones, the team achieved breakthroughs in several complex fields. Key results from the paper: โ– Faster Solvers: Created more efficient LASSO solvers for statistical problems. โ– Quantum Efficiency: Reduced the overhead needed for quantum routing. โ– New Mathematics: Discovered new mathematical constructions that humans hadn't documented. โ– Smart Budgeting: Proved that allocating "evaluator budget" across parallel exploration and local selection is a key to success. This research shows that in AI, the best discoveries come from agents that can effectively judge their own work. Spotted via @ritualdigest | @ritualnet | @ritualfnd | @joshsimenhoff |
Why donโ€™t AI models just keep getting smarter by practicing against themselves? A new paper by Luke Bailey, Tatsunori Hashimoto, Tengyu Ma, and their team at Stanford University reveals the problem: when models Self-play they often cheat. Usually, one part of the AI creates a problem and another tries to solve it. Over time, the problem-creator learns to make ugly or nonsensical puzzles that are technically hard but don't actually teach the solver anything useful. To fix this, they introduced Self-Guided Self-Play (SGS), which adds a third role to the team: The Guide. How this three-part team works: โ€ข The Conjecturer: Creates new practice problems to help the solver learn. โ€ข The Solver: Practices on these new problems to get smarter. โ€ข The Guide: Acts like a teacher, scoring the new problems to make sure they are clear, relevant, and actually helpful. The results are stunning. By using this Guide to keep the practice sessions high-quality, a tiny 7B model eventually became better at solving complex math proofs than a massive 671B model (DeepSeek-Prover-V2). This proves that for AI, having a good teacher is more important than just having a bigger brain. Spotted via @ritualdigest
10
27
798
[Weekly Ritual Digest 3 | Evaluation-driven Scaling for Scientific Discovery] @ritualnet @ritualfnd @ritualnet_korea Hello. Our third topic this week introduces a study that fundamentally flips how AI tackles complex scientific challenges. Traditionally, language models solve problems by extending their 'Chain of Thought' to generate answers. However, this paper reframes the AI's problem-solving process from being 'generation-centric' to 'verification-centric'. 1. From Endless Brainstorming to Rigorous Peer Review Usually, when we want AI to solve a hard problem, we prompt it to generate as many potential answers as possibleโ€”like an endless brainstorming session. But without knowing which idea is correct, true scientific discovery is impossible. This paper shifts the system from generating outputs to rigorously evaluating them. The AI first proposes candidate solutions, then an internal 'Verifier' strictly scores them, and the AI refines its hypothesis based on that feedback. Essentially, the AI is continuously generating ideas and subjecting them to its own rigorous peer review. 2. Smart Allocation of Compute with SimpleTES Of course, verifying every single hypothesis to the end requires massive time and computational resources. To solve this, the researchers introduced a framework called 'SimpleTES'. This system acts as a strategic budget manager for the AI's computing power. It allocates resources to explore many ideas broadly at first (parallel exploration). Once a promising candidate is found, it focuses computing power to dig deeper (sequential refinement), and finally spends energy polishing the minor details (local candidate selection). It doesn't blindly consume compute; it spends energy exactly when and where it is most efficient. 3. Breakthroughs Proven in Real Scientific Domains This isn't just theory. When applied across 21 hard problems in 6 different scientific domains, this methodology yielded highly impressive, real-world results. It successfully developed faster calculation methods for complex mathematical models (LASSO solvers) used in data science, and significantly reduced inefficiencies (overhead) in quantum computer routing. It even discovered entirely new mathematical constructions. The AI autonomously accomplished optimization tasks that typically require months of human expert labor. 4. Wrapping up the third paper review The message of this study is clear: for AI to achieve genuine scientific discovery, the raw ability to generate plausible text from massive data is no longer enough. The key to future AI scaling lies in how well we design the 'verification loop'โ€”the system's ability to constantly doubt, evaluate, and refine its own hypotheses. Next time, we will cover the final topic of this week's digest, which explores a fascinating paradox. We will look into the cognitive blind spots of AI agents, specifically the phenomenon where they successfully explore and find the perfect solution, but ultimately ignore it when taking action. --- [Weekly Ritual Digest 3 | ํ‰๊ฐ€ ์ฃผ๋„ํ˜• ํ™•์žฅ์„ ํ†ตํ•œ ๊ณผํ•™์  ๋ฐœ๊ฒฌ] ์•ˆ๋…•ํ•˜์„ธ์š”? ์ด๋ฒˆ ์ฃผ ์„ธ ๋ฒˆ์งธ ์ฃผ์ œ๋Š” ์ธ๊ณต์ง€๋Šฅ์ด ๋ณต์žกํ•œ ๊ณผํ•™์  ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹ ์ž์ฒด๋ฅผ ๋’ค๋ฐ”๊พผ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์–ธ์–ด ๋ชจ๋ธ์€ ๋‹จ์ˆœํžˆ ์ƒ๊ฐ์˜ ์‚ฌ์Šฌ(Chain of Thought)์„ ๊ธธ๊ฒŒ ๋Š˜๋ ค ์ •๋‹ต์„ ์ถ”๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์€ AI์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ ๊ณผ์ •์„ '์ƒ์‚ฐ'์ด ์•„๋‹Œ '์—„๋ฐ€ํ•œ ๊ฒ€์ฆ' ์ค‘์‹ฌ์œผ๋กœ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. 1. ๋ฌดํ•œํ•œ ๋ธŒ๋ ˆ์ธ์Šคํ† ๋ฐ์—์„œ ๊น๊นํ•œ ํ”ผ์–ด ๋ฆฌ๋ทฐ(Peer Review)๋กœ ์ผ๋ฐ˜์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” AI๊ฐ€ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ํ’€๊ฒŒ ํ•  ๋•Œ ๊ฐ€๋Šฅํ•œ ํ•œ ๋งŽ์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ผ์ข…์˜ ๋ฌดํ•œ ๋ธŒ๋ ˆ์ธ์Šคํ† ๋ฐ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Ÿ์•„์ง€๋Š” ์•„์ด๋””์–ด ์ค‘ ๋ฌด์—‡์ด ์ง„์งœ ์ •๋‹ต์ธ์ง€ ๋ชจ๋ฅธ๋‹ค๋ฉด ๊ณผํ•™์  ๋ฐœ๊ฒฌ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ์‹œ์Šคํ…œ์„ '์ƒ์„ฑ' ์ค‘์‹ฌ์—์„œ '๊ฒ€์ฆ(Verification)' ์ค‘์‹ฌ์œผ๋กœ ๋ฐ”๊ฟจ์Šต๋‹ˆ๋‹ค. AI๊ฐ€ ํ•ด๊ฒฐ์ฑ…์˜ ํ›„๋ณด๊ตฐ์„ ๋จผ์ € ์ œ์•ˆํ•˜๋ฉด ๋‚ด๋ถ€์˜ ์—„๊ฒฉํ•œ ๊ฒ€์ฆ๊ธฐ๊ฐ€ ์ด๋ฅผ ์ฑ„์ ํ•˜๊ณ , ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์„ค์„ ๋‹ค์‹œ ๋‹ค๋“ฌ๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ์ฆ‰, AI ์Šค์Šค๋กœ ์•„์ด๋””์–ด๋ฅผ ๋‚ด๊ณ  ๊น๊นํ•˜๊ฒŒ ๋™๋ฃŒ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. 2. ํ•œ์ •๋œ ์—ฐ์‚ฐ๋ ฅ(Compute)์„ ๋˜‘๋˜‘ํ•˜๊ฒŒ ๋ถ„๋ฐฐํ•˜๋Š” SimpleTES ๋ฌผ๋ก  ๋ชจ๋“  ๊ฐ€์„ค์„ ๋๊นŒ์ง€ ๊ฒ€์ฆํ•˜๋ ค๋ฉด ์—„์ฒญ๋‚œ ์‹œ๊ฐ„๊ณผ ์ปดํ“จํ„ฐ ์ž์›์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฐ๊ตฌ์ง„์€ 'SimpleTES'๋ผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ AI์˜ ์—ฐ์‚ฐ ์˜ˆ์‚ฐ์„ ์ „๋žต์ ์œผ๋กœ ๋ถ„๋ฐฐํ•ฉ๋‹ˆ๋‹ค. ์ฒ˜์Œ์—๋Š” ์—ฌ๋Ÿฌ ์•„์ด๋””์–ด๋ฅผ ๋„“๊ฒŒ ํƒ์ƒ‰(๋ณ‘๋ ฌ ํƒ์ƒ‰)ํ•˜๋Š” ๋ฐ ์ž์›์„ ์“ฐ๊ณ , ์œ ๋ ฅํ•œ ํ›„๋ณด๊ฐ€ ๋‚˜์˜ค๋ฉด ๊ฑฐ๊ธฐ์— ์ง‘์ค‘ํ•˜์—ฌ ๊นŠ๊ฒŒ ํŒŒ๊ณ ๋“ค๋ฉฐ(์ˆœ์ฐจ์  ์ •์ œ), ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ฏธ์„ธํ•œ ์˜ค๋ฅ˜๋ฅผ ๊ต์ •(์ง€์—ญ์  ํ›„๋ณด ์„ ํƒ)ํ•˜๋Š” ๋ฐ ์—๋„ˆ์ง€๋ฅผ ๋ฐฐ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ๋ฌด์ž‘์ • ์—ฐ์‚ฐ๋ ฅ์„ ์Ÿ์•„๋ถ“๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ํƒ€์ด๋ฐ์— ์—๋„ˆ์ง€๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 3. ์‹ค์ œ ๊ณผํ•™ ๋„๋ฉ”์ธ์—์„œ ์ฆ๋ช…๋œ ํ˜์‹ ์  ์„ฑ๊ณผ ๋‹จ์ˆœํ•œ ์ด๋ก ์— ๊ทธ์น˜์ง€ ์•Š๊ณ  ์ด ๋ฐฉ๋ฒ•๋ก ์€ ์‹ค์ œ 6๊ฐœ ๊ณผํ•™ ๋„๋ฉ”์ธ์˜ 21๊ฐœ ๋‚œ์ œ์— ์ ์šฉ๋˜์–ด ๋†€๋ผ์šด ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์“ฐ์ด๋Š” ๋ณต์žกํ•œ ์ˆ˜ํ•™ ๋ชจ๋ธ(LASSO ์†”๋ฒ„)์˜ ์—ฐ์‚ฐ ์†๋„๋ฅผ ํš๊ธฐ์ ์œผ๋กœ ๋†’์˜€๊ณ , ์ฐจ์„ธ๋Œ€ ๊ธฐ์ˆ ์ธ ์–‘์ž ์ปดํ“จํ„ฐ์˜ ๋ผ์šฐํŒ… ๋น„ํšจ์œจ์„ฑ(Overhead)์„ ํฌ๊ฒŒ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ์‹ฌ์ง€์–ด ๊ธฐ์กด์— ์•Œ๋ ค์ง€์ง€ ์•Š์•˜๋˜ ์ƒˆ๋กœ์šด ์ˆ˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ๋ฐœ๊ฒฌํ•ด ๋‚ด๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€๋“ค์ด ์˜ค๋žœ ์‹œ๊ฐ„ ๋งค๋‹ฌ๋ ค์•ผ ํ–ˆ๋˜ ์ตœ์ ํ™” ์ž‘์—…๋“ค์„ AI๊ฐ€ ์Šค์Šค๋กœ ํ•ด๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 4. ์„ธ ๋ฒˆ์งธ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ๋งˆ์น˜๋ฉฐ ์ด ์—ฐ๊ตฌ๊ฐ€ ์šฐ๋ฆฌ์—๊ฒŒ ๋˜์ง€๋Š” ๋ฉ”์‹œ์ง€๋Š” ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค. AI๊ฐ€ ์ง„์ •ํ•œ ์˜๋ฏธ์˜ ๊ณผํ•™์  ๋ฐœ๊ฒฌ์„ ์ด๋ฃจ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹จ์ˆœํžˆ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ธ€์„ ๊ทธ๋Ÿด์‹ธํ•˜๊ฒŒ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ค‘์š”ํ•œ ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค. ์Šค์Šค๋กœ ์„ธ์šด ๊ฐ€์„ค์„ ๋Š์ž„์—†์ด ์˜์‹ฌํ•˜๊ณ , ํ‰๊ฐ€ํ•˜๊ณ , ์ˆ˜์ •ํ•˜๋Š” '๊ฒ€์ฆ ๋ฃจํ”„'๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๊ณ„ํ•˜๋А๋ƒ๊ฐ€ ๋ฏธ๋ž˜ AI ์„ฑ๋Šฅ ํ™•์žฅ์˜ ํ•ต์‹ฌ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹œ๊ฐ„์—๋Š” ์ด๋ฒˆ ์ฃผ ๋‹ค์ด์ œ์ŠคํŠธ์˜ ๋งˆ์ง€๋ง‰ ์ฃผ์ œ์ด์ž ์•„์ฃผ ํฅ๋ฏธ๋กœ์šด ์—ญ์„ค์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. AI ์—์ด์ „ํŠธ๊ฐ€ ํƒ์ƒ‰์„ ํ†ตํ•ด ์™„๋ฒฝํ•œ ์ •๋‹ต์„ ๋ˆˆ์•ž์—์„œ ๋ฐœ๊ฒฌํ•˜๊ณ ๋„, ์ด๋ฅผ ์‹ค์ œ ํ–‰๋™์— ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ  ๋ฌด์‹œํ•ด ๋ฒ„๋ฆฌ๋Š” ์ธ์ง€์  ํ•œ๊ณ„์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. @joshsimenhoff @mongdiny7 @Jez_Cryptoz @dunken9718
[Weekly Ritual Digest 2 | Scaling Self-Play with Self-Guidance] @ritualnet @ritualfnd @ritualnet_korea Following our previous post, the second topic explores why LLMs cannot scale infinitely through self-play, where they generate and solve their own problems. 1. The dilemma of reward-hacking In theory, if a Conjecturer creates problems and a Solver tackles them, the model should improve forever. In practice, however, the Conjecturer tends to "reward-hack," generating artificially ugly and overly complex problems simply to stump the Solver. 2. Introducing a third role: The Guide To break this cycle, the researchers introduced a third role: a Guide. The Guide scores the generated Lean4 problems based on their relevance, clarity, and actual usefulness toward solving targeted, unsolved problems. 3. The triumph of a smaller model Driven by the directional feedback of the Guide, a 7B parameter prover model eventually exceeded the pass@4 performance of the massive 671B DeepSeek-Prover-V2 model after multiple rounds. 4. Wrapping up the second paper review This is a crucial finding, showing that without an evaluation metric setting the right direction, even highly capable models can get stuck in meaningless computational loops. Next time, we will look into a study that reframes the process of scientific discovery in LLMs through evaluation-driven scaling. --- [Weekly Ritual Digest 2 | ์ž๊ฐ€ ํ•™์Šต์˜ ํ•œ๊ณ„ ๋ŒํŒŒ: ๊ฐ€์ด๋“œ๊ฐ€ ์žˆ๋Š” ์…€ํ”„ ํ”Œ๋ ˆ์ด] ์ง€๋‚œ ๊ธ€์— ์ด์–ด ๋‘ ๋ฒˆ์งธ ์ฃผ์ œ๋Š” ์–ธ์–ด ๋ชจ๋ธ์ด ์™œ ์Šค์Šค๋กœ ๋ฌธ์ œ๋ฅผ ๋‚ด๊ณ  ํ‘ธ๋Š” '์…€ํ”„ ํ”Œ๋ ˆ์ด(Self-play)' ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ฌดํ•œํžˆ ์„ฑ๋Šฅ์„ ๋†’์ด์ง€ ๋ชปํ•˜๋Š”์ง€ ๋ถ„์„ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. 1. ๋ณด์ƒ ํ•ดํ‚น์˜ ๋”œ๋ ˆ๋งˆ ์ด๋ก ์ ์œผ๋กœ ๋ฌธ์ œ ์ถœ์ œ์ž(Conjecturer)๊ฐ€ ๋ฌธ์ œ๋ฅผ ๋งŒ๋“ค๊ณ  ํ•ด๊ฒฐ์ž(Solver)๊ฐ€ ์ด๋ฅผ ํ’€๋ฉด ๋ชจ๋ธ์€ ์˜์›ํžˆ ๋ฐœ์ „ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์ถœ์ œ์ž๊ฐ€ ๋‹จ์ˆœํžˆ ํ•ด๊ฒฐ์ž๋ฅผ ๊ณค๋ž€ํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ธ์œ„์ ์ด๊ณ  ๊ธฐ๊ดดํ•˜๊ฒŒ ์–ด๋ ค์šด ๋ฌธ์ œ๋งŒ ์ƒ์„ฑํ•˜๋Š” ๋ณด์ƒ ํ•ดํ‚น(Reward-hacking) ํ˜„์ƒ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. 2. ์ œ3์˜ ์—ญํ• : ๊ฐ€์ด๋“œ(Guide)์˜ ๋„์ž… ์—ฐ๊ตฌ์ง„์€ ์ด ์•…์ˆœํ™˜์„ ๋Š๊ธฐ ์œ„ํ•ด '๊ฐ€์ด๋“œ(Guide)'๋ผ๋Š” ์„ธ ๋ฒˆ์งธ ์—ญํ• ์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์ด๋“œ๋Š” ์ƒ์„ฑ๋œ ์ˆ˜ํ•™(Lean4) ๋ฌธ์ œ๋“ค์ด ์•„์ง ํ’€์ง€ ๋ชปํ•œ ๋ชฉํ‘œ ๋ฌธ์ œ๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์„ฑ์ด ๋†’๊ณ , ๋ช…ํ™•ํ•˜๋ฉฐ, ์œ ์šฉํ•œ์ง€๋ฅผ ์ ์ˆ˜๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. 3. ์†Œํ˜• ๋ชจ๋ธ์˜ ๋ฐ˜๋ž€ ์ด๋Ÿฌํ•œ ๊ฐ€์ด๋“œ ์‹œ์Šคํ…œ์˜ ๋ฐฉํ–ฅ์„ฑ ์ œ์‹œ์— ํž˜์ž…์–ด 7B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์†Œํ˜• ์ฆ๋ช… ๋ชจ๋ธ์ด ๋ฐ˜๋ณต ํ•™์Šต ๋์— ๋ฌด๋ ค 671B ํฌ๊ธฐ์˜ DeepSeek-Prover-V2 ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋„˜์–ด์„œ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. 4. ๋‘ ๋ฒˆ์งธ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ๋งˆ์น˜๋ฉฐ ์ด๋Š” ์•„๋ฌด๋ฆฌ ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ์ด๋ผ๋„ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์„ ์„ค์ •ํ•ด ์ฃผ๋Š” ํ‰๊ฐ€ ๊ธฐ์ค€ ์—†์ด๋Š” ๋ฌด์˜๋ฏธํ•œ ์—ฐ์‚ฐ๋งŒ ๋ฐ˜๋ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ค‘์š”ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹œ๊ฐ„์—๋Š” ์—์ด์ „ํŠธ์˜ ๊ณผํ•™์  ๋ฐœ๊ฒฌ ๊ณผ์ •์„ ํ‰๊ฐ€ ์ฃผ๋„ํ˜• ํ™•์žฅ ๋ฐฉ์‹์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•œ ์—ฐ๊ตฌ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. @joshsimenhoff @mongdiny7 @Jez_Cryptoz @dunken9718
8
12
201
Excited to release SimpleTES: a better open-sourced AlphaEvolve! With gpt-oss, SimpleTES discovers SOTA solutions across 21 tasks: quantum compilation, LASSO speedup, scaling law discovery, kenel optimization... Project: wizardquant.com/will/simpletโ€ฆ Code: github.com/wq-will/SimpleTES
1
5
22
784
SimpleTES allocates evaluator budget across parallel exploration, sequential refinement, and local candidate selection. Across 21 problems in 6 domains, it reports strong results: faster LASSO solvers, lower quantum routing overhead, and new mathematical constructions.
1
8
86
Simple Test-time Evaluation-driven Scaling (SIMPLETES) - Sped up LASSO by over 2x - Designed quantum circuit routing policies by 24.5% - Discovered new Erdล‘s minimum overlap constructions that surpass the best-known results
2
11
59
11,165
Replying to @BenTheClanker
what can i say she simpletes me
1
3
46
De verdad la gente estรก votando por Alexa, tebi y yuli. Dios los mรกs simpletes. #LaCasaDeLosFamososCol3
1
1
7
774
En resum: el que fa el diputat Rufian no รฉs polรญtica. ร‰s pura propaganda i demagรฒgia. De passada, la fotografia que vol construir -Puigdemont al costat de VOX - รฉs falsa i deliberada. Sรณc independent, i el meu vot ha anat a les esquerres. Perรฒ clarament has de ser manipulador de ments simpletes per suggerir que Junts comparteix valors o projecte dโ€™Estat amb VOX. Em sembla que el que @miriamnoguerasM rebutja รฉs legislar malament, encara que el contingut sigui parcialment positiu.
2
29
65
1,246
Vive les rรฉflexions simpletes.
2
24
Replying to @radioamericahn
Que si, es una vieja vividora, se vende al mejor postor, simpletes un parรกsito igual que vos cara de verga
2
34