More than 50% of the reported reasoning abilities of LLMs might not be true reasoning.
How do we evaluate models trained on the entire internet? I.e., what novel questions can we ask of something that has seen all written knowledge? Below: new eval, results, code, and paper.
Functional benchmarks are a new way to do reasoning evals. Take a popular benchmark, e.g., MATH, and manually rewrite its reasoning into code, MATH(). Run the code to get a snapshot that asks for the same reasoning but not the same question. A reasoning gap exists if a model’s performance is different on snapshots. Big question: Are current SOTA models closer to gap 0 (proper reasoning) or gap 100 (lots of memorization)?
What we find: Gaps in the range of 58% to 80% in a bunch of SOTA models. Motivates us to build Gap 0 models.
We’re releasing the paper, code, and 3 snapshots of functional MATH() today.
arxiv draft: arxiv.org/abs/2402.19450
github repo: github.com/ConsequentAI/fnev…
1/🧵
What started as an experiment in a Harvard dorm and became one of Sequoia India’s 1st partnerships in SEA has grown into a 🚀company powered by a maniacal focus on #customer centricity. Congrats Chih-Han, Joe, Winnie & everyone @GoAppier on today’s #IPO!
sequoiacap.com/india/build/a…
We are deeply grateful to Sequoia’s LPs, who have committed $1.35B to two new Sequoia India venture and growth funds. The region’s #startup ecosystem is at a fork in the road. We believe there is an opportunity to make different choices for the future. linkedin.com/pulse/fork-road…
techcrunch.com/2019/11/25/ap…
Appier continues to be one of the leading AI companies in Asia. We @Sequoia_India are thrilled to have been partners for 5 years and look fwd to their continued success. Onwards!
Pick just 1 or 2 metrics at each phase of your #startup journey & track them relentlessly, says @abheek. The ability to articulate a vision and break it down into measurable goals helps #founders move the ball forward every day. surgeahead.com/driving-every…
So happy to see my former employer Facebook's Return to Work program - helping those who have left the workforce for 2 yrs come back fulltime. We must have more orgs do this, especially in Asia fb.careers/returntowork
After years of reflection, my four (software) startup engineering killers are:
1. Premature scaling
2. Too much shiny/new tech
3. Bad hiring (great engineer, but not great startup eng)
4. Eng/business mismatch
What are yours?
nemil.com/musings/four-start…
A little over 72 hours before we close applications for Surge. It’s right down to the wire! Head over to surgeahead.com to apply. Let’s get this done!
#GetReadyToSurge
1/ We @sequoia hear stories like this about @ericsyuan every day
It’s no coincidence that this humble & amazing founder is at the helm of such a wonderful company, @zoom_us
In fact, his humility has been a key driver of their success
I stopped @ericsyuan on the street last night after watching his talk during @saastr
It was night and he was on his way somewhere but he took the time to talk to me. This is the founder and CEO of a decacorn @zoom_us
What a humble and amazing guy!
#SaaStrAnnual19
@silvanus_lee, @linuslee & Liu Feng-Yuan have a compelling vision to build scalable & accountable AI products for enterprises. We’re excited to partner with this highly technical, seasoned team of #datascience experts at the early stage of their journey. techinasia.com/prominent-sin…
We @sequoia are very proud to support the work @wellcometrust is doing on antimicrobial resistance, epidemic preparedness, safer births, Ebola response, and crazy ambitious new ideas to 10x healthcare around the world. It is an honor & a privilege to work for such great causes.
Another good year at @wellcometrust. Thanks to all our investment partners who have helped us grind out a 13.4% return despite all the headwinds, and 11.7% annualised over the decade since the financial crisis. Amazing!
50 years after he debuted "The Art of Computer Programming," Donald Knuth reflects on his opus-in-progress. “It started out that computer scientists were worried nobody was listening to us. Now I’m worried that too many people are listening.” nyti.ms/2EtFUtJ