Filter
Exclude
Time range
-
Near
Forevergreen_2015 🇨🇦 retweeted
LLB, BL 🇳🇬 LLM 🇳🇬 MSc, Distinction (Commonwealth Scholar) 🇬🇧 MA, Merit (ESRC Scholar) 🇬🇧 PhD (ESRC Scholar) 🇬🇧 Higher Education Awards (AFHEA, Horizon) 🇬🇧 Fully Funded Research Scholarships in 🇨🇦 🇳🇴 🇸🇪 University Graduate School Scholar 🇬🇧 6 degrees, 7 scholarships
Hi women, can you post pictures or talk about your academic achievements? I need some motivation this month. If you see this tweet, share it so women can see it.
9
49
201
2,412
雪ん子兎(ゆきんこ) retweeted
銀行子会社で鯖管してるわい。上司から今はやってるからAI試せと言われ情報漏洩は絶対まかりなならんと言明されてしかたなしに100万でローカルLLMでPOC機を組む ↓ できたマシンがあまりにも馬鹿すぎて泣く。 これが末端技術者のリアルな人生やで。Claude Fable 5 ガーとかワイも言ってイキってみたい
19
259
2,243
481,954
Dustin Warner retweeted
Probably this book should be forbidden llsoftsec.github.io/llsoftse…. Read it before it disappears just like your favorite LLM ;-) (via burakemir.ch/about/, not on X)

1
10
106
12,444
˚₊·—̳͟͞͞♡⟢ retweeted
B.sc Political Science 2:1 LLB, Law with distinction BL (Nigerian Law School) 2:1 LLM, International Energy Law and Policy with distinction
Hi women, can you post pictures or talk about your academic achievements? I need some motivation this month. If you see this tweet, share it so women can see it.
3
21
79
771
CrankGPT runs speech recognition, a small LLM, and text-to-speech locally on a Raspberry Pi 5 powered by a hand crank. No battery, no cloud. Part gag, part real edge-compute benchmark on power, latency, and model-size tradeoffs. tinyurl.com/mr9euc43
2
HIRO_IRI retweeted
現代のLLMの学習において、自分で問題を作り、自分で解けるようになっていく自己改善ループは重要である。SGS(Self-Guided Self-Play)は、Lean4による形式定理証明において、ガイド付きの自己学習環境を作ることの重要性を示した研究である。 次の問題設定を考える。モデルはLean4の定理を入力として受け取り、その証明を生成する。しかし、元のターゲット問題が難しすぎる場合、いきなり証明に成功することは難しい。そこで、元問題の証明を助けるような中間問題を生成する。 良い中間問題を作るモデルと、その中間問題を解けるようになるモデルが協調して改善していくことで、最終的に元問題の解決率を高めることを目指す。 これを実現するために、SGSではモデルに3つの役割を持たせる。 1つ目はSolverである。SolverはLean4の定理を入力として受け取り、その証明を生成する。生成された証明はLean4コンパイラによって検証され、正しければ報酬1、間違っていれば報酬0が与えられる。 2つ目はConjecturerである。Conjecturerは、Solverがまだ解けていない元のターゲット問題を入力として受け取り、それに関連する、より簡単な合成問題を作る。これは、難問を解くための補助問題のようなものであり、今のSolverが学習可能な範囲にある問題を作る役割を持つ。 3つ目はGuideである。Guideは、Conjecturerが作った合成問題が元のターゲット問題に本当に関連しているか、自然で明確な問題になっているかをLLM Judgeとして評価する。 Guideを導入しない場合、Conjecturerは単にSolverにとってほどよく難しい問題を作ろうとする。その結果、Conjecturerは報酬をハックし始める。例えば、結論が不自然に長い定理や、OR条件を大量に含む定理を作るようになる。これらは今のSolverにとって解ける場合があるものの、元のターゲット問題の解決にはあまり役立たない。このような現象をConjecturer collapseと呼ぶ。 これを防ぐため、Conjecturerの報酬は R_solve と R_guide の積として定義される。 R_solveは、Conjecturerが作った合成問題が、今のSolverにとってちょうどよい難しさかを評価する。簡単すぎず、難しすぎず、Solverが少しは解けるがまだ学習余地のある問題が高く評価される。 R_guideは、Guideによる品質評価である。合成問題が元のターゲット問題に関連しているか、結論が複雑すぎないか、冗長な仮定がないか、数学的に自然で明確かをLLM Judgeによって評価する。 Guideなしの場合でも一定の性能向上は見られるが、Conjecturerが不自然な問題を作る方向に崩壊しやすく、最終的な性能はGuideありの場合に劣る。GuideありのSGSでは、合成問題の品質を保ちながら学習を続けられるため、より高い解決率に到達する。 さらに興味深いことに、Conjecturerの学習にはSolver側の多様性も重要である。CISPOという最近のRL目的関数をSolverに使うと、Solverの出力エントロピーが急速に潰れる。するとSolverの挙動が決定的になり、合成問題に対するsolve rateが0か1に偏りやすくなる。その結果、Conjecturerは「中程度の難しさ」の問題を作るための報酬信号を得にくくなり、学習が停滞する。 この論文では、SolverにはREINFORCE 1/2という目的関数を使うことを提案している。これは、solve rateが0.5以下の問題だけを使って、正解証明の確率を上げる方法である。単純な方法だが、Solverのエントロピーを保ちやすく、Conjecturerにとっても学習可能な環境を維持しやすい。 コメント ==== 自己改善ループにおいて、次々と問題を作るモデルと、出てきた問題を解いていくモデルが協調して強くなっていく、うまくしないと崩壊する様子は、GANなどを彷彿とさせる。 現代はそこからさらに進化し、最初にターゲットとなる定理を与え、その補助問題を作っていく設定にすることで、意味のない問題に逸れていくことを防いでいる。そのうえで、補助問題の質や方向性をGuideが評価する。 今回の問題設定では、Guideは固定されたLLM Judgeとして使われていた。しかし将来的には、ここも学習ベースで評価することが考えられる。ただし、その場合はさらに学習の安定性が下がる可能性がある。複数の動的に変化する目標下で学習していくことは今後重要な研究テーマとなるだろう。 自己改善ループの学習は、最初、自己対戦型の問題で成功した後、そうでない問題に対しても適用可能になりつつある。そして、研究課題は、この自己改善ループを崩壊させず、うまく協調させ続けられるかに移ってきている。この改善ループを長時間、長ステップにわたって安定して実行できるようになるほど、モデルの能力を高められる。 現在のAIは継続学習が難しいという問題は残っている。しかし、一度この方法で特定タスクに強いモデルを作ることができれば、それらの専門家モデル群をさらに一つのモデルへ蒸留することで、より強い汎用モデルを作れる。これらの方法での限界はまだ見えていない
1
3
9
751
LicensioAI retweeted
Train your own LLM from scratch. This repo builds a GPT-style transformer from the ground up, without using any high-level libraries. You see exactly how attention, multi-head attention, the feed-forward block, embeddings, residuals, and layer norm fit together. And it doesn't stop at the model. It walks the whole path from raw data to generated text. ↳ Data download, preprocessing, training, and generation ↳ Training data from The Pile (825GB across 22 sources) ↳ Tokenized with tiktoken (r50k_base) and stored in HDF5 ↳ Training loop with eval, LR decay, and crash-safe checkpoints ↳ An SFT and RLHF guide for what comes after pretraining The same code scales by changing a few config values. Around 13M parameters is where the output starts producing correct grammar and spelling, and you can train that in about a day on a free Colab or Kaggle T4. If you've ever wanted to actually see how a transformer works instead of importing one, this is a clean place to start. Link to the repo in the comments. Interested in ML/AI Engineering? Check my FREE AI engineering Guidebook with 380 pages (downloaded over 80k times, link below)
10
38
309
12,359
LLMが知識を抽象化して細分化できない(ように見えるだが置いておいて)のはなぜだろうか 細分化に最適なハードが必要という意見がある。言語野や視覚野のように。しかし数式を解く専門ハードを備えてたっけな生き物。ソフト内で抽象化して細分化できるという考え自体が間違っているのだろうか
2
Replying to @TheLivingLibary
Didn't work for the book I gave it but interestingly the app pretended it had found the book then put together answers based on a general dataset, not the book. That aside, this app doesn't seem to do anything you can't ask an LLM to do for you with a simple prompt.
1
俵 retweeted
従来のコンピュータには簡単にできる算数やアルゴリズムの実行をLLMにやらせるゲームで、8-bit の01を睨んで謎のbit操作を探索するのがそのひとつ。これは解けたんだけどなあ...
あ、Nemotronコンペって今日の9時までなんすね。結局参加できなかったけどどういう感じだったのかな。 kaggle.com/competitions/nvid…
1
1
31
NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu arxiv.org/abs/2606.13171 [𝚌𝚜.𝙲𝙻 𝚌𝚜.𝙰𝙸]
4
たくの retweeted
Fableへのアクセス制限は、ミドルパワーにとって「主権AIを作れ」という目覚ましと捉えられがちだが、そうではなく、むしろ米国との関係を壊さずに過度に国産モデルに偏重せず、フロンティアモデルへのアクセス権を広範な交渉力で稼ぐべきとする論考(自国llmの開発に必要な計算資源はフロンティアモデルへのアクセス権の交渉カードとトレードオフになるため慎重な立場)。
The Fable decision is fundamentally a domestic US policy mess, and it seems likely to resolve itself, albeit chaotically. For middle powers, it's a tempting starting point for ill-conceived 'sovereign' AI takes, but I think the right move is to let it blow over and buy more time. The decision itself seems counterintuitively domestic in scope. USG got worried about the prospect of a jailbreak, didn't feel like it had a particularly precise and effective way to tackle that risk, and defaulted to limiting access in an obvious and available way. The easy way to do this was also a way to make Anthropic uncomfortable and annoy allies, but these seem like secondary considerations at best. That's still relevant information for middle powers--it goes to show that there is no immediately effective and legible reason for USG or American developers to consider their interests in making access decisions. As for middle powers' posture in reacting to this: I think nothing good comes from wading into a politically charged AI policy environment, solidarising with Anthropic and drawing the continued ire of the Trump administration. Right now, middle powers are collateral damage, and acting hastily risks making them parties to the conflict. What would have helped in this case? It's not clear that even a European model would really have. Imagine you had the absolute best case European sovereignty in place, up and running since 2024. would it have a Fable-class model by now? or would it 'only' be at the level of Opus 4.8 - a near-frontier model in its own right that remains available today? you'd have to be quite confident that the European champion would not just be 2 months behind the curve, but at the absolute bleeding edge of frontier development, to make a difference to this particular scenario. Not even the most bullish views of the domestic project makes that kind of outcome particularly likely, so this really is not a particularly incisive wake-up call on European frontier models specifically. But even if you disagree, what would it actually *mean* for this to be a wake-up call for European sovereignty? Are you going to build your own model now? What are you going to do in the--generously--three years between announcing this project and reaping the frontier models it would build: are you willing to give up frontier access in the meantime? Because the resources required for building this frontier model directly trade off against the resources you could invest into guaranteeing access instead (most notably through compute); and the political fallout from announcing an attempt to build the very kind of model capability the US is attempting to restrict would also make future access negotiations harder. Let's say you're willing to bear that delay: do you think a Trump administration that just refused to give you access to Fable is going to let you buy enough frontier chips to train an unrestricted Fable clone yourself? Are you willing to go the mat on semiconductor chokepoints, even if it comes with sky-high costs in Ukraine and trade policy? I don't actually think so. Look into the details of what would be required for a big European push right now, and you'll see the leverage for 'waking up' and divorcing from the US ecosystem simply is not feasible in the current technological or geopolitical environment. I regret that this is the case, but that doesn't make it the case any less! What, then, is the alternative? First, I think it's worth noting that this is fundamentally a very good version of a very bad thing. In a fortuitous turn of events, the Trump administration has picked the most ill-conceived version of access restrictions you could possibly come up with. It's legally fraught, so domestically impactful that it will lead to massive internal pushback, and likely extremely economically harmful. As a result, it will likely go down in flames eventually. The U.S. is not yet in the spot to actually go through with long-term cut off: international markets are still too important, the security situation is not yet sufficiently dire, and so on. So the first live fire exercise of cutting off the rest of the world is going to fail, which means labs and the admin are going to be much more wary of subsequent attempts to do the same, even if they end up more sophisticated. Second, I think the access recipe is fundamentally the same as it was yesterday: build leverage on the margins that makes cut-offs like these even less attractive, for instance through access-for-compute deals and by creating deep economic integrations that are economically central to US labs and strategically central to the US supply chain--create a lobby to push back harder against attempts like this in the future. In the future, we can use the resources and capacities that gives us to sprint toward our own frontier project if we must, but right now we neither have the political will nor the relative power to get even close to trying that. Third, and somewhat trivially, we should start thinking about what we want to do the next time this happens. I suspect any analyses that assess whether you can use ASML or any semiconductor chokepoint to avert this will come up short, but there's still value in analysing and then credibly precommitting to threats. Right now, USG did this operating under the assumption there would be absolutely no reaction from middle powers at all. Any plan in the drawer that suggests there is a non-negliblie cost for the US to act like this in the future would be helpful. There's little use in deploying it reactively now; there's lots of value in precommitting to it for the next iteration. That's different than actually going to the mat; the goal here is to play chicken a bit, increase uncertainty and latent risk for the administration in making these decisions to tilt the calculus toward integration, not to go all out on a highly costly tradewar. Fourth, I think this clarifies the specific concerns that could motivate access restrictions. Security concerns, both on misuse as well as distillation and model theft, fundamentally make the US more likely to restrict model access; this time around, it was concerns around reducing surface area for unmonitored jailbreak attempts. That is, in principle, fixable--middle power governments can and should engage with labs to create security conditions that create permission structure for exports and model sharing. Make your infrastructure as secure as they want it to be, and you reduce the risk they consider exporting to you a security vulnerability. Again, I understand if this sounds submissive and uncomfortable to you---but again, all this is necessary even if you go for the maximal sovereignty playbook at the same time, because you will need frontier access in the meantime. Instead of these reasonable responses, I worry that the low-resolution view on this whole affair is to think this should shake middle powers into the wrong kind of action. Realising how important and contingent frontier AI access is quickly leads down the path of wanting to build your own; realising how capricious the American ecosystem is makes you want to divorce from it faster. But for better or for worse, the central implication of this episode is the opposite: as evidenced by this episode being possible at all, middle powers currently do not have the leverage to do much about any of this, and building up this leverage is almost impossible to do in an openly adversarial relationship to the US. In that sense, waking up is not a matter of loud yelling, decisive action or pivotal decisions. For all the internal urgency with which I think we should precommit to some leverage and shore up our security concerns, I still think the optimal strategy is one of public restraint and progress on the margins of the current playbook. I'm just not sure there's that much to wake up from - this is just what life is like for now.
2
18
33
8,193
G. @ The Neuron retweeted
There are three primary grading strategies we can use when evaluating an LLM agent… Human evaluation is the definitive source of truth for measuring an agent’s performance. We can start the human evaluation process with quick manual inspections (with researchers or internal users). In fact, the creators of Claude Code even began their evaluation process using internal tests with Anthropic employees. These simple tests capture high-level trends in agent capabilities, but we need to be more rigorous to collect high-quality human feedback at scale. We start by creating a guideline / rubric that defines key quality dimensions and how they should be evaluated. Then, we calibrate the human evaluation process by collecting feedback, measuring agreement between humans, and refining our evaluation guidelines / process to improve agreement. Some tasks are more objective, but evaluation is usually subjective–human evaluators rarely agree with each other without investing significant effort into calibration. Although it is our source of truth, human feedback can be difficult to scale, so automatic evaluation techniques are often adopted to make experimentation more efficient. We can roughly categorize automatic evaluation techniques into two groups: 1. Code-based: heuristic checks that can be captured in a Python function. 2. Model-based: open-ended checks that are based on an LLM judge. Code-based graders include techniques like string matching and assertions on the agent’s outcome or transcript. For example, we can check if an agent’s final answer matches the ground truth or run predefined test cases for a coding problem. These are deterministic quality checks one can implement / execute with a Python function. Code-based graders are efficient, reproducible, and easy to debug, but they also struggle to capture subjective aspects of agent behavior. For example, checking whether code written by an agent passes test cases can be handled with a code-based grader. Judging whether the code is sufficiently clean or elegant, on the other hand, is subjective. To grade such open-ended and subjective criteria, we need model-based graders. The idea of a model-based grader is simple: we just prompt an LLM to evaluate some aspect of quality for us. This approach, known as LLM-as-a-Judge, is universally used in the evaluation process for both conventional LLMs and agent systems. To ensure the LLM judge is accurate, we should calibrate it by measuring agreement with human experts. Multiple graders. No one grading technique is perfect. LLM judges are flexible but non-deterministic and subject to bias. Code-based graders are narrow. Human evaluation is expensive and difficult to tune. Therefore, we should use all of these techniques in tandem instead of relying on one of them. Human evaluation serves as the core evaluation signal for an agent system. We can calibrate model-based graders based on their agreement with humans and use them to scale the evaluation process (while monitoring agreement with humans). Code-based graders can also be used for more deterministic or narrow checks that do not require model-based graders.
1
3
21
896