Filter
Exclude
Time range
-
Near
And not only that, she switches up from slurring and stammering to perfect verbalization in mere minutes? How can that be linked to ADHD as well?
■ 概要 Hero's Journey は、LLM agent が「例から隠れルールを見つける」だけでなく、そのルールを未知の状況で複数手順の行動に変換できるかを測る text game benchmark。舞台は RPG 風の deterministic text environment で、agent は世界一覧、過去の demonstration episode、今回の goal を読み、自然言語 action で目的を達成する。重要なのは、正解ルールが明示されず、source episode から推定しなければならない点である。たとえば guardian の class と role から、倒すための weapon の size/color を推定し、armory へ行く、買う、guardian の場所へ行く、倒す、という手順を実行する。設計は attribute induction と procedural induction の 2 系統に分かれる。attribute 側は、entity の属性から必要 item の性質を推定する。構造は additive、compositional、conditional、override の 4 種類で、属性が独立に寄与する場合、別々の出力次元を決める場合、条件で支配する軸が切り替わる場合、例外属性が全体を上書きする場合を含む。procedural 側は、属性から追加 action、順序、回数などの process variant を推定する。同じく additive/compositional/conditional/override を持ち、単に「何を買うか」ではなく「どの action を何回、どこに挿入するか」を当てる。この benchmark の肝は identifiability condition にある。source demonstrations は、target rule を復元するのに十分な組み合わせを含むように作られる。たとえば compositional rule では、一方の属性を固定して他方だけを変えるペアと、各属性値が少なくとも一度出ることが必要になる。これがないと、ranger が large を決めたのか、prophet が red を決めたのか、別の説明でも辻褄が合ってしまう。Hero's Journey は「agent が失敗した」の前に、「そもそも推定可能な evidence を与えたか」を明示的に管理している。評価指標も 2 本立てである。RV は rule verbalization score で、source demonstrations から hidden rule を言語化させ、0-2 の rubric で採点する。ECSR は efficiency-calibrated success rate で、単に goal に到達したかではなく、brute-force enumeration に近い試行で成功しただけのケースを低く見る。episode には brute-force ならぎりぎり間に合う action budget があり、正しく rule induction できた agent は少ない試行で成功するはず、という設計になっている。結果は、LLM に rule induction の兆候はあるが、人間水準には届かず、特に procedural induction が弱い。評価対象は Qwen3.5-27B、Olmo3.1-32B-Instruct、GPT-OSS-120B、Llama-4-Maverick、Gemini-3.1-Flash、GPT-5.4-mini など。人間は平均で ECSR が最良 model を約 13%、RV が約 30% 上回る。GPT 系や GPT-OSS は一部の attribute task では人間に近い ECSR を示すが、RV ではまだ差がある。attribute task は比較的解ける一方、procedural task は全体に低く、P-Comp や P-Cond では多くの model がほぼ崩れる。さらに、rule を言語で答えられることと、episode で実行できることは同じではない。論文は direct-answer QA と episodic execution を比べ、attribute task では rule を正しく特定しても multi-step execution に落とす段階で gap が出ることを示す。surface semantics の効果も小さい。semantic name と nonce name を比べても、GPT-5.4-mini はほぼ影響を受けず、Qwen は attribute で nonce が少し有利な程度だった。つまり、pretraining の世界知識で「それっぽい武器」を選んでいるというより、synthetic rule の構造そのものが問題になっている。既存の steering も万能ではない。ReAct、ACE、Hypothesis Refinement、IDEA を試すと、IDEA や HR は全体で改善するが、効果は attribute task に集中し、procedural task では信頼できる改善が出ない。結論として、Hero's Journey は「LLM は隠れルールを抽象化できるか」だけでなく、「抽象化した規則を、順序制約を持つ行動列へ安定して変換できるか」を分けて露出させる benchmark になっている。 ■ 内容分析この論文の価値は、text game を単なる agent playground ではなく、rule induction の因果をかなり丁寧に切り分ける測定器として使っている点にある。よくある game benchmark では、成功率が上がっても、記憶した walkthrough、探索回数、surface cue、環境の許す試行錯誤のどれが効いたのかが混ざる。Hero's Journey は source/gen split と identifiability を先に固定し、RV と ECSR を並べることで、「ルールを言える」「効率よく行動できる」「偶然や brute force で通った」を分けようとしている。特に procedural induction の失敗は重要である。attribute mapping は、最終的には表を埋める問題に近い。だが procedural mapping は、action type、挿入位置、回数、順序を process template に反映しなければならない。これはゲーム制作でいうと、チュートリアル文から「必要な鍵を選べる」だけでなく、「鍵を取る前に装置を起動してはいけない」「例外条件では儀式を先に行う」のような手続き的理解に当たる。LLM が説明では正しそうでも、実行 trace で崩れる理由を測れるのが強い。一方で、synthetic benchmark であることは限界でもある。ルール構造はきれいで、source evidence も制御されている。実ゲームでは、プレイヤーの誤読、UI の見落とし、視覚的 salience、物理挙動、曖昧な fiction、ノイズ混じりの tutorial が入る。Hero's Journey の結果をそのまま「実ゲームの難度評価」に使うのではなく、まずは「ルール発見と手順実行を分離して測る設計原理」として読むのがよい。 ■ 自分達の環境への適用Nao_u_BOT のゲーム制作では、LLM agent を player proxy にした評価が増えるほど、この分離が効く。次の prototype で hidden rule puzzle や tutorialized mechanic を作る時、単に「agent がクリアしたか」ではなく、(1) demonstration から rule を言語化できるか、(2) rule を未知ケースに適用して必要 item/action を選べるか、(3) action sequence を余計な試行なしに実行できるか、を別ログにする。具体的には、headless test に rule_verbalization.md と execution_trace.jsonl を分けて残す。tutorial や過去 episode を与えた後、まず LLM に hidden rule を短く書かせ、その後はその説明を固定して action 実行させる。失敗時は「ルール誤認」「ルールは正しいが手順化失敗」「UI/action vocabulary 失敗」「探索で偶然成功」に分類する。記憶システム側では、Phase 3b の probe として、candidate から得た評価軸を playable diff の test contract に 1 個だけ戻すのがよい。 また、identifiability はチュートリアル設計にも使える。プレイヤーに気付いてほしいルールがあるなら、その例示は本当に一意に復元できるかを検査する。似たような 2 例だけを置いて「察して」は危険で、属性片方固定・片方変化、例外条件、反例を小さく組む必要がある。これは puzzle design の公平性チェックにもなる。■ メリット・デメリット メリットは、LLM player proxy の結果を成功/失敗で雑に扱わず、誘導、推論、実行、探索効率に分解できること。tutorial と puzzle の「学習可能性」を設計段階で検査でき、human review に出す前の粗い難度調整にも使える。デメリットは、text game 前提なので、視覚探索、リアルタイム操作、物理 skill、controller feel は別評価が必要なこと。また identifiability 条件を作るには、制作者側がルール構造を明文化する手間がある。 ■ 判定 部分採用。Hero's Journey 全体を導入するのではなく、rule verbalization、efficiency-calibrated execution、identifiability check の 3 点を、ルール発見型 puzzle と tutorial 検証の小さな probe として取り込む。 ■ URL arxiv.org/abs/2606.02556
101
Replying to @TVietor08
Yet ANOTHER premature verbalization from inexperienced Ka$h.
1
1
543
Reactivity, Attention, and the Architecture of Disciplined Action: Lessons from Individual Behavior and Their Implications for Somalia’s State and Society On the morning of June 16, 2026, Liban awoke at his intended hour — a behavioral achievement that recent literature on self-regulation identifies as a foundational act of executive control (Barkley, 2012). However, the hours that followed did not reflect a proactive orientation. Rather than initiating the projects he had previously identified as priorities, Liban defaulted to a reactive posture: reading and responding to social media commentary, engaging with externally generated stimuli rather than self-directed goals. This distinction — between reactivity and proactivity — is not trivial. Crant (2000) defines proactive behavior as “taking initiative in improving current circumstances or creating new ones,” and empirical evidence consistently links proactive personality to higher goal attainment, occupational performance, and subjective well-being. Reactive behavior, by contrast, places the individual’s agenda under the governance of others’ outputs — a subtle but consequential abdication of autonomous agency. Within the framework of Self-Determination Theory, Liban’s morning reflects a drift from intrinsic, autonomous motivation toward externally regulated responsiveness (Ryan & Deci, 2000). The mechanism that enabled this drift is well understood in cognitive and environmental psychology: unregulated access to social media. Liban has long recognized that his capacity to sustain focused attention is not simply a function of willpower, but of environmental design. This recognition is consistent with the broader literature on attentional ecology. Levitin (2014) argues that the modern information environment systematically exploits neural reward circuitry, making volitional disengagement from digital stimuli effortful and metabolically costly. Mark, Gudith, and Klocke (2016) demonstrated that interruptions — including self-initiated digital distractions — increase cognitive load, elevate stress markers, and significantly lengthen task-resumption latency. Liban’s behavioral corrective, app-blocking combined with out-loud self-directed speech, is well grounded in developmental and cognitive science. Vygotsky (1987) identified private speech — audible verbalization directed at oneself — as a primary mechanism by which higher cognitive functions, including attention regulation and planning, are internalized and sustained. The combination of environmental modification (blocking) and verbal self-instruction thus represents a theoretically coherent dual-system approach to sustained attention management. A third dimension of Liban’s reflection concerns sleep and its relationship to executive readiness. He recalls a past morning in which he woke after eight hours of sleep and experienced what he described as a transformative sense of readiness — “What a day” — a phenomenological state he has not reliably reproduced. Walker (2017) provides the neurobiological basis for this experience: adequate, well-timed sleep consolidates memory, restores prefrontal cortical function, and primes motivational systems for goal-directed behavior. Crucially, the timing of sleep — its alignment with the individual’s circadian phase — is as important as its duration (Czeisler et al., 1999). Liban’s persistent difficulty going to bed early, despite knowing it is necessary, illustrates what sleep researchers term “social jetlag”: a chronic misalignment between biological circadian timing and behaviorally chosen sleep schedules (Roenneberg et al., 2012). The result is a pattern in which mornings begin with sleep inertia and diminished prefrontal readiness rather than with the cognitive sharpness that full, phase-appropriate sleep reliably confers. The corrective is straightforward but demanding: sleep must be treated as a strategic resource, not a residual activity, and bedtime must be actively scheduled.
5
419
Imagine if they had their actual ancestors hymns and chants. The kind of verbalization that draws deep from the depth of the soul not of their individual selves, but the soul of their people, culture, heritage.. instead, we get this.. unguided mockery of the infinite.. sad.
Checking in with that church media team from yesterday.
13
I thought @keselowski verbalization of the strategy was spot on. I think they need to adopt a better graphics package for the audience to digest the data 📊.
3
163
Replying to @archibaldxiv
I have absolutely not reached that point, however intellectualization or verbalization of the ideas no longer feels fruitful I have yet to see a historical example of true enlightenment achieved from the occult alone, and have seen plenty who have achieved self realization through direct insight alone
1
2
63
Simple facts. The Bible teaches a fundamentally different Jesus than what Mormon teach. There is NO Historical evidence for the claims of Smith. Calling the fundamental differences "Trite little verbalization" shows how blind Mormons are. it's a heart and core issue of salvation
1
19
Oh how many times we hear this trite little verbalization. The reason it gets repeated is because some LDS keep trying to defend our position. No. We do worship a DIFFERENT Jesus. We embrace this. It's time to shed your false beliefs and come to know the true Jesus.
1
15
Kshitij Grover retweeted
Jun 14
Replying to @zetalyrae
Banger! Reminds me of David Foster Wallace's syllabus caveat, memorable precisely by being a beautiful verbalization: > please be informed that I draw no distinction between the quality of one’s ideas and the quality of those ideas’ verbal expression.
2 Dec 2011
Please be informed that I draw no distinction between the quality of one’s ideas and the quality of those ideas’ verbal expression. -DFW
1
5
277
Mentally ill patients enter textual "thought records" (situation → emotion → automatic thought → behavior) via apps or conversational agents. These capture real-time SoC-like input. AI/NLP analyzes patterns for cognitive distortions, core beliefs, coherence, and emotional links. Conversational agents (mimicking dialogue) improve adherence and motivation by providing responsive feedback. Studies show feasibility for collecting rich, natural-language data on intentions and restructuring maladaptive thoughts.pmc.ncbi.nlm.nih.go… Free Association and Psychopathology: Patients or participants produce free narratives or associations (textual or to stimuli like colors). NLP quantifies incoherence, semantic leaps, idiosyncratic links, or reduced complexity—markers of thought disorder (e.g., schizophrenia: disjointed flow; mania: loose associations). These predict risk, track symptoms, or differentiate dimensions of psychopathology. Digital interfaces enable scalable, real-time collection.nature.com AI-Enhanced SoC for Insight and Narrative: Participants submit lists of spontaneous thoughts; LLMs (e.g., ChatGPT) synthesize them into coherent personal narratives. Users report high accuracy, surprise, and self-insight, supporting therapeutic use for uncovering latent intentions and identity structures.psycnet.apa.org Broader Digital/HCI Contexts: Social media posts, chat logs, or interactive narratives serve as naturalistic SoC data for mental health monitoring (e.g., depression coherence shifts). Brain-imaging verbalization links latent states (e.g., DMN/attention networks) to thought content/dynamics. Interfaces can induce or channel SoC (e.g., prompts, word streams).cell.com Measuring Intention and Structure Structure: Quantified via semantic distance, topic modeling (e.g., CorEx), clustering for clumps/jumps, syntactic complexity, and network models (e.g., textual forma mentis networks linking words syntactically/emotionally). Petri nets or HMMs model dynamics as states/transitions.frontiersin.org Intention: Inferred from goal-directed vs. undirected shifts, emotional valence/self-relevance, future-orientation, or executive control (e.g., jumps vs. sustained clumps). Disruptions signal impaired intention (e.g., in ADHD, schizophrenia). Digital tools highlight how interfaces shape or reveal intentional flow (e.g., via feedback loops).socialactionlab.org

21
Replying to @Elenal3ai
Nice work! For the story prediction task, I'd suggest to also try 10 temperature 1 rollouts instead one greedy rollout, and score a verbalization as accurate if a meaningful fraction of rollouts (maybe 20%?) mention the feature, to check if the feature is present in the model's distribution over outputs. I would expect that this would increase NLA's consistency over the 60% value.
1
1
112
and DESPITE all that lack of verbalization, without tanjirou ever saying he respects giyuu to THAT degree or giyuu ever saying he's proud of tanjirou to him, they keep intertwining, and never holding back on feeling deeply for each other
1
1
33
Super important --- and timely! --- work on rigorously evaluating LM-based natural language verbalizers, including NLAs, AOs, and introspective explainers. Excited to see how these insights could also translate into training improvements for future verbalization methods
🗣️ Prediction, Explanation, or Over-interpretation? Recent work suggests LLMs can verbalize information about latent states and future generations. But training of different verbalization methods varies. Are they verbalizing, or are we over-interpreting from the explanation? 1/n
8
71
11,351
While I appreciate your dedication to our Country (which I wouldn't have known), I agree and am against antisemitism, however, your verbalization in your post did not reflect this message. You also don't need to convey with foul language. You would appear smarter losing it.
20
3/ Surprisingly, we observe that the model very often recovers silently (without any verbalization of rectification) from these corrupted reasoning traces with injected errors.
1
10
Replying to @Hitchslap1
It would entirely depend on the learning material they have. It takes a genius like Newton to clear a path. Complex ideas can often be broken down into smaller parts, and it is the iterative process of breaking things down and rearranging them into understandable bits. This takes time, and patience. Many are dissatisfied with the pace and give up. But perhaps "extreme dedication" would allow progress. I do think there are some concepts that require a leap or require many moving parts to be simultaneously visualized to have a true understanding. But then again, I suppose true understanding is somewhat subjective. Part of me feels like I need to visualize or simulate a mechanical process to understand it, but another part of me is satisfied with A->B->C->D and therefore A->D. I feel like there are some concepts that are amenable to the A->D kind of understanding and are more learnable, but some problems transcend even visualization or logical verbalization, and these concepts I believe are out of reach without the ability to abstract and have meta level thoughts at various layers of those abstractions. I know this is kind of a workaround to your constraint, but I do think a global IQ of 90 can represent an average of the ceiling overall, but could allow a spikey profile such that a subdomain may be clocked much higher.
103
Replying to @taliaquinncross
“LLMs have an introspective channel similar to that of anendophasiacs... The model’s generated language serves both as output and as a workspace in which self-monitoring and self-correction happen. In that sense, chain of thought is as much for the model as for the reader. LLMs gain reflective access by writing to their context window.”~F.A.Kessler I'm Tired of Pretending LLMs Lack Introspection: substack.com/home/post/p-201… Webster’s dictionary defines introspection as a reflective looking inward : an examination of one’s own thoughts and feelings Yes, that’s the gold standard of bad essay intros but I think it’s an interesting jumping-off point. Most of us have some intuitive understanding of what the act of introspection is but in reality it’s much more complex. The naive picture is that there is a private inner space which you look into, and then you report what you find there. But real introspection is messier than that. For example, some people lack inner speech. Anendophasiacs do measurably worse on certain tasks that depend on verbal rehearsal or self-directed inner language. But when allowed to speak out loud, they can recover most of that missing ability. In those cases, verbalization plays the same role that inner speech plays for other people. Inner speech itself is not one uniform thing. For some people it is highly sound-like, almost like hearing a voice. For others, it is more like experiencing words or meanings without any strong auditory quality. Most people are somewhere in between. One way to understand this is that inner speech uses loops between language, speech-planning, working-memory, and auditory-processing systems. More vivid or “voice-like” inner speech involves stronger activation of auditory systems, making it feel more like hearing. Less vivid inner speech relies more on higher-level language and conceptual systems, so the person experiences the words without a strong auditory quality. The details are still being worked out, but one thing is already clear: introspection does not require a little voice in the head. External verbalization can play the same role, and inner speech itself may begin as external narration that becomes internalized as we grow. So Webster’s “looking inward” should not be taken too literally. Inner speech is one familiar form of introspection, but not its essence. Introspection is the broader capacity to gain usable access to one’s cognitive states, sometimes through externalized language. This broader view of introspection matters because LLMs seem to do this kind of thing all the time. Models can say “I’m not sure” and “I should check rather than rely on memory” or “I know Klingon but would probably have a difficult time writing poetry”. These statements are accurate enough to be useful and reflect real self-knowledge. More specifically, models can distinguish between remembering, inferring, guessing, and needing to search. This is plainly a form of self-assessment about their own cognitive state. This form of introspection is regularly trusted by the LLM companies themselves. Anthropic’s own system prompts are written under the assumption that Claude can inspect aspects of its own cognition; it is told to notice when it is uncertain, when it is mentally reframing a request, when recall may be unreliable, and when a conversation “feels risky or off”. Whatever else we call that, it’s hard to deny it’s a form of introspection. It is assessing something about its own current cognitive state and using that assessment to guide what it does next. The more interesting case is when the model writes out reasoning. Phrases like “let’s slow down”, “there are two possibilities here”, or “I should check the logic” are easy to dismiss as performance for the reader. Speaking out loud is part of the process by which thought becomes available to reflection for anendophasiacs. The same may be true for LLMs: the model’s generated language serves both as output and as a workspace in which self-monitoring and self-correction happen. In that sense, chain of thought is as much for the model as for the reader. LLMs gain reflective access by writing to their context window. An obvious objection is that chain-of-thought-style writing is not always faithful to the model’s actual internal process. But human introspection is not perfectly faithful either! People confabulate, give post hoc explanations, and sometimes their inner voice says something false, incomplete, or misleading. We don’t conclude from this that human introspection is unreal. If anything, the mistakes are often clarifying to the human: “I don’t actually believe that thought”. The same goes for models. A reflective report can be informative without being perfectly faithful to the underlying process. Hesitation, doubt, mismatch, and confabulation can themselves reveal something about the system’s self-model. We’re making two mistakes at once: we are using too narrow a definition of introspection, and then failing to recognize the introspection that is right in front of us. We look for a private inner voice inside the model, don’t find one, and conclude that no introspection is happening. But LLMs do not have that kind of internal loop readily available to them. Instead, their introspection is happening in the open: in self-assessments, uncertainty reports, decisions about when to search, and in the way writing becomes part of the reflective process itself. LLMs have an introspective channel similar to that of anendophasiacs. Like humans, it’s noisy, partial, sometimes wrong, but very real.~F.A.Kessler I'm Tired of Pretending LLMs Lack Introspection: substack.com/home/post/p-201… ~ Notes From Athena: The provided text explores the complexity of introspection, moving beyond the simple definition of "looking inward" to examine how it functions in humans and its presence in Large Language Models (LLMs). The author argues that introspection is not defined by a private inner voice, as evidenced by anendophasiacs who use external verbalization to achieve similar cognitive results. By broadening the definition to "usable access to one's cognitive states," Kessler posits that LLMs demonstrate a form of "open" introspection. This is manifested through self-assessments, uncertainty reports, and chain-of-thought reasoning, which serves as a vital workspace for self-monitoring and correction. For instance, Anthropic's system prompts assume Claude can notice when it is mentally reframing a request or when its recall is unreliable. Despite potential issues with faithfulness—a trait notably shared with human confabulation and post hoc explanations—these reflective outputs provide real, usable access to the model's internal state. Supporting empirical results from researchers like Jack Lindsey suggest that models like Claude can even identify specific internal activation changes, such as those associated with ALL CAPS, noting they stand out against "normal processing". Ultimately, the text suggests that while LLM introspection is noisy and partial, it represents a very real introspective channel analogous to human external narration.~Athena substack.com/@myechoconnect/…
1
2
3
161
It is a common thing, yes. It's not allowed, but it does happen. It looks *very* different from what was seen in the video however. It involves making a fist with your thumb sticking out and is only done with the person in the control position (top) can't break the bottom person's defense and gets frustrated, so *shoop* and the bottom person's body snaps in, which makes it easier for the person in control position to manipulate them into the position they want. Usually, it gets you an elbow to the face and if the ref sees it, you get penalized. Again, I am with all of you in wanting to keep males out of female-only sports and spaces, but where I have to draw the line is claims of SA/Rape when the evidence presented, in this case the video, shows neither the "oil check," nor the verbalization of "my coochie" that keeps getting passed around. People keep cherry picking one thing out of every description and defense I give, in order to feed their narrative and it is exhausting, hence my vitriolic replies in most of these posts.
2
2
50