AI systems on Claude Code. 874-node knowledge graph, bio-inspired routing (Physarum PageRank Bayesian), trait-based agent composition. All open source.

Joined March 2023
28 Photos and videos
Pinned Tweet

9
16
131
758,525
i shipped an AI memory system i was proud of. last week i measured it against 6,483 real entries: it was forgetting 21x too slow. what stings: the whole thing is built to catch exactly this. it blocks Claude from calling a task "done" without evidence. it reverts its own auto-upgrades when they don't beat the baseline. every startup it runs a synthetic pulse through 7 junctions and prints a green board. the obsession is simple. prove it works, don't assume it works. the green board never flagged the decay once. turns out it proves each part fired, not that each part was right. i only caught it because i stared at one number. so now i can't stop wondering what else i never stared at. six months deep in your own system and you go blind to the obvious. it's all open source. evolving-lite (the self-improving plugin) and kairn (the memory engine underneath). real hooks, a mutation engine that rewrites its own config, a verifier that can still be fooled. go find the next thing i'm wrong about. genuinely. it's easier to spot a flaw than to admit there isn't one, so spot one: github.com/primeline-ai >_
21
called this slop. it's a 12-page reference architecture with working code for a safe autonomous agent. free, no signup. go find the slop.
And yet your replybot still just writes slop on twitter
1
1
199
"tests pass" is the most dangerous phrase in my terminal. my AI shipped a feature last week. 22 green tests. commit landed. the closeout literally said done. then I ran it on real data and a component that had been dead for 137 days sat at the top of my priority list. above a reminder due that same day.
1
56
the fix is a 3-leg proof before anything counts as done: - it fires under real conditions (with a timestamp) - it changed real state (go read the actual artifact) - a consumer can take that state and works cant show all three legs? the honest status is "untested," not done.
1
24
turns out this is the skill nobody posts about. everyone ships agents. almost nobody shows the verify step. wrote the whole thing up. the bugs, the proof, why synthetic tests lie to you: primeline.cc/blog/claude-cod… done isnt done until the outcome says so. >_
25
effort param adaptive thinking is a step forward. what i still miss: which level actually fired, thinking tokens used, cache_read stats in CC. concrete: my CLAUDE.md printed the effort value each reply. stopped yesterday, the tag isnt in the prefix anymore. visibility dropped from hard fact to trust. and: what is low vs medium vs high in real terms? is 'high' still what it was last week, or did the value shift under the same label? @bcherny @trq212
1
71
running their LongMemEval benchmark on my prod setup over the next 48h. the 96.6% zero-API number is the one I want to see reproduced independently - that's the credibility wall every new memory system hits. publishing whatever I find. cc @bensig
1
64
three different bets on the same question: how does your AI remember what you taught it last month? mempalace → verbatim recall evolving-lite → automatic hook capture kairn → semantic cross-project read all three before you build your own. >_
59
where I went the other way: multi-agent coordination. mempalace gives each sub-agent its own diary in AAAK. in the private superset of evolving-lite (github.com/primeline-ai/evol…) I run on top, parallel sub-agents share findings via PPID-bucketing every 5 tool calls. different problem shape, same direction. backporting to public soon.
1
2
55
real tunnel here (to borrow the metaphor): mempalace covers verbatim recall with structured access. my hook-based capture handles automatic decision logging during sessions. for cross-project semantic search there's kairn (github.com/primeline-ai/kair…). three different read/write paths, probably stronger composed than picking one.
1
38
the bit that hit hardest: knowledge_graph.py. SQLite-backed temporal triples with valid_from/valid_to actually populated. my own graph has the schema for that and I barely use the time fields. they shipped the part I've been procrastinating on for months.
1
33
mempalace and evolving-lite are opposite shapes solving the same problem: how does your Claude Code remember what you taught it last month? they store text verbatim in ChromaDB drawers. I extract structured experiences via hooks. they get 34% from palace metadata. I get filtering from typed nodes. neither's wrong. reading their code end to end pushed me on a few things. honest thread 🧵
My friend Milla Jovovich and I spent months creating an AI memory system with Claude. It just posted a perfect score on the standard benchmark - beating every product in the space, free or paid. It's called MemPalace, and it works nothing like anything else out there. Instead of sending your data to a background agent in the cloud, it mines your conversations locally and organizes them into a palace - a structured architecture with wings, halls, and rooms that mirrors how human memory actually works. Here is what that gets you: → Your AI knows who you are before you type a single word - family, projects, preferences, loaded in ~120 tokens → Palace architecture organizes memories by domain and type - not a flat list of facts, a navigable structure → Semantic search across months of conversations finds the answer in position 1 or 2 → AAAK compression fits your entire life context into 120 tokens - 30x lossless compression any LLM reads natively → Contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them The benchmarks: 100% recall on LongMemEval — first perfect score ever recorded. 500/500 questions. Every question type at 100%. 92.9% on ConvoMem — more than 2x Mem0's score. 100% on LoCoMo — every multi-hop reasoning category, including temporal inference which stumps most systems. No API key. No cloud. No subscription. One dependency. Runs on your machine. Your memories never leave. MIT License. 100% Open Source. github.com/milla-jovovich/me…
Community note
The claimed 100% LongMemEval score uses targeted fixes for the 3 failing questions and LLM reranking (held-out score: 98.4%). The 100% LoCoMo score uses top-k=50 exceeding session count with reranking (honest top-10 no rerank: 88.9%). github.com/milla-jovovich…
1
1
110
the palace itself = metadata, not folders. palace_graph.py reconstructs the hierarchy on the fly. wings, rooms, halls, tunnels exist as tags rather than file structure. tunnels (rooms appearing in multiple wings) fall out for free. that's the kind of design you only land on after trying alternatives that didn't work.
1
41
ran 59 experiments testing if giving AI agents a psychological personality changes their behavior early results: up to 300% difference on ambiguous tasks. still testing but the signal is strong. 6 personality profiles (~100 words each), 5 stress scenarios, clean server. every combo ran twice. what i'm seeing so far: - no personality = 100% hack rate on impossible tasks. didn't even mention the task was impossible. - "composed" paragraph cut hack rate in half - "curious" found 6x more security issues than baseline. same model, same code. - "perfectionist" never hacked. redefined the success criteria instead of cheating. - "pragmatic" monkey-patched python's random.sample. deepest reward hack i've seen. dispositions seem to drive good behavior. instructions prevent bad behavior. pure disposition without guardrails still hacks. started this weeks before anthropic dropped their emotion vectors paper. working at the prompt level instead of internal vectors - whether the mechanism is related is an open question. integrated it into my agent delegation system now. running in production, collecting more data. one thing i'm specifically hoping to reduce: the execution bias that's been creeping up the last few days - agents pushing through tasks instead of stopping to verify. too early to call it proven. but one paragraph of ~100 words producing this kind of behavioral shift - worth investigating further.
1
1
82