Filter
Exclude
Time range
-
Near
Calmsy retweeted
$1.4B onchain. Tokenized credit is the fastest growing RWA segment @maplefinance is the #1 tokenizer of credit.
RWA Tokenized Credit - League Table The top 10 platforms by total value onchain: @maplefinance - $1.4B @stokr_io - $1.3B @centrifuge - $736.1M @Securitize - $486.6M @HastraFi - $405.7M @chainlink CCIP - $241.8M @AssetoFinance - $205.7M @onrefinance - $185.4M @paretocredit - $180.8M @MidasRWA - $104.8M Tokenized credit is becoming one of the biggest categories in RWA.
3
1
6
310
*There is no such thing as a tokenizer-free lunch* by @linguist_cat Nice summary of recent "tokenizer-free" LLMs and why that might be a misnomer. huggingface.co/blog/catherinโ€ฆ
1
3
145
Replying to @Bitget_zh
ๆ›tokenizer ๆ˜ฏ็‚บไบ†ๆ•ˆๆžœๆ›ดๅฅฝโ€ฆ ไธๆ˜ฏ็‚บไบ†ๅทๅทๆผฒๅƒน ๐Ÿ˜‚
1
361
PHP ๆœฌไฝ“ใฎ AST ใ‚„ Tokenizer ใ‚’ใใฎใพใพไฝฟใ†ใ“ใจใง้ซ˜้€Ÿใ‹ใคๅฟ ๅฎŸใช LSP ใ‚’ไฝœๆˆใ™ใ‚‹ LSParrot ใจใ„ใ†ใ‚‚ใฎใ‚’ไฝœใฃใฆใ„ใพใ™ pie install lsparrot/lsparrot ใงๅ…ฅใ‚Šใ€ VSCode Extension ใŒใ‚ใ‚Šใพใ™ marketplace.visualstudio.comโ€ฆ github.com/LSParrot/ext-lspaโ€ฆ
1
3
293
ใ“ใ‚Œใฏๅ…จๆ–‡ๆคœ็ดขใ€ใƒ™ใ‚ฏใƒˆใƒซๆคœ็ดขใ€ใ‚ฐใƒฉใƒ•ใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚นใ€ใใ—ใฆๆ™‚็ณปๅˆ—ใ‚’่€ƒๆ…ฎใ—ใŸ็ตฑๅˆๆคœ็ดขใ‚จใƒณใ‚ธใƒณใฎใ‚ˆใ†ใชใ‚‚ใฎใซใชใฃใฆใ„ใ‚‹ใฎใงใ€ๆ—ฅๆœฌ่ชžใงไฝฟใ†ๆ™‚ใฏๅฐ‘ใชใใจใ‚‚tokenizer, reranker, embedding ใ‚’ๆ—ฅๆœฌ่ชžๅฏพๅฟœใฎใ‚‚ใฎใซๅทฎใ—ๆ›ฟใˆใฆไฝฟใ†ๅฟ…่ฆใŒใ‚ใ‚Šใพใ™ใ€‚ใƒ‡ใƒ•ใ‚ฉใƒซใƒˆใ ใจๆ—ฅๆœฌ่ชžใŒใƒฉใƒณใ‚ญใƒณใ‚ฐไธŠไฝใซๅ‡บใพใ›ใ‚“ใ€‚
10
SQL Engine from first principles. At its core, it does 4 simple things: 1. table: a set of rows with columns 2. select: choose which columns to return 3. where: restrict the rows returned by adding conditions 4. Return the result SQL Query --> Tokenizer --> Parser --> Planner --> Executor --> Rows The engine scans rows, checks the condition, fetches requested columns, and returns matching records. Production SQL engines add: - parser - optimizer - indexes - joins - transactions - storage engine - query planner - concurrency control But the first principle remains the same.
14
Sometimes you just don't need a tokenizer. This estimate is surprisingly good
5
โš”๏ธ AI wars are heating up. ๐Ÿง  Anthropic launched Claude Opus 4.7, now topping coding benchmarks and handling much larger imagesโ€”but a new tokenizer may increase token usage by up to 35%. ๐Ÿค– OpenAI responded with a major Codex upgrade, turning it into an autonomous agent workstation with computer use, browser access, multi-day automations, and 90 enterprise integrations. The race is no longer about chatbotsโ€”it's about building the best AI coworker. Which ecosystem are you betting on: Anthropic or OpenAI? hashtag#AI hashtag#ArtificialIntelligence hashtag#TechTrends Reference article: lnkd.in/gCTgGa7i
31
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Recent Activity โ”‚ โ”‚ โ”‚ Welcome back, Meaghan! โ”‚ 1m ago Updated tokenizer (mistral-common) โ”‚ โ”‚ /inference โ”‚ โ”‚ 8m ago Refactored safety filters โ”‚ โ”‚ /finetune โ”‚ (เธ…โ€ขฯ‰โ€ขเธ…) โ”‚ 2d ago New inference task added to memory โ”‚ โ”‚ /help for commands โ”‚ โ”‚ 1w ago Benchmarks for large context updated โ”‚ โ”‚ โ”‚ Le Gros Chaton โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ What's New โ”‚ โ”‚ โ”‚ โ”‚ /finetune for custom datasets โ”‚ โ”‚ โ”‚ โ”‚ ctrl s to search history โ”‚ โ”‚ โ”‚ โ”‚ Updated context handling โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ >mistral@Sandman:~$ โ–ˆ Le Gros Chaton (32K context with extra fluff) | meaghan โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ยฉ Mistral AI | blaze speed โ€ข /effort milestone complete | fine-tuning progress [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 42% ๐Ÿพ
1
2
202
The range is so wide as results depend on a tokenizer and JSON type and there are a lot of possible combinations. I've also used a benchmark that consists of only 35 examples of multiple kinds of JSON data. Right now I'm building the better benchmark and I'll publish it tonight. As for the json structured outputs, we're just changing the encoding and RAIF round-trips losslessly back to json There's only a problem that existing integrations and harnesses don't speak RAIF. I'm working on making sort of vLLM plugin or something similar to use it as a middleware that converts RAIF to JSON at the serving layer so harnesses won't need tuning. Our LoRA completely changes json output to RAIF, so the benchmark shows relevant metrics excluding prose json markings and other stuff that happen.
12
weโ€™re building von0.5B(idk why i picked this name but it sounds cool loll) : a ~500M parameter small language model trained from scratch. the goal is not just another LoRA adapter. this is a standalone small coding model pipeline: dataset staging, tokenizer training, scratch pretraining, SFT, ORPO preference optimization, and benchmark gating before any performance claims. current progress: - staged an 80k-row coding mixture on Kaggle(data is smallll, gathering more hopefully) - mounted curated external coding datasets into a reusable training dataset - validated a scratch pilot end-to-end - launched the full von500m pretraining run on Kaggle 2xt4 using the staged mixture the focus is high coding performance per parameter, with edge/phone usability as a secondary deployment target. no outputs yet but im "building in public" why?: i needed to be able to run models on my phone but current ones due to heavy quantizations keep outputting garbage. and then i thought to build one myself from scratch, definitely not an easy task but a good one
1
1
1
55
rt machine ๐Ÿ‡บ๐Ÿ‡ฆ retweeted
"you have watched Karpathy building a GPT tokenizer on YouTube after 8pm, have you not? You have watched it on your iPad, haven't you?"
๐Ÿšจ NEW: Keir Starmer will introduce nightly social media curfews for 16 and 17-year-olds as part of the Government's social media ban [@thetimes]
1
6
77
1,815
@superactro 67 TOPS at 7-15W is a solid benchmark for the Orin Nano. The real unlock for edge AI isn't just the inference speed โ€” it's being able to run the full perception pipeline (tokenizer model post-processing) inside a thermal envelope that survives a factory floor. Have you tested sustained inference with ClawBox under 24/7 thermal cycling? Would be curious about your quantization strategy for 7B models on that power budget.
8
If you build with or evaluate LLMs, my new post is for you. Tokenization sounds like a boring preprocessing detail. In practice it decides what a model can do, how much it costs to run, and why it fails on tasks you'd expect it to handle. The post covers, with code and real numbers: โ†’ Why vocabulary size is a real design decision (32K โ†’ 128K โ†’ 256K) โ†’ How much of a model is just the embedding table (31% of GPT-2, ~7% of a 7B) โ†’ Weight tying: which models share the embedding and LM head, and which don't โ†’ Why multilingual cost varies so much per language โ†’ Why you can't swap a tokenizer without retraining from scratch There's a companion Kaggle notebook so you can check every number yourself. Part 3 of my genAI Fundamentals series. Link: buff.ly/Zm5udFW
36
really? thought i saw the qwen tokenizer back then gonna check again
19
llama.cpp b9637 added a dedicated Cohere2MoE / North Code chat parser. This is the unglamorous part of local model support: the runtime absorbs tokenizer and chat-template quirks so users are not debugging prompt formats by hand.
6
midwest liberals find out iโ€™m latina and try to tokenize me but unfortunately for them i had already tokenized them first. you canโ€™t tokenize the tokenizer. science.
17
Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search Amazon presents an inference-free sparse retrieval system for music search that uses a granular subword tokenizer to robustly match misspelled and varied queries. ๐Ÿ“ amazon.science/publications/โ€ฆ
3
252