Developer-friendly, open source AI-Native Multimodal Lakehouse github.com/lancedb/lancedb

Joined April 2023
308 Photos and videos
Jun 11
Most feature engineering pipelines rewrite the full table for every new column. In Lance, adding a column writes only the new data. Blobs, embeddings, indexes untouched. Every write is a versioned commit. That's data evolution. ๐Ÿ”— lancedb.com/blog/scalable-feโ€ฆ
3
16
755
Jun 10
1/ Two LanceDB sessions at @databricks #DataAISummit next week in SF (June 15-18)
1
9
1,193
Jun 10
2/ ๐— ๐—ฎ๐—ป๐—ฎ๐—ด๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ ๐—ฎ๐˜ ๐—˜๐˜…๐—ฎ๐—ฏ๐˜†๐˜๐—ฒ ๐—ฆ๐—ฐ๐—ฎ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ ๐—”๐—œ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด The full exploration-to-GPU-loading path: why single-purpose tools force teams to copy data across systems for each workflow step, and how LanceDB collapses that into one table. ๐Ÿ”— databricks.com/dataaisummit/โ€ฆ
1
2
417
Jun 10
3/ ๐—™๐—ฟ๐—ผ๐—บ ๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด ๐˜๐—ผ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต: ๐—›๐—ผ๐˜„ ๐—˜๐˜…๐—ฎ ๐—จ๐˜€๐—ฒ๐˜€ ๐—Ÿ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—›๐—ถ๐—ด๐—ต-๐—ง๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต๐—ฝ๐˜‚๐˜ ๐—”๐—œ ๐—ช๐—ผ๐—ฟ๐—ธ๐—น๐—ผ๐—ฎ๐—ฑ๐˜€ Joint session with @ExaAILabs where we walk through Exa's Spark Structured Streaming pipeline, ~10K rows/second into Lance, and how the same tables power their vector search. ๐Ÿ”— databricks.com/dataaisummit/โ€ฆ
3
370
Prashanth Rao @tech_optimist and Sarwar Bhuiyan are running a workshop at TMLS on June 19. ๐—˜๐—ป๐—ต๐—ฎ๐—ป๐—ฐ๐—ถ๐—ป๐—ด ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ ๐—ฃ๐—ถ๐—ฝ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—Ÿ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐˜๐—ต๐—ฒ ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—Ÿ๐—ฎ๐—ธ๐—ฒ๐—ต๐—ผ๐˜‚๐˜€๐—ฒ They're covering Lance's architecture and what makes it suited for ML workloads (fast random access, native blob storage, built-in versioning), live PyTorch and Hugging Face integration examples, a 3D world-model dataset case study, and I/O benchmarks during data loading. If you're managing multimodal training data at scale and your storage, search, and training layers are still three separate systems, this one's for you.
1
1
4
926
1/ May's highlight: stable-worldmodel paper published, standardizing world model pipelines on Lance, with Lance-backed S3 streaming several times faster than HDF5 for small-batch random access.
2
3
12
1,361
5/ Events: Check out our 2 sessions at Data AI Summit by @databricks โ€” June 15โ€“18, San Francisco - Managing Data at Exabyte Scale for AI Model Training - From Streaming to Search: How Exa Uses Lance and Apache Spark for high-throughput AI Workloads
1
276
1/ 3-4x faster data loading on Push-T vs HDF5 or video formats at a fraction of the disk size. stable-worldmodel uses Lance as the data layer โ€” here's the training walkthrough.
1
3
8
640
2/ The loader is URI-agnostic: local, s3://, gs://, hf://buckets/... all use identical code. @huggingface Buckets is first-class โ€” Lance's fragment design and Buckets' chunk storage share deduplication primitives, so you get it for free.
1
3
6
2,117
Lei Xu is speaking at #SnowflakeSummit this Thursday alongside Vishwa Lakkundi (Sr. Manager, Snowflake) for ๐—”๐—ฝ๐—ฎ๐—ฐ๐—ต๐—ฒ ๐—ฃ๐—ผ๐—น๐—ฎ๐—ฟ๐—ถ๐˜€ ๐—ถ๐—ป ๐—ฃ๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ฐ๐—ฒ: ๐—ข๐—ฝ๐—ฒ๐—ป ๐—–๐—ฎ๐˜๐—ฎ๐—น๐—ผ๐—ด๐˜€, ๐—ข๐—ฝ๐—ฒ๐—ป ๐—™๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐˜€, ๐—ข๐—ฝ๐—ฒ๐—ป ๐—–๐—ผ๐—บ๐—บ๐˜‚๐—ป๐—ถ๐˜๐˜†. The session covers where the open catalog layer is heading. Lance is one of the formats Polaris now supports alongside Delta and Iceberg. The direction is one catalog spec that works across every engine and every format, multimodal included. If you're at @Snowflake Summit and building on Lance or thinking about how your catalog layer handles multimodal data as the format mix expands beyond Iceberg, this is the session.
1
1
8
914
Thursday, June 4 ยท 1:00โ€“1:45 PM PDT: reg.snowflake.com/flow/snowfโ€ฆ

1
591
Cosmos 3 by @nvidia released today โ€” a frontier omnimodal world model for Physical AI. For the data infrastructure behind it, they built on Lance. SILA, NVIDIA's internal curation platform, processes tens of billions of multimodal training candidates as a single Lance dataset. Curation signals, embeddings, and vector indexes all in one table. No separate vector DB. One table from raw data to training-ready.
2
5
20
3,427
Full infrastructure section in @nvidia's Cosmos 3 technical report: research.nvidia.com/labs/cosโ€ฆ

1
333
1/ Most curation pipelines have no shared version history. Six months later, nobody knows which rows trained that model.
1
1
9
664
2/ On LAION-1M, queried directly from the Hugging Face Hub: โ†’ 1.16M rows to 604K in a single SQL predicate chain โ†’ pHash catches ~25% near-duplicate clusters; CLIP-feature NN matching catches the rest โ†’ 95/5 stratified split, identical mean similarity (0.3318) across train and test
1
3
434
3/ Every write versions automatically. Tag the baseline, tighten the threshold, both versions stay on disk. Open v1 and you get exactly the 604K rows the original run trained on. Full code: lancedb.com/blog/reproduciblโ€ฆ
2
253