Most feature engineering pipelines rewrite the full table for every new column.
In Lance, adding a column writes only the new data. Blobs, embeddings, indexes untouched. Every write is a versioned commit.
That's data evolution.
๐ lancedb.com/blog/scalable-feโฆ
2/ ๐ ๐ฎ๐ป๐ฎ๐ด๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ ๐ฎ๐ ๐๐ ๐ฎ๐ฏ๐๐๐ฒ ๐ฆ๐ฐ๐ฎ๐น๐ฒ ๐ณ๐ผ๐ฟ ๐๐ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด
The full exploration-to-GPU-loading path: why single-purpose tools force teams to copy data across systems for each workflow step, and how LanceDB collapses that into one table.
๐ databricks.com/dataaisummit/โฆ
3/ ๐๐ฟ๐ผ๐บ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐๐ผ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต: ๐๐ผ๐ ๐๐ ๐ฎ ๐จ๐๐ฒ๐ ๐๐ฎ๐ป๐ฐ๐ฒ ๐ฎ๐ป๐ฑ ๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐ถ๐ด๐ต-๐ง๐ต๐ฟ๐ผ๐๐ด๐ต๐ฝ๐๐ ๐๐ ๐ช๐ผ๐ฟ๐ธ๐น๐ผ๐ฎ๐ฑ๐
Joint session with @ExaAILabs where we walk through Exa's Spark Structured Streaming pipeline, ~10K rows/second into Lance, and how the same tables power their vector search.
๐ databricks.com/dataaisummit/โฆ
Prashanth Rao @tech_optimist and Sarwar Bhuiyan are running a workshop at TMLS on June 19.
๐๐ป๐ต๐ฎ๐ป๐ฐ๐ถ๐ป๐ด ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ฎ๐๐ฎ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ ๐๐ถ๐๐ต ๐๐ฎ๐ป๐ฐ๐ฒ ๐ฎ๐ป๐ฑ ๐๐ต๐ฒ ๐ ๐๐น๐๐ถ๐บ๐ผ๐ฑ๐ฎ๐น ๐๐ฎ๐ธ๐ฒ๐ต๐ผ๐๐๐ฒ
They're covering Lance's architecture and what makes it suited for ML workloads (fast random access, native blob storage, built-in versioning), live PyTorch and Hugging Face integration examples, a 3D world-model dataset case study, and I/O benchmarks during data loading.
If you're managing multimodal training data at scale and your storage, search, and training layers are still three separate systems, this one's for you.
1/ May's highlight: stable-worldmodel paper published, standardizing world model pipelines on Lance, with Lance-backed S3 streaming several times faster than HDF5 for small-batch random access.
5/ Events:
Check out our 2 sessions at Data AI Summit by @databricks โ June 15โ18, San Francisco
- Managing Data at Exabyte Scale for AI Model Training
- From Streaming to Search: How Exa Uses Lance and Apache Spark for high-throughput AI Workloads
1/ 3-4x faster data loading on Push-T vs HDF5 or video formats at a fraction of the disk size. stable-worldmodel uses Lance as the data layer โ here's the training walkthrough.
2/ The loader is URI-agnostic: local, s3://, gs://, hf://buckets/... all use identical code. @huggingface Buckets is first-class โ Lance's fragment design and Buckets' chunk storage share deduplication primitives, so you get it for free.
Lei Xu is speaking at #SnowflakeSummit this Thursday alongside Vishwa Lakkundi (Sr. Manager, Snowflake) for ๐๐ฝ๐ฎ๐ฐ๐ต๐ฒ ๐ฃ๐ผ๐น๐ฎ๐ฟ๐ถ๐ ๐ถ๐ป ๐ฃ๐ฟ๐ฎ๐ฐ๐๐ถ๐ฐ๐ฒ: ๐ข๐ฝ๐ฒ๐ป ๐๐ฎ๐๐ฎ๐น๐ผ๐ด๐, ๐ข๐ฝ๐ฒ๐ป ๐๐ผ๐ฟ๐บ๐ฎ๐๐, ๐ข๐ฝ๐ฒ๐ป ๐๐ผ๐บ๐บ๐๐ป๐ถ๐๐.
The session covers where the open catalog layer is heading. Lance is one of the formats Polaris now supports alongside Delta and Iceberg. The direction is one catalog spec that works across every engine and every format, multimodal included.
If you're at @Snowflake Summit and building on Lance or thinking about how your catalog layer handles multimodal data as the format mix expands beyond Iceberg, this is the session.
Cosmos 3 by @nvidia released today โ a frontier omnimodal world model for Physical AI.
For the data infrastructure behind it, they built on Lance.
SILA, NVIDIA's internal curation platform, processes tens of billions of multimodal training candidates as a single Lance dataset. Curation signals, embeddings, and vector indexes all in one table. No separate vector DB.
One table from raw data to training-ready.
2/ On LAION-1M, queried directly from the Hugging Face Hub:
โ 1.16M rows to 604K in a single SQL predicate chain
โ pHash catches ~25% near-duplicate clusters; CLIP-feature NN matching catches the rest
โ 95/5 stratified split, identical mean similarity (0.3318) across train and test
3/ Every write versions automatically. Tag the baseline, tighten the threshold, both versions stay on disk. Open v1 and you get exactly the 604K rows the original run trained on. Full code: lancedb.com/blog/reproduciblโฆ