Lightning-fast unified analytics engine

Joined June 2013
80 Photos and videos
For a long time, streaming architecture advice boiled down to two engines: one for high-throughput ETL, another when you need millisecond latency. At Data Engineering Open Forum 2026, Indrajit Roy (@databricks) walked through how Apache Spark Structured Streaming took a different path from day one: micro-batch processing. πŸ”Έ Micro-batch model: Records arrive on a stream; the engine waits briefly, forms a batch, processes it, then repeats. πŸ”Έ Batch query on stream slices: Each step is effectively running a small batch query over the latest slice of data. πŸ”Έ One engine, different tradeoffs: The design challenges the β€œtwo streaming engines” default instead of accepting it as fixed. Watch the full keynote: lnkd.in/eN6nErib #ApacheSpark #StructuredStreaming #DataEngineering #OpenSource
4
29
2,271
#DataAISummit Session Spotlight ➑️ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows. πŸ—“οΈ June 15–18 πŸ“ San Francisco πŸ”— Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit
3
10
954
#DataAISummit (June 15-18) Session Spotlight πŸ‘‡ Get a year in review and the roadmap for Apache Spark Structured Streaming in open source: what's shipping in Spark 4.1 and what's ahead in 4.2 for mission-critical streaming ingestion and ETL pipelines. Jerry Peng and Anish Shrigondekar (@databricks) will cover recent advances and what's next! πŸ”—Details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit #StructuredStreaming #DataEngineering
5
709
At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. πŸ‘‡ Real-time isn’t just for streaming specialists anymore. Express the logic. Let the engine handle the rest. πŸ“Ή Full video: youtu.be/VLJhGDwTS3I #ApacheSpark
2
8
817
#DataAISummit Session Spotlight πŸ‘‡ Apache Sparkβ„’ 4.2: unified batch streaming for AI workloadsβ€”feature pipelines, multimodal data, planner-level optimizations. 🎀 DB Tsai & Xiao Li | πŸ—“οΈ June 15-18 | πŸ“ San Francisco Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit
2
10
808
#DataAISummit Session Spotlight πŸ‘‡ Andreas Neumann and @lisancao will cover Spark Declarative Pipelines (4.1). Declare what your pipeline does, and Spark manages execution, parallelization, checkpoints, and failure recovery. πŸ—“οΈ June 15–18 | πŸ“ San Francisco πŸ”— Session details: databricks.com/dataaisummit/… 🎟️ Register: dataaisummit.databricks.com/… #ApacheSpark #DataAISummit
1
3
10
984
For a decade, β€œstreaming on Spark” meant micro-batches. Fine for ETL. A wall if your latency budget was under a second. Spark 4.1 stops that. Real-Time Mode (SPARK-50708) πŸ‘‡
1
12
64
5,887
How: β€’ Continuous execution β€” long-lived tasks β€’ Simultaneous scheduling β€” stage N 1 on N’s first record β€’ Streaming shuffle β€” in-memory handoff, no batch boundary
1
4
973
Stateless: 4.1. Stateful RTM: upstream. If you have a streaming workload that "shouldn't be on Spark" because it needed ms, pull the RC and try it. The next move is yours.
1
460
Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three. @lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing πŸ‘‡ πŸ”Ή SDP: declare intent, not triggers/checkpoints πŸ”Ή RTM: one engine for sub-sec batch πŸ”Ή Connect: pyspark-client; prod = URL change πŸ”— Read more: medium.com/apache-spark/apac… #ApacheSpark
3
20
1,454
#DataAISummit Session Spotlight πŸ‘‡ Structured Streaming: year in review roadmap. Real-Time Mode, stateful transforms, Spark 4.2 ahead. 🎀 Jerry Peng & Anish Shrigondekar πŸ—“οΈ June 15–18 πŸ“ San Francisco πŸ”— Details: databricks.com/dataaisummit/… 🎟️ Register: dataaisummit.databricks.com/…
1
13
798
#DataAISummit Session Spotlight πŸ‘‡ Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production. The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns. 🎀 Andreas Neumann & Lisa Cao πŸ“† June 15-18 πŸ“ San Francisco Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit #DataEngineering #Spark
5
13
1,290
#DataAISummit Session Spotlight πŸ‘‡ Apache Sparkβ„’ 4.2: unified batch streaming for AI workloads: feature pipelines, multimodal data, planner-level optimizations. 🎀 DB Tsai & Xiao Li | πŸ—“οΈ June 15–18 | πŸ“ San Francisco πŸ”— Session details: databricks.com/dataaisummit/… #ApacheSpark
1
4
22
1,509
Spark 4.1 for agents πŸ‘‡ πŸ”Ή SDP: triggers/checkpoints/DAG off the agent; dry-run fails fast πŸ”Ή RTM: sub-second batch, one engine (stateless in 4.1) πŸ”Ή Connect: pyspark-client, no local JVM; sandboxβ†’prod = URL Agent owns intent. Spark absorbs the rest. πŸ”— Read more: medium.com/apache-spark/apac… #ApacheSpark #DataEngineering
3
8
44
3,711
Apache Spark 4.1 is out today. πŸš€ AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run. Three architectural changes in 4.1 shrink that surface area. 1️⃣ Spark Declarative Pipelines (SDP) 2️⃣ Real-Time Mode 3️⃣ Spark Connect Project Feather Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship. πŸ‘‰ Get started: spark.apache.org/downloads #ApacheSpark #DataEngineering #OSS #AIagents
5
20
83
9,328
#DataAISummit Session Spotlight (June 15–18 | San Francisco)πŸ‘‡ What's New in Apache Sparkβ„’ 4.1? πŸ”§ Spark Declarative Pipelines (SDP) ⚑ Structured Streaming Real-Time Mode 🐍 PySpark πŸ”— Spark Connect & SQL Session details: databricks.com/dataaisummit/… Register: dataaisummit.databricks.com/… #ApacheSpark #DataAISummit
5
21
1,513
Apache Spark is great at petabytes. It can be heavy at 100 megabytes. Project Feather is a new SPIP to fix that. πŸ‘‡ Three lines of work, all targeting Spark in local mode: 1️⃣ Compilation and scheduling. Skip unnecessary shuffles when the planner knows a scan is one file. Mark itSinglePartitionand let the next aggregate run in place. 2️⃣ Arrow-baseddf.cache. Swap the row-oriented cache for Apache Arrow IPC. Columnar, compressed, iterable. 3️⃣ Shuffle-free execution. On a single node, replace blocking shuffle with in-process channels and Java virtual threads. No disk round-trip. Prototype today: a filter-and-sort query on a small in-memory table runs in 150 ms instead of 330 ms. One stage instead of two. The win compounds as the optimizations stack. πŸ”— Project Feather: docs.google.com/document/d/1… The SPIP is open for comment. Pull the prototype, run it against your hardest small-data pipeline, file the bug we missed. ✍ Authors: Daniel Tenedorio and Liang-Chi Hsieh. #ApacheSpark #SPIP #OpenSource #DataEngineering #ApacheArrow
1
7
49
4,148
#DataAISummit Session Spotlight πŸ‘‡ Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026 At Data AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs β€” including Native Arrow UDFs/UDTFs and built-in faulthandler profiling. πŸ“ June 15–18 Β· SF Add to your agenda: databricks.com/dataaisummit/… #ApacheSpark #PySpark #DataAISummit #DataEngineering #OpenSource
3
13
1,227