For a long time, streaming architecture advice boiled down to two engines: one for high-throughput ETL, another when you need millisecond latency.
At Data Engineering Open Forum 2026, Indrajit Roy (@databricks) walked through how Apache Spark Structured Streaming took a different path from day one: micro-batch processing.
πΈ Micro-batch model: Records arrive on a stream; the engine waits briefly, forms a batch, processes it, then repeats.
πΈ Batch query on stream slices: Each step is effectively running a small batch query over the latest slice of data.
πΈ One engine, different tradeoffs: The design challenges the βtwo streaming enginesβ default instead of accepting it as fixed.
Watch the full keynote: lnkd.in/eN6nErib#ApacheSpark#StructuredStreaming#DataEngineering#OpenSource
#DataAISummit Session Spotlight β‘οΈ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows.
ποΈ June 15β18
π San Francisco
π Session details: databricks.com/dataaisummit/β¦#ApacheSpark#DataAISummit
At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. π
Real-time isnβt just for streaming specialists anymore. Express the logic. Let the engine handle the rest.
πΉ Full video: youtu.be/VLJhGDwTS3I#ApacheSpark
#DataAISummit Session Spotlight π
Apache Sparkβ’ 4.2: unified batch streaming for AI workloadsβfeature pipelines, multimodal data, planner-level optimizations.
π€ DB Tsai & Xiao Li | ποΈ June 15-18 | π San Francisco
Session details: databricks.com/dataaisummit/β¦#ApacheSpark#DataAISummit
For a decade, βstreaming on Sparkβ meant micro-batches. Fine for ETL. A wall if your latency budget was under a second.
Spark 4.1 stops that. Real-Time Mode (SPARK-50708) π
Stateless: 4.1. Stateful RTM: upstream.
If you have a streaming workload that "shouldn't be on Spark" because it needed ms, pull the RC and try it. The next move is yours.
Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three.
@lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing π
πΉ SDP: declare intent, not triggers/checkpoints
πΉ RTM: one engine for sub-sec batch
πΉ Connect: pyspark-client; prod = URL change
π Read more: medium.com/apache-spark/apacβ¦#ApacheSpark
#DataAISummit Session Spotlight π
Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production.
The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns.
π€ Andreas Neumann & Lisa Cao
π June 15-18
π San Francisco
Session details: databricks.com/dataaisummit/β¦#ApacheSpark#DataAISummit#DataEngineering#Spark
Spark 4.1 for agents π
πΉ SDP: triggers/checkpoints/DAG off the agent; dry-run fails fast
πΉ RTM: sub-second batch, one engine (stateless in 4.1)
πΉ Connect: pyspark-client, no local JVM; sandboxβprod = URL
Agent owns intent. Spark absorbs the rest.
π Read more: medium.com/apache-spark/apacβ¦#ApacheSpark#DataEngineering
Apache Spark 4.1 is out today. π
AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run.
Three architectural changes in 4.1 shrink that surface area.
1οΈβ£ Spark Declarative Pipelines (SDP)
2οΈβ£ Real-Time Mode
3οΈβ£ Spark Connect Project Feather
Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship.
π Get started: spark.apache.org/downloads#ApacheSpark#DataEngineering#OSS#AIagents
Apache Spark is great at petabytes. It can be heavy at 100 megabytes. Project Feather is a new SPIP to fix that. π
Three lines of work, all targeting Spark in local mode:
1οΈβ£ Compilation and scheduling. Skip unnecessary shuffles when the planner knows a scan is one file. Mark itSinglePartitionand let the next aggregate run in place.
2οΈβ£ Arrow-baseddf.cache. Swap the row-oriented cache for Apache Arrow IPC. Columnar, compressed, iterable.
3οΈβ£ Shuffle-free execution. On a single node, replace blocking shuffle with in-process channels and Java virtual threads. No disk round-trip.
Prototype today: a filter-and-sort query on a small in-memory table runs in 150 ms instead of 330 ms. One stage instead of two. The win compounds as the optimizations stack.
π Project Feather: docs.google.com/document/d/1β¦
The SPIP is open for comment. Pull the prototype, run it against your hardest small-data pipeline, file the bug we missed.
β Authors: Daniel Tenedorio and Liang-Chi Hsieh.
#ApacheSpark#SPIP#OpenSource#DataEngineering#ApacheArrow
#DataAISummit Session Spotlight π
Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026
At Data AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs β including Native Arrow UDFs/UDTFs and built-in faulthandler profiling.
π June 15β18 Β· SF
Add to your agenda: databricks.com/dataaisummit/β¦#ApacheSpark#PySpark#DataAISummit#DataEngineering#OpenSource