Apache Spark

Apache Spark

80 Photos and videos

Tweets

Apache Spark

@ApacheSpark

Jun 2

For a long time, streaming architecture advice boiled down to two engines: one for high-throughput ETL, another when you need millisecond latency. At Data Engineering Open Forum 2026, Indrajit Roy (@databricks) walked through how Apache Spark Structured Streaming took a different path from day one: micro-batch processing. 🔸 Micro-batch model: Records arrive on a stream; the engine waits briefly, forms a batch, processes it, then repeats. 🔸 Batch query on stream slices: Each step is effectively running a small batch query over the latest slice of data. 🔸 One engine, different tradeoffs: The design challenges the “two streaming engines” default instead of accepting it as fixed. Watch the full keynote: lnkd.in/eN6nErib #ApacheSpark #StructuredStreaming #DataEngineering #OpenSource

0:42

2,271

Apache Spark

Apache Spark

@ApacheSpark

Jun 1

#DataAISummit Session Spotlight ➡️ Learn how to build agentic workflows with OSS Spark Declarative Pipelines, with patterns for deterministic, testable, production-ready data workflows. 🗓️ June 15–18 📍 San Francisco 🔗 Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit

954

Apache Spark

Apache Spark

@ApacheSpark

May 29

#DataAISummit (June 15-18) Session Spotlight 👇 Get a year in review and the roadmap for Apache Spark Structured Streaming in open source: what's shipping in Spark 4.1 and what's ahead in 4.2 for mission-critical streaming ingestion and ETL pipelines. Jerry Peng and Anish Shrigondekar (@databricks) will cover recent advances and what's next! 🔗Details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit #StructuredStreaming #DataEngineering

709

Apache Spark

Apache Spark

@ApacheSpark

May 29

At DEOF 2026, Indrajit Roy (@databricks) opened with a keynote on how Apache Spark Structured Streaming innovated on throughput, latency, and flexibility, and what that means for data engineers in 2026. 👇 Real-time isn’t just for streaming specialists anymore. Express the logic. Let the engine handle the rest. 📹 Full video: youtu.be/VLJhGDwTS3I #ApacheSpark

817

Apache Spark

Apache Spark

@ApacheSpark

May 28

#DataAISummit Session Spotlight 👇 Apache Spark™ 4.2: unified batch streaming for AI workloads—feature pipelines, multimodal data, planner-level optimizations. 🎤 DB Tsai & Xiao Li | 🗓️ June 15-18 | 📍 San Francisco Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit

808

Apache Spark

Apache Spark

@ApacheSpark

May 28

#DataAISummit Session Spotlight 👇 Andreas Neumann and @lisancao will cover Spark Declarative Pipelines (4.1). Declare what your pipeline does, and Spark manages execution, parallelization, checkpoints, and failure recovery. 🗓️ June 15–18 | 📍 San Francisco 🔗 Session details: databricks.com/dataaisummit/… 🎟️ Register: dataaisummit.databricks.com/… #ApacheSpark #DataAISummit

984

Apache Spark

Apache Spark

@ApacheSpark

May 27

For a decade, “streaming on Spark” meant micro-batches. Fine for ETL. A wall if your latency budget was under a second. Spark 4.1 stops that. Real-Time Mode (SPARK-50708) 👇

0:59

5,887

more replies

Apache Spark

Apache Spark

@ApacheSpark

May 27

How: • Continuous execution — long-lived tasks • Simultaneous scheduling — stage N 1 on N’s first record • Streaming shuffle — in-memory handoff, no batch boundary

973

Apache Spark

Apache Spark

@ApacheSpark

May 27

Stateless: 4.1. Stateful RTM: upstream. If you have a streaming workload that "shouldn't be on Spark" because it needed ms, pull the RC and try it. The next move is yours.

460

Apache Spark

Apache Spark

@ApacheSpark

May 26

Agent-written Spark can pass static checks and a 10K-row sample, then fail at hour three. @lisancao breaks down how Spark 4.1 addresses that, with three patterns worth knowing 👇 🔹 SDP: declare intent, not triggers/checkpoints 🔹 RTM: one engine for sub-sec batch 🔹 Connect: pyspark-client; prod = URL change 🔗 Read more: medium.com/apache-spark/apac… #ApacheSpark

1,454

Apache Spark

Apache Spark

@ApacheSpark

May 26

#DataAISummit Session Spotlight 👇 Structured Streaming: year in review roadmap. Real-Time Mode, stateful transforms, Spark 4.2 ahead. 🎤 Jerry Peng & Anish Shrigondekar 🗓️ June 15–18 📍 San Francisco 🔗 Details: databricks.com/dataaisummit/… 🎟️ Register: dataaisummit.databricks.com/…

798

Apache Spark

Apache Spark

@ApacheSpark

May 26

#DataAISummit Session Spotlight 👇 Spark 4.1 introduces Spark Declarative Pipelines (SDP). Declare datasets and transformations. Spark manages the execution plan. Less boilerplate. Faster path to production. The session covers dependency resolution, checkpoint coordination, failure recovery, incremental processing, and testing patterns. 🎤 Andreas Neumann & Lisa Cao 📆 June 15-18 📍 San Francisco Session details: databricks.com/dataaisummit/… #ApacheSpark #DataAISummit #DataEngineering #Spark

1,290

Apache Spark

Apache Spark

@ApacheSpark

May 25

#DataAISummit Session Spotlight 👇 Apache Spark™ 4.2: unified batch streaming for AI workloads: feature pipelines, multimodal data, planner-level optimizations. 🎤 DB Tsai & Xiao Li | 🗓️ June 15–18 | 📍 San Francisco 🔗 Session details: databricks.com/dataaisummit/… #ApacheSpark

1,509

Apache Spark

Apache Spark

@ApacheSpark

May 25

Spark 4.1 for agents 👇 🔹 SDP: triggers/checkpoints/DAG off the agent; dry-run fails fast 🔹 RTM: sub-second batch, one engine (stateless in 4.1) 🔹 Connect: pyspark-client, no local JVM; sandbox→prod = URL Agent owns intent. Spark absorbs the rest. 🔗 Read more: medium.com/apache-spark/apac… #ApacheSpark #DataEngineering

3,711

Apache Spark

Apache Spark

@ApacheSpark

May 22

Apache Spark 4.1 is out today. 🚀 AI data agents are now common in data engineering. They're also a real risk in production: tool sprawl and the glue code required to run real pipelines create a huge surface area for silent errors. The cost is wasted time and wasted compute on jobs you only notice are broken three hours into a four-hour run. Three architectural changes in 4.1 shrink that surface area. 1️⃣ Spark Declarative Pipelines (SDP) 2️⃣ Real-Time Mode 3️⃣ Spark Connect Project Feather Three architectural changes. One platform shape. Fewer surfaces for the agent to drift on. Less technical debt as you ship. 👉 Get started: spark.apache.org/downloads #ApacheSpark #DataEngineering #OSS #AIagents

1:00

9,328

Apache Spark

Apache Spark

@ApacheSpark

May 21

#DataAISummit Session Spotlight (June 15–18 | San Francisco)👇 What's New in Apache Spark™ 4.1? 🔧 Spark Declarative Pipelines (SDP) ⚡ Structured Streaming Real-Time Mode 🐍 PySpark 🔗 Spark Connect & SQL Session details: databricks.com/dataaisummit/… Register: dataaisummit.databricks.com/… #ApacheSpark #DataAISummit

1,513

Apache Spark

Apache Spark

@ApacheSpark

May 20

Apache Spark is great at petabytes. It can be heavy at 100 megabytes. Project Feather is a new SPIP to fix that. 👇 Three lines of work, all targeting Spark in local mode: 1️⃣ Compilation and scheduling. Skip unnecessary shuffles when the planner knows a scan is one file. Mark itSinglePartitionand let the next aggregate run in place. 2️⃣ Arrow-baseddf.cache. Swap the row-oriented cache for Apache Arrow IPC. Columnar, compressed, iterable. 3️⃣ Shuffle-free execution. On a single node, replace blocking shuffle with in-process channels and Java virtual threads. No disk round-trip. Prototype today: a filter-and-sort query on a small in-memory table runs in 150 ms instead of 330 ms. One stage instead of two. The win compounds as the optimizations stack. 🔗 Project Feather: docs.google.com/document/d/1… The SPIP is open for comment. Pull the prototype, run it against your hardest small-data pipeline, file the bug we missed. ✍ Authors: Daniel Tenedorio and Liang-Chi Hsieh. #ApacheSpark #SPIP #OpenSource #DataEngineering #ApacheArrow

0:25

4,148

Apache Spark

Apache Spark

@ApacheSpark

May 19

#DataAISummit Session Spotlight 👇 Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026 At Data AI Summit, Tian Gao and Yicong Huang will cover Arrow-based execution and improved debuggability for PySpark UDFs — including Native Arrow UDFs/UDTFs and built-in faulthandler profiling. 📍 June 15–18 · SF Add to your agenda: databricks.com/dataaisummit/… #ApacheSpark #PySpark #DataAISummit #DataEngineering #OpenSource

1,227