🎉 We’re proud to announce the
@apachehudi 1.0 release! This release has been the result of a massive community effort, with tons of new code (re)written. I want to thank all 60 contributors who worked on ~180K lines of change.
🗒️ Release blog:
hudi.apache.org/blog/2024/12…
Hudi is still the OG of the data lakehouse when it comes to real technical innovation, as will become apparent below. 👇
🔥 Secondary Indexing - yes! you read it right. You can speed up queries using indexes, just like a
#database. 95% decreased latency on 10TB tpc-ds for low-moderate selectivity queries. You can create/drop indexes asynchronously.
✨ Logical partitioning via Expression Indexes -
#postgres style expression indexes to treat partitions like the coarse-grained indexes they are. It avoids the most common pitfall with users creating tons of small partitions.
🤯 Partial Updates - 2.6x performance and 85% reduction in byte written dropping write/query costs on update-heavy workloads. Lays the foundation for multimodal and unstructured data
⚡ Non-blocking Concurrency Control (NBCC) enables simultaneous writing from multiple writers and compaction of the same record without blocking any involved processes. This is an industry first!
🎉 Merge Modes - First-class support for both styles of stream data processing: commit_time_ordering, event_time_ordering, and custom record merger APIs.
🦾 LSM timeline—Hudi has a revamped timeline that stores all action history on a table as a scalable LSM tree, allowing users to retain a large amount of table history.
⌛ TrueTime - Hudi strengthens TrueTime semantics. The default implementation assures forward-moving clocks even with distributed processes, assuming a maximum tolerable clock skew similar to OLTP/NoSQL stores
So, if you love open-source innovation as much as we do, check out the release and join our ~12000 strong community across Slack & GitHub. We're a grassroots OSS community that has sustained innovation in a fiercely competitive commercial data ecosystem.
#apachehudi #datalakehouse #opentableformat #dataengineering #apachespark #apacheflink #trinodb #awss3 #distributedsystems #analytics #bigdata #datalake