I write about data engineering | SQL | Python | Distributed systems. Get my free data engineering course at startdataengineering.com/ema…

Joined April 2020
132 Photos and videos
Pinned Tweet
Exercise project for anyone starting in data engineering startdataengineering.com/pos… #dataengineering #bigdata #ETL #ApacheAirflow #AWS #ApacheSpark
14
90
475
Upserting data in a warehouse is tricky. What happens when it fails partway through? Are you sure you're updating only the right rows? MERGE INTO puts all your insert/update/delete logic in one atomic statement. Full guide 👇 startdataengineering.com/pos… #spark #sql #dataengineer
4
109
Building throwaway pipelines for quick wins, then losing days fixing them? The fix is boring but real: data modeling upfront I broke down the design decisions that let you move fast without breaking everything 👇 startdataengineering.com/pos… #datapipeline #datamodel
1
128
The project structure dbt recommends for data modeling is complex and confusing. At almost every company I've worked we've used these 3 layers: 1. raw tables as-is from source 2. tables modeled as facts and dims 3. summary tables I'm working on upgrading my dbt tutorial #dbt
1
3
200
If you are a data engineer or looking to break into Data Engineering, Apache Airflow is a must-know. Check out my post on how to think about orchestration and scheduling with Airflow. startdataengineering.com/pos… #dataengineering #apacheairflow #datapipelines
2
6
703
Joseph Machado retweeted
I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out. I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really). It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely. The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture. We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying. I worry.
512
1,901
15,329
1,586,229
Spark API is easy to learn. But to debug a hanging job, you need to know Spark internals. Here are 7 topics to know for production Spark 👇
1
1
9
396
6. Read the Spark UI: Slow stages, skewed tasks, spill to disk are all there. If you can't diagnose a hanging job, you can't own one in production. 7. Observability, audit, and lineage: You need to know what ran, when, on what data, and whether it succeeded.
1
1
96
The API is the easy part. Production is where the real learning happens.
1
79
PSA: Understand the concepts and read the docs, before using LLMs Claude sent me on a wild goose chase, hallucinations, complex setup that breaks stuff, etc Wasted a lot of time, only to realize the tool(quarto) I work with already does what I needed
3
408
Too many small files in your data lake impact performance. Detect it with Spark UI 1. Go to the stages tab, see the event timeline. 2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem. Fix coming tomorrow #dataengineering
3
10
584