Upserting data in a warehouse is tricky. What happens when it fails partway through?
Are you sure you're updating only the right rows?
MERGE INTO puts all your insert/update/delete logic in one atomic statement. Full guide 👇
startdataengineering.com/pos…#spark#sql#dataengineer
Building throwaway pipelines for quick wins, then losing days fixing them?
The fix is boring but real: data modeling upfront
I broke down the design decisions that let you move fast without breaking everything 👇
startdataengineering.com/pos…#datapipeline#datamodel
SQL is the bread and butter of data engineering.
Whether you are a seasoned pro or new to data engineering, there is always a way to improve your SQL skills.
Check out 8 patterns to uplevel your SQL skills.
startdataengineering.com/pos…#dataengineering#sql#datapipeline
The project structure dbt recommends for data modeling is complex and confusing.
At almost every company I've worked we've used these 3 layers:
1. raw tables as-is from source
2. tables modeled as facts and dims
3. summary tables
I'm working on upgrading my dbt tutorial
#dbt
I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out.
I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really).
It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely.
The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture.
We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying.
I worry.
Your data warehouse bill is high for one reason.
Full table scans. Every query reads everything, whether it needs to or not.
Here are 6 storage patterns that fix this 👇
startdataengineering.com/pos…
6. Read the Spark UI: Slow stages, skewed tasks, spill to disk are all there. If you can't diagnose a hanging job, you can't own one in production.
7. Observability, audit, and lineage: You need to know what ran, when, on what data, and whether it succeeded.
PSA: Understand the concepts and read the docs, before using LLMs
Claude sent me on a wild goose chase, hallucinations, complex setup that breaks stuff, etc
Wasted a lot of time, only to realize the tool(quarto) I work with already does what I needed
Too many small files in your data lake impact performance.
Detect it with Spark UI
1. Go to the stages tab, see the event timeline.
2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem.
Fix coming tomorrow
#dataengineering