Joseph Machado

Joseph Machado

132 Photos and videos

Tweets

Pinned Tweet

Joseph Machado @startdataeng

25 May 2020

Exercise project for anyone starting in data engineering startdataengineering.com/pos… #dataengineering #bigdata #ETL #ApacheAirflow #AWS #ApacheSpark

475

Joseph Machado

Joseph Machado @startdataeng

Jun 12

Upserting data in a warehouse is tricky. What happens when it fails partway through? Are you sure you're updating only the right rows? MERGE INTO puts all your insert/update/delete logic in one atomic statement. Full guide 👇 startdataengineering.com/pos… #spark #sql #dataengineer

How to Use Spark SQL Merge Into - Step-by-Step Tutorial – Start Data Engineering

Understand when and why to use Merge Into and its tradeoffs, with real-world code examples and visual diagrams.

startdataengineering.com

109

Joseph Machado

Joseph Machado @startdataeng

Jun 10

Building throwaway pipelines for quick wins, then losing days fixing them? The fix is boring but real: data modeling upfront I broke down the design decisions that let you move fast without breaking everything 👇 startdataengineering.com/pos… #datapipeline #datamodel

Six Data Modeling Techniques For Building Production-Ready Tables Fast – Start Data Engineering

For data engineers tired of quick fixes and drowning in technical debt

startdataengineering.com

128

Joseph Machado

Joseph Machado @startdataeng

Jun 5

SQL is the bread and butter of data engineering. Whether you are a seasoned pro or new to data engineering, there is always a way to improve your SQL skills. Check out 8 patterns to uplevel your SQL skills. startdataengineering.com/pos… #dataengineering #sql #datapipeline

25 SQL tips to level up your data engineering skills – Start Data Engineering

As a data engineer, you always want to uplevel yourself. SQL is the bread and butter of data engineering. Whether you are a seasoned pro or new to data engineering, there is always a way to improve...

startdataengineering.com

225

Joseph Machado

Joseph Machado @startdataeng

Jun 2

The project structure dbt recommends for data modeling is complex and confusing. At almost every company I've worked we've used these 3 layers: 1. raw tables as-is from source 2. tables modeled as facts and dims 3. summary tables I'm working on upgrading my dbt tutorial #dbt

200

Joseph Machado

Joseph Machado @startdataeng

Jun 2

If you are a data engineer or looking to break into Data Engineering, Apache Airflow is a must-know. Check out my post on how to think about orchestration and scheduling with Airflow. startdataengineering.com/pos… #dataengineering #apacheairflow #datapipelines

ALT https://www.startdataengineering.com/post/airflow-tutorial/

703

Joseph Machado

Joseph Machado @startdataeng

Jun 1

Adding short video explanations to my posts, let's see how that goes startdataengineering.com/pos… #dataengineering

IaC (Infrastructure-as-Code) fundamentals for data engineers – Start Data Engineering

Create data infrastructure with Terraform. Understand how Terraform uses configs, state, and variable files.

startdataengineering.com

146

Joseph Machado

Joseph Machado @startdataeng

May 26

Stuck preparing for data engineering system design interviews? Use this article to guide you through the prep. startdataengineering.com/pos… #dataengineering #interview #systemdesign

Data Engineering Interview Preparation Series #2: System Design – Start Data Engineering

System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I...

startdataengineering.com

209

Joseph Machado

Joseph Machado @startdataeng

May 21

Everyone says you need to know Python for data work. But what does that mean exactly? In this post I cover the key Python concepts every DE needs to know ⬇️ startdataengineering.com/pos… #dataengineering #python #datapipeline #softwareengineering #dataanalysis

Python Essentials for Data Engineers – Start Data Engineering

You know Python is essential for a data engineer. Does anyone know how much one should learn to become a data engineer? When you’re in an interview with a hiring manager, how can you effectively...

startdataengineering.com

358

Joseph Machado

Joseph Machado @startdataeng

May 18

Pipeline code can be generated; set yourself apart by understanding how to design pipelines. Learn how to design your pipelines based on given inputs and required outputs. startdataengineering.com/pos… #dataengineering #datawarehouse #sql #etl #datapipeline

Data Pipeline Design Patterns - #1. Data flow patterns – Start Data Engineering

Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your...

startdataengineering.com

595

Mitchell Hashimoto

Joseph Machado retweeted

Mitchell Hashimoto

@mitchellh

May 15

I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out. I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really). It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely. The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture. We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying. I worry.

512

1,901

15,329

1,586,229

Joseph Machado

Joseph Machado @startdataeng

May 15

Your data warehouse bill is high for one reason. Full table scans. Every query reads everything, whether it needs to or not. Here are 6 storage patterns that fix this 👇 startdataengineering.com/pos…

3 Data Storage Techniques Every Data Engineer Should Know – Start Data Engineering

Skyrocketing data costs are due to full table scans. Here are the rules of thumb that cut costs and make your queries fast.

startdataengineering.com

367

Joseph Machado

Joseph Machado @startdataeng

May 14

Spark API is easy to learn. But to debug a hanging job, you need to know Spark internals. Here are 7 topics to know for production Spark 👇

396

more replies

Joseph Machado

Joseph Machado @startdataeng

May 14

6. Read the Spark UI: Slow stages, skewed tasks, spill to disk are all there. If you can't diagnose a hanging job, you can't own one in production. 7. Observability, audit, and lineage: You need to know what ran, when, on what data, and whether it succeeded.

Joseph Machado

Joseph Machado @startdataeng

May 14

The API is the easy part. Production is where the real learning happens.

Joseph Machado

Joseph Machado @startdataeng

May 13

Large datasets are stored as individual files in S3. Too many small files per dataset make reads expensive! Learn to: Detect it, Compact it or use table properties during insert. Read how to here 👇 startdataengineering.com/pos… #dataengineering #data #apachespark #apacheiceberg

How to Identify and Fix Small Files Problem with Spark & Iceberg – Start Data Engineering

Hands-on workshop covering compaction strategies to detect & fix small file issues in Apache Iceberg with Apache Spark.

startdataengineering.com

250

Joseph Machado

Joseph Machado @startdataeng

May 12

PSA: Understand the concepts and read the docs, before using LLMs Claude sent me on a wild goose chase, hallucinations, complex setup that breaks stuff, etc Wasted a lot of time, only to realize the tool(quarto) I work with already does what I needed

408

Joseph Machado

Joseph Machado @startdataeng

May 12

Too many small files in your data lake impact performance. Detect it with Spark UI 1. Go to the stages tab, see the event timeline. 2. Many small tasks (1 task = 1 green chunk) indicate a many-small-files (or partitions) problem. Fix coming tomorrow #dataengineering

584