Data Nerds! I ranked every data engineering tool by how often it shows up in 4M job postings. 📊
But here's the catch 😳.
Some critical skills show up way less than they should because they're often assumed to be foundational skills for jobs. (e.g., Skills like Bash/Terminal for running pipelines)
Anyway, here's the breakdown of the tiers 👇 (Note: % = how often each tool appears in DE job postings)
🔴 S TIER — Non-Negotiable
The core skills needed for any DE job. Don't apply without these:
📊 SQL (~68%) — every warehouse runs on it. Query, transform, and model data.
🐍 Python (~67%) — the pipeline language. Ingestion, automation, APIs, glue between systems.
⌨️ Terminal/Bash (~11%) — every tool you'll use runs from here. This is highly undervalued in postings.
📁 Git (~11%) — version control. Every team uses it. Same posting-% caveat as Bash.
☁️ One cloud platform warehouse (~26-46%) — AWS Redshift, GCP BigQuery, or Azure Synapse. Combined cloud presence is in nearly every posting.
Start with SQL, then Python. Everything else you absorb alongside them.
🟠 A TIER — Job-Ready Foundation
The tool that closes the gap from "learning DE" to "hireable for modern stacks":
🪛 dbt (~10%) — only 10% of all DE postings, but 36% in Analytics Engineer (AE) roles.
That's not a niche, it's a leading indicator. AE is the new hybrid role modern data teams are hiring for: part analyst, part engineer.
✅ Land the job with S A. Pass the interview with conceptual knowledge of B Tier 👇
🟡 B TIER — Interview-Aware
Know what they solve. Don't expect to code from scratch:
⚙️ Airflow (~17%) — orchestration. Built on DAGs (directed acyclic graphs).
⚡ Spark (~38%) — distributed computing for processing large datasets.
🌊 Kafka (~19%) — real-time event streaming between systems.
All these depend on a foundational knowledge of Python & SQL; don't jump the gun learning these.
🟢 C TIER — Data Platform Awareness
Pick the one your company uses. Understand both conceptually:
❄️ Snowflake (~26%) — pure SQL warehouse. Optimized for analytics. Modern-stack favorite.
🧱 Databricks (~24%) — lakehouse on Spark. Handles structured unstructured. ML/AI heavy teams.
🔵 D TIER — Versatility Multipliers
Lower headline demand, but high value per hour:
📊 Power BI (~15%) / Tableau (~10%) — but the kicker: in AE roles these jump to 28% / 33%.
Modern data teams want pipeline builders who can also visualize. For analysts pivoting to DE, lead with this in interviews.
🟣 E TIER — Path-Dependent
High demand on paper, but concentrated in legacy enterprise stacks. Skip until your job requires it:
☕ Java (~25%) — legacy enterprise data infrastructure
⚖️ Scala (~22%) — Spark's native language. Spark-heavy shops.
🎥 How did I derive this ranking? In my latest video, I walk through the concepts first (the DE lifecycle, what each tool actually solves) and then derive the tiers. (Link in comments 👇)