Early in my career, I got an IAM error.
AccessDenied: User is not authorized to perform: s3:GetObject
I panicked. Added "Action": "*" to the policy.
It worked.
That single move is one of the most common — and dangerous — mistakes data engineers make.
IAM isn’t “someone else’s job.”
Every Glue job, every Lambda function, every Redshift cluster — needs permissions to talk to other AWS services.
If you’re writing pipelines, you’re writing IAM policies. Whether you realize it or not.
5 things every data engineer must know about IAM:
1. Users vs Roles — know the difference
👤 Users → for people (you logging into AWS console)
🤖 Roles → for services (Glue, Lambda, Redshift assuming permissions)
Your pipelines should never use a user’s credentials. Always roles.
2. Two policies, two jobs
📜 Trust policy → “WHO can assume this role?”
📜 Permission policy → “WHAT can this role do?”
A Glue job’s role trusts
glue.amazonaws.com AND has permissions to read S3 write to Glue Catalog.
3. Least privilege isn’t optional
"Action": "s3:*" on "Resource": "*" = an open door.
Scope it down:
"Action": ["s3:GetObject", "s3:PutObject"]
"Resource": "arn:aws:s3:::my-bucket/raw/*"
4. Never hardcode credentials
No access keys in your PySpark scripts. No .env files committed to Git.
Glue, Lambda, EC2 — all can assume roles automatically. Use that.
5. One role per job type
Don’t create one giant “DataEngineerRole” used everywhere.
A Glue ETL role ≠ a Lambda trigger role ≠ a Redshift COPY role.
Separate roles = smaller blast radius if something goes wrong.
That "Action": "*" shortcut I used early on?
In a real production environment, that’s the kind of mistake that shows up in a security audit — and in an interview question.
Full breakdown with real policy JSON common mistakes →
medium.com/@mallinathnpatil1…
#DataEngineering #AWS #IAM #CloudSecurity #Python #DataEngineer