Building a Data Engineering Agent SDK based on Claude-agent-sdk
- I built a minimal agent SDK with Polars pipelines, governance hooks, and MCP servers in ~2,000 lines
- Polars pipeline: CSV → filter → join → aggregate in 5 steps proves the pattern works
- Each step returns an Artifact (URI, schema, row_count, lineage) — structured outputs beat print statements
- Hooks wrap every tool call: PreToolUse for access control, PostToolUse for lineage logging
- Session context (user, role, warehouse) passed everywhere — no globals, no magic state
- MCP server isolates tools in subprocess via JSON-RPC over stdio — crash-proof execution
- Direct SDK for fast iteration: agent.register_tool(run_polars) → query("filter data")
- Three working examples: simple SQL, Polars pipeline, MCP subprocess
- Governance hook denies finance.- tables unless role=finance — declarative permissions
- Lineage hook writes every execution to /tmp/lineage.jsonl — observability for free
- Six tools handle real work: run_sql, read_metadata, run_polars, transform_csv, join_datasets, aggregate_data
- Working code over promises: CSV → clean → enrich → join → aggregate runs end-to-end
- No LLM integration yet — tools called directly via simple prompt parser, add Claude API later
- Artifact type does the heavy lifting — Dataset, QueryPlan, JobRun can wait for v2
- Plan mode stubbed (returns tool name) — full dry-run with cost estimates is future work
- Docker uv = 30-second cold start, not 5-minute pip dependency hell
- Polars integration shows extensibility: add dbt, Spark, data quality next
- This demonstrates the pattern — message loop tools hooks — not a complete framework
- Learn how agent frameworks actually work, then extend based on your needs
- Types (Pydantic) and working examples beat architecture docs you never finish readingcall: PreToolUse for access control, PostToolUse for lineage logging