ABC

ABC

369 Photos and videos

Tweets

Pinned Tweet

ABC

@Ubunta

11 Oct 2025

Using Postgres as a Data Warehouse - Start with Postgres 18 — asynchronous I/O makes table scans 2-3x faster than Postgres 15 - One command runs everything: `docker-compose up`. If partitioning breaks on localhost, it'll break in prod — test the real structure first - Async I/O in Postgres 18 changes everything — sequential scans that took 45 seconds now take 15 - No config changes needed — it just works faster out of the box - Postgres isn't just storage — it's your transform layer, your cache, your query engine - Materialized views = dashboards that don't run live queries when 500 people open Slack at 9 AM - Partition by date or tenant — keeps queries under 3 seconds without bigger hardware - VACUUM and ANALYZE aren't optional - Use schemas like folders — `raw` for ingestion, `staging` for transforms, `analytics` for BI - JSONB feels flexible until you try to aggregate Millions rows — use real columns for anything you'll query often - Foreign keys and constraints catch bad data before your dashboard does - DuckDB reads Postgres tables directly — `duckdb 'SELECT * FROM postgres_scan(...)'` - Run heavy aggregations in DuckDB, write results back to Postgres — best of both worlds - Postgres 18's async I/O DuckDB's columnar engine = the fastest local analytics stack nobody talks about - Indexes win 90% of performance battles — btree for filters, GIN for arrays, BRIN for time-series logs - `EXPLAIN ANALYZE` until you understand how Postgres thinks — if it scans 5M rows, add an index - Async I/O helps, but indexes help more — fix the query plan before throwing hardware at it - Backup is boring by design: `pg_dump` to S3 every night - Back up schemas separately from data — schema recovery is 10x faster than full restores - Postgres 18's faster I/O means backups and restores complete in half the time - The real test: can a new engineer clone your repo, run `docker-compose up`, and query prod-like data in 5 minutes? - Postgres 18 is the warehouse you already have — just use it properly

618

51,139

ABC

ABC

@Ubunta

Feeling empowered because I can build an entire data pipeline… only to realize the design plan alone can burn 50% of the tokens. At this point, the real pipeline is: idea → architecture → token bankruptcy → coffee → wait...wait...wait.. 😂

137

ABC

ABC

@Ubunta

Jun 13

Building a Data AI agent is the easy part. Getting it into the hands of actual data users is where the work starts. First question nobody wants to answer: can you even send this data to a model? Security and governance come before the demo, not after it. Then the audit trail. Something will break eventually, and you'll need evidence of what happened and why. Say you clear all that. Users get access. Now the quieter problem: are they checking the output, or just trusting it? Most trust it. That's the part that bites you later. And when an answer is wrong, you're stuck on the annoying question: is it the model, or is it your tool wrapping the model? That takes longer to untangle than it should. Then cost. Once users start, you can't quietly pull it back. The bill only goes one direction. So it was never really an "AI" problem. It's the same old job. Build, ship, watch, fix, pay for it. Repeat.

156

ABC

ABC

@Ubunta

Jun 10

LLM companies may have millions of users, but only one real customer: Developers.

158

ABC

ABC

@Ubunta

Jun 10

Claude Fable 5 is outstanding for complex medical cohort generation, but it is extremely expensive. Only Option: Fable 5: Generate cohort → GPT-5.5: Validate cohort → If concerns are identified: Send only flagged cases back to Fable 5 → Final validated cohort

141

ABC

ABC

@Ubunta

Jun 4

Frugal Modern Data Engineering AI Infrastructure An AI agent that builds and runs data pipelines — without burning money or reinventing what already exists. It doesn't spin up new infra. It learns your existing stack — what warehouses, catalogs, and pipelines you already have — and reuses them. The data engineering loop: - Engineer states the intent in plain English - Agent discovers the existing infra stack reusable pipelines (no rebuild) - Estimates token compute cost before it runs anything - Checks the cache first — most asks are answered for free - Runs on Databricks / Snowflake over secure M2M The frugal part is the guardrail: - Estimated cost over the max threshold? → pause, ask a human - Burn rate spiking or errors past a threshold? → stop, hand back to a human - A task fails? → it learns from the failure, self-heals, retries, re-routes So it never quietly runs up a bill. It either does the cheap, reusable thing — or it stops and asks. Governed end to end: every query, model call, and dollar logged. Token & cost ledger, lineage, audit trail. Reuse over rebuild. Estimate before compute. Pause before you pay.

759

ABC

ABC

@Ubunta

Jun 1

Talked to dozens of young EU devs lately. The brutal job market has them desperate — stacking AI subscriptions (OpenAI, Claude, Cursor) and starring repos they never open. It feels like progress. It isn't. Buying tools isn't building skill, and chasing the trend isn't knowing where it's going. The real question: what does the market want now, and where are dev jobs even heading?

236

ABC

ABC

@Ubunta

May 31

which llm model I reach for Data Engineering, and when: Claude Opus 4.8 — building from zero - deterministic data apps - pyspark / distributed pipelines - hand-optimized DB access libs for big reads OpenAI Codex 5.5 — surgery on what exists - bolting security onto live pipelines - single-node duckdb work - refactoring tangled multi-connector pipelines

387

ABC

ABC

@Ubunta

May 29

LLM releases now feel like Windows updates. Claude, OpenAI, DeepSeek — everyone is dropping models so frequently that you barely finish testing one before the next one arrives. And just like Windows updates, the changelog sounds exciting… …but after restarting, everything feels mostly the same.

198

ABC

ABC

@Ubunta

May 29

I built a Clinical Trials AI agent on top of the entire ClinicalTrials.gov database (AACT — 50 tables, 12M rows), and I want to share a few things I learned, because most of them surprised me. The goal: not another chatbot that "talks to a database," but an agent that could actually reason over a massive, real-world dataset and be useful. And I vibe-coded the whole thing — directing AI to architect and build it, while I held the design decisions, the data model, and the deployment. 1. The hard part isn't the agent. It's the data. You can wire up a model in an afternoon. But 12M rows don't answer questions quickly or cheaply by accident. Almost all the real effort went into the boring layer underneath: a two-DB split (a read-only pool just for AACT), materialized views hand-tuned SQL instead of an ORM fighting the schema, versioned snapshots with automated sync as new dumps land, and caching so hot queries stay cheap. The agent is the easy 10%. The data engineering is the 90% nobody films. 2. You don't need an agent framework. You need a clean tool contract. No LangChain, no orchestration library. Just a tiny registry: defineTool() to declare a tool, runTool() to call one. The thing I'd underline for anyone building agents — every call goes through the same pipeline: schema-validated (Zod) → policy-checked → executed → audited. That one invariant is worth more than any framework. The control loop stays mine, and adding a capability is one file. 3. Give the model a toolbelt, not a database. Instead of raw SQL, I gave it nine purpose-built tools — trials_search, study_get, eligibility_lookup, feasibility, competitive_landscape, safety_profile, and more. Each encodes how a human actually thinks about clinical trials, and the model composes them. You're not building a query interface, you're building the agent's vocabulary for the domain. 4. Decouple from the model early. Everything goes through one getModel(role) factory — Anthropic OpenAI today, switchable per task. Adding Bedrock, Azure, or a local model is one file, zero refactor. Models change every few months now; your architecture shouldn't care which one is winning this week. 4. Decouple from the model early. Everything goes through one getModel(role) factory — Anthropic OpenAI today, switchable per task. Adding Bedrock, Azure, or a local model is one file, zero refactor. Models change every few months now; your architecture shouldn't care which one is winning this week. 5. Vibe-coding is a real skill, and it's not "typing less." The work wasn't writing code. It was steering an AI to produce a clean, production-shaped system — knowing what good architecture looks like, catching when it drifts, then getting it deployed: containerized, health checks, telemetry on every call.

1:40

276

ABC

ABC

@Ubunta

May 26

The hardest problem in AI-assisted Data Engineering isn't token burn. It's measuring token burn against actual engineering outcomes. You can burn millions of tokens and end up with a solid pipeline — or complete garbage. The real trap: letting AI talk directly to your database. That's a token-burn accelerant. AI shouldn't freely explore your DB. Give it bounded context, clear contracts, safe APIs, schema summaries, query limits, and measurable outcomes. Otherwise you're not doing AI-powered Data Engineering. You're just paying for confusion at scale.

265

ABC

ABC

@Ubunta

May 24

Tokenmaxxing won't save bad Data Engineering. Weak foundations will sink it faster. If you ask AI to build entire pipelines without: - clear desired outcomes - a proper dev environment - defined tech choices and responsibilities - a solid design structure …you'll burn infinite tokens and still ship garbage pipelines. AI doesn't replace engineering fundamentals — it amplifies them. Strong software practices, local testing, and safe environments matter more now, not less. If your data pipelines hit prod directly, don't blame AI. AI just exposed what was already broken.

486

ABC

ABC

@Ubunta

May 22

Hard truth for Data Engineers: Knowing your tools, writing ETL, and shipping PySpark jobs is no longer a moat. It's the baseline — and the baseline is being automated. The next wave isn't "Data Engineer 2.0." It's full-stack engineers who use AI to solve data challenges faster than any team could last year. The ask isn't more data engineers. It's more AI data automation.

6,166

ABC

ABC

@Ubunta

May 21

5 places AI automation still doesn’t fully belong in large-scale Data Engineering platforms: 1. Domain logic & sanity checks — too business-specific 2. Infra & secrets — no-go zone 3. Pipeline tests — no intent = fake confidence 4. DB migrations — schema changes, backfills, and rollbacks need humans 5. Prod decisions — reruns, data loss, and access changes need accountable owners AI assists. Humans own intent, controls, and risk.

671

ABC

ABC

@Ubunta

May 15

A year ago I wouldn't trust AI with a JOIN. Last week it built a data pipeline in SQL and Python that's running in production, no issues. Data engineers should stop asking: – Can AI write production-grade code? – Will it replace me? – Should I bother learning AI-native tools? The shift already happened. Now it's just about who keeps up.

1,135

ABC

ABC

@Ubunta

May 12

I’m honestly unsure which part of Data Engineering cannot be automated with GenAI anymore. That does not mean you don’t need data engineers. But it does mean you probably don’t need the same size team as before. In many cases, maybe not even half the team you needed earlier.

614

ABC

ABC

@Ubunta

May 7

Healthcare AI is forcing a rethink of how RAG systems should actually work. Traditional vector RAG is great for FAQs, support systems, and broad semantic lookup. But once you move into clinical protocols, SAPs, regulatory submissions, research papers, or evidence packages, the retrieval problem changes completely. The challenge is no longer: "find semantically similar text." It becomes: – navigating document hierarchy – reasoning across sections – preserving traceability to source pages – and avoiding retrieval that is "similar" but contextually wrong A clinician reviewing a protocol does not think in chunks and embeddings. They navigate endpoints, inclusion criteria, appendices, statistical methodology, references, and cross-document relationships. Retrieval systems should mirror that workflow instead of flattening everything into vector similarity. This is why I've been experimenting with approaches like PageIndex (github.com/VectifyAI/PageInd…). What I find interesting is not the "vectorless" angle itself. It's the shift toward reasoning-based retrieval using hierarchical document structures and tree navigation that behaves much closer to how domain experts actually read long documents. I don't think vector RAG disappears. It still solves many problems well. But for long-form, structured, regulated domains like healthcare, I increasingly think the future is hybrid: vector retrieval reasoning-based document navigation working together in the same platform. Then let healthcare professionals judge which outputs are actually more trustworthy, traceable, and clinically useful. Because in regulated AI systems, retrieval quality is not just a UX feature. It's part of the safety layer.

GitHub - VectifyAI/PageIndex: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG - VectifyAI/PageIndex

github.com

314

ABC

ABC

@Ubunta

Apr 28

Top dangerous things to do in Data Engineering 1. Backups inside the same blast radius. Same region, same admin key, same failure path. That is not disaster recovery. 2. Letting AI touch production directly. AI can draft deployment code. It should not control your production cluster. 3. Connecting MCP servers without governance. Every new tool connection is a new permission boundary, audit gap, and attack surface. 4. Running without serious observability Silent failures, data drift, runaway costs, and wrong outputs are worse when nobody is watching. Most data engineering disasters start with one thing: too much access and too little control.

325

ABC

ABC

@Ubunta

Apr 25

Through recent conferences and conversations, two approaches to GenAI keep showing up. On one side, enterprises are still debating the risks and relevance without ever touching it. On the other, teams are buying every tool in sight, burning budget at speed, then concluding: “AI doesn’t work.” Different teams. Same mistake. One is fear without data. The other is spending without strategy. Both skip the only step that actually matters → small, deliberate experiments. Your environment. Your data. Your constraints.You don’t get to an opinion on GenAI by only reading about it. You don’t get to ROI by buying your way there. You get there by building something small, watching where it fails, and paying attention to why.

234

ABC

ABC

@Ubunta

Apr 20

The way we build Data Pipelines in regulated healthcare is changing. AI is no longer a downstream consumer — it is becoming a component inside the pipeline itself. And that is where the architecture gets interesting. The old shape was familiar. Sources → ingest → transform → warehouse → BI. Never fully deterministic — late-arriving data, schema drift, manual labeling all leaked in — but the failure modes were known and the fixes were boring. Healthcare data was messy but the pipeline behavior was predictable. AI changes the shape. An LLM doing chart abstraction mid-DAG. An agent selecting a cohort definition. A RAG call enriching a record before it lands in the warehouse. Now the pipeline has a new class of failure — silent semantic corruption, non-reproducible outputs, cost blowups from agent loops. In a regulated environment, that is the whole problem. The discipline is simple. Not easy. Keep the pipeline deterministic. Let AI live only inside bounded, validated nodes. I keep going back to how the biodata community solved reproducibility — nf-core / Nextflow → DAG-first execution, content-addressed caching, resume-on-failure, containerized steps, provenance baked in. That mindset translates directly. I am building it now: - DAG as the backbone. Idempotent steps, content-hashed outputs. - A common data model as the semantic layer. Schema validation non-negotiable. - Provenance tracked per record, not per batch. - LLM nowhere near the orchestrator. Only inside scoped nodes — chart abstraction, endpoint adjudication drafting, protocol-to-SQL translation. - Every LLM output hits a deterministic validator before persistence. → Eval layer built before the agents. Clinician-labeled ground truth, re-run on every model bump. Then AI earns its place in the pipeline — and a regulator can still follow the trail.

212

ABC

ABC

@Ubunta

Apr 17

Switching from Claude or Codex to a local coding model for data engineering makes a few things very obvious. The planning quality drops — less context carried across steps, weaker breakdown of problems, and more gaps in logic (especially around joins, transformations, and edge cases). Iteration also slows down a lot. What used to be quick back-and-forth becomes noticeably delayed, which affects how fast you can validate ideas. On top of that, the mac becomes the bottleneck. High resource usage leads to heating and throttling, and overall system responsiveness takes a hit. While local models reduce external dependencies, the current trade-off is lower reasoning quality and slower workflows, especially for non-trivial data engineering tasks.

388