What is your go-to SQL pattern to return the latest row per group in Postgres? The first pattern to reach for is DISTINCT ON.
In a totally unscientific test on a table with 100,000 rows and 20,000 customer_ids, we compared 3 different query patterns running 20 times each, each pre-warmed with noise queries between runs to shuffle the buffer cache. The median execution times were:
No index:
- 32.92ms: Self-join with MAX
- 34.75ms: DISTINCT ON
- 38.68ms: Window function with ROW_NUMBER()
Index on `(customer_id, created_at DESC)`:
- 38.01ms: Self-join with MAX
- 15.99ms: DISTINCT ON
- 25.36ms: Window function with ROW_NUMBER()
Without an index, why are the results similar when I am recommending DISTINCT ON? For this small, unscientific example, the self-join uses a hash aggregate that fits entirely in memory, while `DISTINCT ON` and `ROW_NUMBER()` both require an external sort that spills to disk. Those patterns do less total work conceptually, but they pay a sort cost the self-join avoids. With an index, `DISTINCT ON` eliminates the sort entirely, and it drops from 35ms to 16ms while the self-join is unchanged (it actually ends up slower after an index due to buffer cache pressure from index pages loaded by the other queries in the shuffled run).
For examples below, imagine something as simple as the following orders table:
```
CREATE TABLE orders (
id bigint PRIMARY KEY,
customer_id int,
status text,
created_at timestamptz
);
```
Self-join with MAX
This is the pattern I was taught 20 years ago when I first learned SQL on a not-Postgres database. It's still common in ORMs that generate SQL for multiple databases, but it's not ideal for Postgres.
```
SELECT o.*
FROM orders o
INNER JOIN (
SELECT customer_id, MAX(created_at) AS latest
FROM orders
GROUP BY customer_id
) AS latest ON o.customer_id = latest.customer_id
AND o.created_at = latest.latest
```
This scans `orders` twice: once to find MAX per group, once to join back and fetch the full row. The join on customer_id and created_at is also fragile because two orders for the same customer can share the exact same timestamp. So, you'll want to add a tie-breaker.
In this small example, without an index self-join is the fastest of the three because the hash aggregate fits in memory. With an index it becomes the slowest because the other two queries load index pages into shared_buffers, displacing the heap pages.
DISTINCT ON
This is the Postgres-specific pattern to use. Why Postgres-specific? Because "DISTINCT ON" is not in the SQL Standard.
```sql
SELECT DISTINCT ON (customer_id)
customer_id, status, created_at
FROM orders
ORDER BY customer_id, created_at DESC
```
Returns exactly one row per customer without a join. The `DISTINCT ON (customer_id)` keeps the first row per group after the `ORDER BY` is applied.
```
Unique (cost=12091.82..12591.82 rows=19998 width=20) (actual time=20.538..28.124 rows=20000 loops=1)
Buffers: shared hit=742, temp read=417 written=418
-> Sort (cost=12091.82..12341.82 rows=100000 width=20) (actual time=20.537..24.219 rows=100000 loops=1)
Sort Key: customer_id, created_at DESC
Sort Method: external merge Disk: 3336kB
Buffers: shared hit=742, temp read=417 written=418
-> Seq Scan on orders (cost=0.00..1736.00 rows=100000 width=20) (actual time=0.005..4.951 rows=100000 loops=1)
Buffers: shared hit=736
```
In this example, the sort spills 3336kB to disk because `work_mem` is too small to hold the sort in memory.
Using indexes with DISTINCT ON
Eliminate the sort with a composite index:
```sql
CREATE INDEX ON orders (customer_id, created_at DESC);
```
The EXPLAIN:
```
Result (cost=0.42..5006.10 rows=20155 width=20) (actual time=0.029..11.558 rows=20000 loops=1)
Buffers: shared hit=1930 read=386
-> Unique (cost=0.42..5006.10 rows=20155 width=20) (actual time=0.027..10.411 rows=20000 loops=1)
Buffers: shared hit=1930 read=386
-> Index Scan using orders_customer_id_created_at_idx on orders (cost=0.42..4756.10 rows=100000 width=20) (actual time=0.026..6.576 rows=100000 loops=1)
Buffers: shared hit=1930 read=386
```
The sort is gone entirely because the index is already sorted. Postgres walks the index in `(customer_id, created_at DESC)` order, and `DISTINCT ON` keeps the first row per `customer_id`. Median execution time drops from 35ms to 16ms.
Window function with ROW_NUMBER()
For more flexibly in the number of rows returned per group, use window functions:
```sql
SELECT customer_id, status, created_at
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
FROM orders
) ranked
WHERE rn = 1
```
This reads every row, assigns row numbers, then filters. Generally similar cost to the sort-based `DISTINCT ON` but slightly more overhead in tracking the window frame. But, you can retrieve more than one row per group.