Elasticsearch can easily handle 1TB / 16M rows.
The real problem is how you model index it.
If I was starting from scratch, I’d do it like this:
1. First: clarify workload
Mostly reads? Mostly writes? Analytics or search?
Time based data (logs, events) or static records (users, products)?
Your index design totally depends on this, dont skip this step.
2. Do NOT dump all 1TB into a single index!!
Use time based indices if it is logs/events:
daily / weekly / monthly index like: logs-2025.11.17
This lets you: drop old data, move cold data to cheaper nodes, scale only hot data.
For static data, still split by logical boundaries (region / tenant / type) if needed.
3. Shard design (most noobs mess up here)
Too many shards will kill your cluster faster than too much data.
Target 30–50 GB per shard as a rough rule.
1TB data → around 20–30 primary shards total across all indices is usually fine.
Start simple: maybe 3–5 primary shards per major index, 1 replica.
Watch actual shard sizes and adjust for future indices, not for already created ones.
4. Mapping (this decides performance storage)
Explicit mappings only, do NOT rely on dynamic mapping for everything.
Use:
keyword for exact matches / aggregations
text only for real full text search fields
Turn off things you don’t need: _source only if you really can, doc_values or fielddata only on fields you aggregate/sort on.
Avoid indexing large blobs / descriptions unless you really search them.
5. Hardware / cluster
Prefer SSD, spinning disks gonna hurt you badly at this scale.
Start with 3 data nodes (for HA) if you can:
16–32GB RAM per node
Heap around 50% of RAM but never more than ~30–31GB
Use dedicated master nodes in bigger setups, but for small cluster you can share.
6. Indexing strategy for that initial 1TB load
Use the Bulk API only.
Aim for 5–20 MB per bulk request, tune based on response times.
Multiple parallel bulk workers instead of one gigantic request.
While bulk loading:
Set index.refresh_interval = -1 (disable refresh)
Optionally set number_of_replicas = 0 during load
After load finishes: set replicas back to 1 and refresh_interval to 1s or whatever.
After big import, you can run a forcemerge to reduce segments, but only on read-mostly indices, not on hot write ones.
7. Query design
Use filters (bool filter) for conditions on keyword / numeric / date fields, cause filters are cacheable and much cheaper.
Limit from/size deep pagination, use search_after / scroll for large scans.
Monitor slow logs to see bad queries.
8. Observability
Watch: heap usage, GC, search latency, index latency, queue sizes, shard counts.
If nodes are dying, first check shards and mappings before buying more hardware.
you can design a very solid layout.
But the high level answer:
1TB / 16M rows is not crazy for Elasticsearch
You just need: good mapping, sane shard count, bulk indexing, and time based indices instead of one giant “data” index.