The Reality of High-Traffic Systems
Most applications never hit 1M requests/day. But the ones that do — and weren't designed for it — experience the same failure modes: CPU saturation, connection pool exhaustion, memory leaks under load, and cascading failures when one dependency slows down.
The good news: scaling is not magic. It is a series of well-understood techniques applied in the right order.
Vertical vs Horizontal: Know When to Stop Scaling Up
Vertical scaling (bigger server) is fast and requires no code changes. It is also a dead end. A 64-core machine with 256GB RAM is expensive and still has an upper bound.
Horizontal scaling (more servers) requires your application to be stateless — any instance can handle any request. This means:
- Session state in Redis, not in memory
- File uploads to object storage (S3), not the local filesystem
- Scheduled jobs with distributed locking, not cron on a single host
Get your application stateless first. Everything else follows.
Node.js Clustering: Using All CPU Cores
Node.js runs on a single thread by default. A 16-core server runs your app at 1/16th capacity without clustering.
import cluster from 'cluster'
import { cpus } from 'os'
if (cluster.isPrimary) {
const numCPUs = cpus().length
for (let i = 0; i < numCPUs; i++) {
cluster.fork()
}
cluster.on('exit', (worker) => {
console.log(`Worker ${worker.process.pid} died — restarting`)
cluster.fork()
})
} else {
// Your Express/Fastify app
startServer()
}In production, use PM2's cluster mode instead — it handles worker management, zero-downtime restarts, and memory limit enforcement.
Connection Pool Tuning
Database connections are expensive. Default pool sizes are almost always wrong for production.
// Prisma example — tune per your database and instance count
const prisma = new PrismaClient({
datasources: {
db: {
url: process.env.DATABASE_URL,
},
},
// connection_limit should be: total_db_connections / number_of_app_instances
})If you have a PostgreSQL instance that supports 100 connections and 5 app instances, each instance gets a pool of 20. Exceed this and requests queue — or fail.
Load Balancing Strategies
Round Robin works for homogeneous requests. Least Connections is better when request processing time varies significantly. IP Hash ensures a client always hits the same instance — useful if you have soft session state that is expensive to fully externalize.
At the infrastructure level: NGINX and HAProxy for self-hosted, ALB (AWS) or Cloud Load Balancing (GCP) for cloud-native deployments.
Caching: The Force Multiplier
Every request that hits your cache does not hit your database. At 1M requests/day with a 70% cache hit rate, you have reduced database load by 700,000 queries per day.
Cache layers in order of speed:
- In-process cache (LRU in memory): microseconds, lost on restart
- Redis: sub-millisecond, shared across instances, durable
- CDN edge cache: global, zero origin load for cacheable responses
Cache invalidation strategy matters more than cache implementation. Use event-driven invalidation — when data changes, explicitly invalidate affected cache keys rather than relying on TTL expiry.
Autoscaling: Letting Traffic Drive Capacity
With containers and Kubernetes, you can scale horizontally in response to real traffic:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60Scale on CPU at 60% utilization — not 80%. You need headroom for traffic spikes before the new pods become ready.
The Checklist Before You Hit the Wall
- Application is stateless (sessions in Redis)
- All CPU cores in use (clustering or container replicas)
- Database connection pool sized correctly per instance
- Redis caching on hot read paths
- Horizontal Pod Autoscaler configured with appropriate headroom
- Load testing done with realistic traffic patterns (k6, Locust)
- Runbook exists for database connection exhaustion