Scaling to 1M+ Requests/Day

The Reality of High-Traffic Systems

Most applications never hit 1M requests/day. But the ones that do — and weren't designed for it — experience the same failure modes: CPU saturation, connection pool exhaustion, memory leaks under load, and cascading failures when one dependency slows down.

The good news: scaling is not magic. It is a series of well-understood techniques applied in the right order.

Vertical vs Horizontal: Know When to Stop Scaling Up

Vertical scaling (bigger server) is fast and requires no code changes. It is also a dead end. A 64-core machine with 256GB RAM is expensive and still has an upper bound.

Horizontal scaling (more servers) requires your application to be stateless — any instance can handle any request. This means:

Session state in Redis, not in memory
File uploads to object storage (S3), not the local filesystem
Scheduled jobs with distributed locking, not cron on a single host

Get your application stateless first. Everything else follows.

Node.js Clustering: Using All CPU Cores

Node.js runs on a single thread by default. A 16-core server runs your app at 1/16th capacity without clustering.

        import cluster from 'cluster'
        import { cpus } from 'os'

        if (cluster.isPrimary) {
          const numCPUs = cpus().length
          for (let i = 0; i < numCPUs; i++) {
            cluster.fork()
          }
          cluster.on('exit', (worker) => {
            console.log(`Worker ${worker.process.pid} died — restarting`)
            cluster.fork()
          })
        } else {
          // Your Express/Fastify app
          startServer()
        }

In production, use PM2's cluster mode instead — it handles worker management, zero-downtime restarts, and memory limit enforcement.

Connection Pool Tuning

Database connections are expensive. Default pool sizes are almost always wrong for production.

        // Prisma example — tune per your database and instance count
        const prisma = new PrismaClient({
          datasources: {
            db: {
              url: process.env.DATABASE_URL,
            },
          },
          // connection_limit should be: total_db_connections / number_of_app_instances
        })

If you have a PostgreSQL instance that supports 100 connections and 5 app instances, each instance gets a pool of 20. Exceed this and requests queue — or fail.

Load Balancing Strategies

Round Robin works for homogeneous requests. Least Connections is better when request processing time varies significantly. IP Hash ensures a client always hits the same instance — useful if you have soft session state that is expensive to fully externalize.

At the infrastructure level: NGINX and HAProxy for self-hosted, ALB (AWS) or Cloud Load Balancing (GCP) for cloud-native deployments.

Caching: The Force Multiplier

Every request that hits your cache does not hit your database. At 1M requests/day with a 70% cache hit rate, you have reduced database load by 700,000 queries per day.

Cache layers in order of speed:

In-process cache (LRU in memory): microseconds, lost on restart
Redis: sub-millisecond, shared across instances, durable
CDN edge cache: global, zero origin load for cacheable responses

Cache invalidation strategy matters more than cache implementation. Use event-driven invalidation — when data changes, explicitly invalidate affected cache keys rather than relying on TTL expiry.

Autoscaling: Letting Traffic Drive Capacity

With containers and Kubernetes, you can scale horizontally in response to real traffic:

        apiVersion: autoscaling/v2
        kind: HorizontalPodAutoscaler
        spec:
          minReplicas: 3
          maxReplicas: 50
          metrics:
            - type: Resource
              resource:
                name: cpu
                target:
                  type: Utilization
                  averageUtilization: 60

Scale on CPU at 60% utilization — not 80%. You need headroom for traffic spikes before the new pods become ready.

The Checklist Before You Hit the Wall

Application is stateless (sessions in Redis)
All CPU cores in use (clustering or container replicas)
Database connection pool sized correctly per instance
Redis caching on hot read paths
Horizontal Pod Autoscaler configured with appropriate headroom
Load testing done with realistic traffic patterns (k6, Locust)
Runbook exists for database connection exhaustion

The Reality of High-Traffic Systems

The good news: scaling is not magic. It is a series of well-understood techniques applied in the right order.

Vertical vs Horizontal: Know When to Stop Scaling Up

Vertical scaling (bigger server) is fast and requires no code changes. It is also a dead end. A 64-core machine with 256GB RAM is expensive and still has an upper bound.

Horizontal scaling (more servers) requires your application to be stateless — any instance can handle any request. This means:

Session state in Redis, not in memory
File uploads to object storage (S3), not the local filesystem
Scheduled jobs with distributed locking, not cron on a single host

Get your application stateless first. Everything else follows.

Node.js Clustering: Using All CPU Cores

Node.js runs on a single thread by default. A 16-core server runs your app at 1/16th capacity without clustering.

        import cluster from 'cluster'
        import { cpus } from 'os'

        if (cluster.isPrimary) {
          const numCPUs = cpus().length
          for (let i = 0; i < numCPUs; i++) {
            cluster.fork()
          }
          cluster.on('exit', (worker) => {
            console.log(`Worker ${worker.process.pid} died — restarting`)
            cluster.fork()
          })
        } else {
          // Your Express/Fastify app
          startServer()
        }

In production, use PM2's cluster mode instead — it handles worker management, zero-downtime restarts, and memory limit enforcement.

Connection Pool Tuning

Database connections are expensive. Default pool sizes are almost always wrong for production.

        // Prisma example — tune per your database and instance count
        const prisma = new PrismaClient({
          datasources: {
            db: {
              url: process.env.DATABASE_URL,
            },
          },
          // connection_limit should be: total_db_connections / number_of_app_instances
        })

If you have a PostgreSQL instance that supports 100 connections and 5 app instances, each instance gets a pool of 20. Exceed this and requests queue — or fail.

Load Balancing Strategies

At the infrastructure level: NGINX and HAProxy for self-hosted, ALB (AWS) or Cloud Load Balancing (GCP) for cloud-native deployments.

Caching: The Force Multiplier

Every request that hits your cache does not hit your database. At 1M requests/day with a 70% cache hit rate, you have reduced database load by 700,000 queries per day.

Cache layers in order of speed:

In-process cache (LRU in memory): microseconds, lost on restart
Redis: sub-millisecond, shared across instances, durable
CDN edge cache: global, zero origin load for cacheable responses

Cache invalidation strategy matters more than cache implementation. Use event-driven invalidation — when data changes, explicitly invalidate affected cache keys rather than relying on TTL expiry.

Autoscaling: Letting Traffic Drive Capacity

With containers and Kubernetes, you can scale horizontally in response to real traffic:

        apiVersion: autoscaling/v2
        kind: HorizontalPodAutoscaler
        spec:
          minReplicas: 3
          maxReplicas: 50
          metrics:
            - type: Resource
              resource:
                name: cpu
                target:
                  type: Utilization
                  averageUtilization: 60

Scale on CPU at 60% utilization — not 80%. You need headroom for traffic spikes before the new pods become ready.

The Checklist Before You Hit the Wall

Application is stateless (sessions in Redis)
All CPU cores in use (clustering or container replicas)
Database connection pool sized correctly per instance
Redis caching on hot read paths
Horizontal Pod Autoscaler configured with appropriate headroom
Load testing done with realistic traffic patterns (k6, Locust)
Runbook exists for database connection exhaustion

Scaling to 1M+ Requests/Day

The Reality of High-Traffic Systems

Vertical vs Horizontal: Know When to Stop Scaling Up

Node.js Clustering: Using All CPU Cores

Connection Pool Tuning

Load Balancing Strategies

Caching: The Force Multiplier

Autoscaling: Letting Traffic Drive Capacity

The Checklist Before You Hit the Wall

More from the blog

Redis Beyond Caching: Pub/Sub, Queues & Rate Limiting at Scale

Database Scaling Patterns: Replicas, Sharding & CQRS

Want to build something like this?

Scaling to 1M+ Requests/Day

The Reality of High-Traffic Systems

Vertical vs Horizontal: Know When to Stop Scaling Up

Node.js Clustering: Using All CPU Cores

Connection Pool Tuning

Load Balancing Strategies

Caching: The Force Multiplier

Autoscaling: Letting Traffic Drive Capacity

The Checklist Before You Hit the Wall

More from the blog

Redis Beyond Caching: Pub/Sub, Queues & Rate Limiting at Scale

Database Scaling Patterns: Replicas, Sharding & CQRS

Want to build something like this?