Horizontal vs vertical scaling: the foundational choice
Every production engineering team eventually faces the moment when a single server is no longer enough. The instinctive response is often to buy a bigger machine — more cores, more RAM, faster SSDs. This is vertical scaling (scale-up), and it is a perfectly legitimate first step. A larger server requires no application changes, no infrastructure complexity and no rethinking of your data layer. For early-stage products, vertical scaling will carry you remarkably far: a modern 32-core, 128 GB RAM server can comfortably handle tens of thousands of concurrent users on a well-optimised monolithic web app.
The problem is the ceiling. Once you have vertically scaled to the largest available instance, you have nowhere left to go — and at that point you are also running your entire production workload on a single point of failure. A hardware fault, kernel panic or availability-zone outage takes down everything simultaneously. This is why cloud-native architecture treats vertical scaling as temporary relief, not a strategy.
Horizontal scaling (scale-out) means distributing load across multiple identical server instances. It has no theoretical ceiling: you add instances as load grows, and you remove them when traffic drops. This elasticity is the foundation of modern web app scalability. But horizontal scaling comes with a strict precondition: your application must be designed to run on many servers concurrently, each unaware of the state held by its siblings. This brings us to the most architecturally consequential decision in scaling work.
| Dimension | Vertical scaling | Horizontal scaling |
|---|---|---|
| Ceiling | Hard (largest available instance) | None in practice |
| Application changes required | None | Statelessness, shared data layer |
| Fault tolerance | Single point of failure | Redundant — instances fail independently |
| Cost efficiency at scale | Low — large instances carry high premium | High — commodity instances, pay only for used capacity |
| Deployment complexity | Low | Moderate (load balancer, service discovery) |
| Time to implement | Minutes (instance resize) | Days to weeks (architecture refactor) |
In practice, most teams use a combination: vertical scaling buys time while horizontal scaling is being engineered. The critical mistake is treating vertical as a permanent solution. Resize the server when you need breathing room; simultaneously start the stateless refactor so you are not forced to do it under production pressure.
Note that the horizontal-vs-vertical decision applies regardless of whether you are running a monolith or microservices. As we discuss in our guide to monolith vs microservices architecture, a well-structured monolith scales horizontally just as effectively as a service mesh — the key is statelessness, not the deployment topology. This article focuses on the scaling techniques themselves; the architectural decomposition question is addressed there.
Stateless application design
Statelessness is the single most important design principle for horizontally scalable web applications. A stateless application instance holds no per-user, per-session or per-request data in memory between requests. Every incoming HTTP request carries all the context needed to process it — typically via a JWT or session token — and can be routed to any available instance without side effects.
The contrast is a stateful application that stores session data in memory on a specific server. Once a user logs in and their session lives on Server A, every subsequent request must be routed to Server A — a pattern called sticky sessions. Sticky sessions fundamentally limit your scaling options: you cannot add an instance without some users being unable to reach their session, you cannot remove an instance without terminating live sessions, and you cannot deploy a new version without a disruptive rolling restart that drops in-flight sessions. Sticky sessions are not wrong for small-scale deployments, but they represent a ceiling that arrives earlier than most teams expect.
The path to statelessness involves moving all mutable state out of application memory and into shared external stores:
- Session data. Replace in-memory session objects with signed JWTs (for authentication state) or a distributed session store such as Redis. JWTs are self-contained and require no server-side storage; Redis sessions are centralised and can be invalidated server-side — both patterns allow any instance to handle any request.
- User uploads and generated files. Never write files to the local disk of an application server. Store them in object storage (AWS S3, Google Cloud Storage, Azure Blob) and reference them by URL. Any instance can serve or generate any file this way.
- Application-level caches. In-process caches (local HashMap, LRU cache) create divergent state across instances. Promote them to a shared distributed cache (Redis, Memcached) so all instances share a consistent view of cached data.
- Websocket and real-time state. WebSocket connections are inherently stateful — a connected client is attached to a specific server process. Use a pub/sub layer (Redis Pub/Sub, socket.io adapter, Ably) to broadcast events across all instances so any connected client on any server receives the message.
Load balancing strategies
A load balancer is the entry point to a horizontally scaled cluster. It accepts incoming connections and routes them to available backend instances according to a chosen algorithm. The choice of algorithm, health-check configuration and connection draining policy significantly affects both performance and reliability under real-world load patterns.
The most widely used algorithms in production web applications are:
- Round robin. Requests are distributed sequentially across instances. Statistically equal distribution in theory; in practice, slow requests can cause imbalance if some instances accumulate long-running jobs. Works well when request processing time is roughly uniform (REST APIs, static-asset servers).
- Least connections. Each new request goes to the instance with the fewest active connections at that moment. Significantly better than round robin when request duration varies widely — for example, when some requests trigger complex database queries while others are lightweight lookups. Used by default in NGINX Plus and AWS ALB weighted target groups.
- Weighted routing. Instances are assigned weights based on their capacity. A 16-core instance can receive twice the traffic of an 8-core instance in the same pool. Useful in mixed-instance deployments or during canary releases when you want only a fraction of traffic to reach a new version.
- IP hash / sticky routing. The same client IP always reaches the same instance. Useful for stateful backends where statelessness has not yet been achieved, or for WebSocket connections in single-instance setups. Avoid as a long-term strategy — it limits autoscaling and creates uneven load distribution.
Beyond the algorithm, three health-check properties define how quickly a load balancer reacts to instance failure: the check interval (how often it probes), the healthy threshold (how many consecutive successes before an instance is considered healthy), and the unhealthy threshold (how many consecutive failures before it is pulled from rotation). For production web APIs, typical values are: interval 10–15 s, healthy threshold 2, unhealthy threshold 2–3. This gives a worst-case failure detection time of 30–45 seconds — acceptable for most availability budgets.
Connection draining (also called deregistration delay) is equally important: when an instance is removed from the load balancer pool (during a deploy or scale-down), the load balancer should wait for in-flight requests to complete before terminating the connection — typically 30–60 seconds. Without draining, your deploys create a brief burst of 502/503 errors as active requests hit a terminating instance.
Caching layers: from in-process to CDN
Caching is the highest-leverage performance intervention available to a web backend engineer. A single cache hit avoids a database query that might take 5–50 ms; at scale, the cumulative saving is enormous. But caching is also one of the most commonly misapplied patterns — teams either under-cache (leaving obvious wins on the table) or over-cache (serving stale data with incorrect TTLs and no invalidation strategy).
Think of caching as a stack of layers, each progressively faster and progressively less durable:
- CDN edge caching. Fully static assets (images, JS bundles, CSS, fonts) should be cached at the CDN edge and served from the nearest point of presence to the user — adding far-future Cache-Control headers (max-age=31536000, immutable) and content-hashed filenames. For partially static content (blog posts, product pages), Stale-While-Revalidate allows the CDN to serve cached responses while asynchronously refreshing in the background. A well-configured CDN absorbs 60–95% of total HTTP requests before they reach your origin.
- Reverse proxy / gateway caching. An NGINX or Kong gateway layer can cache API responses for short TTLs (1–60 s) when the endpoint is safe to cache and the data changes infrequently. This layer protects your application servers from traffic spikes without requiring application code changes.
- Distributed in-memory cache (Redis / Memcached). Shared across all application instances, used for computed aggregates, database query results, third-party API responses, session tokens and rate-limit counters. Redis is the industry default in 2026 — it supports richer data structures than Memcached, includes TTL-based expiry, pub/sub for real-time events and optional persistence. Cache invalidation strategy choices: TTL (simple, risks brief staleness), explicit eviction on write (consistent, adds write-path complexity), or event-driven invalidation via a message queue.
- Application-layer in-process cache. A small in-process LRU cache (node-lru-cache, Caffeine in JVM) for truly hot, rarely changing data — for example, a lookup table of 200 country codes loaded at startup. Extremely fast (nanoseconds vs microseconds for Redis), but must be kept small to avoid memory pressure and must be invalidated correctly when the source data changes.
Cache hit ratio is the key metric. For a distributed cache, aim for a hit ratio above 80% on your busiest endpoints. Below 70% suggests either your TTLs are too short, your cache keys are too granular, or your dataset has genuine read-mostly hot spots that need a different access pattern. Profile your database slow query log and correlate it with cache miss logs to identify the highest-value caching opportunities.
Async processing and message queues
HTTP is a synchronous protocol: the client waits while the server processes the request. For most user interactions this is appropriate — the user submits a form and expects to see a result within a second. But many backend operations do not need to complete synchronously: sending an email, resizing an uploaded image, generating a PDF report, indexing a new record in a search engine, calling a slow third-party webhook. Blocking the HTTP response on any of these operations wastes server-side resources, increases latency perceived by the user, and creates fragile failure modes where a third-party timeout causes your entire request to fail.
The pattern is straightforward: your web server receives the request, persists the minimum required state (a job record in the database, or a message in a queue), returns HTTP 202 Accepted immediately, and a separate worker process does the heavy lifting asynchronously. The user's browser polls for completion or receives a push notification (WebSocket, Server-Sent Events) when the job finishes.
The message queue (also called a task queue or job queue) is the mechanism that connects the web tier to the worker tier. The most widely used options in production web applications in 2026:
- Amazon SQS. Fully managed, zero operational overhead, scales automatically to any throughput. Standard queues offer at-least-once delivery; FIFO queues provide exactly-once processing in order. The AWS-native default for teams already on AWS infrastructure.
- RabbitMQ. Mature, self-hosted or managed (CloudAMQP). Supports sophisticated routing topologies (exchanges, bindings, dead-letter queues) that SQS lacks. Preferred when you need fine-grained message routing or priority queues. Slightly more operational overhead than SQS.
- Apache Kafka. Log-based, high-throughput, designed for event streaming at millions of messages per second. Retains messages for configurable periods (not just until consumed), enabling replay and fan-out to multiple consumers. The correct choice for event sourcing, analytics pipelines and real-time data processing. Overkill for simple task queues — Kafka's operational complexity is significant.
- Redis Streams / Bull queue. For teams already running Redis, Bull (Node.js) or RQ (Python) provide simple job queues backed by Redis. Good for moderate throughput (thousands of jobs per minute) with minimal infrastructure addition. Not suitable for high-throughput or multi-consumer fan-out patterns at scale.
Worker tier scaling is independent of web tier scaling. You might run 10 web instances and 3 worker instances during normal operation, then scale workers to 20 during a batch processing surge — without touching the web tier. This decoupling is one of the most valuable properties of async architectures: it allows you to right-size each layer for its actual workload.
Database scaling: read replicas and sharding
The database is almost always the first bottleneck in a growing web application. Application servers are stateless and scale horizontally with minimal friction; databases carry state and are significantly harder to scale. Fortunately, the vast majority of production web apps have far more reads than writes — a ratio of 80:20 or 90:10 is common — which means read scaling is usually the right problem to solve first.
Read replicas are the standard first move. Most managed database services (AWS RDS, Google Cloud SQL, PlanetScale) support one or more read replicas: synchronised copies of your primary database that accept read-only queries. Your application routes all SELECT queries to replicas and sends all writes (INSERT, UPDATE, DELETE) to the primary. A single well-provisioned read replica can absorb 70–90% of your query load, reducing primary database CPU proportionally. Three or four replicas — behind a connection pooler like PgBouncer or ProxySQL — can handle the read load of very large production systems without requiring any schema changes.
The trade-off is replication lag. Async replication (the default in most managed services) means replicas may lag the primary by milliseconds to seconds under write-heavy load. For most read use cases this is acceptable; for reads that must reflect an immediately preceding write (for example, confirming a purchase to a user), route that specific query to the primary using read-your-writes consistency guarantees or with explicit routing logic in your application.
Database sharding is the next step when write throughput or dataset size exceeds what a single primary can handle. Sharding splits your data horizontally across multiple database nodes (shards) based on a shard key. Every record is assigned to exactly one shard based on a deterministic function of its shard key — for example, user_id % N where N is the number of shards. Each shard is an independent database with its own primary and (optionally) its own replicas.
Sharding introduces significant operational and query complexity. Cross-shard queries (joins or aggregations across users on different shards) require either scatter-gather fan-out or de-normalised data redundancy. Schema changes must be applied to all shards in coordination. Rebalancing shards as data grows requires careful migration orchestration. For these reasons, sharding should be treated as a last resort — most teams reach for application-level partitioning, time-series table partitioning (splitting one large table into monthly or yearly sub-tables), or a managed distributed database (CockroachDB, Vitess, PlanetScale) before hand-rolling a sharding layer.
| Technique | Solves | Complexity | When to use |
|---|---|---|---|
| Query optimisation | Slow individual queries | Low | Always first |
| Connection pooling | Connection exhaustion | Low | When connections > 100 |
| Read replicas | Read throughput | Low-medium | Read:write ratio > 5:1 |
| Table partitioning | Table size / query performance | Medium | Tables > 50M rows |
| Sharding | Write throughput / total data size | High | When replicas are insufficient |
| Managed distributed DB | All of the above | Medium (operational outsourced) | Greenfield or migration budget available |
Autoscaling and capacity planning
Autoscaling is the mechanism by which your infrastructure automatically adjusts compute capacity in response to observed demand — adding instances during traffic spikes and terminating them when load subsides. In cloud environments (AWS Auto Scaling Groups, Google Cloud Managed Instance Groups, Kubernetes Horizontal Pod Autoscaler), autoscaling is the standard approach to maintaining availability without over-provisioning baseline capacity for peak load.
Effective autoscaling requires three things to work reliably: meaningful scaling signals, correctly tuned thresholds, and application instances that can start up and shut down cleanly within the autoscaler's reaction window.
Scaling signals. The default signal for most autoscalers is CPU utilisation — add instances when average CPU exceeds a threshold (typically 60–70%), remove when it drops below a lower threshold (typically 30–40%). CPU is a reasonable proxy for compute-bound workloads but misleading for I/O-bound apps that spend most of their time waiting for database responses. For web APIs, request latency (p95 API response time) or active connection count are often more accurate signals. Use custom CloudWatch, Stackdriver or Datadog metrics for latency-based autoscaling when the default CPU metric gives you false signals.
Cooldown and stabilisation windows. Autoscalers include a cooldown period after a scaling action to prevent thrashing — repeatedly adding and removing instances in rapid succession as the metric oscillates around the threshold. A scale-out cooldown of 60–90 s and a scale-in cooldown of 300–600 s (scale-in is slower because removing capacity is riskier than adding it) are typical starting points. Adjust based on your instance startup time and your traffic volatility.
Capacity planning. Autoscaling does not eliminate the need for capacity planning — it changes its nature. You no longer need to pre-provision for sustained peak load, but you do need to specify minimum and maximum instance counts that match your availability requirements and your cost budget. A minimum of 2 instances (one per availability zone) ensures redundancy during scale-in events; a maximum that covers your estimated peak plus a 30% buffer prevents runaway scaling from a traffic spike or an application bug that causes request loops.
Pre-warming is relevant when instance startup time is significant relative to your traffic spike duration. For containerised apps (Docker on ECS or Kubernetes), startup times of 5–30 s are typical. For VM-based scale-out, 60–180 s is more common. If your traffic spikes are sudden (a marketing email drop, a product launch), pre-warming instances ahead of the event — or using scheduled scaling — prevents the initial wave of requests from hitting an under-resourced cluster while the autoscaler catches up.
SLOs and graceful degradation
Scalability is not only about handling more load — it is about maintaining acceptable performance as load increases and recovering gracefully when components fail. Service Level Objectives (SLOs) are the formal commitments that define "acceptable" for your system.
Well-chosen SLOs serve two purposes. Operationally, they define the conditions that trigger an incident and wake an on-call engineer. Architecturally, they define the performance envelope your system must be designed to sustain — and they reveal which components are the constraining factors. If your SLO is p95 API latency under 500 ms and you are regularly breaching it during database maintenance windows, you have identified both a reliability problem and an architectural gap (insufficient read replica capacity or lack of a query cache).
Standard SLO targets for production web applications:
- Availability: 99.9% (Three Nines) allows 8.7 hours of downtime per year. This is the minimum expectation for any commercial SaaS. 99.95% (4.4 hrs/yr) is achievable with multi-AZ deployment and automated failover. 99.99% requires active-active multi-region architecture.
- API latency: p50 < 150 ms, p95 < 500 ms, p99 < 1,000 ms for synchronous HTTP APIs. These are reference points for well-optimised web APIs; compute-heavy endpoints (ML inference, complex reports) carry their own latency budgets.
- Error rate: Less than 0.1% of all requests should return 5xx errors during normal operation. A sustained error rate above 0.1% for more than 5 minutes should trigger an alert.
- Throughput: Define your expected peak RPS (requests per second) and design your infrastructure to sustain it at the SLO latency targets with 30–50% headroom. The headroom absorbs traffic spikes while the autoscaler reacts.
Graceful degradation is the design principle that ensures your application degrades predictably under overload rather than failing completely. Rather than crashing when a dependency is slow or unavailable, the system falls back to a degraded but functional state. Practical implementation patterns:
- Circuit breakers. If calls to an external service (a third-party API, a non-critical internal microservice) fail repeatedly, open the circuit — short-circuit subsequent calls with a cached fallback or a "service temporarily unavailable" message rather than hammering a failing dependency. Libraries like Resilience4j (Java), Polly (.NET) or Opossum (Node.js) implement this pattern.
- Load shedding. When request queues exceed a threshold, actively reject low-priority requests with HTTP 429 (Too Many Requests) rather than accepting them into a queue that can never drain. This protects the core request flow from being starved by low-priority background traffic.
- Feature flags for non-critical paths. Non-essential features (activity feeds, personalisation, recommendation widgets) can be disabled at the feature-flag level during a high-load incident, allowing the application to continue serving core flows without the overhead of the degraded component.
- Stale data over no data. When a cache miss occurs and the database is slow or unavailable, serving a slightly stale cached response is often preferable to returning an error. Implement serve-stale-on-error policies in your cache layer for read-heavy endpoints where eventual consistency is acceptable.
The web application development services we provide at YuSMP Group include architectural review and scalability planning as part of every engagement. We design systems from the outset to meet SLOs — defining autoscaling policies, caching strategies and degradation patterns before the first line of code is written, not as a post-launch retrofit. For teams choosing between architectural approaches, our companion article on monolith vs microservices addresses how decomposition choices interact with scaling strategy.
FAQ
What is the difference between horizontal and vertical scaling?
Vertical scaling (scale-up) means adding more CPU, RAM or storage to a single server. It is fast to implement but has a hard ceiling and creates a single point of failure. Horizontal scaling (scale-out) means adding more server instances and distributing load across them. It has no theoretical ceiling, eliminates single points of failure, and is the foundation of cloud-native web app architecture.
Why does statelessness matter for web app scalability?
A stateless application stores no per-request session data in memory on any individual server. Every instance can handle any incoming request without coordination. This means you can add or remove instances instantly without routing requests to specific servers — which is the prerequisite for autoscaling and zero-downtime deployments. Stateful servers require sticky sessions, complicating load balancing and preventing true horizontal scale.
When should I add a caching layer to my web app?
Add caching when read traffic dominates — typically when your database CPU exceeds 60% or when the same data is read significantly more often than it changes. Start with an in-process cache for hot lookup tables, then add a distributed cache (Redis, Memcached) when your app runs across multiple instances. A well-tuned Redis cache can absorb 80–95% of read load, reducing database pressure proportionally.
What is database sharding and when is it necessary?
Sharding splits a database table horizontally across multiple servers (shards) based on a shard key — for example, by user ID range or geographic region. It is necessary when your dataset outgrows a single database server's capacity or when write throughput exceeds what one primary can handle. Most teams reach for read replicas first (for read scaling) and only introduce sharding when they hit write bottlenecks at significant scale, typically 10M+ rows with high write rates.
How do message queues improve web app scalability?
Message queues (RabbitMQ, SQS, Kafka) decouple slow or resource-intensive work from the HTTP request cycle. Instead of making a user wait while your server sends emails, resizes images, or calls a third-party API, you enqueue the task and respond immediately. Worker processes drain the queue independently and can be scaled separately from your web tier — meaning a spike in background work does not slow down interactive requests.
What SLOs should a scalable web app target?
Industry-standard targets for a production SaaS are: 99.9% uptime (8.7 hrs/yr downtime budget), p50 API latency under 150 ms, p95 under 500 ms, and p99 under 1,000 ms. Error rate should stay below 0.1% of all requests. For consumer-facing apps where UX is critical, tighten p95 to under 300 ms and track Core Web Vitals (LCP under 2.5 s, INP under 200 ms) alongside backend SLOs.
Published 13 June 2026. Technical recommendations reflect production patterns observed across YuSMP Group client engagements 2023–2026 and current cloud provider documentation. Specific thresholds and metrics will vary by workload — treat these as informed starting points, not universal rules.


