Scalability And High Availability

Scalability And High Availability visual map

Purpose

Scalability is the ability to handle growth. High availability is the ability to continue serving users when failures happen.

They are related, but not the same. A system can scale to many users and still have one single point of failure. A system can be highly available for a small workload but collapse when traffic grows.

Scalability Dimensions

Systems scale across several dimensions:

Dimension	Meaning
Traffic	More requests, users, devices, integrations
Data	More records, files, history, analytics, backups
Tenants	More customers with isolation and fairness needs
Features	More workflows and business rules
Teams	More developers making changes safely
Geography	More regions, languages, compliance zones

Good scalability design asks which dimension is actually growing. "Scale" without a noun is mostly architecture fog.

Scaling Patterns

Pattern	Use When	Caution
Vertical scaling	Need quick capacity by increasing machine size	Has limits and can become expensive
Horizontal scaling	Stateless services can run multiple instances	Requires load balancing and shared-state discipline
Caching	Repeated reads are expensive	Invalidation and stale data
Queue-based processing	Work can be asynchronous	Backlogs, retries, idempotency
Read replicas	Read load is high	Replication lag
Partitioning/sharding	Data set or write load exceeds one node	Operational complexity
CDN	Static/global content delivery	Cache rules and invalidation
Event-driven design	Loose coupling and async workflows	Debugging and eventual consistency

High Availability Concepts

HA design reduces single points of failure.

Common HA practices:

Multiple instances.
Load balancing.
Health checks.
Auto-restart.
Rolling/canary deployments.
Multi-zone deployment.
Database replication/failover.
Queue durability.
Backup and restore testing.
Disaster recovery plan.

HA is not only infrastructure. Application code must tolerate retries, duplicate messages, timeouts, and partial failure.

Availability Targets

Availability targets have cost implications:

Target	Approximate Downtime Per Year	Interpretation
99%	3.65 days	Acceptable for non-critical/internal systems
99.9%	8.76 hours	Common practical target for many services
99.95%	4.38 hours	Higher customer expectation
99.99%	52.6 minutes	Requires serious operational maturity
99.999%	5.26 minutes	Expensive and difficult; use only when justified

Do not promise availability because the number looks nice. Promise what the architecture, team, process, and budget can support.

Single Points Of Failure

Look for:

One database without failover.
One server.
One region.
One queue/broker.
One admin person.
One certificate renewal process.
One manual deployment step.
One shared secret.
One external dependency with no fallback.

The "one person knows this" failure mode is especially common and wonderfully invisible until that person is on leave.

Data And Consistency

Scalable systems often require tradeoffs:

Strong consistency vs eventual consistency.
Synchronous vs asynchronous workflows.
Single database vs distributed data.
Fresh reads vs cached reads.
Simpler operations vs higher scale.

Use strong consistency where correctness requires it: payments, balances, permissions, inventory reservations, critical state transitions.

Use eventual consistency where user experience can tolerate it: notifications, analytics, search indexes, reports, activity feeds.

Readiness Checklist

Before launch or growth event:

What traffic/data growth do we expect?
What component saturates first?
What happens if one instance dies?
What happens if a dependency times out?
Can the system autoscale or scale manually?
Is data backed up and restore tested?
Are long-running jobs idempotent?
Are retries safe?
Are dashboards and alerts ready?
Is rollback tested?

Team Reference Guide

Guidelines For Teams

Make services stateless where practical.
Design idempotent operations for retryable workflows.
Use queues for asynchronous work with clear retry/dead-letter strategy.
Avoid single points of failure in critical paths.
Match availability target to business value.

Reflection Questions

What part of the system cannot be scaled horizontally today?
What happens if our main database is unavailable?
Which workflow must be strongly consistent?
Which dependency failure should degrade gracefully?

Further Study

AWS reliability pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
Azure reliability guide: https://learn.microsoft.com/en-us/azure/well-architected/reliability/
Google Cloud reliability guidance: https://cloud.google.com/architecture/framework/reliability
Microsoft cloud design patterns: https://learn.microsoft.com/en-us/azure/architecture/patterns/
Martin Fowler, Patterns of Distributed Systems: https://martinfowler.com/articles/patterns-of-distributed-systems/