Scalability And High Availability
Purpose
Scalability is the ability to handle growth. High availability is the ability to continue serving users when failures happen.
They are related, but not the same. A system can scale to many users and still have one single point of failure. A system can be highly available for a small workload but collapse when traffic grows.
Scalability Dimensions
Systems scale across several dimensions:
| Dimension | Meaning |
|---|---|
| Traffic | More requests, users, devices, integrations |
| Data | More records, files, history, analytics, backups |
| Tenants | More customers with isolation and fairness needs |
| Features | More workflows and business rules |
| Teams | More developers making changes safely |
| Geography | More regions, languages, compliance zones |
Good scalability design asks which dimension is actually growing. "Scale" without a noun is mostly architecture fog.
Scaling Patterns
| Pattern | Use When | Caution |
|---|---|---|
| Vertical scaling | Need quick capacity by increasing machine size | Has limits and can become expensive |
| Horizontal scaling | Stateless services can run multiple instances | Requires load balancing and shared-state discipline |
| Caching | Repeated reads are expensive | Invalidation and stale data |
| Queue-based processing | Work can be asynchronous | Backlogs, retries, idempotency |
| Read replicas | Read load is high | Replication lag |
| Partitioning/sharding | Data set or write load exceeds one node | Operational complexity |
| CDN | Static/global content delivery | Cache rules and invalidation |
| Event-driven design | Loose coupling and async workflows | Debugging and eventual consistency |
High Availability Concepts
HA design reduces single points of failure.
Common HA practices:
- Multiple instances.
- Load balancing.
- Health checks.
- Auto-restart.
- Rolling/canary deployments.
- Multi-zone deployment.
- Database replication/failover.
- Queue durability.
- Backup and restore testing.
- Disaster recovery plan.
HA is not only infrastructure. Application code must tolerate retries, duplicate messages, timeouts, and partial failure.
Availability Targets
Availability targets have cost implications:
| Target | Approximate Downtime Per Year | Interpretation |
|---|---|---|
| 99% | 3.65 days | Acceptable for non-critical/internal systems |
| 99.9% | 8.76 hours | Common practical target for many services |
| 99.95% | 4.38 hours | Higher customer expectation |
| 99.99% | 52.6 minutes | Requires serious operational maturity |
| 99.999% | 5.26 minutes | Expensive and difficult; use only when justified |
Do not promise availability because the number looks nice. Promise what the architecture, team, process, and budget can support.
Single Points Of Failure
Look for:
- One database without failover.
- One server.
- One region.
- One queue/broker.
- One admin person.
- One certificate renewal process.
- One manual deployment step.
- One shared secret.
- One external dependency with no fallback.
The "one person knows this" failure mode is especially common and wonderfully invisible until that person is on leave.
Data And Consistency
Scalable systems often require tradeoffs:
- Strong consistency vs eventual consistency.
- Synchronous vs asynchronous workflows.
- Single database vs distributed data.
- Fresh reads vs cached reads.
- Simpler operations vs higher scale.
Use strong consistency where correctness requires it: payments, balances, permissions, inventory reservations, critical state transitions.
Use eventual consistency where user experience can tolerate it: notifications, analytics, search indexes, reports, activity feeds.
Readiness Checklist
Before launch or growth event:
- What traffic/data growth do we expect?
- What component saturates first?
- What happens if one instance dies?
- What happens if a dependency times out?
- Can the system autoscale or scale manually?
- Is data backed up and restore tested?
- Are long-running jobs idempotent?
- Are retries safe?
- Are dashboards and alerts ready?
- Is rollback tested?
Team Reference Guide
Guidelines For Teams
- Make services stateless where practical.
- Design idempotent operations for retryable workflows.
- Use queues for asynchronous work with clear retry/dead-letter strategy.
- Avoid single points of failure in critical paths.
- Match availability target to business value.
Reflection Questions
- What part of the system cannot be scaled horizontally today?
- What happens if our main database is unavailable?
- Which workflow must be strongly consistent?
- Which dependency failure should degrade gracefully?
Further Study
- AWS reliability pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
- Azure reliability guide: https://learn.microsoft.com/en-us/azure/well-architected/reliability/
- Google Cloud reliability guidance: https://cloud.google.com/architecture/framework/reliability
- Microsoft cloud design patterns: https://learn.microsoft.com/en-us/azure/architecture/patterns/
- Martin Fowler, Patterns of Distributed Systems: https://martinfowler.com/articles/patterns-of-distributed-systems/