Simpro Knowledge Base

Scalability And High Availability

Scalability And High Availability visual map

Purpose

Scalability is the ability to handle growth. High availability is the ability to continue serving users when failures happen.

They are related, but not the same. A system can scale to many users and still have one single point of failure. A system can be highly available for a small workload but collapse when traffic grows.

Scalability Dimensions

Systems scale across several dimensions:

Dimension Meaning
Traffic More requests, users, devices, integrations
Data More records, files, history, analytics, backups
Tenants More customers with isolation and fairness needs
Features More workflows and business rules
Teams More developers making changes safely
Geography More regions, languages, compliance zones

Good scalability design asks which dimension is actually growing. "Scale" without a noun is mostly architecture fog.

Scaling Patterns

Pattern Use When Caution
Vertical scaling Need quick capacity by increasing machine size Has limits and can become expensive
Horizontal scaling Stateless services can run multiple instances Requires load balancing and shared-state discipline
Caching Repeated reads are expensive Invalidation and stale data
Queue-based processing Work can be asynchronous Backlogs, retries, idempotency
Read replicas Read load is high Replication lag
Partitioning/sharding Data set or write load exceeds one node Operational complexity
CDN Static/global content delivery Cache rules and invalidation
Event-driven design Loose coupling and async workflows Debugging and eventual consistency

High Availability Concepts

HA design reduces single points of failure.

Common HA practices:

  • Multiple instances.
  • Load balancing.
  • Health checks.
  • Auto-restart.
  • Rolling/canary deployments.
  • Multi-zone deployment.
  • Database replication/failover.
  • Queue durability.
  • Backup and restore testing.
  • Disaster recovery plan.

HA is not only infrastructure. Application code must tolerate retries, duplicate messages, timeouts, and partial failure.

Availability Targets

Availability targets have cost implications:

Target Approximate Downtime Per Year Interpretation
99% 3.65 days Acceptable for non-critical/internal systems
99.9% 8.76 hours Common practical target for many services
99.95% 4.38 hours Higher customer expectation
99.99% 52.6 minutes Requires serious operational maturity
99.999% 5.26 minutes Expensive and difficult; use only when justified

Do not promise availability because the number looks nice. Promise what the architecture, team, process, and budget can support.

Single Points Of Failure

Look for:

  • One database without failover.
  • One server.
  • One region.
  • One queue/broker.
  • One admin person.
  • One certificate renewal process.
  • One manual deployment step.
  • One shared secret.
  • One external dependency with no fallback.

The "one person knows this" failure mode is especially common and wonderfully invisible until that person is on leave.

Data And Consistency

Scalable systems often require tradeoffs:

  • Strong consistency vs eventual consistency.
  • Synchronous vs asynchronous workflows.
  • Single database vs distributed data.
  • Fresh reads vs cached reads.
  • Simpler operations vs higher scale.

Use strong consistency where correctness requires it: payments, balances, permissions, inventory reservations, critical state transitions.

Use eventual consistency where user experience can tolerate it: notifications, analytics, search indexes, reports, activity feeds.

Readiness Checklist

Before launch or growth event:

  • What traffic/data growth do we expect?
  • What component saturates first?
  • What happens if one instance dies?
  • What happens if a dependency times out?
  • Can the system autoscale or scale manually?
  • Is data backed up and restore tested?
  • Are long-running jobs idempotent?
  • Are retries safe?
  • Are dashboards and alerts ready?
  • Is rollback tested?

Team Reference Guide

Guidelines For Teams

  • Make services stateless where practical.
  • Design idempotent operations for retryable workflows.
  • Use queues for asynchronous work with clear retry/dead-letter strategy.
  • Avoid single points of failure in critical paths.
  • Match availability target to business value.

Reflection Questions

  • What part of the system cannot be scaled horizontally today?
  • What happens if our main database is unavailable?
  • Which workflow must be strongly consistent?
  • Which dependency failure should degrade gracefully?

Further Study

  • AWS reliability pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
  • Azure reliability guide: https://learn.microsoft.com/en-us/azure/well-architected/reliability/
  • Google Cloud reliability guidance: https://cloud.google.com/architecture/framework/reliability
  • Microsoft cloud design patterns: https://learn.microsoft.com/en-us/azure/architecture/patterns/
  • Martin Fowler, Patterns of Distributed Systems: https://martinfowler.com/articles/patterns-of-distributed-systems/