Skip to content

Systems Design Template

FEATURE EXPECTATIONS - 5 min

  • Waht does it need to do?
  • Use cases
  • Scenarios that will not be covered
  • Who will use
  • How many will use
  • Usage patterns

ESTIMATIONS - 5 min

  • Throughput (QPS for read and write queries)
  • Latency expected from the system (for read and write queries)
  • Read/Write ratio
  • Traffic estimates
    • Write (QPS, Volume of data)
    • Read (QPS, Volume of data)
  • Storage estimates
  • Memory estimates
    • If we are using a cache, what is the kind of data we want to store in cache
    • How much RAM and how many machines do we need for us to achieve this ?
    • Amount of data you want to store in disk/ssd

DESIGN GOALS - 5 min

  • Latency and Throughput requirements
  • Consistency vs Availability [Weak/strong/eventual => consistency | Failover/replication => availability]

HIGH LEVEL DESIGN - 5-10 min

  • APIs for Read/Write scenarios for crucial components
  • Database schema
  • Basic algorithm
  • High level design for Read/Write scenario

DEEP DIVE - 15-20 min

  • Deployment
  • Scaling the algorithm
  • Scaling individual components:
    • Availability, Consistency and Scale story for each component
    • Consistency and availability patterns
  • Call graph depth
    • Longer sequence call depths in the graph means compounding availability problems
      • If both Foo and Bar each had 99.9% availability, their total availability in sequence would be 99.8%.
  • What is a graceful (aka. zero downtime) restart?
    • A graceful restart takes place in two steps:
      • A new generation of parent and child processes are spawned to take over service.
      • Processes from the older generation only exit once they have finished their tasks.
  • Think about the following components, how they would fit in and how it would help
    • DNS Networking#How does DNS work
    • CDN [Push vs Pull]
    • Load Balancers [Active-Passive, Active-Active, Layer 4, Layer 7]
      • Reverse Proxy
    • Single points of failure
    • Application layer scaling [Microservices, Service Discovery]
    • DB Cliff Notes - Someguy#Relational model vs document model
      • RDBMS
        • Master-slave, Master-master, Federation, Sharding, Denormalization, SQL Tuning
        • Benefits - great at many-to-many and many-to-one relationships, one copy of data
        • Downfalls - high impedance (results format don't match the model format)
      • NoSQL
        • Benefits: Greater scalability, high write throughput, large datasets, dynamic schemas, low impendance
        • Downfalls: not good at many-to-one or many-to-many relationships, data can be replicated in many objects, eventual consistency
        • Key-Value, Wide-Column, Graph, Document
          • Fast-lookups:
            • RAM [Bounded size] => Redis, Memcached
            • AP [Unbounded size] => Cassandra, RIAK, Voldemort
            • CP [Unbounded size] => HBase, MongoDB, Couchbase, DynamoDB
    • Caches
    • Asynchronism
      • Message queues
      • Task queues
      • Back pressure
    • Communication
      • TCP
      • UDP
      • REST
      • RPC
    • Metrics
      • External (off-box, not off prem) Monitoring
        • in case local monitoring collapses
        • regional problems e.g. check for down isps
        • breaches in SLA
        • QPS, Latency, Errors, Calltimes
      • Host
        • health check
        • disk space, memory, swap, i/o, load
        • anything specific to the programming language
          • e.g. Java GC
      • SLA Alerting/Reporting/Breaches
        • An SLA is a contract.
        • An SLO is a specific goal that is defined in a contract.
        • An SLI measures the extent to which teams comply with the SLO promises they make in SLA contracts.
          • The SLI (Service Level Indicator) equation is the number of good events divided by the total number of valid events, multiplied by 100 to keep it a uniform percentage.
          • measures how well a company actually meets the SLO promises that it sets within SLAs.
          • Measured over some period
        • Burn rate - notify when you're consuming your error budget faster than threshold, measured over compliance period
    • Log Aggregation
      • Why this is important in a high-availability / distributed system
      • Query-able

JUSTIFY - 5 min

-  Throughput of each layer
-  Latency caused between each layer
-  Overall latency justification