Systems Design Template
FEATURE EXPECTATIONS - 5 min
- Waht does it need to do?
- Use cases
- Scenarios that will not be covered
- Who will use
- How many will use
- Usage patterns
ESTIMATIONS - 5 min
- Throughput (QPS for read and write queries)
- Latency expected from the system (for read and write queries)
- Read/Write ratio
- Traffic estimates
- Write (QPS, Volume of data)
- Read (QPS, Volume of data)
- Storage estimates
- Memory estimates
- If we are using a cache, what is the kind of data we want to store in cache
- How much RAM and how many machines do we need for us to achieve this ?
- Amount of data you want to store in disk/ssd
DESIGN GOALS - 5 min
- Latency and Throughput requirements
- Consistency vs Availability [Weak/strong/eventual => consistency | Failover/replication => availability]
HIGH LEVEL DESIGN - 5-10 min
- APIs for Read/Write scenarios for crucial components
- Database schema
- Basic algorithm
- High level design for Read/Write scenario
DEEP DIVE - 15-20 min
- Deployment
- Scaling the algorithm
- Scaling individual components:
- Availability, Consistency and Scale story for each component
- Consistency and availability patterns
- Call graph depth
- Longer sequence call depths in the graph means compounding availability problems
- If both
Foo
and Bar
each had 99.9% availability, their total availability in sequence would be 99.8%.
- What is a graceful (aka. zero downtime) restart?
- A graceful restart takes place in two steps:
- A new generation of parent and child processes are spawned to take over service.
- Processes from the older generation only exit once they have finished their tasks.
- Think about the following components, how they would fit in and how it would help
- DNS Networking#How does DNS work
- CDN [Push vs Pull]
- Load Balancers [Active-Passive, Active-Active, Layer 4, Layer 7]
- Single points of failure
- Application layer scaling [Microservices, Service Discovery]
- DB Cliff Notes - Someguy#Relational model vs document model
- RDBMS
- Master-slave, Master-master, Federation, Sharding, Denormalization, SQL Tuning
- Benefits - great at many-to-many and many-to-one relationships, one copy of data
- Downfalls - high impedance (results format don't match the model format)
- NoSQL
- Benefits: Greater scalability, high write throughput, large datasets, dynamic schemas, low impendance
- Downfalls: not good at many-to-one or many-to-many relationships, data can be replicated in many objects, eventual consistency
- Key-Value, Wide-Column, Graph, Document
- Fast-lookups:
- RAM [Bounded size] => Redis, Memcached
- AP [Unbounded size] => Cassandra, RIAK, Voldemort
- CP [Unbounded size] => HBase, MongoDB, Couchbase, DynamoDB
- Caches
- Asynchronism
- Message queues
- Task queues
- Back pressure
- Communication
- Metrics
- External (off-box, not off prem) Monitoring
- in case local monitoring collapses
- regional problems e.g. check for down isps
- breaches in SLA
- QPS, Latency, Errors, Calltimes
- Host
- health check
- disk space, memory, swap, i/o, load
- anything specific to the programming language
- SLA Alerting/Reporting/Breaches
- An SLA is a contract.
- An SLO is a specific goal that is defined in a contract.
- An SLI measures the extent to which teams comply with the SLO promises they make in SLA contracts.
- The SLI (Service Level Indicator) equation is the number of good events divided by the total number of valid events, multiplied by 100 to keep it a uniform percentage.
- measures how well a company actually meets the SLO promises that it sets within SLAs.
- Measured over some period
- Burn rate - notify when you're consuming your error budget faster than threshold, measured over compliance period
- Log Aggregation
- Why this is important in a high-availability / distributed system
- Query-able
JUSTIFY - 5 min
- Throughput of each layer
- Latency caused between each layer
- Overall latency justification