Google Systems Design - Code Deployment¶
Must be global and fast
Deployment from master to global¶
Questions asked:
* Assume it has already been tested and reviewed.
* No. of machines: 100,000's of machines all over the world in n number of regions.
* Is it internal or external service?
* Since it is internal, is it OK with a lower levels of availability?
* Deployment system must be reliable enough to rollback and roll forward.
* 30 minutes for entire deployment?
* All deployments must end up in terminal state, failed or success.
* Availability target is 99 or 99.9%
* How big are the binaries? Say 10GB.
Requirements
* Trigger build
* Builds the binary
* Deploys the binary effeciently
*
Actions
* Trigger build using SHA from commits
* Commits are FIFO to create binaries
* Want to persist the status of builds in case of system disaster
* For a very long time, do not leave in memory
* Maybe use SQL queue
* Jobs Table Description
* Autogenerated ID
* Name
* Creation time
* Commit SHA
* Status
* Queued
* Processing
* Fail
* Success
* Cancelled
* Building Node Name
* Last HeartBeat
* Binary Hash
* SQL gives **ACID** transactions to make sure the builds are concurrency safe
* You can write the SQL query to make sure no jobs are currently processing
* `select * from jobs where status="queued" order by created asc limit 1`
* `BEGIN TRANSACTION; update jobs set status="processing" where id=_id; COMMIT`
* Use way to make health check on the bulding node
* Nodes update heartbeat in the table
* Set a max timeout for non-response of the heartbeat
* If node is timed out but still building, it can check at end of build that is markd as failed and maybe send an alert to refine the system
* Estimate size of build system in nodes
* Requirements * 5000 builds per day, 15m per build
* Ask what are worst case scenarios for builds happening at once. **5000 builds per day don't come evenly**
* Estimation _if even_ 24hrs*60m/15m = 96
* 5000/96 = 52 nodes
* Padding: 20% = 63 nodes
* Each node can download the entire source code and apply incremental patches if needed, when needed
* Builds the binary
* Before marking build as succesful
* Make sure the binary is in local blob storage
* Update tables/deployment managers in other regions to know that there is a binary that is built and ready to be copied, maybe give binary blob location
* Store binary in blob storage
* Can't have 100k machines downloading binary from blobs
* Best to use peer-to-peer networking, like bittorrent
* Deploys the binary effeciently
* How do the peers know to download a new file?
* Peers should know what their "goal" state is * meaning this is the version they should be running
* Use Key/Value pair to know * App -> Version
* Maybe store this value in something like Zookeeper
* Every peer has a key/value system that is watching for changes
* If value is found, it starts downloading via bittorrent
* How do peers know to not build over each other?