Skip to content

Google Systems Design - Code Deployment

Must be global and fast

Deployment from master to global

Questions asked:
* Assume it has already been tested and reviewed. 
* No. of machines: 100,000's of machines all over the world in n number of regions.
* Is it internal or external service?
* Since it is internal, is it OK with a lower levels of availability?
* Deployment system must be reliable enough to rollback and roll forward.
* 30 minutes for entire deployment?
* All deployments must end up in terminal state, failed or success.
* Availability target is 99 or 99.9%
* How big are the binaries?  Say 10GB.

Requirements
* Trigger build
* Builds the binary
* Deploys the binary effeciently
* 

Actions
* Trigger build using SHA from commits
    * Commits are FIFO to create binaries
    * Want to persist the status of builds in case of system disaster
        * For a very long time, do not leave in memory
        * Maybe use SQL queue
            * Jobs Table Description
                * Autogenerated ID
                * Name
                * Creation time
                * Commit SHA
                * Status
                    * Queued
                    * Processing
                    * Fail
                    * Success
                    * Cancelled
                * Building Node Name
                * Last HeartBeat
                * Binary Hash
        * SQL gives **ACID** transactions to make sure the builds are concurrency safe
        * You can write the SQL query to make sure no jobs are currently processing
            * `select * from jobs where status="queued" order by created asc limit 1`
            * `BEGIN TRANSACTION; update jobs set status="processing" where id=_id; COMMIT`
        * Use way to make health check on the bulding node
            * Nodes update heartbeat in the table
            * Set a max timeout for non-response of the heartbeat
            * If node is timed out but still building, it can check at end of build that is markd as failed and maybe send an alert to refine the system
    * Estimate size of build system in nodes
        *  Requirements * 5000 builds per day, 15m per build
        * Ask what are worst case scenarios for builds happening at once. **5000 builds per day don't come evenly**
        * Estimation _if even_ 24hrs*60m/15m = 96
            * 5000/96 = 52 nodes
            * Padding: 20% = 63 nodes
    * Each node can download the entire source code and apply incremental patches if needed, when needed
* Builds the binary
    * Before marking build as succesful
        * Make sure the binary is in local blob storage
        * Update tables/deployment managers in other regions to know that there is a binary that is built and ready to be copied, maybe give binary blob location
* Store binary in blob storage
    * Can't have 100k machines downloading binary from blobs
    * Best to use peer-to-peer networking, like bittorrent
* Deploys the binary effeciently
    * How do the peers know to download a new file?
        * Peers should know what their "goal" state is * meaning this is the version they should be running
            * Use Key/Value pair to know * App -> Version
            * Maybe store this value in something like Zookeeper
            * Every peer has a key/value system that is watching for changes
            * If value is found, it starts downloading via bittorrent
    * How do peers know to not build over each other?