Troubleshooting via Metrics¶
Looking at Graphs¶
Wants to check for deployment activity Wants to check for cyclical pattern (week over week etc) Wants to check for recent changes to threshold Wants to check for magnitude of difference below threshold Asks NOC for nothing (while investigating) OR Asks NOC to look for other reg-related alerts/flows/metrics Makes a sensible guess for the cause of the problem Makes a second sensible guess (some examples: ABI, seasonality, one colo failure) Questions the time it took NOC to escalate or mentions it as a followup item
Initial Troubleshooting¶
Reviews all presented data and asks appropriate questions about it Suggests a correlation between deployment and drop in regs from graph Suggests problem may be due to deployment from informed Wants to check error logs on the reg service Wants to check some other log (that makes some degree of sense to check)
Initial Resolution¶
Takes a logical action to resolve the issue Second action is also logical based on 1st failing Stays calm and doesn’t argue when told their solution didn’t work
Secondary Resolution¶
Last action also meaningful Suggests escalation to engineering Suggests escalation to management Has articulable thing for escalated person to do Articulable thing makes sense given the problem Second articulable thing for escalated person, that also makes sense