We've run into a problem wherein our customers and potential customers often "know" that we just provide custom software development. This isn't the truth, but we've done a pretty terrible job of explaining or enumerating what other services we provide. A big part of what we often provide could be considered a "Virtual CTO" offering, or vCTO. This post outlines something that falls under that product - performing Root Cause Analysis.

If your organization or project has an outage, there are a few options at your disposal. The simplest one is to just fix the outage. This is pretty much always required, but if you stop there you won't have any confidence that a future outage won't occur. The best way to prevent future similar outages is to perform a Root Cause Analysis and to put procedures in place to prevent the issues that led up to the outage.

Not just "What happened?", but "Why did it happen?"

One of our developers ran into a partial outage on a customer's system recently. He did a fairly good job of identifying the "what happened?" question - our geocoding layer was failing to return useful results. However, he was having a little trouble getting to the "first why" quickly, and so he brought me in.

If you're not familiar with the term "first why" then you probably have not heard of the "5 Whys" process. At its core, the 5 Whys is a recursive question technique, that almost perfectly accurately models a toddler. Specifically, once you've identified the reason for a problem, you've got the first why. In our case, the answer to the question "Why was the geocoding customer-facing service failing?" was "because the geocoder layer is returning invalid results." We'll keep moving down the 5 Whys to explain how you can get to a more useful answer.

  • Why is geocoder returning invalid result?
    • Because Google's geocoding service has apparently had a change in their API since we last failed over to it, and our code doesn't know about that change.
  • Wait...why are we using Google? Aren't we supposed to be using Yahoo for this geocoding?
    • Yes. Yahoo was failing, and our code automatically failed over to Google, as it was designed to do.
    • Solution: Let's fix the Google geocoder code!
  • Wait, we'll ignore the Google issue right now. Why is Yahoo failing?
    • After some analysis, it would appear that our oAuth login request to Yahoo's APIs is being rejected due to a skew in our system clock.
    • Solution: We can just fix the time manually!
  • Wait, don't do that. Why is our system clock skewed? Don't we use NTP like sane people?
    • Well, yes. However, we're still out of skew. On analysis, the Ubuntu NTP servers we were using are ignoring our requests. Perhaps because the OS is old.
    • Solution: Switch to NIST NTP servers, which work fine!
  • Alright, wait a minute. Why are we finding out about clock skew by having a login to Yahoo's API fail?
    • Because we aren't monitoring clock skew anywhere, or our failover code triggering, or our OS being out of date, or...
    • Solution: implement robust monitoring!

In this moderately-Socratic monologue, you can see that stopping at the first Why led to a drastically different result than stopping at the fifth Why. Now we've insulated ourselves and our clients against future time sync issues, which as you can tell have the possibility of causing very strange and seemingly unrelated downstream symptoms. By getting to the real root cause (we have no monitoring in place, largely because we usually let our customers handle their own servers), we were able to identify a much broader "Fix" that will increase the reliability of our product overall, rather than a fix that was going to fail again as soon as our clock skewed further.

This is one of the ways that we bring stability to our clients and reduce the risk that their mission-critical software will fail. When it does go down, we're going to make it work, and make it work the right way. If you're working with us to ensure your web application is rock-solid, then you'll see a detailed report like this one.

We know how to fix site outages, and fix them for good. Get in touch now, and skip the 3 A.M. page tonight!

Here's the report we provided to our client

Josh Adams is a developer and architect with over eleven years of professional experience building production-quality software and managing projects. Josh is isotope|eleven's lead architect, and is responsible for overseeing architectural decisions and translating customer requirements into working software. Josh graduated from the University of Alabama at Birmingham (UAB) with Bachelor of Science degrees in both Mathematics and Philosophy. He runs the ElixirSips screencast series, teaching hundreds of developers Elixir. He also occasionally provides Technical Review for Apress Publishing, specifically regarding Arduino microprocessors. When he's not working, Josh enjoys spending time with his family.