Designing for failure

by abhirama

In the world of software, failure is a certainty. Servers go kaput, databases go down, processes go out of memory, things break all the time. You can categorize software as good or bad based on how they behave in these adverse scenarios. I am not trying to imply that software has to be resilient to all these, on the other hand, I believe that it is perfectly fine to crap out when shit hits the fan. The question is how do you handle this crapping out.

Whenever architecting components, devote ample amount of time to failure scenarios. Let us take the case of a piece of code which interacts with an external, third party API. What are the questions you should be asking when designing this component? What happens if the API suddenly stops responding one day? Can I hold my main thread hostage to the API response time? What happens if the API takes eons to respond? In case there is an exception, am I logging enough data to debug? If there are performance issues, do I have enough diagnostic data? Diagnostic data might be in terms of graphing the API response time, no of times the code path was executed, etc. Do I need to send out an alert when something goes wrong? All these question revolve around failure handling. These questions should be second nature to you as a software engineer.

I have seen a tendency among developers to devote inordinate amount of time in making their code adhere to the latest programming fad, trying to use the best possible library etc, but not to failure scenarios. Logging data might not be as sexy as debating which design pattern to use, but once things break, logs are your only friend. Next time when you are furiously pounding on the key board, take a step back and ask these questions. In the future, the developer who maintains the code that you wrote today, will thank you for doing this.