In the modern software world, where micro-services are de rigueur, observability of systems is paramount. If you do not have a way to observe your application, you are as good as dead.
The first step towards embracing observability is figuring out what to track. Broadly, we can categorize software observability into:
1. Infrastructure metrics.
2. Application metrics.
3. Business metrics.
4. Distributed tracing.
Infrastructure metrics boil down to capturing the pulse of the underlying infrastructure where the application is running. Some examples are CPU utilization, memory usage, disc space usage, network ingress, and egress. Infrastructure metrics should give a clear picture as to how well the application is utilizing the hardware it is running on. Infrastructure metrics also aid in capacity planning and scaling.
Application metrics help in gauging the efficiency of the application; how fast or slow the application is responding and where are the bottlenecks. Some examples of application metrics are the API response time, the number of times a particular API is called, the processing time of a specific segment of code, calls to external services and their latency. Application metrics help in weeding out potential bottlenecks as well as in optimizing the application.
Infrastructure metrics give an overall picture whereas application metrics help in drilling down to the specifics. For example, if the infrastructure metric indicates more than 100% CPU utilization, application metrics help in zeroing in on the cause of this.
Business metrics are the numbers which are crucial from a functionality point of view. For example, if the piece of code deals with user login and sign-up, some business metrics of interest would be the number of people who sign up, number of people who log in, number of people who log out, the modes of login like social login versus direct. Business metrics help in keeping a pulse on the functionality and diagnosing feature specific breakdowns.
Business metrics should not be confused with business reports. Business metrics serve a very different purpose; they are not to quantify numbers accurately but more to gauge the trend and detect anomalous behavior.
It helps to think of infrastructure, application and business metrics as a hierarchy where you zoom in from one to the other when keeping a tab on the health of the system as well as diagnosing problems. Keeping a check on all three ensures you have hale and hearty application.
Logging enables to pinpoint specific errors. The big challenge with logs is making logs easily accessible to all in the organization. Business metrics help in tracking the overall trend and logging helps to zero in on the specific details.
Distributed tracing ties up all the microservices in the ecosystem and assists to trace a flow end to end, as it moves from one microservice to another. Microservices fail all the time; if distributed tracing is not in place, diagnosing issues which span microservices feels like searching for a needle in a haystack.
If you have infrastructure, application and business metrics in place, you can create alerts which should be triggered when they show abnormal behavior; this pre-empts potential downtimes and business loss. One golden rule for alerts is, if it is an alert, it should be actionable. If not, alerts lose their significance and meaning.
Both commercial, as well as open source software, are available to build observability. NewRelic is one of the primary contenders on the commercial side. StatsD, Prometheus and the ilk dominate the open source spectrum. For log management, Splunk is the clear leader in the commercial space. ELK stack takes the crown on the open source front. Zipkin is an open source reference implementation of distributed tracing. Most of the metrics tracking software have alerting capabilities these days.
If you already have microservices or are moving towards that paradigm, you should be investing heavily on observability. Microservices without observability is a fool’s errand.