Distributed Tracing

2019-07-12

Distributed tracing means different things to different people - depends on whom you ask.

In this post, we will look at distributed tracing in the context of debugging web application errors using logs in a microservices environment. We will also figure out the way to implement distributed tracing in the least disruptive manner. The aim is to leverage components that are already part of your application stack without introducing anything new for the sake of distributed tracing.

There are many closed as well as open-source products to get distributed tracing off the ground. There is nothing wrong in using these; in a lot many cases, the investment may not be worth it.

The idea of distributed tracing is straightforward. Every request that lands on your application should have a trace id - a random unique alphanumeric string. The trace id is usually called a request id. From there onwards, the request id should be part of whichever code path the request invokes within the application as well as any external calls it makes. The request id should also be part of the logs generated in all these paths. If the code path invokes external services, those calls should have the request id in the header. The application serving the external calls should follow the same pattern as discussed here.

There you go, the simplest possible distributed tracing implementation.

One decision to make is where to generate the request id?

If you are serving a user-initiated action, generate the request id at the entry point of the request. Post that, all the subsequent code paths and dependent services will use the same request id.

The best place to generate the request id is the web server fronting the application server. All web servers have a way to add a custom header to an incoming request.

Below is how you would generate a request id in Nginx:

[code]

location / { proxy_pass http://upstream; proxy_set_header X-Request-Id $request_id; }

[/code]

Nginx doc - http://nginx.org/en/docs/http/ngx_http_core_module.html#var_request_id

We should generate a request id only if the request does not already have one. This sort of conditional generation can be achieved using the below configuration:

[code]

map $http_x_request_id $reqid { default $http_x_request_id; "" $request_id; }

location @proxy_to_app { proxy_set_header X-Request-ID $reqid; proxy_pass http://backend; }

[/code]

Helpful links:

https://stackoverflow.com/questions/17748735/setting-a-trace-id-in-nginx-load-balancer https://stackoverflow.com/questions/13583501/nginx-how-to-add-header-if-it-is-not-set/44761645

Also, make the request id part of web server access log.

Now that we have guaranteed all incoming requests have an id, we need to log the request id along with the application logs.

Most logging/application frameworks have a way to pass a thread-local custom context between layers. You can inject the request id in this context.

If you are on Java using Logback, MDC can help you to achieve this - https://logback.qos.ch/manual/mdc.html If you are on Python and Django - https://github.com/dabapps/django-log-request-id If you are on Go and Gin - https://github.com/atarantini/ginrequestid

Now that the access log, application log as well as any external calls made has request id, you can trace a request’s entire journey throughout the application and debug errors and pinpoint the cause.

Even if you are not trying to tie up micro service calls together, even within an application, having a request id helps in debugging issues faster - you can trace a request as it traverses through the different layers of an application.

Now that the strategy and tactic are clear, some enhancements; this is the right time to talk about span ids.

Let us say you have service A calling service B. Service A generates and sends a trace id as part of the call to service B. In addition to logging A’s trace id, service B should create a trace id of its own and log that too; this is the concept of a span id - an id owned by each service.

You can expand the span concept to a unit of work within a single service. For example, if your application has three different logical units, you can generate a span id for each of them too; depends on your use case and what is it that you are trying to achieve.

Having distributed tracing in place is a huge productivity boost. It is a low investment high ROI activity.

← Designing A Great Meeting Room Experience Security By Obscurity →