Distributed Tracing: Finding the Needle in the Haystack

A user reports that their checkout is slow.

You check the logs for the Order service. Everything looks green. You check the Payment service. It’s also fine. You check the Inventory service. No errors there either.

Somewhere in that chain of five services, a request is hanging for 10 seconds. But because every service only sees its own little world, you’re blind.

This is why you need Distributed Tracing.

The Trace ID: A Digital Passport#

In a monolith, you have a stack trace. In microservices, you have a Trace ID.

The idea is simple. When a request first hits your system (usually at the API Gateway), you generate a unique ID. As that request travels from service A to service B to service C, that ID travels with it in the headers.

If service B calls service C, it passes the Trace ID along. If service C writes to a database, it logs the Trace ID.

Now, when you want to see why that checkout was slow, you just search for that one ID. You see the entire journey.

Spans and Hierarchies#

A “Trace” is the whole journey. A “Span” is a single unit of work within that journey (e.g., a database query, an API call, or a heavy calculation).

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% gantt title Checkout Request (Trace ID: abc-123) dateFormat ss axisFormat %S section API Gateway Authenticate :a1, 00, 01s section Order Service Create Order :a2, 01, 05s section Payment Service Authorize Card :a3, 02, 04s section Inventory Reserve Items :a4, 04, 05s

Visualizing the timeline makes it obvious that ‘Create Order’ is where most of the time is being spent.

The Glue: OpenTelemetry (OTel)#

A few years ago, you had to pick a specific tool like Jaeger or Zipkin and bake their libraries into your code. If you wanted to switch tools, you had to rewrite your instrumentation.

OpenTelemetry changed that. It’s an industry standard. You instrument your code using OTel, and you can send that data to any backend you want (Honeycomb, Datadog, Jaeger, etc.) without changing a line of code.

How it looks in code (Java)#

You don’t usually manually create spans for everything. The OTel agent does it for you automatically for most common libraries (Spring, JDBC, Kafka).

// Automatic instrumentation handles the Trace ID propagation
// But you can add custom metadata (Attributes) to make it useful
Span span = Span.current();
span.setAttribute("order.id", order.getId());
span.setAttribute("user.tier", "premium");

The Cost of Visibility#

Tracing isn’t free.

Performance: Generating and sending spans takes CPU and memory.
Storage: Storing 100% of your traces is expensive. If you have 10,000 requests per second, you’re going to have a massive bill.

The solution is Sampling. You might only save 1% of the successful traces, but save 100% of the traces that result in an error or take longer than 500ms.

What I’m Thinking#

I used to think logging was enough. I’d just grep through ELK/Splunk for a userId and hope I could piece the story together.

But logging is flat. It doesn’t tell you causality. It doesn’t show you that Service B was slow because it was waiting on a lock in Service C.

Distributed tracing is the “X-ray” for your architecture. It’s a lot of work to set up correctly, especially getting headers to propagate through async message queues like Kafka. But the first time an outage happens and you find the bottleneck in 30 seconds instead of 3 hours, it pays for itself.

Have you ever had a request “vanish” in your system? How did you track it down?