Structured Logging in Distributed Systems
Production bug. User says checkout failed. You SSH into the server and grep the logs. Nothing useful. The request hit 6 services. The error is in service 4. You’re grepping through service 1.
This is how I spent my first year debugging distributed systems. It doesn’t scale.
The Problem With Unstructured Logs#
log.info("Processing order " + orderId + " for user " + userId);
log.error("Failed to process order: " + e.getMessage());
Good luck grepping that across 20 instances. Every developer formats differently. Some log the order ID, some don’t. Some include the user, some don’t. Finding the full request path is archaeology.
Structured Logging#
Log as data, not strings.
log.info("Processing order",
kv("orderId", orderId),
kv("userId", userId),
kv("action", "order.process"),
kv("service", "order-service"));
Output is JSON. Every field is searchable. Your log aggregator (ELK, Loki, whatever) can index and query any field.
{"timestamp":"2026-02-10T08:30:00Z","level":"INFO","message":"Processing order","orderId":"ORD-123","userId":"USR-456","action":"order.process","service":"order-service","correlationId":"abc-def-789"}
Correlation IDs#
The real power: a single ID that follows the request across every service.
User hits API Gateway. Gateway generates a correlation ID. Passes it to Service A. A passes it to B. B passes it to the database layer. Every log line includes it.
// Spring interceptor adds correlation ID to MDC
public class CorrelationInterceptor implements HandlerInterceptor {
public boolean preHandle(HttpServletRequest req,
HttpServletResponse res, Object handler) {
String correlationId = req.getHeader("X-Correlation-ID");
if (correlationId == null) {
correlationId = UUID.randomUUID().toString();
}
MDC.put("correlationId", correlationId);
return true;
}
}
Now one query pulls the entire request journey:
correlationId: "abc-def-789"
Six services, 30 log lines, one story. At Oracle, adding correlation IDs to our NSSF services turned debugging from a multi-hour process into a 5-minute one. That’s not an exaggeration. We went from “which service even handled this?” to “here’s the exact line that failed.”
What I’m Learning#
Structured logging is one of those things that feels like overhead until you need it. Then it’s the difference between solving a production issue in minutes versus hours.
The minimum: JSON logs, correlation ID on every request, and consistent field names across services. Everything else is nice-to-have.
What’s your logging setup? Do you use distributed tracing or just structured logs?