It is three in the morning, the on-call rotation is awake, and the logs scroll past at a rate the eye cannot track. Ten thousand identical lines reading "ERROR Request failed: Connection timeout" appear in the last fifteen minutes. The timestamps are dense, the request paths blurred, the causal chain absent. Somewhere in the system, a downstream call to an inventory service is failing. The log file does not, in any column, tell anyone which downstream call, which upstream caller, which user request started the cascade, or which retry attempt happens to be the one currently scrolling.
I want to take that scenario seriously, because it is not a logging-quality problem. The logs in question are well-formatted, well-timestamped, and well-aggregated. The team is doing all the things the 2014-vintage advice columns recommended. The problem is structural: a log line is the wrong unit of analysis for the failure they are looking at, and no quantity of better log lines will turn the wrong unit of analysis into the right one. The unit they need is the trace.
What a log line actually is
A structured log entry answers a specific question: "what did this service observe at this moment." It is local to the service, local to the moment, and — by design — has no native concept of where in a wider request lifecycle it sits. In a monolith this is a survivable limitation; the entire request runs in one process, every log line shares an in-memory request context, and a request_id field is enough to grep the picture together after the fact.
In a microservice deployment of any sophistication, the assumption breaks down. A user request that hits a Kubernetes ingress at the edge typically traverses five to twenty internal services before producing a response: an auth gateway, a session-resolution service, two or three domain APIs, a feature-flag layer, several backing data stores, possibly a recommendation service, possibly a billing path. Each of those services emits its own log lines, often to its own log destination, often without a shared correlation field. The request did happen; nobody recorded the shape of it.
The log-correlation problem isn't fixable inside the log abstraction. A log line cannot, by construction, contain information about the call graph it sat inside, because the call graph wasn't visible to the service writing the line. Someone has to record the call graph separately, and that someone is a different signal class entirely.
What a trace is
A trace is the structural answer to "what happened across services for one request." It is a tree of spans, where each span represents a unit of work — a service call, a database query, a cache lookup, an outbound HTTP request — and parent-child relationships preserve the causal nesting. A unique trace_id propagates from the top of the tree (typically the edge ingress) through every hop; each span carries the parent span's ID, plus its own span ID, plus a small bag of attributes (HTTP method, query params, error code, business identifiers).
Rendered visually, a trace is a waterfall: time on the horizontal axis, services and operations on the vertical, each span a coloured bar whose width is its duration. The slow span is the wide one. The failed span is red. The interesting question — "which of the twenty hops in this request consumed the time, and which one returned the error?" — is one screen, not twenty grep commands.
This is not a logging upgrade. It is a different signal class, with a different unit of analysis (the request) than the log signal (the moment), and a different storage model (a tree per request) than the log storage model (a stream per service). The two are complementary, not interchangeable.
The standard you can rely on
Distributed tracing as a discipline is older than most engineers writing about it think. Google's 2010 Dapper paper, by Sigelman and colleagues, is the canonical reference; Twitter open-sourced Zipkin in 2012 as a Dapper-inspired implementation, and Uber open-sourced Jaeger in 2017 on similar lineage. For most of the 2010s, however, the operational reality was vendor-specific: each APM (Datadog, New Relic, AppDynamics, Dynatrace) shipped its own SDK, and instrumenting an application meant choosing a vendor and accepting that the instrumentation work was, structurally, lock-in.
The standardisation arrived in two pieces, both of which are worth pausing on.
The W3C published the Trace Context Recommendation on 6 February 2020, defining a vendor-neutral wire format for propagating trace IDs across HTTP service boundaries. The spec is small and unglamorous — a traceparent header carrying the trace ID, parent span ID, and sampling flags, plus an optional tracestate header for vendor-specific context. Most major HTTP clients and frameworks now respect it as a matter of course.
OpenTelemetry, the merger of OpenTracing and OpenCensus, was accepted to the CNCF Sandbox in May 2019 and moved to Incubating maturity on 26 August 2021, where it remains as of mid-2026. The project ships SDKs for the major languages (Node.js, Python, Java, Go, .NET, Rust, Ruby, PHP), an OTLP wire protocol, and a Collector binary that brokers between application-side instrumentation and any compliant backend. The SDKs include automatic-instrumentation libraries that wire framework-level telemetry without code changes — HTTP servers, ORMs, RPC clients, and message-queue libraries instrument themselves at load time.
The practical consequence is that the 2010s pattern of "pick an APM and live with their SDK" has been replaced by "instrument with OpenTelemetry, ship the OTLP traffic to whichever backend you can afford this quarter." Jaeger, Grafana Tempo, Honeycomb, Datadog APM, New Relic, Splunk Observability, Elastic APM, and AWS X-Ray all accept OTLP. The instrumentation decision is now separable from the backend decision, and the backend choice is a renewable one.
Where the wiring breaks
Even teams running OpenTelemetry, in production, often fail at one specific bridge: linking each log line back to the trace span it occurred inside.
The fix is twenty lines of code in any language with an OpenTelemetry SDK. Read the active span from the SDK at log time, attach trace_id and span_id to the structured log payload as additional fields. From that point forward, the log-aggregation tool and the APM are a single navigable surface — click a span in the trace view, see the logs for that span; open a log entry in the aggregator, jump to the trace that produced it.
The Node.js shape of it, with no extra dependencies beyond the OpenTelemetry SDK already in the application:
const { trace } = require('@opentelemetry/api'); function log(level, message, fields = {}) { const span = trace.getActiveSpan(); const ctx = span ? span.spanContext() : {}; console.log(JSON.stringify({ level, message, trace_id: ctx.traceId || null, span_id: ctx.spanId || null, timestamp: new Date().toISOString(), ...fields, })); } // Use it like any structured logger. log('error', 'Failed to reserve inventory', { item_id: 4821 });
The same pattern transposes directly to Python (opentelemetry.trace.get_current_span()), Go (trace.SpanFromContext(ctx).SpanContext()), Java (Span.current().getSpanContext()), and every other OpenTelemetry SDK; the language-specific entry point changes, the structure does not. A wrapper like this routed through the team's existing logger (Pino, Winston, Bunyan, slog, Logback, etc.) means every log line emitted in a request inherits the trace-and-span IDs of whatever span is active at emission time.
The bridge is missing from a striking number of production deployments. Two screens, no link, the on-call engineer still grep-correlating by timestamp at 3am. The pattern is consistent enough across teams that it deserves to be called out as the single highest-leverage observability change a team can make: instrument once with OpenTelemetry, then add trace_id and span_id to every log line, and the entire observability surface becomes navigable.
The 3am scenario, replayed
Take the same incident from the opening, with traces wired. The trace view shows a POST /checkout request with 320ms total duration, breaking down into a 12ms hop from the API gateway to the order service, then a 301ms hop from order to inventory, where the inventory service's AcquireLock(item_id=4821) span shows a 300ms timeout. The full causal chain is visible on one screen. The trace_id on each log line in any log-aggregation tool is the same trace_id visible in the APM, so the engineer can pivot freely between the two surfaces.
Same incident, two debug tools, very different work envelope. The team that landed at the trace view in fifteen seconds is back in bed within the hour. The team without it is constructing the call graph by hand from log timestamps until daylight.
The economic point is the part most teams underestimate when they're prioritising the work. Incident MTTR is not just an availability metric — it is a developer-velocity metric, because it determines how much engineering time per week is spent in active incident response versus shipping features. A team running with traces wired has a bounded MTTR; the worst case is "look at the trace, find the slow span, fix it." A team without has an unbounded MTTR, because the worst case is "look at twenty log streams, hope someone wrote a useful field, give up and start adding println calls in a hotfix."
The minimum viable observability stack in 2026
The 2010s observability stack was complicated. In 2026 it isn't.
An OpenTelemetry SDK in the application emits OTLP traffic. An OpenTelemetry Collector — a small, stateless binary — receives that traffic and forwards it to whichever backend the team uses for storage and visualisation. The backend can be open-source self-hosted, commercial SaaS, or part of a wider observability suite. Switching backends is a Collector configuration change, not a re-instrumentation project.
The OTLP-compatible backend market in 2026, briefly:
Backend Type Hosting Notable strength Jaeger Open source Self-host (Docker / Kubernetes / standalone) CNCF Graduated; OTLP-native receiver since 1.35; the reference implementation most other backends are read against Grafana Tempo Open source Self-host or Grafana Cloud Object-storage-backed; tightly integrated with Loki (logs) + Prometheus (metrics) for a unified Grafana stack SigNoz Open source Self-host or SigNoz Cloud OpenTelemetry-native end-to-end; ClickHouse-backed; combined logs/traces/metrics in one tool Honeycomb Commercial SaaS Honeycomb Cloud High-cardinality query model; pioneered observability-as-debugging rather than as monitoring Datadog APM Commercial SaaS Datadog Cloud Mature, widely deployed, expensive at high cardinality / high retention New Relic Commercial SaaS New Relic Cloud Bundled into a broader APM / monitoring / RUM suite; consumption-based pricing Splunk Observability Cloud Commercial SaaS Splunk Cloud Rebranded SignalFx; enterprise-oriented; integrates with the Splunk log / SIEM stack Elastic APM Open source + commercial Self-host or Elastic Cloud Tight integration with Elasticsearch + Kibana; strong for teams already on the Elastic stack AWS X-Ray Cloud-native AWS Native integration with AWS-only deployments; thin on cross-cloud or hybrid scenarios
Switching among these is a Collector configuration change. The application's instrumentation code does not move.
The application code carries one library and a handful of attribute-tagging calls in the spans the team cares about. Log-aggregation continues to use whatever the team already runs (Loki, Elastic, OpenSearch, Datadog Logs, etc.); the bridge is the trace_id field on every log line.
This is a much smaller commitment than the 2017 vendor-SDK pattern, and the lock-in surface is a fraction of what it was. Teams that haven't made the move are usually not held back by complexity; they're held back by the fact that the previous-generation advice columns are still in everyone's bookmarks, and the new pattern hasn't been internalised as the default.
What logs and traces actually answer
It is worth stating clearly, because the field's vocabulary has been muddy on this point for a decade. Logs answer what happened in this service at this moment. Traces answer where in the causal chain across services the failure occurred. Metrics answer how often, with what shape, over what window. Each of the three signals is necessary for a different class of question; none of them is sufficient on its own.
Signal Question it answers Unit of analysis Storage model Useful at 3am for Logs What did this service observe at this moment? An event in one process Stream per service, time-ordered Reading the exact error text, stack trace, payload Traces Where in the causal chain across services did the request go? A request, end to end Tree of spans per request Locating the slow or failing hop in a multi-service path Metrics How often, with what shape, over what window? Counter / gauge / histogram, time-bucketed Time series Detecting that something is broken right now and characterising the pattern
The reason the three are routinely conflated is that the same word "observability" gets used for all of them, and the same vendor often sells all three as one product. They are still three different signals with three different jobs, and a deployment that has only one of them — usually logs, occasionally metrics — is missing two thirds of its debug surface.
The reason the 3am incident scenario is annoying is that the on-call engineer is asking a "where in the chain" question and being handed a "what at the moment" answer. The log line is technically correct and operationally useless, because it is the wrong signal for the question. Adding ten more log lines per service does not improve the situation; the additional log lines answer the same wrong question more loudly.
Wiring traces is not "doing observability properly." It is wiring the signal that the failure mode actually has a shape in. Once it is wired, the logs become useful again — not because the log lines got better, but because they now sit inside a structure that makes them addressable.
What the slow rollout actually looks like
There is a pattern in how teams adopt distributed tracing that is worth noting because it is so consistent.
A team starts with logs only. Logs are good enough until the deployment crosses some threshold, usually around four or five microservices, after which the log-correlation overhead becomes a daily friction. The team adds metrics, often Prometheus, and the metrics solve the "is something broken right now" question but do not help with "why is this specific request slow." The team adds tracing last, often after a particular incident has consumed enough engineering hours that someone has been given a quarter to fix it. The tracing rollout is not difficult; the SDK is mature, the Collector is small, the backends are interchangeable. The friction is organisational, not technical — the team has to allocate the time, agree on attribute conventions, and write the log-trace bridge.
After the rollout, every team that has done it reports the same observation: the marginal cost of the next incident drops by an order of magnitude. The incident that would have consumed an afternoon now takes ten minutes. The cost of the rollout pays back over the first two months.
Teams that haven't done it yet are paying that cost in incident hours, week over week, until they do.
The actual question
The question that matters in 2026 is not "should we do distributed tracing." The standardisation argument is over; OpenTelemetry has won the instrumentation layer, the W3C Trace Context spec has won the propagation layer, and the backend market has commoditised on the OTLP wire format. The question that matters is whether the team's logs are addressable from a trace, and whether the trace is addressable from a log line, in both directions, on every request.
If the answer is yes, the 3am incident has a different shape than the one in the opening. If the answer is no, the team is paying for that gap in MTTR every week, and the cost shows up in features that didn't ship because the on-call rotation was busy. Logs and traces are not in tension. They are two different signals, joined by twenty lines of code, and the team that has wired the join is the team that goes back to bed.
AI summary
Loglarınızı okumak yeterli değil! Dağınık mikro hizmet mimarilerinde olayları anlamak için izlemeyi kullanın. OpenTelemetry ve W3C Trace Context rehberimizle başlayın.