Multi-Cluster Tracing with Traefik, OpenTelemetry, and Datadog

Multi-Cluster Tracing with Traefik, OpenTelemetry, and Datadog

A single request that crosses three clusters needs to look like a single request when you investigate it. Otherwise you're reading three disconnected log files and guessing at the join.

This post walks through the setup: Traefik Hub in a multi-cluster topology, OpenTelemetry as the wire format, Datadog as the observability platform, and Bits as the AI investigator on top.


The architecture

The topology is a Traefik Hub parent fronting a Traefik Hub child across a multiclusters. In this example there is one parent cluster and one child cluster that could be many.:

The parent owns ingress and policy. The child owns the IngressRoute CRDs and routes to the workloads. A request from a client traverses parent → uplink → child → upstream, which means a single user request generates spans in two Traefik instances and at least one upstream HTTP call.

Without trace propagation, that's three views of the same request. With trace propagation, it's one trace.


OpenTelemetry as the wire format

Traefik speaks OpenTelemetry natively for traces, metrics, and logs. You configure each signal in the Helm values for your Traefik Hub deployment. Here's the parent configuration:

# OTLP Tracing Configuration
tracing:
  otlp:
    enabled: true
    http:
      enabled: true
      endpoint: http://datadog-agent.datadog.svc.cluster.local:4318/v1/traces
  serviceName: traefik-hub-parent
  resourceAttributes:
    env: dev
    service: traefik-hub-parent
    deployment.environment: dev
  addInternals: true
  sampleRate: 1.0

# OTLP Metrics Configuration
metrics:
  otlp:
    enabled: true
    http:
      enabled: true
      endpoint: http://datadog-agent.datadog.svc.cluster.local:4318/v1/metrics
    addEntryPointsLabels: true
    addRoutersLabels: true
    addServicesLabels: true
    serviceName: traefik-hub-parent

# Access logs with OTLP export
logs:
  access:
    enabled: true
    format: json
    addInternals: true
    fields:
      general:
        defaultmode: keep
      headers:
        defaultmode: keep

# OTLP Logs
additionalArguments:
  - --experimental.otlpLogs=true
  - --accesslog.otlp.http=true
  - --accesslog.otlp.http.endpoint=http://datadog-agent.datadog.svc.cluster.local:4318/v1/logs
  - --log.otlp.http=true
  - --log.otlp.http.endpoint=http://datadog-agent.datadog.svc.cluster.local:4318/v1/logs

The child cluster uses the same pattern with serviceName: traefik-hub-child. Both instances point at the local Datadog Agent's OTLP HTTP port (4318). Trace context propagates over the uplink because Traefik forwards traceparent and tracestate headers by default. The child picks up the context, opens a child span, and the trace stays intact across the cluster boundary.

Access logs go directly to the Datadog Agent over OTLP HTTP, not via container log collection. This gives you trace ID correlation out of the box without relying on log parsing.

Because everything is OTLP, the resource attributes are predictable: service.nameenvdeployment.environment, plus the Kubernetes labels the agent injects. Those attributes are what Datadog uses to build the service map, and they're what Bits later uses to correlate logs with spans.


Datadog ingests OTLP

The Datadog Agent has native OTLP ingestion. We deploy it using the Datadog Operator with a DatadogAgent CRD:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  global:
    clusterName: traefik-child-1
    site: datadoghq.com
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
    tags:
      - "env:dev"
      - "team:traefik"

  features:
    otlp:
      receiver:
        protocols:
          http:
            enabled: true    # listens on 0.0.0.0:4318
          grpc:
            enabled: true    # listens on 0.0.0.0:4317
    apm:
      enabled: true
    logCollection:
      enabled: true
      containerCollectAll: true

  override:
    nodeAgent:
      containers:
        agent:
          env:
            # OTLP logs ingestion (off by default)
            - name: DD_OTLP_CONFIG_LOGS_ENABLED
              value: "true"
            # Resource attributes → Datadog tags
            - name: DD_OTLP_CONFIG_METRICS_RESOURCE_ATTRIBUTES_AS_TAGS
              value: "true"
            - name: DD_OTLP_CONFIG_TRACES_RESOURCE_ATTRIBUTES_AS_TAGS
              value: "true"

Three things to know about that config. First, traces and metrics ingestion is on by default once the OTLP receiver is enabled in features.otlp. Second, OTLP logs are off by default to avoid surprise billing. You opt in explicitly by setting DD_OTLP_CONFIG_LOGS_ENABLED=true as an environment variable. Third, the resource attribute environment variables ensure that OTel attributes like service.name and deployment.environment become first-class Datadog tags.

Traefik emits OTLP, the agent receives OTLP, and Datadog maps the OTel semantic conventions onto its own service/resource/span model on the way in.


What Datadog does with the data

This is where the multi-cluster setup pays off. Datadog joins the spans across clusters by trace ID and renders the actual call graph:

The view above shows traefik-hub-parenttraefik-hub-child as the routing hop, and the downstream errors landing on the external dependencies. Per-edge request rates and error percentages come from the spans. P50/P95/P99 latencies come from the same data, broken out per upstream service in the panel below the graph.

Three signals end up correlated by trace ID and resource attributes:

  • Traces and spans give per-hop latency and the call structure.
  • Metrics give the rate and error trends over time.
  • Logs are joined by trace ID and pod identifiers, so you jump from a slow span directly into the log lines for that exact request.

That join is what makes the data investigable instead of just stored.


Bits surfaces two real issues

I ran a general investigation in Bits, Datadog's AI investigator, against a window with elevated noise on the child. Five minutes later, two findings.

Finding 1: missing Kubernetes services

The Traefik Kubernetes provider on the child cluster was logging continuous ERR-level messages from kubernetes_http.go:110. Five Kubernetes services referenced in IngressRoute CRDs do not exist in the apps namespace:

  • apps/swapi-svc (referenced by traefik-lke-route-child)
  • apps/swapi-svc-svc (referenced by traefik-lke-route-child)
  • apps/httpbin-svc (referenced by both traefik-lke-route-child and traefik-lke-route)
  • apps/petstore-svc (referenced by traefik-lke-route)

The provider polls every 5 seconds (pollInterval=5s per the startup args), which produces a steady ~110 errors per 5-minute window. Bits read the logs, correlated them to the CRD definitions, and named the exact routes that needed fixing.

Finding 2: shutdown-timeout panic

Pod traefik-685dcc7bcf-l6gq5 (image ghcr.io/traefik/traefik-hub:v3.20.0-ea.7, commit 19adc2f2d4) panicked at 08:50:40 UTC with:

panic: Timeout while stopping traefik, killing instance
goroutine 1940622 [running]
github.com/traefik/traefik/v3@v3.7.0-ea.1/pkg/server/server.go:82
  → Server.Close.func1

Bits lined this up with a deployment event detected at 08:50:20 UTC. The shutdown sequence could not drain in-flight goroutines within the configured grace period, so the runtime force-killed the instance. Bits also analyzed the HTTP 499 errors on traefik-hub-parent and concluded they were client cancellations from a bruno-runtime/3.2.2 test client, not server-side failures.


What's actually doing the work

Four layers, each earning its place:

  • OpenTelemetry gives a vendor-neutral wire format and predictable semantic conventions.
  • Traefik Hub propagates trace context across the multicluster uplink so spans from parent and child belong to the same trace.
  • The Datadog Agent ingests OTLP natively, so the pipeline is configuration, not code.
  • Datadog and Bits join traces, spans, metrics, and logs by trace ID and resource attributes, then run AI investigations on top of the joined data.

Take any one of those out and the value drops sharply. Run all four and a multi-cluster trace becomes a single artifact you can investigate.


And you?

How are you propagating trace context across cluster boundaries today? If you're running a parent/child gateway pattern or any kind of multi-cluster ingress, what does your OTel pipeline look like?

Read more