Skip to main content
Version: 3.0.0

Telemetry - Core Provider

This document describes the OpenTelemetry metrics emitted by core-provider and provider-runtime.

For information on how to set up the OpenTelemetry Collector, configure Prometheus, or import dashboards, please refer to the OpenTelemetry Configuration guide.

Telemetry Assets​

The following assets are available for core-provider:

Metrics Reference​

Naming note​

Metric names in code use dots. Prometheus usually normalizes them with underscores, and counters may appear with a _total suffix. Histogram queries use the generated _bucket series, which is a cumulative count time series. For average duration, use the generated _sum and _count series instead of _bucket.

Metrics Table​

MetricTypeUnitDescriptionPromQL example
provider_runtime.startup.successCountercountProvider started successfully.sum(increase(provider_runtime_startup_success_total[1h]))
provider_runtime.startup.failureCountercountProvider startup failed.sum(increase(provider_runtime_startup_failure_total[1h]))
provider_runtime.reconcile.duration_secondsHistogramsecondsTotal reconcile duration.histogram_quantile(0.95, sum by (le) (rate(provider_runtime_reconcile_duration_seconds_bucket[5m])))
provider_runtime.reconcile.queue.depthUpDownCountercountCurrent queued requests for the controller.max(provider_runtime_reconcile_queue_depth)
provider_runtime.reconcile.queue.wait.duration_secondsHistogramsecondsTime spent waiting in queue before processing.histogram_quantile(0.95, sum by (le) (rate(provider_runtime_reconcile_queue_wait_duration_seconds_bucket[5m])))
provider_runtime.reconcile.queue.oldest_item_age_secondsHistogramsecondsAge of the oldest queued item observed at enqueue/dequeue time.histogram_quantile(0.95, sum by (le) (rate(provider_runtime_reconcile_queue_oldest_item_age_seconds_bucket[5m])))
provider_runtime.reconcile.queue.work.duration_secondsHistogramsecondsTime spent processing a dequeued item before Done().histogram_quantile(0.95, sum by (le) (rate(provider_runtime_reconcile_queue_work_duration_seconds_bucket[5m])))
provider_runtime.reconcile.queue.requeuesCountercountTotal queue requeues grouped by reason.sum(increase(provider_runtime_reconcile_queue_requeues_total[1h]))
core_provider.webhook.request.duration_secondsHistogramsecondsDuration of mutating and conversion webhook requests.sum(rate(core_provider_webhook_request_duration_seconds_sum{webhook="mutating"}[5m])) / sum(rate(core_provider_webhook_request_duration_seconds_count{webhook="mutating"}[5m]))
core_provider.webhook.request.totalCountercountTotal webhook requests grouped by webhook, operation, and outcome.sum(increase(core_provider_webhook_request_total{webhook="conversion"}[1h]))
provider_runtime.external.connect.duration_secondsHistogramsecondsTime spent reading external references.sum(rate(provider_runtime_external_connect_duration_seconds_sum[5m])) / sum(rate(provider_runtime_external_connect_duration_seconds_count[5m]))
provider_runtime.external.observe.duration_secondsHistogramsecondsTime spent observing external resources.sum(rate(provider_runtime_external_observe_duration_seconds_sum[5m])) / sum(rate(provider_runtime_external_observe_duration_seconds_count[5m]))
provider_runtime.finalizer.add.duration_secondsHistogramsecondsTime spent adding finalizers.histogram_quantile(0.95, sum by (le) (rate(provider_runtime_finalizer_add_duration_seconds_bucket[5m])))
provider_runtime.reconcile.requeue.afterCountercountReconcile returned RequeueAfter.sum(increase(provider_runtime_reconcile_requeue_after_total[1h]))
provider_runtime.reconcile.requeue.immediateCountercountReconcile returned immediate Requeue.sum(increase(provider_runtime_reconcile_requeue_immediate_total[1h]))
provider_runtime.reconcile.requeue.errorCountercountReconcile returned an error and will be requeued.sum(increase(provider_runtime_reconcile_requeue_error_total[1h]))
provider_runtime.reconcile.in_flightGaugecountNumber of reconcile operations currently running.max(provider_runtime_reconcile_in_flight)

Additional Notes​

  • The manager metrics endpoint on :8080 still exposes controller-runtime defaults.
  • The custom provider-runtime metrics are exported via OTLP when --otel-enabled is set.
  • The webhook metrics are emitted by core-provider and flow through the same OTLP pipeline.
  • Webhook metrics are request-driven, so the Grafana panels remain empty until the admission webhooks receive actual mutating or conversion traffic.
  • The dashboard splits webhook panels by webhook="mutating" and webhook="conversion" so each admission path is easier to inspect.
  • If OTEL_ENABLED is false or the OTLP endpoint is unreachable, webhook metrics will not reach Prometheus/Grafana.
  • Avoid high-cardinality labels for queue metrics.