Skip to content

Broker ingresses and filters silently stop exposing metrics #4645

@antoineco

Description

@antoineco

Describe the bug

At some point, broker ingresses and filters start showing the following error messages at :9092/metrics instead of showing metrics.

Ingress

An error has occurred while serving metrics:
24 error(s) occurred:
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > counter:<value:1 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_dispatch_latencies" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > histogram:<sample_count:1 sample_sum:28 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user2" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > counter:<value:3 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user2" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > histogram:<sample_count:3 sample_sum:16231.000000000002 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:2 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > > } was collected before with the same name and label values

...

(24 times with different namespace_name)

I noticed this always happens with the following tags, although I'm currently sending a lot more event types than that:

"event_type" value:"dev.knative.sources.ping"

Filter

An error has occurred while serving metrics:

104 error(s) occurred:
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:17 > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:17 sample_sum:134 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:9 upper_bound:5 > bucket:<cumulative_count:13 upper_bound:10 > bucket:<cumulative_count:15 upper_bound:20 > bucket:<cumulative_count:17 upper_bound:50 > bucket:<cumulative_count:17 upper_bound:100 > bucket:<cumulative_count:17 upper_bound:200 > bucket:<cumulative_count:17 upper_bound:500 > bucket:<cumulative_count:17 upper_bound:1000 > bucket:<cumulative_count:17 upper_bound:2000 > bucket:<cumulative_count:17 upper_bound:5000 > bucket:<cumulative_count:17 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:17 sample_sum:75 bucket:<cumulative_count:10 upper_bound:1 > bucket:<cumulative_count:10 upper_bound:2 > bucket:<cumulative_count:10 upper_bound:5 > bucket:<cumulative_count:12 upper_bound:10 > bucket:<cumulative_count:16 upper_bound:20 > bucket:<cumulative_count:17 upper_bound:50 > bucket:<cumulative_count:17 upper_bound:100 > bucket:<cumulative_count:17 upper_bound:200 > bucket:<cumulative_count:17 upper_bound:500 > bucket:<cumulative_count:17 upper_bound:1000 > bucket:<cumulative_count:17 upper_bound:2000 > bucket:<cumulative_count:17 upper_bound:5000 > bucket:<cumulative_count:17 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user2" > label:<name:"trigger_name" value:"go-lambda-trigger" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:2 sample_sum:13 bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > bucket:<cumulative_count:2 upper_bound:20 > bucket:<cumulative_count:2 upper_bound:50 > bucket:<cumulative_count:2 upper_bound:100 > bucket:<cumulative_count:2 upper_bound:200 > bucket:<cumulative_count:2 upper_bound:500 > bucket:<cumulative_count:2 upper_bound:1000 > bucket:<cumulative_count:2 upper_bound:2000 > bucket:<cumulative_count:2 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:4 > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:4 sample_sum:130 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:4 upper_bound:50 > bucket:<cumulative_count:4 upper_bound:100 > bucket:<cumulative_count:4 upper_bound:200 > bucket:<cumulative_count:4 upper_bound:500 > bucket:<cumulative_count:4 upper_bound:1000 > bucket:<cumulative_count:4 upper_bound:2000 > bucket:<cumulative_count:4 upper_bound:5000 > bucket:<cumulative_count:4 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:4 sample_sum:47.99999999999999 bucket:<cumulative_count:2 upper_bound:1 > bucket:<cumulative_count:2 upper_bound:2 > bucket:<cumulative_count:3 upper_bound:5 > bucket:<cumulative_count:3 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:4 upper_bound:50 > bucket:<cumulative_count:4 upper_bound:100 > bucket:<cumulative_count:4 upper_bound:200 > bucket:<cumulative_count:4 upper_bound:500 > bucket:<cumulative_count:4 upper_bound:1000 > bucket:<cumulative_count:4 upper_bound:2000 > bucket:<cumulative_count:4 upper_bound:5000 > bucket:<cumulative_count:4 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:1 > } was collected before with the same name and label values

...

This becomes obvious when looking at a Grafana dashboard that uses those metrics. Below, I'm supposed to see 10 replicas, and a rate of exactly 10K event/sec, but only 4, then 3, then 2... then 0 replica(s) report metrics:

image

Expected behavior

No failure.

To Reproduce

Create a few ping sources I suppose (we currently have 19 in this cluster, spread across multiple namespaces), then generate some load on the ingress, e.g. using vegeta.

Knative release version

Eventing v0.19.2

Additional context

Metadata

Metadata

Assignees

Labels

area/observabilityarea/performancekind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions