[CC-4374]Scaffold HCP Telemetry#16797
Conversation
74c7392 to
5745e4c
Compare
5745e4c to
bb571d8
Compare
bb571d8 to
44f12b0
Compare
512f03b to
8eaf996
Compare
44f12b0 to
f04ef96
Compare
There was a problem hiding this comment.
Functions of the reporter
- The reporter continuously gathers Consul metrics over a configurable time interval from the go metrics in memory sink
- Over a configurable time interval, it flushes these gathered metrics to the Exporter (in charge of exporting metrics to HCP)
Concerns
There are currently a few concerns with the batching strategy in the reporter, right now:
- Flush failures/ Retry and Duplicates: currently with the interval time comparisons, we do not handle the case where the exporter fails to export. We currently drop metrics regardless. We want to offload some of the heavy lifting into the consul server. For example, if the Telemetry gateway is down, we want the export to retry or the batch to remain in memory and flush again, while avoiding duplicates.
- Memory concerns: With the ideal solution described above, we want to drop metrics that are too old, having a maximum size on the data structure holding the gathered metrics. We can't leave it unbound, and let the data structure grow forever. We need to have a configurable value here and we also need to run benchmarks to make sure we pick a sane value.
- The current BatchMetrics data structure might not be ideal to decide how to pick outdated intervals. Right now we compare agains time, and update the last interval based on our export batch. With a better solution, we likely need to reconcile this information better.
Overall, a sound batching strategy needs to be developed.
This will be addressed via a concretized strategy developed in CC-4636, where I will evaluate patterns, document them, and perform memory benchmarks before making a decision.
For now, please consider this file a scaffold for a first iteration of this merging into the feature branch.
This also is easier to review PR's, as the follow-up can be focused on the batching
The interfacing methods will likely remain the same, and so will most of the reporter initialization.
However, the data structures and algorithms for the actual batching will change, and potentially that main select for loop depending on the algorithm.
If anyone has lots of experience with this kind of thing, would love to talk more!
f04ef96 to
fa6d14b
Compare
There was a problem hiding this comment.
@nickethier After playing around, I went with the hcp.Client being the client exporter. It holds the OTLP client, and it can configure it with HCP auth from cloud config. It exposes two methods:
InitMetricsClient(ctx context.Context, endpoint string) error: given an endpoint, initializes an OTLP client internally with HCP auth setup. We can't init the OTLP client earlier, since we get CCM endpoint information later, not during bootstrap when the HCP client is configured.ExportMetrics(context.Context, metricdata.ResourceMetrics) error: given metrics, makes the HCP authenticated HTTP request using the OTLP client to export metrics.
The telemetry.Exporter uses this wrapped client to export metrics.
When the reporter starts, we try to :
- fetch CCM config
- if registered with CCM, call this
InitMetricsClientfunction to init the OTLP client with the configured endpoint - Create a Metrics exporter by injecting this HCP/OTLP Client
- Metrics exporter can now access the
ExportMetrics(context.Context, metricdata.ResourceMetrics) errorto export metrics
See usage here:
https://github.com/hashicorp/consul/pull/16797/files#diff-12b6ad75045bab0d33ca50e727bfe4de502bb2322e7c4599a0c67f6a43e18a6fR84
Lmk what you think!
I think from a code readability standpoint, might still be best to inject this as a dependency into the Exporter 😓 But yeah that means exposing the cloud config via a method.. Was really divided between the two.
There was a problem hiding this comment.
What if we define two interfaces in this package. If we had a client.MetricsClient interface with ExportMetrics then we could create it after calling Client.FetchTelemetryConfig and avoid needing the InitMetricsClient func
There was a problem hiding this comment.
@nickethier Where the Client gets initialized and injected into the exporter. We could even do this within the Exporter init, by passing the HCP client. I liked that it was more obvious what is happening this way.
There was a problem hiding this comment.
Thought I should handle the case where we don't have an error per say, but the server isn't registered with CCM 🤔 This might not be nil, more empty (no endpoint) but for now since the CCM protos are not yet available I did it this way. Might change later.
Also, the InitMetricsClient is expecting an endpoint of the form <host>:<port>. No scheme. Depending on if CCM returns the endpoint with scheme, we will have to do parsing/validation here for the endpoint.
There was a problem hiding this comment.
Returns an error so we can unit test easily!
There was a problem hiding this comment.
Could also make the m.reporter.Run(ctx) return an error to make this return m.reporter.Run(ctx), but the reporter can be run outside of this package, and so it felt forced to do that.
There was a problem hiding this comment.
One thing that worries me is that we are not handling goroutine exits, in the case that the ctx is cancelled both the manager Run and this inner goroutine will return with a parent that exited.
Some thoughts:
- Maybe this is this fine as is?
- If not:
- Should I use a separate context for the reporter goroutine, and cancel it when the manager
Runfunction cancels? - Should I convert the contents of the manager
Runfunction into a goroutinerunand wait on bothrunReporterandrun?
- Should I use a separate context for the reporter goroutine, and cancel it when the manager
There was a problem hiding this comment.
Functions of the reporter
- Converts batched metrics from the Reporter into OTLP
- Filters metrics based on configured metric filters
- Adds attributes based on configured labels
- Exports these to HCP Telemetry Gateway. It holds an HCP Client capable of making this HTTP request (HCP authenticated)
There was a problem hiding this comment.
I think these are the right Temporality values...
There was a problem hiding this comment.
Labels are attributes that the metrics will be tagged with, for example the service instance id (service.instance.id)
There was a problem hiding this comment.
Note for reviewers: I initially wanted to have this otlpmetrichttp client directly in the Exporter (telemetry/exporter.go).
However, to do HCP auth, we need the hcpConfig which is injected during bootstrap into the HCP client.
The full configuration will be done in CC-4635, as we need to patch otlpmetrichttp to configure the internal client.
With this design, it made mocking/ testing extremely easy as well. The client already has a mock as well.
Other options I considered:
- Return the
hcpConfigwith a newHCPConfigmethod on the HCP Client and keep the OTLP specific client in Exporter. Although this works fine, I didn't get the mocking benefits, and it became hard to test the Exporter. A - Remove Exporter and use the Reporter + HCP Client only. This made the Reporter really complex. From a code maintainability standpoint, this felt much cleaner and the concerns are well separated.
There was a problem hiding this comment.
Safety guard, a client and logger are needed. If this package is used outside of our use case, this ensures the config object is configured well.
There was a problem hiding this comment.
Safety guard, a client, logger, and gatherer are needed. If this package is used outside of our use case, this ensures the config object is configured well.
0877382 to
f2ffb6b
Compare
b96dcf9 to
559fde5
Compare
8eaf996 to
08e0026
Compare
051969d to
8b17d7c
Compare
ba26669 to
a83bc18
Compare
a83bc18 to
0fb65e0
Compare
Description
HCP Telemetry
TLDR: new feature to ship Server metrics directly to HCP.
mainand backportedThis allows us to iteratively improve on this feature over time before. Split up for easier PR review. 🚢
PRs to look at before this one:
Changes
This PR is a first to scaffold HCP telemetry, namely to:
Next steps
Testing & Reproduction steps
I tested this locally end to end with:
server.jsonthat contains HCP creds configuration to run the reporterlocalhost:9090since the telemetry config returns that hardcoded value for nowbin/consul agent -dev -config-file=config.jsonIn my OTLP receiver, I get metrics 🥳 :
Shortened for brevity ^

Links
PR Checklist