Skip to content

[CC-4374]Scaffold HCP Telemetry#16797

Closed
Achooo wants to merge 17 commits into
cc-4374/move-client-subpackagefrom
cc-4374/scaffold-hcp-telemetry
Closed

[CC-4374]Scaffold HCP Telemetry#16797
Achooo wants to merge 17 commits into
cc-4374/move-client-subpackagefrom
cc-4374/scaffold-hcp-telemetry

Conversation

@Achooo
Copy link
Copy Markdown
Contributor

@Achooo Achooo commented Mar 28, 2023

Description

HCP Telemetry

TLDR: new feature to ship Server metrics directly to HCP.

  • We will be merging these approved changes into feature/hcp-telemetry
  • That branch will be merged at the end into main and backported

This allows us to iteratively improve on this feature over time before. Split up for easier PR review. 🚢

PRs to look at before this one:

⚠️ Merge those first into feature branch, as this is rebased on top of it.

Changes

This PR is a first to scaffold HCP telemetry, namely to:

  • Create a reporter that obtains metrics from go metrics at a configurable interval
    • Reporter starts if HCP creds are configured.
    • Server is registered with CCM, and metrics endpoint configuration is obtained.
  • Create a metrics exporter that exports the metrics:
    • In OTLP format
    • If they are in the allowed list of filters (regex based)
    • Exports to the configured endpoint to the telemetry gateway
    • Uses a client capable of HCP auth (parts of it will be done later)

Next steps

  • Batching strategy improvements
  • Complete HCP Client auth in OTLP Exporter

Testing & Reproduction steps

I tested this locally end to end with:

  • Configuring a server.json that contains HCP creds configuration to run the reporter
  • Modyfing the OTLP client to allow HTTP requests
unc (c *hcpClient) InitMetricsClient(ctx context.Context, endpoint string) error {
	// hcpConfig, err := c.cfg.HCPConfig()
	// if err != nil {
	// 	return err
	// }

	// TODO:In follow up PR (CC-4635) : Need to use oauth2 transport in the otlpmetrichttp client to add token.
	// We likely need to fork and update the otlpmetrichttp client, as there is no interface to do this currently.
	exp, err := otlpmetrichttp.New(ctx, otlpmetrichttp.WithEndpoint(endpoint), otlpmetrichttp.WithInsecure())
	if err != nil {
		return err
	}

	c.exporter = exp
	return nil
}
  • Setting a mock OTLP receiver at localhost:9090 since the telemetry config returns that hardcoded value for now
  • Run bin/consul agent -dev -config-file=config.json

In my OTLP receiver, I get metrics 🥳 :

~/Github/HashiCorp/mock-receiver 6m 24s
❯ go run main.go
{"resourceMetrics":[{"resource":{"attributes":[{"key":"service.instance.id","value":{"stringValue":"04b1534a-fa03-dc66-e127-f829314ef448"}},{"key":"service.name","value":{"stringValue":"consul-server"}},{"key":"service.version","value":{"stringValue":"1.16.0-dev"}}]},"scopeMetrics":[{"scope":{"name":"github.com/hashicorp/consul/agent/hcp/client/telemetry","version":"v1"},"metrics":[{"name":"consul.raft.leader.oldestLogAge","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028130000000000","asDouble":0}]}},{"name":"consul.raft.last_index","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028130000000000","asDouble":17}]}},{"name":"consul.raft.applied_index","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028130000000000","asDouble":17}]}},{"name":"consul.raft.thread.main.saturation","histogram":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028130000000000","count":"10","sum":0,"min":0,"max":0}],"aggregationTemporality":2}}]}]}]}
{"resourceMetrics":[{"resource":{"attributes":[{"key":"service.instance.id","value":{"stringValue":"04b1534a-fa03-dc66-e127-f829314ef448"}},{"key":"service.name","value":{"stringValue":"consul-server"}},{"key":"service.version","value":{"stringValue":"1.16.0-dev"}}]},"scopeMetrics":[{"scope":{"name":"github.com/hashicorp/consul/agent/hcp/client/telemetry","version":"v1"},"metrics":[{"name":"consul.raft.last_index","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028140000000000","asDouble":17}]}},{"name":"consul.raft.leader.oldestLogAge","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028140000000000","asDouble":0}]}},{"name":"consul.raft.applied_index","gauge":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028140000000000","asDouble":17}]}},{"name":"consul.raft.thread.main.saturation","histogram":{"dataPoints":[{"startTimeUnixNano":"11651379494838206464","timeUnixNano":"1680028140000000000","count":"10","sum":0,"min":0,"max":0}],"aggregationTemporality":2}}]}]}]}

Shortened for brevity ^
Screenshot 2023-03-28 at 3 51 03 PM

Links

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern

@github-actions github-actions Bot added pr/dependencies PR specifically updates dependencies of project theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics labels Mar 28, 2023
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch 2 times, most recently from 74c7392 to 5745e4c Compare March 28, 2023 15:07
@Achooo Achooo changed the base branch from feature/hcp-telemetry to cc-4374/expose-go-metrics-as-backend March 28, 2023 15:07
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from 5745e4c to bb571d8 Compare March 28, 2023 15:30
@Achooo Achooo changed the title [CC-4374]Scaffold HCP Telemetry Reporter/Exporter [CC-4374]Scaffold HCP Telemetry Mar 28, 2023
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from bb571d8 to 44f12b0 Compare March 28, 2023 16:32
@Achooo Achooo changed the base branch from cc-4374/expose-go-metrics-as-backend to cc-4374/move-client-subpackage March 28, 2023 16:32
@Achooo Achooo force-pushed the cc-4374/move-client-subpackage branch from 512f03b to 8eaf996 Compare March 28, 2023 18:18
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from 44f12b0 to f04ef96 Compare March 28, 2023 18:33
Comment thread agent/hcp/telemetry/reporter.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions of the reporter

  • The reporter continuously gathers Consul metrics over a configurable time interval from the go metrics in memory sink
  • Over a configurable time interval, it flushes these gathered metrics to the Exporter (in charge of exporting metrics to HCP)

Concerns

There are currently a few concerns with the batching strategy in the reporter, right now:

  1. Flush failures/ Retry and Duplicates: currently with the interval time comparisons, we do not handle the case where the exporter fails to export. We currently drop metrics regardless. We want to offload some of the heavy lifting into the consul server. For example, if the Telemetry gateway is down, we want the export to retry or the batch to remain in memory and flush again, while avoiding duplicates.
  2. Memory concerns: With the ideal solution described above, we want to drop metrics that are too old, having a maximum size on the data structure holding the gathered metrics. We can't leave it unbound, and let the data structure grow forever. We need to have a configurable value here and we also need to run benchmarks to make sure we pick a sane value.
  3. The current BatchMetrics data structure might not be ideal to decide how to pick outdated intervals. Right now we compare agains time, and update the last interval based on our export batch. With a better solution, we likely need to reconcile this information better.

Overall, a sound batching strategy needs to be developed.
This will be addressed via a concretized strategy developed in CC-4636, where I will evaluate patterns, document them, and perform memory benchmarks before making a decision.

For now, please consider this file a scaffold for a first iteration of this merging into the feature branch.
This also is easier to review PR's, as the follow-up can be focused on the batching
The interfacing methods will likely remain the same, and so will most of the reporter initialization.

However, the data structures and algorithms for the actual batching will change, and potentially that main select for loop depending on the algorithm.

If anyone has lots of experience with this kind of thing, would love to talk more!

@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from f04ef96 to fa6d14b Compare March 28, 2023 19:33
Comment thread agent/hcp/client/client.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickethier After playing around, I went with the hcp.Client being the client exporter. It holds the OTLP client, and it can configure it with HCP auth from cloud config. It exposes two methods:

  1. InitMetricsClient(ctx context.Context, endpoint string) error : given an endpoint, initializes an OTLP client internally with HCP auth setup. We can't init the OTLP client earlier, since we get CCM endpoint information later, not during bootstrap when the HCP client is configured.
  2. ExportMetrics(context.Context, metricdata.ResourceMetrics) error: given metrics, makes the HCP authenticated HTTP request using the OTLP client to export metrics.

The telemetry.Exporter uses this wrapped client to export metrics.
When the reporter starts, we try to :

  1. fetch CCM config
  2. if registered with CCM, call this InitMetricsClient function to init the OTLP client with the configured endpoint
  3. Create a Metrics exporter by injecting this HCP/OTLP Client
  4. Metrics exporter can now access the ExportMetrics(context.Context, metricdata.ResourceMetrics) error to export metrics

See usage here:
https://github.com/hashicorp/consul/pull/16797/files#diff-12b6ad75045bab0d33ca50e727bfe4de502bb2322e7c4599a0c67f6a43e18a6fR84

Lmk what you think!

I think from a code readability standpoint, might still be best to inject this as a dependency into the Exporter 😓 But yeah that means exposing the cloud config via a method.. Was really divided between the two.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we define two interfaces in this package. If we had a client.MetricsClient interface with ExportMetrics then we could create it after calling Client.FetchTelemetryConfig and avoid needing the InitMetricsClient func

Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickethier Where the Client gets initialized and injected into the exporter. We could even do this within the Exporter init, by passing the HCP client. I liked that it was more obvious what is happening this way.

@Achooo Achooo requested a review from nickethier March 28, 2023 19:46
Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I should handle the case where we don't have an error per say, but the server isn't registered with CCM 🤔 This might not be nil, more empty (no endpoint) but for now since the CCM protos are not yet available I did it this way. Might change later.

Also, the InitMetricsClient is expecting an endpoint of the form <host>:<port>. No scheme. Depending on if CCM returns the endpoint with scheme, we will have to do parsing/validation here for the endpoint.

Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns an error so we can unit test easily!

Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also make the m.reporter.Run(ctx) return an error to make this return m.reporter.Run(ctx), but the reporter can be run outside of this package, and so it felt forced to do that.

Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that worries me is that we are not handling goroutine exits, in the case that the ctx is cancelled both the manager Run and this inner goroutine will return with a parent that exited.

Some thoughts:

  • Maybe this is this fine as is?
  • If not:
    • Should I use a separate context for the reporter goroutine, and cancel it when the manager Run function cancels?
    • Should I convert the contents of the manager Run function into a goroutine run and wait on both runReporter and run ?

Comment thread agent/hcp/telemetry/exporter.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions of the reporter

  • Converts batched metrics from the Reporter into OTLP
    • Filters metrics based on configured metric filters
    • Adds attributes based on configured labels
  • Exports these to HCP Telemetry Gateway. It holds an HCP Client capable of making this HTTP request (HCP authenticated)

Comment thread agent/hcp/telemetry/exporter.go Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the right Temporality values...

Comment thread agent/hcp/manager.go Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels are attributes that the metrics will be tagged with, for example the service instance id (service.instance.id)

Comment thread agent/hcp/telemetry/filter.go Outdated
Comment thread agent/hcp/telemetry/filter.go Outdated
Comment thread agent/hcp/client/client.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: I initially wanted to have this otlpmetrichttp client directly in the Exporter (telemetry/exporter.go).

However, to do HCP auth, we need the hcpConfig which is injected during bootstrap into the HCP client.
The full configuration will be done in CC-4635, as we need to patch otlpmetrichttp to configure the internal client.

With this design, it made mocking/ testing extremely easy as well. The client already has a mock as well.

Other options I considered:

  1. Return the hcpConfig with a new HCPConfig method on the HCP Client and keep the OTLP specific client in Exporter. Although this works fine, I didn't get the mocking benefits, and it became hard to test the Exporter. A
  2. Remove Exporter and use the Reporter + HCP Client only. This made the Reporter really complex. From a code maintainability standpoint, this felt much cleaner and the concerns are well separated.

Comment thread agent/hcp/telemetry/exporter_test.go Outdated
Comment thread agent/hcp/telemetry/exporter.go Outdated
Comment thread agent/hcp/telemetry/exporter.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety guard, a client and logger are needed. If this package is used outside of our use case, this ensures the config object is configured well.

Comment thread agent/hcp/telemetry/reporter.go Outdated
Copy link
Copy Markdown
Contributor Author

@Achooo Achooo Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety guard, a client, logger, and gatherer are needed. If this package is used outside of our use case, this ensures the config object is configured well.

Comment thread agent/hcp/manager.go Outdated
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from 0877382 to f2ffb6b Compare March 29, 2023 15:55
Comment thread agent/hcp/telemetry/reporter_test.go Outdated
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from b96dcf9 to 559fde5 Compare April 3, 2023 15:45
@Achooo Achooo force-pushed the cc-4374/move-client-subpackage branch from 8eaf996 to 08e0026 Compare April 4, 2023 02:19
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from 051969d to 8b17d7c Compare April 4, 2023 02:30
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch 2 times, most recently from ba26669 to a83bc18 Compare April 4, 2023 14:53
@Achooo Achooo force-pushed the cc-4374/scaffold-hcp-telemetry branch from a83bc18 to 0fb65e0 Compare April 4, 2023 14:57
@Achooo Achooo closed this Apr 21, 2023
@Achooo Achooo deleted the cc-4374/scaffold-hcp-telemetry branch April 27, 2023 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr/dependencies PR specifically updates dependencies of project theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants