Skip to content

tracing: add datadog extension#4699

Merged
mattklein123 merged 18 commits intoenvoyproxy:masterfrom
DataDog:cgilmour/datadog-tracer
Nov 5, 2018
Merged

tracing: add datadog extension#4699
mattklein123 merged 18 commits intoenvoyproxy:masterfrom
DataDog:cgilmour/datadog-tracer

Conversation

@cgilmour
Copy link
Copy Markdown
Contributor

Description:
This PR adds a tracer extension so envoy can produce HTTP traces and submit them to Datadog via an agent.
It has similar behavior to the existing LightStep and Zipkin tracers, and their implementations were used as a reference.

Risk Level:
Low.

Testing:
Unit tests ("borrowed" from Lightstep)
Integration tests using docker-compose. (An example is not submitted yet, but can be provided)
End-to-End tests from browser HTTP request to traces appearing in Datadog's system, internally and by external users in a staging environment.

Docs Changes:
Please suggest where doc changes need to be made.

Release Notes:
Please suggest about this also.

Fixes #3861 (already closed)

CC: @mattklein123 (offered to sponsor this submission) and other tracing experts @rnburn and @objectiser

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
@cgilmour cgilmour force-pushed the cgilmour/datadog-tracer branch from 3decae8 to 2b5c3e6 Compare October 12, 2018 04:23
@alyssawilk
Copy link
Copy Markdown
Contributor

Sorry, can you do a master merge to pick up #4701?

@mattklein123
Copy link
Copy Markdown
Member

@cgilmour unfortunately I'm out for the next 2 weeks, and I'm not going to have a chance to look at this before I go. I'm going to mark this as "no stalebot" in the interim. If anyone else wants to take a look including @rnburn and/or @objectiser that would be great.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Oct 12, 2018
Comment thread api/envoy/config/trace/v2/trace.proto Outdated
// The cluster to use for submitting traces to the Datadog agent.
string collector_cluster = 1 [(validate.rules).string.min_bytes = 1];
string service_name = 2 [(validate.rules).string.min_bytes = 1];
bool priority_sampling = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the priority_sampling option?

Also, does datadog's tracer support the sampling.priority tag? Envoy uses it to disable sampling for some of the traffic that would be noise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnburn Support for that tag is being added in DataDog/dd-opentracing-cpp#59 should be ready today.

@cgilmour We should copy/have similar comments as https://github.com/DataDog/dd-opentracing-cpp/blob/master/src/tracer_factory.cpp#L21 We should also default priority sampling to on here (whereas normally it's off by default), since envoy uses it.

message->headers().insertPath().value(encoder_->path());
message->headers().insertHost().value(driver_.cluster()->name());
for (auto& h : encoder_->headers()) {
ENVOY_LOG(debug, "Adding header {}: {}", h.first, h.second);
Copy link
Copy Markdown
Contributor

@rnburn rnburn Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there enough context in this log statement for it to be useful?

cluster_ = cluster->info();

tracer_options_.operation_name_override = "envoy.proxy";
if (datadog_config.service_name().size() > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!datadog_config.service_name().empty()?

Upstream::ClusterInfoConstSharedPtr cluster() { return cluster_; }
Runtime::Loader& runtime() { return runtime_; }
DatadogTracerStats& tracerStats() { return tracer_stats_; }
const datadog::opentracing::TracerOptions& tracerOptions() { return tracer_options_; }
Copy link
Copy Markdown
Contributor

@rnburn rnburn Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an Envoy convention for putting non-overridden methods like this before the overridden ones.

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
@cgilmour cgilmour force-pushed the cgilmour/datadog-tracer branch from 9a27059 to 569bbc2 Compare October 16, 2018 04:48
@cgilmour
Copy link
Copy Markdown
Contributor Author

I'll sort out an update to fix the failing tests.
Seems like a minor detail from some recent refactoring in our code.

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
@mattklein123 mattklein123 removed the no stalebot Disables stalebot from closing an issue label Oct 29, 2018
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and sorry for the delay. Some comments to get started.

google.protobuf.Struct config = 2;
}

// Configuration for the Datadog tracer.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a release note for this? Are there any other docs that need updating? Perhaps https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/tracing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do this once the main code changes are acceptable.


class TraceReporter;
typedef std::unique_ptr<TraceReporter> TraceReporterPtr;
typedef std::shared_ptr<datadog::opentracing::TraceEncoder> TraceEncoderPtr;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TraceEncoderSharedPtr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

POOL_COUNTER_PREFIX(stats, "tracing.datadog."))},
tls_(tls.allocateSlot()), runtime_(runtime) {

Upstream::ThreadLocalCluster* cluster = cm_.get(datadog_config.collector_cluster());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Config::Utility::checkCluster here I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it was done this way for consistency (and not knowing about checkCluster).
Would you like the other tracers updated as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure please clean up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for all three cases.


void TraceReporter::enableTimer() {
const uint64_t flush_interval =
driver_.runtime().snapshot().getInteger("tracing.datadog.flush_interval_ms", 1000U);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure these runtime keys are documented somewhere. Don't recall of the top of my head where the other keys you based this on are documented.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was "inspired" by settings in other tracers, but can probably be replaced with a sensible default.
The impact of it is on the tracer mechanics more than the user.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that's fine, you don't need to have things be configurable if you don't want.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with a sensible default.

flush_timer_->enableTimer(std::chrono::milliseconds(flush_interval));
}

void TraceReporter::flushTraces() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that almost all this code was basically copied from the LS implementation? How much of it is different? Is it worth sharing the code in any way? cc @rnburn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, they were used as a reference, and where possible, used verbatim so that the result would be consistent.
If there's room to refactor things, that's great.
Flush timers and flush mechanisms are probably a good case for that.
Not sure about other bits.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you but any de-dup would be appreciated. This is going to come up again when someone comes in and adds the OpenCensus tracer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done, but OK doing a followup change to de-dup things, and any suggestions from @rnburn


Http::MessagePtr message(new Http::RequestMessageImpl());
message->headers().insertMethod().value().setReference(Http::Headers::get().MethodValues.Post);
message->headers().insertPath().value(encoder_->path());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and next line can be references I think, FWIW

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll see I can make this one return a reference. The next one is internal to envoy so easier to update.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

message->headers().insertMethod().value().setReference(Http::Headers::get().MethodValues.Post);
message->headers().insertPath().value(encoder_->path());
message->headers().insertHost().value(driver_.cluster()->name());
for (auto& h : encoder_->headers()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for perf reasons, it would be better to pre-construct the lower case headers. Then you can just send them by reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll see what I can do for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
@cgilmour
Copy link
Copy Markdown
Contributor Author

Thanks @mattklein123, added some responses to the comments.
I'll work on the changes and push them when they're ready for another look.

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in general LGTM. Can we do a docs/release notes pass and then I can take a final pass? Thank you!

@mattklein123
Copy link
Copy Markdown
Member

Marking is waiting.

/wait

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you!

@mattklein123 mattklein123 merged commit 46c0693 into envoyproxy:master Nov 5, 2018
@lizan
Copy link
Copy Markdown
Member

lizan commented Nov 11, 2018

CODEOWNERS entry is missing, @cgilmour @mattklein123?

@cgilmour
Copy link
Copy Markdown
Contributor Author

Ok. I can submit a followup PR for that.

@mattklein123
Copy link
Copy Markdown
Member

Thank you @cgilmour!

@cgilmour cgilmour deleted the cgilmour/datadog-tracer branch January 30, 2019 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add datadog tracing driver to envoy

6 participants