validation: Add config validation mode. (#499)#863
Conversation
…nd add a factory method to it for the ClusterManager. This is preparatory to adding a ValidationClusterManagerFactory, which creates ValidationClusterManagers.
This adds a --mode command-line flag. "--mode=serve" is the default and can be omitted; in that mode Envoy runs as it always has. With "--mode=validate" Envoy checks that the configuration file in --config-file is valid, then exits. "Valid" means that it performs the JSON schema check as usual, but also initializes as much of its internal state as possible, exiting with any errors as it would under "--mode=serve". When initialization is complete, instead of listening for traffic, Envoy prints an "OK" message to stderr and exits, returning success. In validation mode, to the extent possible, Envoy avoids interfering with its environment. For example, the hot-restart process is skipped, so validation can safely be run on the same machine as a serving Envoy. Validation mode also doesn't send any upstream traffic or open any listeners. It *does* attempt to read any files referenced in the config, like certs and private keys, so validation will fail if those files aren't located at the expected paths. A future "lite validation" mode, that mocks out the filesystem as well as the network, is not yet implemented as of this patch.
|
@mattklein123: As we talked about a while ago, this ended up having a fair number of tentacles mocking out different parts of the class structure. Comments on how you'd organize it differently are more than welcome. Remember I'm still reasonably new to the language, so if anything's surprising in here, there's no subtle genius, just wrongness. :) |
|
Nit: why |
There was a problem hiding this comment.
This looks like a useful feature, thanks @rlazarus. Before going too far into the review, I have an architectural question, which I actually feel bad about bringing up at such a late stage, given this looks like it's basically complete.
I'm wondering whether it would have been better to set a global dry_run flag and condition on the existing code paths for server initialization rather than create the bifurcated initialization in this PR? I feel it's going to be an ongoing maintenance issue to keep the two in sync. But, I haven't explored this other alternative, so please let me know what your thoughts are.
|
|
||
| namespace Server { | ||
|
|
||
| /** |
There was a problem hiding this comment.
Nit: s/config-validation/configuration validation/
| /** | ||
| * Default mode: Regular Envoy serving process. Configs are validated in the normal course of | ||
| * initialization, but if all is well we proceed to serve traffic. | ||
| */ |
There was a problem hiding this comment.
It's fine in the enum here, but for the command-line, I'm in agreement with @PiotrSikora, it's probably cleaner to have a --dry-run and --dry-run-lite flags, I don't think anyone will ever set --mode=serve.
There was a problem hiding this comment.
FWIW, my thinking is that we might end up with any number of distinct modes for running Envoy -- "full validation" and "lite validation" are two, but conceptually --hot-restart-version is another: if we implemented that today, it might be a value for --mode. These all have two important things in common: that they significantly modify the semantics of what it means to run the binary (you expect drastically different things to happen) and, more importantly to the flag question, that it doesn't make sense to specify more than one.
We can grow a lot of boolean flags and document them all with "set either zero or one of these flags because they're mutually exclusive" or we can make that implicit with an enum-valued flag, and I argue that's much clearer. I agree nobody will ever explicitly set --mode serve, but I think it's still semantically a default state.
I'm 100% open to different names (e.g. I don't feel strongly about "validate" over "dry run"). I'm also willing to be outvoted on an enum vs. a fleet of booleans, but it was an intentional stance.
(cc @PiotrSikora because this is in reply to your comment too)
There was a problem hiding this comment.
What does hot restart version do that isn't done by --mode=serve?
There was a problem hiding this comment.
It prints the hot restart version and exits immediately (https://github.com/lyft/envoy/blob/master/source/server/options_impl.cc#L63) without serving, validating, or doing anything else.
|
|
||
| } // Server | ||
|
|
||
| int validateConfigAndExit(OptionsImpl& options, Server::ProdComponentFactory& component_factory, |
There was a problem hiding this comment.
This would be better placed in an anonymous namespace to make it local to the translation unit (similar to a static function, but the C++ way is to use namespaces).
| options.serviceClusterName(), options.serviceNodeName()); | ||
|
|
||
| if (options.mode() != Server::Mode::Serve) { | ||
| return validateConfigAndExit(options, component_factory, local_info); |
There was a problem hiding this comment.
I think it would be clearer to call it just validateConfig, as the AndExit implies to me that it's going to do a ::exit(something);, which it doesn't, it returns.
There was a problem hiding this comment.
Good point -- the name was left over from an earlier version where that's what it did. Done.
|
|
||
| Server::ValidationInstance server(options, restarter, stats_store, restarter.accessLogLock(), | ||
| component_factory, local_info); | ||
| std::cerr << "configuration '" << options.configPath() << "' OK" << std::endl; |
There was a problem hiding this comment.
Nit: Since this isn't an error, I think stdout might be more appropriate.
There was a problem hiding this comment.
Done. (My original thinking was to have success and failure outputs go the same place, and this is now the only reference to stdout in non-test code -- but that's probably fine.)
|
|
||
| int validateConfigAndExit(OptionsImpl& options, Server::ProdComponentFactory& component_factory, | ||
| const LocalInfo::LocalInfoImpl& local_info) { | ||
| Server::ValidationHotRestart restarter; |
There was a problem hiding this comment.
Do you need this restarter only for its logLock() and accessLogLock()? Could these be just independent log objects you create here instead?
There was a problem hiding this comment.
The Server interface exposes it via hotRestart().
FWIW, @rlazarus and I discussed this a while back and decided to go with the "mock" solution. The general thinking was that plumbing a dry_run param through the code would make the primary server logic harder to read and more complicated and the flag would have to go into a lot of places since this testing is doing semantic testing, not just JSON validation. In some sense, it's doing mocking without using gmock. I agree with @htuch that this is ugly, but on the other hand I think we need to choose between 3 ugly options: 1) Do this. 2) plumb dry_run through a ton of places. 3) Actually use gmock and the test mocks in prod code. I still think #1 is the least ugly, but let's discuss tomorrow if we need to all get on the same page. |
|
One other random thought for discussion tomorrow. Assuming we keep this approach, I think it would be good to switch the ExampleConfig test over from gmock to use these "mocks." This will help us get better test coverage: Eventually also probably the router test tool that @hennna wrote. |
|
What's the strategy the maintenance ? I haven't looked into the code (so feel free to say "go look st the code :)" ). If I were to introduce a new feature, is there a structure in place that can be followed to keep adding more stuff to the validation logic ? How is RDS, CDS and SDS validated ? (I am going totally by the PR description. ) On a side note, it seems that all you want to accomplish here is to ensure that Envoy initializes without perturbing the environment. From that perspective, reads of any form are okay right (from files, dns, etc). It's the writes that complicate things (pushing stats, and making requests). Rather than validating from inside Envoy, why can't you validate externally observable behavior (e.g. expect connection to foo.bar.com with so and so headers) using an external sink/request gen ? (This is not same as using gmock in prod code). You could emulate the external environment without Envoy knowing about it (like the nginx perl tests?). Not sure if this covers the use cases you are looking for. |
|
@rlazarus @htuch was there any offline conversation about this? Is it ready for wider review? I also need to sync with @junr03 tomorrow on whether this meets our internal needs / what might be missing (somewhat related to questions from @rshriram). In general I think if we convert example config tests away from mocks and to this stuff I'm more comfortable (haven't looked at whether that was done yet or not). |
|
@mattklein123 I need to give this another pass with a broad "it's probably the cleanest way we have to handle this" notion in mind. Will try make this happen today. |
| /** | ||
| * Default mode: Regular Envoy serving process. Configs are validated in the normal course of | ||
| * initialization, but if all is well we proceed to serve traffic. | ||
| */ |
There was a problem hiding this comment.
What does hot restart version do that isn't done by --mode=serve?
|
|
||
| /** | ||
| * Allocate a cluster manager from configuration JSON. | ||
| */ |
There was a problem hiding this comment.
Can we remove the other methods from this interface now that we have this? E.g. can allocateConPool just be part of ClusterManagerImpl.
There was a problem hiding this comment.
I think not easily -- that distinction is there for testability, see https://github.com/lyft/envoy/blob/master/test/common/upstream/cluster_manager_impl_test.cc#L38. (I'm actually wasn't originally 100% sure that the ClusterManagerFactory is morally speaking the right place for, well, a ClusterManager factory function -- but I successfully talked myself into it. I could be talked back out of it, if there's a better place.)
| const Optional<std::chrono::milliseconds>& timeout) { | ||
| AsyncRequestImpl* async_request = | ||
| new AsyncRequestImpl(std::move(request), *this, callbacks, timeout); | ||
| async_request->initialize(); |
There was a problem hiding this comment.
Do we want consistency in AsyncStreamImpl with the initialization pattern? I.e. it might be surprising to someone using both AsyncRequestImpl and AsyncStreamImpl that they have slightly different ways to kick off a request after construction.
There was a problem hiding this comment.
AsyncStreamImpl doesn't currently do any real work in its ctor, though. I could add an initialize() method but it would only be for consistency's sake.
| Server::ValidationHotRestart restarter; | ||
| Logger::Registry::initialize(options.logLevel(), restarter.logLock()); | ||
| Stats::HeapRawStatDataAllocator alloc; | ||
| Stats::ThreadLocalStoreImpl stats_store(alloc); |
There was a problem hiding this comment.
Do we need TLS during validation? Or is it possible to complete all the initialization+validation on just the entry thread? If so, we could use a simpler stats_store, e.g. IsolatedStoreImpl.
| int validateConfig(OptionsImpl& options, Server::ProdComponentFactory& component_factory, | ||
| const LocalInfo::LocalInfoImpl& local_info) { | ||
| Server::ValidationHotRestart restarter; | ||
| Logger::Registry::initialize(options.logLevel(), restarter.logLock()); |
There was a problem hiding this comment.
The export of the log and access locks isn't part of the HotRestart interface. I think you can both simplify here by just using locally defined lock objects in both cases and also removing locking from the Server::ValidationHotRestart implementation, can you take a look to see if this is the case?
There was a problem hiding this comment.
Done -- I had had it that way for consistency between the regular and validation HotRestarts, but you're right that this is simpler.
| Event::Libevent::Global::initialize(); | ||
| OptionsImpl options(argc, argv, Server::SharedMemory::version(), spdlog::level::warn); | ||
| Server::ProdComponentFactory component_factory; | ||
| LocalInfo::LocalInfoImpl local_info(Network::Utility::getLocalAddress(), options.serviceZone(), |
There was a problem hiding this comment.
Question; do we want to rely on local address or should this be something you can pass in? I'm imagining using Envoy in a config validation pipeline that isn't run directly on the target server.
| options.serviceClusterName(), options.serviceNodeName()); | ||
|
|
||
| if (options.mode() != Server::Mode::Serve) { | ||
| return validateConfig(options, component_factory, local_info); |
There was a problem hiding this comment.
Do you want to add a if (option.mode() == Server::Mode::Validate) { NOT_IMPLEMENTED; }? That would avoid someone asking for light mode and getting back full validation behavior.
There was a problem hiding this comment.
(Assuming you mean == Server::Mode::ValidateLight.)
In practice I just never set that enum value -- in options_impl.cc if you pass anything other than --mode serve or validate you get the "I've never heard of that mode" error. But a belt-and-suspenders check here seems reasonable; done.
| Event::Dispatcher& dispatcher() override { return handler_.dispatcher(); } | ||
| Network::DnsResolver& dnsResolver() override { return dns_resolver_; } | ||
| bool draining() override { | ||
| throw EnvoyException("ValidationInstance::draining() not implemented"); |
There was a problem hiding this comment.
This is a super common pattern with some boilerplate. What do you reckon about a #define (I know, the shock, the horror) in the spirit of MOCK_METHOD, which does all the exception generation and name mangling etc? I think this will make it more readable/maintainable.
There was a problem hiding this comment.
Could do, or I could just replace all the exceptions with NOT_IMPLEMENTED;, which should do the right thing now that our stack traces are friendly -- and would also save me from falling into that exception trap I fell into. I think I've just talked myself into doing it that way, any objections?
There was a problem hiding this comment.
Either way works, I have a slight preference for whatever is less verbose.
| Configuration::InitialImpl initial_config(*config_json); | ||
|
|
||
| for (uint32_t i = 0; i < std::max(1U, options.concurrency()); i++) { | ||
| workers_.emplace_back(new Worker(thread_local_, options.fileFlushIntervalMsec())); |
There was a problem hiding this comment.
Yeah, I was wonder if we can get away with no threading during validation above, but as I look through this method I see there's a fair bit of stuff going on related to this. @mattklein123 Do you think it's possible in Envoy to perform all config validation without bringing up any additional threads?
There was a problem hiding this comment.
After a quick chat with @mattklein123, I think we can do this more simply ala https://github.com/lyft/envoy/blob/master/test/config_test/config_test.cc#L25. We shouldn't need to start any workers, bring up thread-local state, use threaded variants of things like the stats_store_ etc. You really just need to provide the mocks that are in config_test with your classes.
There was a problem hiding this comment.
And as part of this change please just make config test use your mocks per other comments. Will provide much better test coverage (though might require the "light" version also which doesn't read files).
| * If we finish initialization, and reach the point where an ordinary Envoy run would begin serving | ||
| * requests, the validation is considered successful. | ||
| */ | ||
| class ValidationInstance : Logger::Loggable<Logger::Id::main>, public Instance { |
There was a problem hiding this comment.
Does this need to inherit from the implementation, or can it just be a bare minimum implementation that exports out methods similar to what is used in https://github.com/lyft/envoy/blob/master/test/config_test/config_test.cc?
There was a problem hiding this comment.
This doesn't inherit from the implementation -- Server::Instance is the interface, as distinct from Server::InstanceImpl.
htuch
left a comment
There was a problem hiding this comment.
BTW, the C++ style is great, I wouldn't have picked this for a PR from someone new to the language :)
|
@rlazarus can we move this PR forward? Would like to try to get it merged or close for now. |
|
@mattklein123 Yes -- I've already been working with @htuch today to get it out the door ASAP. |
These don't do anything except test that the mocked out methods are mocked out. They're purely to keep coverage from dropping.
|
@htuch Assuming Travis smiles upon this, it's ready for another look. Nits more than welcome. |
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo a few small comments.
| CdsApiPtr ValidationClusterManagerFactory::createCds(const Json::Object& config, | ||
| ClusterManager& cm) { | ||
| // Create the CdsApiImpl... | ||
| CdsApiPtr cds = ProdClusterManagerFactory::createCds(config, cm); |
There was a problem hiding this comment.
Do you even need to assign to cds here since it's unused later?
There was a problem hiding this comment.
Nope, good point. Done.
| ValidationInstance server(options, stats_store, access_log_lock, component_factory, local_info); | ||
| std::cout << "configuration '" << options.configPath() << "' OK" << std::endl; | ||
| server.shutdown(); | ||
| return 0; |
There was a problem hiding this comment.
I'd prefer using EXIT_SUCCESS/EXIT_FAILURE here for clarity, or just a plain bool, making validateConfig not tied to the exit status and perform the mapping to exit code in main.cc.
|
|
||
| /** | ||
| * validateConfig() takes over from main() for a config-validation run of Envoy. It returns an exit | ||
| * code for the process: 0 if the config is valid, 1 if invalid. |
There was a problem hiding this comment.
See above comment on the return value.
| Upstream::ProdClusterManagerFactory cluster_manager_factory( | ||
| server.runtime(), server.stats(), server.threadLocal(), server.random(), server.dnsResolver(), | ||
| server.sslContextManager(), server.dispatcher(), server.localInfo()); | ||
| MainImpl config(server, cluster_manager_factory); |
There was a problem hiding this comment.
Why not support multiple constructors (including the old one), so can avoid the boilerplate here?
There was a problem hiding this comment.
Not sure I follow -- by multiple constructors do you mean offer MainImpl(Server::Instance&) as well as the new MainImpl(Server::Instance&, ClusterManagerFactory&)? That might work but I'm not sure about the lifetimes -- the MainImpl only holds a reference to the ClusterManagerFactory (and needs it in initialize(), not in its ctor) so I don't immediately see a clean way but maybe I'm missing it.
If the concern is just that these tests are noisy now, I agree -- if I refactored them to use a test fixture, so that we only had to say that once and the individual tests were more compact, would that suit you?
There was a problem hiding this comment.
Yeah, whatever you think is best here. I'm just the enemy of boiler plate.
| @@ -0,0 +1,58 @@ | |||
| #include "envoy/json/json_object.h" | |||
There was a problem hiding this comment.
It occurred to me that another way you could get coverage is to write sufficient example configs to exercise all paths, but I think what you have here is fine if it doesn't regress coverage.
There was a problem hiding this comment.
Yeah -- once we've replaced the example config tests with this, adding more example configs sounds like a good way of upping the coverage. But we might as well get there first.
htuch
left a comment
There was a problem hiding this comment.
@rlazarus Optionally in this PR you could also update the docs in https://github.com/lyft/envoy/tree/master/docs to talk about this tool. I say optionally as I think that would be a fine in a followup PR, up to you.
|
@rlazarus Can you improve the coverage here so bazel.coverage passes again? |
|
Frustrating: Merging master seems to have taken us from just above the code coverage setpoint to just below. I'll see if I can effectively cover the NOT_IMPLEMENTED lines with a death test, because there's not much else to do. |
htuch
left a comment
There was a problem hiding this comment.
Ready to ship as soon as coverage is fixed.
|
Right -- that's what @htuch and I are looking at, I pushed that commit so he could patch it in. |
|
If it helps I found reference to manually calling __gcov_flush although at the time I was investigating the signal handler code wasn't in place yet. It might be possible to easily integrate __gcov_flush call now that the custom signal handlers are installed. |
…y NOT_IMPLEMENTED lines that coverage doesn't count
This reverts commit 18fb3c2. The tests were only added for the coverage metric, and they don't increase the coverage metric.
|
@rlazarus Looks like some minor bit rot to fix. |
This adds a
--modecommand-line flag.--mode=serveis the default and can be omitted; in that mode Envoy runs as it always has.With
--mode=validateEnvoy checks that the configuration file in --config-file is valid, then exits. "Valid" means that it performs the JSON schema check as usual, but also initializes as much of its internal state as possible, exiting with any errors as it would under--mode=serve. When initialization is complete, instead of listening for traffic, Envoy prints an "OK" message to stderr and exits, returning success.In validation mode, to the extent possible, Envoy avoids interfering with its environment. For example, the hot-restart process is skipped, so validation can safely be run on the same machine as a serving Envoy. Validation mode also doesn't send any upstream traffic or open any listeners.
It does attempt to read any files referenced in the config, like certs and private keys, so validation will fail if those files aren't located at the expected paths. A future "lite validation" mode, that mocks out the filesystem as well as the network, is not yet implemented as of this patch.