Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

CDN Snapshots and Monitoring Payloads rework (blueprint)#5367

Closed
ocket8888 wants to merge 1 commit intoapache:masterfrom
ocket8888:blueprint/snapshots-and-monitoring
Closed

CDN Snapshots and Monitoring Payloads rework (blueprint)#5367
ocket8888 wants to merge 1 commit intoapache:masterfrom
ocket8888:blueprint/snapshots-and-monitoring

Conversation

@ocket8888
Copy link
Copy Markdown
Contributor

What does this PR (Pull Request) do?

  • This PR is not related to any Issue

This PR adds a blueprint for reworking Snapshots and Monitoring payloads to only containing the information necessary for Traffic Router and Traffic Monitor to do their jobs, respectively. For details read the blueprint.

Which Traffic Control components are affected by this PR?

None.

What is the best way to verify this PR?

Don't. It's just a blueprint.

The following criteria are ALL met by this PR

  • Tests are unnecessary
  • Documentation is unnecessary
  • An update to CHANGELOG.md is not necessary
  • This PR includes any and all required license headers
  • This PR DOES NOT FIX A SERIOUS SECURITY VULNERABILITY

@ocket8888 ocket8888 added the blueprint feature requirements / implementation details label Dec 10, 2020
to handle the new format - but as always, a previous API version must still be
supported.

## Traffic Router Impact
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe in more detail the impact from these changes.
For example,is there now a new condition that TR may receive configuration for servers for which TM does not yet have any health data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... is there now a new condition that TR may receive configuration for servers for which TM does not yet have any health data?"

That's actually already possible. So, no, that new condition is not being introduced.

```typescript
interface IPAddress {
address: string;
gateway: string | null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does TM use the gateway?

Copy link
Copy Markdown
Contributor Author

@ocket8888 ocket8888 Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, it's only included to keep the structure of a server's interfaces consistent across the API, and possibly as supplemental information in log messages and the UI. I don't think the UI currently displays it, but the TM API does expose them, I think.

configPollingInterval: number;
};
deliveryServices: {
[xmlID: string]: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the DS active field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it'd be superfluous. Only Delivery Services which are actively routed appear in monitoring payloads.

// used to be a string containing a boolean, instead of just a boolean.
anonymousBlockingEnabled: boolean;
// Formerly 'ttl' - now consistent with the DS property it represents.
ccrDNSTTL: null | number;//integer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this to get rid of ccr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to- but not until that can be done on Delivery Services themselves. This is the same name as the property on the DS and I think that consistency is more important than a good name, because it makes it very obvious where that data came from.

// Formerly 'ttl' - now consistent with the DS property it represents.
ccrDNSTTL: null | number;//integer
consistentHashQueryParams: Array<string>;
coverageZoneOnly: boolean;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this currently an int?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. According to the docs:

"[coverageZoneOnly is] A string containing a boolean that tells whether or not this Delivery Service routes traffic based only on its Coverage Zone File"

Comment thread blueprints/snapshots.and.monitoring.md
## Security Impact
There should be no changes from a security standpoint.

## Upgrade Impact
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will TM/TR first try to request the v4 API and failing that retry with v3 APIs calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. TM already does this (currently v3->v2). I don't know if TR already does but it should because all components are supposed to do that to maintain backward-compatibility.

@limited
Copy link
Copy Markdown
Contributor

limited commented Dec 18, 2020

Thanks for the quick responses.

Overall, I think this is a great improvement. I've seen some strange cases where TRs dont always update from TMs and I think this will be in general more reliable.

Also, selfishly, this opens up potential for Traffic Ops to select which snapshot version a particular TR receives, which can be used as part of a staged or canary deployment.

data set is polled while another grows stale.

## Proposed Change
Traffic Router should receive Snapshots directly from Traffic Ops, while
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traffic Router should receive Snapshots directly from Traffic Ops

The TO->TM->TR flow is very intentional, to make sure the Monitor is monitoring what the Router is serving.

Removing that flow path and making independent TO->TM and TO->TR flows will create more even worse race conditions than the one mentioned above.

What does TR do if it gets CRConfig changes before the Monitor? So health data doesn't exist for a cache? What if something important, like a Profile, is changed? So the thing being monitored, the server, thresholds, deliveryservices, anything, is different than what the Router needs to serve? Any and all data--routing, thresholding, services, anything--is now a potential race condition, and can lead to strange and very-difficult-to-debug errors, including outages. Worse, if either the Monitor or Router is having trouble getting the latest CRConfig for any reason, those inscrutable errors could be indefinite.

Traffic Monitor uses data partially from CDN Snapshots and partially from its own Monitoring payloads to determine what to monitor and how. This leads to race conditions

This can be solved either put all Monitoring data in the CRConfig, or to add a "bulk fetch" method to TO to get both objects in one request.

If necessary, the CRConfig Monitoring data TR doesn't need could be stripped before sending it to TR. But I suspect it's a tiny size increase, and not nearly worth the safety and correctness benefit of having the Monitor directly serve the raw CRConfig bytes it got from TO. It also gives us more flexibility in the future to make TR use that data somehow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you missed the conversation we had about this in Slack. This was @rawlinp's response to more or less exactly what you just said:

If we are getting CrConfig from TO and CrStates from TM, it is possible that a cache that has been set ADMIN_DOWN could still be health in CrStates.

that race doesn't seem like a big issue, because once TM has processed the ADMIN_DOWN-ing of the cache, it will set it unhealthy in CrStates, but in the meantime TR will still be sending traffic to it. This is no different than TR waiting to get the latest CRConfig+health directly from TM -- it will still send traffic to the cache until TM has done its thing.

It's also possible that a cache that has been moved to OFFLINE and should not be in the CrConfig is still in CrStates.

This isn't really an issue either, because TR only looks up health state for caches that are in its crconfig. I was just in that code a month ago to fix the bug where if a cache went from admin_down to offline, TM removed it from the crstates, causing TR to treat it as available.On the flipside, for caches going from offline to reported/admin_down, if TR has the cache in its crconfig but not in crstates, TR should treat it as unavailable until it gets a valid state.

which I think is at least as well as I could've put it. Originally the plan was actually to have TM generate configuration for TR, but this convinced me and @dneuman64 that it was okay to let TR go directly to TO.

Frankly I don't really care one way or the other, though as @limited has pointed out the latter has some advantages that would be nice to take advantage of.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Server.Status is not the only potential race. As I said

What if something important, like a Profile, is changed? So the thing being monitored, the server, thresholds, deliveryservices, anything, is different than what the Router needs to serve? Any and all data--routing, thresholding, services, anything--is now a potential race condition

We've discussed adding smarter health, a percentage instead of a boolean, better thresholding, and many other features. Every new feature like those is another potential outage-inducing race, if TM and TR are racing for the same data instead of passing it through a flow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only data both TM and TR need is which caches to monitor/route to. If those get out of sync, TR has contingencies in place to not route traffic to caches it has no health information about. Basically any cache it knows about via the CRConfig that doesn't have health info yet is considered unavailable. And if TR gets health information about a cache it doesn't know about (because it's not in its CRConfig), it ignores it. So although TM/TR get out of sync with each other until they've both processed the config, it's not a problem.

TM should only be telling TR which caches are healthy or not. I don't see how changing profiles, thresholds, etc., is a potential outage-inducing race condition. Those would only affect how TM determines cache availability, which is all that TR really cares about from TM.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Monitor also has Stat data, and does Delivery Service Thresholding, as well as serving stats to be aggregated via Traffic Stats to an upstream location, which can potentially be critical that it be correct. For example, if it were used for Billing.

What happens if that Delivery Service stat and thresholding is updated on the Router before the Monitor? Now we're routing to things the Monitor doesn't know about, and doesn't know to aggregate, and thus that traffic data is lost upstream, with potential financial or legal ramifications.

TM should only be telling TR which caches are healthy or not. I don't see how changing profiles, thresholds, etc., is a potential outage-inducing race condition. Those would only affect how TM determines cache availability, which is all that TR really cares about from TM.

TM also serves Delivery Service availability to TR, as above. Moreover, as I said, we'd like to be able to add more complex health, such as DS-specific availability for each cache. These races make that either impossible, or worse, someone might not realize their subtlety and implement constantly racing and error-prone things.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TR wouldn't be getting any kind of stat or thresholding data from TO or TM. TM processes that data then tells TR simply "this cache is unavailable" or "this delivery service is unavailable". If we wanted to enhance that to have TM say "this cache is 50% available" for instance, then that still wouldn't be a race w/ TR. That's just TR waiting to get health data from TM as always.

@mitchell852
Copy link
Copy Markdown
Member

@rawlinp - i believe we have gone in a different but similar direction already regarding TR and TM configs. Do you think this should be closed?

@rawlinp
Copy link
Copy Markdown
Contributor

rawlinp commented Mar 15, 2022

I think the compromise of #6625 is that TM will get the data it needs from monitoring.json rather than relying on both monitoring.json and CRConfig.json, but it will still synchronize the distribution of CRConfig to TR because that is one of the contentious points in this blueprint. If that's agreeable to @ocket8888 and others I think this could be closed.

@ocket8888
Copy link
Copy Markdown
Contributor Author

It seems like this blueprint is slightly different from what's being done in #6625, but since that's much less controversial and already well on its way to getting merged, anything new or different in here will need to be reworked anyway before it could be started. Since I'm not going to do that right now, there's really no reason to leave this open.

@ocket8888 ocket8888 closed this Mar 15, 2022
@ocket8888 ocket8888 deleted the blueprint/snapshots-and-monitoring branch March 15, 2022 20:27
rawlinp added a commit to rawlinp/trafficcontrol that referenced this pull request Mar 24, 2022
Do not use CRConfig data when the same data already exists in the
TMConfig. For data that TM needs but doesn't exist in TMConfig, add it
to TMConfig. This way, TM only uses data in the TMConfig and can become
simply a shuttle for the CRConfig without actually using any CRConfig
data.

Closes: apache#6512
Related: apache#1739
Related: apache#5367
ocket8888 pushed a commit that referenced this pull request Mar 28, 2022
* Reduce TM dependency on CRConfig

Do not use CRConfig data when the same data already exists in the
TMConfig. For data that TM needs but doesn't exist in TMConfig, add it
to TMConfig. This way, TM only uses data in the TMConfig and can become
simply a shuttle for the CRConfig without actually using any CRConfig
data.

Closes: #6512
Related: #1739
Related: #5367

* Make trafficServer deliveryService field backwards-compatible

* Address review comments

* Fix grave (accent) mistakes, clarify field omission
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

blueprint feature requirements / implementation details

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants