Changes for Backup Edge Cache Group#1908
Changes for Backup Edge Cache Group#1908Vijay-1 wants to merge 4 commits intoapache:masterfrom Vijay-1:EdgeCache-BackupGroup
Conversation
|
Can one of the admins verify this patch? |
|
add to whitelist |
|
test this please |
rawlinp
left a comment
There was a problem hiding this comment.
I just did an initial pass over this PR and it generally seems good but I did notice a couple things:
- The indentation seems off in several places. Please make sure indentation is consistent with the rest of the file.
- New features should be clearly documented. We do have documentation around the Coverage Zone File that will need updated for the new field.
Question: if a CZ's backup list exists, but none of the backup CZs have caches available, what happens? After the backup list is exhausted, should we fallback to the original logic of finding the closest available cachegroup?
|
@rawlinp This backupList would be used to gain more control over which CGs can be a backup for a group. If a CG is not in the backup list, it won't be used, even if that means the request will be rejected. We specifically implemented this code to not fall back to getClosestAvailableCachegroup(). This allows operators to compartmentalize traffic within portions of their network. (For example, if all the east coast caches go down, bandwidth-limited backbone/cross-country links will not be overwhelmed by all the requests going from east clients to west-coast caches). (This was actually the driving reason for this feature and why the existing getClosest... isn't sufficient in some cases). |
|
Code looks good. Please also add a section to CHANGELOG.md at the top-level with a few line description of the feature and a link to the new docs. |
|
I have addressed the review comments. Please review and merge if every thing looks good. |
There was a problem hiding this comment.
"where zone name does not map to a cg name in Traffic Ops"
This might be more clearly written as:
The ``"backupList"`` section is optional and is used by Traffic Router for localization in the case of a CZF
"hit" when there are no caches available for that DS in the matched cache group.
There was a problem hiding this comment.
Please also mention that the list is ordered.
Also, include that if there are no caches available in any of the listed groups, the request will be bypassed (if configured) or rejected
|
Hopefully last request. Can you please fix your commit author and email (currently they are root). Then redo the push (you might need a |
|
Thank you for guiding me through out this. |
rawlinp
left a comment
There was a problem hiding this comment.
couple minor spacing/indentation issues, change section of release notes
There was a problem hiding this comment.
since we currently have an RC for 2.2, this is most likely landing in 2.3. So I would add a new v2.3.0 [unreleased] section at the top of this file and add this subsection #### Backup Edge Cache group under a ### Release Notes header, ie:
v2.3.0 [unreleased]
-------------------
### Release Notes
#### my new feature
...
v2.2.0 [unreleased]
...
There was a problem hiding this comment.
missing space between parameters
There was a problem hiding this comment.
question: did you build the docs with this change to make sure the backupList note is contained within the same box as the coordinates note? The syntax highlighting here makes me think that the 2nd note might not be in the same box
There was a problem hiding this comment.
I built the docs and loaded the file in a browser. Didn't find any issue.
|
Can one of the admins verify this patch? |
|
ok to test |
|
Refer to this link for build results (access rights to CI server needed): |
|
I think we want to allow the Maybe the format changes to: I'm not sure it matters whether the default is true or false as long as it's documented. What do you guys think? |
|
I believe , we are achieving compartmentalization by avoiding closest cache
group altogether. If user wants to fallback to closest cache group, he can
do that with suitable coverage zone file which doesn't have the backup list.
…On Tue, Feb 27, 2018 at 11:45 PM, Rawlin Peters ***@***.***> wrote:
I think we want to allow the backupList to be used for compartmentalizing
traffic but also provide a configurable option to fallback to the default
behavior of finding the next closest cachegroup once all backupList
cachegroups have been exhausted. Perhaps by default we will fall back to
finding the next closest cachegroup if all backup cachegroups have been
exhausted, but if the coverageZone has a field "fallbackToNextClosest":
false the request will be dropped if the backupList has been exhausted?
Maybe the format changes to:
"GROUP2": {
"backupZones": {
"list": ["GROUP1"],
"fallbackToNextClosest": false <-----default: true or false?
},
"network6": [
"1234:567a::/64",
"1234:567b::/64"
],
"network": [
"10.197.69.0/24"
]
}
I'm not sure it matters whether the default is true or false as long as
it's documented. What do you guys think?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1908 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANr-iJg8TAUBs4q-GFvS-s3l-AZbORobks5tZEY9gaJpZM4SPrbm>
.
|
|
@asfgit Request your inputs on fixing these . |
|
I think I agree with Rawlin. I understand the reason we have the backup list, and why using |
|
@Vijay-1 I think we could add this functionality for a pretty minor change to the logic in |
|
@dneuman64 . If every one agrees that this fallback configuration is required,will get it done in this PR. |
|
I think the All Checks have failed just means it couldn't find the unit tests to run. I think it's just an issue with the build envrionment. @dangogh can you confirm? |
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Even if the backupZones list is empty and the operator sets useClosest=true, I think we should enter this loop because its equivalent to trying to route to the backupZones when they are all down/unavailable.
There was a problem hiding this comment.
I think , this condition in NetworkNode.java will take care of this, where we parse and store the useClosest flag.
if (backupConfigNode != null && backupConfigNode.has("list")) {
...
}
There was a problem hiding this comment.
With this the following config will do the closest group if backup list is empty
"backupZones":{
"list": [],
"fallbackToClosestGroup": true
},
and wont do closest group with the following:
"backupZones":{
"list": [],
"fallbackToClosestGroup": false
},
There was a problem hiding this comment.
ok, i'm trying to simplify this conditional, because its awkward to have opposite checks on both sides of an OR. Do we really need to check for networkNode.getBackupLoc() here?
There was a problem hiding this comment.
Got you. Let me fix this.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
I'm good with this. @rawlinp Any more comments? |
rawlinp
left a comment
There was a problem hiding this comment.
apologies in advance, the code looks good, but I have a bunch of nitpicks mostly around variable naming and some minor cleanup
There was a problem hiding this comment.
just as a heads up (you shouldn't have to take any action), I might be reformatting the changelog later per the mailing list discussion about using the keepachangelog.com format.
There was a problem hiding this comment.
The changelog format has now changed to the keepachangelog format (there's now a link to that in the changelog), so you will need to rebase this branch onto the latest version of master and update these notes to use the new format.
There was a problem hiding this comment.
I would break this last sentence up to make it clearer (starting from "This backup list...":
This backup "list" contains an ordered list of backup groups to choose from if the matched cache group has no caches available for a requested DS. If an available cache cannot be found in any of the backup groups either, the "fallbackToClosestGroup" flag determines the Traffic Router's following behavior. If true, Traffic Router will find the next closest cache group with available caches. If false (the default), Traffic Router will bypass the request (if configured) or reject it.
There was a problem hiding this comment.
kind of a nitpick but can we call this variable backupConfigJson? All the Node stuff is confusing with JsonNode vs NetworkNode.
There was a problem hiding this comment.
I think this variable is unneeded; see comment on L131
There was a problem hiding this comment.
cachegroup rather than network?
There was a problem hiding this comment.
cachegroup rather than element
There was a problem hiding this comment.
in this case, it wasn't from backupCzGroup but the default behavior of nextClosest. Should we just omit this if-block?
There was a problem hiding this comment.
I believe, any thing other than the exact match on the subnet, is backup. (i.e) both co-ordinates and Backup Cache Groups are fallback mechanism and hence i feel this if-block should be fine
There was a problem hiding this comment.
I think we should just initialize useClosestOnBackupFailure to true on L113 and omit this if-else block. It doesn't change behavior; it would just be cleaner because it will be set to false in the previous block if configured anyway.
There was a problem hiding this comment.
I think these two lines can be replaced with:
useClosestOnBackupFailure = JsonUtils.optBoolean(backupConfigJson, "fallbackToClosestGroup", false);
Then you also don't need the fallbackNode intermediate variable
|
Refer to this link for build results (access rights to CI server needed): |
|
@rawlinp , I have address all your comments except the removal of if-block which tags cache group selection based on coordinates as backup. I believe, any thing other than the exact match on the subnet, is backup / fallback. (i.e) both co-ordinates and Backup Cache Groups are fallback mechanism and hence i feel this if-block should be fine. |
|
Refer to this link for build results (access rights to CI server needed): |
|
@Vijay-1 thanks for addressing my comments! I pulled this PR down to run some test scenarios against it, and this case was unexpected:
I would've expected item 3 to cause the client to be bypassed (if configured) or rejected. With your current code this only happens if the Delivery Service is set to CZ-only. |
There was a problem hiding this comment.
Rather just returning null here, I think we have to return more information to the caller so that the caller knows the difference between "null because there wasn't a hit in the CZF" and "null because there was a hit in the CZF but we're not returning a CacheLocation due to the "backupZones" config.
There was a problem hiding this comment.
@rawlinp This test case and results looks fine to me. With the Geo Limit any value other than CZF only , say for example None, TR will go all out to find a route and when doing so it will find a route via 'GEO' since this is not limited to CZF only.
For Null returning, may be we can add one more rdtl field to denote backup Zone failure. But then , i think it is supernumerary since we have enough tagging in the access logs to differentiate between selecting caches from CZ, CZ's backup configuration for successful cases.
There was a problem hiding this comment.
I disagree, one of the reasons for specifying backupZones is to confine traffic to certain cachegroups when the matched cachegroups are unavailable. In this case we're getting a hit in the CZ (so we know where the client is in the network), that cachegroup is unavailable, but the client will still fall back to a Geo-lookup (even though from the CZ-hit we already know where the client is), best-case scenario get matched to the same cachegroup, find the closest available cachegroup, and altogether disregard the "confine traffic to certain cachegroups" rule in the CZF.
There was a problem hiding this comment.
"confine traffic to certain cache group" is defined by CZF. So, if we remove that limitation (Geo Limit's value other than CZF only), TR can fetch a matching cache using Geo lookup. I am still not clear on why should TR reject this when the configuration says None as Geo Limit.
There was a problem hiding this comment.
Geo is generally the backup case when a client's IP cannot be found in the CZF. If the client is found in the CZF, we shouldn't be falling back to a Geo-lookup which subverts the CZF's backupZones config. We know the client is in our network, so we want it to follow the rules specified in the CZF.
If the client was not found in the CZF, then it should most likely not be in our network. In that case, the congestion might not be a problem because the client should be entering our network from somewhere else.
There was a problem hiding this comment.
If the client was found in the CZF, but the caches in that zone are down (and theres no backupList AND fallthrough == true), then we should try to use thecoordinates of the matching zone to find the next closest cache group. If there are no coordinates, I think the current behavior is to try geolocation, even on a client that was in the CZF.
There was a problem hiding this comment.
I agree with that, but the scenario I'm concerned about is when fallbackToNextClosest == false. In that case I don't think the request should fallback to Geo when the client was found in the CZF but the matched cachegroup has no caches available. In the best-case geo-lookup that client would still be traversing the same network path to the next available cachegroup (which is what the backup list is supposed to prevent).
There was a problem hiding this comment.
ok i agree that when fallbackToNextClosest ==false that we should NOT geolocate
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
The changelog format has now changed to the keepachangelog format (there's now a link to that in the changelog), so you will need to rebase this branch onto the latest version of master and update these notes to use the new format.
There was a problem hiding this comment.
Ideally this line should just pass track into this method and let the method default useDeep to false.
There was a problem hiding this comment.
is it possible to change the method signature to have this be getCoverageZoneCacheLocation(request.getClientIP(), ds, track); so that we're not passing useDeep explicitly as false? useDeep should default to false.
There was a problem hiding this comment.
I think this can remain a simple else if the logic in my comment above is implemented.
There was a problem hiding this comment.
maybe add DS_CZ_NO_BKP to signify that the matched cachegroup and its configured backup cachegroups were unavailable.
There was a problem hiding this comment.
Following from my suggestions above, this line would probably be:
} else {
track.SetNoBackupLocationsAvailable(true);
return null;
}
Then I'm not sure if we can also remove the return null statement below this.
There was a problem hiding this comment.
Generally I'm not a big fan of polluting this method with extra NetworkNode stuff, especially since getNetworkNode is already called in getCoverageZoneCacheLocation. I think ideally getCoverageZoneCacheLocation should return us the information we need.
Since we're passing in the Track object, we could probably add that information there, ie something like track.noBackupLocationsAvailable() which would get set within getCoverageZoneCacheLocation. Then we can update the logic starting on L277 to something like this:
if (track.noBackupLocationsAvailable() or ds.isCoverageZoneOnly()) {
if (ds.geoRedirect) {...}
else {
resultDetails = DS_CZ_ONLY if ds.isCoverageZoneOnly else DS_CZ_NO_BKP
...
}
}
|
Refer to this link for build results (access rights to CI server needed): |
Addressed the following review comments: 1. Added documentation 2. Updated CHANGELOG.md 3. Fixed indentation Addressed review comments. Commit using proper Author fields. Addressed Rawlin's comments Fixing Note's section as per Rawlin's comment. Addressing Rawlin's comments on fallback configuration on backupZones. Rewrote the ambiguous if condition check based on Eric's comments Fixing unit test case failure. Addressing Rawlin's comments on: 1. Naming convention 2. Avoiding use of variable 'fallbackNode ' 3. Documentation Documentation update for Traffic Router's access logs rdtl field Addressing Rawlin's comment on expanding the scope of backupZones's fallbackToClosestGroup configuration for all Geo Limit values. Use Track to record network node's configuration accross.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@rawlinp
|
rawlinp
left a comment
There was a problem hiding this comment.
reviewed the changelog first, will continue reviewing the rest of the PR shortly
| @@ -1,13 +1,18 @@ | |||
| # Changelog | |||
| All notable changes to this project will be documented in this file. | |||
| ======= | |||
There was a problem hiding this comment.
this line is unnecessary, it turns the line above into a large header
| ### Added | ||
| - Per-DeliveryService Routing Names: you can now choose a Delivery Service's Routing Name (rather than a hardcoded "tr" or "edge" name). This might require a few pre-upgrade steps detailed [here](http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/migration_from_20_to_22.html#per-deliveryservice-routing-names) | ||
|
|
||
| - Backup Edge Cache group: Backup Edge group for a particular cache group can be configured using coverage zone files. With this, users will be able control traffic within portions of their network, there by avoiding choosing fall back cache groups using geo co-ordinates(getClosestAvailableCachegroup) This can be controlled using "backupZones" which contains configuration for backup cache groups parsed from Coverage zone file as explained [here] (http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/using.html#the-coverage-zone-file-and-asn-table) |
There was a problem hiding this comment.
A few things here:
- Based on the keepachangelog format, there shouldn't be blank lines between bullet points, so this line should be moved up.
- "thereby" not "there by", "fallback" not "fall back", "coordinates" not "co-ordinates"
- A period is needed after "(getClosestAvailableCachegroup)"
- "the Coverage Zone File" not "Coverage zone file"
- Also, the
[here] (http://...)link doesn't work if there's a space between[here]and the link(http://..)
| - Reformatted this CHANGELOG file to the keep-a-changelog format | ||
|
|
||
| [Unreleased]: https://github.com/apache/incubator-trafficcontrol/compare/RELEASE-2.1.0...HEAD | ||
| -[Unreleased]: https://github.com/apache/incubator-trafficcontrol/compare/RELEASE-2.1.0...HEAD |
There was a problem hiding this comment.
the change here breaks the [Unreleased] link above, this shouldn't be changed
|
|
||
| - Backup Edge Cache group: Backup Edge group for a particular cache group can be configured using coverage zone files. With this, users will be able control traffic within portions of their network, there by avoiding choosing fall back cache groups using geo co-ordinates(getClosestAvailableCachegroup) This can be controlled using "backupZones" which contains configuration for backup cache groups parsed from Coverage zone file as explained [here] (http://traffic-control-cdn.readthedocs.io/en/latest/admin/traffic_ops/using.html#the-coverage-zone-file-and-asn-table) | ||
|
|
||
|
|
There was a problem hiding this comment.
based on the keepachangelog format there should only be one blank line between sections
| } | ||
| } else { | ||
| } else if (track.isUseNextClosest()) { | ||
| //Even with Geo limit none, TR wont do geo look-up, if fallback is diabled via backupZones configuration |
|
@rawlinp |
|
Refer to this link for build results (access rights to CI server needed): |
|
I have successfully executed the following test cases:
With Geo Limit 'None':
|
|
Refer to this link for build results (access rights to CI server needed): |
|
This PR will be closed and a new one based on the review comments will be created. |
|
New PR: #2029 |
This PR implements solution for the issue: #1907
It places the backup policy in the CZF file
{
"coverageZones": {
"GROUP2": {
"backupList": ["GROUP1"],
"network6": [
"1234:567a::/64",
"1234:567b::/64"
],
"network": [
"10.197.69.0/24"
]
},
"GROUP1": {
"backupList": ["GROUP2"],
"network6": [
"1234:5677::/64",
"1234:5676::/64"
],
"network": [
"10.126.250.0/24"
]
}
}
}
The following test cases has been executed successfully for both DNS and HTTP Routing:
Test Setup
GROUP1 : Two cache Servers
Group2 : One Cache Server
GEO Limit set to CZF only
Request from GROUP1's subnet , with one server in GROUP1 down and make sure the request is getting redirected to the remaining server in GROUP1
Request from GROUP1's subnet , with both servers in GROUP1 down and make sure the request is getting redirected to the GROUP2
Regression cases
Set GEO limit to None without backup list configured in CZF
Request from GROUP1's subnet , with both servers in GROUP1 down and make sure the request is getting redirected to the GROUP2