Blueprint/refetch#5910
Conversation
There was a problem hiding this comment.
Just to be clear, if the value of invalidationType is malformed or of the wrong data type, it will result in a 400, else, if it is missing, or any value other than refetch or refresh will always default to refresh, correct?
There was a problem hiding this comment.
Oh excellent question. What are your thoughts? I suppose I would probably make this a bit more strict.
To be explicitly correct, the caller would have to send either refetch, refresh, or nothing at all. By that I mean they would not include the invalidationType key at all (which would then default to refresh). If they did include the key and the value was an empty string, that would result in a 400. In fact, any other value would result in a 400. I think this is probably the best approach. The response can contain a helpful message to why it was rejected to assist any callers in the future.
Should it be case insensitive?
There was a problem hiding this comment.
Yeah I think if the user includes the invalidationType key and gives a blank string, that could probably be treated as refresh as well, but I'm new to invalidation jobs so maybe @ocket8888 would know better.
The following four types should be accepted IMO:
1.) refresh
2.) refetch
3.) blank string -> defaults to refresh
4.) invalidationType key absent -> defaults to refresh
Anything else is treated as a 400.
There was a problem hiding this comment.
You absolutely can do that - I would suggest instead that the field not be optional, and the client creates objects exactly as they desire them to exist with no room for surprises.
There was a problem hiding this comment.
yeah, surprises are not good imo. so i guess 400 if invalidationType != refresh or refetch. and because you are adding this to a new major version of the api (4.0), you have the luxury of adding the new required field to the request...but tbh, if everyone likes the idea of a default value, i'm good with that too :)
There was a problem hiding this comment.
Just because we have a new major version we can add this to doesn't mean we should just make every new change a breaking change. I don't think making this a new required field adds enough value to warrant that since we can easily provide a default instead.
There was a problem hiding this comment.
I've modified this to be explicitly either refresh or refetch. All other submissions, including missing/omitted, will be reject as a client error (400 - Bad Request).
| #### Data Model / Database Impact | ||
|
|
||
|
|
||
| The current column `parameters` will now contain a cskv (comma separated key value) string. Currently it only stores the `TTL` for the invalidation request: |
There was a problem hiding this comment.
Which table is this in?
Why not add a new column rather than storing multiple values in the same column?
There was a problem hiding this comment.
agree. stuffing lots of values in a column seems problematic going forward. can't you just reuse the keyword column of the job table? currently, keyword is PURGE for EVERY job. maybe keyword should be REFRESH and REFETCH instead and do a database migration to change all existing PURGE values to REFRESH?
There was a problem hiding this comment.
I've pivoted a bit on this original idea. My first attempt was to change as little as possible, reusing the parameters column. However, I believe we all agree that explicity and purposefully is a better approach. I've taken the approach of explicit columns (and datatypes). I also added a recommendation which I think we should implement that performs a great deal of clean up for the jobs table. We are carrying a decent amount of future-proofing along with us that doesn't appear to be utilized anywhere.
Definitely let me know if this newer proposal is lacking in clarity.
|
|
||
| > The removal of these columns will impact the `UserInvalidationJob` struct. Even though the only endpoint utilizing this struct (v1.1) is deprecated, it is still in use. | ||
|
|
||
| ### ORT/T3C Impact |
There was a problem hiding this comment.
Can you please describe how the construction of the regexrevalidate.config file will change given that this an external interface of Traffic Control?
There was a problem hiding this comment.
I've added a bit more of a description. There isn't much more to the construction of T3C's regexrevalidate.config file. This is already implemented, we are allowing it to be explicitly controlled via ATC's APIs, interfaces, and clients.
Essentially, it remains the same format, but with an optional third field for each non-comment entry.
|
|
||
| ### ORT/T3C Impact | ||
|
|
||
| The `regexrevalidatedotconfig.go` in _lib/tc-atscfg_ currently has a function called `filterJobs` that is responsible for parsing the _parameters_ information from the job. Right now, that is only parsing the **TTL**, however it would then also need to parse the optional, extra field of **invalidationType**. |
There was a problem hiding this comment.
What considerations are made for TrafficServer version dependencies?
Is MISS supported by all ATS versions? If not, what happens when an operator sets a REFETCH job on a set of caches that do not support that invalidation type?
There was a problem hiding this comment.
Seems like there are two options when MISS is unavailable based on ATS version: ignore it or use refresh. Can you suggest which would be better?
There was a problem hiding this comment.
I'm not an ATS expert, so I can't make a solid suggestion here.
I think whatever happens probably needs to be bubbled up to the operator. You don't want to silently ignore or silently do a stale-refresh when you expect a Miss-Refetch.
There was a problem hiding this comment.
Well, when I need an ATS and/or ORT/t3c/atstccfg expert I ask @rob05c - lets see if he has an opinion
There was a problem hiding this comment.
It's definitely not supported on older versions. It's a new feature of the regex_revalidate plugin.
-
Unfortunately, there's not an easy way to bubble it up. We don't really have a mechanism to send data from caches to TO, and even if we did, it'd be pretty unwieldy to collect that from every single cache.
We do have the ATS version in a Parameter in TO, we could hard-code it. But someone might backport the plugin changeset to an older version, and then ATC just won't allow using it. So that's not a great solution.
-
the "refetch" feature is only necessary or useful if the Origin server is fundamentally broken and completely violating the HTTP Spec. Without "refetch," ATS does an IMS request. So, the only reason it's ever needed is when a Tenant's Origin is responding with a 304 when the asset actually is modified. In complete violation of RFC 7234 and every HTTP Caching specification. But of course, we deal with horribly broken Origins every day. But I do think we should make that clear in the docs, and maybe the UI.
-
To work correctly, every single Cache has to support it. Which is mostly fine, since CDN operators don't usually operate with widely varying ATS versions. Some do, but I think the majority have a single version and maybe some newer Canaries.
With those 3 things in mind, I think the best solution might be to make it a config Feature Flag in Traffic Ops/+Portal, with Documentation that they must have version X of regex_revalidate. As an additional safety, we should also make t3c check and log an error (which operators can then Alarm on).
It's not as good as bubbling it up, but considering all those things, that's the best solution I'm seeing at the moment.
There was a problem hiding this comment.
It sounds like you're saying you set the flag in TO, then ORT/t3c/atstccfg looks at it and determines based on that what to do with jobs that use "refetch" - which makes sense, because otherwise I suppose I don't know how t3c could "check and log an error"
Yes, that is how use_reval_pending works currently and is how I imagine a feature flag for this feature would work as well.
There was a problem hiding this comment.
When I said "feature flag" I was thinking the TO configuration file; but a global Parameter also works.
I thought it'd be like: you make a new job, if it's using "refetch" then TO checks if the feature flag is allowed. If it isn't, you can't make that job
That's what I was thinking, but also having a check in t3c-apply as a fail-safe, if it ever does get a job and the installed regex_revalidate is too old to understand it, it would log an error (but probably still do the reval, because an IMS reval is better than nothing and doesn't hurt if they wanted a no-IMS reval).
There was a problem hiding this comment.
We need to not allow creating those jobs at all, if the feature-flag isn't set. Because by the time it gets to the cache, it's too late: the user put in jobs, that they expected to work, and now we have an error we can't reasonably propagate up to them that the job won't work.
But we already knew it wasn't going to work, we have the flag in TO. We should be able to return an error from the /jobs POST.
There was a problem hiding this comment.
Bearing in mind, these won't be intermittent errors. The endpoint will always return an error if the flag isn't set.
That's a good reason to put the flag in the Parameter instead of the config file: the Param is already available in the API. Which lets UIs like the Portal fetch it, to decide whether to hide that extra button from users, so they never even see the option that will always error.
There was a problem hiding this comment.
I've added some language specifically stating the addition of a global parameter, defaulting to rejecting refetch POSTS until specfically set to allow.
|
|
||
| ## Problem Description | ||
|
|
||
| Currently, within ATC, there is a concept of Invalidation Jobs. These Invalidation Jobs give a user the ability to queue an invalidation for a resource, primarily based on regular expressions. The invalidation is gathered and treated as though there was a cache **STALE**, allowing the cache to query the origin server to **REFRESH** the resource. However, should the cache policy still be incorrect or misconfigured, the resource could be updated on the origin server, but the cache will still receive a 304 - Not Modified HTTP status response. |
There was a problem hiding this comment.
so i don't think the cache queries the "origin server" per se. instead it queries the "parent server" which in some cases will be a parent mid (if the cache is an edge) and the origin server (if the cache is a mid) and is simply the "parent server" when a topology is imployed by a delivery service (where there can be N tiers).
There was a problem hiding this comment.
Hm, this is a really good point and observation. My understanding is that this all revolves entirely around misconfigured origins, and even more specifically around origins we can't "fix". If it was an issue within the CDN topology, would we be able to manually intervene in the chain?
There was a problem hiding this comment.
That is the purpose for introducing the feature, but as I understand it the implementation will have the same effects on any requests made by a cache server to its parent for the affected content - regardless of whether or not it's an origin.
There was a problem hiding this comment.
I modified the original description to reference a CDN rather than a cache. Also added clarity that this could be a parent along the chain.
| ##### Read | ||
| When displaying the information, the **Invalidation Requests** table current shows the `Parameters` field, so the display will be straight forward with no manipulation. | ||
|
|
||
| However, we derive and calculate the expiration field based on the TTL. This code will need to be modified to account for the additional information contained in the `Parameters` field. |
There was a problem hiding this comment.
tbh i think the parameters column was a poor design (there should have simply been a TTL column) and we should probably not follow that approach. imo i would suggest hijacking the keyword column and maybe renaming it to invalidation_type or simply type with 2 acceptable values: refresh and refetch
There was a problem hiding this comment.
I agree, any jobs that don't have a numeric TTL parameter will fail to operate properly; I think a migration could reasonably parse it and change it to a numeric TTL column, since if the migration fails it means ORT is failing in some regard.
There was a problem hiding this comment.
I've modified the proposal to be more explicit, rather than the first attempt (which was to make the most minimal changes). Relish the feedback.
| { | ||
| "alerts":[ | ||
| { | ||
| "text":"Invalidation request created for http://amc-linear-origin.local.tld/path/.*\\.jpeg, start:2021-06-02 15:23:21.348 +0000 UTC end 2021-06-03 15:23:21.348 +0000 UTC", |
There was a problem hiding this comment.
How about changing the response text to "Invalidation request (refetch) created for..."
There was a problem hiding this comment.
I added some clarity to the alert message response below.
| This struct now contains the `InvalidationType *string` field. | ||
| ```go | ||
| type InvalidationJobInput struct { | ||
| DeliveryService *interface{} `json:"deliveryService"` |
There was a problem hiding this comment.
This was a *interface{} in APIv3 to give a transition period between using Delivery Service ID (APIv2 and earlier) to XMLID (hopefully all versions moving forward) - so this could be a string. It's not really related to the changes we're making, but it's something that could be cleaned up since we're making structure changes anyway.
That would also eliminate the need for the unexported dsid and the hoops you need to jump through for that.
There was a problem hiding this comment.
This is a good catch, and I believe it should be documented at the very least. Same with the TTL below as well. I'm happy to add this as part of the refactor effort, especially since I expect more - than + :)
| ```go | ||
| type InvalidationJobInput struct { | ||
| DeliveryService *interface{} `json:"deliveryService"` | ||
| Regex *string `json:"regex"` |
There was a problem hiding this comment.
regex isn't actually optional, right? Can this just be a string instead of a *string?
There was a problem hiding this comment.
Another good observation. What is actually optional? Any of the fields? We can make them all explicitly required at this time as well if you believe that's a benefit at this time.
There was a problem hiding this comment.
I don't think any of the properties were optional, but it's easier to tell for strings than other things if a value was actually given. For example "regex": "" is just as invalid as "regex": null or a missing "regex" property, and Go's parser can't tell the difference between those cases for a string property. However, for something numeric, Go can't tell the difference between 0, null and undefined values for an int property, but sometimes 0 is a valid value, so you have to use *int. It's a very annoying and pedantic balancing act you have to perform with the number of things you have to check for nil all the time versus what the Go parser can and can't distinguish between.
Historically, though, the reason these fields are all pointers is because originally this was going to be the first endpoint to allow PATCH requests, where every property is optional by definition. That didn't pan out because there was prerequisite work that needed to be done to support PATCH safely, but I guess I never changed the fields back from pointers.
There was a problem hiding this comment.
Cool info on the PATCH. That makes sense. I've gone ahead and made both DeliveryService and Regex nonoptional (required for brevity?).
| type InvalidationJobInput struct { | ||
| DeliveryService *interface{} `json:"deliveryService"` | ||
| Regex *string `json:"regex"` | ||
| InvalidationType *string `json:"invalidationType,omitempty"` |
There was a problem hiding this comment.
I don't really like optional properties, but also why omitempty? That won't actually be allowed to be empty in the database, right?
There was a problem hiding this comment.
Removed omitempty. I originally thought we could aim for backwards compatibility. Also, whether it was an explicit column in the DB I figured it would default to NULL. However, with the newer changes this will be explicit and no longer optional.
| Regex *string `json:"regex"` | ||
| InvalidationType *string `json:"invalidationType,omitempty"` | ||
| StartTime *Time `json:"startTime"` | ||
| TTL *interface{} `json:"ttl"` |
There was a problem hiding this comment.
Does anyone really like being able to put in "duration strings"? I don't think that's as useful as I thought it was. JSON and Javascript numbers are IEEE double-precision floating point numbers, which means that as long as we don't need jobs to last longer than 290 years this could just be a time.Duration which is a duration in nanoseconds.
That would also eliminate the need for the unexported ttl and the hoops you need to jump through for that.
There was a problem hiding this comment.
The API call implementations (whether go clients or TP), from what I can gather, are sending up JS numbers. We can certainly enforce a numeric data type here, and even if we stick with the hours format, it would still be clearer than empty interface.
RFC3339 for Time format Explain impact for T3C Explicitly set 'refresh' or 'refetch' for all API interaction Minor formatting fixes
Updated InvdalidationJobInput fields iIncorporate global Parameter check Add clarity around parent vs origin
What does this PR (Pull Request) do?
This is a blueprint proposal to solve a few issues, however the primary one is:
Which Traffic Control components are affected by this PR?
This blueprint has the potential to impact:
What is the best way to verify this PR?
No verification necessary for the blueprint. Just review the file
blueprints/refetch_invalidation.md