Blueprint/refetch by tcfdev · Pull Request #5910 · apache/trafficcontrol

tcfdev · 2021-06-03T02:58:01Z

What does this PR (Pull Request) do?

This is a blueprint proposal to solve a few issues, however the primary one is:

Is related to Add regex_revalidate support for MISS #5825

Which Traffic Control components are affected by this PR?

This blueprint has the potential to impact:

CDN in a Box
Documentation
Traffic Control Clients (Go)
Traffic Ops
Traffic Ops ORT
Traffic Portal
CI tests

What is the best way to verify this PR?

No verification necessary for the blueprint. Just review the file blueprints/refetch_invalidation.md

srijeet0406 · 2021-06-03T16:34:03Z

Just to be clear, if the value of invalidationType is malformed or of the wrong data type, it will result in a 400, else, if it is missing, or any value other than refetch or refresh will always default to refresh, correct?

Oh excellent question. What are your thoughts? I suppose I would probably make this a bit more strict.

To be explicitly correct, the caller would have to send either refetch, refresh, or nothing at all. By that I mean they would not include the invalidationType key at all (which would then default to refresh). If they did include the key and the value was an empty string, that would result in a 400. In fact, any other value would result in a 400. I think this is probably the best approach. The response can contain a helpful message to why it was rejected to assist any callers in the future.

Should it be case insensitive?

Yeah I think if the user includes the invalidationType key and gives a blank string, that could probably be treated as refresh as well, but I'm new to invalidation jobs so maybe @ocket8888 would know better.
The following four types should be accepted IMO:
1.) refresh
2.) refetch
3.) blank string -> defaults to refresh
4.) invalidationType key absent -> defaults to refresh
Anything else is treated as a 400.

You absolutely can do that - I would suggest instead that the field not be optional, and the client creates objects exactly as they desire them to exist with no room for surprises.

yeah, surprises are not good imo. so i guess 400 if invalidationType != refresh or refetch. and because you are adding this to a new major version of the api (4.0), you have the luxury of adding the new required field to the request...but tbh, if everyone likes the idea of a default value, i'm good with that too :)

Just because we have a new major version we can add this to doesn't mean we should just make every new change a breaking change. I don't think making this a new required field adds enough value to warrant that since we can easily provide a default instead.

I've modified this to be explicitly either refresh or refetch. All other submissions, including missing/omitted, will be reject as a client error (400 - Bad Request).

limited · 2021-06-04T14:38:00Z

+#### Data Model / Database Impact
+
+
+The current column `parameters` will now contain a cskv (comma separated key value) string. Currently it only stores the `TTL` for the invalidation request:


Which table is this in?

Why not add a new column rather than storing multiple values in the same column?

agree. stuffing lots of values in a column seems problematic going forward. can't you just reuse the keyword column of the job table? currently, keyword is PURGE for EVERY job. maybe keyword should be REFRESH and REFETCH instead and do a database migration to change all existing PURGE values to REFRESH?

I've pivoted a bit on this original idea. My first attempt was to change as little as possible, reusing the parameters column. However, I believe we all agree that explicity and purposefully is a better approach. I've taken the approach of explicit columns (and datatypes). I also added a recommendation which I think we should implement that performs a great deal of clean up for the jobs table. We are carrying a decent amount of future-proofing along with us that doesn't appear to be utilized anywhere.

Definitely let me know if this newer proposal is lacking in clarity.

limited · 2021-06-04T14:39:37Z

+
+> The removal of these columns will impact the `UserInvalidationJob` struct. Even though the only endpoint utilizing this struct (v1.1) is deprecated, it is still in use.
+
+### ORT/T3C Impact


Can you please describe how the construction of the regexrevalidate.config file will change given that this an external interface of Traffic Control?

I've added a bit more of a description. There isn't much more to the construction of T3C's regexrevalidate.config file. This is already implemented, we are allowing it to be explicitly controlled via ATC's APIs, interfaces, and clients.

Essentially, it remains the same format, but with an optional third field for each non-comment entry.

limited · 2021-06-04T14:44:35Z

+
+### ORT/T3C Impact
+
+The `regexrevalidatedotconfig.go` in _lib/tc-atscfg_ currently has a function called `filterJobs` that is responsible for parsing the _parameters_ information from the job. Right now, that is only parsing the **TTL**, however it would then also need to parse the optional, extra field of **invalidationType**. 


What considerations are made for TrafficServer version dependencies?

Is MISS supported by all ATS versions? If not, what happens when an operator sets a REFETCH job on a set of caches that do not support that invalidation type?

Seems like there are two options when MISS is unavailable based on ATS version: ignore it or use refresh. Can you suggest which would be better?

I'm not an ATS expert, so I can't make a solid suggestion here.

I think whatever happens probably needs to be bubbled up to the operator. You don't want to silently ignore or silently do a stale-refresh when you expect a Miss-Refetch.

Well, when I need an ATS and/or ORT/t3c/atstccfg expert I ask @rob05c - lets see if he has an opinion

It's definitely not supported on older versions. It's a new feature of the regex_revalidate plugin.

Unfortunately, there's not an easy way to bubble it up. We don't really have a mechanism to send data from caches to TO, and even if we did, it'd be pretty unwieldy to collect that from every single cache.

We do have the ATS version in a Parameter in TO, we could hard-code it. But someone might backport the plugin changeset to an older version, and then ATC just won't allow using it. So that's not a great solution.

the "refetch" feature is only necessary or useful if the Origin server is fundamentally broken and completely violating the HTTP Spec. Without "refetch," ATS does an IMS request. So, the only reason it's ever needed is when a Tenant's Origin is responding with a 304 when the asset actually is modified. In complete violation of RFC 7234 and every HTTP Caching specification. But of course, we deal with horribly broken Origins every day. But I do think we should make that clear in the docs, and maybe the UI.

To work correctly, every single Cache has to support it. Which is mostly fine, since CDN operators don't usually operate with widely varying ATS versions. Some do, but I think the majority have a single version and maybe some newer Canaries.

With those 3 things in mind, I think the best solution might be to make it a config Feature Flag in Traffic Ops/+Portal, with Documentation that they must have version X of regex_revalidate. As an additional safety, we should also make t3c check and log an error (which operators can then Alarm on).

It's not as good as bubbling it up, but considering all those things, that's the best solution I'm seeing at the moment.

It sounds like you're saying you set the flag in TO, then ORT/t3c/atstccfg looks at it and determines based on that what to do with jobs that use "refetch" - which makes sense, because otherwise I suppose I don't know how t3c could "check and log an error"

Yes, that is how use_reval_pending works currently and is how I imagine a feature flag for this feature would work as well.

When I said "feature flag" I was thinking the TO configuration file; but a global Parameter also works.

I thought it'd be like: you make a new job, if it's using "refetch" then TO checks if the feature flag is allowed. If it isn't, you can't make that job

That's what I was thinking, but also having a check in t3c-apply as a fail-safe, if it ever does get a job and the installed regex_revalidate is too old to understand it, it would log an error (but probably still do the reval, because an IMS reval is better than nothing and doesn't hurt if they wanted a no-IMS reval).

We need to not allow creating those jobs at all, if the feature-flag isn't set. Because by the time it gets to the cache, it's too late: the user put in jobs, that they expected to work, and now we have an error we can't reasonably propagate up to them that the job won't work.

But we already knew it wasn't going to work, we have the flag in TO. We should be able to return an error from the /jobs POST.

Bearing in mind, these won't be intermittent errors. The endpoint will always return an error if the flag isn't set.

That's a good reason to put the flag in the Parameter instead of the config file: the Param is already available in the API. Which lets UIs like the Portal fetch it, to decide whether to hide that extra button from users, so they never even see the option that will always error.

I've added some language specifically stating the addition of a global parameter, defaulting to rejecting refetch POSTS until specfically set to allow.

mitchell852 · 2021-06-04T18:08:06Z

+
+## Problem Description
+
+Currently, within ATC, there is a concept of Invalidation Jobs. These Invalidation Jobs give a user the ability to queue an invalidation for a resource, primarily based on regular expressions. The invalidation is gathered and treated as though there was a cache **STALE**, allowing the cache to query the origin server to **REFRESH** the resource. However, should the cache policy still be incorrect or misconfigured, the resource could be updated on the origin server, but the cache will still receive a 304 - Not Modified HTTP status response.


so i don't think the cache queries the "origin server" per se. instead it queries the "parent server" which in some cases will be a parent mid (if the cache is an edge) and the origin server (if the cache is a mid) and is simply the "parent server" when a topology is imployed by a delivery service (where there can be N tiers).

Hm, this is a really good point and observation. My understanding is that this all revolves entirely around misconfigured origins, and even more specifically around origins we can't "fix". If it was an issue within the CDN topology, would we be able to manually intervene in the chain?

That is the purpose for introducing the feature, but as I understand it the implementation will have the same effects on any requests made by a cache server to its parent for the affected content - regardless of whether or not it's an origin.

I modified the original description to reference a CDN rather than a cache. Also added clarity that this could be a parent along the chain.

mitchell852 · 2021-06-04T18:11:10Z

+##### Read
+When displaying the information, the **Invalidation Requests** table current shows the `Parameters` field, so the display will be straight forward with no manipulation.
+
+However, we derive and calculate the expiration field based on the TTL. This code will need to be modified to account for the additional information contained in the `Parameters` field.


tbh i think the parameters column was a poor design (there should have simply been a TTL column) and we should probably not follow that approach. imo i would suggest hijacking the keyword column and maybe renaming it to invalidation_type or simply type with 2 acceptable values: refresh and refetch

I agree, any jobs that don't have a numeric TTL parameter will fail to operate properly; I think a migration could reasonably parse it and change it to a numeric TTL column, since if the migration fails it means ORT is failing in some regard.

I've modified the proposal to be more explicit, rather than the first attempt (which was to make the most minimal changes). Relish the feedback.

mitchell852 · 2021-06-04T18:14:15Z

+{
+	"alerts":[
+		{
+			"text":"Invalidation request created for http://amc-linear-origin.local.tld/path/.*\\.jpeg, start:2021-06-02 15:23:21.348 +0000 UTC end 2021-06-03 15:23:21.348 +0000 UTC",


How about changing the response text to "Invalidation request (refetch) created for..."

I added some clarity to the alert message response below.

ocket8888 · 2021-06-04T19:18:01Z

+This struct now contains the `InvalidationType *string` field.
+```go
+type InvalidationJobInput struct {
+	DeliveryService *interface{} `json:"deliveryService"`


This was a *interface{} in APIv3 to give a transition period between using Delivery Service ID (APIv2 and earlier) to XMLID (hopefully all versions moving forward) - so this could be a string. It's not really related to the changes we're making, but it's something that could be cleaned up since we're making structure changes anyway.

That would also eliminate the need for the unexported dsid and the hoops you need to jump through for that.

This is a good catch, and I believe it should be documented at the very least. Same with the TTL below as well. I'm happy to add this as part of the refactor effort, especially since I expect more - than + :)

ocket8888 · 2021-06-04T19:18:51Z

+```go
+type InvalidationJobInput struct {
+	DeliveryService *interface{} `json:"deliveryService"`
+	Regex *string `json:"regex"`


regex isn't actually optional, right? Can this just be a string instead of a *string?

Another good observation. What is actually optional? Any of the fields? We can make them all explicitly required at this time as well if you believe that's a benefit at this time.

I don't think any of the properties were optional, but it's easier to tell for strings than other things if a value was actually given. For example "regex": "" is just as invalid as "regex": null or a missing "regex" property, and Go's parser can't tell the difference between those cases for a string property. However, for something numeric, Go can't tell the difference between 0, null and undefined values for an int property, but sometimes 0 is a valid value, so you have to use *int. It's a very annoying and pedantic balancing act you have to perform with the number of things you have to check for nil all the time versus what the Go parser can and can't distinguish between.

Historically, though, the reason these fields are all pointers is because originally this was going to be the first endpoint to allow PATCH requests, where every property is optional by definition. That didn't pan out because there was prerequisite work that needed to be done to support PATCH safely, but I guess I never changed the fields back from pointers.

Cool info on the PATCH. That makes sense. I've gone ahead and made both DeliveryService and Regex nonoptional (required for brevity?).

ocket8888 · 2021-06-04T19:22:50Z

+type InvalidationJobInput struct {
+	DeliveryService *interface{} `json:"deliveryService"`
+	Regex *string `json:"regex"`
+	InvalidationType *string `json:"invalidationType,omitempty"`


I don't really like optional properties, but also why omitempty? That won't actually be allowed to be empty in the database, right?

Removed omitempty. I originally thought we could aim for backwards compatibility. Also, whether it was an explicit column in the DB I figured it would default to NULL. However, with the newer changes this will be explicit and no longer optional.

ocket8888 · 2021-06-04T19:28:09Z

+	Regex *string `json:"regex"`
+	InvalidationType *string `json:"invalidationType,omitempty"`
+	StartTime *Time `json:"startTime"`
+	TTL *interface{} `json:"ttl"`


Does anyone really like being able to put in "duration strings"? I don't think that's as useful as I thought it was. JSON and Javascript numbers are IEEE double-precision floating point numbers, which means that as long as we don't need jobs to last longer than 290 years this could just be a time.Duration which is a duration in nanoseconds.

That would also eliminate the need for the unexported ttl and the hoops you need to jump through for that.

The API call implementations (whether go clients or TP), from what I can gather, are sending up JS numbers. We can certainly enforce a numeric data type here, and even if we stick with the hours format, it would still be clearer than empty interface.

RFC3339 for Time format Explain impact for T3C Explicitly set 'refresh' or 'refetch' for all API interaction Minor formatting fixes

Updated InvdalidationJobInput fields iIncorporate global Parameter check Add clarity around parent vs origin

srijeet0406 reviewed Jun 3, 2021

View reviewed changes

Added blueprint for Refetch Cache Invalidation proposal

6a52fa0

tcfdev force-pushed the blueprint/refetch branch from 5ce9a7b to 6a52fa0 Compare June 3, 2021 17:18

Add json tag to code syntax blocks. Reformat json for readability.

a60c540

ocket8888 reviewed Jun 4, 2021

View reviewed changes

Comment thread blueprints/refetch-invalidation.md Outdated

zrhoffman reviewed Jun 4, 2021

View reviewed changes

Comment thread blueprints/refetch-invalidation.md Outdated

limited reviewed Jun 4, 2021

View reviewed changes

mitchell852 reviewed Jun 4, 2021

View reviewed changes

ocket8888 reviewed Jun 4, 2021

View reviewed changes

Fix go struct spacing alignment

7f0609f

rawlinp added the blueprint feature requirements / implementation details label Jun 8, 2021

Taylor Frey added 4 commits June 14, 2021 10:56

Incorporate feedback for blueprint.

94692eb

RFC3339 for Time format Explain impact for T3C Explicitly set 'refresh' or 'refetch' for all API interaction Minor formatting fixes

Incorporate Feedback

47cede0

Updated InvdalidationJobInput fields iIncorporate global Parameter check Add clarity around parent vs origin

Remove omitempty

6212401

Update Upgrade Impact section to account for the Parameter

c20c071

rawlinp merged commit dd1a333 into apache:master Jun 22, 2021

tcfdev mentioned this pull request Aug 18, 2021

Feature/refetch #6118

Merged

4 tasks

tcfdev mentioned this pull request Nov 3, 2021

Feature/tp refetch #6325

Merged

4 tasks

		#### Data Model / Database Impact


		The current column `parameters` will now contain a cskv (comma separated key value) string. Currently it only stores the `TTL` for the invalidation request:


		> The removal of these columns will impact the `UserInvalidationJob` struct. Even though the only endpoint utilizing this struct (v1.1) is deprecated, it is still in use.

		### ORT/T3C Impact


		### ORT/T3C Impact

		The `regexrevalidatedotconfig.go` in _lib/tc-atscfg_ currently has a function called `filterJobs` that is responsible for parsing the _parameters_ information from the job. Right now, that is only parsing the TTL, however it would then also need to parse the optional, extra field of invalidationType.


		## Problem Description

		Currently, within ATC, there is a concept of Invalidation Jobs. These Invalidation Jobs give a user the ability to queue an invalidation for a resource, primarily based on regular expressions. The invalidation is gathered and treated as though there was a cache STALE, allowing the cache to query the origin server to REFRESH the resource. However, should the cache policy still be incorrect or misconfigured, the resource could be updated on the origin server, but the cache will still receive a 304 - Not Modified HTTP status response.

Conversation

tcfdev commented Jun 3, 2021

What does this PR (Pull Request) do?

Which Traffic Control components are affected by this PR?

What is the best way to verify this PR?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tcfdev Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitchell852 Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rob05c Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitchell852 Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocket8888 Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitchell852 Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tcfdev Jun 3, 2021 •

edited

Loading

mitchell852 Jun 4, 2021 •

edited

Loading

rob05c Jun 4, 2021 •

edited

Loading

mitchell852 Jun 4, 2021 •

edited

Loading

ocket8888 Jun 14, 2021 •

edited

Loading

mitchell852 Jun 4, 2021 •

edited

Loading

tcfdev Jun 14, 2021 •

edited

Loading