Bug 4616: store_client.cc:92: "mem" assertion by eduard-bagdasaryan · Pull Request #50 · squid-cache/squid

eduard-bagdasaryan · 2017-08-21T12:40:35Z

This bug was probably caused by Bug 2833 feature/fix (39fe14b).

Collapsed forwarding code must ensure StoreEntry::mem_obj existence. It
is missing for hits purged from (or never addmitted into) the memory
cache. Most storeClientListAdd() callers either have similar code or
call storeCreateEntry() which also creates StoreEntry::mem_obj.

Also: eliminated code duplication with StoreEntry::createMemObject().

XXX: another(related) problem was discovered: MemObject URIs/method
should not be overwritten. For example, it is wrong to change
MemObject::method when another request, having different method, hits
this entry, because the originally stored method is used for many purposes
(determining size of the incoming response, validating entry length,
finding vary markers, etc.). We tried to fix this problem, making
MemObject URIs/method constant and removing MemObject::setUris(). This
attempt was rejected, because of a negative effect for shared memory
cache: URIs/method would stay uninitialized. Since this is the only
place where MemObject is created without providing URIs/method
information, we decided to postpone these changes until the related XXX
within MemStore::get() is fixed.

eduard-bagdasaryan · 2017-08-21T14:46:37Z

rebuild

eduard-bagdasaryan · 2017-08-21T14:57:00Z

rebuild

eduard-bagdasaryan · 2017-08-21T16:48:13Z

rebuild

eduard-bagdasaryan · 2017-08-21T17:05:14Z

rebuild

eduard-bagdasaryan · 2017-08-21T18:07:32Z

rebuild

rousskov

Please check whether we actually need both createMemObject() and ensureMemObject() methods. In other words, do existing createMemObject() callers overwrite the old URL/method by choice or by accident?

For example, why would purgeFoundObject() want to overwrite the request method (or the URL) of the found cache entry? And overwrite the request method with what, the DELETE method?!

eduard-bagdasaryan · 2017-08-22T15:42:45Z

For example, why would purgeFoundObject() want to overwrite the request method (or the URL) of the found cache entry? And overwrite the request method with what, the DELETE method?!

Looks like this 'example' is the only one where using createMemObject() is controversial, because other usecases have a StoreEntry just created. When purging (a "PURGE" request), we get first the corresponding StoreEntry from storage ( using METHOD_GET for that) and then reset the method to the actual, i.e., METHOD_PURGE. I assume this is needed, e.g., to do correct logging into store.log (if enabled via cache_store_log parameter).

rousskov · 2017-08-24T20:53:01Z

Looks like purgeFoundObject() is the only one where using createMemObject() is controversial, because other usecases have a StoreEntry just created.

If true, then purgeFoundObject() is currently the only possible justification for having both createMemObject() and ensureMemObject() methods.

When purging (a "PURGE" request), we get first the corresponding StoreEntry from storage ( using METHOD_GET for that) and then reset the method to the actual, i.e., METHOD_PURGE. I assume this is needed, e.g., to do correct logging into store.log (if enabled via cache_store_log parameter).

Your assumption sounds strange to me because the same StoreEntry object may be shared by many Store clients -- modifying it using a particular client information seems wrong. Each transaction may log its to-Squid method (%>rm) and/or the original from-Squid method (%<rm), but not the method of the last request that found the StoreEntry object:

[http::]rm    Request method (GET/POST etc)
[http::]>rm  Request method from client
[http::]<rm  Request method sent to server or peer

It is possible that the old code is buggy in this area. If so: Fixing that bug is probably outside this PR scope, but we should add an XXX to mark the bug and note that one of the methods should be removed once that bug is fixed. It may be a good idea to mark the "wrong" or "extra" method name with an XXX so that others are not tempted to use it.

eduard-bagdasaryan · 2017-08-24T23:09:20Z

It is possible that the old code is buggy in this area. If so: Fixing that bug is probably outside this PR scope, but we should add an XXX to mark the bug and note that one of the methods should be removed once that bug is fixed. It may be a good idea to mark the "wrong" or "extra" method name with an XXX so that others are not tempted to use it.

The old code already has a TODO inside clientReplyContext::purgeRequestFindObjectToPurge():

// TODO: can we use purgeAllCached() here instead of doing the
// getPublicByRequestMethod() dance?

I am not yet sure what are 'wrong' methods which should be removed, but to fix these problems we probably need to rework the last code block inside clientReplyContext::purgeRequest() with purgeAllCached().

rousskov · 2017-08-24T23:44:49Z

to fix these problems we probably need to rework the last code block inside clientReplyContext::purgeRequest() with purgeAllCached().

That feels out of scope to me. ~~The PR is already huge. Let's not make it even bigger if we do not have to.~~

Edit: Sorry, I was thinking about the other pending PR. This PR is still small, but fixing method iteration in purgeRequestFindObjectToPurge() sounds completely out of this PR scope.

squid-prbot · 2017-08-25T23:32:18Z

Jenkins test

eduard-bagdasaryan · 2017-08-25T23:33:19Z

rebuild

eduard-bagdasaryan · 2017-08-25T23:56:08Z

rebuild

eduard-bagdasaryan · 2017-08-26T00:12:17Z

rebuild

rousskov · 2017-08-26T17:32:05Z

@eduard-bagdasaryan, looks like the basic Jenkins build test is working. Please switch to the full "matrix" tests when you get a chance.

eduard-bagdasaryan · 2017-08-26T19:02:56Z

looks like the basic Jenkins build test is working

FYI: as you can see, the context is 'default' for Jenkins check. Jenkins allows to configure this, but this option works incorrectly, confusing Github with wrong statuses. There is an open bug issue, describing this.

rousskov · 2017-08-26T20:43:25Z

as you can see, the context is 'default' for Jenkins check.

I did not notice that earlier, but I do see now that GitHub refers to the check as "default" instead of something more reasonable like "Jenkins build test" or "basic build test".

Jenkins allows to configure this, but this option works incorrectly, confusing Github with wrong statuses. There is an open bug issue, describing this.

If the label "default" works, we can continue using that uninformative label until the Jenkins bug is fixed.

If we add another independent/parallel test (job?) to Jenkins (e.g., Co-Advisor or Coverity), will GitHub get confused with two Jenkins jobs having the same "default" label? In other words, does this bug limit us to using just one Jenkins job when integrating with GitHub CI?

eduard-bagdasaryan · 2017-08-27T14:59:08Z

rebuild

eduard-bagdasaryan · 2017-08-27T15:16:20Z

Jenkins allows to configure this, but this option works incorrectly, confusing Github with wrong statuses.

I found a workaround for this bug: we should configure this context for each job independently. This can be done inside "Trigger Setup" withing job configuration. Moreover, I think this is the only correct way to manage our configuration, since we will have several independent jobs with different names, like "Jenkins(build test)", "Jenkins(Co-Advisor test), etc.

instead of something more reasonable like "Jenkins build test" or "basic build test"

I configured our Jenkins "5-pr-test" job, but the "default" status is stuck still there (due to that bug). I checked on other(test) project that new pull requests will not have this "default", only proper "Jenkins(build test)" (and other) build statuses.

If we add another independent/parallel test (job?) to Jenkins (e.g., Co-Advisor or Coverity), will GitHub get confused with two Jenkins jobs having the same "default" label?

Github is not confused with the same "default". Though all related jobs are triggered via Webhook in this case, Github displays only one status for one of the completed jobs (probably the last finished). That is not a problem, once we configure a specific 'build context' for each job (as I described above).

eduard-bagdasaryan · 2017-08-27T20:50:16Z

rebuild

rousskov

Refreshing the review because the author thought that all change requests have been addressed, and I disagree: Please avoid adding a second mem_obj initialization method/logic if the code can be adjusted to use a single method/logic.

You have argued that (only) the entry purging code may need a special method/logic, but there were serious doubts in the validity of that argument.

eduard-bagdasaryan · 2017-08-29T10:54:03Z

rebuild

rousskov · 2017-09-01T17:27:15Z

Thanks for describing this in detail

Please note that I just rephrased what the code does. These facts were not some secret knowledge that I magically possessed before writing that comment. Discovering that knowledge required manual labor, but I am sure you could have done it as well or better.

remove third parameter from MemObject::setUris()

AFAICT, we should not separate the two URIs from the method -- those three pieces always go together. No valid StoreEntry user may want to change just one them. Please correct me if I am wrong.

If I am right, then to make all three constant, we would have to provide all three at MemObject construction time. I think that is doable, but I did not check carefully. If you think it would work well, please try to implement it.

eduard-bagdasaryan · 2017-09-07T08:07:32Z

Though SemaphoreCI reported about whitespace errors, it is unrelated to this PR and caused by 5107d2c and 70cfe22 master revisions.

eduard-bagdasaryan · 2017-09-07T08:09:27Z

@rousskov, I made Jenkins do complete build tests with test_builds.sh, so you can mark this check as 'required' if needed.

eduard-bagdasaryan · 2017-09-08T22:49:05Z

rebuild

rousskov · 2017-09-09T16:16:08Z

Though SemaphoreCI reported about whitespace errors, it is unrelated to this PR and caused by 5107d2c and 70cfe22 master revisions.

I would not be surprised if git diff --check may report false errors, but it is probably pointless to discuss them until the branch is in sync with master because the diff for an an out-of-sync branch is different that a diff for an in-sync branch. This is one more case where the check should be done on the fully merged and about-to-be-committed code rather than some approximation of that.

The current primitive git diff --check on Semaphore CI essentially takes whatever branch GitHub/Semaphore offers it and compares it to the guessed branching point in master. That is not the right approach. The right approach is to do all git manipulations that an auto-merging plugin/bot would do up to "git commit" and then run git diff --check.

Whether we should ignore errors from out-of-sync builds or not test out-of-sync builds is an open question.

rousskov · 2017-09-09T17:09:15Z

I made Jenkins do complete build tests with test_builds.sh, so you can mark this check as 'required' if needed.

Done. IIRC, you have added stable Jenkins nodes to that test. When you get a chance, please rename that job from Jenkins(build test) to Comprehensive build test.

rousskov · 2017-09-09T18:00:43Z

src/MemStore.cc

    // store the response for the Root().get() callers to be happy because they
    // expect IN_MEMORY entries to already have the response headers and body.
-    e->makeMemObject();
+    e->ensureMemObject(nullptr, nullptr, HttpRequestMethod());


Something does not add up here. This ensureMemObject(nullptr, nullptr...) call will always trigger the (moved) Bug: Missing MemObject::storeId warning, right? I am sure you can adjust the code further to avoid that warning in this specific case, but does not this problem tell us that we cannot make the URIs/method trio constant?! AFAICT, here we are creating a StoreEntry with MemObject that lacks any URIs/method information. After this PR, this StoreEntry will remain in this half-baked state because nobody will be able to reset URIs/method. Would not exposing Store clients to such half-baked StoreEntry/MemObject objects create all sorts of problems?

I am guessing that before this PR the half-baked StoreEntry/MemObject were updated with URIs/method information as soon as that info was loaded from the memory cache by/in MemStore::copyFromShm() or perhaps shortly after that.

If my suspicions are correct, then either we have to go back to mutable URIs/method trio (at least for the specific case where the original trio was empty) or we need to reshape this code so that it does not need the MemObject object until it knows what the actual URIs/method trio is. The latter approach would address the above XXX comment, so it is better in general, but I suspect it may not be feasible without revamping a lot of code. In that case, we would have to allow trio updates, at least from the empty to non-empty state.

There are other small problems with the recent changes, but let's address this big issue first.

eduard-bagdasaryan · 2017-09-10T09:55:27Z

rebuild

eduard-bagdasaryan · 2017-09-13T07:28:39Z

@rousskov , please review e400119.

rousskov

I really do not like how the semi-repeated calls in this PR iteration look. Please see the code comments for a specific fix suggestion.

rousskov · 2017-09-13T14:35:49Z

src/MemObject.cc

-    vary_headers(nullptr)
+    vary_headers(nullptr),
+    storeId_(aStoreId),
+    logUri_((aLogUri == aStoreId) ? nullptr : aLogUri)


Until the trio becomes constant, I do not think it is a good idea to silently duplicate this tricky trio initialization logic (and the URL_CHECKSUM_DEBUG hack). If the sketch described in another change request is implemented, then the old constructor may continue to work because setUris() will only be called with a set/non-nil trio.

If you do need to add a MemObject(trio) constructor for some reason, then the new constructor can probably call the old MemObject(void) constructor (C++11 allows constructor reuse) and then call setUris().

Undone previous changes, postponed creating MemObject(trio) until it is required by other changes.

rousskov · 2017-09-13T14:37:44Z

src/MemObject.h


    /// sets store ID, log URI, and request method; TODO: find a better name
+    /// XXX: remove this method and make corresponding URI fields constant
+    /// when another XXX within MemStore::get() is addressed.


Please s/another XXX within /the XXX in/

rousskov · 2017-09-13T15:26:51Z

src/MemObject.h


-    /// whether setUris() has been called
+    /// Whether entry StoreID was provided.
+    /// TODO: probably misnamed.


If the sketch described in another change request is implemented, then the old setUri() documentation will remain valid because setUris() will only be called with a set/non-nil trio.

rousskov · 2017-09-13T15:27:45Z

src/Store.h

-    MemObject *makeMemObject();
-
-    /// initialize mem_obj member (if needed) and supply URI-related info
+    /// initialize mem_obj (if needed)


If the sketch described in another change request is implemented, then this documentation will need to be adjusted further. Please check other affected methods as well.

rousskov · 2017-09-13T15:30:53Z

src/client_side_reply.cc


    if (entry) {
+        entry->createMemObject(url, http->log_uri, http->request->method);
+        entry->mem_obj->setUris(url, http->log_uri, http->request->method);


This repetition and similar code elsewhere in this PR look really bad. We need to find a better way to express what we know. AFAICT, we need to handle three or four distinct cases (from the caller point of view):

The caller knows that the StoreEntry object does not have a MemObject. The caller needs to create the MemObject but the caller does not have the trio information yet. This case is related to the old MemStore problem that we are not going to fix in this PR (and that makes it impossible to make the trio members constant).

The caller knows that the StoreEntry object does not have a MemObject. The caller needs to create the MemObject and set its trio information.

The caller knows that the StoreEntry object has a MemObject. The caller needs to update the MemObject trio if it is unset. I am not sure these cases actually exist.

The caller does not know whether the StoreEntry object has a MemObject member. If the StoreEntry does not have MemObject, then the caller needs to create one (with the caller-supplied trio). Otherwise, the caller needs to update the MemObject trio if it is unset.

I suggest using the following APIs to support all of the above cases:

MemObject::MemObject(void): Old MemObject constructor. No changes here.

MemObject::setUris(trio): Old method adjusted to do nothing when hasUris() is already true and to assert that hasUris() becomes true otherwise.

`StoreEntry::ensureMemObject(trio): Creates MemObject if needed and always calls setUri().

`StoreEntry::createMemObject(trio): Asserts that there is no MemObject and calls ensureMemObject().

`StoreEntry::createMemObject(void): Asserts that there is no MemObject and creates one.

When/if we no longer need to update the trio, MemObject(void) and setUris() will disappear, with their code merged into MemObject(trio). The createMemObject(void) method will disappear as well as unneeded.

The use cases listed earlier can use the following APIs:

createMemObject(void)

createMemObject(trio)

assert(mem_obj); mem_obj->setUris(trio)

ensureMemObject(trio)

Since each case gets its own API, it would be easier to find and adjust them later. Again, I hope that case 3 does not actually exist.

If you do not have a better idea, please see whether the above sketch results in a better looking but still correct code.

I decided to implement your sketch "as is".

eduard-bagdasaryan · 2017-09-14T22:50:30Z

@rousskov , I addressed latest review comments.

rousskov · 2017-09-15T01:13:10Z

I removed the setUris() exit assertion, added a few polishing touches, rebased, and squashed. You can still see individual changes at SQUID-314-bug-4616-mem-assertion.bak

@eduard-bagdasaryan, if the tests pass, do you have any objections to merging ~~0d4baf8~~ 4da97c7?

Edit: I had to force push a trivial change in hope to awake SemaphoreCI that got stuck/confused. That did not help so I will try to find another trick to unstuck it.

This bug was probably caused by Bug 2833 feature/fix (1a210de). The primary fix here is limited to clientReplyContext::processExpired(): Collapsed forwarding code must ensure StoreEntry::mem_obj existence. It was missing for cache hits purged from (or never admitted into) the memory cache. Most storeClientListAdd() callers either have similar code or call storeCreateEntry() which also creates StoreEntry::mem_obj. Also avoided clobbering known StoreEntry URIs/method in some cases. The known effect of this change is fixed store.log URI and method fields when a hit transaction did not match the stored entry exactly (e.g., a HEAD hit for a GET cached entry), but this improvement may have even more important consequences: The original method is used by possibly still-running entry filling code (e.g., determining the end of the incoming response, validating the entry length, finding vary markers, etc.). Changing the method affects those actions, essentially corrupting the entry state. The same argument may apply to store ID and log URI. We even tried to make URIs/method constant, but that is impractical w/o addressing an XXX in MemStore::get(), which is outside this issue scope. To facilitate that future fix, the code now distinguishes these cases: * createMemObject(void): Buggy callers that create a new memory object but do not know what URIs/method the hosting StoreEntry was based on. Once these callers are fixed, we can make the URIs/method constant. * createMemObject(trio): Callers that create a new memory object with URIs/method that match the hosting StoreEntry. * ensureMemObject(trio): Callers that are not sure whether StoreEntry has a memory object but have URIs/method to create one if needed.

rousskov · 2017-09-15T03:58:36Z

For the record, closing and reopening this PR unstuck Semaphore CI. Please note that not everybody has the right to reopen and that pushing into the branch of a closed PR may force you to jump through additional hoops later.

eduard-bagdasaryan · 2017-09-15T07:41:36Z

do you have any objections to merging

I do not mind merging it.

This bug was probably caused by Bug 2833 feature/fix (1a210de). The primary fix here is limited to clientReplyContext::processExpired(): Collapsed forwarding code must ensure StoreEntry::mem_obj existence. It was missing for cache hits purged from (or never admitted into) the memory cache. Most storeClientListAdd() callers either have similar code or call storeCreateEntry() which also creates StoreEntry::mem_obj. Also avoided clobbering known StoreEntry URIs/method in some cases. The known effect of this change is fixed store.log URI and method fields when a hit transaction did not match the stored entry exactly (e.g., a HEAD hit for a GET cached entry), but this improvement may have even more important consequences: The original method is used by possibly still-running entry filling code (e.g., determining the end of the incoming response, validating the entry length, finding vary markers, etc.). Changing the method affects those actions, essentially corrupting the entry state. The same argument may apply to store ID and log URI. We even tried to make URIs/method constant, but that is impractical w/o addressing an XXX in MemStore::get(), which is outside this issue scope. To facilitate that future fix, the code now distinguishes these cases: * createMemObject(void): Buggy callers that create a new memory object but do not know what URIs/method the hosting StoreEntry was based on. Once these callers are fixed, we can make the URIs/method constant. * createMemObject(trio): Callers that create a new memory object with URIs/method that match the hosting StoreEntry. * ensureMemObject(trio): Callers that are not sure whether StoreEntry has a memory object but have URIs/method to create one if needed.

yadij approved these changes Aug 21, 2017

View reviewed changes

yadij added the S-waiting-for-committer privileged action is expected (and usually required) label Aug 21, 2017

rousskov requested changes Aug 21, 2017

View reviewed changes

rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-committer privileged action is expected (and usually required) labels Aug 21, 2017

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Aug 28, 2017

rousskov requested changes Aug 28, 2017

View reviewed changes

rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Aug 28, 2017

rousskov reviewed Sep 9, 2017

View reviewed changes

rousskov requested changes Sep 13, 2017

View reviewed changes

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Sep 14, 2017

rousskov force-pushed the SQUID-314-bug-4616-mem-assertion branch from 4d33a73 to 0d4baf8 Compare September 15, 2017 01:03

rousskov approved these changes Sep 15, 2017

View reviewed changes

rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Sep 15, 2017

rousskov force-pushed the SQUID-314-bug-4616-mem-assertion branch from 0d4baf8 to 4da97c7 Compare September 15, 2017 03:22

rousskov closed this Sep 15, 2017

rousskov reopened this Sep 15, 2017

rousskov merged commit 76d6111 into squid-cache:master Sep 15, 2017

rousskov deleted the SQUID-314-bug-4616-mem-assertion branch September 15, 2017 14:13

eduard-bagdasaryan mentioned this pull request Sep 20, 2017

Bug 4505: SMP caches sometimes do not purge entries #46

Merged

rousskov mentioned this pull request Jan 20, 2018

Fixed store.cc "!mem_obj" assertion via peerDigestRequest #134

Merged

Conversation

eduard-bagdasaryan commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduard-bagdasaryan commented Aug 21, 2017

Uh oh!

eduard-bagdasaryan commented Aug 21, 2017

Uh oh!

eduard-bagdasaryan commented Aug 21, 2017

Uh oh!

eduard-bagdasaryan commented Aug 21, 2017

Uh oh!

eduard-bagdasaryan commented Aug 21, 2017

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

eduard-bagdasaryan commented Aug 22, 2017

Uh oh!

rousskov commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduard-bagdasaryan commented Aug 24, 2017

Uh oh!

rousskov commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squid-prbot commented Aug 25, 2017

Uh oh!

eduard-bagdasaryan commented Aug 25, 2017

Uh oh!

eduard-bagdasaryan commented Aug 25, 2017

Uh oh!

eduard-bagdasaryan commented Aug 26, 2017

Uh oh!

rousskov commented Aug 26, 2017

Uh oh!

eduard-bagdasaryan commented Aug 26, 2017

Uh oh!

rousskov commented Aug 26, 2017

Uh oh!

eduard-bagdasaryan commented Aug 27, 2017

Uh oh!

eduard-bagdasaryan commented Aug 27, 2017

Uh oh!

eduard-bagdasaryan commented Aug 27, 2017

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

eduard-bagdasaryan commented Aug 29, 2017

Uh oh!

rousskov commented Sep 1, 2017

Uh oh!

eduard-bagdasaryan commented Sep 7, 2017

Uh oh!

eduard-bagdasaryan commented Sep 7, 2017

Uh oh!

eduard-bagdasaryan commented Sep 8, 2017

Uh oh!

rousskov commented Sep 9, 2017

Uh oh!

rousskov commented Sep 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eduard-bagdasaryan commented Sep 10, 2017

Uh oh!

eduard-bagdasaryan commented Sep 13, 2017

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

eduard-bagdasaryan commented Aug 21, 2017 •

edited

Loading

rousskov commented Aug 24, 2017 •

edited

Loading

rousskov commented Aug 24, 2017 •

edited

Loading

rousskov commented Sep 15, 2017 •

edited

Loading