Bug 4505: SMP caches sometimes do not purge entries by eduard-bagdasaryan · Pull Request #46 · squid-cache/squid

eduard-bagdasaryan · 2017-08-16T09:45:18Z

When Squid finds a requested entry in the memory cache, it does not
check whether the same entry is also stored in a cache_dir. The
StoreEntry object may become associated with its store entry in the
memory cache but not with its store entry on disk. This inconsistency
causes two known problems:

Squid may needlessly swap out the memory hit to disk, either
overwriting an existing (and identical) disk entry or, worse,
creating a duplicate entry on another disk. In the second case, the
two disk entries are not synchronized and may eventually start to
differ if one of them is removed or updated.
Squid may not delete a stale disk entry when needed, violating
various HTTP MUSTs, and eventually serving stale [disk] cache entries
to clients.

Another purging problem is not caused by the above inconsistency:

A DELETE request or equivalent may come for the entry which is still
locked for writing. Squid fails to get a lock for such an entry (in
order to purge it) and the entry remains in disk and/or memory cache.

To solve the first two problems:

StoreEntry::mayStartSwapout() now avoids needless swapouts by checking
whether StoreEntry was fully loaded, is being loaded, or could have
been loaded from disk. To be able to reject swapouts in the last case,
we now require that the newer (disk) entries explicitly delete their
older variants instead of relying on the Store to overwrite the older
(unlocked) variant. That explicit delete should already be happening
in higher-level code (that knows which entry is newer and must mark
any stale entries for deletion anyway).

To fix problem #3:

A new Store::Controller::evictIfFound(key) method purges (or marks for
deletion if purging is impossible) all the matching store entries,
without loading the StoreEntry information from stores. Avoiding
StoreEntry creation reduces waste of resources (the StoreEntry object
would have to be deleted anyway) and allows us to mark being-created
entries (that are locked for writing and, hence, cannot be loaded into
a StoreEntry object).

Note that even if Squid properly avoids storing duplicate disk entries,
some cache_dir manipulations by humans and Squid crashes may lead to
such duplicates being present. This patch leaves dealing with potential
duplicates out of scope except it guarantees that if an entry is
deleted, then all [possible] duplicates are deleted as well.

Fixing the above problems required (and/or benefited from) many related
improvements, including some Store API changes. It is impractical to
detail each change here, but several are highlighted below.

To propagate DELETEs across workers, every public StoreEntry now has a
Transients entry.

Prevented concurrent cache readers from aborting when their entry is
release()d. Unlike abort, release should not affect current readers.

Fixed store.log code to avoid "Bug: Missing MemObject::storeId value".

Removed Transients extras used to initialize MemObject StoreID/method in
StoreEntry objects created by Transients::get() for collapsed requests.
Controlled::get() and related Controller APIs do not require setting
those MemObject details: get() methods for all cache stores return
StoreEntry objects without them (because entry basics lack Store ID and
request method). The caller is responsible for cache key collision
detection. Controlled::get() parameters could include Store ID and
request method for early cache key collision detection, but adding a
StoreQuery class and improving collision detection code is outside this
project scope (and requires many changes).

Found more cases where release() should not prevent sharing.
Remaining cases need further analysis as discussed in master 39fe14b.

Greatly simplified UFS store rebuilding, possibly fixing subtle bug(s).

Clarified RELEASE_REQUEST flag meaning, becoming 'a private StoreEntry
which can't become public anymore'. Refactored the related code,
combining two related notions: 'a private entry' and 'an entry marked
for removal'.

Do not abort collapsed StoreEntries during syncing just because the
corresponding being stored shared entry was marked for deletion. Abort
them if the shared entry has been also aborted.

Added StoreEntry helper methods to prevent direct manipulation of
individual disk-related data members (swap_dirn, swap_filen, and
swap_status). These methods help keep these related data members in a
coherent state and minimize code duplication.

XXX: SMP cache purges may continue to malfunction when the Transients
table is missing. Currently, Transients are created only when the
collapsed_forwarding is on. After Squid bug 4579 is fixed, every public
StoreEntry will have the corresponding Transients entry and vice versa,
extending these fixes to all SMP environments.

yadij

looks mostly okay to me. Please fix the compile issues in stubs. The rest are optional polish requests and I will approve with just the compile ones fixed.

yadij · 2017-08-21T17:46:35Z

src/store.cc

+                    (swap_dirn < 0) == (swap_status == SWAPOUT_NONE) &&
+                    (swap_dirn < 0 || swap_dirn < Config.cacheSwap.n_configured);
+
+    if (!ok) {


seems to be missing a WARNING label, unless it is not that important really and more deserving of a level 3+ debugs.

Done with 'ERROR' label, since it is an 'error' rather than a 'warning' here.

yadij · 2017-08-21T17:50:55Z

src/neighbors.cc

@@ -1766,19 +1778,22 @@ neighborsHtcpClear(StoreEntry * e, const char *uri, HttpRequest * req, const Htt
    char buf[128];

    for (p = Config.peers; p; p = p->next) {


This function can be polished up a bit more. Please consider making the p variable be defined as part of the loop definition like neighborsHtcpClearNeeded and the buf a static local inside the new if-condition.

yadij · 2017-08-21T17:57:44Z

src/store/Controller.h

    virtual bool smpAware() const override;

+    /// Whether the entry with the key exists and was marked for removal
+    /// some time ago; get(key) will return NULL in such case.


please reference 'nil' or nullptr whichever is more descriptive instead of NULL.

yadij · 2017-08-21T18:01:23Z

src/store/Controller.h

    virtual void sync() override;
    virtual void maintain() override;
    virtual void markForUnlink(StoreEntry &) override;
+    virtual void unlinkByKeyIfFound(const cache_key *key) override;


please follow the updated coding style in this file, not mentioning parameter names unless they are needed to document the parameter better than type or method name can. StoreEntry and cache_key are already descriptive enough types to convey the necessary detail.

Please also consider using the new style in the other new function and methods. I'm not asking for those others as a review item there since it is polish and they are situated in places not using the new style yet.

Done. Also fixed some other new methods.

yadij · 2017-08-21T18:06:36Z

src/tests/stub_MemStore.cc

 int64_t MemStore::maxObjectSize() const STUB_RETVAL(0)
 bool MemStore::dereference(StoreEntry &) STUB_RETVAL(false)
 void MemStore::markForUnlink(StoreEntry&) STUB
+void MemStore::unlinkByKeyIfFound(const cache_key *key) STUB


please remove the parameter name here. it leads to compile error 'unused variable' on strict compilers.

same in new stub_store.cc and TestSwapDir.h lines.

Done.
BTW, there are currently lots of unused variables in Squid sources. I assume this should be fixed first before we are going to test new patches with 'strict' compilers.

eduard-bagdasaryan · 2017-08-23T14:29:14Z

@yadij , I addressed your review comments and updated the PR.

rousskov

I may have discovered a new release() problem, stated my recommendations regarding the syncCollapsed() TODO, and added a few minor polishing requests.

rousskov · 2017-08-23T15:03:09Z

src/store/Controller.cc

    assert(collapsed->mem_obj);
    assert(collapsed->mem_obj->smpCollapsed);
+    // TODO: confirm whether this is reasonable:
+    // assert(collapsed->store_status == STORE_PENDING);


Eduard: what to do with STORE_PENDING check? I suspect that it should always store_status==STORE_PENDING inside Controller::syncCollapsed(), because syncing should be performed on not-yet completed/STORE_OK entry.

Ignoring the trivial case of findCollapsed() returning nil, syncCollapsed() can be called for any transient entry. To assert that any transient entry is STORE_PENDING, we must establish two conditions:

syncCollapsed() relies on the transient entry being STORE_PENDING.

A transient entry cannot be STORE_OK.

For item 1: I do not see why syncCollapsed() would need the transient entry to always be in STORE_PENDING state. Do you? Sure, there are places inside syncCollapsed() where STORE_PENDING is needed (e.g., when we call abort(), but there are also places where it is irrelevant (e.g., when the entry is inSync).

For item 2: AFAICT, a StoreEntry object becomes STORE_OK when it is fully stored in disk or memory cache OR when it is aborted. I see no code that would guarantee that a StoreEntry is always removed from Transients when it becomes STORE_OK. Did I miss it?

Eduard: There is another reason indirectly confirming that: StoreEntry::abort() asserts if store_status!=STORE_PENDING and this method is invoked throughout Controller::syncCollapsed().

This is indeed an important observation -- syncCollapsed() must not abort() aborted or completed StoreEntry objects. However, that makes STORE_PENDING a precondition for the syncCollapsed() code calling abort(). It does not imply that the entire syncCollapsed() code has such a precondition.

If you do not find flaws in the above reasoning, then I suggest removing this TODO comment. More related same-method changes are requested below.

then I suggest removing this TODO comment.

Done at c0a470d.

syncCollapsed() must not abort() aborted or completed StoreEntry objects.

The related c0a470d change handles 'aborted' case. Not sure what was specifically suggested for 'completed' case, left as is for now. So, currently, abort() will assert there if the entry is STORE_OK.

Alex: This is indeed an important observation -- syncCollapsed() must not abort() aborted or completed StoreEntry objects.

Eduard: Not sure what was specifically suggested for 'completed' case

Nothing was suggested in that phrase, I was just retelling what you said (not STORE_PENDING means aborted or completed StoreEntry). If the current syncCollapsed() code never calls abort() for aborted or completed StoreEntry objects, then we are all set as far as this "remove TODO" change request is concerned.

rousskov · 2017-08-23T15:44:02Z

src/store/Controller.cc

    debugs(20, 7, "syncing " << *collapsed);

-    bool abandoned = transients->abandoned(*collapsed);
+    bool aborted = false;


s/aborted/abortedByWriter/ to clarify what kind of abort we are dealing with here

Done at c0a470d.

rousskov · 2017-08-23T15:48:52Z

src/store/Controller.cc

+    bool waitingToBeFreed = false;
+    transients->status(*collapsed, aborted, waitingToBeFreed);
+    if (aborted) {
+        debugs(20, 3, "aborting " << *collapsed << " due to aborted shared status");


s/due to aborted shared status/because its writer has aborted/

Done at c0a470d.

rousskov · 2017-08-23T15:51:55Z

src/store/Controller.cc

+        collapsed->abort();
+        return;
+    }
    bool found = false;


Let's add an empty line here -- we are done processing the aborted case. The rest of the method can assume that the shared entry is not aborted.

Done at c0a470d.

rousskov · 2017-08-23T15:52:10Z

src/store/Controller.cc

+    bool aborted = false;
+    bool waitingToBeFreed = false;
+    transients->status(*collapsed, aborted, waitingToBeFreed);
+    if (aborted) {


Let's add an empty line above the if statement to better isolate the aborted case.

Done at c0a470d.

rousskov · 2017-08-23T16:54:43Z

src/store/Controller.cc

+{
+    if (StoreEntry *entry = static_cast<StoreEntry*>(hash_lookup(store_table, key))) {
+        debugs(20, 3, "got in-transit entry: " << *entry);
+        entry->release();


If this entry is collapsed but still detached, will its local client(s) get stuck for a while, waiting for a syncCollapsed() call (that this particular method does not trigger for local clients)? And then, when that syncCollapsed() call finally comes, we will call abort() because we cannot attach this entry?

AFAICT, collapsed but still detached entries in other workers will not get stuck because StoreEntry::release() eventually calls Transients::abandon() that calls CollapsedForwarding::Broadcast(). Please double check the current code and correct me if I am wrong.

If my concerns are at least partially valid, then we could add more code to StoreEntry::release(), but I am worried that calling abort() from release() would create dangerous reentrant code (i.e., the object calling release() will be synchronously called back to be notified about the abort and then will resume running with an aborted entry). While abort.callback is asynchronous, invokeHandlers() (also called by StoreEntry::abort()) is not!

To address this, perhaps we should do something like this instead:

// entry->release() notifies other waiting workers but // XXX: it does not abort local collapsed Store clients if needed. entry->release(); // Work around the above XXX. TODO: Move into release() after making invokeHandlers() asynchronous. if (... has transient index ...) syncCollapsed(...);

Any better ideas?

AFAICT, collapsed but still detached entries in other workers will not get stuck because StoreEntry::release() eventually calls Transients::abandon() that calls CollapsedForwarding::Broadcast().

While experimenting, I saw that this does not work properly with our changes. Currently, 'DELETE' causes purgeEntriesByUrl(), Controller::unlinkByKeyIfFound() and Transients::unlinkByKeyIfFound() (for USE_HTCP off scenario). These were our 'innovations' to optimize 'heavy' StoreEntry loading. I.e., instead of StoreEntry::release() and eventually Transients::abandon() now we have unlinkByKeyIfFound() and no abandon(). So other worker clients keep 'hanging' on detached entry which became deleted.
To fix this, we can either:

revert our changes in purgeEntriesByUrl() and load StoreEntry or

teach Transients::unlinkByKeyIfFound() to do CollapsedForwarding::Broadcast(). There should be added a suitable api for this though.

What do you think?

revert our changes in purgeEntriesByUrl() and load StoreEntry

IIRC, loading StoreEntry is not an option in general because Controller::find() returns nil for not-yet-readable entries. See your PR description near "mark being-created
entries". Our changes were not simply an optimization. They addressed a problem with marking unreadable entries for deletion. We need to fix/improve our changes, not revert them.

teach Transients::unlinkByKeyIfFound() to do CollapsedForwarding::Broadcast().

unlinkByKeyIfFound() or a related method should broadcast (in some cases): A worker must broadcast whenever it changes the state of a shared being-read entry -- the broadcast is the mechanism for readers in other workers to learn about those changes. Marking a transient entry for deletion requires broadcasting (when there are other entry readers).

I am not sure this broadcast should be initiated by the Transients::unlinkByKeyIfFound() method, by the corresponding Controller method, or by similar methods in the cache Stores. I hope you can find and defend the answer to that question. Ideally, there should be at most one broadcast per change. See below for more clues.

There should be added a suitable api for this though.

I would check whether calling Transients::get() and broadcasting the resulting entry (or not broadcasting at all if it returns nil) is sufficient (and a good idea) in this case. That would still be different than relying on Controller::find() for deletion. Here, we are relying on (fast) get() to broadcast the change, not to mark the store entries for deletion.

Note that, AFAICT, we do not need to broadcast if the transients entry has already been marked for deletion (which means we changed nothing) or when the transients entry is locked for writing (which means it cannot have any readers yet).

Also, if there is a local StoreEntry object already, we should probably use that instead of calling Transients::get(). Perhaps the code already has this optimization.

Combining all of the above, you might arrive at something like this (entry locking/unlocking not shown and may be missing):

if (auto entry = transients->get(key)) entry->release(); // broadcasts to inform other readers if needed else Controller::unlinkByKeyIfFound(key); // does not broadcast but there are no readers

As always, please check my reasoning. The overarching "broadcast changes when there are readers" principle should be correct, but I could miss various nuances/details.

IIRC, loading StoreEntry is not an option in general because Controller::find() returns nil for not-yet-readable entries. See your PR description near "mark being-created
entries". Our changes were not simply an optimization. They addressed a problem with marking unreadable entries for deletion. We need to fix/improve our changes, not revert them.

To clarify: my (1) suggestion did not assume 100% reverting purgeEntriesByUrl(): unlinkByKeyIfFound() method should stay of course, to handle situation when it is already impossible/not_yet_possible to 'get' the entry from store but it is still possible to get it by the key. I suggested to keep getting entry when we can(as the old code did). This would be a simple but not a perfect/complete solution though.

My sketch is similar to what you are suggesting then, except it does not pay the relatively high price of creating a new StoreEntry object when transients do not have a usable StoreEntry object already.

If the optimization in my sketch causes trouble, then reusing the old code (i.e., creating a new StoreEntry object from store metadata whenever possible) is fine. ~~Just add a source code comment for the reader who might be tempted to optimize this code by using Transients::get() alone.~~

As we discussed, release() sometimes did not cause broadcasting in my tests (including the old code). I see the reason in Store::Controller::unlink(): it does not care about Transients, calling transientsAbandon().

I agree that the lack of a transients->unlink() call in Controller::unlink() looks wrong, especially since Controller::markForUnlink() has the expected transients->markForUnlink() call.

However, I cannot say, without doing research, whether changing that method alone would guarantee broadcasts during release(). Please make sure release() broadcasts (directly or indirectly) whenever the state changes in such a way that a broadcast is required, regardless of whether Controller::unlink() has been called. Locating the best place(s) to (check state and) initiate that broadcast is not easy.

However, I cannot say, without doing research, whether changing that method alone would guarantee broadcasts during release()

No, it is not guaranteed: there is another problem. Transients::unlink() does broadcasting only for 'writers', i.e., when "e.mem_obj->xitTable.io == MemObject::ioWriting". That was the initial design: only writers can broadcast, but now, evidently, we need to broadcast from any worker, which received DELETE.

we need to broadcast from any worker, which received DELETE

To be more precise:

from any worker which changed the shared entry state

to any worker reading that shared entry

This PR focus is on DELETE-triggered changes, and we can broaden these narrow rules if needed to ease implementation (e.g., broadcast to all workers), but we should keep these rules in mind.

Please note that the rules have not changed. What changed is our understanding of their impact: We now know that DELETEs (and such) may require initiating broadcasts from non-writers.

The related fixes for this thread discussion at 2765f0b.

rousskov · 2017-08-23T17:22:41Z

src/enums.h

    ENTRY_REVALIDATE_ALWAYS,
    DELAY_SENDING,
-    RELEASE_REQUEST,
+    RELEASE_REQUEST, ///< makes private key permanent


Since the flag itself does not really make anything, I would rephrase as "prohibits making the key public".

Done at c0a470d.

rousskov · 2017-08-23T17:26:11Z

src/enums.h

 typedef enum {
-    SWAPOUT_NONE,
-    SWAPOUT_WRITING,
+    SWAPOUT_NONE, ///< the store entry is in a 'disconnected' from disk state


disconnected can be easily misinterpreted as "was connected but not longer connected". How about "StoreEntry is currently not associated with any disk store entry". And, to make it more symmetric with other definitions:

/// StoreEntry is currently not associated with any disk store entry. /// Does not guarantee (or preclude!) a matching disk store entry existence.

Done at c0a470d.

rousskov · 2017-08-23T17:30:30Z

src/enums.h

-    SWAPOUT_WRITING,
+    SWAPOUT_NONE, ///< the store entry is in a 'disconnected' from disk state
+    SWAPOUT_WRITING, ///< the store entry is being stored on a disk
+    /// The store entry has been stored on a disk and still is 'connected'


I suggest:

/// StoreEntry is associated with a complete (i.e., fully swapped out) disk store entry. /// Guarantees the disk store entry existence.

Done at c0a470d.

rousskov · 2017-08-23T17:31:55Z

src/enums.h

-    SWAPOUT_NONE,
-    SWAPOUT_WRITING,
+    SWAPOUT_NONE, ///< the store entry is in a 'disconnected' from disk state
+    SWAPOUT_WRITING, ///< the store entry is being stored on a disk


I suggest:

/// StoreEntry is being swapped out to the associated disk store entry. /// Guarantees the disk store entry existence.

Done at c0a470d.

rousskov · 2017-08-25T14:32:35Z

src/store.cc

-    expireNow();
+    if (!shareable) {
+        // This object should not be reused
+        expireNow();


Why do we need to expire entries when making them private?

This makePrivate(false) is called when the decision is 'reuseNot': we can't share the entry even with already collapsed clients. E.g., we got a header forbidding any caching/sharing (smth. like CC:no-store).

Why do we need to expire entries when making them private and unshareable? You are explaining why unshareable entry cannot be shared, but I am asking why it has to expire.

Probably this sounds vague, but to my mind 'expiration' prevents sharing shareable but not reusable entries. I.e., it makes difference between doNotCacheButShare and reuseNot states.
I am not insist this is [the only] correct way for doing this. This is how it is implemented and working now.

'expiration' prevents sharing shareable but not reusable entries

True, but should not be relevant because we are talking about unshareable entries.

I.e., it makes difference between doNotCacheButShare and reuseNot states.

My argument that it should not make a difference. I do not know whether it does. Unshareable entries should be unshareable regardless of their expiration time. If unsharable entries are shareable, there is a bug we should fix.

This is how it is implemented and working now.

I know but that is largely irrelevant to my question. If the current code works correctly (in some cases) because of a bad hack, and we are fixing a very related aspect of that code because of a very related bug (in other cases), then we should remove the hack instead of splitting the code into two parts, one that continues to rely on a hack and one that is no longer affected by the bugs that hack caused.

If you remove that if statement and expireNow() call, will the code break in some cases?

If yes, then the code relies on a weak expiration check to prevent sharing of unsharable entries. Such code must be fixed IMO.

If no, then the code is already OK and the expireNow() call is simply not needed (after all your other changes/fixes).

If no, then the code is already OK and the expireNow() call is simply not needed (after all your other changes/fixes).

I think you are right with this. I overlooked our StoreEntry::mayStartHitting(), developed to distinguish such shareable and unshareable entries. So there should be no need to do extra work with expireNow().

will the code break in some cases

It is difficult to say. I can state only that the major code paths (clientReplyContext::cacheHit() or clientReplyContext::handleIMSReply()) already use mayStartHitting() and there is no need there for protection with 'expiration'.

I can state only that the major code paths (clientReplyContext::cacheHit() or clientReplyContext::handleIMSReply()) already use mayStartHitting() and there is no need there for protection with 'expiration'.

The undocumented StoreEntry::validToSend() method also checks for "released" entries and, hence, does not rely on the expiration date alone (although I suspect it should check mayStartHitting() instead).

Finally, setting StoreEntry::expires is a rather weak protection because that setting can be ignored based on max-stale directives (at least). Look for FRESH_REQUEST_MAX_STALE_... Unlike the expiration date, private keys and shareableWhenPrivate should never be ignored.

rousskov · 2017-08-30T15:27:46Z

src/client_side_reply.cc

-            Store::Root().unlinkByKeyIfFound(key);
+                // Are there local collapsed clients we should notify?
+                // If yes, get transients index and use it later.
+                const auto transientsIndex = entry->hasTransients() && entry->locked() ? entry->mem_obj->xitTable.index : -1;


I understand that you call locked() here because Controller::get() might return an unlocked entry, and the transientsIndex is not going to survive StoreEntry::release() of an unlocked entry. I think we should make this code safer and simpler by always locking/unlocking the entry returned by get(). Doing so will remove this complex locked() logic and make post-release() entry manipulation safe. As a positive side-effect, it may also delete unlocked intransit entries that some other code might leave behind.

if (auto entry = Store::Root().get(key)) { entry->lock("purgeEntriesByUrl"); ... // entry->release() notifies other waiting workers but // XXX: does not abort local collapsed Store clients if needed. entry->release(true); // Work around the above XXX. // TODO: Move into release() after making invokeHandlers() asynchronous. if (entry->hasTransients()) Store::Root().syncCollapsed(entry->mem_obj->xitTable.index); entry->unlock("purgeEntriesByUrl"); } else ...

Done at fdcfaa3.

I am afraid I applied your improvement in hurry: I forgot about one reason why I decided to implement without locks. When we get from the Store, we could create a 'fake' entry (when store_table misses it). We need this entry only for notification purpose in release(). The syncCollapsed() method should handle local 'detached' entries, but it should not handle(abort for example) our 'fake' entry! That is why my previous version destroyed this fake entry first(after receiving required index) and then called syncCollapsed().
So I would suggest to undo that fdcfaa3.

When we get from the Store, we could create a 'fake' entry (when store_table misses it).

Why do you call that entry "fake"? It is a regular/valid entry AFAICT. Does it have some special properties that are important in this context?

We need this entry only for notification purpose in release().

Yes, and, until that XXX is fixed, for the synchronization purposes in syncCollapsed().

The syncCollapsed() method should handle local 'detached' entries,

Yes.

but it should not handle(abort for example) our 'fake' entry!

Why not?

Why do you call that entry "fake"?

I call this 'fake' because of my understanding of its usage. It is not used by clients. It is a 'helper' way to notify others. We could avoid using it at all, if, for example, transients->unlinkByKeyIfFound() had a mechanism to notify all workers(including this one) in case Transients has this entry. That was my initial suggestion you did not like.

Why not?

My suggestion prevents this extra work done on this entry in syncCollapsed(), including probably unnecessary(and misleading) logging. BTW, the abort() method says: "Someone wants to abort this transfer." What "transfer" do we have with our "fake" entry? If there is no local entry, we would return after transients->findCollapsed(xitIndex).

I think we must guarantee that Transients::unlinkByKeyIfFound() is not called when there is a local StoreEntry object.

Or, equivalently, if Transients table exists, for each local StoreEntry hasTransients() must be true. destroyMemObject() disconnects Transients from the StoreEntry. I see two scenarios to consider:

Transients was disconnected while deleting StoreEntry, i.e., destroyStoreEntry() is called. In this case we firstly disconnect it from Transients and secondly remove it from store_table. Is it possible that anybody attaches to the already disconnected, but not yet deleted entry? Looks like impossible that this code became interrupted.

Transients was disconnected while purging memory (handleIdleEntry(), discussed above, or storeGetMemSpace() methods). As I noted above, there could be cases when such memobj-less StoreEntries keep existing in the local store_table. What is the purpose of leaving such half-baked local entries? Optimization or anything else? If it is a bug, it should be fixed and the invariant becomes correct. If not, the invariant does not work.
Also, one thing is known that such transient-less and memobj-less entries should be also unlocked, meaning that there are no clients attached to it. If this consideration is correct, the invariant should be valid without any additional fixes: we should not worry about such 'stand-alone' entries, because they stay apart from collapsing, notifying, etc. Also, we can assert that entry.locked()==false, before calling unlinkByKeyIfFound() inside Transients::markForUnlink().

FYI: since we are dealing with a multi-process/thread environment any process may cause a kernel interrupt and do something with the shared parts unless the Transients disconnect part of the sequence is wrapped in proper atomic locks and the reader/writer counts checked inside the atomic lock (our StoreEntry locking does not count as thread-safety).

[ I have not checked the code to see if that is present, but you dont mention it as the reason for being "impossible", so though I should mention it. ]

@yadij, AFAICT, when talking about "impossible" and "code interrupts" in item 1, Eduard is actually talking about things happening inside one Squid process. Here, "code interrupts" means AsyncCalls and such rather than process interrupts . This particular concern is unrelated to SMP, and Eduard is right that it is impossible for a new Store client to attach to a being destroyed StoreEntry object (i.e., attach during destroyStoreEntry()).

Eduard is actually talking about things happening inside one Squid process.

That is correct. No other clients can attach during destruction process, because no async calls start processing during it. Also my supposition was that if Squid worker is suspended by OS at the beginning of StoreEntry destruction, it will resume executing from that very point of code, finishing the destruction process.

Alex: it is impossible for a new Store client to attach to a being destroyed StoreEntry object (i.e., attach during destroyStoreEntry()).

Eduard: No other clients can attach during destruction process

For the record, both statements above talk about necessarily local (i.e., same-process) clients attaching to the StoreEntry object (which is always local!). Store clients in other workers (i.e., "remote" clients) may attach to the (shared) Transients entry and/or other StoreEntry objects in their processes, but that is a different issue.

eduard-bagdasaryan · 2017-08-30T17:22:46Z

@rousskov , I addressed last discussed comments, but I can't set it to S-waiting-for-reviewer.

rousskov

I found a couple of problems. The rest is polishing. I hope we are getting close, but more changes are needed.

Also, if possible, please assert that RELEASE_FLAG is not set inside clearPrivate(), confirming that the private key setting is permanent.

rousskov · 2017-08-31T00:08:07Z

src/Transients.cc

        // there will be no more updates from us after this, so we must prevent
        // future readers from joining
-        map->freeEntry(e.mem_obj->xitTable.index); // just marks the locked entry
        map->closeForWriting(e.mem_obj->xitTable.index);


The comment above is still correct, but instead of freeing the entry, we are now using complete() test to prevent new readers from joining. Let's clarify what is going on and document a dependency on our get() logic:

// There will be no more updates from us after this, so we must prevent // future readers from joining. Making the entry complete() is sufficient // because Transients::get() does not return completed entries. map->closeForWriting(e.mem_obj->xitTable.index);

Done at a40bbf6.

rousskov · 2017-08-31T00:10:55Z

src/Transients.cc

 {
-    unlink(e);
+    assert(e.key);
+    e.hasTransients() ? unlink(e) :


Please do not use tertiary operator as an if-statement replacement. Use an if-statement. A tertiary operator imposes certain requirements on its second and third arguments, and ignoring its result may trigger compiler warnings.

A tertiary operator imposes certain requirements on its second and third arguments, and ignoring its result may trigger compiler warnings.

In this particular case these 'requirements' should be satisfied and no warnings should be generated. Both arguments result is 'void' so the result of the operator is also 'void'.

Yes, I know, but the return type(s) may change. The current void types are not a part of some natural interface invariant for those methods. The if statement would not be affected by such changes. A tertiary operator offers no advantages here, so I am puzzled why you are spending time to defend its use.

Yes, I know, but the return type(s) may change.

And in this case we would need to fix our if/else, prefixing methods with (void) methods ...

I am puzzled why you are spending time to defend its use.

I am not going to defend this minor issue anymore. In general (if everything is equal), I would prefer one line of the ternary operator to a clumsy equivalent four lines of if/else.

And (if the return types change), we would need to fix our if/else, prefixing methods with (void) methods ...

Not necessarily. The if statement will continue to work and read correctly if the method(s) gain a an optional return value. If the return value is required/important, then the if statement will need to be changed, but the changes are very likely to be much smaller in this case (because the two methods are too different to return the same kind of a required/important result).

In general, I prefer using separate lines for conditions and actions, which results in a one line difference between a tertiary operator and an if statement. Removing that extra else line is not worth the problems that tertiary operator brings.

AFAICT, the tertiary operator is meant for obtaining a result from either of the two different algorithms. The code which does not care about the result should not use the operator that does.

Done at a40bbf6.

rousskov · 2017-08-31T00:25:35Z

src/Transients.cc

 Transients::unlink(StoreEntry &e)
 {
-    if (e.mem_obj && e.mem_obj->xitTable.io == MemObject::ioWriting)
+    if (e.mem_obj)


This change looks unfinished. The old code was (indirectly) protecting abandon() from entries that do not have xitTable.index set. The new code does not and may result in asserts if unlink() is called without a hasTransients() check. I see two correct models:

All callers must check hasTransients() before calling unlink().

Unlink() must check hasTransients().

AFAICT, this PR is using approach 1 right now. Following that approach means not checking e.mem_obj either! If you agree and this PR continues to follow approach 1 (see below), then please remove that check as well. Otherwise, unlink() has to check hasTransients() and call markForUnlink() if there are none (approach 2).

I am not sure which approach is the best, but it bothers me that unlink() is using approach 1 while markForUnlink() is using approach 2. Approach 2 seems to be required for markForUnlink() because we must mark the entry even if it is being created (i.e., locked for writing) right now, correct? Should not the same logic apply to unlink()?

Otherwise, unlink() has to check hasTransients() and call markForUnlink() if there are none (approach 2).

After these changes I noticed that unlink() and markForUnlink() became identical. So I made unlink() private(which now only calls markForUnlink()) and use markForUnlink() in that single place instead.

Done at a40bbf6.

rousskov · 2017-08-31T00:29:18Z

src/fs/rock/RockSwapDir.cc

    debugs(47, 5, e);
-    map->freeEntry(e.swap_filen);
+    assert(e.key);
+    e.hasDisk() ? map->freeEntry(e.swap_filen) :


Please use an if statement instead of the tertiary operator.

Done at a40bbf6.

rousskov · 2017-08-31T00:36:39Z

src/fs/ufs/UFSSwapDir.h

    virtual uint64_t currentCount() const override { return n_disk_objects; }
    virtual ConfigOption *getOptionTree() const override;
    virtual bool smpAware() const override { return false; }
+    virtual bool hasReadableEntry(const StoreEntry &) const override { return false; };


Do you think we should document why this method always returns false? Something like this, perhaps:

/// as long as ufs relies on the global store_table to index entries, /// it is wrong to ask individual ufs cache_dirs whether they have an entry virtual bool hasReadableEntry(const StoreEntry &) const override { return false; }

Done at a40bbf6.

rousskov · 2017-08-31T05:11:53Z

src/store/Controller.cc

        debugs(20, 3, "aborting unsyncable " << *collapsed);
        collapsed->abort();
-    } else { // the entry is still not in one of the caches
+    } else  {


I meant this:

if (inSync) { debugs(20, 5, "synced " << *collapsed); collapsed->invokeHandlers(); return; } if (found) { // unrecoverable problem syncing this entry debugs(20, 3, "aborting unsyncable " << *collapsed); collapsed->abort(); return; } // the entry is still not in one of the caches debugs(20, 7, "waiting " << *collapsed);

rousskov · 2017-08-31T05:15:32Z

src/store/Disks.cc

+            if (e.hasDisk(i))
+                dir(i).unlink(e);
+            else
+                dir(i).unlinkByKeyIfFound(reinterpret_cast<const cache_key*>(e.key));


As already discussed elsewhere, with "approach 2", this inner if-statement would be replaced with just dir(i).unlink(e); and each cache dir will do either mark the (locked) entry for unlinking or unlink by key.

Done at a40bbf6.
Following it, I removed a hasDisk() check from storeSwapOutFileClosed() because the surrounding code assumes that the entry should be attached to a disk: it uses e->disk() without check. Probably there should be an assert(e->hasDisk()) near existing assert(e->swappingOut())?

I would not add a hasDisk() assert to protect disk() calls because those calls have that assert().

Since storeSwapOutFileClosed() does not manipulate raw disk indexes/pointers itself (only dumps them in stale debugs() messages), I think we are OK without an assert protecting such raw manipulations.

I consider this change request addressed.

rousskov · 2017-08-31T05:16:53Z

src/store/Storage.h

    virtual void markForUnlink(StoreEntry &e) = 0;

-    /// remove the entry from the store
+    /// Remove the entry from the store if possible


s/the entry/the matching entry/

Done at a40bbf6.

rousskov · 2017-08-31T05:17:47Z

src/store/Storage.h

+
+    /// Remove the entry from the store if possible
+    /// or mark it as waiting to be freed otherwise.
+    /// The entry must be present in the store.


With "approach 2", the "must be present" part will be gone.

rousskov · 2017-08-31T05:19:30Z

src/store_swapout.cc


+    if (Store::Root().markedForDeletionAndAbandoned(*this)) {
+        debugs(20, 3, "marked for deletion and abandoned");
+        return false;


Should this be MemObject::SwapOut::swImpossible?

Yes, it should. I think this bug is a consequence of code duplication throughout this method: swImpossible state is always followed by returning 'false'. I would suggest to fix this with something like:

static bool CreateImpossibleDecision(StoreEntry *e, const char *msg) { debugs(20, 3, msg << " " << *e); swapOutDecision(MemObject::SwapOut::swImpossible); return false; }

and the above block(and other similar checks) would transform into:

if (Store::Root().markedForDeletionAndAbandoned(*this)) return CreateImpossibleDecision(this, "marked for deletion and abandoned");

Let's not polish this now. I am concerned that either something went wrong in this method (before this PR) or the method itself needs to be redesigned again (outside of this PR) because false always means swImpossible. Addressing that concern is not trivial and is outside this PR scope.

Done without polishing at a40bbf6.

rousskov · 2017-08-31T22:27:40Z

I approved the changes I requested. I do not like the final unlink()/markForUnlink() state. I think we should polish that split. However, that polishing can be done separately, and a separate polish may reduce overheads. @eduard-bagdasaryan, please proceed with rebasing the branch and doing the (hopefully) last batch of comprehensive tests.

@yadij, please approve the changes you requested if Eduard has indeed addressed them per #46 (comment)

eduard-bagdasaryan · 2017-09-01T17:06:49Z

Also, if possible, please assert that RELEASE_FLAG is not set inside clearPrivate(), confirming that the private key setting is permanent.

s/RELEASE_FLAG/RELEASE_REQUEST/
This is possible, I think: clearPrivate() usage implies that the flags at the moment if its call is either already checked with the same assertion (StoreEntry::setPublicKey()), or checked that the entry is not private (StoreEntry::clearPublicKeyScope()). The entry must not have RELEASE_REQUEST without being KEY_PRIVATE.
Done at 3935345.

eduard-bagdasaryan · 2017-09-12T05:03:52Z

@rousskov, please review the latest revision.

rousskov

I found a couple of more problems and added a few questions and polishing remarks as well. Please see the code comments for details.

rousskov · 2017-09-12T20:21:42Z

src/client_side_reply.cc

                }
 #endif
-                entry->release();
+                // entry->release() notifies other waiting workers, including this one


s/other waiting workers/waiting workers/

Also, the opening curly brace should go on the same line as the if-statement (it does in the original code). Please check that you have run the source code formatting script on these changes.

rousskov · 2017-09-12T21:35:17Z

src/Transients.cc

+    const auto idx = entry.mem_obj->xitTable.index;
+    const auto &anchor = collapsedWriter(entry) ?
+        map->writeableEntry(idx) : map->readableEntry(idx);
+    aborted = EBIT_TEST(anchor.basics.flags, ENTRY_ABORTED);


anchor.basics.flags is not atomic but may be updated by writer's StoreMap::abortWriting() calls, so it is wrong to read it without holding an exclusive lock. For example, when collapsedWriter() is false, we do not have an exclusive lock and might try to read anchor.basics.flags when another worker (writer) is doing StoreMap::abortWriting(). To properly fix this, we would probably have to add a second dedicated atomic flag to StoreMapAnchor, like waitingToBeFreed, and use it to mark and detect aborted map entries. The ENTRY_ABORTED flag in StoreMapAnchor will be unused (like it was before these changes). Any better ideas?

Can we place that second aborted flag so that sizeof(StoreMapAnchor) does not change on 32bit and 64bit platforms?

The rest of StoreMapAnchor::basics was and remains constant when readers are allowed AFAIK, but this PR started changing flags...

Any better ideas?

Having two duplicated flags looks not very well. Moreover, what if in future we would need to read one more 'flag' from anchor.basics.flags in another reader? Again add such 'atomic' duplicate?

I would like to suggest another approach:

aborted = false; // Since a shared entry gets ENTRY_ABORTED only within StoreMap::abortWriting(), // (where anchor.writing() becomes false), we should not check this value before. // Also note that trying to read ENTRY_ABORTED for being written shared entry // may cause an unpredictable result, because anchor.basics.flags is not 'atomic'. if (!anchor.writing() && EBIT_TEST(anchor.basics.flags, ENTRY_ABORTED)) aborted = true; waitingToBeFreed = anchor.waitingToBeFreed;

Having two duplicated flags looks not very well

I think we can avoid any duplication of flags in my proposal: The new atomic flag will be used without any relationship or reference to the unused(?) same-name bit in basics.flags. And we can name the new atomic flag writerGone or something like that to minimize the implication that the two are flags are duplicated.

Moreover, there is actually some extra danger in using ENTRY_ABORTED like we do in the current PR code. I suspect that some higher level code may assert if it gets an ENTRY_ABORTED entry from the cache. Our code probably copies shared basic.flags into local StoreEntry objects blindly, which (in rare cases) might result in a cache reader getting a ENTRY_ABORTED StoreEntry hit and asserting. The code should handle later hit aborts correctly (because they can always happen on incomplete entries), but finding an already aborted hit in the cache, especially for a "complete" entry, may be problematic. This concern alone may be a sufficient reason to not use ENTRY_ABORTED for our purposes, even if the atomicity issue does not exist!

Finally, we should consider removing basics.flags as effectively unused, but probably not in this PR because that decision is not trivial.

// Since a shared entry gets ENTRY_ABORTED only within StoreMap::abortWriting(), // (where anchor.writing() becomes false), we should not check this value before. if (!anchor.writing() && EBIT_TEST(anchor.basics.flags, ENTRY_ABORTED)) aborted = true;

This trick misses the (rare) case where the flag has been set (before the check) but the write lock has not been released (until after the check). This problem is not critical because the affected reading worker should receive another notification later, when the writer finishes updating its state. At that time, the reader will detect the aborted entry correctly. Thus, we can get away with this by adding an XXX.

The question is: Why should we use tricks and add XXXs when we can use an atomic flag instead? I do not think we should if the earlier discussion addresses your "duplicated flags" concern.

Done at 8cd8303. Does it meet your expectation?

Can we place that second aborted flag so that sizeof(StoreMapAnchor) does not change on 32bit and 64bit platforms?

I placed it just after waitingToBeFreed. I assume that since sizeof(waitintToBeFreed) == sizeof (writerHalted) == 1, these fields are not aligned. The following data field (uint64_t key[2]) should be aligned with 3-byte offset before changes and with 2-byte offset after changes. The sizeof(StoreMapAnchor) should be the same in both cases, I think. In my environment it is 96 bytes.

rousskov · 2017-09-12T21:38:42Z

src/store/Controller.cc

+Store::Controller::syncCollapsedWriter(StoreEntry &collapsed)
 {
-    assert(transients);
+    bool aborted = false;


Edit: This change request is probably not relevant because syncCollapsed() restructuring discussed in another change request should make these polishing touches unnecessary.

s/aborted/abortedByUs/ to restore symmetry with syncCollapsedReader().

Also, please add a comment to confirm and explain why this variable is unused. For example:

// keeps Transients::status() API simple; unused because we already // know whether we are aborted (so this flag does not need syncing) bool abortedByUs = false;

You can write know we are not aborted instead because syncCollapsed() checks that, but I would not introduce that unnecessary implied dependency.

Skipped as irrelevant.

rousskov · 2017-09-12T21:53:25Z

src/store/Controller.cc

-        debugs(20, 7, "not SMP-syncing not-transient " << xitIndex);
-        return;
+    if (waitingToBeFreed) {
+        debugs(20, 3, "releasing writing " << collapsed << " due to waitingToBeFreed shared status");


We cannot really release a being-written entry so I would

s/releasing/marking/
or
s/releasing/will release/

I would also drop "shared status" for brevity sake.

Skipped as irrelevant.

rousskov · 2017-09-12T22:46:12Z

src/store/Controller.cc

-        return;
+    if (waitingToBeFreed) {
+        debugs(20, 3, "releasing writing " << collapsed << " due to waitingToBeFreed shared status");
+        collapsed.release(true);


This check and release() call are missing for the syncCollapsedReader() case. There is a similar check calling abort() there, but that check does not cover the non-aborting cases AFAICT.

When I imagine the resulting (fixed) code, it feels like we should restructure Store::Controller::syncCollapsed() to reduce the chance of similar future bugs and to reduce code duplication. A single method like this may work better:

... debugs(20, 7, "syncing " << *collapsed); bool abortedByWriter = false; bool waitingToBeFreed = false; transients->status(collapsed, abortedByWriter, waitingToBeFreed); if (waitingToBeFreed) { debugs(20, 3, "will release " << collapsed << " due to waitingToBeFreed"); collapsed.release(true); // may already be marked } if (transients->collapsedWriter(*collapsed)) return; // readers can only change our waitingToBeFreed flag assert(transients->collapsedReader(*collapsed)); if (abortedByWriter) { debugs(20, 3, "aborting " << collapsed << " because its writer has aborted"); collapsed.abort(); return; } ... the rest of the current syncCollapsedReader() code follows ...

rousskov · 2017-09-12T23:39:01Z

src/ipc/StoreMap.h

    /// free the entry if possible or mark it as waiting to be freed if not
-    void freeEntry(const sfileno fileno);
+    /// \param stateChanged when provided, true if the entry state changes, false otherwise
+    void freeEntry(const sfileno fileno, bool *stateChanged = nullptr);


I suggest a simpler interface:

/// free the entry if possible or mark it as waiting to be freed if not /// \returns whether the entry was neither empty nor marked bool freeEntry(const sfileno fileno);

I did similarly at first, but almost all (~15) freeEntry() calls end up with ignoring its return value. We would have to prefix them with (void), which looks ugly.

almost all freeEntry() calls end up with ignoring its return value. We would have to prefix them with (void)

I do not think we have to or should prefix those calls with (void). That prefix is useful or needed when you want to emphasize that a usually important return value is ignored on purpose rather than by accident. In this case, the return value is usually unimportant (because it does not indicate a success or failure -- the method cannot fail) and, hence, can be silently ignored by most callers.

rousskov · 2017-09-12T23:49:28Z

src/ipc/StoreMap.cc

        freeChain(fileno, s, false);
-    else
-        s.waitingToBeFreed = true; // mark to free it later
+        if (stateChanged)


To avoid returning true for the already marked and/or already free entries, we can do something like this:

if (s.lock.lockExclusive()) { const bool result = !(s.waitingToBeFreed || s.empty()); freeChain(...); return result; } // mark to free the locked entry later (if not already marked) bool expected = false; return s.waitingToBeFreed.compare_exchange_strong(expected, true);

Done.
I used !s.waitingToBeFreed && !s.empty() instead, which looks simpler for me.

rousskov · 2017-09-13T00:32:42Z

src/store/Controller.cc

 {
    debugs(20, 3, storeKeyText(key));

+    if (markedForDeletion(key))


Please add a debugs() here, to match the other lines in this method and to provide an explanation to admins trying to understand why the previous cached object does not produce hits:

if (markedForDeletion(key)) { debugs(20, 3, "ignoring marked " << storeKeyText(key)); return nullptr; }

rousskov · 2017-09-13T00:38:59Z

src/store/Controller.cc

+        debugs(20, 3, "got in-transit entry: " << *entry);
+        entry->release(true);
+        // The entry should be locked (but see find()). Return here because the
+        // release of a locked entry should take care of all its stores.


Why do we care whether the entry is locked here? StoreEntry::release() takes care of all its stores regardless of whether the entry is locked, right? If this comment is stale or just wrong, I would do:

entry->release(true); // takes cares of all its stores if any return;

However, the "all of its stores" logic implies that release() may ignore Store entries that are not attached to the entry object. On the other hand, Controller::unlinkByKeyIfFound() should mark all matching store entries in existence, even those not currently associated with any StoreEntry object. In other words:

The above if-statement takes care of the (matching) store entries attached to the matching local StoreEntry object.

The code below that if-statement takes care of the matching store entries when there is no matching local StoreEntry object.

Who will take care of the matching store entries unattached to the matching local StoreEntry object? I believe such entries may exist because, for example, one worker could be writing a response to disk while another worker may be writing the same-key response to the memory cache. Can Transients guarantee that these two independent concurrent writes can never happen? I do not think so: Transients may not be able to create a shared entry to coordinate those two workers because, for example, the required Transients slot in shared memory may already be locked by another key due to a benign hash collision. And there may be cases other than two concurrent different-worker writes where a local matching StoreEntry coexists with matching but unattached store entries.

If my concern is valid, then we cannot return from the above if-statement. We have to continue to the code below. We still need to if-statement to keep any local StoreEntry objects in sync with Store changes, of course.

However, the "all of its stores" logic implies that release() may ignore Store entries that are not attached to the entry object.

Looks like this comment is not precise: release() should mark not attached entry for deletion from storages via Controller::markForUnlink(), and corresponding markForUnlink() storage methods.

BTW, Controller::unlinkByKeyIfFound() is only called from purgeEntriesByUrl() with precondition that Store::Root().get() returned nil. That makes me thinking that the "if" block in Controller::unlinkByKeyIfFound() is not necessary.

Code: release(true); // takes cares of all its stores if any

Alex: the "all of its stores" logic implies that release() may ignore Store entries that are not attached to the entry object

Eduard: release() should mark not attached entry for deletion from storages via Controller::markForUnlink

Whether release() should unlink unattached entries is actually debatable, but, for this discussion, let's assume that release() does unlink unattached entries.

Eduard: Looks like [Alex] comment is not precise

My comment seems OK, but the source code comment implies something that it should not imply. In other words, if release() unlinks unattached entries, then we need to fix the comment. Something like this may work better:

// unlink all Store entries with entry->key, even unattached ones entry->release(true); return;

And to minimize similar doubts in the future, we should improve the release() description as well. For example, something like this would be better than the current text IMO (we are running rather thin on Store-related terminology, but let's not try to fix that here, in a method description):

/// Removes all unlocked (and marks for eventual removal all locked) Store /// entries, including attached and unattached entries that have our key. /// Also destroys us if we are unlocked or makes us private otherwise. virtual void release(const bool shareable = false);

Controller::unlinkByKeyIfFound() is only called from purgeEntriesByUrl() with precondition that Store::Root().get() returned nil. That makes me thinking that the "if" block in Controller::unlinkByKeyIfFound() is not necessary.

AFAICT, that thinking does not match the code because Root().get() can return nil when store_table has a matching entry. See the markedForDeletion() check in Controller::find().

And, more importantly, we should not make nil get() a hidden precondition for calling Root().unlinkByKeyIfFound() because some caller will eventually forget to check that. Your comment and several similar earlier discussions all suggest to me that we need to revise the unlink/markForUnlink API. I hope we can do that after this PR. For example, it seems likely that a chunk of the current purgeEntriesByUrl() logic should go directly inside Controller::unlinkByKeyIfFound() instead of appearing as preconditions for (or alternative actions to) that method call.

Done: fixed comments, the code is the same.

rousskov · 2017-09-13T04:36:50Z

src/client_side_reply.cc

+
+                entry->unlock("purgeEntriesByUrl");
+            } else {
+                Store::Root().unlinkByKeyIfFound(key); // does not broadcast but there are no waiting workers


The argument is valid IMO, but please s/are/were/ to make it a bit more accurate.

eduard-bagdasaryan · 2017-09-14T17:34:15Z

@rousskov , I addressed your latest remarks.There is a couple of not yet fully discussed questions.

rousskov · 2017-09-14T19:02:02Z

src/ipc/StoreMap.cc

        return result;
    }

+    // TODO: check whether 'empty' anchors can be still locked.


Yes, any anchor/entry can be locked. Locking does not depend on the entry state (other than the state of the lock itself, of course, but the lock can/should be viewed as being "outside" the entry itself).

rousskov · 2017-09-14T19:47:50Z

src/client_side_reply.cc

+
+                entry->unlock("purgeEntriesByUrl");
+            } else {
+                Store::Root().unlinkByKeyIfFound(key); // does not broadcast but there were no waiting workers


This comment is still misleading or even wrong. There could be workers working on an entry with key, of course. What we are trying to say here is that we should not worry about ignoring them. Why? I believe we can safely ignore such workers because either the entry has already been marked for deletion (and so another notification would make no difference) or Controller::find() failed to create new or join an existing Transients entry (and so we would not be able to notify anybody anyway). If my logic is correct, then let's replace this comment with something like this:

// The call below does not notify workers but that is OK here because nil // Store::Root().get() guarantees that either the matching entry has already // been marked (so new notifications are useless) OR get() could not create // (or join) a Transients entry (so notifications are impossible).

// The call below does not notify workers but that is OK here because nil // Store::Root().get() guarantees that either the matching entry has already // been marked (so new notifications are useless) OR get() could not create // (or join) a Transients entry (so notifications are impossible).

I do not mind this in general, however note that if get() returns nil that only means that "it could join a Transients entry" (as well as any other of shared or local storages), we don't know whether the entry was deleted or have not been created. Also, AFAIK, get() does not create Transients entries: it is done by Controller::allowCollapsing() and Transients::startWriting().

if get() returns nil that only means that "it could join a Transients entry"

I assume you meant "could not join a Transients entry".

(as well as any other of shared or local storages)

Right, but those other stores do not matter as far as notifications are concerned, and the comment we are discussing is about notifications.

get() does not create Transients entries

Good point! It only creates StoreEntries from existing Transient entries.

Another problem with that comment is that it is actually possible to broadcast a notification even if we could not join an existing Transients record -- broadcast messages just need xitTable.index value which we could compute. Such broadcasts would be useless because if we could not join then nobody else could, and so there are no readers to broadcast to (and the Transients writer does not care about the delete broadcasts, right?).

Here is a fixed(?) comment:

// The call below does not notify workers but that is OK here because nil // Store::Root().get() guarantees that the matching Transients entry does // not exist (so there are no readers to notify), could not be joined yet // (ditto), or has already been marked (so new notifications are useless).

However, all the conditions in the above comment were true only at the get() time. They may no longer be true at the unlinkByKeyIfFound() time! What happens if a new reader comes and manages to attach to the Transients entry after our failed get() but before our unlinkByKeyIfFound() call? Will that reader have to wait for the next store write to realize that it can never attach to the stores and has to abandon the current transaction? If yes, is there a simple way to improve this?

Perhaps unlinkByKeyIfFound() should broadcast using raw Transients index but it feels like we should just reorder the markings instead:

Mark stores (not Transients!) by key.

Root().get().

If we got an entry, do the usual entry->release(), with Broadcast().

Otherwise, do nothing.

What do you think?

What happens if a new reader comes and manages to attach to the Transients entry after our failed get() but before our unlinkByKeyIfFound() call?

BTW, similar logic should be applied to the writer itself: an entry may be locked for Transients writing (without appending) at the time of get() but become 'appending' (and available for another get()) just before unlinkByKeyIfFound(). Then the skipped broadcasting would not release() that local writer entry (Controller::syncCollapsed()).

but it feels like we should just reorder the markings instead

What is the need of (1)? If Root.get() works for us(Transients is not marked) why it will not work for others? I.e., If shared Transients entry is unavailable at (2) but becomes available just before (3) wont we face same problems?

Perhaps unlinkByKeyIfFound() should broadcast using raw Transients index

This would be more reliable solution, I assume.

similar logic should be applied to the writer itself: an entry may be locked for Transients writing (without appending) at the time of get() but become 'appending' (and available for another get()) just before unlinkByKeyIfFound(). Then the skipped broadcasting would not release() that local writer entry (Controller::syncCollapsed()).

Agreed. The writer problem cannot be (easily) solved by broadcasting AFAICT. Instead, right before switching to appending mode, the writer must check waitingToBeFreed and release() its entry. If that code is missing, please add it. I think there is already similar code somewhere.

What is the need of (1)?

Step 1 calls unlinkByKeyIfFound() for memory and disk stores (if any).

If Root.get() works for us(Transients is not marked) why it will not work for others?

Root.get() will work for others as well, but they will get our notification and will be fine. They will not attach themselves to stores because we marked those first (in step 1).

If shared Transients entry is unavailable at (2) but becomes available just before (3) wont we face same problems?

We will! Glad you found this bug. I also realized that my step 4 was missing unlinkByKeyIfFound() for Transients, but adding that would not fix this bug.

Also, I really do not want to broadcast based on unlocked Transient index values because key:index mapping is (or will be) unreliable (if StoreMap supports hash collision resolution).

I do not have any good solutions to offer right now. Back to the drawing board... If you can offer something that addresses all these problems, please do!

I tested latest adjustments with our Daft test. One problem occasionally happens: when several clients request a disk-cached entry, Squid returns unexpected "503 Service Unavailable" responses instead of "200" hits. Cache.log discovery showed that the reason is Transients entry creation collision: two workers simultaneously find the entry in Rock::SwapDir::get(), create StoreEntry objects and try to create Transients entry, locking it for writing. Unexpected 503s looks like the test imperfection, in reality we will get unexpected cache misses in this case.

Glad our tests are capable of exposing such hidden problems!

In general, it is acceptable to return a miss when we cannot guarantee reliable hit delivery. A hit test must either accommodate/tolerate/retry such misses or avoid them.

However, we should also avoid making hit delivery unreliable without a good reason. In the context of your "concurrent hits" example, I do not see such a reason (from an independent reviewer point of view; ignoring internal problems of our own making). There are two different cases to consider here:

A complete() store entry. AFAICT, such an entry does not need a Transients entry. There is no need to signal other workers, even when handling DELETE or similar purging events.

An in-complete() store entry. Such an entry must already have a Transients entry so there should be no creation conflicts at hit time.

If the above analysis is correct, then we should relax the earlier "every active Store entry must have a corresponding Transients entry" rule to "every Store entry being written must have a corresponding Transients entry", right? A store reader must still attach to the existing Transients entry (if any), of course, but it should not try to create that Transients entry.

If you agree with the above analysis, please make the necessary code adjustments. Otherwise, please discuss what I have missed.

A complete() store entry. AFAICT, such an entry does not need a Transients entry. There is no need to signal other workers, even when handling DELETE or similar purging events.

I am thinking about possible problems with scenario:

There is a disk-cached entry.

Two clients request it at the same time, only one of them manages to create Transients entry.

The two clients start receiving cache hit.

DELETE comes on another worker, marking the the shared cache and releasing only the first StoreEntry(via notification).

The first client finishes getting the response and the corresponding Transients entry is freed. Our Store::Controller::markedForDeletion() protection stops working since now.

Another client hits the the second worker, thus accessing its local still active StoreEntry (and ignoring previous DELETE request).

Two clients request an existing disk-cached entry at the same time, only one of them manages to create Transients entry.

With the adjusted rule, none of the reading workers will try to create a Transients entry.

Another client hits the the second worker, thus accessing its local still active StoreEntry (and ignoring previous DELETE request).

This concern is valid (and we do not even need two clients in step 2 to trigger this problem!).

I see two ways to fix this problem:

A. Go back to the original "all active store entries have Transients entries" rule and accept the increased probability of rejecting perfectly valid cache hits.

B. Stay with the reduced "store writers must create Transients entries" rule but adjust the existing Controller::markedForDeletion() method to return true for keys that have local StoreEntry objects reading store entries marked for deletion. Please note that such StoreEntry objects will have a lock on the store entries they are reading.

I think solution B is preferable. Does it miss any use cases?

Stay with the reduced "store writers must create Transients entries" rule

be9511bb implements this suggestion. While re-running the test, the above problem has gone, but a new, similar one appeared. I found the source of that new problem and posted the fix. After that, the test succeeded for both shared memory and disk cache.

rousskov · 2017-09-14T19:55:05Z

There is a couple of not yet fully discussed questions.

I believe I answered all the pending questions. I also added a few additional (minor) change requests. I think we are getting really close to being done here. The ball is on your side.

rousskov

Nice choice for the new flag name -- much better than writerGone I have suggested.

I added a few minor change requests for the last commit but it looks good to me overall.

The only big issue remaining is the purgeEntriesByUrl() race condition discussed a bit earlier: #46 (comment)

rousskov · 2017-09-15T16:16:46Z

src/ipc/StoreMap.h


 public:
    mutable ReadWriteLock lock; ///< protects slot data below
-    std::atomic<uint8_t> waitingToBeFreed; ///< may be accessed w/o a lock


Please do not change this type unless we have to for some reason. The correct type here may actually be std::atomic_bool, but this PR is not about improving that type and such improvements are not trivial.

Still open.

Adding this old problem to the new review: Please do not change this type unless we have to for some reason. The correct type here may actually be std::atomic_bool, but this PR is not about improving that type and such improvements are not trivial.

rousskov · 2017-09-15T16:22:18Z

src/ipc/StoreMap.h

    mutable ReadWriteLock lock; ///< protects slot data below
-    std::atomic<uint8_t> waitingToBeFreed; ///< may be accessed w/o a lock
+    std::atomic<bool> waitingToBeFreed; ///< may be accessed w/o a lock
+    /// true when StoreMap::abortWriting() was called but there were still readers


I think we can simplify because we do not need to document the free entry state:

std::atomic<uint8_t> writerHalted; ///< whether abortWriting() has been called

Still open.

Adding this old problem to the new review: I think we can simplify because we do not need to document the free entry state:

std::atomic<uint8_t> writerHalted; ///< whether abortWriting() has been called

rousskov · 2017-09-15T16:24:27Z

src/Transients.cc

+    const auto idx = entry.mem_obj->xitTable.index;
+    const auto &anchor = collapsedWriter(entry) ?
+        map->writeableEntry(idx) : map->readableEntry(idx);
+


I would remove this empty line: All the lines above and below it are about extracting those two flags.

Still open. Please also remove the new empty line in Transients::copyFromShm().

https://measurementfactory.atlassian.net/secure/attachment/32001/SQUID-175-methods-do-not-invalidate-v4-t17.patch

(http://lists.squid-cache.org/pipermail/squid-dev/2017-July/009220.html): * MemStore::updateCollapsed(): restored old behavior with returning true when the given collapsed entry is disconnected from he cache. * unlinkByKeyIfFound(): do nothing if the 'map' is null. * Corrected several methods documentation and code comments. Still not addressed: * Whether(and how) remove RELEASE_REQUEST flag direct manipulation for a just created StoreEntry.

Before this change, the code abused RELEASE_REQUEST flag setting, violating general 'releaseRequest()' API. For now, RELEASE_REQUEST flag means an (locked) entry marked for removal. As a consequence, such StoreEntry becomes not reachable for new store clients. On the other hand, if a StoreEntry is private, it is not usable by new store clients also. This patch aggregates these two related notions: all work of making StoreEntry 'unusable for new store clients' is performed now inside StoreEntry::setPrivateKey(). RELEASE_REQUEST flag now gets a proper meaning: it means a private StoreEntry which can't become public anymore. In other words, this new meaning implies that after setting RELEASE_REQUEST (making the StoreEntry 'permanently' private), the other code must not clear that flag (and change the key from private to public). Some existing EBIT_CLR(RELEASE_REQUEST) lines were in the way. It was carefully analysed that removing them is safe (i.e., there are no scenarios when StoreEntry has the flag set at these points).

Keeping to this new meaning introduced at 9cfa044, simplified with a new StoreEntry::permanentlyPrivate().

rousskov

I am done fixing and polishing this PR branch. I will freshen up the description and am asking @eduard-bagdasaryan to review my commits, but if there are no surprises, then this code can be finally committed.

rousskov · 2018-01-03T21:30:57Z

Some [Jenkins] checks were not successful

After checking a few, I am going to assume that all build failures were those mysterious "reference is not a tree" Jenkins plugin(?) failures that I keep asking @kinkie about. I am going to ignore them in hope that the next build will magically avoid them.

yadij · 2018-01-17T05:16:45Z

There are some real build errors remaining to be fixed in the bb533f0 test results. Please can one of you attend to those and get this merged into master ASAP. This is now not going into the Jan v4 beta, but I would really like it to have as long a period of use as possible for the Feb beta.

eduard-bagdasaryan · 2018-01-17T13:28:21Z

Looks like these errors were caused by the fact that only a couple of SwapDir virtual members are overridden. We should override all such methods or none of them.

rousskov · 2018-01-17T15:30:39Z

@eduard-bagdasaryan, please remove those inconsistent overrides while you are reviewing the changes.

* Removed unused EntryGuard::releaseLocked(). * Protected EntryGuard destructor from the throwing code inside of it.

... and honor this rule in Transients::evictCached().

eduard-bagdasaryan · 2018-01-26T08:33:09Z

Checking merge bot, attempt 2.

... this is what previous 5c09e58 missed.

The removing of writing state from Transients was incorrect: there are several places where writers and readers behave differently. For example, broadcasting should only work when there are readers and should not work when there is only writer. Another example is checking for transient readers (giving incorrect results) inside cheCheckQuickAbortIsReasonable().

* Fixed Transients::addEntry() comment * Fixed wrong argument passing to StoreEntry::release() * Removed wrong statement in Controller::transientsCompleteWriting()

Before this change (and after 5c09e58) a Transients entry sometimes could not be properly evicted using Transients::evictCached() because the StoreEntry had already been disconnected from the Transients cache, while evicting shared memory and destroying memory object. To overcome this, we need to evict the entry from Transients before destroying memory object. It may seem that doing so would violate the "transients invariant" (every public StoreEntry must have the corresponding Transients entry), for a very short period of time. However this is not a problem, because the StoreEntry is not accessible via Controller interface after it has been evicted form Transients.

rousskov · 2018-01-26T18:23:44Z

Status update: @eduard-bagdasaryan is investigating one potential problem discovered during his last review. Once he addresses that concern, I plan to merge these changes.

rousskov · 2018-02-02T04:16:45Z

Status update: Problem confirmed. The fix is being tested.

StoreEntry::release() (during disk rebuild) and storeGetMemSpace() were purging unlocked entries from the memory cache without abandon()ing them. Purging an entry from a memory cache destroys mem_obj and, hence, its Transients (if any). Such silent Transients loss violates the public=hasTransients invariant and leads to bugs, especially if StoreEntry was not cached in memory and, hence, did not broadcast its eviction. Also do not memoryEvictCached() in StoreEntry::release() because all memoryEvictCached() calls are dangerous, and that call was not needed as releaseRequest() yields to a (safe and broader) evictCached() call. Dangerous purgeMem() is gone, and memoryEvictCached() is now private. Also possibly fixed purging of incomplete on-disk entries that managed to lose their mem_obj (in non-SMP caching environment): Controller::handleIdleEntry() was not releasing such entries because it relied on purgeMem() that did nothing without a mem_obj.

rousskov · 2018-02-02T15:58:39Z

Tests failed with the usual reference is not a tree Jenkins plugin error. I suspect Jenkins GitHub integration plugin (or something related to it) has a bug where target branch changes (at certain sensitive stages of the test execution) result in these failures. Resyncing the branch and retesting...

I plan to merge this PR if the tests are successful.

rousskov · 2018-02-02T18:52:27Z

Three out of ten Jenkins tests failed with the usual reference is not a tree error (again).

the error is not related to Squid code
the same tests have all passed earlier without significant branch code changes
other (random) tests have failed earlier for the same non-Squid reason
successful tests cover both compilers in diverse environments
GitHub is having trouble with this huge PR -- often unicorning this Conversation page; I suspect GitHub may have similar troubles when the PR is accessed by the Jenkins plugin using GitHub API.

Given the combination of the above facts, I plan to ignore similar future failures for this PR and merge this PR in a few hours.

P.S. I tried to restart just the failed Jenkins test runs, but could not find a way to do it. I then tried to delete the failed tests in hope that Jenkins will report an overall success to GitHub, but it looks like Jenkins deleted the whole build 312 (all tests) instead. Sigh.

rousskov · 2018-02-02T20:56:16Z

After triggering another round of tests, I forced-merged ignoring several Jenkins test failures. All failures were due to the same reference is not a tree error.

I expect these changes to trigger new and expose old bugs: While we did perform several rounds of fairly extensive lab testing, the vast scope and complexity of these changes virtually guarantee problems in some untested use cases.

The next big step is to fix Squid bug 4579 mentioned as an important XXX in this PR commit message. We also need to officially record a dozen out-of-scope problems discovered while working on this PR.

When Squid finds a requested entry in the memory cache, it does not check whether the same entry is also stored in a cache_dir. The StoreEntry object may become associated with its store entry in the memory cache but not with its store entry on disk. This inconsistency causes two known problems: 1. Squid may needlessly swap out the memory hit to disk, either overwriting an existing (and identical) disk entry or, worse, creating a duplicate entry on another disk. In the second case, the two disk entries are not synchronized and may eventually start to differ if one of them is removed or updated. 2. Squid may not delete a stale disk entry when needed, violating various HTTP MUSTs, and eventually serving stale [disk] cache entries to clients. Another purging problem is not caused by the above inconsistency: 3. A DELETE request or equivalent may come for the entry which is still locked for writing. Squid fails to get a lock for such an entry (in order to purge it) and the entry remains in disk and/or memory cache. To solve the first two problems: * StoreEntry::mayStartSwapout() now avoids needless swapouts by checking whether StoreEntry was fully loaded, is being loaded, or could have been loaded from disk. To be able to reject swapouts in the last case, we now require that the newer (disk) entries explicitly delete their older variants instead of relying on the Store to overwrite the older (unlocked) variant. That explicit delete should already be happening in higher-level code (that knows which entry is newer and must mark any stale entries for deletion anyway). To fix problem squid-cache#3: * A new Store::Controller::evictIfFound(key) method purges (or marks for deletion if purging is impossible) all the matching store entries, without loading the StoreEntry information from stores. Avoiding StoreEntry creation reduces waste of resources (the StoreEntry object would have to be deleted anyway) _and_ allows us to mark being-created entries (that are locked for writing and, hence, cannot be loaded into a StoreEntry object). XXX: SMP cache purges may continue to malfunction when the Transients table is missing. Currently, Transients are created only when the collapsed_forwarding is on. After Squid bug 4579 is fixed, every public StoreEntry will have the corresponding Transients entry and vice versa, extending these fixes to all SMP environments. Note that even if Squid properly avoids storing duplicate disk entries, some cache_dir manipulations by humans and Squid crashes may lead to such duplicates being present. This patch leaves dealing with potential duplicates out of scope except it guarantees that if an entry is deleted, then all [possible] duplicates are deleted as well. Fixing the above problems required (and/or benefited from) many related improvements, including some Store API changes. It is impractical to detail each change here, but several are highlighted below. To propagate DELETEs across workers, every public StoreEntry now has a Transients entry. Prevented concurrent cache readers from aborting when their entry is release()d. Unlike abort, release should not affect current readers. Fixed store.log code to avoid "Bug: Missing MemObject::storeId value". Removed Transients extras used to initialize MemObject StoreID/method in StoreEntry objects created by Transients::get() for collapsed requests. Controlled::get() and related Controller APIs do not _require_ setting those MemObject details: get() methods for all cache stores return StoreEntry objects without them (because entry basics lack Store ID and request method). The caller is responsible for cache key collision detection. Controlled::get() parameters could include Store ID and request method for early cache key collision detection, but adding a StoreQuery class and improving collision detection code is outside this project scope (and requires many changes). Found more cases where release() should not prevent sharing. Remaining cases need further analysis as discussed in master 39fe14b. Greatly simplified UFS store rebuilding, possibly fixing subtle bug(s). Clarified RELEASE_REQUEST flag meaning, becoming 'a private StoreEntry which can't become public anymore'. Refactored the related code, combining two related notions: 'a private entry' and 'an entry marked for removal'. Do not abort collapsed StoreEntries during syncing just because the corresponding being stored shared entry was marked for deletion. Abort them if the shared entry has been also aborted. Added StoreEntry helper methods to prevent direct manipulation of individual disk-related data members (swap_dirn, swap_filen, and swap_status). These methods help keep these related data members in a coherent state and minimize code duplication. Conflicts: src/MemObject.cc

When Squid finds a requested entry in the memory cache, it does not check whether the same entry is also stored in a cache_dir. The StoreEntry object may become associated with its store entry in the memory cache but not with its store entry on disk. This inconsistency causes two known problems: 1. Squid may needlessly swap out the memory hit to disk, either overwriting an existing (and identical) disk entry or, worse, creating a duplicate entry on another disk. In the second case, the two disk entries are not synchronized and may eventually start to differ if one of them is removed or updated. 2. Squid may not delete a stale disk entry when needed, violating various HTTP MUSTs, and eventually serving stale [disk] cache entries to clients. Another purging problem is not caused by the above inconsistency: 3. A DELETE request or equivalent may come for the entry which is still locked for writing. Squid fails to get a lock for such an entry (in order to purge it) and the entry remains in disk and/or memory cache. To solve the first two problems: * StoreEntry::mayStartSwapout() now avoids needless swapouts by checking whether StoreEntry was fully loaded, is being loaded, or could have been loaded from disk. To be able to reject swapouts in the last case, we now require that the newer (disk) entries explicitly delete their older variants instead of relying on the Store to overwrite the older (unlocked) variant. That explicit delete should already be happening in higher-level code (that knows which entry is newer and must mark any stale entries for deletion anyway). To fix problem #3: * A new Store::Controller::evictIfFound(key) method purges (or marks for deletion if purging is impossible) all the matching store entries, without loading the StoreEntry information from stores. Avoiding StoreEntry creation reduces waste of resources (the StoreEntry object would have to be deleted anyway) _and_ allows us to mark being-created entries (that are locked for writing and, hence, cannot be loaded into a StoreEntry object). XXX: SMP cache purges may continue to malfunction when the Transients table is missing. Currently, Transients are created only when the collapsed_forwarding is on. After Squid bug 4579 is fixed, every public StoreEntry will have the corresponding Transients entry and vice versa, extending these fixes to all SMP environments. Note that even if Squid properly avoids storing duplicate disk entries, some cache_dir manipulations by humans and Squid crashes may lead to such duplicates being present. This patch leaves dealing with potential duplicates out of scope except it guarantees that if an entry is deleted, then all [possible] duplicates are deleted as well. Fixing the above problems required (and/or benefited from) many related improvements, including some Store API changes. It is impractical to detail each change here, but several are highlighted below. To propagate DELETEs across workers, every public StoreEntry now has a Transients entry. Prevented concurrent cache readers from aborting when their entry is release()d. Unlike abort, release should not affect current readers. Fixed store.log code to avoid "Bug: Missing MemObject::storeId value". Removed Transients extras used to initialize MemObject StoreID/method in StoreEntry objects created by Transients::get() for collapsed requests. Controlled::get() and related Controller APIs do not _require_ setting those MemObject details: get() methods for all cache stores return StoreEntry objects without them (because entry basics lack Store ID and request method). The caller is responsible for cache key collision detection. Controlled::get() parameters could include Store ID and request method for early cache key collision detection, but adding a StoreQuery class and improving collision detection code is outside this project scope (and requires many changes). Found more cases where release() should not prevent sharing. Remaining cases need further analysis as discussed in master 39fe14b. Greatly simplified UFS store rebuilding, possibly fixing subtle bug(s). Clarified RELEASE_REQUEST flag meaning, becoming 'a private StoreEntry which can't become public anymore'. Refactored the related code, combining two related notions: 'a private entry' and 'an entry marked for removal'. Do not abort collapsed StoreEntries during syncing just because the corresponding being stored shared entry was marked for deletion. Abort them if the shared entry has been also aborted. Added StoreEntry helper methods to prevent direct manipulation of individual disk-related data members (swap_dirn, swap_filen, and swap_status). These methods help keep these related data members in a coherent state and minimize code duplication. Conflicts: src/MemObject.cc

yadij requested changes Aug 21, 2017

View reviewed changes

yadij added the S-waiting-for-author author action is expected (and usually required) label Aug 21, 2017

rousskov requested changes Aug 23, 2017

View reviewed changes

rousskov reviewed Aug 25, 2017

View reviewed changes

rousskov reviewed Aug 30, 2017

View reviewed changes

rousskov changed the title ~~SMP Squid with SMP-aware caches sometimes does not purge cache entries.~~ SMP caches sometimes do not purge cache entries Aug 30, 2017

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Aug 30, 2017

rousskov requested changes Aug 31, 2017

View reviewed changes

rousskov approved these changes Aug 31, 2017

View reviewed changes

rousskov requested changes Sep 13, 2017

View reviewed changes

rousskov added the S-waiting-for-author author action is expected (and usually required) label Sep 13, 2017

rousskov reviewed Sep 14, 2017

View reviewed changes

rousskov requested changes Sep 15, 2017

View reviewed changes

eduard-bagdasaryan added 4 commits September 21, 2017 12:09

Ported to master the corresponding SQUID-175 v4.0 patch:

35efd80

https://measurementfactory.atlassian.net/secure/attachment/32001/SQUID-175-methods-do-not-invalidate-v4-t17.patch

Refactored and clarified RELEASE_REQUEST flag meaning (pt. 2).

8502645

Keeping to this new meaning introduced at 9cfa044, simplified with a new StoreEntry::permanentlyPrivate().

rousskov approved these changes Jan 3, 2018

View reviewed changes

rousskov changed the title ~~SMP caches sometimes do not purge cache entries~~ Bug 4505: SMP caches sometimes do not purge entries Jan 3, 2018

rousskov removed the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Jan 3, 2018

Merge branch 'master' into SQUID-175-methods-do-not-invalidate-cache

bb533f0

eduard-bagdasaryan added 4 commits January 18, 2018 14:49

Fix an 'inconsistent override' compiler error caused by 957e7c6

1c85f6e

Polished Store::EntryGuard class

a2a0d4d

* Removed unused EntryGuard::releaseLocked(). * Protected EntryGuard destructor from the throwing code inside of it.

Assert if 'every public StoreEntry must have transients' rule failed

5c09e58

... and honor this rule in Transients::evictCached().

Remove an unused method argument

8c533e3

eduard-bagdasaryan added 4 commits January 26, 2018 12:03

Skip ENTRY_SPECIAL entries when checking Transients invariant

8a30b69

... this is what previous 5c09e58 missed.

Several minor fixes

b24aa05

* Fixed Transients::addEntry() comment * Fixed wrong argument passing to StoreEntry::release() * Removed wrong statement in Controller::transientsCompleteWriting()

Merged master into SQUID-175-methods-do-not-invalidate-cache.

1f599bb

Polished comment (primarily to trigger a Jenkins rebuild).

238f83f

rousskov merged commit 4310f8b into squid-cache:master Feb 2, 2018

rousskov removed the S-waiting-for-author author action is expected (and usually required) label Feb 2, 2018

		@@ -1766,19 +1778,22 @@ neighborsHtcpClear(StoreEntry * e, const char uri, HttpRequest req, const Htt
		char buf[128];

		for (p = Config.peers; p; p = p->next) {

Conversation

eduard-bagdasaryan commented Aug 16, 2017 • edited by rousskov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yadij left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eduard-bagdasaryan commented Aug 23, 2017

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rousskov Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eduard-bagdasaryan commented Aug 16, 2017 •

edited by rousskov

Loading

rousskov Aug 25, 2017 •

edited

Loading

rousskov Aug 25, 2017 •

edited

Loading