Skip to content

Conversation

@selmf
Copy link
Member

@selmf selmf commented Feb 24, 2021

Switching comics too fast in YACReader can lead to an accumulation of comic threads which are in different stages of parsing, processing and ultimatively cleanup. While these threads and the comic objects are usually disconnected from the render when spawning a new thread, the nature of Qt's queued connection and thread scheduling by the operating system leads to the situation that the disconnection and thread cleanup does not take effect immediately.

As thread disconnection and cleanup is not reliable, race conditions occur.

This PR protects against a race condition where the number of pages the render tries to prepare is set via a signal from the comic object. If this value is set by a stray comic thread while loading a comic, the false value in combination with a bookmarked page index (also from the stray comic) can lead to the render code trying to access a nonexistent page, which leads to a segfault.

Less obvious symptoms which do not lead to crashing can be loading a comic with the wrong total pagecount (navigation problem) and all sort of undefined behavior.

One quick and very dirty way to protect against this is checking if the sender matches the current comic object. Note that while this fixes #211 it is not a fix for the underlying thread problems nor does it take care of all possible symptoms. It does not remove the race condition, it just prevents it from wreaking havoc.

Copy link
Contributor

@vedgy vedgy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess comic has already sent the signal before the disconnect() call. The call to Render::setNumPages() must have already been queued. I don't see how this situation can be avoided other than by destroying the Render object along with the Comic object and creating a new Render for a new Comic.

void Render::setNumPages(unsigned int numPages)
{
if (sender() != comic)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comic::numPages signal is connected to Render::numPages signal in addition to this Render::setNumPages slot:

connect(comic, SIGNAL(numPages(unsigned int)), this, SIGNAL(numPages(unsigned int)), Qt::QueuedConnection);
connect(comic, SIGNAL(numPages(unsigned int)), this, SLOT(setNumPages(unsigned int)), Qt::QueuedConnection);

This signal forwarding is likely to cause some other bug under the conditions that trigger #211.

@selmf
Copy link
Member Author

selmf commented Feb 24, 2021

Well … the obvious way to fix this is to never have more than one active thread and to introduce a proper mechanism to manage the comic switching. Ideally, you would reuse the thread instead of creating new threads for each comic.

Sadly, the obvious way is not the easy one in this case and I am speaking from experience. The current system resembles a Gordian thread knot and untangling it is a delicate matter.

The forwarded signal is likely responsible for setting the comic page count in the interface outside of the render. In my tests, I was able to provoke a state where the program did not crash but interface and comic page count were not in sync, i.e. the page count in the interface was lower than the actual comic pages loaded in memory.

Another option to defuse this situation a little would be to remove the numPages() signal from the comic and introduce a more general purpose signal that indicates the comic has finished parsing and it is safe to use the data now.

@vedgy
Copy link
Contributor

vedgy commented Feb 25, 2021

There is one more easy way to fix the bug: create a proxy QObject, connect its signals to Comic's signals with Qt::QueuedConnection and connect Render's slots to the proxy's signals with Qt::DirectConnection. When the Comic object is replaced, simply delete the proxy (or proxy->disconnect() and proxy->deleteLater() if there is a chance that the replacement was triggered by a proxy's signal somehow).

By the way, I think I may have spotted the reason why #5 had caused memory leaks. When only QObject::destroyed signal of Comic was connected to QThread::quit, this disconnect() call in Render prevented this signal from being emitted and the thread from being destroyed. Currently the leak is absent because Render::createComic() calls Comic::invalidate() (which emits invalidated() connected to QThread::quit synchronously) before disconnecting all Comic's signals from every object. With the proxy object fix described above, this comic->disconnect() call can be safely removed and maybe #5 can be attempted again.

@selmf
Copy link
Member Author

selmf commented Feb 26, 2021

I don't think introducing an extra layer of abstraction at this point is a good idea. A proper handling mechanism would of course include a class to encapsulate the threading, a thread manager, so to speak. But that is beyond the scope of what we're trying to achieve here.

The idea to salvage the thread-ping-pong PR occurred to me too, but at this point this is a little too dangerous. @luisangelsm and I have discussed and we agree that the current implementation despite all its flaws and bad maintainability is proven to be good enough for most cases and reasonably stable, so it should not be unnecessarily exposed to possible breakage.

What we will do instead is to create parallel versions of mission critical code that can be enabled by opt-in via configuration switches. That way we can keep the old code around as long as necessary an at the same time work on the much needed improvements and replacements without worrying too much about regressions.

@luisangelsm : I will convert this PR to a WIP for now to run some more tests on the second signal @vedgy pointed out and try some more things on the connection queue and event loop front.

@selmf selmf marked this pull request as draft February 26, 2021 11:15
@vedgy
Copy link
Contributor

vedgy commented Feb 26, 2021

Even if the single worker thread is reused for different comic files, the already emitted signals for the previously opened comic would have to be filtered out somehow.

Apart from Comic::numPages, Render is connected to many other Comic's signals. I guess some of them may also be emitted just before the user switches to a different comic.

The new simple proxy class wouldn't require changing Comic or Render much. I expect it to completely eliminate this just-emitted signal bug - for all connections between Comic and Render - and with little risk. The proxy doesn't have to be an extra layer of abstraction and encapsulate Comic completely, because implementing a wrapper around each of Comic's member functions would mean a lot of boilerplate code. The proxy could consist of a bunch of signals and a single constructor that would connect it to a Comic's signals.

@selmf selmf force-pushed the fix/invalidate_race_condition branch from 9cb95f8 to 142fc87 Compare March 1, 2021 16:11
@selmf selmf marked this pull request as ready for review March 1, 2021 16:13
@selmf
Copy link
Member Author

selmf commented Mar 1, 2021

After some more testing and experimenting, I found a nice solution. All events for render are now processed after invalidating the comic and the comic is subsequentially disconnected and cleaned up. I have verified that this prevents the stray signals from zombie threads happening and I have left the sender check in the page number slot for extra safety.

I also tested running the processing step after disconnect, but that produced stray signals.

@luisangelsm the code impact of this is so low that we can safely merge this fix without any danger. Let me know if you want to take a look first or if I can go ahead.

Comment on lines 702 to 703
QCoreApplication::sendPostedEvents(this);
comic->disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still an API race between the QCoreApplication::sendPostedEvents(this) and comic->disconnect() lines. The comic's thread can post a signal-emitted event before the disconnection.

Quote from https://woboq.com/blog/how-qt-signals-slots-work-part3-queuedconnection.html:

When posting an event (in QCoreApplication::postEvent), the event will be pushed in a per-thread queue (QThreadData::postEventList). The event queued is protected by a mutex, so there is no race conditions when threads push events to another thread's event queue.

I also tested running the processing step after disconnect, but that produced stray signals.

What stray signals and why? I think reordering these two statements would eliminate the race...

@selmf
Copy link
Member Author

selmf commented Mar 1, 2021

I don't agree. As I explicitly wrote, I tested it. Running sendPostedEvents before calling disconnect resulted in no more signals from old comic objects arriving on my test slots (verified by comparing sender() with the current comic object).

When I ran the clear after the disconnect the race condition was present and stray signals appeared (verified the same way).

I haven't bothered to investigate the exact reason for this behavior. If you want to run your own tests, you just need to add a debug message to the sender check in the render code. You could also investigate Qt's source code directly.

But consider this: The cleanup of the posted events does not happen on the comic thread but on the main application thread. We first invalidate the comic to make it stop processing and sending new signals, then make sure all pent-up signals are processed and finally we disconnect the comic before deleting it. This might not be a picture book example of a cleanup, but it fixes the issue at hand and does so in a way that is not purely cosmetic. For me that is good enough.

You are of course welcome to investigate further using proper tools like thread sanitizer or valgrind, but I'll call it a day here.

@vedgy
Copy link
Contributor

vedgy commented Mar 1, 2021

We first invalidate the comic to make it stop processing and sending new signals

comic->invalidate() does not prevent comic from emitting e.g. numPages(). Comic::_invalidated isn't even checked before this signal is emitted in FileComic::process(). I'll try to trigger the supposed race tomorrow by inserting QThread::sleep calls at the beginning of FileComic::process() and between the sendPostedEvents and disconnect statements in Render::createComic().

@selmf
Copy link
Member Author

selmf commented Mar 1, 2021

Making the signal depend on the invalidation status was the first thing I tried to fix this. It made no difference. If you poke this legacy code hard enough you will almost certainly find other corner cases.

The goal here is not to create a perfect solution but a reasonably good one. As a rule of thumb we avoid investing time in code we expect to be replaced or substantially refactored and the render code is exactly such a case. You are of course free to tinker and test with it, but please avoid spending too much time on it. Keep it simple, go for good, not perfect, keep in mind that someone else needs to review and test and that time spent on legacy code is lost when said code is replaced.

@vedgy
Copy link
Contributor

vedgy commented Mar 2, 2021

Making the signal depend on the invalidation status was the first thing I tried to fix this. It made no difference.

This confirms my hypothesis: the signal has already been emitted and the event has already been posted (but not yet sent) by the time comic->invalidate() was called.

The goal here is not to create a perfect solution but a reasonably good one. As a rule of thumb we avoid investing time in code we expect to be replaced or substantially refactored and the render code is exactly such a case. You are of course free to tinker and test with it, but please avoid spending too much time on it. Keep it simple, go for good, not perfect, keep in mind that someone else needs to review and test and that time spent on legacy code is lost when said code is replaced.

These are valid points. Yet our involvement in YACReader development is a volunteer endeavor. It should be fun, not a trudge. Maximizing development speed at the expense of code quality and deep understanding can be depressing. At least for me, personally, implementing an optimal, correct solution with a good understanding of what is going on and why, is much more satisfying than hurrying to implement as many features and fix as many bugs as possible in the shortest possible time.

When you have fun doing something, plus learn and deepen your understanding in the process, that time is not lost. The bug fix we are working on here is not thoroughly tied to the current suboptimal legacy implementation. So this part will not necessarily be discarded when Comic class is reimplemented. Even if the fix won't retain its current form, the approach may be reused in the new implementation (which would also have to discard signals for load-canceled comic files). We may encounter a similar issue in other parts of YACReader code or in another project we work on. If we come up with and implement a reasonably optimal solution now, we can reuse it elsewhere. The knowledge and understanding gained during the investigation can be applied and put to good use elsewhere too.

One more point: with the current speed of legacy YACReader code elimination and replacement, rewriting Comic from scratch may actually happen 5 or 10 years from now. Fixing and improving the code gradually makes the application run better immediately, is less risky and does not require spending lots of time in one go. Such improvements are much more feasible in the short and middle term. In these years, while we wait and hope for a perfect rewritten solution, a hastily implemented workaround may cause another bug, which will take more time to patch up. Perhaps the thread ping-pong issue is the result of a "temporary" workaround of moving comic back to the main thread instead of investigating deeper and disconnecting only Render's slots from Comic's signals. And now the "temporary" workaround haunts you for years.

@vedgy
Copy link
Contributor

vedgy commented Mar 2, 2021

Below I describe experiments in my branch based on this PR: https://github.com/vedgy/yacreader/commits/experiments/invalidate_race_condition.

The first commit Fix 1-minute startup crash works around a common Qt application crash under X11 if its start-up takes 60 seconds or longer. I'll create a separate pull request with the fix soon.

The second commit Trigger the API race allows to easily and reliably reproduce the API race I have outlined in a previous comment: simply press Ctrl+Right twice after opening a comic file.

The third commit Fix the API race applies the statement reordering fix I have suggested earlier - and it successfully eliminates the race.

The fourth commit No sleeping - reproduce the crash by holding Ctrl+Left or Ctrl+Right reverts the QThread::sleep calls. At this commit I managed to reproduce a crash by holding down Ctrl+Right, then Ctrl+Left. And indeed I couldn't reproduce the crash when I swapped the QCoreApplication::sendPostedEvents(this) and comic->disconnect() lines again. So I placed a breakpoint at the qFatal line in Render::setNumPages() and discovered that the slot was invoked from inside the sendPostedEvents() call in Render::createComic(). sender() equaled nullptr, not Render::comic. Apparently the comic->disconnect() call retroactively edits data referenced by the already posted event and sets sender to nullptr. But this is not a problem for YACReader!

The fifth commit Don't crash if setNumPages() is invoked from sendPostedEvents() carefully disables the setNumPages()'s inequality check during createComic()'s sendPostedEvents call. Did some testing and couldn't reproduce any crash or weird behavior at this commit. Note that I have been able to reliably reproduce the crash described in #211 when I removed the sendPostedEvents fix once I figured out that Ctrl+Right has to be held down rather than manually repeated.

The sixth commit Revert "No sleeping - reproduce the crash by holding Ctrl+Left..." restores the QThread::sleep calls. At this point I experimented with moving the QThread::sleep(3); line above sendPostedEvents, disconnect and invalidate lines in Render::createComic() => got no crashes.

At the seventh commit Restore the API race I can easily crash YACReader by quickly pressing Ctrl+Right twice. And the qFatal call stack shows that the slot is invoked from the main thread's event loop, not from the sendPostedEvents call in createComic (in which case there would have beeen no crash).

So I recommend to reorder the disconnect and sendPostedEvents lines and fix the comment above the sendPostedEvents line. Also the too-strict sender() check should be removed from Render::setNumPages(). Early-returning from this one slot while still executing all others can result in some other bug. Best to run all the queued slots even if they are not useful anymore.

Note that the proxy object fix I described above would prevent these extra slot calls at the cost of more code changes and the extra signal indirection performance hit. Seeing as #211 is difficult to reproduce accidentally, the slots would very rarely be called inside the sendPostedEvents call in practice. Thus sendPostedEvents fix seems better than the proxy object fix, even long-term. However, there is a risk of the sendPostedEvents fix too. First of all, invoking slots connected to the obsolete comic's signals inside the call to Render::createComic(), which is called by Render::load(), which in turn is called by Viewer::open(), can cause problems elsewhere. For example, Viewer::open() calls prepareForOpening() before Render::load(), which calls goToFlow->reset(). So the unexpected signals from Render can cause issues in goToFlow. Furthermore, Render's slots connected to signals from other objects could be executed out of order and cause more weird bugs - now or after some future code changes. Consider the connect(pageRenders[currentPageBufferedIndex], SIGNAL(pageReady(int)), this, SLOT(prepareAvailablePage(int))); connection. The pageReady signal is emitted from a non-main thread, so prepareAvailablePage() can theoretically be called from sendPostedEvents. I don't know if any of these possible unexpected invocations cause issues to the current code. But because of these potential bugs elsewhere, the alternative proxy object fix seems safer to me.

@vedgy
Copy link
Contributor

vedgy commented Mar 2, 2021

The sixth commit Revert "No sleeping - reproduce the crash by holding Ctrl+Left..." restores the QThread::sleep calls. At this point I experimented with moving the QThread::sleep(3); line above sendPostedEvents, disconnect and invalidate lines in Render::createComic() => got no crashes.

To clarify this paragraph above, I have tried and failed to reproduce a crash with 4 different positions of the QThread::sleep(3); call: just above comic->deleteLater();, just below comic->disconnect(), just above comic->disconnect() and just above comic->invalidate(); lines.

To be on the safe side, just tested 3 more positions of QThread::sleep(3); - just under comic->deleteLater();, just under comic = FactoryComic::newComic(path); and just under pagesReady.clear(); statements in Render::createComic() at the sixth commit => no crashes.

@selmf
Copy link
Member Author

selmf commented Mar 3, 2021

One more point: with the current speed of legacy YACReader code elimination and replacement, rewriting Comic from scratch may actually happen 5 or 10 years from now. Fixing and improving the code gradually makes the application run better immediately, is less risky and does not require spending lots of time in one go. Such improvements are much more feasible in the short and middle term. In these years, while we wait and hope for a perfect rewritten solution, a hastily implemented workaround may cause another bug, which will take more time to patch up. Perhaps the thread ping-pong issue is the result of a "temporary" workaround of moving comic back to the main thread instead of investigating deeper and disconnecting only Render's slots from Comic's signals. And now the "temporary" workaround haunts you for years.

I don't think it is going to be 5 or 10 years. We have reached a point in development where several parts of our code are no longer suitable for the features we want to develop for YACReader. Additionally, Qt5 is no longer being developed and if we want to continue to support mac OS X as a platform, we will need to port to Qt6 soon.
A significant part of our refactoring work in the past has gone towards improving the detection of errors, cleaning up third party code, identifying deprecated code and most important lessening the overall test, review and maintenance burden for us as a team. And that is quite important.

As much as we enjoy working on YACReader it is still something we do in our free time and we have other stuff going on in our lives too and we have to think about how to best use this time. That goes double for @luisangelsm who codes for a living and already spends more than 8 hours a day in front of his computer.

As for this PR and your experiments - kudos for going the extra mile and testing all that corner cases. Personally, the information that disconnect() sets sender() to a nullpointer would have been a strong enough argument for me to change my opinion.

I will move the sendPostedEvents call after the disconnect and I will remove the sender check.

@selmf selmf force-pushed the fix/invalidate_race_condition branch 3 times, most recently from a8d3846 to cad7dd3 Compare March 4, 2021 13:00
@vedgy
Copy link
Contributor

vedgy commented Mar 4, 2021

The current version fixes the crash in a most concise way. So I agree that it can be merged as a temporary workaround.

In the long term, however, I believe that QCoreApplication::sendPostedEvents() should not be used for this purpose as it may cause other bugs elsewhere as described in the last paragraph of my recent comment here.

@selmf
Copy link
Member Author

selmf commented Mar 4, 2021

I think it is pretty clear to everyone involved that this is not a longterm solution to the general design problems of the code involved but a measure to prevent the issues at hand. We all agree a better mechanism is needed and yes, that probably would involve a thread management mechanism (single thread queue?) similar to what you theorized with the proxy object.

However, right now is not the right time. @luisangelsm is currently preparing for a 9.8 release and once that has stabilized we will have a better window of opportunity to introduce some bigger changes to stabilize the general foundation of YACReader so we can implement necessary changes like this more easily.

@selmf selmf force-pushed the fix/invalidate_race_condition branch from cad7dd3 to 24a5beb Compare March 4, 2021 14:57
@selmf selmf merged commit e5526de into YACReader:develop Mar 4, 2021
@selmf selmf deleted the fix/invalidate_race_condition branch March 4, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

YACReader segfaults when switching comics too fast (unarr only?)

2 participants