-
Notifications
You must be signed in to change notification settings - Fork 103
Render: Protect against race condition in setNumPages #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vedgy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess comic has already sent the signal before the disconnect() call. The call to Render::setNumPages() must have already been queued. I don't see how this situation can be avoided other than by destroying the Render object along with the Comic object and creating a new Render for a new Comic.
YACReader/render.cpp
Outdated
| void Render::setNumPages(unsigned int numPages) | ||
| { | ||
| if (sender() != comic) | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comic::numPages signal is connected to Render::numPages signal in addition to this Render::setNumPages slot:
yacreader/YACReader/render.cpp
Lines 721 to 722 in 32e1db7
| connect(comic, SIGNAL(numPages(unsigned int)), this, SIGNAL(numPages(unsigned int)), Qt::QueuedConnection); | |
| connect(comic, SIGNAL(numPages(unsigned int)), this, SLOT(setNumPages(unsigned int)), Qt::QueuedConnection); |
This signal forwarding is likely to cause some other bug under the conditions that trigger #211.
|
Well … the obvious way to fix this is to never have more than one active thread and to introduce a proper mechanism to manage the comic switching. Ideally, you would reuse the thread instead of creating new threads for each comic. Sadly, the obvious way is not the easy one in this case and I am speaking from experience. The current system resembles a Gordian thread knot and untangling it is a delicate matter. The forwarded signal is likely responsible for setting the comic page count in the interface outside of the render. In my tests, I was able to provoke a state where the program did not crash but interface and comic page count were not in sync, i.e. the page count in the interface was lower than the actual comic pages loaded in memory. Another option to defuse this situation a little would be to remove the numPages() signal from the comic and introduce a more general purpose signal that indicates the comic has finished parsing and it is safe to use the data now. |
|
There is one more easy way to fix the bug: create a proxy By the way, I think I may have spotted the reason why #5 had caused memory leaks. When only |
|
I don't think introducing an extra layer of abstraction at this point is a good idea. A proper handling mechanism would of course include a class to encapsulate the threading, a thread manager, so to speak. But that is beyond the scope of what we're trying to achieve here. The idea to salvage the thread-ping-pong PR occurred to me too, but at this point this is a little too dangerous. @luisangelsm and I have discussed and we agree that the current implementation despite all its flaws and bad maintainability is proven to be good enough for most cases and reasonably stable, so it should not be unnecessarily exposed to possible breakage. What we will do instead is to create parallel versions of mission critical code that can be enabled by opt-in via configuration switches. That way we can keep the old code around as long as necessary an at the same time work on the much needed improvements and replacements without worrying too much about regressions. @luisangelsm : I will convert this PR to a WIP for now to run some more tests on the second signal @vedgy pointed out and try some more things on the connection queue and event loop front. |
|
Even if the single worker thread is reused for different comic files, the already emitted signals for the previously opened comic would have to be filtered out somehow. Apart from The new simple proxy class wouldn't require changing |
9cb95f8 to
142fc87
Compare
|
After some more testing and experimenting, I found a nice solution. All events for render are now processed after invalidating the comic and the comic is subsequentially disconnected and cleaned up. I have verified that this prevents the stray signals from zombie threads happening and I have left the sender check in the page number slot for extra safety. I also tested running the processing step after disconnect, but that produced stray signals. @luisangelsm the code impact of this is so low that we can safely merge this fix without any danger. Let me know if you want to take a look first or if I can go ahead. |
YACReader/render.cpp
Outdated
| QCoreApplication::sendPostedEvents(this); | ||
| comic->disconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still an API race between the QCoreApplication::sendPostedEvents(this) and comic->disconnect() lines. The comic's thread can post a signal-emitted event before the disconnection.
Quote from https://woboq.com/blog/how-qt-signals-slots-work-part3-queuedconnection.html:
When posting an event (in QCoreApplication::postEvent), the event will be pushed in a per-thread queue (QThreadData::postEventList). The event queued is protected by a mutex, so there is no race conditions when threads push events to another thread's event queue.
I also tested running the processing step after disconnect, but that produced stray signals.
What stray signals and why? I think reordering these two statements would eliminate the race...
|
I don't agree. As I explicitly wrote, I tested it. Running sendPostedEvents before calling disconnect resulted in no more signals from old comic objects arriving on my test slots (verified by comparing sender() with the current comic object). When I ran the clear after the disconnect the race condition was present and stray signals appeared (verified the same way). I haven't bothered to investigate the exact reason for this behavior. If you want to run your own tests, you just need to add a debug message to the sender check in the render code. You could also investigate Qt's source code directly. But consider this: The cleanup of the posted events does not happen on the comic thread but on the main application thread. We first invalidate the comic to make it stop processing and sending new signals, then make sure all pent-up signals are processed and finally we disconnect the comic before deleting it. This might not be a picture book example of a cleanup, but it fixes the issue at hand and does so in a way that is not purely cosmetic. For me that is good enough. You are of course welcome to investigate further using proper tools like thread sanitizer or valgrind, but I'll call it a day here. |
|
|
Making the signal depend on the invalidation status was the first thing I tried to fix this. It made no difference. If you poke this legacy code hard enough you will almost certainly find other corner cases. The goal here is not to create a perfect solution but a reasonably good one. As a rule of thumb we avoid investing time in code we expect to be replaced or substantially refactored and the render code is exactly such a case. You are of course free to tinker and test with it, but please avoid spending too much time on it. Keep it simple, go for good, not perfect, keep in mind that someone else needs to review and test and that time spent on legacy code is lost when said code is replaced. |
This confirms my hypothesis: the signal has already been emitted and the event has already been posted (but not yet sent) by the time
These are valid points. Yet our involvement in YACReader development is a volunteer endeavor. It should be fun, not a trudge. Maximizing development speed at the expense of code quality and deep understanding can be depressing. At least for me, personally, implementing an optimal, correct solution with a good understanding of what is going on and why, is much more satisfying than hurrying to implement as many features and fix as many bugs as possible in the shortest possible time. When you have fun doing something, plus learn and deepen your understanding in the process, that time is not lost. The bug fix we are working on here is not thoroughly tied to the current suboptimal legacy implementation. So this part will not necessarily be discarded when One more point: with the current speed of legacy YACReader code elimination and replacement, rewriting |
|
Below I describe experiments in my branch based on this PR: https://github.com/vedgy/yacreader/commits/experiments/invalidate_race_condition. The first commit Fix 1-minute startup crash works around a common Qt application crash under X11 if its start-up takes 60 seconds or longer. I'll create a separate pull request with the fix soon. The second commit Trigger the API race allows to easily and reliably reproduce the API race I have outlined in a previous comment: simply press Ctrl+Right twice after opening a comic file. The third commit Fix the API race applies the statement reordering fix I have suggested earlier - and it successfully eliminates the race. The fourth commit No sleeping - reproduce the crash by holding Ctrl+Left or Ctrl+Right reverts the The fifth commit Don't crash if setNumPages() is invoked from sendPostedEvents() carefully disables the The sixth commit Revert "No sleeping - reproduce the crash by holding Ctrl+Left..." restores the At the seventh commit Restore the API race I can easily crash YACReader by quickly pressing Ctrl+Right twice. And the So I recommend to reorder the Note that the proxy object fix I described above would prevent these extra slot calls at the cost of more code changes and the extra signal indirection performance hit. Seeing as #211 is difficult to reproduce accidentally, the slots would very rarely be called inside the |
To clarify this paragraph above, I have tried and failed to reproduce a crash with 4 different positions of the To be on the safe side, just tested 3 more positions of |
I don't think it is going to be 5 or 10 years. We have reached a point in development where several parts of our code are no longer suitable for the features we want to develop for YACReader. Additionally, Qt5 is no longer being developed and if we want to continue to support mac OS X as a platform, we will need to port to Qt6 soon. As much as we enjoy working on YACReader it is still something we do in our free time and we have other stuff going on in our lives too and we have to think about how to best use this time. That goes double for @luisangelsm who codes for a living and already spends more than 8 hours a day in front of his computer. As for this PR and your experiments - kudos for going the extra mile and testing all that corner cases. Personally, the information that disconnect() sets sender() to a nullpointer would have been a strong enough argument for me to change my opinion. I will move the sendPostedEvents call after the disconnect and I will remove the sender check. |
a8d3846 to
cad7dd3
Compare
|
The current version fixes the crash in a most concise way. So I agree that it can be merged as a temporary workaround. In the long term, however, I believe that |
|
I think it is pretty clear to everyone involved that this is not a longterm solution to the general design problems of the code involved but a measure to prevent the issues at hand. We all agree a better mechanism is needed and yes, that probably would involve a thread management mechanism (single thread queue?) similar to what you theorized with the proxy object. However, right now is not the right time. @luisangelsm is currently preparing for a 9.8 release and once that has stabilized we will have a better window of opportunity to introduce some bigger changes to stabilize the general foundation of YACReader so we can implement necessary changes like this more easily. |
cad7dd3 to
24a5beb
Compare
Switching comics too fast in YACReader can lead to an accumulation of comic threads which are in different stages of parsing, processing and ultimatively cleanup. While these threads and the comic objects are usually disconnected from the render when spawning a new thread, the nature of Qt's queued connection and thread scheduling by the operating system leads to the situation that the disconnection and thread cleanup does not take effect immediately.
As thread disconnection and cleanup is not reliable, race conditions occur.
This PR protects against a race condition where the number of pages the render tries to prepare is set via a signal from the comic object. If this value is set by a stray comic thread while loading a comic, the false value in combination with a bookmarked page index (also from the stray comic) can lead to the render code trying to access a nonexistent page, which leads to a segfault.
Less obvious symptoms which do not lead to crashing can be loading a comic with the wrong total pagecount (navigation problem) and all sort of undefined behavior.
One quick and very dirty way to protect against this is checking if the sender matches the current comic object. Note that while this fixes #211 it is not a fix for the underlying thread problems nor does it take care of all possible symptoms. It does not remove the race condition, it just prevents it from wreaking havoc.