-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
EHN: Improved thread safety for read_html() GH16928 #16930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/whatsnew/v0.21.0.txt
Outdated
| - :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | ||
| - :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | ||
|
|
||
| - Improved thread safety for `read_html()`. (:issue:`16928`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually inclined to call this a bug because multi-threading is something that we check for unless explicitly stated otherwise. Thus, I would move to this to the bugs section. Also, I would expand your description about the issue was (just a sentence but a longer one would suffice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :func:`read_html`
Codecov Report
@@ Coverage Diff @@
## master #16930 +/- ##
==========================================
- Coverage 90.99% 90.97% -0.02%
==========================================
Files 161 161
Lines 49303 49303
==========================================
- Hits 44863 44854 -9
- Misses 4440 4449 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16930 +/- ##
==========================================
- Coverage 91.05% 91.01% -0.05%
==========================================
Files 161 161
Lines 49350 49350
==========================================
- Hits 44936 44915 -21
- Misses 4414 4435 +21
Continue to review full report at Codecov.
|
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the test from the issue
doc/source/whatsnew/v0.21.0.txt
Outdated
| - :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`) | ||
| - :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`) | ||
|
|
||
| - Improved thread safety for `read_html()`. (:issue:`16928`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :func:`read_html`
|
Regarding adding the test. I have trouble creating a test that reproduces the issue. def get_html_file():
filename = os.path.join(DATA_PATH, 'valid_markup.html')
read_html(filename)
class ErrorThread(threading.Thread):
def run(self):
try:
threading.Thread.run(self)
except Exception as e:
self.err = e
else:
self.err = None
def test_importcheck_thread_safety():
t1 = ErrorThread(target = get_html_file)
t2 = ErrorThread(target = get_html_file)
t1.start()
t2.start()
while(t1.is_alive() or t2.is_alive()):
pass
assert None == t1.err == t2.err, "Errors when run in parallel" |
I'm confused...isn't that what you want (because it doesn't fail)? |
|
The threads don't throw exceptions regardless of whether I include my fix or not. Meaning the test I wrote is useless and would indicate that everything is fine while actual programs are crashing. |
Ah, okay. That makes sense. Does your code from the issue raise an Exception if you incorporate it as a test (just as a starter)? |
|
I'm not sure what you mean by starter. I tried using the code from the issue directly earlier and it didn't work. Apparently exceptions need to be thrown by the main-thread. |
|
By starter, I meant: does your code from the original issue form a valid test for our code-base (when incorporated into the |
|
@3553x : As I'm not sure at this point why If it does, commit your test and push to this PR. Otherwise, let us know if it worked or not. How does that sound to you? |
|
That sounds like something worth trying out. I'll let you know once I'm done. |
|
One other thing, if this operation is successful, if you could provide us to the links to the failing and passing builds. That way we can see the differential between having and not having your patch. |
|
I tried using travis with my test added on top of the unfixed version. Some builds fail because of linting, however they all seem to pass pytest. I realised that my earlier version might have passed since These are the additions that were made to try:
from importlib import reload
except ImportError:
pass
[...]
import pandas.io.html
[...]
class ErrorThread(threading.Thread):
def run(self):
try:
super(ErrorThread, self).run()
except Exception as e:
self.err = e
else:
self.err = None
@pytest.mark.slow
def test_importcheck_thread_safety():
reload(pandas.io.html)
filename = os.path.join(DATA_PATH, 'valid_markup.html')
helper_thread1 = ErrorThread(target = read_html, args = (filename,))
helper_thread2 = ErrorThread(target = read_html, args = (filename,))
helper_thread1.start()
helper_thread2.start()
while(helper_thread1.is_alive() or helper_thread2.is_alive()):
pass
assert None == helper_thread1.err == helper_thread2.errOutput of Output of |
|
@3553x you can add your test to the PR. |
7688010 to
9d640cc
Compare
|
Hello @3553x! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 21, 2017 at 14:55 Hours UTC |
|
pls rebase on master |
2fdcfb1 to
ad8541b
Compare
doc/source/whatsnew/v0.21.0.txt
Outdated
|
|
||
| - Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`) | ||
|
|
||
| - Bug in :func:`read_html` importcheck fails when run concurrently (:issue:`16928`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importcheck --> "import check"
|
|
||
| @pytest.mark.slow | ||
| def test_importcheck_thread_safety(): | ||
| reload(pandas.io.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain (or at least comment) why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add a newline beneath this (just for readability).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import check only happens when a variable in html.py is set to false, it's initial value. However, the variable will be set to true during the first call to read_html. Reloading the module allows us to reinitialise that variable and effectively force an import check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly write that as a comment above the reload call?
| reload(pandas.io.html) | ||
| filename = os.path.join(DATA_PATH, 'valid_markup.html') | ||
| helper_thread1 = ErrorThread(target=read_html, args=(filename,)) | ||
| helper_thread2 = ErrorThread(target=read_html, args=(filename,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline beneath this (just for readability).
| helper_thread1 = ErrorThread(target=read_html, args=(filename,)) | ||
| helper_thread2 = ErrorThread(target=read_html, args=(filename,)) | ||
| helper_thread1.start() | ||
| helper_thread2.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline beneath this (just for readability).
ad8541b to
10aee08
Compare
pandas/tests/io/test_html.py
Outdated
| import_module = __import__ | ||
|
|
||
| try: | ||
| from importlib import reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge these two try-except together since both come from Python 2.x. Also, make a comment about that importlib is from Python 2.x.
10aee08 to
1632ed8
Compare
doc/source/whatsnew/v0.21.0.txt
Outdated
|
|
||
| - Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`) | ||
|
|
||
| - Bug in :func:`read_html` import check fails when run concurrently (:issue:`16928`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the word "where" before "import check"
|
|
||
|
|
||
| @pytest.mark.slow | ||
| def test_importcheck_thread_safety(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reference issue number below the test function definition? Something like "see gh-16928" will suffice.
1632ed8 to
fcd7bdf
Compare
pandas/tests/io/test_html.py
Outdated
| helper_thread1.start() | ||
| helper_thread2.start() | ||
|
|
||
| while(helper_thread1.is_alive() or helper_thread2.is_alive()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, add a space between "while" and "("
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do you need the parentheses? I don't think you do...
fcd7bdf to
9d5e040
Compare
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some comments.
doc/source/whatsnew/v0.21.0.txt
Outdated
|
|
||
| - Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`) | ||
|
|
||
| - Bug in :func:`read_html` where import check fails when run concurrently (:issue:`16928`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say in multiple threads
| import warnings | ||
|
|
||
|
|
||
| # imports needed for Python 3.x but will fail under Python 2.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@3553x : He means to just remove the line. I thought that line would be useful for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is not supported in py2, then you should either conditionally import it, or set reload=None and check it (and skip in a test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reload is a built-in in Python2. There is no need to import it unless you are using Python3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
|
||
|
|
||
| @pytest.mark.slow | ||
| def test_importcheck_thread_safety(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can only test under py3 then use a @pytest.mark.skipf(not compat.PY3, reason=.....) decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just tested it with python3 `which pytest` test_html.py::test_importcheck_thread_safety and python2 `which pytest` test_html.py::test_importcheck_thread_safety for both cases (with and without fix). The test seems to work for both versions on my system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if this works, then remove the comment above
9d5e040 to
c01caaf
Compare
|
thanks @3553x |
git diff upstream/master --name-only -- '*.py' | flake8 --diff(On Windows,git diff upstream/master -u -- "*.py" | flake8 --diffmight work as an alternative.)It failed a total of 6 tests when running
test_fast.shbut failed none when runningpytest pandas/tests/io/test_html.py.It seems that these 6 failed tests are unrelated to my changes since they also occur on the master branch.