Skip to content

Conversation

@3553x
Copy link
Contributor

@3553x 3553x commented Jul 14, 2017

  • closes read_html() Thread Safety #16928
  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff (On Windows, git diff upstream/master -u -- "*.py" | flake8 --diff might work as an alternative.)
  • whatsnew entry

It failed a total of 6 tests when running test_fast.sh but failed none when running pytest pandas/tests/io/test_html.py.
It seems that these 6 failed tests are unrelated to my changes since they also occur on the master branch.

@gfyoung gfyoung added IO HTML read_html, to_html, Styler.apply, Styler.applymap Multithreading Parallelism in pandas Bug labels Jul 14, 2017
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- Improved thread safety for `read_html()`. (:issue:`16928`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually inclined to call this a bug because multi-threading is something that we check for unless explicitly stated otherwise. Thus, I would move to this to the bugs section. Also, I would expand your description about the issue was (just a sentence but a longer one would suffice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :func:`read_html`

@codecov
Copy link

codecov bot commented Jul 14, 2017

Codecov Report

Merging #16930 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16930      +/-   ##
==========================================
- Coverage   90.99%   90.97%   -0.02%     
==========================================
  Files         161      161              
  Lines       49303    49303              
==========================================
- Hits        44863    44854       -9     
- Misses       4440     4449       +9
Flag Coverage Δ
#multiple 88.74% <100%> (ø) ⬆️
#single 40.19% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 84.85% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.71% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6000c5b...7688010. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 14, 2017

Codecov Report

Merging #16930 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16930      +/-   ##
==========================================
- Coverage   91.05%   91.01%   -0.05%     
==========================================
  Files         161      161              
  Lines       49350    49350              
==========================================
- Hits        44936    44915      -21     
- Misses       4414     4435      +21
Flag Coverage Δ
#multiple 88.78% <100%> (-0.03%) ⬇️
#single 40.26% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 84.85% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4efe656...c01caaf. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the test from the issue

- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when ``margins=True``. (:issue:`15972`)
- :func:`Dataframe.select_dtypes` now accepts scalar values for include/exclude as well as list-like. (:issue:`16855`)

- Improved thread safety for `read_html()`. (:issue:`16928`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :func:`read_html`

@3553x
Copy link
Contributor Author

3553x commented Jul 15, 2017

Regarding adding the test.

I have trouble creating a test that reproduces the issue.
The following code snippet works fine when I run pytest test_html.py::test_importcheck_thread_safety but as soon as I run pytest test_html.py the test always passes. Neither of the threads throw an exception in that case.

def get_html_file():
    filename = os.path.join(DATA_PATH, 'valid_markup.html')
    read_html(filename)


class ErrorThread(threading.Thread):
    def run(self):
        try:
            threading.Thread.run(self)
        except Exception as e:
            self.err = e
        else:
            self.err = None

def test_importcheck_thread_safety():
    t1 = ErrorThread(target = get_html_file)
    t2 = ErrorThread(target = get_html_file)
    t1.start()
    t2.start()
    while(t1.is_alive() or t2.is_alive()):
        pass
    assert None == t1.err == t2.err, "Errors when run in parallel"

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2017

Neither of the threads throw an exception in that case.

I'm confused...isn't that what you want (because it doesn't fail)?

@3553x
Copy link
Contributor Author

3553x commented Jul 15, 2017

The threads don't throw exceptions regardless of whether I include my fix or not. Meaning the test I wrote is useless and would indicate that everything is fine while actual programs are crashing.

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2017

The threads don't throw exceptions regardless of whether I include my fix or not.

Ah, okay. That makes sense. Does your code from the issue raise an Exception if you incorporate it as a test (just as a starter)?

@3553x
Copy link
Contributor Author

3553x commented Jul 15, 2017

I'm not sure what you mean by starter. I tried using the code from the issue directly earlier and it didn't work. Apparently exceptions need to be thrown by the main-thread.

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2017

By starter, I meant: does your code from the original issue form a valid test for our code-base (when incorporated into the tests directory? You indeed answered the question.

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2017

@3553x : As I'm not sure at this point why pytest is behaving the way it is for you, how about this: you need the test to break and then pass (with your changes) on the build machines before we merge. On a separate branch, add your test, push it, and have Travis and/or Appveyor (either one is sufficient) run and see if it breaks without your changes. Then push your changes to that branch and see if they then pass.

If it does, commit your test and push to this PR. Otherwise, let us know if it worked or not. How does that sound to you?

@3553x
Copy link
Contributor Author

3553x commented Jul 15, 2017

That sounds like something worth trying out. I'll let you know once I'm done.

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2017

One other thing, if this operation is successful, if you could provide us to the links to the failing and passing builds. That way we can see the differential between having and not having your patch.

@3553x
Copy link
Contributor Author

3553x commented Jul 15, 2017

I tried using travis with my test added on top of the unfixed version. Some builds fail because of linting, however they all seem to pass pytest. I realised that my earlier version might have passed since _IMPORTS is set to true after the first call to read_html(). So I imported pandas.io.html and reloaded it. But this didn't fix the issue. This change was included in the version that was tested by travis.

These are the additions that were made to test_html.py:

try:
    from importlib import reload
except ImportError:
    pass
[...]
import pandas.io.html
[...]
class ErrorThread(threading.Thread):
    def run(self):
        try:
            super(ErrorThread, self).run()
        except Exception as e:
            self.err = e
        else:
            self.err = None

@pytest.mark.slow
def test_importcheck_thread_safety():
    reload(pandas.io.html)
    filename = os.path.join(DATA_PATH, 'valid_markup.html')
    helper_thread1 = ErrorThread(target = read_html, args = (filename,))
    helper_thread2 = ErrorThread(target = read_html, args = (filename,))
    helper_thread1.start()
    helper_thread2.start()
    while(helper_thread1.is_alive() or helper_thread2.is_alive()):
        pass
    assert None == helper_thread1.err == helper_thread2.err

Output of pytest test_html.py::test_importcheck_thread_safety:

=========================== test session starts ============================
platform linux -- Python 3.6.1, pytest-3.1.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/fe/code/pandas, inifile: setup.cfg
plugins: xdist-1.16.0
collected 8 items 

test_html.py F

================================= FAILURES =================================
______________________ test_importcheck_thread_safety ______________________

    @pytest.mark.slow
    def test_importcheck_thread_safety():
        reload(pandas.io.html)
        filename = os.path.join(DATA_PATH, 'valid_markup.html')
        helper_thread1 = ErrorThread(target = read_html, args = (filename,))
        helper_thread2 = ErrorThread(target = read_html, args = (filename,))
        helper_thread1.start()
        helper_thread2.start()
        while(helper_thread1.is_alive() or helper_thread2.is_alive()):
            pass
>       assert None == helper_thread1.err == helper_thread2.err
E       AssertionError: assert None == ImportError('lxml not found, please install it',)
E        +  where None = <ErrorThread(Thread-1, stopped 140146488710912)>.err
E        +  and   ImportError('lxml not found, please install it',) = <ErrorThread(Thread-2, stopped 140146480318208)>.err

test_html.py:961: AssertionError
========================= 1 failed in 7.14 seconds =========================

Output of pytest test_html.py

=========================== test session starts ============================
platform linux -- Python 3.6.1, pytest-3.1.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/fe/code/pandas, inifile: setup.cfg
plugins: xdist-1.16.0
collected 74 items 

test_html.py ..........................................................................

======================== 74 passed in 50.77 seconds ========================

@jreback
Copy link
Contributor

jreback commented Jul 15, 2017

@3553x you can add your test to the PR.

@3553x 3553x force-pushed the read_html_thread_safety branch from 7688010 to 9d640cc Compare July 16, 2017 15:11
@pep8speaks
Copy link

pep8speaks commented Jul 16, 2017

Hello @3553x! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 21, 2017 at 14:55 Hours UTC

@jreback
Copy link
Contributor

jreback commented Jul 16, 2017

pls rebase on master

git rebase -i upstream/master
git push yourremote yourbranch -f

@3553x 3553x force-pushed the read_html_thread_safety branch 2 times, most recently from 2fdcfb1 to ad8541b Compare July 16, 2017 16:49

- Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`)

- Bug in :func:`read_html` importcheck fails when run concurrently (:issue:`16928`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importcheck --> "import check"


@pytest.mark.slow
def test_importcheck_thread_safety():
reload(pandas.io.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain (or at least comment) why we need this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add a newline beneath this (just for readability).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import check only happens when a variable in html.py is set to false, it's initial value. However, the variable will be set to true during the first call to read_html. Reloading the module allows us to reinitialise that variable and effectively force an import check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly write that as a comment above the reload call?

reload(pandas.io.html)
filename = os.path.join(DATA_PATH, 'valid_markup.html')
helper_thread1 = ErrorThread(target=read_html, args=(filename,))
helper_thread2 = ErrorThread(target=read_html, args=(filename,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline beneath this (just for readability).

helper_thread1 = ErrorThread(target=read_html, args=(filename,))
helper_thread2 = ErrorThread(target=read_html, args=(filename,))
helper_thread1.start()
helper_thread2.start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline beneath this (just for readability).

@3553x 3553x force-pushed the read_html_thread_safety branch from ad8541b to 10aee08 Compare July 17, 2017 09:08
import_module = __import__

try:
from importlib import reload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge these two try-except together since both come from Python 2.x. Also, make a comment about that importlib is from Python 2.x.

@3553x 3553x force-pushed the read_html_thread_safety branch from 10aee08 to 1632ed8 Compare July 18, 2017 12:49

- Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`)

- Bug in :func:`read_html` import check fails when run concurrently (:issue:`16928`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the word "where" before "import check"



@pytest.mark.slow
def test_importcheck_thread_safety():
Copy link
Member

@gfyoung gfyoung Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference issue number below the test function definition? Something like "see gh-16928" will suffice.

@3553x 3553x force-pushed the read_html_thread_safety branch from 1632ed8 to fcd7bdf Compare July 18, 2017 17:04
helper_thread1.start()
helper_thread2.start()

while(helper_thread1.is_alive() or helper_thread2.is_alive()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, add a space between "while" and "("

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, do you need the parentheses? I don't think you do...

@3553x 3553x force-pushed the read_html_thread_safety branch from fcd7bdf to 9d5e040 Compare July 18, 2017 17:41
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some comments.


- Bug in :func:`read_stata` where value labels could not be read when using an iterator (:issue:`16923`)

- Bug in :func:`read_html` where import check fails when run concurrently (:issue:`16928`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say in multiple threads

import warnings


# imports needed for Python 3.x but will fail under Python 2.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@3553x : He means to just remove the line. I thought that line would be useful for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not supported in py2, then you should either conditionally import it, or set reload=None and check it (and skip in a test).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reload is a built-in in Python2. There is no need to import it unless you are using Python3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok



@pytest.mark.slow
def test_importcheck_thread_safety():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can only test under py3 then use a @pytest.mark.skipf(not compat.PY3, reason=.....) decorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested it with python3 `which pytest` test_html.py::test_importcheck_thread_safety and python2 `which pytest` test_html.py::test_importcheck_thread_safety for both cases (with and without fix). The test seems to work for both versions on my system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if this works, then remove the comment above

@jreback jreback added this to the 0.21.0 milestone Jul 19, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

@3553x : @jreback has some minor comments that if you could address, we will be able to merge this!

@3553x 3553x force-pushed the read_html_thread_safety branch from 9d5e040 to c01caaf Compare July 21, 2017 14:55
@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

@3553x : Thanks for updating!

@jreback : this LGTM, but I'll leave it to you to check one more time before merging.

@jreback jreback merged commit d884e51 into pandas-dev:master Jul 21, 2017
@jreback
Copy link
Contributor

jreback commented Jul 21, 2017

thanks @3553x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap Multithreading Parallelism in pandas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_html() Thread Safety

4 participants