Skip to content

Update to PyO3 0.7#56

Merged
rth merged 9 commits intomasterfrom
pyo3-0.7-update
Jun 5, 2019
Merged

Update to PyO3 0.7#56
rth merged 9 commits intomasterfrom
pyo3-0.7-update

Conversation

@rth
Copy link
Owner

@rth rth commented May 26, 2019

This updates to the lastest PyO3, which allows using lifetimes in pymethods. As result tokenization in Python is a bit faster by avoiding string copies.

On master,

python3.7 benchmarks/bench_tokenizers.py
 Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.93s [31.0 MB/s, 2450 kWPS]
                RegexpTokenizer(r'\b\w\w+\b'): 1.96s [46.5 MB/s, 3671 kWPS]
   UnicodeSegmentTokenizer(word_bounds=False): 2.97s [30.7 MB/s, 2269 kWPS]
    UnicodeSegmentTokenizer(word_bounds=True): 3.58s [25.4 MB/s, 3182 kWPS]
                         VTextTokenizer('en'): 4.11s [22.1 MB/s, 2467 kWPS]
                        CharacterTokenizer(4): 7.73s [11.8 MB/s, 5927 kWPS]

after this PR,

# Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.92s [31.2 MB/s, 2460 kWPS]
                RegexpTokenizer(r'\b\w\w+\b'): 1.40s [64.8 MB/s, 5119 kWPS]
   UnicodeSegmentTokenizer(word_bounds=False): 2.48s [36.8 MB/s, 2721 kWPS]
    UnicodeSegmentTokenizer(word_bounds=True): 2.65s [34.3 MB/s, 4292 kWPS]
                         VTextTokenizer('en'): 3.32s [27.4 MB/s, 3053 kWPS]
                        CharacterTokenizer(4): 4.47s [20.4 MB/s, 10252 kWPS]

@rth rth changed the title Update to PyO3 0.7, faster tokenizers in Python Update to PyO3 0.7 May 26, 2019
@rth
Copy link
Owner Author

rth commented May 26, 2019

Hmm, no actually creating a PyList from Vec<&str> works but segfaults on Windows (probably due to the use of unsafe in Pyo3) and the fact that lifetimes are not right. Revering the change to tokenizers, unfortunately, though it should be possible to optimize this further.

Edit: or rather it seems to be a regression in pyo3 as vectorization tests segfault.

@rth rth force-pushed the pyo3-0.7-update branch from 747a0e5 to 5468df5 Compare May 29, 2019 19:37
@rth
Copy link
Owner Author

rth commented Jun 4, 2019

Managed to reproduce the error on Windows. It's unrelated to tokenizers,

Details
tests/test_vectorize.py::test_count_vectorizer thread '<unnamed>' panicked at 'An error occurred while initializing class SliceBox', C:\Users\Administrator\.cargo\registry\src\github.com-1ecc6299db9ec823\pyo3-0.7.0\src\type_object.rs:260:17
stack backtrace:
   0: std::sys::windows::backtrace::set_frames
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys\windows\backtrace\mod.rs:94
   1: std::sys::windows::backtrace::unwind_backtrace
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys\windows\backtrace\mod.rs:81
   2: std::sys_common::backtrace::_print
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys_common\backtrace.rs:70
   3: std::sys_common::backtrace::print
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\sys_common\backtrace.rs:58
   4: std::panicking::default_hook::{{closure}}
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:200
   5: std::panicking::default_hook
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:215
   6: std::panicking::rust_panic_with_hook
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:478
   7: std::panicking::continue_panic_fmt
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:385
   8: std::panicking::begin_panic_fmt
             at /rustc/7e001e5c6c7c090b41416a57d4be412ed3ccd937\/src\libstd\panicking.rs:340
   9: <T as pyo3::type_object::PyTypeObject>::init_type::{{closure}}
  10: <numpy::slice_box::SliceBox<T>>::new
  11: <numpy::array::PyArray<T, D>>::from_boxed_slice
  12: <ndarray::ArrayBase<ndarray::OwnedRepr<A>, D> as numpy::convert::IntoPyArray>::into_pyarray
  13: <T as pyo3::type_object::PyTypeObject>::init_type::{{closure}}
  14: <T as pyo3::type_object::PyTypeObject>::init_type::{{closure}}
  15: PyMethodDef_RawFastCallKeywords
  16: PyMethodDef_RawFastCallKeywords
  17: PyEval_EvalFrameDefault
  18: PyMethodDef_RawFastCallKeywords
  19: PyEval_EvalFrameDefault
  20: PyFunction_FastCallDict
  21: PySlice_New
  22: PyEval_EvalFrameDefault
  23: PyEval_EvalCodeWithName
  24: PyFunction_FastCallDict
  25: PySlice_New
  26: PyEval_EvalFrameDefault
  27: PyEval_EvalCodeWithName
  28: PyMethodDef_RawFastCallKeywords
  29: PyEval_EvalFrameDefault
  30: PyEval_EvalCodeWithName
  31: PyMethodDef_RawFastCallKeywords
  32: PyEval_EvalFrameDefault
  33: PyMethodDef_RawFastCallKeywords
  34: PyEval_EvalFrameDefault
  35: PyEval_EvalCodeWithName
  36: PyFunction_FastCallDict
  37: PyObject_Call_Prepend
  38: PyType_FromSpecWithBases
  39: PyObject_FastCallKeywords
  40: PyMethodDef_RawFastCallKeywords
  41: PyEval_EvalFrameDefault
  42: PyMethodDef_RawFastCallKeywords
  43: PyEval_EvalFrameDefault
  44: PyFunction_FastCallDict
  45: PySlice_New
  46: PyEval_EvalFrameDefault
  47: PyEval_EvalCodeWithName
  48: PyMethodDef_RawFastCallKeywords
  49: PyEval_EvalFrameDefault
  50: PyEval_EvalCodeWithName
  51: PyMethodDef_RawFastCallKeywords
  52: PyEval_EvalFrameDefault
  53: PyMethodDef_RawFastCallKeywords
  54: PyEval_EvalFrameDefault
  55: PyEval_EvalCodeWithName
  56: PyFunction_FastCallDict
  57: PyObject_Call_Prepend
  58: PyType_FromSpecWithBases
  59: PySlice_New
  60: PyEval_EvalFrameDefault
  61: PyEval_EvalCodeWithName
  62: PyMethodDef_RawFastCallKeywords
  63: PyEval_EvalFrameDefault
  64: PyEval_EvalCodeWithName
  65: PyMethodDef_RawFastCallKeywords
  66: PyEval_EvalFrameDefault
  67: PyEval_EvalCodeWithName
  68: PyFunction_FastCallDict
  69: PySlice_New
  70: PyEval_EvalFrameDefault
  71: PyEval_EvalCodeWithName
  72: PyMethodDef_RawFastCallKeywords
  73: PyEval_EvalFrameDefault
  74: PyEval_EvalCodeWithName
  75: PyMethodDef_RawFastCallKeywords
  76: PyEval_EvalFrameDefault
  77: PyFunction_FastCallDict
  78: PySlice_New
  79: PyEval_EvalFrameDefault
  80: PyEval_EvalCodeWithName
  81: PyMethodDef_RawFastCallKeywords
  82: PyEval_EvalFrameDefault
  83: PyEval_EvalCodeWithName
  84: PyMethodDef_RawFastCallKeywords
  85: PyEval_EvalFrameDefault
  86: PyMethodDef_RawFastCallKeywords
  87: PyEval_EvalFrameDefault
  88: PyEval_EvalCodeWithName
  89: PyFunction_FastCallDict
  90: PyObject_Call_Prepend
  91: PyType_FromSpecWithBases
  92: PyObject_FastCallKeywords
  93: PyMethodDef_RawFastCallKeywords
  94: PyEval_EvalFrameDefault
  95: PyFunction_FastCallDict
  96: PySlice_New
  97: PyEval_EvalFrameDefault
  98: PyEval_EvalCodeWithName
  99: PyMethodDef_RawFastCallKeywords
Windows fatal exception: code 0xc000001d

and only happens when building a wheel (as opposed to installing in developement mode).

@rth
Copy link
Owner Author

rth commented Jun 5, 2019

Using the latest rust nightly (nightly-2019-02-28 was used before) appears to resolve the previous rust-numpy error. Merging.

@rth rth merged commit 974221a into master Jun 5, 2019
@rth rth deleted the pyo3-0.7-update branch June 5, 2019 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant