clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API#2719
clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API#2719
Conversation
|
I just tried different versions of @seismann's code example: import numpy as np
string_arrays01 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test01 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays01)
test01
# array(['ABC ABC', 'DEF DEF'], dtype='<U7') -> cut after 7 signs
string_arrays02 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test02 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays02)
test02
# array(['ABC ABC', 'DEFGHIJ'], dtype='<U7') -> cut after 7 signs
string_arrays03= [np.array(["ABC", "DEF"]), np.array(["DEFGHIJK", "ABC"])]
test03 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays03)
test03
# array(['ABC DEFGHIJK', 'DEF ABC'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7
string_arrays04 = [np.array(["DEFGHIJK", "ABC"]), np.array(["ABC", "DEF"])]
test04 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays04)
test04
# array(['DEFGHIJK ABC', 'ABC DEF'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7Here, it looks like the length of the first concentrated string sets the maximum length of all other / following concentrated strings. This occurs identically when using string_arrays05 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test05 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays05)
test05
# array(['ABC DEF', 'ABC DEF'], dtype='<U7') -> cut after 7 signs
string_arrays06 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test06 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays06)
test06
# array(['ABC DEFGHIJK', 'ABC DEF'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7 |
|
If you do Google search "apply_along_axis string arrays", you will see many posts that have exactly the same error, and also the upstream issue report numpy/numpy#8352. |
>>> import numpy as np
>>> string_arrays = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
>>> %timeit np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays)
37.3 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.array([" ".join(vals) for vals in zip(*string_arrays)])
3.02 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)The new & correct version is 10x faster than the old & incorrect version! |
|
Thanks @seisman for spotting the bug, fixing it and adding the benchmark!
Could you try to add a unit test to |
Modified the existing test With the main branch, the test fails: Done in 80e500c. |
weiji14
left a comment
There was a problem hiding this comment.
Modified the existing test
test_virtualfile_from_vectors_two_string_or_object_columnsto catch this bug.
Brilliant! Wish I had used a different string length back in #520 (you actually asked me to make variable length strings in #520 (comment), but I didn't make sure that string1 + string2 were different lengths) 😆
|
I've revised the PR title to make it more descriptive as a changelog entry. |
Description of proposed changes
Found this bug when working on PR #2720.
The changed code is meant to element-wisely join two string arrays with a space, but it doesn't work if strings have different lengths.
Here is a simple example to reproduce the issue:
Obvisouly the result is incorrect, although I don't fully understand why it doesn't work.
The new code works as expected.
BTW: a test related to this bug will be added in PR #2720.