pyarrow: Check compatibility of pyarrow.array with string type#2933
pyarrow: Check compatibility of pyarrow.array with string type#2933
Conversation
Ensure that pyarrow.array objects with string type can be read by pygmt.text.
Convert a pyarrow.StringArray via a Python list to a ctypes array in the strings_to_ctypes_array function. Updated docstrings and type hints in `clib.Session.put_strings` method and `clib.conversion.strings_to_ctypes_array` function. Added two parametrized unit tests to ensure that pyarrow.StringArray can be passed into the clib methods.
pygmt/clib/conversion.py
Outdated
|
|
||
|
|
||
| def strings_to_ctypes_array(strings: Sequence[str]) -> ctp.Array: | ||
| def strings_to_ctypes_array(strings: Sequence[str] | pa.StringArray) -> ctp.Array: |
There was a problem hiding this comment.
Need to figure out how to fix the mypy type errors for an optional import 🤔
pygmt/clib/conversion.py:298: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "to_pylist" [union-attr]
pygmt/clib/session.py:1004: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "dtype" [union-attr]
Fixes `AttributeError: 'NoneType' object has no attribute 'string'`
Also improve docstring of strings_to_ctypes_array function to mention np.ndarray as supported input.
So that the pyarrow.StringArray type hint will show up on Readthedocs and the PyGMT docs page.
CodSpeed Performance ReportMerging #2933 will not alter performanceComparing Summary
Benchmarks breakdown
|
Faster for longer string arrays. Co-authored-by: Dongdong Tian <seisman.info@gmail.com>
Still need to work on Duration and GeoArrow geometry types.
Accidentally removed from `test_virtualfile_from_vectors_one_string_or_object_column` during merge conflict handling.
|
With PR #3619 and #3608 merged, pyarrow arrays with string type now can be correctly converted to numpy string array. So, we don't need to add support of pyarrow arrays in the low-level functions like |
pygmt/tests/test_clib_put_strings.py
Outdated
| from pygmt import clib | ||
| from pygmt.exceptions import GMTCLibError | ||
| from pygmt.helpers import GMTTempFile | ||
| from pygmt.helpers.testing import skip_if_no |
There was a problem hiding this comment.
Also need to revert changes in this PR, since the low-level function Session.put_strings only accepts numpy.str_ and everything else should be converted to numpy.str_ by the _to_numpy function before calling put_strings.
There was a problem hiding this comment.
Ok, just to document things, the way we are intending pyarrow.string inputs to go through high-level modules like fig.text is:
pygmt.Figure.text->clib.Session.virtualfile_in->clib.Session.virtualfile_from_vectors- In
clib.Session.virtualfile_from_vectors:- First call
clib.conversion.vectors_to_arrays->clib.conversion._to_numpy->np.ascontiguous_arrayto convertpyarrow.stringtonp.str_ - With a
list[np.str_], this is then passed toclib.Session.put_strings
- First call
For the record, the original intention in this PR, mostly in commit d379e46, was to let clib.Session.put_strings handle both np.str_ and pyarrow.string.
d92ca80 to
7dc353b
Compare
| assert output == expected | ||
|
|
||
|
|
||
| @pytest.mark.benchmark |
There was a problem hiding this comment.
The test was moved to test_clib_virtualfile_from_vectors in #3512.
| pytest.param( | ||
| getattr(pa, "array", None), | ||
| {}, # pa.string() | ||
| marks=skip_if_no(package="pyarrow"), | ||
| id="pyarrow", | ||
| ), |
There was a problem hiding this comment.
A possible solution to avoid the complicated param is:
try:
import pyarrow as pa
pa_array = pa.array
except ImportError:
pa_array = None
@pytest.mark.parametrize(
("array_func", "dtype"),
[
pytest.param(np.array, {"dtype": np.str_}, id="str"),
pytest.param(np.array, {"dtype": np.object_}, id="object"),
pytest.param(pa_array, {}, marks=skip_if_no(package="pyarrow"), id="pyarrow"),
]
)
Co-Authored-By: Dongdong Tian <seisman.info@gmail.com>
85d2bb5 to
8172102
Compare
|
It's likely that pytest-codspeed v3.0 breaks our Benchmark workflow. For comparison, see the CI job on the main branch yesterday (https://github.com/GenericMappingTools/pygmt/actions/runs/11830721578/job/32964716012) and this PR (https://github.com/GenericMappingTools/pygmt/actions/runs/11849193208/job/33022039738?pr=2933). Edit: Will explore the issue in a separate PR. |
Description of proposed changes
Check that
pyarrow.arrayobjects of string type can be passed to PyGMT modules/functions.This PR also adds type hints to the docstring of the
pygmt.Figure.text'stextparameter.Preview at https://pygmt-dev--2933.org.readthedocs.build/en/2933/api/generated/pygmt.Figure.text.html
TODO:
pyarrow.arrayby adding parametrized tests topygmt.texttest_clib_virtualfile_from_vectors.pypa.StringArraycan be passed intopygmt.Figure.textAddresses #2800 (comment), part of #2800.
Reminders
make formatandmake checkto make sure the code follows the style guide.doc/api/index.rst.Slash Commands
You can write slash commands (
/command) in the first line of a comment to performspecific operations. Supported slash commands are:
/format: automatically format and lint the code/test-gmt-dev: run full tests on the latest GMT development version