Skip to content

Conversation

@chilin0525
Copy link
Contributor

@chilin0525 chilin0525 commented Feb 25, 2025

Rationale for this change

See #45619.

What changes are included in this PR?

Refactor using f-string instead of string.format. But do not use f-string for following case, string.format allows passing parameters, making the code more reusable.

_read_table_docstring = """
{0}
Parameters
----------
source : str, pyarrow.NativeFile, or file-like object
If a string passed, can be a single file name or directory name. For
file-like objects, only read a single file. Use pyarrow.BufferReader to
read a file contained in a bytes or buffer-like object.
columns : list
If not None, only these columns will be read from the file. A column
name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
'a.c', and 'a.d.e'. If empty, no columns will be read. Note
that the table will still have the correct num_rows set despite having
no columns.
use_threads : bool, default True
Perform multi-threaded column reads.
schema : Schema, optional
Optionally provide the Schema for the parquet dataset, in which case it
will not be inferred from the source.
{1}
filesystem : FileSystem, default None
If nothing passed, will be inferred based on path.
Path will try to be found in the local on-disk filesystem otherwise
it will be parsed as an URI to determine the filesystem.
filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None
Rows which do not match the filter predicate will be removed from scanned
data. Partition keys embedded in a nested directory structure will be
exploited to avoid loading files at all if they contain no matching rows.
Within-file level filtering and different partitioning schemes are supported.
{3}
use_legacy_dataset : bool, optional
Deprecated and has no effect from PyArrow version 15.0.0.
ignore_prefixes : list, optional
Files matching any of these prefixes will be ignored by the
discovery process.
This is matched to the basename of a path.
By default this is ['.', '_'].
Note that discovery happens only if a directory is passed as source.
pre_buffer : bool, default True
Coalesce and issue file reads in parallel to improve performance on
high-latency filesystems (e.g. S3). If True, Arrow will use a
background I/O thread pool. If using a filesystem layer that itself
performs readahead (e.g. fsspec's S3FS), disable readahead for best
results.
coerce_int96_timestamp_unit : str, default None
Cast timestamps that are stored in INT96 format to a particular
resolution (e.g. 'ms'). Setting to None is equivalent to 'ns'
and therefore INT96 timestamps will be inferred as timestamps
in nanoseconds.
decryption_properties : FileDecryptionProperties or None
File-level decryption properties.
The decryption properties can be created using
``CryptoFactory.file_decryption_properties()``.
thrift_string_size_limit : int, default None
If not None, override the maximum total string size allocated
when decoding Thrift structures. The default limit should be
sufficient for most Parquet files.
thrift_container_size_limit : int, default None
If not None, override the maximum total size of containers allocated
when decoding Thrift structures. The default limit should be
sufficient for most Parquet files.
page_checksum_verification : bool, default False
If True, verify the checksum for each page read from the file.
Returns
-------
{2}
{4}
"""

Are these changes tested?

Via CI.

Are there any user-facing changes?

No.

@kou
Copy link
Member

kou commented Feb 26, 2025

We have many string.format codes in https://github.com/apache/arrow/tree/main/python/pyarrow . If you want to work on this step by step (for example, 1 PR per file), could you open a sub-issue for each PR instead of associating all PRs to GH-45619? (GitHub added sub-issue related features recently.)

@chilin0525 chilin0525 marked this pull request as draft February 26, 2025 00:51
@chilin0525
Copy link
Contributor Author

@kou Thank you for the reminder🙏. I personally prefer to implement all changes within a single PR, so I am converting the PR to draft status.

@chilin0525 chilin0525 marked this pull request as ready for review March 1, 2025 13:58
@chilin0525
Copy link
Contributor Author

I have already changed all the files under the pyarrow folder. As discussed in #45619, certain scenarios where the template is reused across multiple methods using string.format will not be refactored to f-strings.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, could you take a look on the CI failures:
https://github.com/apache/arrow/actions/runs/13620611692/job/38077510062?pr=45629#step:6:8540
There are some cases were the expected doctest is failing due to the changes.

@kou kou mentioned this pull request Mar 5, 2025

def __repr__(self):
return "<pyarrow.acero.Declaration>\n{0}".format(str(self))
return f"<pyarrow.acero.Declaration>\n{str(self)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str is implicit in f-strings, you don't need to call it explicitly. Example:

>>> from decimal import Decimal
>>> d = Decimal('1.500')
>>> repr(d)
"Decimal('1.500')"
>>> str(d)
'1.500'
>>> f"d is {d}"
'd is 1.500'
>>> f"d is {d!r}"
"d is Decimal('1.500')"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review! I will change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in cef7a3b

@chilin0525
Copy link
Contributor Author

Hi @kou @raulcd @pitrou , the PR #45679 found some files outside pyarrow using string.format. I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

@pitrou
Copy link
Member

pitrou commented Mar 6, 2025

I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

You can, but you don't have to. It's as you prefer.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 6, 2025
@chilin0525
Copy link
Contributor Author

I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

You can, but you don't have to. It's as you prefer.

Got it! I prefer to update all files in this PR, so I will mark it as a draft until all changes are made. Thanks!

@chilin0525 chilin0525 marked this pull request as draft March 6, 2025 17:34
@chilin0525
Copy link
Contributor Author

@pitrou @kszucs Sorry for the late update. I’ve rechecked all the files and removed unnecessary f prefixes from f-strings. The CI error doesn’t appear to be related to this change — but please correct me if I’m wrong. Thanks!

@kou
Copy link
Member

kou commented Apr 6, 2025

"C GLib & Ruby / AMD64 Windows MSVC GLib" will be fixed by #46006 .

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chilin0525 will you have some time to rebase main and fix the conflicts?

@chilin0525 chilin0525 requested review from AlenkaF and rok as code owners May 10, 2025 11:58
@chilin0525
Copy link
Contributor Author

@raulcd I've rebased onto main and resolved the conflicts. The CI error looks unrelated to this change. Let me know if there's anything else I should adjust. Thanks!

@raulcd
Copy link
Member

raulcd commented May 12, 2025

@github-actions crossbow submit -g python

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied some minor suggestions and fixed a new conflict with main. If CI is successful I am going to merge.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 12, 2025
@github-actions
Copy link

Revision: 7079403

Submitted crossbow builds: ursacomputing/crossbow @ actions-11bf920665

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@raulcd
Copy link
Member

raulcd commented May 12, 2025

The CI failures are unrelated.
The minimal examples fail because the fork repository does not have the development tags. This is a known issue:

The Python 3.10 are related to the test_gdb.py issue:

@raulcd raulcd merged commit 992bee2 into apache:main May 12, 2025
40 of 41 checks passed
@raulcd raulcd removed the awaiting merge Awaiting merge label May 12, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 992bee2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants