GH-45619: [Python] Use f-string instead of string.format #45629

chilin0525 · 2025-02-25T17:31:17Z

Rationale for this change

What changes are included in this PR?

Refactor using f-string instead of string.format. But do not use f-string for following case, string.format allows passing parameters, making the code more reusable.

arrow/python/pyarrow/parquet/core.py

Lines 1624 to 1695 in 0fbf982

    
           _read_table_docstring = """ 
        
           {0} 
        
           Parameters 
        
           ---------- 
        
           source : str, pyarrow.NativeFile, or file-like object 
        
               If a string passed, can be a single file name or directory name. For 
        
               file-like objects, only read a single file. Use pyarrow.BufferReader to 
        
               read a file contained in a bytes or buffer-like object. 
        
           columns : list 
        
               If not None, only these columns will be read from the file. A column 
        
               name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 
        
               'a.c', and 'a.d.e'. If empty, no columns will be read. Note 
        
               that the table will still have the correct num_rows set despite having 
        
               no columns. 
        
           use_threads : bool, default True 
        
               Perform multi-threaded column reads. 
        
           schema : Schema, optional 
        
               Optionally provide the Schema for the parquet dataset, in which case it 
        
               will not be inferred from the source. 
        
           {1} 
        
           filesystem : FileSystem, default None 
        
               If nothing passed, will be inferred based on path. 
        
               Path will try to be found in the local on-disk filesystem otherwise 
        
               it will be parsed as an URI to determine the filesystem. 
        
           filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None 
        
               Rows which do not match the filter predicate will be removed from scanned 
        
               data. Partition keys embedded in a nested directory structure will be 
        
               exploited to avoid loading files at all if they contain no matching rows. 
        
               Within-file level filtering and different partitioning schemes are supported. 
        
               {3} 
        
           use_legacy_dataset : bool, optional 
        
               Deprecated and has no effect from PyArrow version 15.0.0. 
        
           ignore_prefixes : list, optional 
        
               Files matching any of these prefixes will be ignored by the 
        
               discovery process. 
        
               This is matched to the basename of a path. 
        
               By default this is ['.', '_']. 
        
               Note that discovery happens only if a directory is passed as source. 
        
           pre_buffer : bool, default True 
        
               Coalesce and issue file reads in parallel to improve performance on 
        
               high-latency filesystems (e.g. S3). If True, Arrow will use a 
        
               background I/O thread pool. If using a filesystem layer that itself 
        
               performs readahead (e.g. fsspec's S3FS), disable readahead for best 
        
               results. 
        
           coerce_int96_timestamp_unit : str, default None 
        
               Cast timestamps that are stored in INT96 format to a particular 
        
               resolution (e.g. 'ms'). Setting to None is equivalent to 'ns' 
        
               and therefore INT96 timestamps will be inferred as timestamps 
        
               in nanoseconds. 
        
           decryption_properties : FileDecryptionProperties or None 
        
               File-level decryption properties. 
        
               The decryption properties can be created using 
        
               ``CryptoFactory.file_decryption_properties()``. 
        
           thrift_string_size_limit : int, default None 
        
               If not None, override the maximum total string size allocated 
        
               when decoding Thrift structures. The default limit should be 
        
               sufficient for most Parquet files. 
        
           thrift_container_size_limit : int, default None 
        
               If not None, override the maximum total size of containers allocated 
        
               when decoding Thrift structures. The default limit should be 
        
               sufficient for most Parquet files. 
        
           page_checksum_verification : bool, default False 
        
               If True, verify the checksum for each page read from the file. 
        
           Returns 
        
           ------- 
        
           {2} 
        
           {4} 
        
           """

Are these changes tested?

Via CI.

Are there any user-facing changes?

No.

GitHub Issue: [Python] Use f-string instead of string.format #45619

kou · 2025-02-26T00:00:46Z

We have many string.format codes in https://github.com/apache/arrow/tree/main/python/pyarrow . If you want to work on this step by step (for example, 1 PR per file), could you open a sub-issue for each PR instead of associating all PRs to GH-45619? (GitHub added sub-issue related features recently.)

chilin0525 · 2025-02-26T01:52:43Z

@kou Thank you for the reminder🙏. I personally prefer to implement all changes within a single PR, so I am converting the PR to draft status.

chilin0525 · 2025-03-01T14:01:37Z

I have already changed all the files under the pyarrow folder. As discussed in #45619, certain scenarios where the template is reused across multiple methods using string.format will not be refactored to f-strings.

raulcd

Thanks for the PR, could you take a look on the CI failures:
https://github.com/apache/arrow/actions/runs/13620611692/job/38077510062?pr=45629#step:6:8540
There are some cases were the expected doctest is failing due to the changes.

pitrou · 2025-03-06T09:19:19Z

python/pyarrow/_acero.pyx


    def __repr__(self):
-        return "<pyarrow.acero.Declaration>\n{0}".format(str(self))
+        return f"<pyarrow.acero.Declaration>\n{str(self)}"


str is implicit in f-strings, you don't need to call it explicitly. Example:

>>> from decimal import Decimal >>> d = Decimal('1.500') >>> repr(d) "Decimal('1.500')" >>> str(d) '1.500' >>> f"d is {d}" 'd is 1.500' >>> f"d is {d!r}" "d is Decimal('1.500')"

Thanks for review! I will change this.

Solved in cef7a3b

chilin0525 · 2025-03-06T17:25:39Z

Hi @kou @raulcd @pitrou , the PR #45679 found some files outside pyarrow using string.format. I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

pitrou · 2025-03-06T17:27:31Z

I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

You can, but you don't have to. It's as you prefer.

chilin0525 · 2025-03-06T17:33:47Z

I just want to confirm whether I should update all occurrences of string.format across the entire project, rather than limiting the changes to files under pyarrow. Is that correct?

You can, but you don't have to. It's as you prefer.

Got it! I prefer to update all files in this PR, so I will mark it as a draft until all changes are made. Thanks!

chilin0525 · 2025-04-06T09:30:32Z

@pitrou @kszucs Sorry for the late update. I’ve rechecked all the files and removed unnecessary f prefixes from f-strings. The CI error doesn’t appear to be related to this change — but please correct me if I’m wrong. Thanks!

kou · 2025-04-06T22:07:50Z

"C GLib & Ruby / AMD64 Windows MSVC GLib" will be fixed by #46006 .

raulcd

@chilin0525 will you have some time to rebase main and fix the conflicts?

chilin0525 · 2025-05-10T15:36:07Z

@raulcd I've rebased onto main and resolved the conflicts. The CI error looks unrelated to this change. Let me know if there's anything else I should adjust. Thanks!

dev/archery/archery/crossbow/core.py

python/pyarrow/tests/parquet/test_dataset.py

raulcd · 2025-05-12T10:59:10Z

@github-actions crossbow submit -g python

raulcd

I've applied some minor suggestions and fixed a new conflict with main. If CI is successful I am going to merge.

github-actions · 2025-05-12T11:02:14Z

Revision: 7079403

Submitted crossbow builds: ursacomputing/crossbow @ actions-11bf920665

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-python-3

raulcd · 2025-05-12T12:33:25Z

The CI failures are unrelated.
The minimal examples fail because the fork repository does not have the development tags. This is a known issue:

[Python] Jobs fail if Pyarrow version is not correctly generated due to missing remote dev tags #44803

The Python 3.10 are related to the test_gdb.py issue:

[CI][Python] Conda Python 3.10 jobs fail with UnicodeDecodeError due to gdb issue #46343

conbench-apache-arrow · 2025-05-12T20:37:04Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 992bee2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them.

Refactor using f-string instead of string.format

66b3a5a

github-actions bot added Component: Python awaiting review Awaiting review labels Feb 25, 2025

chilin0525 marked this pull request as draft February 26, 2025 00:51

Refactor using f-string instead of string.format

f5b0933

github-actions bot added Component: FlightRPC Component: Gandiva labels Feb 28, 2025

chilin0525 added 5 commits March 1, 2025 02:55

Merge branch 'main' into using-f-string-instead-string-format

a5c1112

Rollback _filesystem_uri to test fail testcase on CI

4b46338

Refactor using f-string instead of string.format

02cbd03

Refactor using f-string instead of string.format

e63da8b

Refactor using f-string instead of string.format

edab1db

chilin0525 marked this pull request as ready for review March 1, 2025 13:58

chilin0525 requested review from lidavidm and wjones127 as code owners March 1, 2025 13:58

Fix inconsistent f-string format in Tensor.repr method

0b9b9e8

raulcd reviewed Mar 3, 2025

View reviewed changes

Fix Tensor.repr method by using proper f-string formatting

c711cb0

kou mentioned this pull request Mar 5, 2025

45619 #45679

Closed

pitrou reviewed Mar 6, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 6, 2025

chilin0525 marked this pull request as draft March 6, 2025 17:34

Merge branch 'main' into using-f-string-instead-string-format

965d70a

chilin0525 added 7 commits April 6, 2025 10:36

rollback from f-string to string formating in bot.py

97070b2

remove unnecessary f prefixes in f-string

2c7c552

trigger GitHub actions

79d9aa0

fix incorrect use of '{{{{' in f-string formatting

e6d63dd

remove all file contains unnecessary f prefixes in f-string

ffe19ed

fix missing f prefix in f-string in feather.py

471192d

test CI

cfebe1f

Merge branch 'main' into using-f-string-instead-string-format

146e79e

raulcd reviewed May 7, 2025

View reviewed changes

fix conflict

6f0f7ca

chilin0525 requested review from AlenkaF and rok as code owners May 10, 2025 11:58

raulcd reviewed May 12, 2025

View reviewed changes

raulcd added 2 commits May 12, 2025 12:55

Merge branch 'main' into using-f-string-instead-string-format

7f20c82

Apply suggestions from code review

7079403

raulcd approved these changes May 12, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 12, 2025

AlenkaF approved these changes May 12, 2025

View reviewed changes

rok approved these changes May 12, 2025

View reviewed changes

raulcd merged commit 992bee2 into apache:main May 12, 2025
40 of 41 checks passed

raulcd removed the awaiting merge Awaiting merge label May 12, 2025

raulcd mentioned this pull request May 12, 2025

[Python] Use f-string instead of string.format #45619

Closed

	_read_table_docstring = """
	{0}

	Parameters
	----------
	source : str, pyarrow.NativeFile, or file-like object
	If a string passed, can be a single file name or directory name. For
	file-like objects, only read a single file. Use pyarrow.BufferReader to
	read a file contained in a bytes or buffer-like object.
	columns : list
	If not None, only these columns will be read from the file. A column
	name may be a prefix of a nested field, e.g. 'a' will select 'a.b',
	'a.c', and 'a.d.e'. If empty, no columns will be read. Note
	that the table will still have the correct num_rows set despite having
	no columns.
	use_threads : bool, default True
	Perform multi-threaded column reads.
	schema : Schema, optional
	Optionally provide the Schema for the parquet dataset, in which case it
	will not be inferred from the source.
	{1}
	filesystem : FileSystem, default None
	If nothing passed, will be inferred based on path.
	Path will try to be found in the local on-disk filesystem otherwise
	it will be parsed as an URI to determine the filesystem.
	filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]], default None
	Rows which do not match the filter predicate will be removed from scanned
	data. Partition keys embedded in a nested directory structure will be
	exploited to avoid loading files at all if they contain no matching rows.
	Within-file level filtering and different partitioning schemes are supported.

	{3}
	use_legacy_dataset : bool, optional
	Deprecated and has no effect from PyArrow version 15.0.0.
	ignore_prefixes : list, optional
	Files matching any of these prefixes will be ignored by the
	discovery process.
	This is matched to the basename of a path.
	By default this is ['.', '_'].
	Note that discovery happens only if a directory is passed as source.
	pre_buffer : bool, default True
	Coalesce and issue file reads in parallel to improve performance on
	high-latency filesystems (e.g. S3). If True, Arrow will use a
	background I/O thread pool. If using a filesystem layer that itself
	performs readahead (e.g. fsspec's S3FS), disable readahead for best
	results.
	coerce_int96_timestamp_unit : str, default None
	Cast timestamps that are stored in INT96 format to a particular
	resolution (e.g. 'ms'). Setting to None is equivalent to 'ns'
	and therefore INT96 timestamps will be inferred as timestamps
	in nanoseconds.
	decryption_properties : FileDecryptionProperties or None
	File-level decryption properties.
	The decryption properties can be created using
	``CryptoFactory.file_decryption_properties()``.
	thrift_string_size_limit : int, default None
	If not None, override the maximum total string size allocated
	when decoding Thrift structures. The default limit should be
	sufficient for most Parquet files.
	thrift_container_size_limit : int, default None
	If not None, override the maximum total size of containers allocated
	when decoding Thrift structures. The default limit should be
	sufficient for most Parquet files.
	page_checksum_verification : bool, default False
	If True, verify the checksum for each page read from the file.

	Returns
	-------
	{2}

	{4}
	"""

GH-45619: [Python] Use f-string instead of string.format #45629

GH-45619: [Python] Use f-string instead of string.format #45629

Uh oh!

Conversation

chilin0525 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kou commented Feb 26, 2025

Uh oh!

chilin0525 commented Feb 26, 2025

Uh oh!

chilin0525 commented Mar 1, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

chilin0525 Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

chilin0525 Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

chilin0525 commented Mar 6, 2025

Uh oh!

pitrou commented Mar 6, 2025

Uh oh!

chilin0525 commented Mar 6, 2025

Uh oh!

chilin0525 commented Apr 6, 2025

Uh oh!

kou commented Apr 6, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

chilin0525 commented May 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raulcd commented May 12, 2025

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

raulcd commented May 12, 2025

Uh oh!

Uh oh!

conbench-apache-arrow bot commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chilin0525 commented Feb 25, 2025 •

edited

Loading