gh-76535: Make `PyUnicode_ToLowerFull` and friends public by lysnikolaou · Pull Request #136176 · python/cpython

lysnikolaou · 2025-07-01T13:30:09Z

Make _PyUnicode_ToLowerFull, _PyUnicode_ToUpperFull, _PyUnicode_ToTitleFull and _PyUnicode_ToFoldedFull public and rename them to PyUnicode_ToLower etc.

Issue: Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

📚 Documentation preview 📚: https://cpython-previews--136176.org.readthedocs.build/

Make `PyUnicode_ToLowerFull`, `PyUnicode_ToUpperFull` and `PyUnicode_ToTitleFull` public and rename them to `PyUnicode_ToLower` etc.

Doc/c-api/unicode.rst

lysnikolaou · 2025-07-01T14:53:13Z

Thanks for taking a look @vstinner! Feedback addressed.

serhiy-storchaka

In #76535 (comment) @vstinner suggested to provide a constant which is the minimum buffer size.

If this is indeed a hard constant which will never be changed in future Unicode standards, then I prefer this way. It is too expensive to allocate the output buffer dynamically.

cc @ezio-melotti, our Unicode expert.

lysnikolaou · 2025-07-01T15:09:49Z

Another question I have is whether we want to expose something like the following to handle the Greek letter sigma edge case:

int PyUnicode_ToLowerHandleSigma(Py_UCS4 *str, Py_UCS4 ch, Py_UCS4 *buffer, int size)

vstinner

Thanks, I prefer this API which is more future-proof, it doesn't depend on a specific Unicode version.

Objects/unicodeobject.c

Doc/c-api/unicode.rst

vstinner · 2025-07-01T15:14:26Z

PyUnicode_ToLowerHandleSigma

Would you mind to elaborate? I'm not aware of this special case.

vstinner · 2025-07-01T15:18:24Z

If this is indeed a hard constant which will never be changed in future Unicode standards

Even if it's a constant which will never (!) change, IMO it's better to request a size as an argument to make the caller responsible to check the buffer size. APIs which accept a pointer with no size are a bad pattern, like the deprecated gets() function.

lysnikolaou · 2025-07-01T15:33:17Z

Further feedback addressed.

Would you mind to elaborate? I'm not aware of this special case.

There's one special case, the Greek letter sigma, where the result of lower casing is context-specific. More specifically, Σ gets lower-cased to ς if it's at the end of the word or to σ otherwise. This is handled in lower_ucs4 right now.

vstinner

Can you try to add tests to Modules/_testcapi/unicode.c and Lib/test/test_capi/test_unicode.py?

Doc/c-api/unicode.rst

vstinner · 2025-07-01T15:46:38Z

There's one special case, the Greek letter sigma, where the result of lower casing is context-specific. More specifically, Σ gets lower-cased to ς if it's at the end of the word or to σ otherwise.

Oh, that's a tricky case. Proposed API takes a single character, so we don't know if Σ is at the end of a word or not. I don't think that it's worth it to handle this special case in proposed API.

lysnikolaou · 2025-07-01T16:32:41Z

Can you try to add tests to Modules/_testcapi/unicode.c and Lib/test/test_capi/test_unicode.py?

Done.

serhiy-storchaka · 2025-07-01T16:53:18Z

If we add too many parameters and runtime checks, this will make the API slower and more difficult to use.

Modules/_testcapi/unicode.c

Lib/test/test_capi/test_unicode.py

Doc/c-api/unicode.rst

Modules/_testcapi/unicode.c

Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner

LGTM, I just have some minor comments.

@serhiy-storchaka: Would you be ok with this approach? An API with a size parameter.

Objects/unicodectype.c

Doc/c-api/unicode.rst

lysnikolaou · 2025-07-02T13:00:47Z

All feedback addressed! Thanks for all the help and patience @vstinner!

lysnikolaou · 2025-07-14T13:47:16Z

Friendly ping.

Lib/test/test_capi/test_unicode.py

…fer case

Doc/c-api/unicode.rst

vstinner · 2025-08-13T09:58:26Z

I ran a benchmark on:

with_size: Py_ssize_t PyUCS4_ToLower(Py_UCS4 ch, Py_UCS4 *res, Py_ssize_t size)
without_size: Py_ssize_t PyUCS4_ToLower(Py_UCS4 ch, Py_UCS4 *res)

I used ASCII letters for my benchmark: letters A to I (10 letters).

Mean +- std dev: [with_size] 2.21 ns +- 0.02 ns -> [without_size] 1.91 ns +- 0.01 ns: 1.16x faster

IMO the difference is too low (0.3 ns, 1.16x faster) to justify removing the size parameter.

Details

Patch for Modules/_testcapimodule.c:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index d0c0b45c20c..c65e3db137b 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -2555,6 +2555,48 @@ toggle_reftrace_printer(PyObject *ob, PyObject *arg)
     Py_RETURN_NONE;
 }
 
+
+static PyObject *
+bench(PyObject *self, PyObject *args)
+{
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "n", &loops)) {
+        return NULL;
+    }
+    Py_UCS4 buffer[10];
+
+    PyTime_t start;
+    (void)PyTime_PerfCounterRaw(&start);
+    for (Py_ssize_t i=0; i < loops; i++) {
+        Py_ssize_t res;
+        res = PyUCS4_ToLower('A', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('B', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('C', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('D', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('E', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('F', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('G', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('H', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('I', buffer, 3);
+        if (res < 0) return NULL;
+        res = PyUCS4_ToLower('J', buffer, 3);
+        if (res < 0) return NULL;
+    }
+    PyTime_t end;
+    (void)PyTime_PerfCounterRaw(&end);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(end - start));
+}
+
+
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
     {"test_config",             test_config,                     METH_NOARGS},
@@ -2649,6 +2691,7 @@ static PyMethodDef TestMethods[] = {
     {"test_atexit", test_atexit, METH_NOARGS},
     {"code_offset_to_line", _PyCFunction_CAST(code_offset_to_line), METH_FASTCALL},
     {"toggle_reftrace_printer", toggle_reftrace_printer, METH_O},
+    {"bench", bench, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Benchmark:

import pyperf, _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench, inner_loops=10)

vstinner · 2025-08-13T10:03:23Z

I ran a benchmark on:

with_size: Py_ssize_t PyUCS4_ToLower(Py_UCS4 ch, Py_UCS4 *res, Py_ssize_t size)
use_int: int PyUCS4_ToLower(Py_UCS4 ch, Py_UCS4 *res, int size)

I used ASCII letters for my benchmark: letters A to I (10 letters).

Mean +- std dev: [with_size] 2.21 ns +- 0.02 ns -> [use_int] 2.26 ns +- 0.02 ns: 1.02x slower

The difference is not significant (or a little bit slower).

serhiy-storchaka · 2025-08-13T10:03:26Z

It means that str.lower() will now be 16% slower. Not good.

vstinner · 2025-08-13T10:10:39Z

@serhiy-storchaka:

It means that str.lower() will now be 16% slower. Not good.

I ran a microbenchmark on str.lower:

env/bin/python -m pyperf timeit -s 's="a"*1_000' 's.lower()'

Result:

Mean +- std dev: [ref] 431 ns +- 6 ns -> [change] 418 ns +- 6 ns: 1.03x faster

str.lower() becomes faster with this change, not slower. At least, it's not 16% slower.

UPDATE: Just to be sure, I ran again the benchmark using --rigorous option. I got similar results:

Mean +- std dev: [ref] 418 ns +- 9 ns -> [change] 411 ns +- 10 ns: 1.02x faster

serhiy-storchaka · 2025-08-13T10:28:01Z

Then you did not test with right data. If PyUCS4_ToLower() becomes 16% slower, str.lower() that calls it in tight loop should be 16% slower for some data. Perhaps in your tests other things dominated.

vstinner · 2025-08-13T10:46:23Z

If PyUCS4_ToLower() becomes 16% slower, str.lower() that calls it in tight loop should be 16% slower for some data.

I'm not sure about this logic. We are talking about nanoseconds. Things get more complicated when the difference is smaller than 1 nanosecond.

serhiy-storchaka · 2025-08-13T11:09:01Z

This scales with the length of the string.

vstinner · 2025-08-13T11:13:29Z

Well, I trust the benchmark numbers :) You can easily run the str.lower() benchmark if you don't trust numbers :-)

vstinner · 2025-09-25T16:03:36Z

I wrote #139333 which is based on this PR but changes the API to:

Py_ssize_t PyUCS4_ToLower(const Py_UCS4 *str, Py_ssize_t str_size, Py_UCS4 *buffer, Py_ssize_t buf_size)

pythongh-76535: Make PyUnicode_ToLowerFull and friends public

431abba

Make `PyUnicode_ToLowerFull`, `PyUnicode_ToUpperFull` and `PyUnicode_ToTitleFull` public and rename them to `PyUnicode_ToLower` etc.

bedevere-app bot added the awaiting core review label Jul 1, 2025

bedevere-app bot mentioned this pull request Jul 1, 2025

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER #76535

Open

lysnikolaou mentioned this pull request Jul 1, 2025

gh-76535: Add C API functions for changing case of a single codepoint #117117

Closed

vstinner reviewed Jul 1, 2025

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

vstinner reviewed Jul 1, 2025

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Address feedback; add size parameter and do PyUnicode_ToFolded as well

d604fc8

📜🤖 Added by blurb_it.

fbbf841

serhiy-storchaka reviewed Jul 1, 2025

View reviewed changes

vstinner reviewed Jul 1, 2025

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Address more feedback; assert return value and raise ValueError

f17aa0c

vstinner reviewed Jul 1, 2025

View reviewed changes

Doc/c-api/unicode.rst Show resolved Hide resolved

Add tests

4a70489

Document the maximum numbers of characters needed in the buffer

61afd9a

vstinner reviewed Jul 2, 2025

View reviewed changes

Modules/_testcapi/unicode.c Show resolved Hide resolved

Modules/_testcapi/unicode.c Outdated Show resolved Hide resolved

Lib/test/test_capi/test_unicode.py Show resolved Hide resolved

Lib/test/test_capi/test_unicode.py Show resolved Hide resolved

vstinner reviewed Jul 2, 2025

View reviewed changes

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Address feedback; test more characters and refactor _testcapi functions

7885b17

vstinner reviewed Jul 2, 2025

View reviewed changes

lysnikolaou and others added 3 commits July 2, 2025 14:14

Address more review comments

6f9cb95

Disallow passing NULL

6a974c4

Only return NULL when chars < 0 in C test functions

ae033ff

Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner approved these changes Jul 2, 2025

View reviewed changes

Objects/unicodectype.c Outdated Show resolved Hide resolved

Objects/unicodectype.c Outdated Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting core review labels Jul 2, 2025

Use Py_ssize_t and don't check overflow in loop

e7ef477

Use Py_ssize_t for return value variable in unicodeobject.c

fff25db

lysnikolaou mentioned this pull request Jul 21, 2025

Expose PyUnicode_ToLower and friends capi-workgroup/decisions#71

Open

encukou reviewed Jul 22, 2025

View reviewed changes

Lib/test/test_capi/test_unicode.py Show resolved Hide resolved

lysnikolaou added 2 commits July 27, 2025 20:34

Merge branch 'main' into pyunicode-tolower-public

dd85fd4

Address feedback; Rename to PyUCS4_*, define macro and test small buf…

f378cea

…fer case

encukou reviewed Jul 29, 2025

View reviewed changes

Doc/c-api/unicode.rst Show resolved Hide resolved

Doc/c-api/unicode.rst Outdated Show resolved Hide resolved

Address feedback

1caaa85

Merge branch 'main' into pyunicode-tolower-public

77eebaf

Uh oh!

Conversation

lysnikolaou commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Jul 1, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

lysnikolaou commented Jul 1, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vstinner commented Jul 1, 2025

Uh oh!

vstinner commented Jul 1, 2025

Uh oh!

lysnikolaou commented Jul 1, 2025

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Jul 1, 2025

Uh oh!

lysnikolaou commented Jul 1, 2025

Uh oh!

serhiy-storchaka commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Jul 2, 2025

Uh oh!

lysnikolaou commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinner commented Aug 13, 2025

Uh oh!

vstinner commented Aug 13, 2025

Uh oh!

serhiy-storchaka commented Aug 13, 2025

Uh oh!

vstinner commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented Aug 13, 2025

Uh oh!

vstinner commented Aug 13, 2025

Uh oh!

serhiy-storchaka commented Aug 13, 2025

Uh oh!

vstinner commented Aug 13, 2025

Uh oh!

vstinner commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

lysnikolaou commented Jul 1, 2025 •

edited

Loading

vstinner commented Aug 13, 2025 •

edited

Loading