gh-76535: Make PyUnicode_ToLowerFull and friends public#136176
gh-76535: Make PyUnicode_ToLowerFull and friends public#136176lysnikolaou wants to merge 16 commits intopython:mainfrom
PyUnicode_ToLowerFull and friends public#136176Conversation
Make `PyUnicode_ToLowerFull`, `PyUnicode_ToUpperFull` and `PyUnicode_ToTitleFull` public and rename them to `PyUnicode_ToLower` etc.
|
Thanks for taking a look @vstinner! Feedback addressed. |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
In #76535 (comment) @vstinner suggested to provide a constant which is the minimum buffer size.
If this is indeed a hard constant which will never be changed in future Unicode standards, then I prefer this way. It is too expensive to allocate the output buffer dynamically.
cc @ezio-melotti, our Unicode expert.
|
Another question I have is whether we want to expose something like the following to handle the Greek letter sigma edge case: int PyUnicode_ToLowerHandleSigma(Py_UCS4 *str, Py_UCS4 ch, Py_UCS4 *buffer, int size) |
vstinner
left a comment
There was a problem hiding this comment.
Thanks, I prefer this API which is more future-proof, it doesn't depend on a specific Unicode version.
Would you mind to elaborate? I'm not aware of this special case. |
Even if it's a constant which will never (!) change, IMO it's better to request a size as an argument to make the caller responsible to check the buffer size. APIs which accept a pointer with no size are a bad pattern, like the deprecated gets() function. |
|
Further feedback addressed.
There's one special case, the Greek letter sigma, where the result of lower casing is context-specific. More specifically, |
vstinner
left a comment
There was a problem hiding this comment.
Can you try to add tests to Modules/_testcapi/unicode.c and Lib/test/test_capi/test_unicode.py?
Oh, that's a tricky case. Proposed API takes a single character, so we don't know if Σ is at the end of a word or not. I don't think that it's worth it to handle this special case in proposed API. |
Done. |
|
If we add too many parameters and runtime checks, this will make the API slower and more difficult to use. |
Co-authored-by: Victor Stinner <vstinner@python.org>
vstinner
left a comment
There was a problem hiding this comment.
LGTM, I just have some minor comments.
@serhiy-storchaka: Would you be ok with this approach? An API with a size parameter.
|
All feedback addressed! Thanks for all the help and patience @vstinner! |
|
Friendly ping. |
|
I ran a benchmark on:
I used ASCII letters for my benchmark: letters A to I (10 letters). IMO the difference is too low (0.3 ns, 1.16x faster) to justify removing the size parameter. DetailsPatch for Modules/_testcapimodule.c: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index d0c0b45c20c..c65e3db137b 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -2555,6 +2555,48 @@ toggle_reftrace_printer(PyObject *ob, PyObject *arg)
Py_RETURN_NONE;
}
+
+static PyObject *
+bench(PyObject *self, PyObject *args)
+{
+ Py_ssize_t loops;
+ if (!PyArg_ParseTuple(args, "n", &loops)) {
+ return NULL;
+ }
+ Py_UCS4 buffer[10];
+
+ PyTime_t start;
+ (void)PyTime_PerfCounterRaw(&start);
+ for (Py_ssize_t i=0; i < loops; i++) {
+ Py_ssize_t res;
+ res = PyUCS4_ToLower('A', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('B', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('C', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('D', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('E', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('F', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('G', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('H', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('I', buffer, 3);
+ if (res < 0) return NULL;
+ res = PyUCS4_ToLower('J', buffer, 3);
+ if (res < 0) return NULL;
+ }
+ PyTime_t end;
+ (void)PyTime_PerfCounterRaw(&end);
+
+ return PyFloat_FromDouble(PyTime_AsSecondsDouble(end - start));
+}
+
+
static PyMethodDef TestMethods[] = {
{"set_errno", set_errno, METH_VARARGS},
{"test_config", test_config, METH_NOARGS},
@@ -2649,6 +2691,7 @@ static PyMethodDef TestMethods[] = {
{"test_atexit", test_atexit, METH_NOARGS},
{"code_offset_to_line", _PyCFunction_CAST(code_offset_to_line), METH_FASTCALL},
{"toggle_reftrace_printer", toggle_reftrace_printer, METH_O},
+ {"bench", bench, METH_VARARGS},
{NULL, NULL} /* sentinel */
};
Benchmark: import pyperf, _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench, inner_loops=10) |
|
I ran a benchmark on:
I used ASCII letters for my benchmark: letters A to I (10 letters). The difference is not significant (or a little bit slower). |
|
It means that |
I ran a microbenchmark on str.lower: Result: str.lower() becomes faster with this change, not slower. At least, it's not 16% slower. UPDATE: Just to be sure, I ran again the benchmark using |
|
Then you did not test with right data. If |
I'm not sure about this logic. We are talking about nanoseconds. Things get more complicated when the difference is smaller than 1 nanosecond. |
|
This scales with the length of the string. |
|
Well, I trust the benchmark numbers :) You can easily run the str.lower() benchmark if you don't trust numbers :-) |
|
I wrote #139333 which is based on this PR but changes the API to: Py_ssize_t PyUCS4_ToLower(const Py_UCS4 *str, Py_ssize_t str_size, Py_UCS4 *buffer, Py_ssize_t buf_size) |
Make
_PyUnicode_ToLowerFull,_PyUnicode_ToUpperFull,_PyUnicode_ToTitleFulland_PyUnicode_ToFoldedFullpublic and rename them toPyUnicode_ToLoweretc.📚 Documentation preview 📚: https://cpython-previews--136176.org.readthedocs.build/