CLN: memory-mapping code #44766

twoertwein · 2021-12-05T05:36:08Z

Rebased on #44761

No need for codecs.getincrementaldecoder as io.TextWrapperIO will do that (and we can use io.TextWrapperIO because mmap is wrapped inside _IOWrapper). io.TextWrapperIO also provides __next__ for us :)

Probably will need some benchmarking with utf-8/non-utf8 files.

pandas/io/common.py

twoertwein · 2021-12-05T05:40:28Z

pandas/tests/io/test_common.py

-            df = tm.makeDataFrame()
-            df.to_csv(path, mode="w+b")
-            tm.assert_frame_equal(df, pd.read_csv(path, index_col=0))
+def test_binary_mode():


this test and test_warning_missing_utf_bom had nothing to do with mmap but were in its test class.

pandas/io/common.py

twoertwein · 2021-12-05T16:24:25Z

pandas/tests/io/parser/test_encoding.py

+
+    # add one entry with a sepcial character
+    encoding_ = encoding or "utf-8"
+    leonardo = "Léonardo".encode(encoding_, errors="ignore").decode(encoding_)


This more strict test version of the test would have failed on master with the python engine.

twoertwein · 2021-12-05T21:04:43Z

pandas/tests/io/parser/test_read_fwf.py

    GH 23254.
    """
    encoding = "iso8859_1"
-    data = BytesIO(" 1 A Ä 2\n".encode(encoding))


This test wasn't using memory_map because it silently failed.

pep8speaks · 2021-12-05T22:54:11Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-05 23:25:04 UTC

twoertwein · 2021-12-05T23:34:03Z

This PR should speed up all read_csv(..., engine="c", encoding="utf-8") calls when a binary handle is used (this should now be faster than using text handles), by applying the same shortcut as done for memory_map in #43787.

Using TextIOWrapper seems to be slower than the previous solution :( Will revert most of the changes in this PR.