fwrite/fread UTF8 and native in file names#3141
fwrite/fread UTF8 and native in file names#3141mattdowle merged 7 commits intoRdatatable:masterfrom dpprdan:fix_#3078
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3141 +/- ##
==========================================
+ Coverage 92.34% 92.34% +<.01%
==========================================
Files 61 61
Lines 11590 11592 +2
==========================================
+ Hits 10703 10705 +2
Misses 887 887
Continue to review full report at Codecov.
|
|
Thank you for this PR. It is generally good practice to include unit test so we can be sure there will be no regression in future. Although not sure there won't be issues with different encoding in tests when submitting to CRAN, @mattdowle? additionally NEWS entry is useful to mention that, as others might experience the same issue and NEWS is the place where they may look up if that has been fixed in more recent releases. All is well described in Contributing wiki page. |
|
Not sure about the encoding in tests either, but these are the tests I came up with. They test file paths in native and utf8 encoding. Not all of them fail without my patch, but the point is that First the tests without my patch. Sys.setlocale(locale = "English")
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"test path with non-ASCII char in native and utf8 encoding with library(data.table)
if (.Platform$OS.type=="windows") {
f = tempfile("ö"); cat("3.14", file = f); fn = enc2native(f); f8 = enc2utf8(f);
data.table:::test(1.1, fread(fn), data.table(V1=3.14));
data.table:::test(1.2, fread(f8), data.table(V1=3.14));
unlink(c(fn, f8))
}
#> Test 1.2 produced 1 errors but expected 0
#> Expected:
#> Observed: File not found: C:\Users\Daniel\AppData\Local\Temp\RtmpYvzCw1\ö178c6a04798ctest path with non-ASCII char in native and utf8 encoding with if (.Platform$OS.type=="windows") {
DT = data.table("a"); pth = tempdir();
f = "ö.csv"; fp = file.path(pth, f);
fpn = enc2native(fp); fp8 = enc2utf8(fp);
fwrite(DT, fpn);
data.table:::test(2.1, list.files(path = pth, pattern = "\\.csv$"), f);
unlink(c(f, "ö.csv"));
fwrite(DT, fp8);
data.table:::test(2.2, list.files(path = pth, pattern = "\\.csv$"), f);
unlink(c(f, "ö.csv"));
p = file.path(pth, "ü"); dir.create(p); f = tempfile(tmpdir = p);
fwrite(DT, enc2native(f));
data.table:::test(2.3, basename(list.files(path = p)), basename(f));
unlink(f);
fwrite(DT, enc2utf8(f))
data.table:::test(2.4, basename(list.files(path = p)), basename(f));
unlink(p, recursive = TRUE)
}
#> Test 2.1 ran without errors but failed check that x equals y:
#> > x = list.files(path = pth, pattern = "\\.csv$")
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> > y = f
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> 1 string mismatch
#> Test 2.2 ran without errors but failed check that x equals y:
#> > x = list.files(path = pth, pattern = "\\.csv$")
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> > y = f
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> 1 string mismatch
#> Error in fwrite(DT, enc2native(f)): No such file or directory: 'C:\Users\Daniel\AppData\Local\Temp\RtmpYvzCw1/ü\file178c151f1d3'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?Now the same tests with my patch applied. library(data.table)
if (.Platform$OS.type=="windows") {
f = tempfile("ö"); cat("3.14", file = f); fn = enc2native(f); f8 = enc2utf8(f);
data.table:::test(1.1, fread(fn), data.table(V1=3.14));
data.table:::test(1.2, fread(f8), data.table(V1=3.14));
unlink(c(fn, f8))
}if (.Platform$OS.type=="windows") {
DT = data.table("a"); pth = tempdir();
f = "ö.csv"; fp = file.path(pth, f);
fpn = enc2native(fp); fp8 = enc2utf8(fp);
fwrite(DT, fpn);
data.table:::test(2.1, list.files(path = pth, pattern = "\\.csv$"), f);
unlink(c(f, "ö.csv"));
fwrite(DT, fp8);
data.table:::test(2.2, list.files(path = pth, pattern = "\\.csv$"), f);
unlink(c(f, "ö.csv"));
p = file.path(pth, "ü"); dir.create(p); f = tempfile(tmpdir = p);
fwrite(DT, enc2native(f));
data.table:::test(2.3, basename(list.files(path = p)), basename(f));
unlink(f);
fwrite(DT, enc2utf8(f))
data.table:::test(2.4, basename(list.files(path = p)), basename(f));
unlink(p, recursive = TRUE)
}Like I said, the tests do not work with a Sys.setlocale(locale = "C")
#> [1] "C"
print("ö")
#> [1] "v"Would you like me to commit these tests (with proper test numbers, of course)? |
|
@dpprdan Thanks for this. Yes if you could add the tests and the NEWS item please. We can't place the actual non-ascii letters in |
|
I added the tests and the NEWS item. FWIW I got errors during Running the tests in 'tests/main.R' failed.
Last 13 lines of output:
1: 1438 34.65 738
2: 1648 30.85 91
3: 1652 29.15 91
4: 1650 27.06 91
5: 1644 18.53 91
6: 1642 18.45 91
7: 1646 17.91 91
8: 1835 13.68 1
9: 1912 11.97 2
10: 1223 10.77 728
Error in eval(exprs[i], envir) :
9 errors out of 7935 in 00:06:36 elapsed (00:02:19 cpu) on Sun Dec 02 20:25:10 2018. [endian==little, sizeof(long double)==16, sizeof(pointer)==8, TZ=Europe/Berlin, locale='LC_COLLATE=C;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252', l10n_info()='MBCS=FALSE; UTF-8=FALSE; Latin-1=TRUE; codepage=1252']. Search inst/tests/tests.Rraw for test numbers: 1445, 1585.1, 1585.3, 1585.5, 1585.6, 1585.7, 1621.1, 1621.2, 1869.7.
Calls: test.data.table -> sys.source -> eval -> eval
Execution haltedAlso a NOTE: > checking Rd line widths ... NOTE
Rd file 'setattr.Rd':
\examples lines wider than 100 characters:
setnames(DT, old, new, skip_absent = TRUE) # skips `old[3]` because `"c"` is not a column name of `DT`
These lines will be truncated in the PDF manual.The log also says |
|
I see this NOTE in https://gitlab.com/Rdatatable/data.table/-/jobs/128243787 |
|
Thanks @dpprdan. I invited you to be project member. Welcome! Among other things you can now create branches directly in the main project so we can more easily commit to each other's branches. |
|
Thanks! |
|
I found this issue by searching for test failures 1445, 1585.1, 1585.3, 1585.5, 1585.6, 1585.7 which was happening on windows because of inst/tests/* files such as doublequote_newline.csv which had windows line endings \r\n instead of the linux line endings \n which is expected in the test. This was solved by doing a fresh git clone of data.table, which resulted in new inst/tests/* files with the expected \n, see also https://docs.github.com/en/free-pro-team@latest/github/using-git/configuring-git-to-handle-line-endings and https://git-scm.com/docs/gitattributes |
|
Note that we recently added gitattributes to address windows line ending issue (for GLCI windows public runner). At the time of this PR we might not have yet gitattributes file. |
Closes #3078. See the original as well as this post for context.
I still think this ought to be fixed on the C side to have a
UTF-8only solution, but this is a quick fix.It also fixes the problem with UTF-8 encoded paths in
fread