Skip to content

fwrite/fread UTF8 and native in file names#3141

Merged
mattdowle merged 7 commits intoRdatatable:masterfrom
dpprdan:fix_#3078
Dec 6, 2018
Merged

fwrite/fread UTF8 and native in file names#3141
mattdowle merged 7 commits intoRdatatable:masterfrom
dpprdan:fix_#3078

Conversation

@dpprdan
Copy link
Copy Markdown
Contributor

@dpprdan dpprdan commented Nov 11, 2018

Closes #3078. See the original as well as this post for context.

I still think this ought to be fixed on the C side to have a UTF-8 only solution, but this is a quick fix.

library("data.table")
setwd(tempdir())
DF = data.frame(A=1:3, B=c("foo","A,Name","baz"))
fwrite(DF, "töst.csv")
list.files(pattern = "\\.csv")
#> [1] "töst.csv"
dir.create("ä")
fpath <- "ä/test.csv"
data.table::fwrite(DF, fpath)
list.files(path = "ä", pattern = "\\.csv")
#> [1] "test.csv"

It also fixes the problem with UTF-8 encoded paths in fread

Encoding(fpath)
#> [1] "latin1"
fread(fpath)
#>    A      B
#> 1: 1    foo
#> 2: 2 A,Name
#> 3: 3    baz
fread(enc2utf8(fpath))
#>    A      B
#> 1: 1    foo
#> 2: 2 A,Name
#> 3: 3    baz

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 11, 2018

Codecov Report

Merging #3141 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3141      +/-   ##
==========================================
+ Coverage   92.34%   92.34%   +<.01%     
==========================================
  Files          61       61              
  Lines       11590    11592       +2     
==========================================
+ Hits        10703    10705       +2     
  Misses        887      887
Impacted Files Coverage Δ
R/fread.R 98.57% <100%> (+0.01%) ⬆️
R/fwrite.R 96.36% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0fc6dde...6971710. Read the comment docs.

@jangorecki
Copy link
Copy Markdown
Member

jangorecki commented Nov 12, 2018

Thank you for this PR. It is generally good practice to include unit test so we can be sure there will be no regression in future. Although not sure there won't be issues with different encoding in tests when submitting to CRAN, @mattdowle? additionally NEWS entry is useful to mention that, as others might experience the same issue and NEWS is the place where they may look up if that has been fixed in more recent releases. All is well described in Contributing wiki page.

@dpprdan
Copy link
Copy Markdown
Contributor Author

dpprdan commented Nov 14, 2018

Not sure about the encoding in tests either, but these are the tests I came up with. They test file paths in native and utf8 encoding. Not all of them fail without my patch, but the point is that fread/fwrite should not fail with either of them (or the expected encoding should be made explicit in the help file). These tests probably only work reliably on Windows and a cp1252 locale, possibly others as well, definitely not with the C locale (see below).

First the tests without my patch.

Sys.setlocale(locale = "English")
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

test path with non-ASCII char in native and utf8 encoding with fread.

library(data.table)
if (.Platform$OS.type=="windows") {
  f = tempfile("ö"); cat("3.14", file = f); fn = enc2native(f); f8 = enc2utf8(f);
  data.table:::test(1.1, fread(fn), data.table(V1=3.14));
  data.table:::test(1.2, fread(f8), data.table(V1=3.14));
  unlink(c(fn, f8))
}
#> Test 1.2 produced 1 errors but expected 0
#> Expected: 
#> Observed: File not found: C:\Users\Daniel\AppData\Local\Temp\RtmpYvzCw1\ö178c6a04798c

test path with non-ASCII char in native and utf8 encoding with fwrite.

if (.Platform$OS.type=="windows") {
  DT = data.table("a"); pth = tempdir(); 
  f = "ö.csv"; fp = file.path(pth, f);
  fpn = enc2native(fp); fp8 = enc2utf8(fp);
  fwrite(DT, fpn);
  data.table:::test(2.1, list.files(path = pth, pattern = "\\.csv$"), f);
  unlink(c(f, "ö.csv"));
  fwrite(DT, fp8);
  data.table:::test(2.2, list.files(path = pth, pattern = "\\.csv$"), f);
  unlink(c(f, "ö.csv"));
  p = file.path(pth, "ü"); dir.create(p); f = tempfile(tmpdir = p);
  fwrite(DT, enc2native(f));
  data.table:::test(2.3, basename(list.files(path = p)), basename(f));
  unlink(f);
  fwrite(DT, enc2utf8(f))
  data.table:::test(2.4, basename(list.files(path = p)), basename(f));
  unlink(p, recursive = TRUE)
}
#> Test 2.1 ran without errors but failed check that x equals y:
#> > x = list.files(path = pth, pattern = "\\.csv$") 
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> > y = f 
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> 1 string mismatch
#> Test 2.2 ran without errors but failed check that x equals y:
#> > x = list.files(path = pth, pattern = "\\.csv$") 
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> > y = f 
#> First 6 of 1 (type 'character'): [1] "ö.csv"
#> 1 string mismatch
#> Error in fwrite(DT, enc2native(f)): No such file or directory: 'C:\Users\Daniel\AppData\Local\Temp\RtmpYvzCw1/ü\file178c151f1d3'. Unable to create new file for writing (it does not exist already). Do you have permission to write here, is there space on the disk and does the path exist?

Now the same tests with my patch applied.

library(data.table)
if (.Platform$OS.type=="windows") {
  f = tempfile("ö"); cat("3.14", file = f); fn = enc2native(f); f8 = enc2utf8(f);
  data.table:::test(1.1, fread(fn), data.table(V1=3.14));
  data.table:::test(1.2, fread(f8), data.table(V1=3.14));
  unlink(c(fn, f8))
}
if (.Platform$OS.type=="windows") {
  DT = data.table("a"); pth = tempdir();
  f = "ö.csv"; fp = file.path(pth, f);
  fpn = enc2native(fp); fp8 = enc2utf8(fp);
  fwrite(DT, fpn);
  data.table:::test(2.1, list.files(path = pth, pattern = "\\.csv$"), f);
  unlink(c(f, "ö.csv"));
  fwrite(DT, fp8);
  data.table:::test(2.2, list.files(path = pth, pattern = "\\.csv$"), f);
  unlink(c(f, "ö.csv"));
  p = file.path(pth, "ü"); dir.create(p); f = tempfile(tmpdir = p);
  fwrite(DT, enc2native(f));
  data.table:::test(2.3, basename(list.files(path = p)), basename(f));
  unlink(f);
  fwrite(DT, enc2utf8(f))
  data.table:::test(2.4, basename(list.files(path = p)), basename(f));
  unlink(p, recursive = TRUE)
}

Like I said, the tests do not work with a C locale, which enforces ASCII.

Sys.setlocale(locale = "C")
#> [1] "C"
print("ö")
#> [1] "v"

Would you like me to commit these tests (with proper test numbers, of course)?

@mattdowle
Copy link
Copy Markdown
Member

@dpprdan Thanks for this. Yes if you could add the tests and the NEWS item please. We can't place the actual non-ascii letters in tests.Rraw like that, but we can use \Uxxxx. See for example test 1864.

@dpprdan
Copy link
Copy Markdown
Contributor Author

dpprdan commented Dec 2, 2018

I added the tests and the NEWS item.

FWIW I got errors during R CMD check, but these are all not for my tests.

   Running the tests in 'tests/main.R' failed.
   Last 13 lines of output:
      1: 1438 34.65   738
      2: 1648 30.85    91
      3: 1652 29.15    91
      4: 1650 27.06    91
      5: 1644 18.53    91
      6: 1642 18.45    91
      7: 1646 17.91    91
      8: 1835 13.68     1
      9: 1912 11.97     2
     10: 1223 10.77   728
     
     Error in eval(exprs[i], envir) : 
       9 errors out of 7935 in 00:06:36 elapsed (00:02:19 cpu) on Sun Dec 02 20:25:10 2018. [endian==little, sizeof(long double)==16, sizeof(pointer)==8, TZ=Europe/Berlin, locale='LC_COLLATE=C;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252', l10n_info()='MBCS=FALSE; UTF-8=FALSE; Latin-1=TRUE; codepage=1252']. Search inst/tests/tests.Rraw for test numbers: 1445, 1585.1, 1585.3, 1585.5, 1585.6, 1585.7, 1621.1, 1621.2, 1869.7.
     Calls: test.data.table -> sys.source -> eval -> eval
     Execution halted

Also a NOTE:

> checking Rd line widths ... NOTE
  Rd file 'setattr.Rd':
    \examples lines wider than 100 characters:
       setnames(DT, old, new, skip_absent = TRUE) # skips `old[3]` because `"c"` is not a column name of `DT`
  
  These lines will be truncated in the PDF manual.

The log also says Running test id 1961.4 towards the end, so I guess my tests were not skipped.

@jangorecki
Copy link
Copy Markdown
Member

jangorecki commented Dec 3, 2018

I see this NOTE in https://gitlab.com/Rdatatable/data.table/-/jobs/128243787
I noted down to fail CRAN CI build when there are any NOTEs #3147

@mattdowle mattdowle added this to the 1.12.0 milestone Dec 6, 2018
@mattdowle mattdowle changed the title workaround for fwrite/fread file path encoding issue #3078 fwrite/fread UTF8 and native in file names Dec 6, 2018
@mattdowle
Copy link
Copy Markdown
Member

Thanks @dpprdan. I invited you to be project member. Welcome! Among other things you can now create branches directly in the main project so we can more easily commit to each other's branches.

@mattdowle mattdowle merged commit da98fb2 into Rdatatable:master Dec 6, 2018
@dpprdan
Copy link
Copy Markdown
Contributor Author

dpprdan commented Dec 6, 2018

Thanks!

@dpprdan dpprdan deleted the fix_#3078 branch December 6, 2018 13:16
@tdhock
Copy link
Copy Markdown
Member

tdhock commented Oct 2, 2020

I found this issue by searching for test failures 1445, 1585.1, 1585.3, 1585.5, 1585.6, 1585.7 which was happening on windows because of inst/tests/* files such as doublequote_newline.csv which had windows line endings \r\n instead of the linux line endings \n which is expected in the test. This was solved by doing a fresh git clone of data.table, which resulted in new inst/tests/* files with the expected \n, see also https://docs.github.com/en/free-pro-team@latest/github/using-git/configuring-git-to-handle-line-endings and https://git-scm.com/docs/gitattributes

@jangorecki
Copy link
Copy Markdown
Member

Note that we recently added gitattributes to address windows line ending issue (for GLCI windows public runner). At the time of this PR we might not have yet gitattributes file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fwrite encoding problem with file names

4 participants