Skip to content

Fwrite bom#3580

Merged
mattdowle merged 28 commits intoRdatatable:masterfrom
philippechataignon:fwrite_withbom
May 24, 2019
Merged

Fwrite bom#3580
mattdowle merged 28 commits intoRdatatable:masterfrom
philippechataignon:fwrite_withbom

Conversation

@philippechataignon
Copy link
Copy Markdown
Contributor

@philippechataignon philippechataignon commented May 22, 2019

Closes #3488
Related to #1770 too. It adds a parameter 'with_bom' in fwrite to create a UTF-8 file with BOM. By default, with_bom is FALSE and fwrite creates a UTF-8 file without BOM.
When with_bom is TRUE, BOM sequence (EF BB BF) is added at the beginning of the file.
with_bom is compatible with options yaml = T and compress="gzip"

Philippe Chataignon added 4 commits May 22, 2019 09:52
Add a new with_bom parameter to create a UTF-8 file with BOM. By
default, with_bom is FALSE and fwrite creates a UTF-8 file without BOM.
When with_bom is TRUE, BOM sequence (EF BB BF) is added at the beginning
of the file but only when col.names is TRUE (default).
@philippechataignon philippechataignon changed the title Fwrite withbom Fwrite with_bom May 22, 2019
@MichaelChirico
Copy link
Copy Markdown
Member

Awesome PR! Tackling fwrite's most wanted again :)

For argument name, maybe use_bom or even use_byte_order_mark?

Comment thread R/fwrite.R Outdated
file <- path.expand(file) # "~/foo/bar"
if (append && missing(col.names) && (file=="" || file.exists(file)))
col.names = FALSE # test 1658.16 checks this
if (with_bom && !col.names) stop("with_bom can be TRUE only if col.names is TRUE")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just ignore with_bom here (possibly with warning)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't with_bom && append be a warning/error/ignore case as well?

Copy link
Copy Markdown
Member

@jangorecki jangorecki May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why BOM couldn't be used without col names? I recall using some excel data without col names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the organization of fwrite : rows writing is threaded and every thread has its own buffer. It's difficult to write BOM in this section of fwrite. The header is written with its own buffer : with_bom uses this buffer and so col.names must be TRUE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't with_bom && append be a warning/error/ignore case as well?

You're right. I propose that when appending, bom is silently set to FALSE, like col.names.
See f877558. So stop is only when creating and col.names = FALSE.

Comment thread R/fwrite.R Outdated
bom[1] <- as.raw(0xEF)
bom[2] <- as.raw(0xBB)
bom[3] <- as.raw(0xBF)
writeBin(bom, file)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this interact properly with append? I guess with_bom && append should be another error or ignore case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is nested within if(yaml) so already append case is handled

Copy link
Copy Markdown
Member

@jangorecki jangorecki May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add internal stopifnot(!(bom && append)) # nocov just in case if the code will be moved later on, to ensure what Michael mentioned.

@jangorecki
Copy link
Copy Markdown
Member

jangorecki commented May 22, 2019

My vote for arg name would be just bom or eventually useBom, none of the current args uses underscore so not point to adding new like that, there is camel case already so better to align to this. @philippechataignon don't forget to git pull before amending changes because @MichaelChirico pushed some new commits.

@philippechataignon
Copy link
Copy Markdown
Contributor Author

BOM is often written in uppercase. What do you think of addBOM ?

@MichaelChirico
Copy link
Copy Markdown
Member

BOM is often written in uppercase

so is YAML... bom would work for me

@philippechataignon
Copy link
Copy Markdown
Contributor Author

Let's go for bom 4e28d61

@philippechataignon philippechataignon changed the title Fwrite with_bom Fwrite bom May 22, 2019
Philippe Chataignon added 2 commits May 22, 2019 15:03
In some cases of appending,
`append && missing(col.names) && (file=="" || file.exists(file))`
col.names is set to FALSE. Now bom is silently set to FALSE too.
@MichaelChirico
Copy link
Copy Markdown
Member

Could you clarify the relation to #1770?

@philippechataignon
Copy link
Copy Markdown
Contributor Author

#1770 is not very clear and his title is fwrite UTF8, so it seems to be related. My understanding is that, in Windows world (I use Linux), utf-8 files often need to have a BOM to be recognized as UTF-8. If not, they may be considered as "Latin-1" and there are "encoding" problems. I'm french, so I meet often "Latin-1" files. I'm happy that fread can handle them. But I never need to write Latin-1 files. And I never had encoding problems with utf8 files written with fwrite .

@MichaelChirico
Copy link
Copy Markdown
Member

OK, that's what I was thinking as well. We can leave it open and post in that thread seeing if it solves peoples' problems after merging.

Copy link
Copy Markdown
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, one minor follow up comment
also CI failures has to be resolved.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 22, 2019

Hi @philippechataignon. Looks great. I'll try and get it passing now and merge.
You're a project member already: please next time create the branch directly in the main project. Just makes it easier for us to push changes to your branch. I can still push to your branch ok this time but I have to clone separately and my local diff tool comparing branch to master doesn't work unless it's a branch in main project.
(The master of your fork was behind at v1.12.1 so that probably explains problems. I brought your fork's master up to date and then merged the fork's master into the fork's branch. Creating a branch in the main project means we don't need to keep two masters in sync.)

@mattdowle mattdowle added this to the 1.12.4 milestone May 22, 2019
@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 22, 2019

Good - passing now. I needed to change yaml tests 2033.06 and 2033.07 to pass. I just replaced with the new values without understanding why or if those are correct to change: @philippechataignon @MichaelChirico please check.
That one line needs covering in fwrite.R, and a news item please.

@MichaelChirico
Copy link
Copy Markdown
Member

I had a glance at the logs yesterday and saw that, seems familiar to an error I was getting on the original PR that was overcome by explicitly setting eol across platforms.

Dunno why this PR affected it. maybe strip.white is needed? will check

@MichaelChirico
Copy link
Copy Markdown
Member

MichaelChirico commented May 23, 2019

Investigating, not sure I'll have time to finish. I notice that with the interaction with yaml seems broken...

I see:

tmp = tempfile()
DT = data.table(a = 1:5, b = c(pi, 1:4), c = letters[1:5])
fwrite(DT, tmp, yaml = TRUE, eol = '\r\n'); length(readLines(tmp))
# [1] 90
fwrite(DT, tmp, yaml = TRUE, eol = '\r\n', bom = TRUE); length(readLines(tmp))
# [2] 30
fwrite(DT, tmp, yaml = TRUE, eol = '\r\n', bom = FALSE); length(readLines(tmp))
# [3] 60

vs normal behavior

fwrite(DT, tmp, eol = '\r\n'); length(readLines(tmp))
# [1] 6
fwrite(DT, tmp, eol = '\r\n', bom = TRUE); length(readLines(tmp))
# [1] 6
fwrite(DT, tmp, eol = '\r\n', bom = FALSE); length(readLines(tmp))
# [1] 6

@MichaelChirico
Copy link
Copy Markdown
Member

MichaelChirico commented May 23, 2019

@philippechataignon i believe the problem is that writeBin always appends to the connection, so it won't work if the file already exists; from ?writeBin:

If the connection is open it is read/written from its current position.

@MichaelChirico
Copy link
Copy Markdown
Member

MichaelChirico commented May 23, 2019

@mattdowle @philippechataignon made some changes -- as noted, writeBin was not obeying append correctly, so instead I used rawToChar to but the BOM in the file with cat.

This solved the 2033.06/2033.07 problem (the indices in the tests were referring to the parts of the file written for earlier tests), and Travis is now passing... but the Windows problem is stubborn & I'm out of ideas for now.

From the fail log:

Test 2033.11 ran without errors but failed check that x equals y:
> x = readBin(f, raw(), 6L) 
First 6 of 6 (type 'raw'): [1] c3 af c2 bb c2 bf
> y = as.raw(c(239, 187, 191, 45, 45, 45)) 
First 6 of 6 (type 'raw'): [1] ef bb bf 2d 2d 2d
6 element mismatches

Perhaps the rawToChar output is different on Windows vs. Linux, or it's locale-dependent?

@MichaelChirico
Copy link
Copy Markdown
Member

Latest commit tried using close(file()) to reset any existing file before proceeding; same thing -- success on linux, failure on windows

@philippechataignon
Copy link
Copy Markdown
Contributor Author

What do you think of removing compatibility between bom and yaml for release 1.12.4 ? This will let more time for analysing. For me, even before this PR, I have eol: |2+ in output. Except yaml header, all fwritewriting is made in C. I think this can be an aim to write yaml header in C and, as side effet, yaml will be compatible with compress=gzip.

@MichaelChirico
Copy link
Copy Markdown
Member

the eol: |2+ part was due to the append bug. eol: |2+ is the expected output when eol = '\n'; this was left in the file for the eol = '\r\n' tests because the file wasn't being overwritten.

I'm not sure 1.12.4 release is imminent so no need to write it off yet. that said, I'd be fine merging this and filing a follow-up to fix the windows bug.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 23, 2019

I think this can be an aim to write yaml header in C and, as side effet, yaml will be compatible with compress=gzip.

Done. Great idea. It also now writes the bom when no column names too.
Just two test fails left, just on Windows ...

@mattdowle
Copy link
Copy Markdown
Member

That's odd. I don't see why the two errors happen on Windows :

Running test id 1658.5      Test 1658.5 produced 1 warnings but expected 0
Expected: 
Observed: invalid input found on input connection 'C:/Users/appveyor/AppData/Local/Temp/1\Rtmpa2ZQeO\file95075ee329b'
...
Running test id 2033.12      Test 2033.12 produced 1 warnings but expected 0
Expected: 
Observed: invalid input found on input connection 'C:/Users/appveyor/AppData/Local/Temp/1\Rtmpa2ZQeO\file95054067179'

@codecov
Copy link
Copy Markdown

codecov bot commented May 24, 2019

Codecov Report

Merging #3580 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3580      +/-   ##
==========================================
+ Coverage   97.62%   97.63%   +<.01%     
==========================================
  Files          66       66              
  Lines       12696    12721      +25     
==========================================
+ Hits        12395    12420      +25     
  Misses        301      301
Impacted Files Coverage Δ
src/fwrite.c 97.63% <100%> (+0.04%) ⬆️
R/fwrite.R 98.78% <100%> (+0.03%) ⬆️
src/fwriteR.c 98.03% <100%> (+0.02%) ⬆️
src/rbindlist.c 100% <0%> (ø) ⬆️
src/froll.c 100% <0%> (ø) ⬆️
src/assign.c 100% <0%> (ø) ⬆️
src/frolladaptive.c 100% <0%> (ø) ⬆️
R/data.table.R 97.68% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1520943...14b6e23. Read the comment docs.

@Rdatatable Rdatatable deleted a comment from codecov bot May 24, 2019
@mattdowle mattdowle merged commit df28194 into Rdatatable:master May 24, 2019
@MichaelChirico
Copy link
Copy Markdown
Member

nice Matt!

do I understand right that your debugging for Windows here was to put a debugging tracer line then push & wait for appveyor & repeat?

@mattdowle
Copy link
Copy Markdown
Member

Yep.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 24, 2019

I'm expecting you and Philippe will likely need to revise. I just wanted to merge now to see if it passed the stronger tests in GL pipelines, and get it into master too so further work (follow up PRs) can be in a branch.

@jangorecki
Copy link
Copy Markdown
Member

jangorecki commented May 24, 2019

gcc-8 -Wall -pedantic

fwrite.c: In function ‘fwriteMain’:
fwrite.c:692:26: warning: overflow in implicit constant conversion [-Woverflow]
     if (args.bom) {*ch++=0xEF; *ch++=0xBB; *ch++=0xBF; }  // 3 appears above (search for "bom")
                          ^~~~
fwrite.c:692:38: warning: overflow in implicit constant conversion [-Woverflow]
     if (args.bom) {*ch++=0xEF; *ch++=0xBB; *ch++=0xBF; }  // 3 appears above (search for "bom")
                                      ^~~~
fwrite.c:692:50: warning: overflow in implicit constant conversion [-Woverflow]
     if (args.bom) {*ch++=0xEF; *ch++=0xBB; *ch++=0xBF; }  // 3 appears above (search for "bom")
                                                  ^~~~

@mattdowle
Copy link
Copy Markdown
Member

Thanks @jangorecki. Now fixed in c21ac65.
We should add -Wall -pedantic to one of the GLCI jobs.

@jangorecki
Copy link
Copy Markdown
Member

@mattdowle we have r-rel-lin using those, and it produces those warnings in 00install.out, but because those are compiler warnings R package check is fine about them. We would need to add extra grep on that file.

@shrektan shrektan mentioned this pull request Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fwrite: Support writing UTF-8 files with BOM

4 participants