Skip to content

Add gzip support to fwrite#3278

Closed
philippechataignon wants to merge 9 commits intoRdatatable:masterfrom
philippechataignon:fwrite_gzip
Closed

Add gzip support to fwrite#3278
philippechataignon wants to merge 9 commits intoRdatatable:masterfrom
philippechataignon:fwrite_gzip

Conversation

@philippechataignon
Copy link
Copy Markdown
Contributor

@philippechataignon philippechataignon commented Jan 12, 2019

This is a first attempt to implement gzipped csv output in fwrite (issue #2016).

It uses zlib library and replaces open/write/close used for csv file by gzopen/gzwrite/gzclose for gzipped csv. A second buffer, which size is around 10% bigger than main buffer, is allocated for each thread. zlib is thread-safe and the gzip compression use all the available threads.

Option compress="gzip" is added to fwrite and is automatically set when file ends with .gz. Default is compress = "none".

On my system (Debian Linux), r-base-dev install zlib1g-dev, which is needed to compile. I don't need to add a -lz for gcc to compile but that may be not true for others systems and on Windows or Mac. I guess that file cc.R must be modified to indicate the zlib dependence but it's too hard for me.

I've added 2 tests but it seems to be difficult to test a binary output like a gzipped csv. Test uses command zcat and runs only on unix platforms.

Please feel free to test.

Philippe Chataignon added 3 commits January 12, 2019 21:56
Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in
a gzipped csv file.

zlib is thread-safe and the gzip compression use the fwrite threads.

Option compress="gzip" is added to fwrite et is automatically set when file ends
with ".gz"
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 12, 2019

Codecov Report

Merging #3278 into master will decrease coverage by 0.13%.
The diff coverage is 47.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.68%   -0.14%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11480      +13     
- Misses        627      645      +18
Impacted Files Coverage Δ
R/fwrite.R 96.55% <100%> (+0.18%) ⬆️
src/fwriteR.c 94.95% <100%> (+0.04%) ⬆️
src/fwrite.c 87.78% <41.46%> (-3.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...041bb4a. Read the comment docs.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 12, 2019

Codecov Report

Merging #3278 into master will decrease coverage by 0.08%.
The diff coverage is 60.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3278      +/-   ##
==========================================
- Coverage   94.81%   94.72%   -0.09%     
==========================================
  Files          65       65              
  Lines       12094    12125      +31     
==========================================
+ Hits        11467    11486      +19     
- Misses        627      639      +12
Impacted Files Coverage Δ
R/fwrite.R 96.55% <100%> (+0.18%) ⬆️
src/fwriteR.c 94.95% <100%> (+0.04%) ⬆️
src/fwrite.c 89.02% <56.09%> (-1.99%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...67a9c36. Read the comment docs.

Copy link
Copy Markdown
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice contribution. Reviewed mostly R code and UI. Also would be better to avoid formatting code like removing empty newlines.

Comment thread R/fwrite.R Outdated
length(nThread)==1L && !is.na(nThread) && nThread>=1L
)

is_gzip <- compress == "gzip" || grepl("\\.gz$", file)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to recognise filename but we could warn if user explicitly set compress to none and uses gz filename

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit 75af89e, I propose this approch :

In fwrite, compress has now 3 options :

  • default : gzip if file ends with .gz, else csv
  • none : force csv
  • gzip : force gzip

Copy link
Copy Markdown
Member

@jangorecki jangorecki Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better "auto" instead of "default".
Anyway new argument is not required as we might do:

is_gzip <- compress == "gzip" || (missing(compress) && grepl("\\.gz$", file))

Comment thread inst/tests/tests.Rraw Outdated
Comment thread inst/tests/tests.Rraw Outdated
Comment thread man/fwrite.Rd Outdated
Philippe Chataignon added 4 commits January 13, 2019 10:19
Comment thread inst/tests/tests.Rraw
if (.Platform$OS.type=="unix") {
f <- tempfile()
fwrite(data.table(a=c(1:3), b=c(1:3)), file=f, compress="gzip")
test(1658.38, system(paste("zcat", f), intern=T), output='[1] "a,b" "1,1" "2,2" "3,3"')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is manual check that will make those line (and next test too) fail, related to T instead of TRUE, see

# No T or F symbols in tests.Rraw. 24 valid F (quoted, column name or in data) and 1 valid T at the time of writing

@philippechataignon
Copy link
Copy Markdown
Contributor Author

philippechataignon commented Jan 14, 2019

Thanks for reviews. I will close this PR for now to improve this enhancement. I saw 2 problems :

  • actually append is not implemented for gzip : I don't know if it has a sense to append a gzip.
  • the approach with gzwrite create a bootleneck : only one thread compress. Gzip in thread would be a better solution but it requires to modify the actual buffer gestion.

@st-pasha
Copy link
Copy Markdown
Contributor

There is a program called pigz which implements parallel writing of gzip archives. It works on top of zlib: each thread creates compresses its own block of data, and then sends ZLIB_SYNC_FLUSH command, which allows compressed blocks be concatenated. Same strategy can be used here.

@philippechataignon philippechataignon mentioned this pull request Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants