Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Aug 8, 2019

  • Write Arrow arrays directly to ColumnWriter, to allow internal optimizations and other features, like direct DictionaryArray writes
  • Refactor and streamline implementation for maintainability and readability
  • Move Arrow reader/writer properties to parquet/properties.h
  • Some minor miscellaneous code reorganization to help

Functionally the library is unchanged

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

cc @hatemhelal @xhochy

Copy link
Contributor

@hatemhelal hatemhelal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.

struct ArrowWriteContext {
ArrowWriteContext(MemoryPool* memory_pool, ArrowWriterProperties* properties)
: memory_pool(memory_pool), properties(properties) {
this->data_buffer = AllocateBuffer(memory_pool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell without compiling but I wondered if this could be done in the member initializer list:

  ArrowWriteContext(MemoryPool* memory_pool, ArrowWriterProperties* properties)
      : memory_pool(memory_pool), properties(properties), 
        data_buffer(AllocateBuffer(memory_pool)),
        def_levels_buffer(AllocateBuffer(memory_pool)) { }

This might fall in the category of a micro (or even nano) optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this change

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

@romainfrancois @nealrichardson I've noted that when I run R CMD INSTALL . that when the R extension is tested, system library paths are given precedence over the contents of $LD_LIBRARY_PATH. This is very annoying for me because there is an older version of liblz4 in Ubuntu 18.04's system paths that is incompatible with the one that we use in libarrow.so. Do you know any way to fix this? I can also open a JIRA issue

@nealrichardson
Copy link
Member

@wesm can you provide a traceback/console output of what the failure looks like/where it happens?

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

$ R CMD INSTALL .
* installing to library ‘/home/wesm/R/x86_64-pc-linux-gnu-library/3.6’
* installing *source* package ‘arrow’ ...
** using staged installation
Arrow C++ libraries found via pkg-config
PKG_CFLAGS=-I/home/wesm/local/include -DARROW_R_WITH_ARROW
PKG_LIBS=-L/home/wesm/local/lib -larrow -lparquet
** libs
make: Nothing to be done for 'all'.
installing to /home/wesm/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-r/00new/arrow/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/wesm/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-r/00new/arrow/libs/arrow.so':
  /home/wesm/local/lib/libarrow.so.100: undefined symbol: LZ4F_resetDecompressionContext
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/home/wesm/R/x86_64-pc-linux-gnu-library/3.6/arrow’
* restoring previous ‘/home/wesm/R/x86_64-pc-linux-gnu-library/3.6/arrow’

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

And of course

$ ldd ~/local/lib/libarrow.so
	linux-vdso.so.1 (0x00007ffd2bcfc000)
	libcrypto.so.1.1 => /home/wesm/cpp-runtime-toolchain/lib/libcrypto.so.1.1 (0x00007fce910f2000)
	libssl.so.1.1 => /home/wesm/cpp-runtime-toolchain/lib/libssl.so.1.1 (0x00007fce92fe4000)
	libglog.so.0 => /home/wesm/cpp-runtime-toolchain/lib/libglog.so.0 (0x00007fce92fab000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce90eee000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce90ce6000)
	libbz2.so.1.0 => /home/wesm/cpp-runtime-toolchain/lib/libbz2.so.1.0 (0x00007fce92f73000)
	liblz4.so.1 => /home/wesm/cpp-runtime-toolchain/lib/liblz4.so.1 (0x00007fce92f3b000)
	libsnappy.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libsnappy.so.1 (0x00007fce92f30000)
	libz.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libz.so.1 (0x00007fce92f16000)
	libzstd.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libzstd.so.1 (0x00007fce90a49000)
	libboost_system.so.1.70.0 => /home/wesm/cpp-runtime-toolchain/lib/libboost_system.so.1.70.0 (0x00007fce92f11000)
	libboost_filesystem.so.1.70.0 => /home/wesm/cpp-runtime-toolchain/lib/libboost_filesystem.so.1.70.0 (0x00007fce92eef000)
	libstdc++.so.6 => /home/wesm/cpp-runtime-toolchain/lib/libstdc++.so.6 (0x00007fce908d5000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce90537000)
	libgcc_s.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libgcc_s.so.1 (0x00007fce92edb000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce90318000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce8ff27000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fce92e4f000)
	libgflags.so.2.2 => /home/wesm/cpp-runtime-toolchain/lib/./libgflags.so.2.2 (0x00007fce92eb5000)

So outside of the context of R CMD INSTALL the correct liblz4.so is resolved

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

+1, will merge this PR once Travis CI runs...

@nealrichardson
Copy link
Member

One thing you can check out is what is in $(R RHOME)/etc/ldpaths. I saw that referenced in The Bible. Maybe you can amend that and it will pick up the right path? Or make it pay attention to the LD_LIBRARY_PATH you set? Totally guessing though, this is not my area of expertise.

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

Ugh

$ cat /etc/R/ldpaths 
: ${JAVA_HOME=/usr/lib/jvm/default-java}
: ${R_JAVA_LD_LIBRARY_PATH=${JAVA_HOME}/lib/server}
if test -n "/usr/lib/x86_64-linux-gnu"; then
: ${R_LD_LIBRARY_PATH=${R_HOME}/lib:/usr/lib/x86_64-linux-gnu}
else
: ${R_LD_LIBRARY_PATH=${R_HOME}/lib}
fi
if test -n "${R_JAVA_LD_LIBRARY_PATH}"; then
  R_LD_LIBRARY_PATH="${R_LD_LIBRARY_PATH}:${R_JAVA_LD_LIBRARY_PATH}"
fi
## This is DYLD_FALLBACK_LIBRARY_PATH on Darwin (macOS) and
## LD_LIBRARY_PATH elsewhere.
## However, on macOS >=10.11 (if SIP is enabled, the default), the
## environment value will not be passed to a script such as R.sh, so
## would not seen here.
if test -z "${LD_LIBRARY_PATH}"; then
  LD_LIBRARY_PATH="${R_LD_LIBRARY_PATH}"
else
  LD_LIBRARY_PATH="${R_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH}"
fi
export LD_LIBRARY_PATH

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

IMHO this is a bug in R. If you have $LD_LIBRARY_PATH set it should respect that

@nealrichardson
Copy link
Member

OIC, it puts its path before the one you provide. That's odd. If you reverse the order of the paths in that next to last line, does it work as expected?

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

This works fine

$ R_LD_LIBRARY_PATH=$LD_LIBRARY_PATH R CMD INSTALL .
* installing to library ‘/home/wesm/R/x86_64-pc-linux-gnu-library/3.6’
* installing *source* package ‘arrow’ ...
** using staged installation
Arrow C++ libraries found via pkg-config
PKG_CFLAGS=-I/home/wesm/local/include -DARROW_R_WITH_ARROW
PKG_LIBS=-L/home/wesm/local/lib -larrow -lparquet
** libs
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/home/wesm/local/include -DARROW_R_WITH_ARROW -I"/home/wesm/R/x86_64-pc-linux-gnu-library/3.6/Rcpp/include"  -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-uuRxut/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c arrowExports.cpp -o arrowExports.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/home/wesm/local/include -DARROW_R_WITH_ARROW -I"/home/wesm/R/x86_64-pc-linux-gnu-library/3.6/Rcpp/include"  -fvisibility=hidden -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-uuRxut/r-base-3.6.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c parquet.cpp -o parquet.o
g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o arrow.so array.o array__to_vector.o array_from_vector.o arraydata.o arrowExports.o buffer.o chunkedarray.o compression.o compute.o csv.o datatype.o feather.o field.o io.o json.o memorypool.o message.o parquet.o recordbatch.o recordbatchreader.o recordbatchwriter.o schema.o symbols.o table.o threadpool.o -L/home/wesm/local/lib -larrow -lparquet -L/usr/lib/R/lib -lR
installing to /home/wesm/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-r/00new/arrow/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (arrow)

I'll do that as a workaround. We should document it for developers

@wesm
Copy link
Member Author

wesm commented Aug 8, 2019

I guess I can kind of understand the point of view of wanting to isolate R from being contaminated by the user's environment, including modifications of LD_LIBRARY_PATH. Since if you build a package when LD_LIBRARY_PATH set and then unset it, that package could be broken

@codecov-io
Copy link

Codecov Report

Merging #5036 into master will increase coverage by 1.59%.
The diff coverage is 94.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5036      +/-   ##
==========================================
+ Coverage   87.57%   89.17%   +1.59%     
==========================================
  Files        1008      727     -281     
  Lines      143814   102958   -40856     
  Branches     1418        0    -1418     
==========================================
- Hits       125952    91814   -34138     
+ Misses      17500    11144    -6356     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/parquet/arrow/reader_internal.cc 91.27% <ø> (ø) ⬆️
cpp/src/parquet/arrow/reader_internal.h 93.02% <ø> (ø) ⬆️
cpp/src/parquet/arrow/reader.h 80% <ø> (-12.31%) ⬇️
cpp/src/parquet/schema.cc 90.01% <ø> (ø) ⬆️
cpp/src/parquet/arrow/writer.h 100% <ø> (ø) ⬆️
cpp/src/parquet/types.h 100% <ø> (ø) ⬆️
cpp/src/parquet/arrow/reader.cc 84.86% <ø> (-0.13%) ⬇️
cpp/src/parquet/column_writer.h 95.65% <100%> (+6.76%) ⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 93.73% <100%> (ø) ⬆️
cpp/src/parquet/properties.cc 73.33% <100%> (+17.77%) ⬆️
... and 290 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d549b7c...b2ddcb3. Read the comment docs.

@wesm wesm closed this in b4c1763 Aug 9, 2019
@wesm wesm deleted the ARROW-6152 branch August 9, 2019 01:23
@nealrichardson
Copy link
Member

If it's a bug, it's been around a while. I traced it back to a commit from the end of the Clinton presidency: wch/r-source@b4f9299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants