Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1761 commits
Select commit Hold shift + click to select a range
3e87f77
ARROW-2131: [Python] Prepend module path to PYTHONPATH when spawning …
wesm Feb 22, 2018
bcbcf02
[JS] Fix typo in npm target for esNext/CommonJS. (#1645)
jheer Feb 22, 2018
5f10067
ARROW-2180: [C++] Remove deprecated APIs from 0.8.0 cycle
wesm Feb 22, 2018
cdc347c
ARROW-2132: Add link to Plasma in main README
wesm Feb 22, 2018
27f7eba
ARROW-2069: [Python] Add note that Plasma is not supported on Windows
wesm Feb 22, 2018
81bfb38
ARROW-2185: Strip CI directives from commit messages
wesm Feb 23, 2018
d52f2ff
[Dev] Follow-up, use angle brackets for commit author instead of squa…
wesm Feb 23, 2018
c2865d0
ARROW-2093: [Python] Do not install PyTorch in Travis CI
wesm Feb 23, 2018
2f01658
ARROW-2201: [Website] Publish JS API Docs
Feb 23, 2018
3e3f7c2
ARROW-2066: [Python] Document using pyarrow with Azure Blob Store
rjrussell77 Feb 23, 2018
cca4a74
ARROW-2197: Document C++ ABI issue and workaround
pitrou Feb 23, 2018
e2dd864
ARROW-2184: [C++] Add static constructor for FileOutputStream return…
xuepanchen Feb 24, 2018
c0b0e33
ARROW-2191: [C++] Only use specific version of jemalloc
xhochy Feb 24, 2018
2fd8f0a
ARROW-2204: Fix TLS errors in manylinux1 build
xhochy Feb 25, 2018
27d8339
ARROW-2214: [JS] add nullBitmap getter to DictionaryData that proxies…
trxcllnt Feb 26, 2018
655eb74
ARROW-2212: [C++/Python] Build Protobuf in base manylinux 1 docker image
xhochy Feb 26, 2018
5521bcf
ARROW-2094: [C++] Install libprotobuf and set PROTOBUF_HOME when usin…
wesm Feb 26, 2018
564fefe
ARROW-2213: [JS] fix npm release
trxcllnt Feb 26, 2018
8c493cd
ARROW-2219: [JS] rename indicies to indices
trxcllnt Feb 26, 2018
e50a8ec
ARROW-2206: [JS] Document Perspective project
lmeyerov Feb 26, 2018
e0328b0
ARROW-2023: [C++] Fix ASAN failure on malformed / empty stream input,…
wesm Feb 27, 2018
c017a63
ARROW-1035: [Python] Add streaming dataframe reconstruction benchmark
pitrou Feb 27, 2018
a5c5ad2
ARROW-2203: [C++] StderrStream class
rvernica Feb 27, 2018
887e893
ARROW-1937: [Python] Document nested array initialization
pitrou Feb 27, 2018
482fc58
ARROW-2210: [C++] Reset ptr on failed memory allocation
xhochy Feb 27, 2018
1a92846
ARROW-2223: [JS] compile src/bin as es5-cjs to all output targets
trxcllnt Feb 27, 2018
d3fabe0
ARROW-2230: [Python] Strip catch-all tag matching from git-describe
xhochy Feb 28, 2018
524b522
ARROW-2218: [Python] PythonFile should infer mode when not given
pitrou Feb 28, 2018
0a672bc
ARROW-2226, ARROW-2233: [JS] Dictionary bugfixes
trxcllnt Feb 28, 2018
1d9b834
ARROW-2225: [JS] support tables split across buffers
trxcllnt Feb 28, 2018
671b53c
ARROW-2046: [Python] Support path-like objects
pitrou Feb 28, 2018
3d5880a
ARROW-2040: [Python] Deserialized Numpy array must keep ref to underl…
pitrou Feb 28, 2018
5321582
ARROW-2231: [CI] Use clcache on AppVeyor for faster builds
pitrou Feb 28, 2018
af2047e
ARROW-2215: [Plasma] Hugetables munmap issue
pcmoritz Mar 1, 2018
8b3bbae
ARROW-2198: [Python] correct docstring for parquet.read_table
wesm Mar 1, 2018
b2eb6ac
ARROW-1632: [Python] Permit categorical conversions in Table.to_panda…
xhochy Mar 1, 2018
bfac60d
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Deci…
cpcloud Mar 1, 2018
99899d6
ARROW-2232: [Python] pyarrow.Tensor constructor segfaults
cpcloud Mar 2, 2018
29495ce
ARROW-2176: [C++] Extend DictionaryBuilder to support delta dictionaries
Mar 2, 2018
f403804
ARROW-2205: [Python] Option for integer object nulls
Mar 2, 2018
5994094
ARROW-2209: [Python] Partition columns are not correctly loaded in sc…
xhochy Mar 2, 2018
34c33f1
[Python] Document serialization parameter as "string" instead of "bytes"
mitar Mar 4, 2018
8b1c811
ARROW-2245: ARROW-2246: [Python] Revert static linkage of parquet-cpp…
xhochy Mar 4, 2018
6e699d7
ARROW-2252: [Python] Create buffer from address, size and base
xhochy Mar 5, 2018
03db8a3
ARROW-2251: [GLib] Keep GArrowBuffer alive while GArrowTensor for the…
kou Mar 5, 2018
9ceda35
ARROW-2244: [C++] Add unit test to explicitly check that NullArray in…
wesm Mar 5, 2018
b89c124
ARROW-2253: [Python] Support __eq__ on scalar values
xhochy Mar 5, 2018
49f1d00
ARROW-2258: [Python] Add additional information to find Boost on windows
xhochy Mar 5, 2018
55bdae5
ARROW-2254: [Python] Ignore JS tags in local dev versions
xhochy Mar 5, 2018
c6359cb
ARROW-1929: [C++] Copy over testing utility code from PARQUET-1092
wesm Mar 5, 2018
45f5da2
ARROW-1982: [Python] Coerce Parquet statistics as bytes to more usefu…
wesm Mar 5, 2018
01a099c
ARROW-2199: [JAVA] Control the memory allocated for inner vectors in …
siddharthteotia Mar 5, 2018
06e9fb4
[Python] Add missing dependency to development.rst
mitar Mar 6, 2018
57e4dd8
ARROW-2265: [Python] Use CheckExact when serializing lists and numpy …
robertnishihara Mar 6, 2018
51e117d
ARROW-2154: [Python] Implement equality on buffers
pitrou Mar 6, 2018
cde18a6
ARROW-2234: [JS] Read timestamp low bits as Uint32s
trxcllnt Mar 6, 2018
a58bd72
ARROW-2272: [Python] Clean up leftovers in test_plasma.py
pitrou Mar 6, 2018
5f8a793
ARROW-2279: [Python] Better error message if lib cannot be found
mitar Mar 6, 2018
60c8081
ARROW-2261: [GLib] Improve memory management for GArrowBuffer data
kou Mar 6, 2018
9effbed
ARROW-2283: [C++] Support Arrow C++ installed in /usr detection by pk…
kou Mar 7, 2018
fb2316c
ARROW-2238: [C++] Detect and use clcache in cmake configuration
pitrou Mar 7, 2018
c372dfb
ARROW-2280: [Python] Return the offset for the buffers in pyarrow.Array
xhochy Mar 7, 2018
5e945a3
ARROW-2239: [C++] Update Windows build docs
pitrou Mar 8, 2018
f3f91b0
ARROW-2263: [Python] Prepend local pyarrow/ path to PYTHONPATH in tes…
wesm Mar 9, 2018
34b18f7
ARROW-1940: [Python] Extra metadata gets added after multiple convers…
cpcloud Mar 9, 2018
04f4e6b
ARROW-2289: [GLib] Add Numeric, Integer, FloatingPoint data types
kou Mar 9, 2018
f56fdc9
ARROW-2270: [Python] Fix lifetime of ForeignBuffer base object
pitrou Mar 9, 2018
40a0008
[Python] Adding more missing Linux dependencies to developer docs
mitar Mar 9, 2018
23d08b7
ARROW-2150: [Python] Raise NotImplementedError when comparing with py…
wesm Mar 9, 2018
7354a19
ARROW-2284: [Python] Fix error display on test_plasma error
pitrou Mar 9, 2018
8167472
ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data()
pitrou Mar 9, 2018
3511c65
ARROW-2268: Drop usage of md5 checksums for source releases, verifica…
wesm Mar 9, 2018
c7c2393
ARROW-2269: [Python] Make boost namespace selectable in wheels
xhochy Mar 9, 2018
fc9f89a
ARROW-2250: [Python] Do not create a subprocess for plasma but just u…
mitar Mar 9, 2018
d0284cb
ARROW-2236: [JS] Add more complete set of predicates
Mar 9, 2018
412bb91
ARROW-2291: [C++] Add additional libboost-regex-dev to build instruct…
andygrove Mar 9, 2018
907a27d
ARROW-2288: [Python] Fix slicing logic
pitrou Mar 9, 2018
2f718d7
ARROW-2262: [Python] Support slicing on pyarrow.ChunkedArray
xhochy Mar 9, 2018
d64a231
ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables
BryanCutler Mar 10, 2018
dc45a1a
ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays t…
wesm Mar 12, 2018
c7b3c05
ARROW-2297: [JS] babel-jest is not listed as a dev dependency
Mar 12, 2018
8f2ff30
ARROW-2240: [Python] Array initialization with leading numpy nan fail…
cpcloud Mar 12, 2018
3917e85
ARROW-2292: [Python] Rename frombuffer() to py_buffer()
pitrou Mar 12, 2018
58fa873
ARROW-2282: [Python] Create StringArray from buffers
xhochy Mar 12, 2018
317b543
ARROW-2293: [JS] Print release vote e-mail template when making sourc…
Mar 12, 2018
6fc9922
ARROW-2118: [C++] Fix misleading error when memory mapping a zero-len…
wesm Mar 12, 2018
171340f
ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array
pitrou Mar 12, 2018
0b28dc5
ARROW-2142: [Python] Allow conversion from Numpy struct array
pitrou Mar 13, 2018
7c7b09f
ARROW-1643: [Python] Accept hdfs:// prefixes in parquet.read_table an…
Mar 13, 2018
33d1091
ARROW-2227: [Python] Fix off-by-one error in chunked binary conversions
wesm Mar 13, 2018
a430758
ARROW-2306: [Python] Fix partitioned Parquet test against HDFS
wesm Mar 14, 2018
385656c
ARROW-2304: [C++] Fix HDFS MultipleClients unit test
wesm Mar 14, 2018
e25e3ef
ARROW-2307: [Python] Allow reading record batch streams with zero rec…
wesm Mar 15, 2018
98012cb
ARROW-2312: [JS] run test_js before test_integration
trxcllnt Mar 15, 2018
b185951
ARROW-2313: [C++] Add -NDEBUG flag to arrow.pc
kou Mar 15, 2018
630ce5e
ARROW-2311: [Python/C++] Fix struct array slicing
pitrou Mar 15, 2018
019a560
ARROW-2309: [C++] Use std::make_unsigned
pitrou Mar 15, 2018
60749b2
ARROW-2316: [C++] Revert Buffer::mutable_data to inline so that linke…
wesm Mar 15, 2018
20ea781
[Python] Pin Cython to 0.27.3 in verify-release-candidate.sh (#1758)
wesm Mar 15, 2018
e29df7d
ARROW-2320: [C++] Vendored Boost build does not build regex library
cpcloud Mar 16, 2018
79e19c3
[JS] Small fixes to source release workflow and e-mail template (#1750)
wesm Mar 16, 2018
82c8b6f
ARROW-2318: [Plasma] Run plasma store tests with unique socket
pcmoritz Mar 16, 2018
95ba6ef
ARROW-2321: [C++] Release verification script fails with if CMAKE_INS…
cpcloud Mar 16, 2018
7be8d37
[Release] Update CHANGELOG.md for 0.9.0
wesm Mar 16, 2018
c695a5d
[maven-release-plugin] prepare release apache-arrow-0.9.0
wesm Mar 16, 2018
bb17a0d
[maven-release-plugin] prepare for next development iteration
wesm Mar 16, 2018
a50ef9f
ARROW-2329: [Website] 0.9.0 release update
siddharthteotia Mar 20, 2018
60848c0
ARROW-2299: [Go] Import Go arrow implementation from influxdata/arrow
stuartcarnie Mar 21, 2018
607c7fa
ARROW-2340: [Website] Add blog post about Go code donation
wesm Mar 22, 2018
948cb4a
ARROW-2336: [Website] Add 0.9.0 release blog post
wesm Mar 22, 2018
f45abf0
[Website] Add link to press release
wesm Mar 22, 2018
07beb51
ARROW-2333: [Python] Fix bundling boost with default namespace
pitrou Mar 22, 2018
47fcef3
ARROW-2334: [C++] Update boost to 1.66.0
cpcloud Mar 22, 2018
d623567
ARROW-2341: [Python] Improve pa.union() mode argument behaviour
pitrou Mar 22, 2018
eecb1bc
ARROW-2281: [Python] Add Array.from_buffers()
pitrou Mar 22, 2018
f50d858
ARROW-2343: [Java/Packaging] Run mvn clean in API doc builds
cpcloud Mar 22, 2018
29268ec
ARROW-2342: [Python] Allow pickling more types
pitrou Mar 22, 2018
0c8d164
ARROW-2345: [Documentation] Fix bundle exec and set sphinx nosidebar …
cpcloud Mar 23, 2018
e6d8eed
ARROW-2322: [Java] Document dev environment requirements for publishi…
wesm Mar 23, 2018
a0ca9b4
ARROW-2346: [Python] Fix PYARROW_CXX_FLAGS with multiple options
pitrou Mar 23, 2018
777f986
ARROW-2331: [Python] Fix indexing for negative or out-of-bounds indices
pitrou Mar 23, 2018
7b2c797
ARROW-2349: [Python] Opt in to bundling Boost shared libraries separa…
wesm Mar 24, 2018
af6e3ec
ARROW-1913: [Java] Disable Javadoc doclint with Java 8
icexelloss Mar 25, 2018
29f744f
ARROW-2350: Consolidated RUN step in spark_integration Dockerfile
Mar 25, 2018
9c7e06b
ARROW-2348: [GLib] Remove GLib + Go example
kou Mar 25, 2018
6156b1d
ARROW-640: [Python] Implement __hash__ and equality for Array scalar …
Mar 26, 2018
27f5a42
ARROW-2301: [Python] Build source distribution inside the manylinux1 …
xhochy Mar 26, 2018
f9f8320
ARROW-2354: [C++] Make PyDecimal_Check() faster
pitrou Mar 26, 2018
3d4b6c1
ARROW-2356: [JS] Fix JSON Reader FixedSizeBinary Vectors
trxcllnt Mar 27, 2018
f29e5a1
ARROW-2368: [JAVA] Correctly pad negative values in DecimalVector#set…
vkorukanti Mar 29, 2018
866e9b8
ARROW-2327: [JS] Table.fromStruct missing from externs
Mar 29, 2018
97f5ec0
[C++] Fix documentation typo in arrow/array.h
rsabhi Mar 29, 2018
ba0cea3
ARROW-2140: [Python] Improve float16 support
pitrou Mar 29, 2018
3f72d14
ARROW-2361: [Rust] Starting point for a native Rust implementation of…
andygrove Mar 31, 2018
3975de5
Update README.md to include new components
wesm Mar 31, 2018
00b334f
[Rust] Update READMEs to add Rust libraries link and to remove out-of…
andygrove Mar 31, 2018
be049fa
ARROW-2370: [GLib] Fix include path in .pc on Meson build
kou Apr 1, 2018
d2d4cc7
ARROW-2371: [GLib] Update "Requires" in .pc on GNU Autotools build
kou Apr 1, 2018
7e27cf5
ARROW-2376: [Rust] Travis builds the Rust library
andygrove Apr 2, 2018
8fdad18
ARROW-2377: [GLib] Support old GObject Introspection
kou Apr 3, 2018
11b15a5
ARROW-2357: [Python] Add microbenchmark for PandasObjectIsNull()
pitrou Apr 3, 2018
fff992a
ARROW-2122: [Python] Pyarrow fails to serialize dataframe with timest…
Apr 3, 2018
b6e8b4b
ARROW-2381: [Rust] Adds iterator support to Buffer<T>
andygrove Apr 3, 2018
fce183c
ARROW-2378: [Rust] Rustfmt
max-sixty Apr 3, 2018
4c68eca
ARROW-2375: [Rust] Implement Drop for Buffer so memory is released
andygrove Apr 3, 2018
65d2558
ARROW-2351 [C++] StringBuilder::append(vector<string>...) not impleme…
lizhougao Apr 3, 2018
65493a6
ARROW-2014: [Python] Document read_pandas method in pyarrow.parquet
Apr 3, 2018
9fc4d89
DOC: Fix a tiny typo in parquet documentation (#1824)
kjordahl Apr 3, 2018
b0f376a
Fix broken build on master (remove duplicate Drop impl for Buffer) (#…
andygrove Apr 3, 2018
82d4555
ARROW-2141: [Python] Support variable length binary conversion from P…
BryanCutler Apr 3, 2018
933b32b
ARROW-2388: [C++] Use valid_bytes API for StringBuilder::Append
kou Apr 4, 2018
806979b
ARROW-2382: [Rust] Bug fix: List was not using aligned mem
andygrove Apr 4, 2018
7081752
ARROW-2385: [Rust] implement to_json for DataType and Field
andygrove Apr 4, 2018
cf39686
ARROW-2195: [Plasma] Return auto-releasing buffers
pitrou Apr 4, 2018
26bc4ab
ARROW-2308: [Python] Make deserialized numpy arrays 64-byte aligned.
robertnishihara Apr 4, 2018
640fc83
ARROW-2276: [Python] Expose buffer protocol on Tensor
pitrou Apr 4, 2018
76edf43
ARROW-1463: [Java] Cleanup usage of Types.MinorType to MinorType
BryanCutler Apr 4, 2018
486d592
ARROW-2384: [Rust] Additional test & Trait standardization
max-sixty Apr 5, 2018
02b0c72
ARROW-2325: [Python] Update setup.py to use Markdown project description
Apr 5, 2018
045470c
ARROW-2396: [Rust] Unify Rust Errors
max-sixty Apr 5, 2018
9515fe9
ARROW-2380: [Python] Streamline conversions
pitrou Apr 5, 2018
29c376d
ARROW-2398: [Rust] Create Builder<T> for building buffers directly in…
andygrove Apr 6, 2018
83bfb39
ARROW-2404: [C++] Fix "declaration of 'type_id' hides class member" w…
rip-nsk Apr 6, 2018
946517d
ARROW-2405: [C++] <function> is required for std::function
kou Apr 6, 2018
e3f7edc
ARROW-2401 Support filters on Hive partitioned Parquet files
Apr 6, 2018
f9c0701
ARROW-2402: [C++] Avoid spurious copies with FixedSizeBinaryBuilder
pitrou Apr 6, 2018
87284a5
[Site] Add Antoine to committers list (#1853)
pitrou Apr 9, 2018
f88949b
ARROW-2418: [Rust] BUG FIX: reserve memory when building list
andygrove Apr 9, 2018
408aa5a
ARROW-2416: [C++] Support system libprotobuf
kou Apr 9, 2018
b4dafa5
ARROW-2414: Fix a variety of typos.
waywardmonkeys Apr 9, 2018
55c1075
ARROW-2353: [CI] Check correctness of built wheel on AppVeyor
pitrou Apr 9, 2018
b095994
ARROW-2408: [Rust] Remove build warnings
max-sixty Apr 9, 2018
57db8b5
ARROW-2419: [Site] Hard-code timezone
pitrou Apr 9, 2018
7376aab
ARROW-2413: [Rust] Remove useless calls to format!().
waywardmonkeys Apr 9, 2018
ca3dbbb
ARROW-2415: [Rust] Fix clippy ref-match-pats warnings.
waywardmonkeys Apr 9, 2018
abf4ed2
ARROW-2408: [Rust] Ability to get `&mut [T]` from `Buffer<T>`
andygrove Apr 9, 2018
5030e23
ARROW-2420: [Rust] Fix major memory bug and add benches
andygrove Apr 9, 2018
ad39d1f
ARROW-2424: [Rust] Fix build - add missing import
andygrove Apr 9, 2018
1bb7fba
ARROW-2100: [Python] Drop Python 3.4 support
pitrou Apr 9, 2018
f56d765
ARROW-2305: [Python] Bump Cython requirement to 0.27+
pitrou Apr 9, 2018
27417b2
ARROW-2328: [C++] Fixed and unit tested feather writing with slice
Adriandorr Apr 9, 2018
e941af8
ARROW-2391: [C++/Python] Segmentation fault from PyArrow when mapping…
kszucs Apr 9, 2018
33d92a0
ARROW-2434: [Rust] Add windows support
paddyhoran Apr 10, 2018
ca277ae
ARROW-2425: [Rust] BUG FIX: Add u8 mappings for Array::from
andygrove Apr 10, 2018
c5574f4
ARROW-2426: [GLib] Follow python -> python@3 change in Homebrew
kou Apr 10, 2018
6633cc9
ARROW-2433: [Rust] Add Builder.push_slice(&[T])
andygrove Apr 10, 2018
91ec792
ARROW-2411: [C++] Add StringBuilder::Append(const char **values)
kou Apr 10, 2018
265142b
ARROW-2441: [Rust] Builder<T>::slice_mut assertions are too strict
andygrove Apr 10, 2018
42e195b
ARROW-2440: [Rust] Implement ListBuilder<T>
andygrove Apr 10, 2018
1ee7d11
ARROW-2407: [GLib] Add garrow_string_array_builder_append_values()
kou Apr 11, 2018
ed7db7c
ARROW-2097: [CI, Python] Reduce Travis-CI verbosity
pitrou Apr 11, 2018
4009b62
ARROW-2224: [C++] Remove boost-regex dependency
pitrou Apr 11, 2018
6e8ecb5
ARROW-2445: [Rust] Add documentation and make some fields private
andygrove Apr 12, 2018
db03663
ARROW-2182: [Python] Build C++ libraries in benchmarks build step
pitrou Apr 12, 2018
9ad8602
ARROW-2432: [Python] Fix Pandas decimal type conversion with None values
BryanCutler Apr 12, 2018
f177404
ARROW-2369: [Python] Fix reading large Parquet files (> 4 GB)
pitrou Apr 12, 2018
685147c
ARROW-2451: [Python] Handle non-object arrays more efficiently in cus…
robertnishihara Apr 12, 2018
0f87c12
ARROW-2437: [C++] Add ReadMessage without aligned argument.
robertnishihara Apr 12, 2018
c96747b
ARROW-2455: [C++] Initialize the atomic bytes_allocated_ properly
sighingnow Apr 13, 2018
7de1264
ARROW-2387: [Python] Flip test for rescale loss if value < 0
Apr 13, 2018
98d250e
ARROW-2397: [Documentation] Update format documentation to describe t…
robertnishihara Apr 14, 2018
b2167e4
ARROW-2435: [Rust] Add memory pool abstraction.
liurenjie1024 Apr 15, 2018
3eee3e4
ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to a…
Apr 16, 2018
2d0fbf1
ARROW-2464: [Python] Use a python_version marker instead of a condition
thedrow Apr 16, 2018
72c7f5d
ARROW-2454: [C++] Allow zero-array chunked arrays
pitrou Apr 16, 2018
66d0ad1
ARROW-2315: [C++/Python] Flatten struct array
pitrou Apr 17, 2018
2876a3f
ARROW-2463: [C++] Update flatbuffers to 1.9.0
xhochy Apr 17, 2018
f1ef708
ARROW-2319: [C++] Add BufferedOutputStream class
pitrou Apr 18, 2018
d7d3196
ARROW-2442: [C++] Disambiguate builder Append() overloads
pitrou Apr 18, 2018
72df18c
ARROW-2465: [Plasma/GPU] Preserve plasma_store rpath
pitrou Apr 18, 2018
4c31b37
ARROW-2147: [Python] Fix type inference of numpy arrays
xhochy Apr 18, 2018
25eff99
ARROW-2468: [Rust] Builder::slice_mut() should take mut self.
waywardmonkeys Apr 19, 2018
d58057b
ARROW-2473: [Rust] List empty slice assertion
andygrove Apr 19, 2018
c2e0d42
ARROW-2423: [Python] Enable DataType, Field and plasma ObjectID equal…
kszucs Apr 19, 2018
18999bb
ARROW-2469: [C++] Make out arguments last in ReadMessage.
robertnishihara Apr 19, 2018
1299931
ARROW-2443: [Python] Allow creation of empty Dictionary indices
xhochy Apr 19, 2018
7eeca3a
ARROW-2458: [Plasma] Use one thread pool per PlasmaClient
pcmoritz Apr 19, 2018
09be7b4
ARROW-2472: [Rust] Remove public attributes from Schema and Field and…
andygrove Apr 20, 2018
46fe09a
ARROW-2471: [Rust] Builder zero capacity fix
andygrove Apr 20, 2018
c19b1f0
ARROW-2481: [Rust] Move all calls to free() into memory.rs
andygrove Apr 21, 2018
249e039
ARROW-1928: [C++] Add BitmapReader/BitmapWriter benchmarks
pitrou Apr 21, 2018
4c71f30
ARROW-2390: [C++/Python] Map Python exceptions to Arrow status codes
pitrou Apr 21, 2018
c9ad33e
ARROW-2457: [GLib] Support large is_valids in builder's append_values()
kou Apr 21, 2018
54df19d
ARROW-1018: [C++] Create FileOutputStream, ReadableFile from file des…
pitrou Apr 21, 2018
2452a46
ARROW-2393: [C++] Moving ARROW_CHECK_OK_[PREPEND] macros from status.…
Apr 21, 2018
3b69c5a
ARROW-2450: [Python] Test for Parquet roundtrip of null lists
pitrou Apr 21, 2018
1ba7d51
ARROW-2222: handle untrusted inputs
crepererum Apr 21, 2018
5381295
ARROW-2314: [C++/Python] Fix union array slicing
pitrou Apr 21, 2018
138717a
ARROW-1858: [Python] Added documentation for pq.write_dataset
dsimmie Apr 21, 2018
a6c9d30
ARROW-2453: [Python] Improve Table column access
Apr 21, 2018
a5ae134
ARROW-1731: [Python] Add columns selector in Table.from_array
Gatisseja Apr 23, 2018
03251e9
ARROW-2427: [C++] Implement ReadAt properly
pitrou Apr 23, 2018
77a5c59
ARROW-2494: [C++] Return status codes from PlasmaClient::Seal instead…
kszucs Apr 23, 2018
7545e3e
ARROW-2492: [Python] Prevent segfault on accidental call of pyarrow.A…
xhochy Apr 23, 2018
b65205e
ARROW-2470: [C++] Avoid seeking in GetFileSize
pitrou Apr 23, 2018
2abc889
ARROW-2489: [Plasma] Fix PlasmaClient ABI variation
pitrou Apr 24, 2018
a609309
ARROW-2502: [Rust] Restore Windows Compatibility
paddyhoran Apr 24, 2018
2d278ab
ARROW-2508: [Python] Fix pytest.raises msg to message
pcmoritz Apr 25, 2018
3d7a5a6
ARROW-2074: [Python] Infer lists of dicts as struct arrays
pitrou Apr 25, 2018
5f9cf9c
ARROW-2448: [Plasma] Reference counting for PlasmaClient::Impl
pcmoritz Apr 25, 2018
c8a3ed8
ARROW-2286: [C++/Python] Allow subscripting pyarrow.lib.StructValue
kszucs Apr 26, 2018
c574006
ARROW-2498: [Java] Use java 1.8 instead of java 1.7
Apr 26, 2018
c8f17dd
ARROW-2518: [Java] Re-instate JDK tests in matrix, but with JDK 8 ins…
Apr 28, 2018
16820a2
ARROW-2452: [TEST] Spark integration test fails with permission error
kszucs Apr 28, 2018
3f5819a
[GLib] Fix a typo
kou Apr 30, 2018
e8d45eb
ARROW-2515 [Python] Add DictionaryValue class, fixing bugs with neste…
blkerby Apr 30, 2018
e3fafae
ARROW-2513: [Python] DictionaryType should give access to index type …
crepererum Apr 30, 2018
f93a635
First try at implementing a CountValues kernel
andrioni May 1, 2018
2d99fc6
Remove commented out prototype
andrioni May 1, 2018
95da385
Fix formatting issues
andrioni May 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions cpp/src/arrow/compute/compute-test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -876,6 +876,23 @@ void CheckDictEncode(FunctionContext* ctx, const shared_ptr<DataType>& type,
ASSERT_ARRAYS_EQUAL(expected, *result);
}

template <typename Type, typename T>
void CheckCountValues(FunctionContext* ctx, const shared_ptr<DataType>& type,
const vector<T>& in_values, const vector<bool>& in_is_valid,
const vector<T>& out_values, const vector<bool>& out_is_valid,
const vector<int64_t>& out_counts) {
shared_ptr<Array> input = _MakeArray<Type, T>(type, in_values, in_is_valid);
shared_ptr<Array> ex_values = _MakeArray<Type, T>(type, out_values, out_is_valid);
shared_ptr<Array> ex_counts =
_MakeArray<Int64Type, int64_t>(int64(), out_counts, out_is_valid);

shared_ptr<Array> result_values;
shared_ptr<Array> result_counts;
ASSERT_OK(CountValues(ctx, Datum(input), &result_values, &result_counts));
ASSERT_ARRAYS_EQUAL(*ex_values, *result_values);
ASSERT_ARRAYS_EQUAL(*ex_counts, *result_counts);
}

class TestHashKernel : public ComputeFixture, public TestBase {};

template <typename Type>
Expand Down Expand Up @@ -903,6 +920,14 @@ TYPED_TEST(TestHashKernelPrimitive, DictEncode) {
{0, 0, 0, 1, 0, 2});
}

TYPED_TEST(TestHashKernelPrimitive, CountValues) {
using T = typename TypeParam::c_type;
auto type = TypeTraits<TypeParam>::type_singleton();
CheckCountValues<TypeParam, T>(&this->ctx_, type, {2, 1, 2, 1, 2, 3, 4},
{true, false, true, true, true, true, false}, {2, 1, 3},
{}, {3, 1, 1});
}

TYPED_TEST(TestHashKernelPrimitive, PrimitiveResizeTable) {
using T = typename TypeParam::c_type;
// Skip this test for (u)int8
Expand All @@ -916,12 +941,14 @@ TYPED_TEST(TestHashKernelPrimitive, PrimitiveResizeTable) {
vector<T> values;
vector<T> uniques;
vector<int32_t> indices;
vector<int64_t> counts;
for (int64_t i = 0; i < kTotalValues * kRepeats; i++) {
const auto val = static_cast<T>(i % kTotalValues);
values.push_back(val);

if (i < kTotalValues) {
uniques.push_back(val);
counts.push_back(kRepeats);
}
indices.push_back(static_cast<int32_t>(i % kTotalValues));
}
Expand All @@ -930,6 +957,8 @@ TYPED_TEST(TestHashKernelPrimitive, PrimitiveResizeTable) {
CheckUnique<TypeParam, T>(&this->ctx_, type, values, {}, uniques, {});

CheckDictEncode<TypeParam, T>(&this->ctx_, type, values, {}, uniques, {}, indices);

CheckCountValues<TypeParam, T>(&this->ctx_, type, values, {}, uniques, {}, counts);
}

TEST_F(TestHashKernel, UniqueTimeTimestamp) {
Expand All @@ -944,6 +973,19 @@ TEST_F(TestHashKernel, UniqueTimeTimestamp) {
{});
}

TEST_F(TestHashKernel, CountValuesTimeTimestamp) {
CheckCountValues<Time32Type, int32_t>(&this->ctx_, time32(TimeUnit::SECOND),
{2, 1, 2, 1}, {true, false, true, true}, {2, 1},
{}, {2, 1});

CheckCountValues<Time64Type, int64_t>(&this->ctx_, time64(TimeUnit::NANO), {2, 1, 2, 1},
{true, false, true, true}, {2, 1}, {}, {2, 1});

CheckCountValues<TimestampType, int64_t>(&this->ctx_, timestamp(TimeUnit::NANO),
{2, 1, 2, 1}, {true, false, true, true},
{2, 1}, {}, {2, 1});
}

TEST_F(TestHashKernel, UniqueBoolean) {
CheckUnique<BooleanType, bool>(&this->ctx_, boolean(), {true, true, false, true},
{true, false, true, true}, {true, false}, {});
Expand Down Expand Up @@ -978,6 +1020,23 @@ TEST_F(TestHashKernel, DictEncodeBoolean) {
{}, {0, 1, 0, 1, 0});
}

TEST_F(TestHashKernel, CountValuesBoolean) {
CheckCountValues<BooleanType, bool>(&this->ctx_, boolean(), {true, true, false, true},
{true, false, true, true}, {true, false}, {},
{2, 1});

CheckCountValues<BooleanType, bool>(&this->ctx_, boolean(), {false, true, false, true},
{true, false, true, true}, {false, true}, {},
{2, 1});

// No nulls
CheckCountValues<BooleanType, bool>(&this->ctx_, boolean(), {true, true, false, true},
{}, {true, false}, {}, {3, 1});

CheckCountValues<BooleanType, bool>(&this->ctx_, boolean(), {false, true, false, true},
{}, {false, true}, {}, {2, 2});
}

TEST_F(TestHashKernel, UniqueBinary) {
CheckUnique<BinaryType, std::string>(&this->ctx_, binary(),
{"test", "", "test2", "test"},
Expand All @@ -997,6 +1056,16 @@ TEST_F(TestHashKernel, DictEncodeBinary) {
{true, false, true, true, true}, {"test", "test2", "baz"}, {}, {0, 0, 1, 0, 2});
}

TEST_F(TestHashKernel, CountValuesBinary) {
CheckCountValues<BinaryType, std::string>(
&this->ctx_, binary(), {"test", "", "test2", "test"}, {true, false, true, true},
{"test", "test2"}, {}, {2, 1});

CheckCountValues<StringType, std::string>(
&this->ctx_, utf8(), {"test", "", "test2", "test"}, {true, false, true, true},
{"test", "test2"}, {}, {2, 1});
}

TEST_F(TestHashKernel, BinaryResizeTable) {
const int64_t kTotalValues = 10000;
const int64_t kRepeats = 10;
Expand Down Expand Up @@ -1046,6 +1115,7 @@ TEST_F(TestHashKernel, FixedSizeBinaryResizeTable) {
vector<std::string> values;
vector<std::string> uniques;
vector<int32_t> indices;
vector<int64_t> counts;
for (int64_t i = 0; i < kTotalValues * kRepeats; i++) {
int64_t index = i % kTotalValues;
std::stringstream ss;
Expand All @@ -1056,6 +1126,7 @@ TEST_F(TestHashKernel, FixedSizeBinaryResizeTable) {

if (i < kTotalValues) {
uniques.push_back(val);
counts.push_back(kRepeats);
}
indices.push_back(static_cast<int32_t>(i % kTotalValues));
}
Expand All @@ -1065,6 +1136,8 @@ TEST_F(TestHashKernel, FixedSizeBinaryResizeTable) {
{});
CheckDictEncode<FixedSizeBinaryType, std::string>(&this->ctx_, type, values, {},
uniques, {}, indices);
CheckCountValues<FixedSizeBinaryType, std::string>(&this->ctx_, type, values, {},
uniques, {}, counts);
}

TEST_F(TestHashKernel, UniqueDecimal) {
Expand All @@ -1084,6 +1157,15 @@ TEST_F(TestHashKernel, DictEncodeDecimal) {
{}, {0, 0, 1, 0, 2});
}

TEST_F(TestHashKernel, CountValuesDecimal) {
vector<Decimal128> values{12, 12, 11, 12};
vector<Decimal128> expected{12, 11};

CheckCountValues<Decimal128Type, Decimal128>(&this->ctx_, decimal(2, 0), values,
{true, false, true, true}, expected, {},
{2, 1});
}

TEST_F(TestHashKernel, ChunkedArrayInvoke) {
vector<std::string> values1 = {"foo", "bar", "foo"};
vector<std::string> values2 = {"bar", "baz", "quuux", "foo"};
Expand All @@ -1095,6 +1177,9 @@ TEST_F(TestHashKernel, ChunkedArrayInvoke) {
vector<std::string> dict_values = {"foo", "bar", "baz", "quuux"};
auto ex_dict = _MakeArray<StringType, std::string>(type, dict_values, {});

vector<int64_t> counts = {3, 2, 1, 1};
auto ex_counts = _MakeArray<Int64Type, int64_t>(int64(), counts, {});

ArrayVector arrays = {a1, a2};
auto carr = std::make_shared<ChunkedArray>(arrays);

Expand All @@ -1103,6 +1188,13 @@ TEST_F(TestHashKernel, ChunkedArrayInvoke) {
ASSERT_OK(Unique(&this->ctx_, Datum(carr), &result));
ASSERT_ARRAYS_EQUAL(*ex_dict, *result);

// Count values
shared_ptr<Array> cv_uniques;
shared_ptr<Array> cv_counts;
ASSERT_OK(CountValues(&this->ctx_, Datum(carr), &cv_uniques, &cv_counts));
ASSERT_ARRAYS_EQUAL(*ex_dict, *cv_uniques);
ASSERT_ARRAYS_EQUAL(*ex_counts, *cv_counts);

// Dictionary encode
auto dict_type = dictionary(int32(), ex_dict);

Expand Down
99 changes: 99 additions & 0 deletions cpp/src/arrow/compute/kernels/hash.cc
Original file line number Diff line number Diff line change
Expand Up @@ -749,6 +749,50 @@ class DictEncodeImpl : public HashTableKernel<Type, DictEncodeImpl<Type>> {
Int32Builder indices_builder_;
};

// ----------------------------------------------------------------------
// Count values implementation

template <typename Type>
class CountValuesImpl : public HashTableKernel<Type, CountValuesImpl<Type>> {
public:
static constexpr bool allow_expand = true;
using Base = HashTableKernel<Type, CountValuesImpl>;

CountValuesImpl(const std::shared_ptr<DataType>& type, MemoryPool* pool)
: Base(type, pool) {}

Status Reserve(const int64_t length) {
counts_.reserve(length);
return Status::OK();
}

void ObserveNull() {}

void ObserveFound(const hash_slot_t slot) { counts_[slot]++; }

void ObserveNotFound(const hash_slot_t slot) { counts_.emplace_back(1); }

Status DoubleSize() { return Base::DoubleTableSize(); }

Status Flush(Datum* out) override {
Int64Builder builder(Base::pool_);
std::shared_ptr<ArrayData> result;

for (const int64_t value : counts_) {
RETURN_NOT_OK(builder.Append(value));
}

RETURN_NOT_OK(builder.FinishInternal(&result));
out->value = std::move(result);
return Status::OK();
}

using Base::Append;

private:
std::vector<int64_t> counts_;
};

// ----------------------------------------------------------------------
// Kernel wrapper for generic hash table kernels

Expand Down Expand Up @@ -871,6 +915,48 @@ Status GetDictionaryEncodeKernel(FunctionContext* ctx,
return Status::OK();
}

Status GetCountValuesKernel(FunctionContext* ctx, const std::shared_ptr<DataType>& type,
std::unique_ptr<HashKernel>* out) {
std::unique_ptr<HashTable> hasher;

#define COUNT_VALUES_CASE(InType) \
case InType::type_id: \
hasher.reset(new CountValuesImpl<InType>(type, ctx->memory_pool())); \
break

switch (type->id()) {
COUNT_VALUES_CASE(NullType);
COUNT_VALUES_CASE(BooleanType);
COUNT_VALUES_CASE(UInt8Type);
COUNT_VALUES_CASE(Int8Type);
COUNT_VALUES_CASE(UInt16Type);
COUNT_VALUES_CASE(Int16Type);
COUNT_VALUES_CASE(UInt32Type);
COUNT_VALUES_CASE(Int32Type);
COUNT_VALUES_CASE(UInt64Type);
COUNT_VALUES_CASE(Int64Type);
COUNT_VALUES_CASE(FloatType);
COUNT_VALUES_CASE(DoubleType);
COUNT_VALUES_CASE(Date32Type);
COUNT_VALUES_CASE(Date64Type);
COUNT_VALUES_CASE(Time32Type);
COUNT_VALUES_CASE(Time64Type);
COUNT_VALUES_CASE(TimestampType);
COUNT_VALUES_CASE(BinaryType);
COUNT_VALUES_CASE(StringType);
COUNT_VALUES_CASE(FixedSizeBinaryType);
COUNT_VALUES_CASE(Decimal128Type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to promote some code reuse with the other unary (single-argument) hash kernels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, although I am not sure how to do it. Maybe moving everything to a macro? I am willing to try if somebody could give me some pointers on what's the best way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default:
break;
}

#undef COUNT_VALUES_CASE

CHECK_IMPLEMENTED(hasher, "count-values", type);
out->reset(new HashKernelImpl(std::move(hasher)));
return Status::OK();
}

namespace {

Status InvokeHash(FunctionContext* ctx, HashKernel* func, const Datum& value,
Expand Down Expand Up @@ -918,5 +1004,18 @@ Status DictionaryEncode(FunctionContext* ctx, const Datum& value, Datum* out) {
return Status::OK();
}

Status CountValues(FunctionContext* ctx, const Datum& value,
std::shared_ptr<Array>* out_uniques,
std::shared_ptr<Array>* out_counts) {
std::unique_ptr<HashKernel> func;
RETURN_NOT_OK(GetCountValuesKernel(ctx, value.type(), &func));

std::vector<Datum> counts_datum;
RETURN_NOT_OK(InvokeHash(ctx, func.get(), value, &counts_datum, out_uniques));

*out_counts = MakeArray(counts_datum.back().array());
return Status::OK();
}

} // namespace compute
} // namespace arrow
22 changes: 17 additions & 5 deletions cpp/src/arrow/compute/kernels/hash.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ Status GetDictionaryEncodeKernel(FunctionContext* ctx,
const std::shared_ptr<DataType>& type,
std::unique_ptr<HashKernel>* kernel);

ARROW_EXPORT
Status GetCountValuesKernel(FunctionContext* ctx, const std::shared_ptr<DataType>& type,
std::unique_ptr<HashKernel>* kernel);

/// \brief Compute unique elements from an array-like object
/// \param[in] context the FunctionContext
/// \param[in] datum array-like input
Expand All @@ -71,6 +75,19 @@ Status Unique(FunctionContext* context, const Datum& datum, std::shared_ptr<Arra
ARROW_EXPORT
Status DictionaryEncode(FunctionContext* context, const Datum& data, Datum* out);

/// \brief Return counts of unique elements from an array-like object
/// \param[in] context the FunctionContext
/// \param[in] value array-like input
/// \param[out] out_uniques unique elements as Array
/// \param[out] out_counts counts per element as Array, same shape as out_uniques
///
/// \since 0.10.0
/// \note API not yet finalized
ARROW_EXPORT
Status CountValues(FunctionContext* context, const Datum& value,
std::shared_ptr<Array>* out_uniques,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems more natural to me to have the output type be a struct (but maybe there was discussion on this previously, I guess the existing API had this)?

std::shared_ptr<Array>* out_counts);

// TODO(wesm): Define API for incremental dictionary encoding

// TODO(wesm): Define API for regularizing DictionaryArray objects with
Expand All @@ -95,11 +112,6 @@ Status DictionaryEncode(FunctionContext* context, const Datum& data, Datum* out)
// Status IsIn(FunctionContext* context, const Datum& values, const Datum& member_set,
// Datum* out);

// ARROW_EXPORT
// Status CountValues(FunctionContext* context, const Datum& values,
// std::shared_ptr<Array>* out_uniques,
// std::shared_ptr<Array>* out_counts);

} // namespace compute
} // namespace arrow

Expand Down