ARROW-3200: [C++] Support dictionaries in Flight streams #4113

pitrou · 2019-04-04T17:03:29Z

No description provided.

pitrou · 2019-04-04T17:04:05Z

@wesm There's some IPC refactoring here that deserves reviewing.
@lidavidm Also some Flight refactoring.

bkietz · 2019-04-04T17:50:49Z

cpp/src/arrow/ipc/writer.cc

This may work but it's highly irregular. For clarity it might be worthwhile to make a deep copy of the schema:

Suggested change

std::shared_ptr<Schema> schema_ptr(const_cast<Schema*>(&schema), deleter);

auto schema_ptr = arrow::schema(schema.fields(), schema.metadata());

wesm · 2019-04-04T20:19:33Z

I will try to spend a good amount of time reviewing this before I leave town on Monday

ghost · 2019-04-04T21:39:16Z

cpp/src/arrow/flight/client.cc

You can get the underlying gRPC error via ClientWriter.Finish: https://grpc.io/grpc/cpp/classgrpc_1_1_client_writer.html#ab6c58e110c289ac0f76ae1fdb8fe24a3

As documented, it will return right away if writing failed (as it would have here): https://grpc.io/grpc/cpp/classgrpc_1_1internal_1_1_client_streaming_interface.html#a47321b8c130947bcef5c793329a54619

ghost

The Flight changes look good to me (and it's much cleaner now!), but it seems to have broken compatibility with Java - looks like now schema messages can carry body buffers? (And Java would need to be updated to account for dictionary batch-type messages anyways). Is the plan to follow up with that change/should I or someone else follow up with that?

pitrou · 2019-04-05T09:35:05Z

looks like now schema messages can carry body buffers?

That sounds unexpected. What do the body buffers contain?

ghost · 2019-04-05T12:28:48Z

It is a single empty buffer. It's the same problem as ARROW-4213 - the body tag gets written even though there is not a body. It looks like the IpcPayload's body length field is corrupted - when the IpcPayload is allocated in GetSchemaPayload, the fields aren't zero-initialized.

This diff fixes it:

--- a/cpp/src/arrow/ipc/writer.cc
+++ b/cpp/src/arrow/ipc/writer.cc
@@ -535,7 +535,7 @@ Status WriteIpcPayload(const IpcPayload& payload, io::OutputStream* dst,
 Status GetSchemaPayloads(const Schema& schema, MemoryPool* pool, DictionaryMemo* out_memo,
                          std::vector<IpcPayload>* out_payloads) {
   DictionaryMemo dictionary_memo;
-  IpcPayload payload;
+  IpcPayload payload{};
 
   out_payloads->clear();
   payload.type = Message::SCHEMA;

pitrou · 2019-04-05T12:31:32Z

Thanks @lihalite. I'll try it out.

wesm · 2019-04-08T12:19:24Z

I have a lull in the airport today so I'll spend a some time reviewing this since it's important

pitrou · 2019-04-08T13:27:22Z

Rebased and hopefully fixed the schema-message-with-body issue.

wesm

The code looks OK, and the refactoring to use common interfaces is good.

The meta-problem that isn't discussed anywhere in this PR is that the dictionaries are being serialized and written multiple times. When you have dictionaries, they are serialized and sent in the FlightGetInfo message

https://github.com/apache/arrow/blob/master/format/Flight.proto#L220

It is safe to assume that the Schema message is of a reasonable size (< 1 MB), but It isn't safe to assume the same of the entire schema including dictionaries.

I would suggest adding a benchmark with one or more large dictionaries (e.g. 32MB-128MB). The issue should immediately present itself.

I had commented in the JIRA "Some work is needed to handle schemas sent separately from their dictionaries" and there is also ARROW-3144 "Better solution for cases where dictionaries are unknown at schema reconstruction time, or for delta dictionaries".

cpp/src/arrow/flight/client.cc

wesm · 2019-04-08T12:31:15Z

cpp/src/arrow/flight/flight-test.cc

we could theoretically encode some standardized structure in the error messages so that the actual problem can be "unboxed"

Right... Though we would first have to standardize errors accross all Arrow implementations ;-)

wesm · 2019-04-08T12:33:57Z

cpp/src/arrow/flight/server.cc

does Finish() return it here?

AFAICT, Finish() is only available on client-side streams :-(

wesm · 2019-04-08T12:37:29Z

cpp/src/arrow/ipc/test-common.cc

Probably a lot of this file can be moved to arrow/testing/*, as there is some redundancy in the data generation

@fsaintjacques

i created https://issues.apache.org/jira/browse/ARROW-5141 for it.

wesm · 2019-04-08T12:38:42Z

cpp/src/arrow/ipc/test-common.cc

Do we have random dictionary generation yet? Might be useful

It doesn't look like it. Also one should check if the IPC-testing dict payloads have some specifically useful characteristics.

wesm · 2019-04-08T12:52:46Z

cpp/src/arrow/ipc/writer.cc

Not a bad idea. However in most cases this shouldn't allocate a lot of memory (unless there are a lot of sliced bitmaps that have to be "fixed"). Are we optimizing for reuse of padding buffers yet (Java does this) -- it is probably not an issue in C++ since we are pretty good about being 8-byte padded?

I'm not sure what you mean with padding buffers. There's a static constexpr uint8_t kPaddingBytes[kArrowAlignment] declaration in ipc/util.h (the constexpr looks a bit unexpected here btw).

wesm · 2019-04-08T12:56:07Z

cpp/src/arrow/ipc/writer.cc

Aside, we should decide on an evolution of DCHECK_OK:

Always evaluates EVAL_AND_DCHECK_OK

Some variant that only runs and checks the statement in debug mode -- here is a good example. DEBUG_CHECK_OK(statement) or something

Yes, the current situation is a bit unfortunate. I think most (all?) uses of DCHECK_OK should be replaced with ABORT_NOT_OK, which doesn't ignore the status return in release mode.

cpp/src/arrow/ipc/writer.cc

pitrou · 2019-04-08T13:37:52Z

The meta-problem that isn't discussed anywhere in this PR is that the dictionaries are being serialized and written multiple times. When you have dictionaries, they are serialized and sent in the FlightGetInfo message

I see... This is because of SerializeSchema, right? The solution could go hand in hand with #4067.

codecov-io · 2019-04-08T14:15:26Z

Codecov Report

Merging #4113 into master will decrease coverage by 0.07%.
The diff coverage is 91.71%.

@@            Coverage Diff             @@
##           master    #4113      +/-   ##
==========================================
- Coverage   87.84%   87.77%   -0.08%     
==========================================
  Files         739      741       +2     
  Lines       90980    91159     +179     
  Branches     1252     1252              
==========================================
+ Hits        79921    80013      +92     
- Misses      10938    11029      +91     
+ Partials      121      117       -4

Impacted Files	Coverage Δ
cpp/src/arrow/flight/serialization-internal.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/flight/test-util.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/ipc/json-test.cc	`98.44% <ø> (ø)`	⬆️
cpp/src/arrow/flight/customize_protobuf.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/ipc/metadata-internal.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/array-test.cc	`100% <ø> (ø)`	⬆️
cpp/src/arrow/ipc/dictionary.h	`100% <ø> (ø)`	⬆️
cpp/src/gandiva/expression_registry.cc	`81.9% <0%> (ø)`	⬆️
cpp/src/arrow/python/numpy_to_arrow.cc	`95.41% <0%> (+0.81%)`	⬆️
cpp/src/arrow/ipc/json-internal.cc	`93.28% <0%> (+0.21%)`	⬆️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e881aef...efe0f5a. Read the comment docs.

wesm · 2019-04-08T15:27:10Z

As another general comment about dictionaries:

in many applications, dictionary encoding is a data compression technique, and two tables or two arrays having different dictionaries are considered semantically "the same". This is often the case with analytic databases -- we are already having the issue with reading Parquet files as dictionary-encoded (the dictionaries may differ between row groups). We will likely be forced by this practical consideration to evolve the C++ API around this.

To make the problem more concrete for Flight -- if a dataset is distributed, the nodes may not "agree" about what the dictionary is. I do not think it is practical to go about some kind of "dictionary consensus" step. The only point at which a "consensus" dictionary is really required is when you do analytics. Then you have various cases:

Dictionary is the same
One dictionary is a prefix of the other (e.g. ['a', 'b'] is a prefix of ['a', 'b', 'c', 'd'])
Neither case, dictionary merge / permutation required to perform analytics

Unfortunately, the case of "semantically the same data, but different dictionaries" is poorly modeled in the C++ library so we should think a bit about this.

I plan to work on Parquet stuff a bunch when I'm back from vacation so I intend to address the dictionary-encoded / categorical issue as part of this

wesm · 2019-04-08T15:28:31Z

BTW in the case of analytics, it is generally most practical to address dictionary / categorical normalization at the last possible moment. When the cardinality is not large, then this is also a lot cheaper. (Imagine a billion-row dataset split into 1000 dictionary-encoded pieces, each having 1 million elements and a different dictionary)

pitrou · 2019-04-08T21:27:48Z

@wesm @lihalite I'm gonna merge this soon if you don't object.

ghost · 2019-04-08T21:39:12Z

I am good with merging, thanks for the heads up. The only thing I might want is to see if we can stop skipping the dictionary case in the integration tests.

pitrou · 2019-04-08T21:51:43Z

I'm not comfortable dealing with the integration tests now. Would you like to tackle that in a separate issue?

ghost · 2019-04-08T21:54:08Z

Of course - filed ARROW-5143.

pitrou force-pushed the ARROW-3200-flight-dicts branch 3 times, most recently from aa70508 to 370c832 Compare April 4, 2019 17:35

bkietz reviewed Apr 4, 2019

View reviewed changes

apache deleted a comment from pitrou Apr 4, 2019

ghost reviewed Apr 4, 2019

View reviewed changes

ARROW-3200: [C++] Support dictionaries in Flight streams

f9f458d

pitrou force-pushed the ARROW-3200-flight-dicts branch from 370c832 to f9f458d Compare April 8, 2019 13:27

wesm reviewed Apr 8, 2019

View reviewed changes

Address review comments

efe0f5a

pitrou mentioned this pull request Apr 8, 2019

[EXP] ARROW-5052: [C++] Add IncompleteDictionaryType #4067

Closed

pitrou closed this in d8e4763 Apr 9, 2019

pitrou deleted the ARROW-3200-flight-dicts branch April 9, 2019 09:53

asfimport mentioned this pull request Apr 11, 2019

[C++] Add support for reading Flight streams with dictionaries #19545

Closed

	std::shared_ptr<Schema> schema_ptr(const_cast<Schema*>(&schema), deleter);
	auto schema_ptr = arrow::schema(schema.fields(), schema.metadata());

ARROW-3200: [C++] Support dictionaries in Flight streams #4113

ARROW-3200: [C++] Support dictionaries in Flight streams #4113

Uh oh!

Conversation

pitrou commented Apr 4, 2019

Uh oh!

pitrou commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 5, 2019

Uh oh!

ghost commented Apr 5, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Apr 5, 2019

Uh oh!

wesm commented Apr 8, 2019

Uh oh!

pitrou commented Apr 8, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou commented Apr 8, 2019

Uh oh!

codecov-io commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wesm commented Apr 8, 2019

Uh oh!

wesm commented Apr 8, 2019

Uh oh!

pitrou commented Apr 8, 2019

Uh oh!

ghost commented Apr 8, 2019

Uh oh!

pitrou commented Apr 8, 2019

Uh oh!

ghost commented Apr 8, 2019

Uh oh!

Reviewers

ghost commented Apr 5, 2019 •

edited by ghost

Loading

pitrou Apr 8, 2019 •

edited

Loading

codecov-io commented Apr 8, 2019 •

edited

Loading