Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Apr 4, 2019

No description provided.

@pitrou
Copy link
Member Author

pitrou commented Apr 4, 2019

@wesm There's some IPC refactoring here that deserves reviewing.
@lidavidm Also some Flight refactoring.

@pitrou pitrou force-pushed the ARROW-3200-flight-dicts branch 3 times, most recently from aa70508 to 370c832 Compare April 4, 2019 17:35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may work but it's highly irregular. For clarity it might be worthwhile to make a deep copy of the schema:

Suggested change
std::shared_ptr<Schema> schema_ptr(const_cast<Schema*>(&schema), deleter);
auto schema_ptr = arrow::schema(schema.fields(), schema.metadata());

@apache apache deleted a comment from pitrou Apr 4, 2019
@wesm
Copy link
Member

wesm commented Apr 4, 2019

I will try to spend a good amount of time reviewing this before I leave town on Monday

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the underlying gRPC error via ClientWriter.Finish: https://grpc.io/grpc/cpp/classgrpc_1_1_client_writer.html#ab6c58e110c289ac0f76ae1fdb8fe24a3

As documented, it will return right away if writing failed (as it would have here): https://grpc.io/grpc/cpp/classgrpc_1_1internal_1_1_client_streaming_interface.html#a47321b8c130947bcef5c793329a54619

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flight changes look good to me (and it's much cleaner now!), but it seems to have broken compatibility with Java - looks like now schema messages can carry body buffers? (And Java would need to be updated to account for dictionary batch-type messages anyways). Is the plan to follow up with that change/should I or someone else follow up with that?

@pitrou
Copy link
Member Author

pitrou commented Apr 5, 2019

looks like now schema messages can carry body buffers?

That sounds unexpected. What do the body buffers contain?

@ghost
Copy link

ghost commented Apr 5, 2019

It is a single empty buffer. It's the same problem as ARROW-4213 - the body tag gets written even though there is not a body. It looks like the IpcPayload's body length field is corrupted - when the IpcPayload is allocated in GetSchemaPayload, the fields aren't zero-initialized.

This diff fixes it:

--- a/cpp/src/arrow/ipc/writer.cc
+++ b/cpp/src/arrow/ipc/writer.cc
@@ -535,7 +535,7 @@ Status WriteIpcPayload(const IpcPayload& payload, io::OutputStream* dst,
 Status GetSchemaPayloads(const Schema& schema, MemoryPool* pool, DictionaryMemo* out_memo,
                          std::vector<IpcPayload>* out_payloads) {
   DictionaryMemo dictionary_memo;
-  IpcPayload payload;
+  IpcPayload payload{};
 
   out_payloads->clear();
   payload.type = Message::SCHEMA;

@pitrou
Copy link
Member Author

pitrou commented Apr 5, 2019

Thanks @lihalite. I'll try it out.

@wesm
Copy link
Member

wesm commented Apr 8, 2019

I have a lull in the airport today so I'll spend a some time reviewing this since it's important

@pitrou pitrou force-pushed the ARROW-3200-flight-dicts branch from 370c832 to f9f458d Compare April 8, 2019 13:27
@pitrou
Copy link
Member Author

pitrou commented Apr 8, 2019

Rebased and hopefully fixed the schema-message-with-body issue.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks OK, and the refactoring to use common interfaces is good.

The meta-problem that isn't discussed anywhere in this PR is that the dictionaries are being serialized and written multiple times. When you have dictionaries, they are serialized and sent in the FlightGetInfo message

https://github.com/apache/arrow/blob/master/format/Flight.proto#L220

It is safe to assume that the Schema message is of a reasonable size (< 1 MB), but It isn't safe to assume the same of the entire schema including dictionaries.

I would suggest adding a benchmark with one or more large dictionaries (e.g. 32MB-128MB). The issue should immediately present itself.

I had commented in the JIRA "Some work is needed to handle schemas sent separately from their dictionaries" and there is also ARROW-3144 "Better solution for cases where dictionaries are unknown at schema reconstruction time, or for delta dictionaries".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could theoretically encode some standardized structure in the error messages so that the actual problem can be "unboxed"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... Though we would first have to standardize errors accross all Arrow implementations ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Finish() return it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, Finish() is only available on client-side streams :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a lot of this file can be moved to arrow/testing/*, as there is some redundancy in the data generation

@fsaintjacques

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have random dictionary generation yet? Might be useful

Copy link
Member Author

@pitrou pitrou Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like it. Also one should check if the IPC-testing dict payloads have some specifically useful characteristics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. However in most cases this shouldn't allocate a lot of memory (unless there are a lot of sliced bitmaps that have to be "fixed"). Are we optimizing for reuse of padding buffers yet (Java does this) -- it is probably not an issue in C++ since we are pretty good about being 8-byte padded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean with padding buffers. There's a static constexpr uint8_t kPaddingBytes[kArrowAlignment] declaration in ipc/util.h (the constexpr looks a bit unexpected here btw).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside, we should decide on an evolution of DCHECK_OK:

  • Always evaluates EVAL_AND_DCHECK_OK
  • Some variant that only runs and checks the statement in debug mode -- here is a good example. DEBUG_CHECK_OK(statement) or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current situation is a bit unfortunate. I think most (all?) uses of DCHECK_OK should be replaced with ABORT_NOT_OK, which doesn't ignore the status return in release mode.

@pitrou
Copy link
Member Author

pitrou commented Apr 8, 2019

The meta-problem that isn't discussed anywhere in this PR is that the dictionaries are being serialized and written multiple times. When you have dictionaries, they are serialized and sent in the FlightGetInfo message

I see... This is because of SerializeSchema, right? The solution could go hand in hand with #4067.

@codecov-io
Copy link

codecov-io commented Apr 8, 2019

Codecov Report

Merging #4113 into master will decrease coverage by 0.07%.
The diff coverage is 91.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4113      +/-   ##
==========================================
- Coverage   87.84%   87.77%   -0.08%     
==========================================
  Files         739      741       +2     
  Lines       90980    91159     +179     
  Branches     1252     1252              
==========================================
+ Hits        79921    80013      +92     
- Misses      10938    11029      +91     
+ Partials      121      117       -4
Impacted Files Coverage Δ
cpp/src/arrow/flight/serialization-internal.h 100% <ø> (ø) ⬆️
cpp/src/arrow/flight/test-util.h 100% <ø> (ø) ⬆️
cpp/src/arrow/ipc/json-test.cc 98.44% <ø> (ø) ⬆️
cpp/src/arrow/flight/customize_protobuf.h 100% <ø> (ø) ⬆️
cpp/src/arrow/ipc/metadata-internal.h 100% <ø> (ø) ⬆️
cpp/src/arrow/array-test.cc 100% <ø> (ø) ⬆️
cpp/src/arrow/ipc/dictionary.h 100% <ø> (ø) ⬆️
cpp/src/gandiva/expression_registry.cc 81.9% <0%> (ø) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 95.41% <0%> (+0.81%) ⬆️
cpp/src/arrow/ipc/json-internal.cc 93.28% <0%> (+0.21%) ⬆️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e881aef...efe0f5a. Read the comment docs.

@wesm
Copy link
Member

wesm commented Apr 8, 2019

As another general comment about dictionaries:

in many applications, dictionary encoding is a data compression technique, and two tables or two arrays having different dictionaries are considered semantically "the same". This is often the case with analytic databases -- we are already having the issue with reading Parquet files as dictionary-encoded (the dictionaries may differ between row groups). We will likely be forced by this practical consideration to evolve the C++ API around this.

To make the problem more concrete for Flight -- if a dataset is distributed, the nodes may not "agree" about what the dictionary is. I do not think it is practical to go about some kind of "dictionary consensus" step. The only point at which a "consensus" dictionary is really required is when you do analytics. Then you have various cases:

  • Dictionary is the same
  • One dictionary is a prefix of the other (e.g. ['a', 'b'] is a prefix of ['a', 'b', 'c', 'd'])
  • Neither case, dictionary merge / permutation required to perform analytics

Unfortunately, the case of "semantically the same data, but different dictionaries" is poorly modeled in the C++ library so we should think a bit about this.

I plan to work on Parquet stuff a bunch when I'm back from vacation so I intend to address the dictionary-encoded / categorical issue as part of this

@wesm
Copy link
Member

wesm commented Apr 8, 2019

BTW in the case of analytics, it is generally most practical to address dictionary / categorical normalization at the last possible moment. When the cardinality is not large, then this is also a lot cheaper. (Imagine a billion-row dataset split into 1000 dictionary-encoded pieces, each having 1 million elements and a different dictionary)

@pitrou
Copy link
Member Author

pitrou commented Apr 8, 2019

@wesm @lihalite I'm gonna merge this soon if you don't object.

@ghost
Copy link

ghost commented Apr 8, 2019

I am good with merging, thanks for the heads up. The only thing I might want is to see if we can stop skipping the dictionary case in the integration tests.

@pitrou
Copy link
Member Author

pitrou commented Apr 8, 2019

I'm not comfortable dealing with the integration tests now. Would you like to tackle that in a separate issue?

@ghost
Copy link

ghost commented Apr 8, 2019

Of course - filed ARROW-5143.

@pitrou pitrou closed this in d8e4763 Apr 9, 2019
@pitrou pitrou deleted the ARROW-3200-flight-dicts branch April 9, 2019 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants