Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented Dec 14, 2018

…stamps fields.

This changes the behavior of use_deprecated_int96_timestamps to support
all timestamp fields irregardless of the time unit. It would previously
only apply this conversion to fields with Nanosecond resolution.

People will only use this option when they use a system that only
supports INT96 timestamps, systems that also support INT64 timestamps in
other resolutions would not need the option.

A notable API change is that this option now take precedence over the
coerce_timestamps option.

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented Dec 14, 2018

As a note for reviewer, I could not track the exact cause (memory corruption) of the segfault cause in https://issues.apache.org/jira/browse/PARQUET-1274, which I reverted in this PR in order to implement the desired behavior.

But I noted that there was a mismatch between the TypedColumnWriter<Int64Type> and TypedWriteBatch<Int96Type, ::arrow::TimestampType> which likely caused the problem. The change in GetTimestampMetadata made sure that both type matched and made the problem go away.

@wesm
Copy link
Member

wesm commented Dec 14, 2018

@fsaintjacques does the build fail here if you do not revert PARQUET-1274? I can look into that bug and try to fix it

@wesm
Copy link
Member

wesm commented Dec 14, 2018

I'll just try to repro the segfault in that JIRA on this branch

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented Dec 14, 2018

The segfault is not present anymore. If you want to trigger it, revert the schema.cc changes, then parquet-arrow-reader-writer-test will segfault consistently.

The top stack will be corrupted and meaningless, the real issue is the pool_ property being corrupted, so unwind to the proper frame where pool points to the global default memory pool.

@wesm
Copy link
Member

wesm commented Dec 14, 2018

Got it. I read the comments in the JIRA and it makes sense (there was an invalid int64 -> int96 type cast happening). I'm going to quickly check the original example

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented Dec 14, 2018

As @joshuastorck recommended, we might prefer a checked_cast in ArrowColumnWriter::WriteBatch* which will be dynamic in debug and static in production builds.

@fsaintjacques
Copy link
Contributor Author

Yep, that would be it, I get a nullptr.

@wesm
Copy link
Member

wesm commented Dec 14, 2018

Sounds good. Can you make all the writer type casts checked_cast here?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also confirmed that the failure in PARQUET-1274 isn't present with this patch so no issues reverting that patch

Copy link
Member

@wesm wesm Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be mindful of parquet v1/v2 issues here -- @xhochy could you take a look at this?

Copy link
Contributor Author

@fsaintjacques fsaintjacques Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python unit tests found another issue, I'll add a c++ test to catch this earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed in GetTimestampMetadata, it's now closer to the original.

…stamps fields.

This changed the behavior of use_deprecated_int96_timestamps to support
all timestamp fields irregardless of the time unit. It would previously
only apply this conversion to fields with Nanosecond resolution.

People will only use this option when they use a system that only
supports INT96 timestamps, systems that also support INT64 timestamps in
other resolutions would not need the option.

A notable API change is that this option now take precedence over the
coerce_timestamps option.
@fsaintjacques fsaintjacques force-pushed the ARROW-2026-parquet-int96-conversion branch from 7a88806 to 2897a72 Compare December 15, 2018 01:22
@codecov-io
Copy link

Codecov Report

Merging #3173 into master will increase coverage by <.01%.
The diff coverage is 96.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3173      +/-   ##
==========================================
+ Coverage   86.39%   86.39%   +<.01%     
==========================================
  Files         505      505              
  Lines       69612    69620       +8     
==========================================
+ Hits        60140    60149       +9     
+ Misses       9371     9370       -1     
  Partials      101      101
Impacted Files Coverage Δ
python/pyarrow/parquet.py 93.7% <ø> (ø) ⬆️
cpp/src/parquet/types.h 100% <100%> (ø) ⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 95.92% <100%> (-0.04%) ⬇️
cpp/src/parquet/arrow/writer.h 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_parquet.py 96.26% <100%> (ø) ⬆️
cpp/src/parquet/arrow/reader.cc 84.72% <100%> (-0.11%) ⬇️
cpp/src/parquet/arrow/writer.cc 91.56% <90%> (+0.33%) ⬆️
cpp/src/parquet/arrow/schema.cc 89.7% <94.73%> (+0.08%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce12fb5...2897a72. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thank you!

@wesm
Copy link
Member

wesm commented Dec 15, 2018

BTW there's no need to squash your commits in a PR. It's actually better if you push additional commits because then I see e-mail notifications that the branch has been updated, which prompts me to review again. If you force push, I do not get a notification. The other way would be to comment asking me to look again =)

@wesm wesm closed this in ec154d2 Dec 15, 2018
@fsaintjacques
Copy link
Contributor Author

Didn't know about force push and notifications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants