-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/ipc/metadata.cc
Outdated
| @@ -944,7 +944,7 @@ static Status VisitField(const flatbuf::Field* field, DictionaryTypeMap* id_to_f | |||
|
|
|||
| Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_field) { | |||
| auto schema = static_cast<const flatbuf::Schema*>(opaque_schema); | |||
| int num_fields = static_cast<int>(schema->fields()->size()); | |||
| auto num_fields = schema->fields()->size(); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to revert these.
cpp/src/arrow/ipc/metadata.cc
Outdated
| @@ -954,7 +954,7 @@ Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_fi | |||
| Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo, | |||
| std::shared_ptr<Schema>* out) { | |||
| auto schema = static_cast<const flatbuf::Schema*>(opaque_schema); | |||
| int num_fields = static_cast<int>(schema->fields()->size()); | |||
| auto num_fields = schema->fields()->size(); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too.
cpp/src/arrow/type.h
Outdated
| @@ -721,7 +721,6 @@ class ARROW_EXPORT Schema { | |||
| private: | |||
| std::vector<std::shared_ptr<Field>> fields_; | |||
| std::unordered_map<std::string, int> name_to_index_; | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rebase artifact.
python/pyarrow/_parquet.pyx
Outdated
| @@ -159,14 +159,26 @@ cdef class FileMetaData: | |||
| result.init_from_file(self, i) | |||
| return result | |||
|
|
|||
| property metadata: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this cool now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a small test, doing it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/tests/test_parquet.py
Outdated
| @@ -91,8 +87,14 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): | |||
|
|
|||
| filename = tmpdir.join('pandas_rountrip.parquet') | |||
| arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True) | |||
| expected_custom_metadata = { | |||
| b'indices': bytes([len(df.columns)]), | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limits us to MultiIndexes with <= 255 levels (because we're using string -> string for metadata). I think that's reasonable for now. We can always come with up a more complex encoding if we want to support more levels than that. I'd be surprised if this ever comes up in practice.
python/pyarrow/__init__.py
Outdated
| @@ -104,6 +104,7 @@ def jemalloc_memory_pool(): | |||
| from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem | |||
|
|
|||
| from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter | |||
| from pyarrow.ipc import to_pandas_wire, from_pandas_wire | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe serialize_pandas and deserialize_pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/pyarrow/_table.pyx
Outdated
| @@ -321,8 +322,11 @@ cdef tuple _dataframe_to_arrays(df, bint timestamps_to_ms, Schema schema): | |||
| cdef: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is getting "chubby" enough that we should probably move it to a pandas utility module in pure Python.
python/pyarrow/_table.pyx
Outdated
| _pandas().MultiIndex.from_arrays( | ||
| index_arrays | ||
| ) if index_arrays else _pandas().RangeIndex(row_count), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above re: doing this in pure Python. It would also encourage adding appropriate public APIs to pyarrow.Table. We already have Table.remove_column, so it is probably better to use that if possible.
python/pyarrow/ipc.py
Outdated
| return sink.get_result() | ||
|
|
||
|
|
||
| def from_pandas_wire(buf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings
| @@ -573,7 +579,7 @@ def test_decimal_128_from_pandas(self): | |||
| }) | |||
| converted = pa.Table.from_pandas(expected) | |||
| field = pa.field('decimals', pa.decimal(26, 11)) | |||
| schema = pa.schema([field]) | |||
| schema = pa.schema([field, DEFAULT_INDEX_FIELD]) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This DEFAULT_INDEX_FIELD is a slight nuisance. Perhaps add an argument to from_pandas whether to ingest the index (default could be True or False I guess)?
python/pyarrow/tests/test_parquet.py
Outdated
| expected_custom_metadata = { | ||
| b'indices': bytes([len(df.columns)]), | ||
| b'has_name': bytes([0]) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think <= 255 levels is OK. I would actually rather see this metadata stored as a JSON blob under a single pandas key, otherwise we are possibly muddying the metadata namespace.
metadata = {b'pandas': json.dumps(pandas_meta).encode('utf8')}
|
PARQUET-595 is merged |
29aa20d to
8aa4ee0
Compare
481b724 to
9a0e6c5
Compare
|
@wesm This is ready for another round of review when you get a chance. |
|
OK, taking a look now. Minor rebase conflict from #679 |
f1232be to
fc27461
Compare
|
Fixed the conflict and addressed the |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks fine, this will be very nice to have! I would say we should start factoring out code from pyarrow.lib that doesn't need to be cythonized, which will make iterative development a little easier in cases too
cpp/src/arrow/type.h
Outdated
| @@ -701,6 +701,9 @@ class ARROW_EXPORT Schema { | |||
| // Returns nullptr if name not found | |||
| std::shared_ptr<Field> GetFieldByName(const std::string& name); | |||
|
|
|||
| // Returns -1 if name not found | |||
| int64_t GetFieldIndex(const std::string& name); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to call this and the one above in a const context. You'll have to mark name_to_index_ as mutable to make this work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/_parquet.pyx
Outdated
| @@ -159,14 +159,26 @@ cdef class FileMetaData: | |||
| result.init_from_file(self, i) | |||
| return result | |||
|
|
|||
| property metadata: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this cool now?
python/pyarrow/array.pxi
Outdated
| TimeUnit_NANO: lambda x, tzinfo: pd.Timestamp( | ||
| x, tz=tzinfo, unit='ns', | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should factor this out into a pandas_compat.py module, along with the rest of the stuff below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty awkward to factor out because of the TimeUnit_* enum values. We'd have to make pandas_compat.pxi if we wanted to keep those available to Cython but not Python (which would seem to defeat part of the purpose of factoring out) or expose the enum values to Python. This doesn't seem worth it for something that will never be seen by a user. Still, if you feel strongly about it I can spend some more time on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True true, no worries, this is fine as is.
python/pyarrow/parquet.py
Outdated
| custom_metadata = self.metadata.metadata | ||
|
|
||
| # TODO(phillipc): Hack? | ||
| if custom_metadata and b'pandas' in custom_metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need an explicit read_pandas function in this class so that the user must express intent to use the additional pandas metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the last thing in this patch. I would like to have the option to ignore the metadata and read the file as-is as an Arrow table (without having the index columns tacked on against my will). So we can either add a read_pandas method to enables the metadata wrangling logic, or an option to read that does the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fully on board here. Just trying to iron out pandas_compat stuff, then moving on to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
python/pyarrow/table.pxi
Outdated
| from collections import OrderedDict | ||
|
|
||
| import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression, since pandas is not a hard dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
python/pyarrow/table.pxi
Outdated
| CColumn* col | ||
| int i | ||
|
|
||
| cdef table_to_blockmanager(const shared_ptr[CTable]& ctable, int nthreads): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this some of this code to pyarrow.pandas_compat?
python/pyarrow/table.pxi
Outdated
| Schema schema | ||
|
|
||
| table.init(ctable) | ||
| block_table.init(ctable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pyarrow_wrap_table here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @@ -67,9 +67,10 @@ def tearDown(self): | |||
|
|
|||
| def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, | |||
| timestamps_to_ms=False, expected_schema=None, | |||
| check_dtype=True, schema=None): | |||
| check_dtype=True, schema=None, | |||
| check_index=True): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make check_index default to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/tests/test_ipc.py
Outdated
| ) | ||
| buf = pa.serialize_pandas(df) | ||
| result = pa.deserialize_pandas(buf) | ||
| assert_frame_equal(result, df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test a MultiIndex here?
What is the behavior when the columns are not strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now raises a TypeError alerting the user to the fact that column names cannot be anything other than strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a multiindex test.
python/pyarrow/table.pxi
Outdated
| @@ -455,7 +568,7 @@ cdef class RecordBatch: | |||
| return Table.from_batches([self]).to_pandas(nthreads=nthreads) | |||
|
|
|||
| @classmethod | |||
| def from_pandas(cls, df, schema=None): | |||
| def from_pandas(cls, df, Schema schema=None, bint preserve_index=True): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document this extra parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
I think this and #602 are the last things I'd like to get in before cutting 0.4.0 (outside some clean up patches). |
|
Sounds good! |
695e1d4 to
5613a4c
Compare
python/pyarrow/ipc.py
Outdated
| return sink.get_result() | ||
|
|
||
|
|
||
| def deserialize_pandas(buf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add nthreads=None here and pass through to to_pandas (single-threaded by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
made a last comment #612 (comment) but outside of that i think this is about good to go |
python/pyarrow/parquet.py
Outdated
| index_columns = [] | ||
|
|
||
| if column_indices and index_columns: | ||
| column_indices += index_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to call _get_column_indices on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah crap. Yep. Will also add a test since this wasn't failing for me locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Here's the appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.158 +1, thanks for doing this! |
cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata
cc @mrocklin