ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata #612

cpcloud · 2017-04-29T00:21:09Z

cpcloud · 2017-04-29T00:21:26Z

cpp/src/arrow/ipc/metadata.cc

@@ -944,7 +944,7 @@ static Status VisitField(const flatbuf::Field* field, DictionaryTypeMap* id_to_f

 Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_field) {
  auto schema = static_cast<const flatbuf::Schema*>(opaque_schema);
-  int num_fields = static_cast<int>(schema->fields()->size());
+  auto num_fields = schema->fields()->size();


I'm going to revert these.

cpcloud · 2017-04-29T00:21:32Z

cpp/src/arrow/ipc/metadata.cc

@@ -954,7 +954,7 @@ Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_fi
 Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo,
    std::shared_ptr<Schema>* out) {
  auto schema = static_cast<const flatbuf::Schema*>(opaque_schema);
-  int num_fields = static_cast<int>(schema->fields()->size());
+  auto num_fields = schema->fields()->size();


cpcloud · 2017-04-29T00:21:41Z

cpp/src/arrow/type.h

@@ -721,7 +721,6 @@ class ARROW_EXPORT Schema {
 private:
  std::vector<std::shared_ptr<Field>> fields_;
  std::unordered_map<std::string, int> name_to_index_;
-


This is a rebase artifact.

cpcloud · 2017-04-29T00:22:18Z

python/pyarrow/_parquet.pyx

@@ -159,14 +159,26 @@ cdef class FileMetaData:
        result.init_from_file(self, i)
        return result

+    property metadata:


This needs a test

is this cool now?

Let me add a small test, doing it now.

cpcloud · 2017-04-29T00:24:41Z

python/pyarrow/tests/test_parquet.py

@@ -91,8 +87,14 @@ def test_pandas_parquet_2_0_rountrip(tmpdir):

    filename = tmpdir.join('pandas_rountrip.parquet')
    arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True)
+    expected_custom_metadata = {
+        b'indices': bytes([len(df.columns)]),


This limits us to MultiIndexes with <= 255 levels (because we're using string -> string for metadata). I think that's reasonable for now. We can always come with up a more complex encoding if we want to support more levels than that. I'd be surprised if this ever comes up in practice.

wesm · 2017-04-29T17:55:34Z

python/pyarrow/__init__.py

@@ -104,6 +104,7 @@ def jemalloc_memory_pool():
 from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem

 from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter
+from pyarrow.ipc import to_pandas_wire, from_pandas_wire


Maybe serialize_pandas and deserialize_pandas?

wesm · 2017-04-29T18:01:42Z

python/pyarrow/_table.pyx

@@ -321,8 +322,11 @@ cdef tuple _dataframe_to_arrays(df, bint timestamps_to_ms, Schema schema):
    cdef:


This function is getting "chubby" enough that we should probably move it to a pandas utility module in pure Python.

wesm · 2017-04-29T18:04:25Z

python/pyarrow/_table.pyx

+        _pandas().MultiIndex.from_arrays(
+            index_arrays
+        ) if index_arrays else _pandas().RangeIndex(row_count),
+    ]


Same comment as above re: doing this in pure Python. It would also encourage adding appropriate public APIs to pyarrow.Table. We already have Table.remove_column, so it is probably better to use that if possible.

wesm · 2017-04-29T18:04:37Z

python/pyarrow/ipc.py

+    return sink.get_result()
+
+
+def from_pandas_wire(buf):


wesm · 2017-04-29T18:06:01Z

python/pyarrow/tests/test_convert_pandas.py

@@ -573,7 +579,7 @@ def test_decimal_128_from_pandas(self):
        })
        converted = pa.Table.from_pandas(expected)
        field = pa.field('decimals', pa.decimal(26, 11))
-        schema = pa.schema([field])
+        schema = pa.schema([field, DEFAULT_INDEX_FIELD])


This DEFAULT_INDEX_FIELD is a slight nuisance. Perhaps add an argument to from_pandas whether to ingest the index (default could be True or False I guess)?

wesm · 2017-04-29T18:08:16Z

python/pyarrow/tests/test_parquet.py

+    expected_custom_metadata = {
+        b'indices': bytes([len(df.columns)]),
+        b'has_name': bytes([0])
+    }


I think <= 255 levels is OK. I would actually rather see this metadata stored as a JSON blob under a single pandas key, otherwise we are possibly muddying the metadata namespace.

metadata = {b'pandas': json.dumps(pandas_meta).encode('utf8')}

wesm · 2017-04-29T18:08:56Z

PARQUET-595 is merged

cpcloud · 2017-05-14T02:35:35Z

@wesm This is ready for another round of review when you get a chance.

wesm · 2017-05-14T13:05:52Z

OK, taking a look now. Minor rebase conflict from #679

cpcloud · 2017-05-14T16:37:51Z

Fixed the conflict and addressed the {,RecordBatch}File{Reader,Writer} change.

wesm

Overall this looks fine, this will be very nice to have! I would say we should start factoring out code from pyarrow.lib that doesn't need to be cythonized, which will make iterative development a little easier in cases too

wesm · 2017-05-14T13:07:43Z

cpp/src/arrow/type.h

@@ -701,6 +701,9 @@ class ARROW_EXPORT Schema {
  // Returns nullptr if name not found
  std::shared_ptr<Field> GetFieldByName(const std::string& name);

+  // Returns -1 if name not found
+  int64_t GetFieldIndex(const std::string& name);


You should be able to call this and the one above in a const context. You'll have to mark name_to_index_ as mutable to make this work

wesm · 2017-05-14T19:18:11Z

python/pyarrow/_parquet.pyx

@@ -159,14 +159,26 @@ cdef class FileMetaData:
        result.init_from_file(self, i)
        return result

+    property metadata:


is this cool now?

wesm · 2017-05-14T19:19:20Z

python/pyarrow/array.pxi

+        TimeUnit_NANO: lambda x, tzinfo: pd.Timestamp(
+            x, tz=tzinfo, unit='ns',
+        )
+    }


Maybe we should factor this out into a pandas_compat.py module, along with the rest of the stuff below

This is pretty awkward to factor out because of the TimeUnit_* enum values. We'd have to make pandas_compat.pxi if we wanted to keep those available to Cython but not Python (which would seem to defeat part of the purpose of factoring out) or expose the enum values to Python. This doesn't seem worth it for something that will never be seen by a user. Still, if you feel strongly about it I can spend some more time on it.

True true, no worries, this is fine as is.

wesm · 2017-05-14T19:21:48Z

python/pyarrow/parquet.py

+            custom_metadata = self.metadata.metadata
+
+            # TODO(phillipc): Hack?
+            if custom_metadata and b'pandas' in custom_metadata:


I think we need an explicit read_pandas function in this class so that the user must express intent to use the additional pandas metadata

I think this is the last thing in this patch. I would like to have the option to ignore the metadata and read the file as-is as an Arrow table (without having the index columns tacked on against my will). So we can either add a read_pandas method to enables the metadata wrangling logic, or an option to read that does the same thing.

Yep, fully on board here. Just trying to iron out pandas_compat stuff, then moving on to this.

wesm · 2017-05-14T19:22:44Z

python/pyarrow/table.pxi

 from collections import OrderedDict

+import pandas as pd


This is a regression, since pandas is not a hard dependency.

wesm · 2017-05-14T19:25:28Z

python/pyarrow/table.pxi

-        CColumn* col
-        int i
-
+cdef table_to_blockmanager(const shared_ptr[CTable]& ctable, int nthreads):


Move this some of this code to pyarrow.pandas_compat?

wesm · 2017-05-14T19:25:36Z

python/pyarrow/table.pxi

+        Schema schema
+
+    table.init(ctable)
+    block_table.init(ctable)


Use pyarrow_wrap_table here

wesm · 2017-05-14T19:28:18Z

python/pyarrow/tests/test_convert_pandas.py

@@ -67,9 +67,10 @@ def tearDown(self):

    def _check_pandas_roundtrip(self, df, expected=None, nthreads=1,
                                timestamps_to_ms=False, expected_schema=None,
-                                check_dtype=True, schema=None):
+                                check_dtype=True, schema=None,
+                                check_index=True):


Maybe make check_index default to false?

wesm · 2017-05-14T19:29:01Z

python/pyarrow/tests/test_ipc.py

+    )
+    buf = pa.serialize_pandas(df)
+    result = pa.deserialize_pandas(buf)
+    assert_frame_equal(result, df)


Test a MultiIndex here?

What is the behavior when the columns are not strings?

This now raises a TypeError alerting the user to the fact that column names cannot be anything other than strings.

Also added a multiindex test.

wesm · 2017-05-14T19:29:45Z

python/pyarrow/table.pxi

@@ -455,7 +568,7 @@ cdef class RecordBatch:
        return Table.from_batches([self]).to_pandas(nthreads=nthreads)

    @classmethod
-    def from_pandas(cls, df, schema=None):
+    def from_pandas(cls, df, Schema schema=None, bint preserve_index=True):


Document this extra parameter

wesm · 2017-05-15T14:20:36Z

I think this and #602 are the last things I'd like to get in before cutting 0.4.0 (outside some clean up patches).

cpcloud · 2017-05-15T18:09:31Z

Sounds good!

wesm · 2017-05-15T20:26:02Z

python/pyarrow/ipc.py

+    return sink.get_result()
+
+
+def deserialize_pandas(buf):


can you add nthreads=None here and pass through to to_pandas (single-threaded by default)

wesm · 2017-05-15T21:18:50Z

made a last comment #612 (comment) but outside of that i think this is about good to go

wesm · 2017-05-16T02:06:56Z

python/pyarrow/parquet.py

+            index_columns = []
+
+        if column_indices and index_columns:
+            column_indices += index_columns


Need to call _get_column_indices on these?

Ah crap. Yep. Will also add a test since this wasn't failing for me locally.

…_metadata

wesm · 2017-05-16T17:13:42Z

Here's the appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.158

+1, thanks for doing this!

@mrocklin

cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

cpcloud commented Apr 29, 2017

View reviewed changes

wesm reviewed Apr 29, 2017

View reviewed changes

cpcloud force-pushed the ARROW-881 branch 2 times, most recently from 29aa20d to 8aa4ee0 Compare April 30, 2017 18:21

cpcloud force-pushed the ARROW-881 branch 6 times, most recently from 481b724 to 9a0e6c5 Compare May 14, 2017 02:01

cpcloud force-pushed the ARROW-881 branch 2 times, most recently from f1232be to fc27461 Compare May 14, 2017 16:36

wesm reviewed May 14, 2017

View reviewed changes

cpcloud force-pushed the ARROW-881 branch from 5211a8a to 703e150 Compare May 14, 2017 20:12

cpcloud force-pushed the ARROW-881 branch 2 times, most recently from 695e1d4 to 5613a4c Compare May 15, 2017 20:22

wesm reviewed May 15, 2017

View reviewed changes

cpcloud force-pushed the ARROW-881 branch from d737c69 to 08c459c Compare May 15, 2017 22:51

wesm reviewed May 16, 2017

View reviewed changes

cpcloud added 6 commits May 16, 2017 10:36

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom…

59477b5

…_metadata

Add test for no index written and pq.read_pandas

c35970c

Add docs to pq.read_pandas

21a8829

Do not write index_column metadata if not requested

2fa1f16

Add test for read_pandas subset

32c5e64

Call column_name_idx on index_columns

2198dc5

cpcloud force-pushed the ARROW-881 branch from 17dea03 to 2198dc5 Compare May 16, 2017 14:45

cpcloud added 4 commits May 16, 2017 10:48

Decode metadata to utf8 because JSON

a42a084

Add doc

de616e8

More doc

60f71aa

Add metadata test

4fa679d

asfgit closed this in bed0197 May 16, 2017

cpcloud deleted the ARROW-881 branch May 16, 2017 18:54

		@@ -321,8 +322,11 @@ cdef tuple _dataframe_to_arrays(df, bint timestamps_to_ms, Schema schema):
		cdef:

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata #612

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata #612

Uh oh!

Conversation

cpcloud commented Apr 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Apr 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Apr 29, 2017

Uh oh!

cpcloud commented May 14, 2017

Uh oh!

wesm commented May 14, 2017

Uh oh!

cpcloud commented May 14, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Apr 29, 2017 •

edited

Loading

cpcloud Apr 29, 2017 •

edited

Loading

cpcloud May 15, 2017 •

edited

Loading