Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions cpp/src/arrow/type.cc
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,31 @@ std::string Date32Type::ToString() const {
return std::string("date32");
}

static inline void print_time_unit(TimeUnit unit, std::ostream* stream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference instead of pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google style guide recommends pointers for mutable arguments / not using mutable reference arguments https://google.github.io/styleguide/cppguide.html#Reference_Arguments -- our code isn't 100% consistent about this, but we should try to follow that convention. In the case of std::ostream you can see this applied in Protocol Buffers and other Google codebases: https://github.com/google/protobuf/blob/fd046f6263fb17383cafdbb25c361e3451c31105/src/google/protobuf/io/zero_copy_stream_impl.h#L265

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. Thanks!

switch (unit) {
case TimeUnit::SECOND:
(*stream) << "s";
break;
case TimeUnit::MILLI:
(*stream) << "ms";
break;
case TimeUnit::MICRO:
(*stream) << "us";
break;
case TimeUnit::NANO:
(*stream) << "ns";
break;
}
}

std::string TimestampType::ToString() const {
std::stringstream ss;
ss << "timestamp[";
print_time_unit(this->unit, &ss);
ss << "]";
return ss.str();
}

// ----------------------------------------------------------------------
// Union type

Expand Down
2 changes: 1 addition & 1 deletion cpp/src/arrow/type.h
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType {
TimestampType(const TimestampType& other) : TimestampType(other.unit) {}

Status Accept(TypeVisitor* visitor) const override;
std::string ToString() const override { return name(); }
std::string ToString() const override;
static std::string name() { return "timestamp"; }

TimeUnit unit;
Expand Down
3 changes: 3 additions & 0 deletions python/pyarrow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@
FloatValue, DoubleValue, ListValue,
BinaryValue, StringValue)

import pyarrow.schema as _schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing it here as private seems like useless as I thought the use of the imports here was mainly to expose the public interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was pretty annoying, after from pyarrow.schema import schema the statement import pyarrow.schema returns pyarrow.schema.schema instead of the module. Other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, then we keep it.


from pyarrow.schema import (null, bool_,
int8, int16, int32, int64,
uint8, uint16, uint32, uint64,
Expand All @@ -64,6 +66,7 @@
list_, struct, dictionary, field,
DataType, Field, Schema, schema)


from pyarrow.table import Column, RecordBatch, Table, concat_tables


Expand Down
76 changes: 59 additions & 17 deletions python/pyarrow/array.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool
cimport pyarrow.scalar as scalar
from pyarrow.scalar import NA

from pyarrow.schema cimport Field, Schema, DictionaryType
from pyarrow.schema cimport (DataType, Field, Schema, DictionaryType,
box_data_type)
import pyarrow.schema as schema

cimport cpython
Expand All @@ -45,16 +46,40 @@ cdef _pandas():
return pd


cdef maybe_coerce_datetime64(values, dtype, DataType type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably reduce the usage of the reserved word type in our codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's __builtin__.type I haven't worried too much about it, but I'm OK with another naming convention, like type_, what do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either type_ or dtype would be ok for me but as it's internally used, we can keep it. I wasn't 100% sure about the implications of using a reserved keyword.

timestamps_to_ms=False):

from pyarrow.compat import DatetimeTZDtype

if values.dtype.type != np.datetime64:
return values, type

coerce_ms = timestamps_to_ms and values.dtype != 'datetime64[ms]'

if coerce_ms:
values = values.astype('datetime64[ms]')

if isinstance(dtype, DatetimeTZDtype):
tz = dtype.tz
unit = 'ms' if coerce_ms else dtype.unit
type = schema.timestamp(unit, tz)
else:
# Trust the NumPy dtype
type = schema.type_from_numpy_dtype(values.dtype)

return values, type


cdef class Array:

cdef init(self, const shared_ptr[CArray]& sp_array):
self.sp_array = sp_array
self.ap = sp_array.get()
self.type = DataType()
self.type.init(self.sp_array.get().type())
self.type = box_data_type(self.sp_array.get().type())

@staticmethod
def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None,
def from_pandas(obj, mask=None, DataType type=None,
timestamps_to_ms=False,
MemoryPool memory_pool=None):
"""
Convert pandas.Series to an Arrow Array.
Expand All @@ -66,6 +91,9 @@ cdef class Array:
mask : pandas.Series or numpy.ndarray, optional
boolean mask if the object is valid or null

type : pyarrow.DataType
Explicit type to attempt to coerce to

timestamps_to_ms : bool, optional
Convert datetime columns to ms resolution. This is needed for
compatibility with other functionality like Parquet I/O which
Expand Down Expand Up @@ -107,33 +135,43 @@ cdef class Array:
"""
cdef:
shared_ptr[CArray] out
shared_ptr[CField] c_field
shared_ptr[CDataType] c_type
CMemoryPool* pool

pd = _pandas()

if field is not None:
c_field = field.sp_field

if mask is not None:
mask = get_series_values(mask)

series_values = get_series_values(obj)
values = get_series_values(obj)
pool = maybe_unbox_memory_pool(memory_pool)

if isinstance(series_values, pd.Categorical):
if isinstance(values, pd.Categorical):
return DictionaryArray.from_arrays(
series_values.codes, series_values.categories.values,
values.codes, values.categories.values,
mask=mask, memory_pool=memory_pool)
elif values.dtype == object:
# Object dtype undergoes a different conversion path as more type
# inference may be needed
if type is not None:
c_type = type.sp_type
with nogil:
check_status(pyarrow.PandasObjectsToArrow(
pool, values, mask, c_type, &out))
else:
if series_values.dtype.type == np.datetime64 and timestamps_to_ms:
series_values = series_values.astype('datetime64[ms]')
values, type = maybe_coerce_datetime64(
values, obj.dtype, type, timestamps_to_ms=timestamps_to_ms)

if type is None:
check_status(pyarrow.PandasDtypeToArrow(values.dtype, &c_type))
else:
c_type = type.sp_type

pool = maybe_unbox_memory_pool(memory_pool)
with nogil:
check_status(pyarrow.PandasToArrow(
pool, series_values, mask, c_field, &out))
pool, values, mask, c_type, &out))

return box_array(out)
return box_array(out)

@staticmethod
def from_list(object list_obj, DataType type=None,
Expand Down Expand Up @@ -338,6 +376,10 @@ cdef class DateArray(NumericArray):
pass


cdef class TimestampArray(NumericArray):
pass


cdef class FloatArray(FloatingPointArray):
pass

Expand Down Expand Up @@ -423,7 +465,7 @@ cdef dict _array_classes = {
Type_LIST: ListArray,
Type_BINARY: BinaryArray,
Type_STRING: StringArray,
Type_TIMESTAMP: Int64Array,
Type_TIMESTAMP: TimestampArray,
Type_DICTIONARY: DictionaryArray
}

Expand Down
9 changes: 9 additions & 0 deletions python/pyarrow/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@

# flake8: noqa

from distutils.version import LooseVersion
import itertools

import numpy as np
import pandas as pd

import sys
import six
Expand Down Expand Up @@ -115,6 +117,13 @@ def encode_file_path(path):
return encoded_path


if LooseVersion(pd.__version__) < '0.19.0':
pdapi = pd.core.common
from pandas.core.dtypes import DatetimeTZDtype
else:
from pandas.types.dtypes import DatetimeTZDtype
pdapi = pd.api.types

integer_types = six.integer_types + (np.integer,)

__all__ = []
4 changes: 2 additions & 2 deletions python/pyarrow/config.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@
cdef extern from 'pyarrow/do_import_numpy.h':
pass

cdef extern from 'pyarrow/numpy_interop.h' namespace 'pyarrow':
cdef extern from 'pyarrow/numpy_interop.h' namespace 'arrow::py':
int import_numpy()

cdef extern from 'pyarrow/config.h' namespace 'pyarrow':
cdef extern from 'pyarrow/config.h' namespace 'arrow::py':
void pyarrow_init()
void pyarrow_set_numpy_nan(object o)

Expand Down
6 changes: 1 addition & 5 deletions python/pyarrow/feather.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from distutils.version import LooseVersion
import pandas as pd

from pyarrow.compat import pdapi
from pyarrow._feather import FeatherError # noqa
from pyarrow.table import Table
import pyarrow._feather as ext
Expand All @@ -27,11 +28,6 @@
if LooseVersion(pd.__version__) < '0.17.0':
raise ImportError("feather requires pandas >= 0.17.0")

if LooseVersion(pd.__version__) < '0.19.0':
pdapi = pd.core.common
else:
pdapi = pd.api.types


class FeatherReader(ext.FeatherReader):

Expand Down
11 changes: 8 additions & 3 deletions python/pyarrow/includes/libarrow.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,13 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
shared_ptr[CArray] indices()
shared_ptr[CArray] dictionary()

cdef cppclass CTimestampType" arrow::TimestampType"(CFixedWidthType):
TimeUnit unit
c_string timezone

cdef cppclass CTimeType" arrow::TimeType"(CFixedWidthType):
TimeUnit unit

cdef cppclass CDictionaryType" arrow::DictionaryType"(CFixedWidthType):
CDictionaryType(const shared_ptr[CDataType]& index_type,
const shared_ptr[CArray]& dictionary)
Expand All @@ -92,6 +99,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
shared_ptr[CArray] dictionary()

shared_ptr[CDataType] timestamp(TimeUnit unit)
shared_ptr[CDataType] timestamp(const c_string& timezone, TimeUnit unit)

cdef cppclass CMemoryPool" arrow::MemoryPool":
int64_t bytes_allocated()
Expand All @@ -117,9 +125,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
cdef cppclass CStringType" arrow::StringType"(CDataType):
pass

cdef cppclass CTimestampType" arrow::TimestampType"(CDataType):
TimeUnit unit

cdef cppclass CField" arrow::Field":
c_string name
shared_ptr[CDataType] type
Expand Down
19 changes: 13 additions & 6 deletions python/pyarrow/includes/pyarrow.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,29 @@
# distutils: language = c++

from pyarrow.includes.common cimport *
from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CField,
from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn,
CTable, CDataType, CStatus, Type,
CMemoryPool, TimeUnit)

cimport pyarrow.includes.libarrow_io as arrow_io


cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil:
cdef extern from "pyarrow/api.h" namespace "arrow::py" nogil:
shared_ptr[CDataType] GetPrimitiveType(Type type)
shared_ptr[CDataType] GetTimestampType(TimeUnit unit)
CStatus ConvertPySequence(object obj, CMemoryPool* pool, shared_ptr[CArray]* out)
CStatus ConvertPySequence(object obj, CMemoryPool* pool,
shared_ptr[CArray]* out)

CStatus PandasDtypeToArrow(object dtype, shared_ptr[CDataType]* type)

CStatus PandasToArrow(CMemoryPool* pool, object ao, object mo,
shared_ptr[CField] field,
const shared_ptr[CDataType]& type,
shared_ptr[CArray]* out)

CStatus PandasObjectsToArrow(CMemoryPool* pool, object ao, object mo,
const shared_ptr[CDataType]& type,
shared_ptr[CArray]* out)

CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr,
PyObject* py_ref, PyObject** out)

Expand All @@ -47,12 +54,12 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil:
CMemoryPool* get_memory_pool()


cdef extern from "pyarrow/common.h" namespace "pyarrow" nogil:
cdef extern from "pyarrow/common.h" namespace "arrow::py" nogil:
cdef cppclass PyBytesBuffer(CBuffer):
PyBytesBuffer(object o)


cdef extern from "pyarrow/io.h" namespace "pyarrow" nogil:
cdef extern from "pyarrow/io.h" namespace "arrow::py" nogil:
cdef cppclass PyReadableFile(arrow_io.ReadableFileInterface):
PyReadableFile(object fo)

Expand Down
10 changes: 9 additions & 1 deletion python/pyarrow/schema.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@
# under the License.

from pyarrow.includes.common cimport *
from pyarrow.includes.libarrow cimport (CDataType, CDictionaryType,
from pyarrow.includes.libarrow cimport (CDataType,
CDictionaryType,
CTimestampType,
CField, CSchema)

cdef class DataType:
Expand All @@ -31,6 +33,12 @@ cdef class DictionaryType(DataType):
cdef:
const CDictionaryType* dict_type


cdef class TimestampType(DataType):
cdef:
const CTimestampType* ts_type


cdef class Field:
cdef:
shared_ptr[CField] sp_field
Expand Down
Loading