Skip to content

Support msgpack format as to_td option #78

@akito19

Description

@akito19

Purpose

Import object datatype values of DataFrame to Treasure Data as string type.

Behaviors

Currently, when we try to import these data into Treasure Data by using pandas_td.to_td, the values would be not necessarily that these are object:

e.g.)

import pytd.pandas_td as td
import pandas as pd

df = pd.DataFrame([{'id':'00123', 'num': '45678'}])
con = td.connect()

td.to_td(df, 'kasai_test.pytd', con, if_exists='replace')

In this case, both id and num are stored as long type in Treasure Data in spite of object type in DataFrame.
I guess that converting to CSV might cause this result since CSV doesn't have typing system.

However, fmt='msgpack' has been available in calling BulkImportWriter.write_dataframe since version 1.0.0.
Thus, we can store these values as string if we use the write_dataframe explicitly like below:

import pytd
import pandas as pd

df = pd.DataFrame({'id': '00324', 'test': '5678'}, index=range(2))
client = pytd.Client()

table = pytd.table.Table(client, "kasai_test", "pytd")
writer = pytd.writer.BulkImportWriter()
writer.write_dataframe(df, table, if_exists="overwrite", fmt="msgpack")

Because pytd has already enabled choosing format when using bulk_import, users might expect to be able to set format as well as write_dataframe.
Thus, I propose that pytd supports msgpack format in to_td options.

What do you think about it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions