[WIP] Converts dataframe to/from named numpy arrays by thunterdb · Pull Request #4 · databricks/spark-sklearn

thunterdb · 2015-12-02T00:23:19Z

I found this incredibly convenient to create small dataframes, here is how you can use it:

n = 5
A = rd.rand(n,4)
C = rd.randint(10, size=n)
df = conv.pack_DataFrame(a=A, c=C)

DataFrame[a: vector, c: bigint]

And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.

Z = Converter.df_to_numpy(df)
# Each column is strictly equal to the original.
Z['a'] == A
Z['c'] == C

Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.

jkbradley · 2015-12-02T00:49:18Z

python/pdspark/converter.py

I assume this will be very slow for larger data? That's OK for now.

Yes it will; we can always improve it later.

jkbradley · 2015-12-16T18:00:16Z

python/pdspark/converter.py

The docs should list the supported input types and how they are handled: lists of common types, or lists of vector or numerical array types.

Done. I also added that we support a subset of numpy types (there are so many) and sql types.

jkbradley · 2015-12-16T18:00:47Z

Just had a couple more comments.

thunterdb · 2015-12-21T21:54:20Z

@jkbradley comments addressed

vlad17 · 2016-06-28T00:22:02Z

This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion

srowen · 2018-12-07T21:08:03Z

I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect?

work

5babe88

jkbradley reviewed Dec 2, 2015
View reviewed changes

comments

41ed092

jkbradley reviewed Dec 16, 2015
View reviewed changes

thunterdb added 3 commits December 21, 2015 13:32

changes

042f54d

comments

ac0a0b6

removing old code

38ae150

srowen added the enhancement label Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Converts dataframe to/from named numpy arrays#4

[WIP] Converts dataframe to/from named numpy arrays#4
thunterdb wants to merge 5 commits intodatabricks:masterfrom
thunterdb:ml-555

thunterdb commented Dec 2, 2015

Uh oh!

jkbradley Dec 2, 2015

Uh oh!

thunterdb Dec 2, 2015

Uh oh!

jkbradley Dec 16, 2015

Uh oh!

thunterdb Dec 21, 2015

Uh oh!

jkbradley commented Dec 16, 2015

Uh oh!

thunterdb commented Dec 21, 2015

Uh oh!

vlad17 commented Jun 28, 2016

Uh oh!

srowen commented Dec 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thunterdb commented Dec 2, 2015

Uh oh!

jkbradley Dec 2, 2015

Choose a reason for hiding this comment

Uh oh!

thunterdb Dec 2, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley Dec 16, 2015

Choose a reason for hiding this comment

Uh oh!

thunterdb Dec 21, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Dec 16, 2015

Uh oh!

thunterdb commented Dec 21, 2015

Uh oh!

vlad17 commented Jun 28, 2016

Uh oh!

srowen commented Dec 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants