[WIP] Converts dataframe to/from named numpy arrays#4
[WIP] Converts dataframe to/from named numpy arrays#4thunterdb wants to merge 5 commits intodatabricks:masterfrom
Conversation
python/pdspark/converter.py
Outdated
There was a problem hiding this comment.
I assume this will be very slow for larger data? That's OK for now.
There was a problem hiding this comment.
Yes it will; we can always improve it later.
python/pdspark/converter.py
Outdated
There was a problem hiding this comment.
The docs should list the supported input types and how they are handled: lists of common types, or lists of vector or numerical array types.
There was a problem hiding this comment.
Done. I also added that we support a subset of numpy types (there are so many) and sql types.
|
Just had a couple more comments. |
|
@jkbradley comments addressed |
|
This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion |
|
I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect? |
I found this incredibly convenient to create small dataframes, here is how you can use it:
And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.
Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.