feat: Add parquet upload#14449
Conversation
|
@john-bodley @villebro just wanted to follow up on this in case it got lost in the shuffle. How should we proceed on this PR? |
|
@exemplary-citizen sorry for dropping the ball on this - I'll have this reviewed within the next 24 hours |
06f9602 to
4cc378a
Compare
|
/testenv up |
| "iterator": True, | ||
| "keep_default_na": not form.null_values.data, | ||
| "mangle_dupe_cols": form.mangle_dupe_cols.data, | ||
| "usecols": form.usecols.data, |
There was a problem hiding this comment.
This appears to change the behavior of the existing CSV upload functionality by specifying columns. Can you add some tests around this?
There was a problem hiding this comment.
added a scenario to test_import_csv that tests uploading a CSV with specific columns
| "If not None, only these columns will be read from the file." | ||
| ), | ||
| validators=[Optional()], | ||
| ) |
There was a problem hiding this comment.
Can you provide a screenshot of the updated form UI?
There was a problem hiding this comment.
Added a screenshot to the summary above
|
@robdiciuccio Ephemeral environment spinning up at http://34.214.127.48:8080. Credentials are |
villebro
left a comment
There was a problem hiding this comment.
A few comments:
- While it's convenient to add this functionality to the CSV upload form, I feel this should be in a form of its own, as the majority of the current fields are specific to CSV only (IIUC only
usecolsis needed for Parquet upload) - If the CSV upload form will also handle Parquet, the title needs to be updated to reflect this. However, I'd personally prefer moving this into a form of its own.
- it would be nice if the form could handle directories/zip files, as it's fairly common to have partitioned data that is split up into multiple Parquet files. As pandas also supports uploading from a directory path, this would be a great feature to avoid having to manually append upload each file.
| config["ALLOWED_EXTENSIONS"].intersection(config["CSV_EXTENSIONS"]), | ||
| config["ALLOWED_EXTENSIONS"].intersection( | ||
| config["CSV_EXTENSIONS"].union(config["OTHER_EXTENSIONS"]) | ||
| ), |
There was a problem hiding this comment.
You need to add the union to the error message below.
|
Yeah I agree that this probably belongs in a separate form. Just went in this direction because creating a new form would mean that we'd effectively be abandoning #13834. I'll go ahead and get started working on a new form for |
|
@villebro can you restart CI? |
360e17e to
5b5eb74
Compare
|
@villebro should come back green now |
|
@exemplary-citizen there's a linting error in one of the files. You can either setup pre-commit hooks or apply the diff below to fix the problem: diff --git a/superset/views/database/views.py b/superset/views/database/views.py
index 8d0a92f6c..3863b165c 100644
--- a/superset/views/database/views.py
+++ b/superset/views/database/views.py
@@ -406,17 +406,23 @@ class ColumnarToDatabaseView(SimpleFormView):
def form_get(self, form: ColumnarToDatabaseForm) -> None:
form.if_exists.data = "fail"
- def form_post(self, form: ColumnarToDatabaseForm) -> Response: # pylint: disable=too-many-locals
+ def form_post(
+ self, form: ColumnarToDatabaseForm
+ ) -> Response: # pylint: disable=too-many-locals
database = form.con.data
columnar_table = Table(table=form.name.data, schema=form.schema.data)
files = form.columnar_file.data
file_type = {file.filename.split(".")[-1] for file in files}
if file_type == {"zip"}:
- zipfile_ob = zipfile.ZipFile(form.columnar_file.data[0]) # pylint: disable=consider-using-with
+ zipfile_ob = zipfile.ZipFile(
+ form.columnar_file.data[0]
+ ) # pylint: disable=consider-using-with
file_type = {filename.split(".")[-1] for filename in zipfile_ob.namelist()}
files = [
- io.BytesIO((zipfile_ob.open(filename).read(), filename)[0]) # pylint: disable=consider-using-with
+ io.BytesIO(
+ (zipfile_ob.open(filename).read(), filename)[0]
+ ) # pylint: disable=consider-using-with
for filename in zipfile_ob.namelist()
] |
|
@villebro took care of the code formatting |
|
@exemplary-citizen sorry to bother you again, but we've recently updated the version of |
villebro
left a comment
There was a problem hiding this comment.
LGTM! While testing I found some edge cases that caused trouble, but those can be improved upon later (I'll try to open up a PR for some of them; will tag you for a review when I do).
|
Ephemeral environment shutdown and build artifacts deleted. |
That sounds great @exemplary-citizen! Oh, and I forgot; thanks so much for your patience with the review process! |
* allow csv upload to accept parquet file * fix mypy * fix if statement * add test for specificying columns in CSV upload * clean up test * change order in test * fix failures * upload parquet to seperate table in test * fix error message * fix mypy again * rename other extensions to columnar * add new form for columnar upload * add support for zip files * undo csv form changes except usecols * add more tests for zip * isort & black * pylint * fix trailing space * address more review comments * pylint * black * resolve remaining issues
* allow csv upload to accept parquet file * fix mypy * fix if statement * add test for specificying columns in CSV upload * clean up test * change order in test * fix failures * upload parquet to seperate table in test * fix error message * fix mypy again * rename other extensions to columnar * add new form for columnar upload * add support for zip files * undo csv form changes except usecols * add more tests for zip * isort & black * pylint * fix trailing space * address more review comments * pylint * black * resolve remaining issues
Hi @villebro and @exemplary-citizen , I am using Superset v2.1.0 docker compose and I couldn't upload a parquet file to Superset. Is this request get deprecated in new version? |
I am sorry for the question above, I see the Is there a way to programatically import parquet files to superset db? Thanks |
* allow csv upload to accept parquet file * fix mypy * fix if statement * add test for specificying columns in CSV upload * clean up test * change order in test * fix failures * upload parquet to seperate table in test * fix error message * fix mypy again * rename other extensions to columnar * add new form for columnar upload * add support for zip files * undo csv form changes except usecols * add more tests for zip * isort & black * pylint * fix trailing space * address more review comments * pylint * black * resolve remaining issues


SUMMARY
Allow CSV upload form to accept parquet file. Went in this direction so as not to exacerbate what was brought up in #13834 by adding a new form specifically for parquet files. I believe small modifications can be made to this PR to accommodate
featherandorcfiles.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
Added a test similar to the ones already in
csv_upload_tests.pyADDITIONAL INFORMATION
Fixes #14020