chore(datasets): Add /api/v1/dataset/{pk}/column endpoint#22332
chore(datasets): Add /api/v1/dataset/{pk}/column endpoint#22332john-bodley wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Regenerated by running superset update-api-docs.
f4c2c43 to
a4aa974
Compare
There was a problem hiding this comment.
we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it
a4aa974 to
939d13e
Compare
Codecov Report
@@ Coverage Diff @@
## master #22332 +/- ##
==========================================
- Coverage 66.75% 66.11% -0.65%
==========================================
Files 1847 1847
Lines 70562 70577 +15
Branches 7742 7737 -5
==========================================
- Hits 47106 46664 -442
- Misses 21455 21906 +451
- Partials 2001 2007 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
939d13e to
f8d36c7
Compare
|
Thanks for taking this performance issue under your wing. I'm hopeful this can mean other orgs with this issue can take this new filters feature to production, and move us another step toward 3.0 and the end of Filter Boxes :D |
7684198 to
f1bd7f2
Compare
ktmud
left a comment
There was a problem hiding this comment.
Not sure why cypress is failing but the change makes sense to me
There was a problem hiding this comment.
| endpoint: `/api/v1/dataset/${datasetId}/column`, | |
| endpoint: `/api/v1/dataset/${datasetId}/columns`, |
Should this end point be /columns?
There was a problem hiding this comment.
+1 on columns as it represents a set of columns.
I added a ticket to Superset 3.0 project board to refactor the endpoints to use the plural form.
There was a problem hiding this comment.
@ktmud and @michael-s-molina I agree that it should be columns and not column—it should also be datasets and not dataset, however per,
we already use the singular form, hence using column ensures consistency which I sense is more important. Per @michael-s-molina's comment I think changing to plural forms should be a bulk operation.
There was a problem hiding this comment.
@dpgaspar, @michael-s-molina, or @villebro I would love your input on this as I'm not overly familiar with the new FAB API and/or how best to define said endpoint. I came across the pattern by prodding around.
There was a problem hiding this comment.
Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129
why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?
also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view
There was a problem hiding this comment.
@dpgaspar I added the rison_data["page_size"] = -1 as the /api/v1/dataset/{pk} response is non-paginated and I wanted to ensure consistency. Additionally this endpoint is currently solely used in a non-paginated way, i.e., it's not powering any /*/list/ endpoints.
There was a problem hiding this comment.
Regarding the pattern I see it being only used twice and it only applies one rison override (as opposed to two which is needed in our case). I think there's merit in refactoring if it's used more than once. Furthermore this example is actually a misnomer in terms of naming, i.e., the function name has been copied and pasted as is even though it's applying the report_schedule as opposed to layer.
There was a problem hiding this comment.
I Agree let's refactor this after. Still think we need to apply the DatasourceFilter here, have you checked it?
There was a problem hiding this comment.
I'm not overly sure the difference between list_columns and show_columns but I thought that it was prudent that the table_columns columns returned was consistent with the columns field in the /api/v1/dataset response.
There was a problem hiding this comment.
list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration
There was a problem hiding this comment.
Can we use this opportunity to add comments in the code explaining the difference?
dpgaspar
left a comment
There was a problem hiding this comment.
This is great @john-bodley! thank you
There was a problem hiding this comment.
Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129
why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?
also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view
There was a problem hiding this comment.
we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it
There was a problem hiding this comment.
we just need the RouteMethod.GET_LIST
There was a problem hiding this comment.
list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration
f1bd7f2 to
6845322
Compare
|
Kicking CI 🤞 |
|
Regrettably when examining the initial problem I missed that the I thought it was prudent to return to the drawing board and close this PR in favor of trying to optimize the underlying FAB/SQLAlchemy logic when fetching a model with its associated relationships. @dpgaspar hopefully dpgaspar/Flask-AppBuilder#1959 is a step in the right direction. All going well I'll follow up with another PR which updates the existing |


SUMMARY
Per SIP-64 this PR adds the
/api/v1/dataset/{pk}/columnendpoints for obtaining just the columns associated with a dataset to aid the performance of the native filters.As indicated in the SIP the
/api/v1/dataset/{pk}endpoint is highly inefficient when fetching a dataset comprising of a large number of columns and/or metrics, i.e., specifically at Airbnb we have a virtual dataset which houses thousands of metrics and dimensions which results in an interim query which computes the cross product of# metricsx# columnsx# owners—resulting in millions of rows.Given the native filters only use the columns/metrics independently from the
/api/v1/dataset/{pk}endpoint this PR simply adds a new endpoint to fetch only the columns metadata associated with said dataset.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
Added unit tests. Also tested the new endpoint with a dataset comprising of ~ 7,000 columns and the
/api/v1/dataset/{pk}/columnendpoint returned in ~ 1 second whereas/api/v1/dataset/{pk}timed out.ADDITIONAL INFORMATION