chore(datasets): Add /api/v1/dataset/{pk}/column endpoint by john-bodley · Pull Request #22332 · apache/superset

john-bodley · 2022-12-05T21:54:29Z

SUMMARY

Per SIP-64 this PR adds the /api/v1/dataset/{pk}/column endpoints for obtaining just the columns associated with a dataset to aid the performance of the native filters.

As indicated in the SIP the /api/v1/dataset/{pk} endpoint is highly inefficient when fetching a dataset comprising of a large number of columns and/or metrics, i.e., specifically at Airbnb we have a virtual dataset which houses thousands of metrics and dimensions which results in an interim query which computes the cross product of # metrics x # columns x # owners—resulting in millions of rows.

Given the native filters only use the columns/metrics independently from the /api/v1/dataset/{pk} endpoint this PR simply adds a new endpoint to fetch only the columns metadata associated with said dataset.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Added unit tests. Also tested the new endpoint with a dataset comprising of ~ 7,000 columns and the /api/v1/dataset/{pk}/column endpoint returned in ~ 1 second whereas /api/v1/dataset/{pk} timed out.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

john-bodley · 2022-12-05T21:54:56Z

Regenerated by running superset update-api-docs.

john-bodley · 2022-12-06T01:19:15Z

we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

codecov · 2022-12-06T02:18:44Z

Codecov Report

Merging #22332 (e10ca2b) into master (d41cb66) will decrease coverage by 0.64%.
The diff coverage is 78.26%.

❗ Current head e10ca2b differs from pull request most recent head 6845322. Consider uploading reports for the commit 6845322 to get more accurate results

@@            Coverage Diff             @@
##           master   #22332      +/-   ##
==========================================
- Coverage   66.75%   66.11%   -0.65%     
==========================================
  Files        1847     1847              
  Lines       70562    70577      +15     
  Branches     7742     7737       -5     
==========================================
- Hits        47106    46664     -442     
- Misses      21455    21906     +451     
- Partials     2001     2007       +6

Flag	Coverage Δ
hive	`52.53% <76.19%> (?)`
mysql	`?`
postgres	`?`
presto	`52.43% <76.19%> (+<0.01%)`	⬆️
python	`79.67% <76.19%> (-1.34%)`	⬇️
sqlite	`76.48% <76.19%> (-0.01%)`	⬇️
unit	`50.92% <76.19%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/datasets/columns/api.py	`70.00% <76.19%> (-30.00%)`	⬇️
...tersConfigModal/FiltersConfigForm/ColumnSelect.tsx	`76.92% <100.00%> (ø)`
superset/datasets/commands/create.py	`30.61% <0.00%> (-69.39%)`	⬇️
superset/datasets/commands/update.py	`25.00% <0.00%> (-69.05%)`	⬇️
superset/datasets/commands/bulk_delete.py	`33.33% <0.00%> (-53.34%)`	⬇️
superset/datasets/columns/commands/delete.py	`44.11% <0.00%> (-52.95%)`	⬇️
superset/datasets/metrics/commands/delete.py	`44.11% <0.00%> (-52.95%)`	⬇️
superset/datasets/dao.py	`44.21% <0.00%> (-50.35%)`	⬇️
superset/sql_validators/postgres.py	`50.00% <0.00%> (-50.00%)`	⬇️
superset/datasets/commands/delete.py	`48.57% <0.00%> (-48.58%)`	⬇️
... and 45 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rusackas · 2022-12-06T16:41:38Z

Thanks for taking this performance issue under your wing. I'm hopeful this can mean other orgs with this issue can take this new filters feature to production, and move us another step toward 3.0 and the end of Filter Boxes :D

ktmud

Not sure why cypress is failing but the change makes sense to me

ktmud · 2022-12-06T19:46:24Z

Suggested change

endpoint: `/api/v1/dataset/${datasetId}/column`,

endpoint: `/api/v1/dataset/${datasetId}/columns`,

Should this end point be /columns?

+1 on columns as it represents a set of columns.

I added a ticket to Superset 3.0 project board to refactor the endpoints to use the plural form.

@ktmud and @michael-s-molina I agree that it should be columns and not column—it should also be datasets and not dataset, however per,

we already use the singular form, hence using column ensures consistency which I sense is more important. Per @michael-s-molina's comment I think changing to plural forms should be a bulk operation.

john-bodley · 2022-12-06T23:33:58Z

@dpgaspar, @michael-s-molina, or @villebro I would love your input on this as I'm not overly familiar with the new FAB API and/or how best to define said endpoint. I came across the pattern by prodding around.

Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

@dpgaspar I added the rison_data["page_size"] = -1 as the /api/v1/dataset/{pk} response is non-paginated and I wanted to ensure consistency. Additionally this endpoint is currently solely used in a non-paginated way, i.e., it's not powering any /*/list/ endpoints.

Regarding the pattern I see it being only used twice and it only applies one rison override (as opposed to two which is needed in our case). I think there's merit in refactoring if it's used more than once. Furthermore this example is actually a misnomer in terms of naming, i.e., the function name has been copied and pasted as is even though it's applying the report_schedule as opposed to layer.

I Agree let's refactor this after. Still think we need to apply the DatasourceFilter here, have you checked it?

john-bodley · 2022-12-06T23:35:51Z

I'm not overly sure the difference between list_columns and show_columns but I thought that it was prudent that the table_columns columns returned was consistent with the columns field in the /api/v1/dataset response.

list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

Can we use this opportunity to add comments in the code explaining the difference?

dpgaspar

This is great @john-bodley! thank you

dpgaspar · 2022-12-07T10:46:56Z

Can we use something more like: https://github.com/apache/superset/blob/master/superset/annotation_layers/annotations/api.py#L129

why the rison_data["page_size"] = -1 because the frontend is not handling pagination yet?

also we probably need to apply DatasourceFilter here, or users will be able to view dataset columns from datasets their not authorized to view

dpgaspar · 2022-12-07T10:56:10Z

we should probably check and move allow_browser_login = True into BaseSupersetModelRestApi since it seems it's used everywhere, I'll make a PR for it

dpgaspar · 2022-12-07T10:59:12Z

we just need the RouteMethod.GET_LIST

dpgaspar · 2022-12-07T11:01:06Z

list_columns is for get_list and show_columns is for get item. Not great naming, at the time I aimed for compatibility with the previous MovelView classes on FAB to allow for a smooth migration

rusackas · 2022-12-08T23:18:45Z

Kicking CI 🤞

john-bodley · 2022-12-10T00:44:49Z

Regrettably when examining the initial problem I missed that the /api/v1/dataset/{pk} endpoint is also used as part of the native filters to fetch the metrics + dataset metadata here.

I thought it was prudent to return to the drawing board and close this PR in favor of trying to optimize the underlying FAB/SQLAlchemy logic when fetching a model with its associated relationships. @dpgaspar hopefully dpgaspar/Flask-AppBuilder#1959 is a step in the right direction.

All going well I'll follow up with another PR which updates the existing /api/v1/dataset/{pk} queries to filter on only the relevant dataset, columns, and metrics fields which should further improve query performance and reduce the response payload size.

pull-request-size Bot added the size/XXL label Dec 5, 2022

john-bodley commented Dec 5, 2022

View reviewed changes

john-bodley force-pushed the john-bodley--api-datasets-columns branch 4 times, most recently from f4c2c43 to a4aa974 Compare December 6, 2022 01:17

john-bodley commented Dec 6, 2022

View reviewed changes

john-bodley marked this pull request as ready for review December 6, 2022 01:31

john-bodley force-pushed the john-bodley--api-datasets-columns branch from a4aa974 to 939d13e Compare December 6, 2022 02:10

john-bodley force-pushed the john-bodley--api-datasets-columns branch from 939d13e to f8d36c7 Compare December 6, 2022 08:50

rusackas requested review from betodealmeida, dpgaspar and villebro December 6, 2022 16:39

john-bodley force-pushed the john-bodley--api-datasets-columns branch 3 times, most recently from 7684198 to f1bd7f2 Compare December 6, 2022 18:44

john-bodley requested review from geido and jinghua-qa as code owners December 6, 2022 18:44

john-bodley requested a review from ktmud December 6, 2022 19:14

ktmud approved these changes Dec 6, 2022

View reviewed changes

john-bodley commented Dec 6, 2022

View reviewed changes

dpgaspar reviewed Dec 7, 2022

View reviewed changes

chore(datasets): Add /api/v1/dataset/{pk}/column endpoint

6845322

john-bodley force-pushed the john-bodley--api-datasets-columns branch from f1bd7f2 to 6845322 Compare December 8, 2022 19:55

john-bodley changed the title ~~chore(datasets): Add /api/v1/dataset/{pk}/column endpoint~~ chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints Dec 9, 2022

john-bodley changed the title ~~chore(datasets): Add /api/v1/dataset/{pk}/[column|metric] endpoints~~ chore(datasets): Add /api/v1/dataset/{pk}/column endpoint Dec 9, 2022

john-bodley closed this Dec 10, 2022

john-bodley mentioned this pull request Dec 14, 2022

chore(sqla): Add explicit bidirectional performant relationships for SQLA model #22413

Merged

9 tasks

	endpoint: `/api/v1/dataset/${datasetId}/column`,
	endpoint: `/api/v1/dataset/${datasetId}/columns`,

Conversation

john-bodley commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rusackas commented Dec 6, 2022

Uh oh!

ktmud left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-s-molina Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-bodley Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpgaspar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rusackas commented Dec 8, 2022

Uh oh!

john-bodley commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

john-bodley commented Dec 5, 2022 •

edited

Loading

codecov Bot commented Dec 6, 2022 •

edited

Loading

michael-s-molina Dec 7, 2022 •

edited

Loading

john-bodley Dec 8, 2022 •

edited

Loading

john-bodley commented Dec 10, 2022 •

edited

Loading