-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12214: [Rust][DataFusion] Add tests for limit #9897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| async fn limit() -> Result<()> { | ||
| let tmp_dir = TempDir::new()?; | ||
| let mut ctx = create_ctx(&tmp_dir, 1)?; | ||
| ctx.register_table("t", table_with_sequence(1, 1000).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at Limit yesterday - it seems to poll new batches even when the limit has been reached (and throw away the result in the end)? Not wrong - but quite inefficient of course :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure --- I have a program (https://github.com/influxdata/influxdb_iox/pull/1117) that I made for testing some stuff as we roll out IOx and there is definitely a problem with limit, though I haven't figured out if it is a problem in DataFusion or not
For this plan:
> explain select * from chunks limit 1;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: 1 |
| | Projection: #database_name, #id, #partition_key, #storage, #estimated_bytes, #time_of_first_write, #time_of_last_write, #time_closing |
| | TableScan: chunks projection=None |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 rows in set. Query took 0 seconds.
Produces no rows (and no columns) when there is a limit:
> select * from chunks limit 1;
++
||
++
++
0 rows in set. Query took 0 seconds.
But there is definitely data in that table:
> select * from chunks;
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
| database_name | id | partition_key | storage | estimated_bytes | time_of_first_write | time_of_last_write | time_closing |
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
| 844910ece80be8bc_05a7a51565539000 | 0 | 2021-04-05 21:00:00 | OpenMutableBuffer | 259733 | 2021-04-05 21:29:38.978576237 | 2021-04-05 21:49:47.995408514 | |
....
| 844910ece80be8bc_eaec8df57a81a1e9 | 1 | 2021-04-05 21:00:00 | OpenMutableBuffer | 6408933 | 2021-04-05 21:44:31.507950286 | 2021-04-05 21:55:37.226659960 | |
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
226 rows in set. Query took 0 seconds.
I am still looking into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/ARROW-12235 is one possibly problem I found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at Limit yesterday - it seems to poll new batches even when the limit has been reached (and throw away the result in the end)? Not wrong - but quite inefficient of course :)
@Dandandan I spent some time looking into this and you were absolutely correct. I have a PR (draft) up with a proposed fix: #9926
I was debugging another issue (not a bug in DataFusion I don't think) but noticed there wasn't any coverage for LIMIT in exec.rs, so I figured I would add some. (well really I was writing a test to trigger what I thought was a bug in DataFusion -- lol) Closes apache#9897 from alamb/limit_fix Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
I was debugging another issue (not a bug in DataFusion I don't think) but noticed there wasn't any coverage for LIMIT in exec.rs, so I figured I would add some.
(well really I was writing a test to trigger what I thought was a bug in DataFusion -- lol)