Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Apr 5, 2021

I was debugging another issue (not a bug in DataFusion I don't think) but noticed there wasn't any coverage for LIMIT in exec.rs, so I figured I would add some.

(well really I was writing a test to trigger what I thought was a bug in DataFusion -- lol)

@github-actions
Copy link

github-actions bot commented Apr 5, 2021

async fn limit() -> Result<()> {
let tmp_dir = TempDir::new()?;
let mut ctx = create_ctx(&tmp_dir, 1)?;
ctx.register_table("t", table_with_sequence(1, 1000).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at Limit yesterday - it seems to poll new batches even when the limit has been reached (and throw away the result in the end)? Not wrong - but quite inefficient of course :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure --- I have a program (https://github.com/influxdata/influxdb_iox/pull/1117) that I made for testing some stuff as we roll out IOx and there is definitely a problem with limit, though I haven't figured out if it is a problem in DataFusion or not

For this plan:

> explain select * from chunks limit 1;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| plan_type    | plan                                                                                                                                    |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: 1                                                                                                                                |
|              |   Projection: #database_name, #id, #partition_key, #storage, #estimated_bytes, #time_of_first_write, #time_of_last_write, #time_closing |
|              |     TableScan: chunks projection=None                                                                                                   |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 rows in set. Query took 0 seconds.

Produces no rows (and no columns) when there is a limit:

> select * from chunks limit 1;
++
||
++
++
0 rows in set. Query took 0 seconds.

But there is definitely data in that table:

> select * from chunks;
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
| database_name                     | id  | partition_key       | storage             | estimated_bytes | time_of_first_write           | time_of_last_write            | time_closing                  |
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
| 844910ece80be8bc_05a7a51565539000 | 0   | 2021-04-05 21:00:00 | OpenMutableBuffer   | 259733          | 2021-04-05 21:29:38.978576237 | 2021-04-05 21:49:47.995408514 |                               |
....
| 844910ece80be8bc_eaec8df57a81a1e9 | 1   | 2021-04-05 21:00:00 | OpenMutableBuffer   | 6408933         | 2021-04-05 21:44:31.507950286 | 2021-04-05 21:55:37.226659960 |                               |
+-----------------------------------+-----+---------------------+---------------------+-----------------+-------------------------------+-------------------------------+-------------------------------+
226 rows in set. Query took 0 seconds.

I am still looking into it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/ARROW-12235 is one possibly problem I found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at Limit yesterday - it seems to poll new batches even when the limit has been reached (and throw away the result in the end)? Not wrong - but quite inefficient of course :)

@Dandandan I spent some time looking into this and you were absolutely correct. I have a PR (draft) up with a proposed fix: #9926

@alamb alamb closed this in fc1e54e Apr 6, 2021
@alamb alamb deleted the limit_fix branch April 6, 2021 10:28
pachadotdev pushed a commit to pachadotdev/arrow that referenced this pull request Apr 6, 2021
I was debugging another issue (not a bug in DataFusion I don't think) but noticed there wasn't any coverage for LIMIT in exec.rs, so I figured I would add some.

(well really I was writing a test to trigger what I thought was a bug in DataFusion -- lol)

Closes apache#9897 from alamb/limit_fix

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants