Skip to content

[R] Cast scalars to type of field in Expression building #32726

@asfimport

Description

@asfimport

After looking at the ExecPlan output of some queries, it jumped out at me how we translate int_field == 5 in R as cast(int_field, float64) == 5 because 5 is a double in R.

This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter.

> system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> compute())
   user  system elapsed 
  0.391   0.024   0.362 

> system.time(ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> compute())
   user  system elapsed 
  0.206   0.025   0.179 

You can see the difference in the query plans too:

> ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain()
ExecPlan with 4 nodes:
3:SinkNode{}
  2:ProjectNode{projection=[passenger_count]}
    1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)}
      0:SourceNode{}

> ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> explain()
ExecPlan with 4 nodes:
3:SinkNode{}
  2:ProjectNode{projection=[passenger_count]}
    1:FilterNode{filter=(passenger_count > 10)}
      0:SourceNode{}

Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R.

Reporter: Neal Richardson / @nealrichardson
Assignee: Neal Richardson / @nealrichardson

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-17462. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions