Skip to content

Conversation

@paleolimbot
Copy link
Member

Implements log(x, base = (something other than 2, e, or 10)).

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! If you don't have one already, can you sign up for an ASF Jira account so I can assign the issue to you?

input %>%
# suppress 'NaNs produced' warning on the first row of df
# that evaluates to NaN (R raises warning but Arrow does not)
suppressWarnings(mutate(., y = log(x, base = x))) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the old Windows build failed--perhaps the magrittr improvement that supports this is too new for 3.6 binary packages. So maybe the best option is to move the suppressWarnings and the accompanying comment to be around/outside of expect_dplyr_equal().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uncovered another corner case in testing this (log(10, base = 1) evaluates to Inf in both R and Arrow). I think the latest commit does this well without silencing any warnings that come out of the other (more common) cases!

Copy link
Contributor

@edponce edponce Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log(x) with base 1 is not a reasonable calculation because this implies solving the following equation 1^y = x which is only defined for x = 1. Nevertheless, most math libraries will yield a value of infinity for base 1 logarithms when x != 1 and NaN for x = 1. Arrow C++ follows this scheme.

These cases are missing from the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in 812f941 that ensures NaN for log(1, base = 1) and Inf for log(10, base = 1) in both dplyr and arrow. Is that test reasonable or should it be rewritten/clarified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable. I also noticed that these corner cases are not tested in the C++ log tests, but that corresponds to another PR.

@paleolimbot
Copy link
Member Author

Sure! I should be 'paleolimbot' in ASF Jira.

@paleolimbot
Copy link
Member Author

Fixed the base with length != 1 issue, although I think my error message could be better since base with length != 1 is supported, just not as a non-expression. Reprex summary of new functionality:

library(arrow)
library(dplyr, warn.conflicts = FALSE)

# with base as a column
RecordBatch$create(a = 2:5, b = 2:5) %>% 
  mutate(log(a, base = b)) %>% 
  collect()
#> # A tibble: 4 × 3
#>       a     b `log(a, base = b)`
#>   <int> <int>              <dbl>
#> 1     2     2                  1
#> 2     3     3                  1
#> 3     4     4                  1
#> 4     5     5                  1
# with base as a scalar
RecordBatch$create(a = 2:5) %>% 
  mutate(log(a, base = 3)) %>% 
  collect()
#> # A tibble: 4 × 2
#>       a `log(a, base = 3)`
#>   <int>              <dbl>
#> 1     2              0.631
#> 2     3              1    
#> 3     4              1.26 
#> 4     5              1.46
# errors for base with length != 1
RecordBatch$create(a = 2:5) %>% 
  mutate(log(a, base = 3:4)) %>% 
  collect()
#> Warning: In log(a, base = 3:4), base with length != 1 not supported by Arrow;
#> pulling data into R
#> # A tibble: 4 × 2
#>       a `log(a, base = 3:4)`
#>   <int>                <dbl>
#> 1     2                0.631
#> 2     3                0.792
#> 3     4                1.26 
#> 4     5                1.16

Should x as a scalar be supported? It currently triggers bringing the data in to R but could theoretically not:

# errors
RecordBatch$create(a = 2:5) %>% 
  mutate(a * log(2)) %>% 
  collect()
#> Warning: Expression a * log(2) not supported in Arrow; pulling data into R
#> # A tibble: 4 × 2
#>       a `a * log(2)`
#>   <int>        <dbl>
#> 1     2         1.39
#> 2     3         2.08
#> 3     4         2.77
#> 4     5         3.47
# works!
RecordBatch$create(a = 2:5) %>% 
  mutate(a * log(Expression$scalar(2))) %>% 
  collect()
#> # A tibble: 4 × 2
#>       a `a * log(Expression$scalar(2))`
#>   <int>                           <dbl>
#> 1     2                            1.39
#> 2     3                            2.08
#> 3     4                            2.77
#> 4     5                            3.47

paleolimbot and others added 2 commits September 13, 2021 21:52
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
…ions (either can be scalar or Expression)
@paleolimbot
Copy link
Member Author

With updated handling of x:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

# with base as a column
RecordBatch$create(a = 2:5, b = 2:5) %>% 
  mutate(log(a, base = b)) %>% 
  collect()
#> # A tibble: 4 × 3
#>       a     b `log(a, base = b)`
#>   <int> <int>              <dbl>
#> 1     2     2                  1
#> 2     3     3                  1
#> 3     4     4                  1
#> 4     5     5                  1
# with base as a scalar
RecordBatch$create(a = 2:5) %>% 
  mutate(log(a, base = 3)) %>% 
  collect()
#> # A tibble: 4 × 2
#>       a `log(a, base = 3)`
#>   <int>              <dbl>
#> 1     2              0.631
#> 2     3              1    
#> 3     4              1.26 
#> 4     5              1.46
# with x as a scalar
RecordBatch$create(a = 2:5) %>% 
  mutate(log(2, base = a)) %>% 
  collect()
#> # A tibble: 4 × 2
#>       a `log(2, base = a)`
#>   <int>              <dbl>
#> 1     2              1    
#> 2     3              0.631
#> 3     4              0.5  
#> 4     5              0.431
# errors for base with length != 1
RecordBatch$create(a = 2:5) %>% 
  mutate(log(a, base = 3:4)) %>% 
  collect()
#> Warning: In log(a, base = 3:4), base must be a column or a length-1 numeric;
#> other values not supported by Arrow; pulling data into R
#> # A tibble: 4 × 2
#>       a `log(a, base = 3:4)`
#>   <int>                <dbl>
#> 1     2                0.631
#> 2     3                0.792
#> 3     4                1.26 
#> 4     5                1.16
# errors for x with length != 1
RecordBatch$create(a = 2:5) %>% 
  mutate(log(a, base = 3:4)) %>% 
  collect()
#> Warning: In log(a, base = 3:4), base must be a column or a length-1 numeric;
#> other values not supported by Arrow; pulling data into R
#> # A tibble: 4 × 2
#>       a `log(a, base = 3:4)`
#>   <int>                <dbl>
#> 1     2                0.631
#> 2     3                0.792
#> 3     4                1.26 
#> 4     5                1.16

Created on 2021-09-14 by the reprex package (v2.0.1)

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note about a test I think we need to add but otherwise this is excellent, thank you!

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Implements `log(x, base = (something other than 2, e, or 10))`.

Closes apache#11133 from paleolimbot/arrow-13626-log-base-b

Authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
@paleolimbot paleolimbot deleted the arrow-13626-log-base-b branch November 1, 2021 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants