replace substring globally with substr#4447
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4447 +/- ##
==========================================
+ Coverage 99.47% 99.60% +0.13%
==========================================
Files 75 72 -3
Lines 14808 13913 -895
==========================================
- Hits 14730 13858 -872
+ Misses 78 55 -23
Continue to review full report at Codecov.
|
|
Have you seen this branch? https://github.com/Rdatatable/data.table/compare/micro-optimizations It includes |
| } | ||
|
|
||
| # R 3.3.0 [April 2016] | ||
| if (!exists('startsWith', as.environment('package:base'))) { |
There was a problem hiding this comment.
It unlikely matters, but it may be slightly more robust to pass inherits=FALSE to exists() here.
For example: exists("startsWith", "package:utils") returns TRUE even though startsWith is in base not utils. exists("startsWith", "package:utils", inherits=FALSE) is FALSE as expected.
But do we care that startsWith is in base and not moved to utils in future? No. Maybe just exists("startsWith") then. I don't think that would pick up startsWith defined in another package the user has loaded because packages have a namespace which protects their internals from seeing functions defined in other packages. However I'm not quite sure about utils vs base; i.e. whether exists("startsWith") would find it if it is moved to utils.
Not something to worry about. I'll just add the inherits=FALSE and be done with it.
There was a problem hiding this comment.
Good point. I think you're right that it doesn't matter but I agree it's probably better practice to use inherits=FALSE
| for (..name in av) { | ||
| name = substring(..name, 3L) | ||
| if (name=="") stop("The symbol .. is invalid. The .. prefix must be followed by at least one character.") | ||
| name = substr(..name, 3L, nchar(..name)) |
There was a problem hiding this comment.
Writing out loud ... I suspect that passing stop=1000000L here instead of nchar(..name) would work, and avoid the allocation for the nchar() result.
Passing .Machine$integer.max might be more robust. But substring()'s default for last= is a hard coded 1000000L. Maybe R's internal max string length is indeed 1000000L characters, but if they ever change that they'll have to remember to update substring's last= argument. If R's maximum string length is greater than 1000000L then substring's argument is already not correct and a bug report could be created.
So, let's see ...
word = paste0(c(rep("A",1e6-2),"hello"), collapse="")
substring(word, 1e6-5)
# [1] "AAAAhe" # unexpectedly chops "llo" off
substring(word, 1e6-5, nchar(word))
# [1] "AAAAhello" # expected resultSo that's a bug in R it seems. I looked at ?substring and although the default of 1000000 is there in the definition, I don't see any text indicating this is intended behaviour. And I see no reason that the default could not be 2^31-1 which is what the max string length is in R, iiuc. Do you feel like reporting it @MichaelChirico? I feel like you get some traction on BugZilla and enjoy interacting there. So I'd appreciate it if you could handle that :-)
There was a problem hiding this comment.
I will take a look. My own quick investigation:
# works
x = strrep(" ", .Machine$integer.max-1)
# fails
x = strrep(" ", .Machine$integer.max)
So I agree it seems like an odd default.
There was a problem hiding this comment.
There was a problem hiding this comment.
Great. Interested to see what their response is. I think my short example would have been better to post as-is, as the post to me on first reading dwells too much on .Machine$integer.max vs .Machine$integer.max-1. A phrase like "makes no sense" can be seen as critical and therefore risks raising hairs and inducing defensive responses. Better just to stick to code, use less English, and be constructive. But who am I to give advice, I didn't even want to engage on R-devel or Bugzilla myself because these difficulties are magnified 10x on those forums. You are braver than I.
There was a problem hiding this comment.
I see the replies now and saw that Brodie concentrated on .Machine$integer.max vs .Machine$integer.max-1. That to me is inefficient communication, as I feared. I see also time spent on defending why the default was 1000000, as I predicted. But it got there in the end which is good. It could have been more efficient by just pointing to the problem (my short example) and leaving them to come up with the best solution. But again: forum threads like this is why I don't engage, so you're braver than me.
There was a problem hiding this comment.
I can guarantee you that some people (perhaps most, perhaps a few, certainly not all, but definitely some) reading that thread now think that the problem is to do with strings that are 2GB long. They therefore think that that it's a silly edge case that hardly ever comes up and folk that do have strings that are 2GB long shouldn't be doing that anyway and instead be restructuring their code.
Where in fact, the problem occurs at 1e6 characters, which is 976K. At under 1MB, let alone GB, a string that large is relatively much more reasonable, commonplace and has relatively little to do with large servers or esoteric edge cases.
That view being formed could have been avoided by using the 1e6 example that I showed above, which is why I created it that way with that in mind. So I should have written "please post this example" and been explicit in that way. I wrote these comments for future reference, for next time.
Ran some benchmarks:
With output:
Admittedly on the microsecond scale, but
substris clearly better thansubstringin all cases (similar applies if the input is a vector -- I tested withs1=rep(s1, 10L), etc).Best of all is
startsWith, but this is only from R 3.3+; nevertheless, I usedstartsWithwhere possible, and added a wrapper tosubstrin casebase::startsWithis unavailable.What exactly is the benefit of
substringis a bit obscure -- seems mainly to do with recycling arguments and naming, neither of which apply to the cases I saw in our code base. There's also the convenientsubstring(x, n)automatically becomessubstring(x, n, nchar(x)), but I don't think it's worth it (and it may be confusing -- doessubstring(x, n)mean the characters 1-n?)Replaced & added to the CRAN_release checks