Skip to content

replace substring globally with substr#4447

Merged
mattdowle merged 7 commits intomasterfrom
substring-substr
Jun 16, 2021
Merged

replace substring globally with substr#4447
mattdowle merged 7 commits intomasterfrom
substring-substr

Conversation

@MichaelChirico
Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico commented May 14, 2020

Ran some benchmarks:

s1 = 'na.rm'
s2 = 'trim'
s3 = paste(rep(letters, 50L), collapse = '')
s4 = paste0('na', s3)

microbenchmark::microbenchmark(
  times = 1e5,
  grepl('^na', s1),
  grepl('^na', s2),
  grepl('^na', s3),
  grepl('^na', s4),
  grepl('^na', s1, perl=TRUE),
  grepl('^na', s2, perl=TRUE),
  grepl('^na', s3, perl=TRUE),
  grepl('^na', s4, perl=TRUE),
  grep('^na', s1),
  grep('^na', s2),
  grep('^na', s3),
  grep('^na', s4),
  grep('^na', s1, perl=TRUE),
  grep('^na', s2, perl=TRUE),
  grep('^na', s3, perl=TRUE),
  grep('^na', s4, perl=TRUE),
  substring(s1, 1L, 2L)=='na',
  substring(s2, 1L, 2L)=='na',
  substring(s3, 1L, 2L)=='na',
  substring(s4, 1L, 2L)=='na',
  substring(s1, 2L)=='na',
  substring(s2, 2L)=='na',
  substring(s3, 2L)=='na',
  substring(s4, 2L)=='na',
  substr(s1, 1L, 2L)=='na',
  substr(s2, 1L, 2L)=='na',
  substr(s3, 1L, 2L)=='na',
  substr(s4, 1L, 2L)=='na',
  startsWith(s1, 'na'),
  startsWith(s2, 'na'),
  startsWith(s3, 'na'),
  startsWith(s4, 'na')
)

With output:

Unit: nanoseconds
                          expr   min    lq       mean median      uq       max neval
              grepl("^na", s1)  3543  4165  4935.0231   4377  4668.0   5035990 1e+05
              grepl("^na", s2)  3562  4179  4913.5268   4389  4679.0   3774953 1e+05
              grepl("^na", s3) 11450 12892 14578.0890  13708 14177.0   5499260 1e+05
              grepl("^na", s4)  3537  4168  5097.6933   4379  4668.0   7537153 1e+05
 grepl("^na", s1, perl = TRUE)  9150 10352 11773.7375  10983 11409.5   5876706 1e+05
 grepl("^na", s2, perl = TRUE)  9011 10209 11545.6510  10841 11255.0   4499622 1e+05
 grepl("^na", s3, perl = TRUE)  9404 10642 12016.8648  11303 11718.0   5530146 1e+05
 grepl("^na", s4, perl = TRUE)  9586 10791 12157.4919  11454 11890.0   3373385 1e+05
               grep("^na", s1)  3673  4321  5046.7225   4541  4834.0   5198785 1e+05
               grep("^na", s2)  3737  4353  5096.7217   4575  4873.0   4505951 1e+05
               grep("^na", s3) 11714 13054 14668.8226  13890 14361.0   3868269 1e+05
               grep("^na", s4)  3718  4321  5133.3707   4540  4838.0   4530341 1e+05
  grep("^na", s1, perl = TRUE)  9289 10531 11979.0810  11177 11612.0   7283472 1e+05
  grep("^na", s2, perl = TRUE)  9202 10407 13000.8980  11048 11473.0 117889308 1e+05
  grep("^na", s3, perl = TRUE)  9567 10832 12296.0602  11513 11944.0   4965770 1e+05
  grep("^na", s4, perl = TRUE)  9699 10966 12564.5620  11644 12088.0   6423057 1e+05
 substring(s1, 1L, 2L) == "na"  1307  1713  2205.3468   1867  2054.0   4163819 1e+05
 substring(s2, 1L, 2L) == "na"  1323  1712  2254.4013   1866  2053.0   4458996 1e+05
 substring(s3, 1L, 2L) == "na"  1329  1717  2551.2385   1871  2059.0  18218403 1e+05
 substring(s4, 1L, 2L) == "na"  1326  1709  2157.2002   1863  2049.0   3373503 1e+05
     substring(s1, 2L) == "na"  1311  1687  2219.3048   1840  2032.0   5426113 1e+05
     substring(s2, 2L) == "na"  1315  1694  2360.0351   1844  2033.0  11231641 1e+05
     substring(s3, 2L) == "na"  2754  3278  3918.8794   3469  3716.0   6439116 1e+05
     substring(s4, 2L) == "na"  2732  3303  4409.6636   3494  3740.0  44304150 1e+05
    substr(s1, 1L, 2L) == "na"   794  1035  1306.2621   1128  1244.0   3644805 1e+05
    substr(s2, 1L, 2L) == "na"   788  1040  1320.3129   1134  1250.0   3827747 1e+05
    substr(s3, 1L, 2L) == "na"   801  1043  1357.2179   1135  1253.0   3761958 1e+05
    substr(s4, 1L, 2L) == "na"   816  1036  1388.7924   1130  1248.0   4170388 1e+05
          startsWith(s1, "na")   388   529   741.6149    579   648.0   4085174 1e+05
          startsWith(s2, "na")   394   527   664.0718    576   645.0    137888 1e+05
          startsWith(s3, "na")   389   528   657.6853    578   646.0     62022 1e+05
          startsWith(s4, "na")   392   531   666.3153    582   651.0     60802 1e+05

Admittedly on the microsecond scale, but substr is clearly better than substring in all cases (similar applies if the input is a vector -- I tested with s1=rep(s1, 10L), etc).

Best of all is startsWith, but this is only from R 3.3+; nevertheless, I used startsWith where possible, and added a wrapper to substr in case base::startsWith is unavailable.

What exactly is the benefit of substring is a bit obscure -- seems mainly to do with recycling arguments and naming, neither of which apply to the cases I saw in our code base. There's also the convenient substring(x, n) automatically becomes substring(x, n, nchar(x)), but I don't think it's worth it (and it may be confusing -- does substring(x, n) mean the characters 1-n?)

Replaced & added to the CRAN_release checks

Comment thread R/data.table.R
@codecov
Copy link
Copy Markdown

codecov bot commented May 14, 2020

Codecov Report

Merging #4447 (683349b) into master (2791043) will increase coverage by 0.13%.
The diff coverage is 100.00%.

❗ Current head 683349b differs from pull request most recent head e8ac4ee. Consider uploading reports for the commit e8ac4ee to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4447      +/-   ##
==========================================
+ Coverage   99.47%   99.60%   +0.13%     
==========================================
  Files          75       72       -3     
  Lines       14808    13913     -895     
==========================================
- Hits        14730    13858     -872     
+ Misses         78       55      -23     
Impacted Files Coverage Δ
R/test.data.table.R 100.00% <ø> (ø)
R/utils.R 100.00% <ø> (ø)
R/data.table.R 100.00% <100.00%> (+0.05%) ⬆️
R/fread.R 100.00% <100.00%> (ø)
src/fmelt.c 99.00% <0.00%> (-1.00%) ⬇️
src/ijoin.c 95.29% <0.00%> (-0.18%) ⬇️
src/fsort.c 95.83% <0.00%> (-0.10%) ⬇️
R/like.R 100.00% <0.00%> (ø)
src/cj.c 100.00% <0.00%> (ø)
R/fcast.R 100.00% <0.00%> (ø)
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2791043...e8ac4ee. Read the comment docs.

Comment thread .dev/revdep.R Outdated
Comment thread R/utils.R
@MichaelChirico MichaelChirico mentioned this pull request May 19, 2020
Michael Chirico added 2 commits May 19, 2020 12:23
@ColeMiller1
Copy link
Copy Markdown
Contributor

Have you seen this branch?

https://github.com/Rdatatable/data.table/compare/micro-optimizations

It includes substring -> substr so there are a lot of neat ideas similar to what you have been proposing

@mattdowle mattdowle added this to the 1.14.1 milestone Jun 16, 2021
@mattdowle mattdowle changed the title replace substring globally with substr [efficiency] replace substring globally with substr Jun 16, 2021
Comment thread R/utils.R Outdated
}

# R 3.3.0 [April 2016]
if (!exists('startsWith', as.environment('package:base'))) {
Copy link
Copy Markdown
Member

@mattdowle mattdowle Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It unlikely matters, but it may be slightly more robust to pass inherits=FALSE to exists() here.
For example: exists("startsWith", "package:utils") returns TRUE even though startsWith is in base not utils. exists("startsWith", "package:utils", inherits=FALSE) is FALSE as expected.
But do we care that startsWith is in base and not moved to utils in future? No. Maybe just exists("startsWith") then. I don't think that would pick up startsWith defined in another package the user has loaded because packages have a namespace which protects their internals from seeing functions defined in other packages. However I'm not quite sure about utils vs base; i.e. whether exists("startsWith") would find it if it is moved to utils.
Not something to worry about. I'll just add the inherits=FALSE and be done with it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think you're right that it doesn't matter but I agree it's probably better practice to use inherits=FALSE

Comment thread R/data.table.R
for (..name in av) {
name = substring(..name, 3L)
if (name=="") stop("The symbol .. is invalid. The .. prefix must be followed by at least one character.")
name = substr(..name, 3L, nchar(..name))
Copy link
Copy Markdown
Member

@mattdowle mattdowle Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing out loud ... I suspect that passing stop=1000000L here instead of nchar(..name) would work, and avoid the allocation for the nchar() result.
Passing .Machine$integer.max might be more robust. But substring()'s default for last= is a hard coded 1000000L. Maybe R's internal max string length is indeed 1000000L characters, but if they ever change that they'll have to remember to update substring's last= argument. If R's maximum string length is greater than 1000000L then substring's argument is already not correct and a bug report could be created.

So, let's see ...

word = paste0(c(rep("A",1e6-2),"hello"), collapse="")
substring(word, 1e6-5)
# [1] "AAAAhe"       # unexpectedly chops "llo" off
substring(word, 1e6-5, nchar(word))
# [1] "AAAAhello"    # expected result

So that's a bug in R it seems. I looked at ?substring and although the default of 1000000 is there in the definition, I don't see any text indicating this is intended behaviour. And I see no reason that the default could not be 2^31-1 which is what the max string length is in R, iiuc. Do you feel like reporting it @MichaelChirico? I feel like you get some traction on BugZilla and enjoy interacting there. So I'd appreciate it if you could handle that :-)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look. My own quick investigation:

# works
x = strrep(" ", .Machine$integer.max-1)
# fails
x = strrep(" ", .Machine$integer.max)

So I agree it seems like an odd default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Interested to see what their response is. I think my short example would have been better to post as-is, as the post to me on first reading dwells too much on .Machine$integer.max vs .Machine$integer.max-1. A phrase like "makes no sense" can be seen as critical and therefore risks raising hairs and inducing defensive responses. Better just to stick to code, use less English, and be constructive. But who am I to give advice, I didn't even want to engage on R-devel or Bugzilla myself because these difficulties are magnified 10x on those forums. You are braver than I.

Copy link
Copy Markdown
Member

@mattdowle mattdowle Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the replies now and saw that Brodie concentrated on .Machine$integer.max vs .Machine$integer.max-1. That to me is inefficient communication, as I feared. I see also time spent on defending why the default was 1000000, as I predicted. But it got there in the end which is good. It could have been more efficient by just pointing to the problem (my short example) and leaving them to come up with the best solution. But again: forum threads like this is why I don't engage, so you're braver than me.

Copy link
Copy Markdown
Member

@mattdowle mattdowle Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can guarantee you that some people (perhaps most, perhaps a few, certainly not all, but definitely some) reading that thread now think that the problem is to do with strings that are 2GB long. They therefore think that that it's a silly edge case that hardly ever comes up and folk that do have strings that are 2GB long shouldn't be doing that anyway and instead be restructuring their code.
Where in fact, the problem occurs at 1e6 characters, which is 976K. At under 1MB, let alone GB, a string that large is relatively much more reasonable, commonplace and has relatively little to do with large servers or esoteric edge cases.
That view being formed could have been avoided by using the 1e6 example that I showed above, which is why I created it that way with that in mind. So I should have written "please post this example" and been explicit in that way. I wrote these comments for future reference, for next time.

@mattdowle mattdowle merged commit 80365ff into master Jun 16, 2021
@mattdowle mattdowle deleted the substring-substr branch June 16, 2021 23:27
mattdowle added a commit that referenced this pull request Jun 17, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants