From 0c80ac3d538758d15a417433ff85c3acfa6d84ad Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:41:12 -0500 Subject: [PATCH 1/7] Skip XML parsing in markdown_evaluate() when no inline R code present Add a fast grepl() pre-check to avoid expensive XML parsing for the 99% of tags that don't contain inline R code (`r ...` or ```{...}). Saves ~12ms per roxygenize() run (11x faster for markdown_evaluate). Co-Authored-By: Claude Opus 4.6 (1M context) --- R/markdown-code.R | 3 ++ optimization-plan.md | 63 +++++++++++++++++++++++++++++++++++++++++ optimization-results.md | 20 +++++++++++++ 3 files changed, 86 insertions(+) create mode 100644 optimization-plan.md create mode 100644 optimization-results.md diff --git a/R/markdown-code.R b/R/markdown-code.R index 2e9559f5d..3a9f956c9 100644 --- a/R/markdown-code.R +++ b/R/markdown-code.R @@ -49,6 +49,9 @@ #' @keywords internal markdown_evaluate <- function(text) { text <- paste(text, collapse = "\n") + if (!grepl("`r ", text, fixed = TRUE) && !grepl("```{", text, fixed = TRUE)) { + return(text) + } mdxml <- xml_ns_strip(md_to_mdxml(text, sourcepos = TRUE)) code_nodes <- xml_find_all(mdxml, ".//code | .//code_block") rcode_nodes <- keep(code_nodes, is_markdown_code_node) diff --git a/optimization-plan.md b/optimization-plan.md new file mode 100644 index 000000000..01322e755 --- /dev/null +++ b/optimization-plan.md @@ -0,0 +1,63 @@ +# Optimization plan + +## Profiling summary + +Profiled `roxygenize(load_code = "source")` on the roxygen2 package itself (5 iterations, ~2.9s total, interval = 5ms). Key bottlenecks by total time: + +| Area | % of total | Key functions | +|------|-----------|---------------| +| Tag parsing / markdown | 23% | `parse_tags` → `markdown_if_active` → `markdown` | +| Markdown pass 2 (md→Rd) | 8% | `markdown_pass2` → `mdxml_children_to_rd_top` | +| CLI warnings | 8.5% | `warn_roxy` → `cli::cli_inform` | +| S3 generic detection | 12% | `block_find_object` → `is_s3_generic` / `find_generic` | +| R6 method extraction | 10% | `r6_extract_methods` → `warn_roxy_tag` (76% of its time!) | + +## Optimization experiments + +### Experiment 1: Cache `is_s3_generic()` results + +**Bottleneck:** `find_generic()` calls `is_s3_generic()` in a loop over dot-split pieces of every function name. `is_s3_generic()` does expensive work: `exists()`, `get()`, `tryCatch(getNamespaceName(...))`, `body()`, and recursive `calls_use_method()`. The same generic names (e.g., `print`, `format`, `[`) are tested repeatedly across many functions in a package. + +**Proposed fix:** Add a simple environment-based cache inside `is_s3_generic()` keyed by `(name, env)`. Clear the cache at the start of each `roxygenize()` call. + +**Risk:** Low — pure function of name + env, env doesn't change during a roxygenize run. + +### Experiment 2: Cache `methods::isGeneric()` in `find_generic()` + +**Bottleneck:** `find_generic()` calls `methods::isGeneric()` for every function name. This is an S4-method lookup that hits the methods infrastructure. + +**Proposed fix:** Cache `methods::isGeneric()` results alongside `is_s3_generic()`. + +**Risk:** Low — same reasoning as experiment 1. + +### Experiment 3: Replace `cli::cli_inform()` with simpler message in `warn_roxy()` + +**Bottleneck:** `warn_roxy()` → `cli::cli_inform()` accounts for ~8.5% of total time. Each call involves `cli::style_hyperlink()`, `cli::format_inline()`, and the full cli formatting pipeline including selector matching. When documenting roxygen2 itself, R6 warnings generate many calls. + +**Proposed fix:** Use `cli::cli_inform()` only in interactive sessions or batch the warnings. Alternatively, use `warning()` or `message()` with pre-formatted strings to bypass cli overhead. + +**Risk:** Medium — changes user-visible output format. Need to be careful to preserve behavior. + +### Experiment 4: Avoid `cli::format_inline()` in `warn_roxy_tag()` + +**Bottleneck:** `warn_roxy_tag()` calls `cli::format_inline("{.strong @{tag$tag}} ")` for every warning. This is expensive relative to the simple output it produces. + +**Proposed fix:** Replace with `paste0("**@", tag$tag, "** ")` or a simpler string formatting approach, since the bold markup is the only cli feature used. + +**Risk:** Low — the formatted tag name is just concatenated into the message anyway. + +### Experiment 5: Precompute `internal_f("tools", ".get_internal_S3_generics")()` + +**Bottleneck:** In `is_s3_generic()`, for primitive functions, `internal_f("tools", ".get_internal_S3_generics")()` is called each time. This fetches and calls an internal function from the tools package. + +**Proposed fix:** Compute this once and cache the result (it never changes within a session). + +**Risk:** Very low — the tools package internal generics list is static. + +### Experiment 6: Optimize `find_generic()` string splitting + +**Bottleneck:** `find_generic()` calls `str_split(name, fixed("."))` for every function name being analyzed, and then `paste0(pieces[seq_len(i)], collapse = ".")` in a loop. + +**Proposed fix:** Use `strsplit()` (base R) instead of `str_split()` which has stringr overhead. Pre-compute the candidate generic names more efficiently. + +**Risk:** Very low — straightforward string operation replacement. diff --git a/optimization-results.md b/optimization-results.md new file mode 100644 index 000000000..6cd6f9cc2 --- /dev/null +++ b/optimization-results.md @@ -0,0 +1,20 @@ +# Optimization results + +## Experiment 1: Skip XML parsing in `markdown_evaluate()` for text without inline R code + +**Hypothesis:** `markdown_evaluate()` parses every tag's text to XML to check for inline R code (`\`r ...\``), but the vast majority of tags don't contain any. A fast `grepl()` pre-check can skip the expensive XML parsing. + +**Data:** Of 240 calls to `markdown_evaluate()` during `roxygenize()` on roxygen2, only 2 contain inline R code (238 are pure text). + +**Change:** Added early return in `markdown_evaluate()` when text doesn't contain `` `r `` or `` ```{ `` patterns. + +**Microbenchmark (240 typical calls, no inline R code):** + +| Version | min | median | +|---------|-----|--------| +| Before | 13.23ms | 13.47ms | +| After | 1.01ms | 1.18ms | + +**Result:** 11x faster for `markdown_evaluate()`, saving ~12ms per `roxygenize()` run. All 1105 tests pass. + +**Verdict:** SUCCESS — committed. From a54db3e0099615e97139bd9c1873c8f06ea1c5e6 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:43:21 -0500 Subject: [PATCH 2/7] Skip Rd tag escaping when text has no backslashes Add fast grepl() pre-check in escape_rd_for_md() to skip expensive regex search and string manipulation when text contains no Rd macros. Saves ~28ms per roxygenize() run (92x faster per call for plain text). Co-Authored-By: Claude Opus 4.6 (1M context) --- R/markdown-escaping.R | 11 +++++++++++ optimization-results.md | 19 +++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/R/markdown-escaping.R b/R/markdown-escaping.R index 540234f26..161653443 100644 --- a/R/markdown-escaping.R +++ b/R/markdown-escaping.R @@ -20,7 +20,18 @@ #' * `unescape_rd_for_md`: the original Rd text. #' @rdname markdown-internals #' @keywords internal +empty_rd_tags <- data.frame( + tag = character(), + start = integer(), + end = integer(), + argend = integer() +) + escape_rd_for_md <- function(text) { + if (!grepl("\\", text, fixed = TRUE)) { + attr(text, "roxygen-markdown-subst") <- list(tags = empty_rd_tags, id = "") + return(text) + } rd_tags <- find_fragile_rd_tags(text, escaped_for_md) protected <- protect_rd_tags(text, rd_tags) double_escape_md(protected) diff --git a/optimization-results.md b/optimization-results.md index 6cd6f9cc2..848f88611 100644 --- a/optimization-results.md +++ b/optimization-results.md @@ -18,3 +18,22 @@ **Result:** 11x faster for `markdown_evaluate()`, saving ~12ms per `roxygenize()` run. All 1105 tests pass. **Verdict:** SUCCESS — committed. + +## Experiment 2: Skip Rd tag escaping in `escape_rd_for_md()` for text without backslashes + +**Hypothesis:** `escape_rd_for_md()` searches for Rd macros (e.g., `\code{}`, `\link{}`) to protect them from markdown parsing. But most roxygen text doesn't contain any Rd macros. A fast `grepl("\\", text, fixed = TRUE)` check can skip all the work. + +**Data:** Of 240 calls during `roxygenize()` on roxygen2, only 4 contain backslashes (236 are plain text). + +**Change:** Added early return in `escape_rd_for_md()` when text contains no backslash character, attaching the required empty attribute structure for `unescape_rd_for_md()`. + +**Microbenchmark (per call, simple text):** + +| Version | min | median | +|---------|-----|--------| +| Before | 106µs | 121µs | +| After | 1.1µs | 1.3µs | + +**Result:** 92x faster for simple text, saving ~28ms per `roxygenize()` run (236 × 120µs). All 1105 tests pass. + +**Verdict:** SUCCESS — committed. From 4608a840bbed90ca52e2948f4d1a1207d09e88d0 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:48:17 -0500 Subject: [PATCH 3/7] Cache is_s3_generic() results during roxygenize() Add an environment-based cache for is_s3_generic() that is active only during roxygenize() runs. With 729 calls and 301 unique names (59% redundancy), this saves ~3ms per run. Cache is cleared after each run to avoid cross-environment contamination. Co-Authored-By: Claude Opus 4.6 (1M context) --- R/object-s3.R | 29 +++++++++++++++++++++++++++++ R/roxygenize.R | 2 ++ optimization-results.md | 17 +++++++++++++++++ 3 files changed, 48 insertions(+) diff --git a/R/object-s3.R b/R/object-s3.R index 40e566080..1ea589a7f 100644 --- a/R/object-s3.R +++ b/R/object-s3.R @@ -1,3 +1,14 @@ +s3_generic_cache <- new_environment() + +s3_generic_cache_reset <- function() { + env_unbind(s3_generic_cache, env_names(s3_generic_cache)) + s3_generic_cache$active <- TRUE +} + +s3_generic_cache_clear <- function() { + env_unbind(s3_generic_cache, env_names(s3_generic_cache)) +} + #' Determine if a function is an S3 generic or S3 method #' #' @description @@ -16,6 +27,24 @@ is_s3_generic <- function(name, env = parent.frame()) { if (name == "") { return(FALSE) } + + if (isTRUE(s3_generic_cache$active)) { + cached <- env_get(s3_generic_cache, name, default = NULL) + if (!is.null(cached)) { + return(cached) + } + } + + result <- is_s3_generic_impl(name, env) + + if (isTRUE(s3_generic_cache$active)) { + env_poke(s3_generic_cache, name, result) + } + + result +} + +is_s3_generic_impl <- function(name, env) { if (!exists(name, envir = env, mode = "function")) { return(FALSE) } diff --git a/R/roxygenize.R b/R/roxygenize.R index c25571419..4e915faa3 100644 --- a/R/roxygenize.R +++ b/R/roxygenize.R @@ -47,6 +47,8 @@ roxygenize <- function( is_first <- roxygen_setup(base_path) find_package_cache_reset() + s3_generic_cache_reset() + withr::defer(s3_generic_cache_clear()) roxy_meta_load(base_path) # Load required packages for method registration packages <- roxy_meta_get("packages") diff --git a/optimization-results.md b/optimization-results.md index 848f88611..8c479269c 100644 --- a/optimization-results.md +++ b/optimization-results.md @@ -37,3 +37,20 @@ **Result:** 92x faster for simple text, saving ~28ms per `roxygenize()` run (236 × 120µs). All 1105 tests pass. **Verdict:** SUCCESS — committed. + +## Experiment 3: Cache `is_s3_generic()` results during `roxygenize()` + +**Hypothesis:** `is_s3_generic()` is called 729 times during `roxygenize()` on roxygen2, with only 301 unique names (59% redundancy). Each call involves `exists()`, `get()`, `tryCatch(getNamespaceName(...))`, and potentially recursive `calls_use_method()`. Caching results by name avoids redundant work. + +**Change:** Added an environment-based cache that is activated during `roxygenize()` and cleared afterwards. The cache is inactive during direct calls (e.g., in tests) to avoid cross-environment contamination. + +**Microbenchmark (729 calls matching a real roxygenize run):** + +| Version | min | median | +|---------|-----|--------| +| Before | 16.4ms | 17.3ms | +| After | 13.6ms | 14.3ms | + +**Result:** ~17% faster for `is_s3_generic()` calls, saving ~3ms per `roxygenize()` run. Savings scale with package size (more S3 methods = more cache hits). All 1105 tests pass. + +**Verdict:** SUCCESS — committed. Small but clean improvement. From c23ba732b5ebbc277e9c0599823b35b5f83ce872 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:49:53 -0500 Subject: [PATCH 4/7] Use base strsplit() in find_generic() instead of stringr::str_split() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The stringr wrapper adds ~12µs overhead per call for a simple fixed-character split. With 280 calls per run, this saves ~3.4ms. Co-Authored-By: Claude Opus 4.6 (1M context) --- R/object-s3.R | 2 +- optimization-results.md | 17 +++++++++++++++++ 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/R/object-s3.R b/R/object-s3.R index 1ea589a7f..3ca1780a1 100644 --- a/R/object-s3.R +++ b/R/object-s3.R @@ -115,7 +115,7 @@ find_generic <- function(name, env = parent.frame()) { return(c("all.equal", method)) } - pieces <- str_split(name, fixed("."))[[1]] + pieces <- strsplit(name, ".", fixed = TRUE)[[1]] n <- length(pieces) # No . in name, so can't be method diff --git a/optimization-results.md b/optimization-results.md index 8c479269c..8dd932ed0 100644 --- a/optimization-results.md +++ b/optimization-results.md @@ -54,3 +54,20 @@ **Result:** ~17% faster for `is_s3_generic()` calls, saving ~3ms per `roxygenize()` run. Savings scale with package size (more S3 methods = more cache hits). All 1105 tests pass. **Verdict:** SUCCESS — committed. Small but clean improvement. + +## Experiment 4: Use base `strsplit()` instead of `stringr::str_split()` in `find_generic()` + +**Hypothesis:** `stringr::str_split()` has overhead from the stringr/stringi infrastructure. For a simple fixed-character split, base `strsplit()` is faster. + +**Change:** Replaced `str_split(name, fixed("."))[[1]]` with `strsplit(name, ".", fixed = TRUE)[[1]]` in `find_generic()`. + +**Microbenchmark (per call):** + +| Version | min | median | +|---------|-----|--------| +| str_split | 10.5µs | 12.6µs | +| strsplit | 410ns | 533ns | + +**Result:** 24x faster per call, saving ~3.4ms per `roxygenize()` run (280 calls × 12µs). All 1105 tests pass. + +**Verdict:** SUCCESS — committed. From e00dc5e545ae854fe31d9739be1fa80875d2bc88 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:52:51 -0500 Subject: [PATCH 5/7] Skip link-reference regex when text has no brackets Add fast grepl() pre-check in get_md_linkrefs() to avoid expensive str_match_all() regex when text contains no [ character. Saves ~4.5ms per roxygenize() run (205 of 240 calls skip the regex). Co-Authored-By: Claude Opus 4.6 (1M context) --- R/markdown-link.R | 3 +++ optimization-results.md | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/R/markdown-link.R b/R/markdown-link.R index ebcafab01..f682ef186 100644 --- a/R/markdown-link.R +++ b/R/markdown-link.R @@ -64,6 +64,9 @@ add_linkrefs_to_md <- function(text) { } get_md_linkrefs <- function(text) { + if (!grepl("[", text, fixed = TRUE)) { + return(character()) + } refs <- str_match_all( text, regex( diff --git a/optimization-results.md b/optimization-results.md index 8dd932ed0..bc620ed68 100644 --- a/optimization-results.md +++ b/optimization-results.md @@ -71,3 +71,22 @@ **Result:** 24x faster per call, saving ~3.4ms per `roxygenize()` run (280 calls × 12µs). All 1105 tests pass. **Verdict:** SUCCESS — committed. + +## Experiment 5: Skip link-reference regex in `get_md_linkrefs()` when no `[` present + +**Hypothesis:** `get_md_linkrefs()` runs a complex regex via `str_match_all()` on every tag's text to find markdown link references like `[function()]`. Most tags don't contain `[` at all. + +**Data:** Of 240 calls, only 35 contain `[` (205 are plain text without brackets). + +**Change:** Added early return when `grepl("[", text, fixed = TRUE)` is FALSE. + +**Microbenchmark (per call, simple text):** + +| Version | min | median | +|---------|-----|--------| +| Before | 19.4µs | 23.3µs | +| After | 779ns | 1.4µs | + +**Result:** 17x faster for plain text, saving ~4.5ms per `roxygenize()` run (205 × 22µs). All 1105 tests pass. + +**Verdict:** SUCCESS — committed. From d1ced216cfec0b05b38405dcb243e82e3673be48 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 07:56:08 -0500 Subject: [PATCH 6/7] Add overall benchmark results to optimization-results.md End-to-end roxygenize() on roxygen2: 936ms -> 839ms (10.4% faster). Co-Authored-By: Claude Opus 4.6 (1M context) --- optimization-results.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/optimization-results.md b/optimization-results.md index bc620ed68..046f40268 100644 --- a/optimization-results.md +++ b/optimization-results.md @@ -1,5 +1,16 @@ # Optimization results +## Overall results + +End-to-end `roxygenize(load_code = "source")` on the roxygen2 package itself (10 iterations, `devtools::load_all()` in a clean R subprocess): + +| Version | min | median | Improvement | +|---------|-----|--------|-------------| +| Before (main) | 863ms | 936ms | — | +| After (all experiments) | 805ms | 839ms | 10.4% faster | + +The optimizations are all fast-path short-circuits that avoid expensive work when it's not needed. They have the largest impact on packages where most roxygen tags contain simple text without inline R code, Rd macros, or markdown links (which is the common case). + ## Experiment 1: Skip XML parsing in `markdown_evaluate()` for text without inline R code **Hypothesis:** `markdown_evaluate()` parses every tag's text to XML to check for inline R code (`\`r ...\``), but the vast majority of tags don't contain any. A fast `grepl()` pre-check can skip the expensive XML parsing. From 20e0973e1d76743caabccee33b23d1a1ff94b955 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Fri, 20 Mar 2026 08:12:40 -0500 Subject: [PATCH 7/7] Update docs --- man/markdown-internals.Rd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/man/markdown-internals.Rd b/man/markdown-internals.Rd index 22e5234d5..9fd2f60b6 100644 --- a/man/markdown-internals.Rd +++ b/man/markdown-internals.Rd @@ -1,22 +1,22 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/markdown-escaping.R -\name{escape_rd_for_md} -\alias{escape_rd_for_md} +\name{empty_rd_tags} +\alias{empty_rd_tags} \alias{unescape_rd_for_md} \title{Escape fragile Rd tags} \usage{ -escape_rd_for_md(text) +empty_rd_tags unescape_rd_for_md(rd_text, esc_text) } \arguments{ -\item{text}{Input text. Potentially contains Rd and/or -markdown markup.} - \item{rd_text}{The markdown parsed and interpreted text.} \item{esc_text}{The original escaped text from \code{escape_rd_for_md()}.} + +\item{text}{Input text. Potentially contains Rd and/or +markdown markup.} } \value{ \itemize{