Skip to content

perf: optimize gs_design_ahr (~3-4x speedup)#623

Open
yihui wants to merge 13 commits intomainfrom
perf/optimize-gs-design-ahr
Open

perf: optimize gs_design_ahr (~3-4x speedup)#623
yihui wants to merge 13 commits intomainfrom
perf/optimize-gs-design-ahr

Conversation

@yihui
Copy link
Copy Markdown
Collaborator

@yihui yihui commented May 5, 2026

Summary

  • Optimize gs_design_ahr() and its dependency chain for 3-4x speedup across all input variants
  • Replace expensive object.size() calls in cache pruning with O(1) numhash() check
  • Replace dplyr operations (tibble, mutate, full_join, select, arrange, filter) with base R equivalents in hot-path functions (gs_power_npe, gs_design_npe, gs_design_ahr, expected_time, gs_info_ahr)
  • Rewrite expected_event() internals using pure vector arithmetic instead of data.frame/merge/order operations (8x speedup for this function alone)

Benchmark results (20 calls each, after warm-up)

Scenario Before After Speedup
Default (single analysis) 0.985s 0.327s 3.0x
Multiple analysis_time (3 analyses) 2.707s 0.997s 2.7x
Info_frac driven 2.847s 1.157s 2.5x
Info_frac + analysis_time 3.790s 1.318s 2.9x
2-sided symmetric (O'Brien-Fleming) 3.059s 0.910s 3.4x
gs_b lower (no futility bound) 3.399s 0.773s 4.4x

Key changes by commit

  1. prune_hash: Replace object.size() (walks entire hash, ~2ms/call) with numhash() (O(1)) for the frequent check; clear when entry count > 100
  2. gs_power_npe output: Replace tibble() + mutate() + arrange() with data.frame() + base R sort (called 9+ times per design via gs_design_npe's root-finding)
  3. gs_design_npe output: Replace full_join + select + rename + arrange with merge() + column subsetting
  4. gs_design_ahr output: Replace dplyr chain (mutate, full_join, select, arrange, filter) with base R equivalents
  5. Hot-path functions: Remove dplyr::select(), dplyr::mutate(), dplyr::transmute() from expected_time, gs_info_ahr, and the info_frac loop in gs_design_ahr
  6. expected_event: Rewrite internals using vectors instead of data.frame/merge/order (8x speedup: 3.6ms to 0.43ms per call)
  7. Backward compatibility: Exported functions (gs_power_npe, gs_design_npe) still return tibbles via as_tibble() at the return boundary

Test plan

  • All 787 existing tests pass (0 failures, 28 pre-existing skips)
  • Numerical output verified identical to baseline for all design variants
  • Tested with: default args, multiple analysis_time, info_frac driven, info_frac + analysis_time, 2-sided symmetric, gs_b lower bound

🤖 Generated with Claude Code

Xie and others added 8 commits May 4, 2026 20:38
object.size() walks the entire hash table structure on every
cache_fun() call, taking ~2ms per invocation. Since cache_fun is
called 15+ times per gs_design_ahr run (via expected_time/ahr and
gs_power_npe), this adds up to significant overhead.

Replace with numhash() which returns the entry count in O(1), and
use clrhash() for a simple eviction strategy when the limit is
exceeded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gs_power_npe is called 9+ times per gs_design_ahr invocation (via
gs_design_npe's bracket search and uniroot). The tibble() + mutate()
+ arrange() output assembly accounted for ~39% of gs_power_npe time.

Replacing with data.frame() and base R ordering is 8x faster for
the output assembly step, yielding ~25% improvement in overall
gs_design_ahr runtime for multi-analysis designs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_npe

The output assembly in gs_design_npe used full_join (to merge H0 and
H1 probabilities), select, rename, and arrange from dplyr. Since
gs_design_npe is called once per gs_design_ahr and these operations
are on small data frames (6 rows), base R merge() and column
subsetting are much faster.

Combined with the gs_power_npe change, this yields ~50% overall
improvement for multi-analysis designs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mutate, full_join, select, arrange, and filter operations
in the output assembly section of gs_design_ahr with equivalent base
R operations (direct column assignment, merge, column subsetting,
order).

This eliminates the dplyr overhead for the final output formatting
which previously involved multiple tibble round-trips on small data
frames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace select(-n) with base R column removal, and replace
mutate/transmute in the info_frac loop of gs_design_ahr with
direct column assignment. These functions are called repeatedly
during uniroot iterations, so even small per-call savings add up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…frames

The original expected_event used data.frame(), merge(), and multiple
order() calls for computation on small interval tables. Profiling
showed expected_event accounted for 65% of pw_info time, and
data.frame overhead was 35% of expected_event time.

Rewrite using pure vector operations: compute the union of enrollment
and failure breakpoints directly, use stepfun2 for rate lookups, and
perform all arithmetic on plain numeric vectors. Only construct a
data.frame for the final output when simple=FALSE.

This yields an 8x speedup for expected_event (3.6ms -> 0.43ms per
call) and ~2.5x speedup for pw_info.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exported functions gs_power_npe and gs_design_npe must return tibbles
for backward compatibility. Add tibble::as_tibble() at the return
point to convert the base R data.frame used for fast internal
computation back to the expected output type.

Also fix row ordering in gs_design_npe to maintain upper-before-lower
within each analysis (matching the original arrange(analysis) with
upper-first convention).

Refine prune_hash to use a 100-entry limit per function, giving
predictable memory bounds (each entry is typically a few KB, so
~100KB per cached function).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Profiling revealed that object.size() overcounts gs_power_npe cache
entries by ~600x (reports 1.8 MB per entry when true incremental cost
is ~3 KB). This is because object.size() walks into shared namespace
environments of function arguments, counting the same gsDesign2
namespace (833 KB) and gsDesign namespace (75 KB) for every entry.

Changes:
- Remove object.size() from the pruning path (both slow and inaccurate)
- Only check entry count before insertions, not on cache hits
- Set max_entries = 1024, justified by:
  - True cost: ~3 KB (gs_power_npe) to ~5 KB (ahr) per entry
  - 1024 entries ≈ 3-5 MB real memory
  - Supports ~200 cached designs in a session
  - A single design creates only 5-28 entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yihui
Copy link
Copy Markdown
Collaborator Author

yihui commented May 5, 2026

R CMD check timing comparison (GHA R-CMD-check workflow)

Comparing the latest run on main (24492c62) vs this PR (838cd521).

ubuntu-latest (release)

Step main PR Speedup
checking examples 31s/27s 25s/21s 1.3x
checking examples --run-donttest 62s/55s 51s/45s 1.2x
Running testthat.R 112s/99s 88s/77s 1.3x
Total R CMD check 262s 221s 1.2x

windows-latest (release)

Step main PR Speedup
checking examples 34s 28s 1.2x
checking examples --run-donttest 82s 67s 1.2x
Running testthat.R 154s 126s 1.2x

macos-latest (release)

Step main PR Speedup
checking examples 22s/23s 14s/15s 1.6x
checking examples --run-donttest 50s/51s 35s/36s 1.4x
checking tests (total) 86s 72s 1.2x

Notes

  • The GHA speedup (~1.2-1.4x) is more modest than local benchmarks (~3-4x) because R CMD check includes overhead (compilation, documentation checks, etc.) and the test suite exercises many functions beyond gs_design_ahr.
  • The examples show clear improvement because they directly call gs_design_ahr() with various arguments.
  • All platforms pass with Status: OK.

@yihui yihui requested review from LittleBeannie and jdblischak May 7, 2026 03:03
@yihui
Copy link
Copy Markdown
Collaborator Author

yihui commented May 7, 2026

@jdblischak @LittleBeannie This PR is ready. Most commits should be straightforward to understand. The only one that's a little challenging is 85f2dd4 (that's because the original code was also not easy to digest).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant