1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431
|
# RQRCode Core - Performance Optimization Plan
This document outlines the step-by-step plan for implementing the performance improvements identified in BENCHMARKS.md.
**Created**: December 4, 2025
**Status**: In Progress
---
## Overview
This plan addresses the "Future Optimization Ideas" section from BENCHMARKS.md, organized by category and priority. Each optimization will be done incrementally with benchmarking before/after to measure impact.
---
## Phase 1: Memory Optimizations (High Priority)
### Task #1: ARCH_BITS Investigation ✅
**Priority**: High
**Status**: COMPLETE
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:69`
**Goals**:
- ✅ Run memory benchmarks with default 64-bit setting (baseline)
- ✅ Run memory benchmarks with `RQRCODE_CORE_ARCH_BITS=32`
- ✅ Run performance benchmarks for speed comparison
- ✅ Document memory differences and performance impact
- ✅ Determine if 32-bit can be safely used as default
- ✅ Make recommendation based on data
**Results**:
| Metric | 64-bit | 32-bit | Improvement |
|--------|--------|--------|-------------|
| Single v1 | 0.38 MB | 0.10 MB | **74% reduction** |
| Single v24 | 8.53 MB | 2.92 MB | **66% reduction** |
| 100x v1 | 37.91 MB | 9.10 MB | **76% reduction** |
| Speed (small) | 152.7 i/s | 157.1 i/s | **+2.9% faster** |
| Speed (large) | 4.77 i/s | 4.87 i/s | **+2.1% faster** |
| Objects (100x v1) | 872,700 | 117,500 | **87% reduction** |
**Key Findings**:
- Integer allocations nearly eliminated (from 70-76% to ~0%)
- All 108 test assertions pass
- Actually faster due to better cache utilization and reduced GC pressure
- No correctness issues
**Recommendation**: ✅ **STRONGLY RECOMMEND** users set `RQRCODE_CORE_ARCH_BITS=32` in production for 70-76% memory savings with 2-4% speed improvement. Not changing default to maintain backwards compatibility, but documenting heavily in README and code comments.
**Documentation Updated**:
- ✅ Updated `lib/rqrcode_core/qrcode/qr_util.rb` comment with benchmark data
- ✅ Replaced README "Experimental" section with prominent "Performance Optimization" section
- ✅ Updated `docs/BENCHMARKS.md` with proven results
- ✅ Moved all benchmark files to `test/benchmarks/` directory
- ✅ Created `test/benchmarks/ARCH_BITS_ANALYSIS.md` with complete analysis
See `test/benchmarks/ARCH_BITS_ANALYSIS.md` for detailed analysis.
---
### Task #2: More Efficient Data Structures for Modules
**Priority**: Medium
**Status**: Pending
**Files**: Various
**Goals**:
- Evaluate current Array of Arrays approach for modules
- Consider alternatives: Flat array with index calculations, BitArray
- Analyze trade-offs: Memory vs access speed
- Benchmark access patterns
**Current**: `@modules = Array.new(@module_count)`
**Considerations**: Module count can be 177x177 = 31,329 cells for v40
---
## Phase 2: Speed Optimizations
### Task #3: Profile Hot Paths with stackprof ✅
**Priority**: High
**Status**: COMPLETE
**Goals**:
- ✅ Add stackprof gem to development dependencies
- ✅ Create profiling script for various QR code sizes
- ✅ Focus on large QR codes (v20+) where performance degrades
- ✅ Identify actual bottlenecks with data
**Results Summary**:
Profiled v1, v5, v10, v20 QR codes across 100-5 iterations each. Clear bottlenecks identified:
**Top Hotspots** (for large QR codes):
1. **`demerit_points_1_same_color`** - **30.2% of CPU time** for v20 codes
- Nested O(n²) loops checking consecutive same-colored modules
- Runs for all 8 mask patterns
- Primary optimization target
2. **Garbage Collection** - 42-75% of samples
- Higher for small codes, decreases as size increases
- Already addressed via ARCH_BITS=32 recommendation
3. **`demerit_points_2_full_blocks`** - 3.5% for v20 codes
- Checks for 2x2 blocks of same color
- Secondary optimization target
4. **`demerit_points_3_dangerous_patterns`** - 2.8% for v20 codes
- Pattern matching for specific sequences
- Tertiary optimization target
**Key Insights**:
- **Scaling behavior**: Demerit calculations grow from 12.7% (v1) to 30.2% (v20) of CPU time
- **Small vs Large**: Small codes are GC-bound (74.7%), large codes are demerit-bound (30.2%)
- **Clear path forward**: Optimizing `demerit_points_1_same_color` will yield 15-30% improvement for large codes
**Files Created**:
- `test/profile_stackprof.rb` - Profiling script
- `test/benchmarks/STACKPROF_ANALYSIS.md` - Detailed analysis with all findings
- `tmp/stackprof/*.dump` - Raw profiling data for further investigation
See `test/benchmarks/STACKPROF_ANALYSIS.md` for complete analysis and optimization recommendations.
---
### Task #4: Cache Frequently Computed Values
**Priority**: Medium
**Status**: Pending
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:119-127`
**Goals**:
- Identify values computed multiple times
- Implement memoization where appropriate
- Add class-level caching for version-specific calculations
**Candidates**:
- `get_error_correct_polynomial(error_correct_length)` - called per RS block
- Pattern positions (already cached in table)
- BCH calculations for common versions
---
### Task #5: Optimize Inner Loops in Encoding
**Priority**: Medium
**Status**: Pending
**Files**:
- `lib/rqrcode_core/qrcode/qr_util.rb:163-192` (demerit_points_1_same_color)
- `lib/rqrcode_core/qrcode/qr_code.rb:367-404` (map_data)
**Goals**:
- Focus on nested loops that iterate over module_count
- Reduce redundant calculations inside loops
- Consider pre-computing values or caching
**Hot Spots**:
- `demerit_points_1_same_color`: O(n²) with nested checks
- `map_data`: Critical path for encoding data into modules
---
### Task #6: Memoization for Version Calculations
**Priority**: Low
**Status**: Pending
**Goals**:
- Add memoization for version-specific calculations
- Cache results for common version/level combinations
- Measure impact on batch generation scenarios
---
## Phase 3: Algorithm Improvements
### Task #7: Review Polynomial Math Operations
**Priority**: Medium
**Status**: Pending
**File**: `lib/rqrcode_core/qrcode/qr_polynomial.rb`
**Goals**:
- Analyze multiply and mod methods for optimization
- Review recursive mod at line 58
- Consider iterative approach instead of recursive
- Benchmark alternative implementations
**Current Concerns**:
- `multiply`: Creates temporary arrays in nested loops
- `mod`: Recursive calls may cause stack overhead
---
### Task #8: Optimize Mask Pattern Calculation
**Priority**: Medium
**Status**: Pending
**File**: `lib/rqrcode_core/qrcode/qr_code.rb:286-300`
**Goals**:
- Review get_best_mask_pattern (tries all 8 patterns)
- Algorithm requires testing all patterns (QR spec)
- Focus on optimizing get_lost_points calculations
- Optimize demerit_points calculations
**Note**: Can't reduce number of patterns tested (spec requirement), but can optimize each pattern evaluation.
---
### Task #9: Benchmark Alternative Implementations
**Priority**: Low
**Status**: Pending
**Goals**:
- After implementing optimizations, compare with baseline
- Document improvements in BENCHMARKS.md
- Consider alternative algorithms if performance goals not met
- Compare with other QR code libraries for reference
---
## Execution Order
Recommended order for maximum impact:
1. ✅ **ARCH_BITS Investigation** - COMPLETE - Proven 70-76% memory savings + 2-4% speed boost
2. ✅ **Profile with stackprof** - COMPLETE - Identified `demerit_points_1_same_color` as 30% CPU bottleneck
3. ✅ **Optimize demerit calculation functions** - COMPLETE - **80-90% speed improvement** across all QR sizes
4. **Caching/memoization** - Progressive improvement (next priority)
5. **Data structures** - Larger refactor, do later
6. **Algorithm improvements** - Most complex, do last
---
## Process for Each Task
1. **Before Changes**:
- Run `rake benchmark:all > before_task_N.txt`
- Document current behavior
- Identify specific optimization targets
2. **Make Changes**:
- Implement optimization
- Run tests: `rake test`
- Run linter: `rake standard`
- Fix any issues
3. **After Changes**:
- Run `rake benchmark:all > after_task_N.txt`
- Compare results with baseline
- Document improvements
4. **Update Documentation**:
- Update this file with results
- Update BENCHMARKS.md if significant improvement
- Commit changes with clear message
---
## Success Metrics
Based on BENCHMARKS.md baseline:
### Memory Targets:
- Reduce memory allocations by 10-20% for single QR codes
- Reduce memory allocations by 15-30% for batch generation
- Focus on Integer and Array allocations (85-90% of total)
### Performance Targets:
- Improve large QR code (v20+) generation speed by 10-20%
- Improve batch generation throughput
- Maintain or improve small QR code performance
### Constraints:
- No breaking API changes
- All tests must pass
- Standard Ruby style compliance
- No external runtime dependencies
---
## Results Log
### Task #1: ARCH_BITS Investigation ✅
**Date**: December 4, 2025
**Status**: Complete
**Results**:
**PROVEN: Setting `RQRCODE_CORE_ARCH_BITS=32` provides dramatic improvements with zero downsides:**
- **Memory**: 70-76% reduction across all scenarios
- **Speed**: 2-4% faster (not just "no penalty"—actually faster)
- **Objects**: 85-87% fewer allocations
- **Tests**: All 108 assertions pass
- **Correctness**: No issues found
**Impact by scenario:**
- Single small QR: 0.38 MB → 0.10 MB (74% reduction)
- Single large QR: 8.53 MB → 2.92 MB (66% reduction)
- Batch 100 small: 37.91 MB → 9.10 MB (76% reduction)
**Why it works:** The `rszf` function creates bit masks. With 64-bit, these require large integer allocations. With 32-bit, they fit into smaller representations, dramatically reducing memory allocation while improving cache locality.
**Action taken:**
- Updated README with prominent "Performance Optimization" section
- Updated code comments in `qr_util.rb` with concrete benchmark data
- Updated `docs/BENCHMARKS.md` with proven results
- Organized all benchmark files in `test/benchmarks/` directory
- Created comprehensive analysis document
**Recommendation:** Users should set `RQRCODE_CORE_ARCH_BITS=32` in production. Not changing library default to avoid surprises for existing users, but strongly recommending the optimization through documentation.
Full analysis available in `test/benchmarks/ARCH_BITS_ANALYSIS.md`.
---
### Task #3: Profile Hot Paths with stackprof ✅
**Date**: December 5, 2025
**Status**: Complete
**Results**:
**IDENTIFIED: Clear CPU bottlenecks with concrete optimization targets:**
Profiled QR codes from v1 (21x21) to v20 (97x97) with 100-5 iterations each.
**Primary Hotspot** - `demerit_points_1_same_color` (`qr_util.rb:171`):
- **CPU Impact**: Scales from 12.7% (v1) → 30.2% (v20)
- **Why Slow**: O(n²) nested loops checking consecutive same-colored modules for all 8 mask patterns
- **Optimization Potential**: 15-30% speed improvement for large QR codes
**Secondary Hotspots**:
- `demerit_points_2_full_blocks`: 1.6% → 3.5% (2x2 block checking)
- `demerit_points_3_dangerous_patterns`: 0.6% → 2.8% (pattern matching)
- Combined potential: 5-10% improvement
**GC Overhead**:
- Small QR codes: 74.7% (GC-bound) - addressed by ARCH_BITS=32
- Large QR codes: 41.6% (compute-bound) - demerit functions are the bottleneck
**Scaling Insight**:
As QR codes grow larger, demerit calculations become the dominant performance factor, overtaking GC as the primary bottleneck.
**Action taken:**
- Added stackprof gem to gemspec
- Created `test/profile_stackprof.rb` profiling script
- Generated profiles for v1, v5, v10, v20 QR codes
- Created comprehensive analysis in `test/benchmarks/STACKPROF_ANALYSIS.md`
- Updated OPTIMIZATION_PLAN.md with data-driven priorities
**Next Steps**: Optimize `demerit_points_1_same_color` as highest-impact target (Task #4 reprioritized).
Full analysis available in `test/benchmarks/STACKPROF_ANALYSIS.md`.
---
### Task #3.5: Optimize Demerit Calculation Functions ✅
**Date**: December 5, 2025
**Status**: Complete
**Results**:
**OPTIMIZED: All three demerit calculation hotspots identified by stackprof**
Based on the stackprof analysis showing `demerit_points_*` functions as the primary bottleneck, optimized all three functions without changing their algorithmic behavior:
**Optimizations Applied:**
1. **`demerit_points_1_same_color` (qr_util.rb:171-213)**:
- Eliminated nested Range objects (`-1..1`) in hot loops
- Pre-computed `max_index` to avoid repeated `module_count - 1` calculations
- Cached row arrays (`modules_row`, `row_above`, `row_below`) to reduce array lookups
- Unrolled nested loops checking 3x3 neighborhood
- Replaced Range#each with Integer#times for better performance
- **Reduced CPU time from 30.2% → 18.5%** (39% reduction)
2. **`demerit_points_2_full_blocks` (qr_util.rb:215-230)**:
- Cached adjacent row arrays to eliminate redundant lookups
- Simplified 2x2 block check using direct equality comparisons
- Removed unnecessary counter variable and array inclusion check
- Replaced Range#each with Integer#times
3. **`demerit_points_3_dangerous_patterns` (qr_util.rb:232-259)**:
- Pre-computed pattern length and max_start index
- Simplified dangerous pattern checks with clearer conditionals
- Replaced Range#each with Integer#times
- Consolidated multi-line conditionals
**Performance Impact** (64-bit, before vs after):
| QR Code Size | Before | After | Improvement |
|--------------|---------|--------|-------------|
| Small (v1) | 152.7 i/s | 292.9 i/s | **+92% faster** |
| Medium (v5) | 46.2 i/s | 85.3 i/s | **+85% faster** |
| Large (v24) | 4.77 i/s | 8.68 i/s | **+82% faster** |
| Version 10 | 19.0 i/s | 34.6 i/s | **+82% faster** |
| Version 20 | 6.50 i/s | 11.8 i/s | **+82% faster** |
| Version 40 | 1.94 i/s | 3.51 i/s | **+81% faster** |
**Time per QR Code** (v20):
- Before: 153.86 ms
- After: 84.57 ms
- **Improvement: 45% reduction in generation time**
**StackProf CPU Profile Changes** (v20 QR code):
- `demerit_points_1_same_color`: 30.2% → 18.5% (39% reduction)
- GC overhead: 41.6% → 47.3% (now that compute is faster, GC shows proportionally higher)
- Overall CPU samples reduced significantly
**Key Insights:**
- The optimizations provided **consistent 80-90% speed improvements** across all QR code sizes
- Largest impact on large QR codes where demerit calculations dominate
- All 108 test assertions pass - behavior unchanged
- Code is now clearer and more maintainable
- No external dependencies added
**Action taken:**
- Optimized all three demerit calculation functions
- Maintained exact same algorithmic behavior (tests pass)
- Applied Ruby style guide fixes via `rake standard:fix`
- Generated new benchmark data in `test/benchmarks/benchmark_performance_optimized.txt`
- Verified performance improvements via stackprof
**Files Modified:**
- `lib/rqrcode_core/qrcode/qr_util.rb` (lines 171-259)
**Recommendation:** These optimizations provide massive performance gains with zero breaking changes. Ready for production use immediately.
See `test/benchmarks/benchmark_performance_optimized.txt` for complete benchmark results.
---
## References
- [BENCHMARKS.md](BENCHMARKS.md) - Baseline performance metrics
- [Ruby Performance Optimization](https://ruby-doc.org/core/doc/performance_md.html)
- [memory_profiler gem](https://github.com/SamSaffron/memory_profiler)
- [stackprof gem](https://github.com/tmm1/stackprof)
|