File: OPTIMIZATION_PLAN.md

package info (click to toggle)
ruby-rqrcode-core 2.1.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 520 kB
  • sloc: ruby: 2,289; makefile: 4; sh: 4
file content (431 lines) | stat: -rw-r--r-- 15,555 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
# RQRCode Core - Performance Optimization Plan

This document outlines the step-by-step plan for implementing the performance improvements identified in BENCHMARKS.md.

**Created**: December 4, 2025  
**Status**: In Progress

---

## Overview

This plan addresses the "Future Optimization Ideas" section from BENCHMARKS.md, organized by category and priority. Each optimization will be done incrementally with benchmarking before/after to measure impact.

---

## Phase 1: Memory Optimizations (High Priority)

### Task #1: ARCH_BITS Investigation ✅
**Priority**: High  
**Status**: COMPLETE  
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:69`

**Goals**:
- ✅ Run memory benchmarks with default 64-bit setting (baseline)
- ✅ Run memory benchmarks with `RQRCODE_CORE_ARCH_BITS=32`
- ✅ Run performance benchmarks for speed comparison
- ✅ Document memory differences and performance impact
- ✅ Determine if 32-bit can be safely used as default
- ✅ Make recommendation based on data

**Results**:

| Metric | 64-bit | 32-bit | Improvement |
|--------|--------|--------|-------------|
| Single v1 | 0.38 MB | 0.10 MB | **74% reduction** |
| Single v24 | 8.53 MB | 2.92 MB | **66% reduction** |
| 100x v1 | 37.91 MB | 9.10 MB | **76% reduction** |
| Speed (small) | 152.7 i/s | 157.1 i/s | **+2.9% faster** |
| Speed (large) | 4.77 i/s | 4.87 i/s | **+2.1% faster** |
| Objects (100x v1) | 872,700 | 117,500 | **87% reduction** |

**Key Findings**:
- Integer allocations nearly eliminated (from 70-76% to ~0%)
- All 108 test assertions pass
- Actually faster due to better cache utilization and reduced GC pressure
- No correctness issues

**Recommendation**: ✅ **STRONGLY RECOMMEND** users set `RQRCODE_CORE_ARCH_BITS=32` in production for 70-76% memory savings with 2-4% speed improvement. Not changing default to maintain backwards compatibility, but documenting heavily in README and code comments.

**Documentation Updated**:
- ✅ Updated `lib/rqrcode_core/qrcode/qr_util.rb` comment with benchmark data
- ✅ Replaced README "Experimental" section with prominent "Performance Optimization" section
- ✅ Updated `docs/BENCHMARKS.md` with proven results
- ✅ Moved all benchmark files to `test/benchmarks/` directory
- ✅ Created `test/benchmarks/ARCH_BITS_ANALYSIS.md` with complete analysis

See `test/benchmarks/ARCH_BITS_ANALYSIS.md` for detailed analysis.

---

### Task #2: More Efficient Data Structures for Modules
**Priority**: Medium  
**Status**: Pending  
**Files**: Various

**Goals**:
- Evaluate current Array of Arrays approach for modules
- Consider alternatives: Flat array with index calculations, BitArray
- Analyze trade-offs: Memory vs access speed
- Benchmark access patterns

**Current**: `@modules = Array.new(@module_count)`  
**Considerations**: Module count can be 177x177 = 31,329 cells for v40

---

## Phase 2: Speed Optimizations

### Task #3: Profile Hot Paths with stackprof ✅
**Priority**: High  
**Status**: COMPLETE

**Goals**:
- ✅ Add stackprof gem to development dependencies
- ✅ Create profiling script for various QR code sizes
- ✅ Focus on large QR codes (v20+) where performance degrades
- ✅ Identify actual bottlenecks with data

**Results Summary**:

Profiled v1, v5, v10, v20 QR codes across 100-5 iterations each. Clear bottlenecks identified:

**Top Hotspots** (for large QR codes):
1. **`demerit_points_1_same_color`** - **30.2% of CPU time** for v20 codes
   - Nested O(n²) loops checking consecutive same-colored modules
   - Runs for all 8 mask patterns
   - Primary optimization target
   
2. **Garbage Collection** - 42-75% of samples
   - Higher for small codes, decreases as size increases
   - Already addressed via ARCH_BITS=32 recommendation
   
3. **`demerit_points_2_full_blocks`** - 3.5% for v20 codes
   - Checks for 2x2 blocks of same color
   - Secondary optimization target

4. **`demerit_points_3_dangerous_patterns`** - 2.8% for v20 codes
   - Pattern matching for specific sequences
   - Tertiary optimization target

**Key Insights**:
- **Scaling behavior**: Demerit calculations grow from 12.7% (v1) to 30.2% (v20) of CPU time
- **Small vs Large**: Small codes are GC-bound (74.7%), large codes are demerit-bound (30.2%)
- **Clear path forward**: Optimizing `demerit_points_1_same_color` will yield 15-30% improvement for large codes

**Files Created**:
- `test/profile_stackprof.rb` - Profiling script
- `test/benchmarks/STACKPROF_ANALYSIS.md` - Detailed analysis with all findings
- `tmp/stackprof/*.dump` - Raw profiling data for further investigation

See `test/benchmarks/STACKPROF_ANALYSIS.md` for complete analysis and optimization recommendations.

---

### Task #4: Cache Frequently Computed Values
**Priority**: Medium  
**Status**: Pending  
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:119-127`

**Goals**:
- Identify values computed multiple times
- Implement memoization where appropriate
- Add class-level caching for version-specific calculations

**Candidates**:
- `get_error_correct_polynomial(error_correct_length)` - called per RS block
- Pattern positions (already cached in table)
- BCH calculations for common versions

---

### Task #5: Optimize Inner Loops in Encoding
**Priority**: Medium  
**Status**: Pending  
**Files**:
- `lib/rqrcode_core/qrcode/qr_util.rb:163-192` (demerit_points_1_same_color)
- `lib/rqrcode_core/qrcode/qr_code.rb:367-404` (map_data)

**Goals**:
- Focus on nested loops that iterate over module_count
- Reduce redundant calculations inside loops
- Consider pre-computing values or caching

**Hot Spots**:
- `demerit_points_1_same_color`: O(n²) with nested checks
- `map_data`: Critical path for encoding data into modules

---

### Task #6: Memoization for Version Calculations
**Priority**: Low  
**Status**: Pending

**Goals**:
- Add memoization for version-specific calculations
- Cache results for common version/level combinations
- Measure impact on batch generation scenarios

---

## Phase 3: Algorithm Improvements

### Task #7: Review Polynomial Math Operations
**Priority**: Medium  
**Status**: Pending  
**File**: `lib/rqrcode_core/qrcode/qr_polynomial.rb`

**Goals**:
- Analyze multiply and mod methods for optimization
- Review recursive mod at line 58
- Consider iterative approach instead of recursive
- Benchmark alternative implementations

**Current Concerns**:
- `multiply`: Creates temporary arrays in nested loops
- `mod`: Recursive calls may cause stack overhead

---

### Task #8: Optimize Mask Pattern Calculation
**Priority**: Medium  
**Status**: Pending  
**File**: `lib/rqrcode_core/qrcode/qr_code.rb:286-300`

**Goals**:
- Review get_best_mask_pattern (tries all 8 patterns)
- Algorithm requires testing all patterns (QR spec)
- Focus on optimizing get_lost_points calculations
- Optimize demerit_points calculations

**Note**: Can't reduce number of patterns tested (spec requirement), but can optimize each pattern evaluation.

---

### Task #9: Benchmark Alternative Implementations
**Priority**: Low  
**Status**: Pending

**Goals**:
- After implementing optimizations, compare with baseline
- Document improvements in BENCHMARKS.md
- Consider alternative algorithms if performance goals not met
- Compare with other QR code libraries for reference

---

## Execution Order

Recommended order for maximum impact:

1. ✅ **ARCH_BITS Investigation** - COMPLETE - Proven 70-76% memory savings + 2-4% speed boost
2. ✅ **Profile with stackprof** - COMPLETE - Identified `demerit_points_1_same_color` as 30% CPU bottleneck
3. ✅ **Optimize demerit calculation functions** - COMPLETE - **80-90% speed improvement** across all QR sizes
4. **Caching/memoization** - Progressive improvement (next priority)
5. **Data structures** - Larger refactor, do later
6. **Algorithm improvements** - Most complex, do last

---

## Process for Each Task

1. **Before Changes**:
   - Run `rake benchmark:all > before_task_N.txt`
   - Document current behavior
   - Identify specific optimization targets

2. **Make Changes**:
   - Implement optimization
   - Run tests: `rake test`
   - Run linter: `rake standard`
   - Fix any issues

3. **After Changes**:
   - Run `rake benchmark:all > after_task_N.txt`
   - Compare results with baseline
   - Document improvements

4. **Update Documentation**:
   - Update this file with results
   - Update BENCHMARKS.md if significant improvement
   - Commit changes with clear message

---

## Success Metrics

Based on BENCHMARKS.md baseline:

### Memory Targets:
- Reduce memory allocations by 10-20% for single QR codes
- Reduce memory allocations by 15-30% for batch generation
- Focus on Integer and Array allocations (85-90% of total)

### Performance Targets:
- Improve large QR code (v20+) generation speed by 10-20%
- Improve batch generation throughput
- Maintain or improve small QR code performance

### Constraints:
- No breaking API changes
- All tests must pass
- Standard Ruby style compliance
- No external runtime dependencies

---

## Results Log

### Task #1: ARCH_BITS Investigation ✅
**Date**: December 4, 2025  
**Status**: Complete  
**Results**: 

**PROVEN: Setting `RQRCODE_CORE_ARCH_BITS=32` provides dramatic improvements with zero downsides:**

- **Memory**: 70-76% reduction across all scenarios
- **Speed**: 2-4% faster (not just "no penalty"—actually faster)
- **Objects**: 85-87% fewer allocations
- **Tests**: All 108 assertions pass
- **Correctness**: No issues found

**Impact by scenario:**
- Single small QR: 0.38 MB → 0.10 MB (74% reduction)
- Single large QR: 8.53 MB → 2.92 MB (66% reduction)
- Batch 100 small: 37.91 MB → 9.10 MB (76% reduction)

**Why it works:** The `rszf` function creates bit masks. With 64-bit, these require large integer allocations. With 32-bit, they fit into smaller representations, dramatically reducing memory allocation while improving cache locality.

**Action taken:**
- Updated README with prominent "Performance Optimization" section
- Updated code comments in `qr_util.rb` with concrete benchmark data
- Updated `docs/BENCHMARKS.md` with proven results
- Organized all benchmark files in `test/benchmarks/` directory
- Created comprehensive analysis document

**Recommendation:** Users should set `RQRCODE_CORE_ARCH_BITS=32` in production. Not changing library default to avoid surprises for existing users, but strongly recommending the optimization through documentation.

Full analysis available in `test/benchmarks/ARCH_BITS_ANALYSIS.md`.

---

### Task #3: Profile Hot Paths with stackprof ✅
**Date**: December 5, 2025  
**Status**: Complete  
**Results**:

**IDENTIFIED: Clear CPU bottlenecks with concrete optimization targets:**

Profiled QR codes from v1 (21x21) to v20 (97x97) with 100-5 iterations each.

**Primary Hotspot** - `demerit_points_1_same_color` (`qr_util.rb:171`):
- **CPU Impact**: Scales from 12.7% (v1) → 30.2% (v20)
- **Why Slow**: O(n²) nested loops checking consecutive same-colored modules for all 8 mask patterns
- **Optimization Potential**: 15-30% speed improvement for large QR codes

**Secondary Hotspots**:
- `demerit_points_2_full_blocks`: 1.6% → 3.5% (2x2 block checking)
- `demerit_points_3_dangerous_patterns`: 0.6% → 2.8% (pattern matching)
- Combined potential: 5-10% improvement

**GC Overhead**:
- Small QR codes: 74.7% (GC-bound) - addressed by ARCH_BITS=32
- Large QR codes: 41.6% (compute-bound) - demerit functions are the bottleneck

**Scaling Insight**:
As QR codes grow larger, demerit calculations become the dominant performance factor, overtaking GC as the primary bottleneck.

**Action taken:**
- Added stackprof gem to gemspec
- Created `test/profile_stackprof.rb` profiling script
- Generated profiles for v1, v5, v10, v20 QR codes
- Created comprehensive analysis in `test/benchmarks/STACKPROF_ANALYSIS.md`
- Updated OPTIMIZATION_PLAN.md with data-driven priorities

**Next Steps**: Optimize `demerit_points_1_same_color` as highest-impact target (Task #4 reprioritized).

Full analysis available in `test/benchmarks/STACKPROF_ANALYSIS.md`.

---

### Task #3.5: Optimize Demerit Calculation Functions ✅
**Date**: December 5, 2025  
**Status**: Complete  
**Results**:

**OPTIMIZED: All three demerit calculation hotspots identified by stackprof**

Based on the stackprof analysis showing `demerit_points_*` functions as the primary bottleneck, optimized all three functions without changing their algorithmic behavior:

**Optimizations Applied:**

1. **`demerit_points_1_same_color` (qr_util.rb:171-213)**:
   - Eliminated nested Range objects (`-1..1`) in hot loops
   - Pre-computed `max_index` to avoid repeated `module_count - 1` calculations
   - Cached row arrays (`modules_row`, `row_above`, `row_below`) to reduce array lookups
   - Unrolled nested loops checking 3x3 neighborhood
   - Replaced Range#each with Integer#times for better performance
   - **Reduced CPU time from 30.2% → 18.5%** (39% reduction)

2. **`demerit_points_2_full_blocks` (qr_util.rb:215-230)**:
   - Cached adjacent row arrays to eliminate redundant lookups
   - Simplified 2x2 block check using direct equality comparisons
   - Removed unnecessary counter variable and array inclusion check
   - Replaced Range#each with Integer#times

3. **`demerit_points_3_dangerous_patterns` (qr_util.rb:232-259)**:
   - Pre-computed pattern length and max_start index
   - Simplified dangerous pattern checks with clearer conditionals
   - Replaced Range#each with Integer#times
   - Consolidated multi-line conditionals

**Performance Impact** (64-bit, before vs after):

| QR Code Size | Before | After | Improvement |
|--------------|---------|--------|-------------|
| Small (v1) | 152.7 i/s | 292.9 i/s | **+92% faster** |
| Medium (v5) | 46.2 i/s | 85.3 i/s | **+85% faster** |
| Large (v24) | 4.77 i/s | 8.68 i/s | **+82% faster** |
| Version 10 | 19.0 i/s | 34.6 i/s | **+82% faster** |
| Version 20 | 6.50 i/s | 11.8 i/s | **+82% faster** |
| Version 40 | 1.94 i/s | 3.51 i/s | **+81% faster** |

**Time per QR Code** (v20):
- Before: 153.86 ms
- After: 84.57 ms
- **Improvement: 45% reduction in generation time**

**StackProf CPU Profile Changes** (v20 QR code):
- `demerit_points_1_same_color`: 30.2% → 18.5% (39% reduction)
- GC overhead: 41.6% → 47.3% (now that compute is faster, GC shows proportionally higher)
- Overall CPU samples reduced significantly

**Key Insights:**
- The optimizations provided **consistent 80-90% speed improvements** across all QR code sizes
- Largest impact on large QR codes where demerit calculations dominate
- All 108 test assertions pass - behavior unchanged
- Code is now clearer and more maintainable
- No external dependencies added

**Action taken:**
- Optimized all three demerit calculation functions
- Maintained exact same algorithmic behavior (tests pass)
- Applied Ruby style guide fixes via `rake standard:fix`
- Generated new benchmark data in `test/benchmarks/benchmark_performance_optimized.txt`
- Verified performance improvements via stackprof

**Files Modified:**
- `lib/rqrcode_core/qrcode/qr_util.rb` (lines 171-259)

**Recommendation:** These optimizations provide massive performance gains with zero breaking changes. Ready for production use immediately.

See `test/benchmarks/benchmark_performance_optimized.txt` for complete benchmark results.

---

## References

- [BENCHMARKS.md](BENCHMARKS.md) - Baseline performance metrics
- [Ruby Performance Optimization](https://ruby-doc.org/core/doc/performance_md.html)
- [memory_profiler gem](https://github.com/SamSaffron/memory_profiler)
- [stackprof gem](https://github.com/tmm1/stackprof)