File: STACKPROF_ANALYSIS.md

package info (click to toggle)
ruby-rqrcode-core 2.1.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 520 kB
  • sloc: ruby: 2,289; makefile: 4; sh: 4
file content (321 lines) | stat: -rw-r--r-- 12,650 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# StackProf Performance Analysis

**Date**: December 5, 2025 (Updated: Post-Optimization)  
**Tool**: StackProf CPU profiling  
**Ruby**: 3.3.4 (arm64-darwin24)  
**ARCH_BITS**: 64  
**Status**: Optimizations Implemented ✅

---

## Executive Summary

Profiling across different QR code sizes (v1, v5, v10, v20) revealed clear optimization targets, which have now been **successfully optimized**.

**Original Top 3 Hotspots** (by CPU samples):
1. **`demerit_points_1_same_color`** - 12-30% of total CPU time → **✅ OPTIMIZED**
2. **Garbage Collection** - 42-75% of samples (varies by size) → **✅ Addressed via ARCH_BITS=32**
3. **`demerit_points_2_full_blocks`** and **`demerit_points_3_dangerous_patterns`** - 2-6% combined → **✅ OPTIMIZED**

**Key Insight**: As QR codes grow larger, the demerit calculation functions became the dominant bottleneck, spending increasing time in nested loops checking module patterns.

**Optimization Results**: All three demerit calculation functions have been optimized, resulting in **80-90% speed improvements** across all QR code sizes with zero breaking changes.

---

## Profiling Results by QR Code Size

### Small QR Code (v1 - 21x21 modules)
**Total Samples**: 316  
**GC Samples**: 236 (74.7%)

| Function | Total % | Samples % | Note |
|----------|---------|-----------|------|
| (sweeping) | 38.6% | 38.6% | GC sweep |
| (garbage collection) | 74.7% | 36.1% | Total GC overhead |
| demerit_points_1_same_color | 16.5% | 12.7% | Pattern checking |
| Range#each | 19.3% | 4.1% | Iterator overhead |
| map_data | 4.1% | 2.5% | Data encoding |

**Observation**: Small QR codes spend most time in GC (74.7%). Actual encoding work is minimal.

---

### Medium QR Code (v5 - 37x37 modules)
**Total Samples**: 974  
**GC Samples**: 680 (69.8%)

| Function | Total % | Samples % | Note |
|----------|---------|-----------|------|
| (sweeping) | 34.8% | 34.8% | GC sweep |
| (garbage collection) | 69.8% | 29.5% | Total GC overhead |
| demerit_points_1_same_color | 19.8% | 16.8% | Pattern checking (↑) |
| Range#each | 24.9% | 3.5% | Iterator overhead |
| demerit_points_2_full_blocks | 2.6% | 2.5% | Block pattern |
| map_data | 3.7% | 2.0% | Data encoding |

**Observation**: `demerit_points_1_same_color` increases from 12.7% to 16.8% as modules grow.

---

### Large QR Code (v10 - 57x57 modules)
**Total Samples**: 2764  
**GC Samples**: 1551 (56.1%)

| Function | Total % | Samples % | Note |
|----------|---------|-----------|------|
| (sweeping) | 30.9% | 30.9% | GC sweep |
| demerit_points_1_same_color | 30.9% | 26.1% | **Dominant hotspot** (↑↑) |
| (garbage collection) | 56.1% | 20.1% | Total GC overhead |
| Range#each | 37.8% | 6.2% | Iterator overhead |
| demerit_points_2_full_blocks | 3.7% | 3.0% | Block pattern |
| map_data | 4.3% | 2.1% | Data encoding |

**Observation**: `demerit_points_1_same_color` now **equals GC time** at 26-31%. Clear optimization target!

---

### Very Large QR Code (v20 - 97x97 modules)
**Total Samples**: 1475  
**GC Samples**: 614 (41.6%)

| Function | Total % | Samples % | Note |
|----------|---------|-----------|------|
| demerit_points_1_same_color | 37.8% | **30.2%** | **Primary hotspot** (↑↑↑) |
| (sweeping) | 21.2% | 21.2% | GC sweep |
| (garbage collection) | 41.6% | 19.9% | Total GC overhead |
| Range#each | 45.9% | 8.6% | Iterator overhead |
| map_data | 6.6% | 4.7% | Data encoding |
| demerit_points_2_full_blocks | 4.0% | 3.5% | Block pattern |
| demerit_points_3_dangerous_patterns | 3.6% | 2.8% | Pattern detection |

**Observation**: `demerit_points_1_same_color` is now the **single largest CPU consumer** at 30.2%, exceeding even GC!

---

## Optimization Targets (Priority Order)

### ✅ Priority 1: `demerit_points_1_same_color` - COMPLETED
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:171-213`  
**CPU Impact**: 12.7% → 30.2% (as QR code size increases) → **18.5% after optimization**

**Original Implementation**: Nested loops checking for consecutive same-colored modules with redundant array lookups

**Why It Was Slow**:
- O(n²) complexity over module_count
- Runs on every row and column for all 8 mask patterns
- Heavy array access patterns with repeated lookups
- Nested Range objects creating allocation overhead
- Repeated `module_count - 1` calculations

**Optimizations Applied**:
1. ✅ **Pre-computed `max_index`** to avoid repeated `module_count - 1` calculations
2. ✅ **Cached row arrays** (`modules_row`, `row_above`, `row_below`) to eliminate redundant lookups
3. ✅ **Unrolled nested loops** checking 3x3 neighborhood for better performance
4. ✅ **Replaced Range#each with Integer#times** for reduced allocation overhead
5. ✅ **Eliminated nested Range objects** (`-1..1`) in hot loops

**Actual Impact**: 
- **CPU time reduced from 30.2% → 18.5%** (39% reduction in v20 QR codes)
- **Overall speed improvement: 80-92% faster** across all QR code sizes
- v1 (21x21): 152.7 i/s → 292.9 i/s (+92%)
- v20 (97x97): 6.50 i/s → 11.8 i/s (+82%)
- Time per v20 QR: 153.86ms → 84.57ms (45% reduction)

---

### ✅ Priority 2: Garbage Collection Overhead - ADDRESSED
**Impact**: 42-75% of samples (higher for small QR codes)

**Root Causes** (from memory profiling):
- Integer allocations (70-76% of objects) - **✅ SOLVED by ARCH_BITS=32**
- Array allocations (15-22% of objects) - **✅ Reduced via loop optimizations**
- Temporary objects in loops - **✅ Reduced via caching**

**Optimizations Applied**:
1. ✅ **ARCH_BITS=32** documented and proven (70-76% memory reduction)
2. ✅ **Reduced temporary Range allocations** in demerit functions
3. ✅ **Cached row arrays** to reduce repeated allocations
4. ✅ **Replaced Range#each with Integer#times** throughout hot paths

**Actual Impact**: 
- ARCH_BITS=32 provides 2-4% speed improvement + 70-76% memory reduction
- Demerit optimizations further reduced allocation pressure
- GC now shows proportionally higher (47.3% for v20) because compute is faster

---

### ✅ Priority 3: `demerit_points_2_full_blocks` - COMPLETED
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:215-230`  
**CPU Impact**: 1.6% → 3.5% (increases with size) → **Optimized**

**Original Implementation**: Checks for 2x2 blocks of same color with redundant lookups

**Optimizations Applied**:
1. ✅ **Cached adjacent row arrays** to eliminate redundant lookups
2. ✅ **Simplified 2x2 block check** using direct equality comparisons
3. ✅ **Removed unnecessary counter variable** and array inclusion check
4. ✅ **Replaced Range#each with Integer#times** for better performance

**Actual Impact**: Contributed to overall **80-90% speed improvement** across all sizes

---

### ✅ Priority 4: `demerit_points_3_dangerous_patterns` - COMPLETED
**File**: `lib/rqrcode_core/qrcode/qr_util.rb:232-259`  
**CPU Impact**: 0.6% → 2.8% (increases with size) → **Optimized**

**Original Implementation**: Pattern matching for specific sequences with nested conditionals

**Optimizations Applied**:
1. ✅ **Pre-computed pattern length and max_start index** to avoid repeated calculations
2. ✅ **Simplified dangerous pattern checks** with clearer conditionals
3. ✅ **Replaced Range#each with Integer#times** for reduced overhead
4. ✅ **Consolidated multi-line conditionals** for better readability and performance

**Actual Impact**: Contributed to overall **80-90% speed improvement** across all sizes

---

### Priority 5: `map_data`
**File**: `lib/rqrcode_core/qrcode/qr_code.rb:367`  
**CPU Impact**: 2.5% → 4.7% (increases with size)

**Current Implementation**: Maps data bits into QR code modules

**Optimization Ideas**:
1. **Reduce redundant mask calculations**
2. **Cache module positions**
3. **Optimize zigzag iteration pattern**

**Expected Impact**: 3-5% improvement

---

### ✅ Priority 6: `Range#each` Overhead - COMPLETED
**CPU Impact**: 4.1% → 8.6% (iterator overhead) → **Reduced**

**Observation**: Ruby's Range#each showed up prominently. Most calls were from demerit functions.

**Optimizations Applied**:
1. ✅ **Replaced Range#each with Integer#times** in all three demerit functions
2. ✅ **Eliminated nested Range objects** (`-1..1`) that created allocation overhead
3. ✅ **Reduced iterator allocations** throughout hot paths

**Actual Impact**: Significant reduction in iterator overhead, contributing to overall speed improvements

---

## Pattern Analysis

### Scaling Behavior

As QR codes grow larger (module_count increases):

| Metric | v1 | v5 | v10 | v20 | Trend |
|--------|----|----|-----|-----|-------|
| Total Samples | 316 | 974 | 2764 | 1475 | — |
| GC % | 74.7% | 69.8% | 56.1% | 41.6% | ↓ Decreasing |
| demerit_points_1 % | 12.7% | 16.8% | 26.1% | 30.2% | ↑↑ Rapidly increasing |
| demerit_points_2 % | 1.6% | 2.5% | 3.0% | 3.5% | ↑ Increasing |
| demerit_points_3 % | 0.6% | 1.7% | 2.1% | 2.8% | ↑ Increasing |

**Key Finding**: For large QR codes, **optimizing demerit calculations provides the most value**. For small QR codes, memory optimization (ARCH_BITS=32) has the biggest impact through reduced GC.

---

## Completed Optimizations ✅

1. ✅ **Immediate Win**: Document and promote `RQRCODE_CORE_ARCH_BITS=32` 
   - **Result**: 70-76% memory reduction, 2-4% speed improvement

2. ✅ **High Impact**: Optimize `demerit_points_1_same_color`
   - **Result**: 39% CPU time reduction (30.2% → 18.5% for v20)
   - **Method**: Cached arrays, pre-computed indices, unrolled loops, replaced Range#each

3. ✅ **Medium Impact**: Optimize other demerit functions
   - **Result**: Optimized both `demerit_points_2_full_blocks` and `demerit_points_3_dangerous_patterns`
   - **Method**: Cached arrays, simplified checks, replaced Range#each

4. ✅ **Low Hanging Fruit**: Replace Range#each in hot paths
   - **Result**: Used Integer#times throughout all demerit functions
   - **Impact**: Reduced iterator allocation overhead

**Overall Results**: **80-90% speed improvement** across all QR code sizes with zero breaking changes.

## Potential Future Optimizations

These optimizations were not pursued as the current improvements already provide excellent performance:

1. **Long Term**: Consider caching mask pattern evaluations between get_best_mask_pattern iterations
   - Would require significant refactoring
   - Current performance is now acceptable

2. **Advanced**: Explore alternative data structures for modules
   - Flat array with index calculations
   - BitArray for memory efficiency
   - Trade-off: Memory vs access speed complexity

---

## Profiling Commands

### View Profile Summary
```bash
stackprof tmp/stackprof/very_large_qr_code_v20__cpu.dump --text --limit 20
```

### View Specific Method
```bash
stackprof tmp/stackprof/very_large_qr_code_v20__cpu.dump --method 'demerit_points_1_same_color'
```

### Generate Flamegraph (requires stackprof-webnav)
```bash
gem install stackprof-webnav
stackprof-webnav tmp/stackprof/
```

### Interactive Mode
```bash
stackprof tmp/stackprof/very_large_qr_code_v20__cpu.dump
```

---

## Files Generated

All profiling data saved in: `tmp/stackprof/`

- `*_cpu.dump` - Raw stackprof data (for further analysis)
- `*_cpu_report.txt` - Text summaries with call trees
- `*_cpu.callgrind` - Callgrind format (currently broken in script, needs fix)

---

## Conclusion

StackProf profiling identified clear optimization targets, which have now been **successfully addressed**:

1. ✅ **`demerit_points_1_same_color`** was the primary bottleneck (30% CPU) → **Optimized to 18.5%**
2. ✅ **GC overhead** dominated small QR codes → **Addressed via ARCH_BITS=32 (70-76% memory reduction)**
3. ✅ **Other demerit functions** were secondary targets (6-8% combined) → **All optimized**
4. ✅ **Large QR code focus** (v10+) → **Achieved 80-90% speed improvements**

**Final Results**:
- **Performance**: 80-92% faster across all QR code sizes
- **Memory**: 70-76% reduction with ARCH_BITS=32
- **Correctness**: All 108 test assertions pass
- **Breaking Changes**: Zero
- **Code Quality**: Improved readability and maintainability

The profiling data provided concrete evidence to guide optimization work, and the results exceeded initial expectations. The optimizations are production-ready and provide massive performance gains with no downsides.

**Benchmark Results** (Before → After):
- Small QR (v1): 152.7 i/s → 292.9 i/s (+92%)
- Medium QR (v5): 46.2 i/s → 85.3 i/s (+85%)
- Large QR (v24): 4.77 i/s → 8.68 i/s (+82%)
- Version 20: 6.50 i/s → 11.8 i/s (+82%)
- Version 40: 1.94 i/s → 3.51 i/s (+81%)

See `test/benchmarks/benchmark_performance_optimized.txt` for complete benchmark results.