1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382
|
# System Resource Tracker Checker
## Overview
The System Resource Tracker is a Level Zero validation layer checker that monitors both Level Zero API resources and system resources in real-time. It tracks resource allocation and deallocation across all Level Zero API calls that create or destroy resources, providing detailed insights into memory usage, resource lifecycles, and system-level metrics.
**Platform Support:** Linux only. This checker uses `/proc/self/status` for system metrics and is not available on Windows or macOS.
## Features
- **Level Zero Resource Tracking**: Monitors all L0 resources including contexts, command queues, modules, kernels, event pools, command lists, events, fences, images, samplers, and memory allocations
- **Memory Leak Detection**: Automatically detects memory leaks by tracking handle-to-metrics mappings for all create/destroy operations:
- Compares memory usage at resource creation vs destruction
- Logs warnings when memory increases during destroy operations
- Reports cumulative leaks per resource type at program exit
- Provides detailed per-handle leak information
- **System Resource Monitoring**: Tracks real system metrics via `/proc/self/status` including:
- Virtual memory size (VmSize)
- Resident set size (VmRSS)
- Data segment size (VmData)
- Peak virtual memory (VmPeak)
- Thread count
- File descriptor count
- **Signed Delta Tracking**: Calculates both positive and negative resource changes (deltas) for each API call with proper signed arithmetic
- **Cumulative Summaries**: Maintains running totals of all resource types and leak totals
- **CSV Export**: Optionally exports timestamped data for graphing and analysis
- **Per-Process Isolation**: Each process creates a unique output file with PID appended
- **Thread-Safe**: Uses mutex protection for concurrent access using Meyers' singleton pattern
- **Low Overhead**: Minimal performance impact (< 1ms per API call)
## Usage
### Basic Tracking with Debug Logging
Enable the checker to log resource usage to the Level Zero debug log:
```bash
export ZE_ENABLE_VALIDATION_LAYER=1
export ZEL_ENABLE_SYSTEM_RESOURCE_TRACKER_CHECKER=1
export ZEL_ENABLE_LOADER_LOGGING=1
export ZEL_LOADER_LOGGING_LEVEL=debug
# Run your Level Zero application
./my_level_zero_app
```
### CSV Output for Graphing
Set the `ZEL_SYSTEM_RESOURCE_TRACKER_CSV` environment variable to specify the output CSV file path, this path will be relative to the current working directory of the application:
```bash
export ZE_ENABLE_VALIDATION_LAYER=1
export ZEL_ENABLE_SYSTEM_RESOURCE_TRACKER_CHECKER=1
export ZEL_SYSTEM_RESOURCE_TRACKER_CSV=tracker_output.csv
# Run your Level Zero application
./my_level_zero_app
```
**Note:** The actual output file will include the process ID (e.g., `tracker_output_pid12345.csv`) to ensure each process creates a unique file. This prevents conflicts when multiple processes use the tracker simultaneously.
## Tracked API Calls
The System Resource Tracker monitors the following Level Zero API functions:
### Context Management
- `zeContextCreate` - Creates L0 context, increments context count
### Command Queue Management
- `zeCommandQueueCreate` - Creates command queue, increments queue count
- `zeCommandQueueDestroy` - Destroys command queue, decrements queue count
### Module and Kernel Management
- `zeModuleCreate` - Creates module from binary, increments module count
- `zeModuleDestroy` - Destroys module, decrements module count
- `zeKernelCreate` - Creates kernel from module, increments kernel count
- `zeKernelDestroy` - Destroys kernel, decrements kernel count
### Event Management
- `zeEventPoolCreate` - Creates event pool, increments pool count
- `zeEventPoolDestroy` - Destroys event pool, decrements pool count
- `zeEventCreate` - Creates event, increments event count
- `zeEventDestroy` - Destroys event, decrements event count
### Command List Management
- `zeCommandListCreate` - Creates regular command list, increments list count
- `zeCommandListCreateImmediate` - Creates immediate command list, increments list count
- `zeCommandListDestroy` - Destroys command list, decrements list count
- `zeCommandListReset` - Resets command list state
### Synchronization
- `zeFenceCreate` - Creates fence, increments fence count
- `zeFenceDestroy` - Destroys fence, decrements fence count
### Image and Sampler Management
- `zeImageCreate` - Creates image, increments image count
- `zeImageDestroy` - Destroys image, decrements image count
- `zeSamplerCreate` - Creates sampler, increments sampler count
- `zeSamplerDestroy` - Destroys sampler, decrements sampler count
### Memory Management
- `zeMemAllocDevice` - Allocates device memory, tracks allocation size and count
- `zeMemAllocHost` - Allocates host memory, tracks allocation size and count
- `zeMemAllocShared` - Allocates shared memory, tracks allocation size and count
- `zeMemFree` - Frees memory, tracks deallocation
### Command List Append Operations
The tracker also monitors all command list append operations to measure memory deltas after each call:
**Memory Operations:**
- `zeCommandListAppendMemoryCopy` - Appends memory copy command
- `zeCommandListAppendMemoryFill` - Appends memory fill command
- `zeCommandListAppendMemoryCopyRegion` - Appends region-based memory copy
- `zeCommandListAppendMemoryCopyFromContext` - Appends cross-context memory copy
- `zeCommandListAppendMemoryPrefetch` - Appends memory prefetch hint
- `zeCommandListAppendMemAdvise` - Appends memory advice hint
**Image Operations:**
- `zeCommandListAppendImageCopy` - Appends image copy command
- `zeCommandListAppendImageCopyRegion` - Appends region-based image copy
- `zeCommandListAppendImageCopyToMemory` - Appends image-to-memory copy
- `zeCommandListAppendImageCopyFromMemory` - Appends memory-to-image copy
**Synchronization:**
- `zeCommandListAppendBarrier` - Appends barrier command
- `zeCommandListAppendMemoryRangesBarrier` - Appends memory ranges barrier
- `zeCommandListAppendSignalEvent` - Appends event signal command
- `zeCommandListAppendWaitOnEvents` - Appends wait on events command
- `zeCommandListAppendEventReset` - Appends event reset command
**Kernel Execution:**
- `zeCommandListAppendLaunchKernel` - Appends kernel launch command
- `zeCommandListAppendLaunchCooperativeKernel` - Appends cooperative kernel launch
- `zeCommandListAppendLaunchKernelIndirect` - Appends indirect kernel launch
**Query Operations:**
- `zeCommandListAppendQueryKernelTimestamps` - Appends kernel timestamp query
- `zeCommandListAppendWriteGlobalTimestamp` - Appends global timestamp write
## Output Format
### Debug Log Output
When logging is enabled, each tracked API call produces output like:
```
[debug] zeContextCreate: Context Count=1, handle=0x5567c00e2f98, Delta: VmRSS=+192KB, VmSize=+0KB, VmData=+0KB
[debug]
System Use After Call:
Total System: VmRSS=158784KB, VmSize=427744KB, VmData=167456KB, Threads=2
L0 Contexts (1): VmRSS=0KB, VmSize=0KB, VmData=0KB
```
When memory leaks are detected on destroy operations:
```
[warning] MEMORY LEAK DETECTED: Context handle=0x5567c00e2f98 - After destroy: VmRSS=+384KB, VmSize=+1796KB (Total Context leaks: 384KB)
```
At program exit, a comprehensive leak summary is logged:
```
[error]
=== MEMORY LEAK SUMMARY ===
Context leaks: 384 KB
EventPool leaks: 384 KB
CommandList leaks: 384 KB
Event leaks: 192 KB
TOTAL LEAKS: 1344 KB
===========================
```
### CSV Format
The CSV file contains the following columns:
| Column | Description |
|--------|-------------|
| `CallNumber` | Sequential call number |
| `TimeMs` | Elapsed time in milliseconds since tracker started |
| `APICall` | Level Zero API function name |
| `VmSize_KB` | Virtual memory size (KB) |
| `VmRSS_KB` | Resident set size (KB) |
| `VmData_KB` | Data segment size (KB) |
| `VmPeak_KB` | Peak virtual memory size (KB) |
| `Threads` | Number of threads |
| `Delta_VmSize_KB` | Signed change in VmSize since previous call (KB, can be negative) |
| `Delta_VmRSS_KB` | Signed change in VmRSS since previous call (KB, can be negative) |
| `Delta_VmData_KB` | Signed change in VmData since previous call (KB, can be negative) |
| `Contexts` | Current number of L0 contexts |
| `CommandQueues` | Current number of L0 command queues |
| `Modules` | Current number of L0 modules |
| `Kernels` | Current number of L0 kernels |
| `EventPools` | Current number of L0 event pools |
| `CommandLists` | Current number of L0 command lists |
| `Events` | Current number of L0 events |
| `Fences` | Current number of L0 fences |
| `Images` | Current number of L0 images |
| `Samplers` | Current number of L0 samplers |
| `TotalMemory_Bytes` | Total L0 memory allocated (bytes) |
**Note on Delta Values:** Delta columns use signed arithmetic to show both increases (positive) and decreases (negative) in memory usage. This enables accurate tracking of memory being freed or reduced.
## Visualization
### Using the Provided Python Script
A plotting script is provided to automatically generate visualizations:
```bash
# Install required Python packages
pip install pandas matplotlib
# Generate plots
python3 scripts/plot_resource_tracker.py tracker_output.csv
```
This will:
- Display interactive plots showing memory usage, resource counts, API call distribution, and **memory leak detection**
- Save plots to `tracker_output_plot.png`
- Print summary statistics including detected memory leaks
### Manual Analysis
You can also import the CSV into any data analysis tool:
**Python/Pandas:**
```python
import pandas as pd
df = pd.read_csv('tracker_output.csv')
print(df.describe())
```
**Excel/LibreOffice Calc:**
- Open the CSV file directly
- Create charts using the time series data
**R:**
```r
data <- read.csv('tracker_output.csv')
plot(data$TimeMs, data$VmRSS_KB, type='l')
```
## Example Visualizations
The plotting script generates 12 comprehensive visualizations:
1. **Memory Leak Detection Over Time** - Red cumulative leak line with scatter markers for individual leak events; light red background if leaks detected
2. **System Memory Usage Over Time** - VmRSS, VmSize, VmData trends
3. **Memory Leaks by API Call** - Horizontal bar chart showing total leaked memory per API type, or success message if no leaks
4. **Memory Deltas Per API Call** - Shows which calls increase/decrease memory in MB with signed deltas
5. **L0 Resource Counts** - Tracks contexts, queues, modules, kernels over time
6. **Command Lists and Events** - Monitors command list and event pool usage
7. **Total Memory Allocations** - Shows cumulative L0 memory allocations in MB
8. **Top 10 Most Frequent API Calls** - Bar chart sorted by call frequency
9. **Top 10 API Calls by Memory Impact** - Total memory delta in MB per API type
10. **Top 10 API Calls by Avg Memory per Call** - Average memory delta in MB per call
11. **Cumulative Memory Impact by Top 5 APIs** - Memory trends over time for highest-impact APIs
12. **Individual Leak Events Timeline** - Bar chart showing leak magnitude at each time point with total annotation
### Leak Detection Algorithm
The plotting script identifies potential memory leaks by:
1. Matching create/destroy API pairs (e.g., `zeContextCreate` ↔ `zeContextDestroy`)
2. Flagging destroy operations where memory **increased** by more than 100KB
3. Calculating cumulative leaked memory across the program's execution
4. Highlighting leak events with red visual indicators
**Example leak detection output:**
```
=== MEMORY LEAK DETECTION ===
⚠️ LEAKS DETECTED!
Total leaked memory: 0.19 MB (192.00 KB)
Number of leak events: 1
Leak events by API:
zeEventPoolDestroy: 1 events, 0.19 MB total, 0.19 MB avg
```
### Example Plot

The example above shows a typical visualization with all 12 plots in a 4x3 grid displaying:
- **Row 1:** Memory leak detection, system memory trends, and leak summary by API
- **Row 2:** Memory deltas, resource counts, and command list/event tracking
- **Row 3:** Total memory allocations, API call frequency, and memory impact rankings
- **Row 4:** Average memory per call, cumulative API impact, and individual leak events timeline
Command list append operations that have high memory deltas will automatically appear in the memory impact and cumulative trend plots. Memory leaks are prominently displayed with red highlighting and detailed annotations.
## Use Cases
- **Memory Leak Detection**: Automatically identify resources that aren't properly cleaned up
- Detects leaks at both the validation layer level (real-time warnings) and plotting level (post-analysis)
- Pinpoints exact handles and API calls that leaked memory
- Quantifies leak severity with KB/MB measurements
- **Performance Analysis**: Identify excessive allocations and memory usage patterns
- **Resource Optimization**: Track resource lifecycle and identify unused resources
- **Debugging**: Correlate application behavior with resource usage patterns and leak events
- **Benchmarking**: Compare resource usage and leak counts across different runs or code versions
- **CI/CD Integration**: Automated tracking of resource usage and leak detection in test pipelines
- **Capacity Planning**: Understand peak resource requirements for production workloads
- **Regression Testing**: Detect unexpected increases in resource consumption or new memory leaks
## Implementation Details
### Architecture
The System Resource Tracker is implemented as a validation layer checker that uses both prologue and epilogue intercepts. When enabled, the validation layer injects calls to the tracker's functions before and after each tracked API call completes.
**Key Components:**
- `SystemResourceMetrics`: Structure holding system-level metrics (VmSize, VmRSS, etc.) with `signedDiff()` method for proper signed delta calculation
- `ResourceTracker`: Meyers' singleton pattern structure maintaining:
- Cumulative resource counts and metrics
- Handle-to-metrics maps for all 11 resource types (contexts, queues, modules, kernels, event pools, command lists, events, fences, images, samplers, memory allocations)
- Per-resource-type leak counters
- Thread-local pre-call metrics storage for append operations
- `getResourceTracker()`: Function-local static singleton accessor ensuring proper initialization order
- `getSystemResourceMetrics()`: Parses `/proc/self/status` to read current system metrics
- `checkForLeak()`: Compares creation metrics to destruction metrics and logs warnings if memory increased
- `writeCsvData()`: Atomic CSV line writer using ostringstream with signed delta support
- `logResourceSummary()`: Formats and logs cumulative resource usage
- Prologue/Epilogue function pairs: 44 functions (22 pairs) for create/destroy operations plus 28 functions (14 pairs) for append operations
### Thread Safety
The tracker uses multiple mechanisms to ensure thread safety:
- **Meyers' Singleton Pattern**: Function-local static `ResourceTracker` ensures initialization on first use, avoiding static initialization order fiasco
- **Mutex Protection**: All prologue and epilogue functions acquire the tracker's mutex before accessing shared state
- **Thread-Local Storage**: Pre-call metrics for append operations are stored per-thread to avoid conflicts
- **Atomic CSV Writes**: CSV lines are built using ostringstream and written atomically
- **Handle-to-Metrics Maps**: Each resource type has its own map protected by the shared mutex
### Performance Considerations
- Tracking overhead is approximately < 1ms per API call
- System metrics are read by parsing a small text file (`/proc/self/status` on Linux)
- CSV writes are buffered and flushed after each call to ensure crash safety
- The tracker only runs when explicitly enabled via environment variable
### Platform Support
The System Resource Tracker is **Linux-only** and relies on `/proc/self/status` for system resource metrics. The checker is automatically excluded from builds on Windows and macOS.
## Troubleshooting
### Empty CSV File (Header Only)
If the CSV file contains only the header row with no data, this was caused by a static initialization order bug that has been fixed. The current version uses a Meyers' singleton pattern (`getResourceTracker()`) to ensure proper initialization.
**Solution**: Rebuild with the latest version and regenerate the CSV file.
### CSV Parsing Errors
If you encounter `pandas.errors.ParserError` with messages like "Expected X fields, saw Y", this indicates corrupted CSV data. This can happen if:
1. The CSV file was generated by an older version with a data race bug (fixed in current version)
2. The application crashed during a write operation
**Solution**: Generate a new CSV file with the current version of the tracker, which uses:
- Atomic line writes to prevent data corruption
- Unique filenames per process (with PID appended) to prevent conflicts
- Meyers' singleton pattern for proper initialization
### False Positive Leak Detection
The plotting script uses a 100KB threshold for leak detection. Small memory increases during destroy operations may be normal due to:
- Internal driver caching
- Deferred cleanup in the driver
- Memory allocation granularity
If you see small leaks (< 100KB), they may not represent actual application leaks. The validation layer's real-time leak detection provides more accurate results as it compares metrics at the exact moment of creation vs destruction.
## Notes
- Each process creates a unique CSV file with the process ID appended (e.g., `tracker_pid12345.csv`)
- CSV output is flushed after each API call to ensure data is captured even if the application crashes
- The tracker uses atomic line writes to prevent data corruption in multi-threaded applications
- The tracker adds minimal overhead (< 1ms per API call)
- CSV files can grow large for long-running applications; consider log rotation for production use
- Both CSV output and debug logging can be enabled simultaneously
|