1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
|
# Statedump
Statedump is a file generated by glusterfs process with different data structure state which may contain the active inodes, fds, mempools, iobufs, memory allocation stats of different types of datastructures per xlator etc.
## How to generate statedump
We can find the directory where statedump files are created using `gluster --print-statedumpdir` command.
Create that directory if not already present based on the type of installation.
Lets call this directory `statedump-directory`.
We can generate statedump using `kill -USR1 <pid-of-gluster-process>`.
gluster-process is nothing but glusterd/glusterfs/glusterfsd process.
There are also commands to generate statedumps for brick processes/nfs server/quotad
For bricks:
```
gluster volume statedump <volname>
```
For nfs server:
```
gluster volume statedump <volname> nfs
```
For quotad:
```
gluster volume statedump <volname> quotad
```
For brick-processes files will be created in `statedump-directory` with name of the file as `hyphenated-brick-path.<pid>.dump.timestamp`. For all other processes it will be `glusterdump.<pid>.dump.timestamp`.
For applications using libgfapi, `SIGUSR1` cannot be used, eg: smbd/libvirtd
processes could have used the `SIGUSR1` signal already for other purposes.
To generate statedump for the processes, using libgfapi, below command can be
executed from one of the nodes in the gluster cluster to which the libgfapi
application is connected to.
```
gluster volume statedump <volname> client <hostname>:<process id>
```
The statedumps can be found in the `statedump-directory`, the name of the
statedumps being `glusterdump.<pid>.dump.timestamp`. For a process there can be
multiple such files created depending on the number of times the volume is
accessed by the process (related to the number of `glfs_init()` calls).
## How to read statedump
We shall see snippets of each type of statedump.
First and last lines of the file have starting and ending time of writing the statedump file. Times will be in UTC timezone.
mallinfo return status is printed in the following format. Please read man mallinfo for more information about what each field means.
### Mallinfo
```
[mallinfo]
mallinfo_arena=100020224 /* Non-mmapped space allocated (bytes) */
mallinfo_ordblks=69467 /* Number of free chunks */
mallinfo_smblks=449 /* Number of free fastbin blocks */
mallinfo_hblks=13 /* Number of mmapped regions */
mallinfo_hblkhd=20144128 /* Space allocated in mmapped regions (bytes) */
mallinfo_usmblks=0 /* Maximum total allocated space (bytes) */
mallinfo_fsmblks=39264 /* Space in freed fastbin blocks (bytes) */
mallinfo_uordblks=96710112 /* Total allocated space (bytes) */
mallinfo_fordblks=3310112 /* Total free space (bytes) */
mallinfo_keepcost=133712 /* Top-most, releasable space (bytes) */
```
### Data structure allocation stats
For every xlator data structure memory per translator loaded in the call-graph is displayed in the following format:
For xlator with name: glusterfs
```
[global.glusterfs - Memory usage] #[global.xlator-name - Memory usage]
num_types=119 #It shows the number of data types it is using
```
Now for each data-type it prints the memory usage.
```
[global.glusterfs - usage-type gf_common_mt_gf_timer_t memusage]
#[global.xlator-name - usage-type <tag associated with the data-type> memusage]
size=112 #num_allocs times the sizeof(data-type) i.e. num_allocs * sizeof (data-type)
num_allocs=2 #Number of allocations of the data-type which are active at the time of taking statedump.
max_size=168 #max_num_allocs times the sizeof(data-type) i.e. max_num_allocs * sizeof (data-type)
max_num_allocs=3 #Maximum number of active allocations at any point in the life of the process.
total_allocs=7 #Number of times this data is allocated in the life of the process.
```
### Mempools
Mempools are optimization to reduce the number of allocations of a data type. If we create a mem-pool of lets say 1024 elements for a data-type, new elements will be allocated from heap using syscalls like calloc, only if all the 1024 elements in the pool are in active use.
Memory pool allocated by each xlator is displayed in the following format:
```
[mempool] #Section name
-----=-----
pool-name=fuse:fd_t #pool-name=<xlator-name>:<data-type>
hot-count=1 #number of mempool elements that are in active use. i.e. for this pool it is the number of 'fd_t' s in active use.
cold-count=1023 #number of mempool elements that are not in use. If a new allocation is required it will be served from here until all the elements in the pool are in use i.e. cold-count becomes 0.
padded_sizeof=108 #Each mempool element is padded with a doubly-linked-list + ptr of mempool + is-in-use info to operate the pool of elements, this size is the element-size after padding
pool-misses=0 #Number of times the element had to be allocated from heap because all elements from the pool are in active use.
alloc-count=314 #Number of times this type of data is allocated through out the life of this process. This may include pool-misses as well.
max-alloc=3 #Maximum number of elements from the pool in active use at any point in the life of the process. This does *not* include pool-misses.
cur-stdalloc=0 #Denotes the number of allocations made from heap once cold-count reaches 0, that are yet to be released via mem_put().
max-stdalloc=0 #Maximum number of allocations from heap that are in active use at any point in the life of the process.
```
### Iobufs
```
[iobuf.global]
iobuf_pool=0x1f0d970 #The memory pool for iobufs
iobuf_pool.default_page_size=131072 #The default size of iobuf (if no iobuf size is specified the default size is allocated)
#iobuf_arena: One arena represents a group of iobufs of a particular size
iobuf_pool.arena_size=12976128 # The initial size of the iobuf pool (doesn't include the stdalloc'd memory or the newly added arenas)
iobuf_pool.arena_cnt=8 #Total number of arenas in the pool
iobuf_pool.request_misses=0 #The number of iobufs that were stdalloc'd (as they exceeded the default max page size provided by iobuf_pool).
```
There are 3 lists of arenas:
1. Arena list: arenas allocated during iobuf pool creation and the arenas that are in use(active_cnt != 0) will be part of this list.
2. Purge list: arenas that can be purged(no active iobufs, active_cnt == 0).
3. Filled list: arenas without free iobufs.
```
[purge.1] #purge.<S.No.>
purge.1.mem_base=0x7fc47b35f000 #The address of the arena structure
purge.1.active_cnt=0 #The number of iobufs active in that arena
purge.1.passive_cnt=1024 #The number of unused iobufs in the arena
purge.1.alloc_cnt=22853 #Total allocs in this pool(number of times the iobuf was allocated from this arena)
purge.1.max_active=7 #Max active iobufs from this arena, at any point in the life of this process.
purge.1.page_size=128 #Size of all the iobufs in this arena.
[arena.5] #arena.<S.No.>
arena.5.mem_base=0x7fc47af1f000
arena.5.active_cnt=0
arena.5.passive_cnt=64
arena.5.alloc_cnt=0
arena.5.max_active=0
arena.5.page_size=32768
```
If the active_cnt of any arena is non zero, then the statedump will also have the iobuf list.
```
[arena.6.active_iobuf.1] #arena.<S.No>.active_iobuf.<iobuf.S.No.>
arena.6.active_iobuf.1.ref=1 #refcount of the iobuf
arena.6.active_iobuf.1.ptr=0x7fdb921a9000 #address of the iobuf
[arena.6.active_iobuf.2]
arena.6.active_iobuf.2.ref=1
arena.6.active_iobuf.2.ptr=0x7fdb92189000
```
At any given point in time if there are lots of filled arenas then that could be a sign of iobuf leaks.
### Call stack
All the fops received by gluster are handled using call-stacks. Call stack contains the information about uid/gid/pid etc of the process that is executing the fop. Each call-stack contains different call-frames per xlator which handles that fop.
```
[global.callpool.stack.3] #global.callpool.stack.<Serial-Number>
stack=0x7fc47a44bbe0 #Stack address
uid=0 #Uid of the process which is executing the fop
gid=0 #Gid of the process which is executing the fop
pid=6223 #Pid of the process which is executing the fop
unique=2778 #Xlators like afr do copy_frame and perform the operation in a different stack, this id is useful to find out the stacks that are inter-related because of copy-frame
lk-owner=0000000000000000 #Some of the fuse fops have lk-owner.
op=LOOKUP #Fop
type=1 #Type of the op i.e. FOP/MGMT-OP
cnt=9 #Number of frames in this stack.
```
### Call-frame
Each frame will have information about which xlator the frame belongs to, what is the function it wound to/from and will be unwind to. It also mentions if the unwind happened or not. If we observe hangs in the system and want to find out which xlator is causing it. Take a statedump and see what is the final xlator which is yet to be unwound.
```
[global.callpool.stack.3.frame.2]#global.callpool.stack.<stack-serial-number>.frame.<frame-serial-number>
frame=0x7fc47a611dbc #Frame address
ref_count=0 #Incremented at the time of wind and decremented at the time of unwind.
translator=r2-client-1 #Xlator this frame belongs to
complete=0 #if this value is 1 that means this frame is already unwound. 0 if it is yet to unwind.
parent=r2-replicate-0 #Parent xlator of this frame
wind_from=afr_lookup #Parent xlator function from which the wind happened
wind_to=priv->children[i]->fops->lookup
unwind_to=afr_lookup_cbk #Parent xlator function to which unwind happened
```
### History of operations in Fuse
Fuse maintains history of operations that happened in fuse.
```
[xlator.mount.fuse.history]
TIME=2014-07-09 16:44:57.523364
message=[0] fuse_release: RELEASE(): 4590:, fd: 0x1fef0d8, gfid: 3afb4968-5100-478d-91e9-76264e634c9f
TIME=2014-07-09 16:44:57.523373
message=[0] send_fuse_err: Sending Success for operation 18 on inode 3afb4968-5100-478d-91e9-76264e634c9f
TIME=2014-07-09 16:44:57.523394
message=[0] fuse_getattr_resume: 4591, STAT, path: (/iozone.tmp), gfid: (3afb4968-5100-478d-91e9-76264e634c9f)
```
### Xlator configuration
```
[cluster/replicate.r2-replicate-0] #Xlator type, name information
child_count=2 #Number of children to the xlator
#Xlator specific configuration below
child_up[0]=1
pending_key[0]=trusted.afr.r2-client-0
child_up[1]=1
pending_key[1]=trusted.afr.r2-client-1
data_self_heal=on
metadata_self_heal=1
entry_self_heal=1
data_change_log=1
metadata_change_log=1
entry-change_log=1
read_child=1
favorite_child=-1
wait_count=1
```
### Graph/inode table
```
[active graph - 1]
conn.1.bound_xl./data/brick01a/homegfs.hashsize=14057
conn.1.bound_xl./data/brick01a/homegfs.name=/data/brick01a/homegfs/inode
conn.1.bound_xl./data/brick01a/homegfs.lru_limit=16384 #Least recently used size limit
conn.1.bound_xl./data/brick01a/homegfs.active_size=690 #Number of inodes undergoing some kind of fop to be precise on which there is at least one ref.
conn.1.bound_xl./data/brick01a/homegfs.lru_size=183 #Number of inodes present in lru list
conn.1.bound_xl./data/brick01a/homegfs.purge_size=0 #Number of inodes present in purge list
```
### Inode
```
[conn.1.bound_xl./data/brick01a/homegfs.active.324] #324th inode in active inode list
gfid=e6d337cf-97eb-44b3-9492-379ba3f6ad42 #Gfid of the inode
nlookup=13 #Number of times lookups happened from the client or from fuse kernel
fd-count=4 #Number of fds opened on the inode
ref=11 #Number of refs taken on the inode
ia_type=1 #Type of the inode. This should be changed to some string :-(
Ref by xl:.patchy-md-cache=11 #Further this there will be a list of xlators, and the ref count taken by each of them on this inode at the time of statedump
[conn.1.bound_xl./data/brick01a/homegfs.lru.1] #1st inode in lru list. Note that ref count is zero for these inodes.
gfid=5114574e-69bc-412b-9e52-f13ff087c6fc
nlookup=5
fd-count=0
ref=0
ia_type=2
Ref by xl:.fuse=1
Ref by xl:.patchy-client-0=-1
```
### Inode context
For each inode per xlator some context could be stored. This context can also be printed in the statedump. Here is the inode ctx of locks xlator
```
[xlator.features.locks.homegfs-locks.inode]
path=/homegfs/users/dfrobins/gfstest/r4/SCRATCH/fort.5102 - path of the file
mandatory=0
inodelk-count=5 #Number of inode locks
lock-dump.domain.domain=homegfs-replicate-0:self-heal #Domain name where self-heals take locks to prevent more than one heal on the same file
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551615, owner=080b1ada117f0000, client=0xb7fc30, connection-id=compute-30-029.com-3505-2014/06/29-14:46:12:477358-homegfs-client-0-0-1, granted at Sun Jun 29 11:01:00 2014 #Active lock information
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551615, owner=c0cb091a277f0000, client=0xad4f10, connection-id=gfs01a.com-4080-2014/06/29-14:41:36:917768-homegfs-client-0-0-0, blocked at Sun Jun 29 11:04:44 2014 #Blocked lock information
lock-dump.domain.domain=homegfs-replicate-0:metadata #Domain name where metadata operations take locks to maintain replication consistency
lock-dump.domain.domain=homegfs-replicate-0 #Domain name where entry/data operations take locks to maintain replication consistency
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=11141120, len=131072, pid = 18446744073709551615, owner=080b1ada117f0000, client=0xb7fc30, connection-id=compute-30-029.com-3505-2014/06/29-14:46:12:477358-homegfs-client-0-0-1, granted at Sun Jun 29 11:10:36 2014 #Active lock information
```
## FAQ
### How to debug Memory leaks using statedump?
#### Using memory accounting feature:
[Bug 1120151](https://bugzilla.redhat.com/show_bug.cgi?id=1120151) is one of the bugs which was debugged using statedump to see which data-structure is leaking. Here is the process used to find what the leak is using statedump. According to the bug the observation is that the process memory usage is increasing whenever one of the bricks is wiped in a replicate volume and a `full` self-heal is invoked to heal the contents. Statedump of the process is taken using `kill -USR1 <pid-of-gluster-self-heal-daemon>`.
```
grep -w num_allocs glusterdump.5225.dump.1405493251
num_allocs=77078
num_allocs=87070
num_allocs=117376
....
grep hot-count glusterdump.5225.dump.1405493251
hot-count=16384
hot-count=16384
hot-count=4095
....
Find the occurrences in the statedump file to figure out the tags.
```
grep of the statedump revealed too many allocations for the following data-types under replicate,
1. gf_common_mt_asprintf
2. gf_common_mt_char
3. gf_common_mt_mem_pool.
After checking afr-code for allocations with tag `gf_common_mt_char` found `data-self-heal` code path does not free one such allocated memory. `gf_common_mt_mem_pool` suggests that there is a leak in pool memory. `replicate-0:dict_t`, `glusterfs:data_t` and `glusterfs:data_pair_t` pools are using lot of memory, i.e. cold_count is `0` and too many allocations. Checking source code of dict.c revealed that `key` in `dict` is allocated with `gf_common_mt_char` i.e. `2.` tag and value is created using gf_asprintf which in-turn uses `gf_common_mt_asprintf` i.e. `1.`. Browsing the code for leak in self-heal code paths lead to a line which over-writes a variable with new dictionary even when it was already holding a reference to another dictionary. After fixing these leaks, ran the same test to verify that none of the `num_allocs` are increasing even after healing 10,000 files directory hierarchy in statedump of self-heal daemon.
Please check this [patch](http://review.gluster.org/8316) for more info about the fix.
#### Debugging leaks in memory pools:
Statedump output of memory pools was used to test and verify the fixes to [Bug 1134221](https://bugzilla.redhat.com/show_bug.cgi?id=1134221). On code analysis, dict_t objects were found to be leaking (in terms of not being unref'd enough number of times, during name self-heal. The test involved creating 100 files on plain replicate volume, removing them from one of the brick's backend, and then triggering lookup on them from the mount point. Statedump of the mount process was taken before executing the test case and after it, after compiling glusterfs with -DDEBUG flags (to have cold count set to 0 by default).
Statedump output of the fuse mount process before the test case was executed:
```
pool-name=glusterfs:dict_t
hot-count=0
cold-count=0
padded_sizeof=140
alloc-count=33
max-alloc=0
pool-misses=33
cur-stdalloc=14
max-stdalloc=18
```
Statedump output of the fuse mount process after the test case was executed:
```
pool-name=glusterfs:dict_t
hot-count=0
cold-count=0
padded_sizeof=140
alloc-count=2841
max-alloc=0
pool-misses=2841
cur-stdalloc=214
max-stdalloc=220
```
Here, with cold count being 0 by default, `cur-stdalloc` indicated the number of `dict_t` objects that were allocated in heap using `mem_get()`, and yet to be freed using `mem_put()` (refer to this [page](../developer-guide/datastructure-mem-pool.md) for more details on how mempool works). After the test case (name selfheal of 100 files), there was a rise in the cur-stdalloc value (from 14 to 214) for `dict_t`.
After these leaks were fixed, glusterfs was again compiled with -DDEBUG flags, and the same steps were performed again and statedump was taken before and after executing the test case, of the mount. This was done to ascertain the validity of the fix. And the following are the results:
Statedump output of the fuse mount process before executing the test case:
```
pool-name=glusterfs:dict_t
hot-count=0
cold-count=0
padded_sizeof=140
alloc-count=33
max-alloc=0
pool-misses=33
cur-stdalloc=14
max-stdalloc=18
```
Statedump output of the fuse mount process after executing the test case:
```
pool-name=glusterfs:dict_t
hot-count=0
cold-count=0
padded_sizeof=140
alloc-count=2837
max-alloc=0
pool-misses=2837
cur-stdalloc=14
max-stdalloc=119
```
The value of cur-stdalloc remained 14 before and after the test, indicating that the fix indeed does what it's supposed to do.
### How to debug hangs because of frame-loss?
[Bug 994959](https://bugzilla.redhat.com/show_bug.cgi?id=994959) is one of the bugs where statedump was helpful in finding where the frame was lost. Here is the process used to find where the hang is using statedump.
When the hang was observed, statedumps are taken for all the processes. On mount's statedump the following stack is shown:
```
[global.callpool.stack.1.frame.1]
ref_count=1
translator=fuse
complete=0
[global.callpool.stack.1.frame.2]
ref_count=0
translator=r2-client-1
complete=1 <<----- Client xlator completed the readdirp call and unwound to afr
parent=r2-replicate-0
wind_from=afr_do_readdir
wind_to=children[call_child]->fops->readdirp
unwind_from=client3_3_readdirp_cbk
unwind_to=afr_readdirp_cbk
[global.callpool.stack.1.frame.3]
ref_count=0
translator=r2-replicate-0
complete=0 <<---- Afr xlator is not unwinding for some reason.
parent=r2-dht
wind_from=dht_do_readdir
wind_to=xvol->fops->readdirp
unwind_to=dht_readdirp_cbk
[global.callpool.stack.1.frame.4]
ref_count=1
translator=r2-dht
complete=0
parent=r2-io-cache
wind_from=ioc_readdirp
wind_to=FIRST_CHILD(this)->fops->readdirp
unwind_to=ioc_readdirp_cbk
[global.callpool.stack.1.frame.5]
ref_count=1
translator=r2-io-cache
complete=0
parent=r2-quick-read
wind_from=qr_readdirp
wind_to=FIRST_CHILD (this)->fops->readdirp
unwind_to=qr_readdirp_cbk
```
`unwind_to` shows that call was unwound to `afr_readdirp_cbk` from client xlator.
Inspecting that function revealed that afr is not unwinding the stack when fop failed.
Check this [patch](http://review.gluster.org/5531) for more info about the fix.
|