-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlinux_perf.txt
567 lines (506 loc) · 28.6 KB
/
linux_perf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
[[{linux.perf,monitoring.perf]]
[[{monitoring.perf,monitoring.kernel,monitoring.apps,profiling.jobs]]
[[profiling.perf,troubleshooting.locks,PM.TODO]]
# perf "Performance Counters for Linux" (PCL), also known as perf events.
* External Refs:
* <http://www.brendangregg.com/perf.html>
* <http://www.brendangregg.com/flamegraphs.html>
* perf is a performance analyzing (multi-)tool introduced in linux kernel (kernel 2.6.31+, 2009)
and controlled by user space `perf`.
* It is capable of statistical profiling of the entire system (both kernel and userland code)
"It provides a framework for all things related to performance analysis """
[[doc_has.keypoint]]
* perf_events can instrument in three ways:
1. counting events in-kernel context, where a summary of counts is printed by perf.
(generate a perf.data file)
2. sampling events, which writes event data to a kernel buffer, which is read at
a gentle asynchronous rate by the perf command to write to the perf.data file.
This file is then read by the perf report or perf script commands.
3. bpf programs on events, a new feature in Linux 4.4+ kernels that can execute
custom user-defined programs in kernel space, which can perform efficient
filters and summaries of the data. Eg, efficiently-measured latency histograms.
* Details about the events can be collected, including:
```
┌───────────────────────────────────┐
│ event │
├───────────────────────────────────┤
│ + timestamp │
│ + path leading to it (stack trace)│
│ + details (ex: value of stack,...)│
│ + ... │
└───────────────────────────────────┘
```
* Event covered: <!-- { -->
```
| $ sudo perf list # list currently known-to-kernel events that
| # can be gathered. >650
|
| List of pre-defined events (to be used in -e):
| event-type
| ----------
| cpu-cycles OR cycles Hardware
| instructions Hardware
| cache-references Hardware
| cache-misses Hardware
| branch-instructions OR branches Hardware
| branch-misses Hardware
| bus-cycles Hardware
| (stalled|iddle)-cycles-frontend Hardware
| (stalled|iddle)-cycles-backend Hardware
| ref-cycles Hardware
| cpu-clock Software
| task-clock Software
| page-faults OR faults Software
| context-switches OR cs Software
| cpu-migrations OR migrations Software
| minor-faults Software
| major-faults Software
| alignment-faults Software
| emulation-faults Software
| bpf-output Software
| dummy Software
| L1-dcache-loads Hardw.cache
| L1-dcache-load-misses Hardw.cache
| L1-dcache-stores Hardw.cache
| [...]
| rNNN Raw hard.descriptor
| cpu/t1=v1[,t2=v2,t3 ...]/modifier Raw hard.descriptor
| (see 'man perf-list' on how
| to encode it)
| mem:<addr>[:access] Hard.breakpoint
| probe:tcp_sendmsg Tracepoint <· Dyn.Tracing
extends `perf list` ¹
| [...]
| sched:sched_process_exec Tracepoint
| sched:sched_process_fork Tracepoint
| sched:sched_process_wait Tracepoint
| sched:sched_wait_task Tracepoint
| sched:sched_process_exit Tracepoint
| [...] └───┬────┘
| ┌─ event types ──────────────────────────┘
| [[{doc_has.keypoint]]
| · Hardware: CPU performance monitoring counters.
|
| · Software: CPU migrations, minor faults, major faults, etc.
|
| · Kernel Tracepoint: static kernel-level instrumentation points
| hardcoded in interesting places.
| · User Statically-Defined Tracing (USDT): static tracepoints for
| user-level programs and applications.
| · ¹ Dynamic Tracing: events created in any location
| For kernel software, `kprobes` framework is used.
| For user software, `uprobes` framework is used.
| · Timed Profiling: Snapshots can be collected at arbitrary
| frequency, (`perf record -F`). Commonly used for CPU
| usage profiling, and works by creating custom timed
| interrupt events.
| [[doc_has.keypoint}]]
|
| $ perf list 'sched:*' <· Listing sched tracepoints
```
<!-- } -->
Start juggling with perf like:
```
| ┌───────┬─···· Lowest system overhead, while
| · · "good enough" for many purposes.
| · ·
| v v ┌─────┬─ any binary or shell command will work
| $ perf stat -- make -j
| (Output will be similar to ...)
| Performance counter stats for 'make -j':
|
| 83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
| 0 context-switches:u # 0.000 K/sec
| 0 cpu-migrations:u # 0.000 K/sec
| 3,228,188 page-faults:u # 0.039 M/sec
| 229,570,665,834 cycles:u # 2.742 GHz
| 313,163,853,778 instructions:u # 1.36 insn per cycle
| 69,704,684,856 branches:u # 832.559 M/sec
| 2,078,861,393 branch-misses:u # 2.98% of all branches
|
| 83.409183620 seconds time elapsed
|
| 74.684747000 seconds user
| 8.739217000 seconds sys
|
| $ perf stats \
| -r 4 \ <· repeat 4 times
| --event=cycles:{k,u} \ <· spring into {k}ernel/{u}ser
| -- \
| make -j <· alternatively -a for system-wide
| ... events.
| 123,123,11 cycles:k
| 1,123,11 cycles:u
|
| 0.00013913 secs elapsed
```
[[{profiling.jobs.IO.BLOCK]]
### Example: tracing/profiling disk I/O:
```
| $ sudo perf record \
| -e block:block_rq_issue \ # <· fires when a block device I/O request is issued
| -a \ # <· trace all CPUs.
| -g # <· # <· catpure call graphs (stack traces)
| (ouput is dutmp to perf.data)
|
| $ sudo perf report # <· takes perf.data as input
| # build tree from the stack traces.
| # coalescing common paths and showing
| # percentages for each path.
| ...
| # Samples: 2K of event 'block:block_rq_issue'
| # Event count (approx.): 2216 <·· related events captured
| #
| # Over Com Shared Symbol
| # head mand Object
| # .... ... ................ ...................
| #
| 32.1% dd [kernel.kallsyms] [k] blk_peek_request
| | └ issued by kernel func.
| -- blk_peek_request
| virtblk_request
| _blk_run_queue
| |
| |-98.31%- queue_unplugged
| | blk_flush_plug_list
| | |
| | |-91.00%-blk_queue_bio
| | | generic_make_request
| | | submit_bio
| | | ext4_io_submit
| | | |
| | | |-58.71%-ext4_bio_write_page
| | | | mpage_da_submit_io
| | | | mpage_da_map_and_submit
| | | | write_cache_pages_da
| | | | ext4_da_writepages
| | | | do_writepages
| | | | __filemap_fdatawrite_range
| | | | filemap_flush
| | | | ext4_alloc_da_blocks
| | | | ext4_release_file
| | | | __fput
| | | | ____fput
| | | | task_work_run
| | | | do_notify_resume
| | | | int_signal
| | | | close
| | | | 0x0
| | | |
| | | -41.29%-mpage_da_submit_io
| [...]
```
[[profiling.jobs.IO}]]
### perf-top
```
| $ perf top # <· Sort of detailed/advanced "top"
|
| Samples: 12K of event 'cycles:ppp', Event count (approx.): 54543453553535
| Overhead Shared Object Symbol
| 13.11% libQT5Core.so.5.7.0 [.] QHashData:NextNode
| 5.11% libc-2.24.so [.] _int_malloc
| 2.90% perf [.] symbols__insert
| ...
```
## perf "one liners" recipes
* Summary from <http://www.brendangregg.com/perf.html>, by Brendan Gregg
* event-oriented observability tool, which solve advanced performance and troubleshooting questions like:
* Why is the kernel on-CPU so much? What code-paths? [[troubleshooting.performance,monitoring.cpu]]
* Which code-paths are causing CPU level 2 cache misses? [[monitoring.cpu.cache]]
* Are the CPUs stalled on memory I/O? [[profiling.memory]]
* Which code-paths are allocating memory, and how much? [[profiling.memory]]
* What is triggering TCP retransmits? [[profiling.network,monitoring.network]]
* Is a certain kernel function being called, and how often? [[profiling.kernel]]
* What reasons are threads leaving the CPU? [[profiling.threads]]
```
| $ perf stat -a sleep 5 # <· # CPU counter statistics for the entire system, for 5 seconds:
|
| "LLC-load-misses" # <· Slast level cache misses
| "cycles:k" # <· CPU kernel instructions.
| "cycles:u" # <· CPU user instructions.
| "cycles:up" # <· CPU user instructions, precisely (using PEBS)
| "context-switches"
|
| ────────────────────────────────────────────────────── # collect counter statistics (perf stat) for ...
| $ EVENT_L00="cycles,instructions,cache-references"
| $ EVENT_L00="${EVENT_L00},cache-misses,bus-cycles" # <· basic CPU statistics
| $ EVENT_L01="L1-dcache-loads,L1-dcache-load-misses"
| $ EVENT_L01="${EVENT_L01},L1-dcache-stores" # <· Level 1 data cache stats.
| $ EVENT_L02="dTLB-loads,dTLB-load-misses
| $ EVENT_L02="${EVENT_L02},dTLB-prefetch-misses" # <· Various CPU data TLB statistics.
| $ EVENT_L03="LLC-loads,LLC-load-misses,LLC-stores"
| $ EVENT_L03="${EVENT_L03},LLC-prefetches" # <· CPU last level cache statistics
| $ EVENT_L04="r003c" # <· eg of raw PMC counters, (unhalted core cycles)
| $ EVENT_L05="cycles" # <· counting cycles
| $ EVENT_L06="cpu/event=0x0e,umask=0x01,inv,cmask=0x01/" # <· frontend stalls via raw specification
| $ EVENT_L07="raw_syscalls:sys_enter" # <· system calls
| $ EVENT_L08="syscalls:sys_enter_*" # <· system calls by type
| $ EVENT_L09="sched:*" # <· scheduler events
| $ EVENT_L10="ext4:*" # <· ext4 events
| $ EVENT_L10="block:*" # <· block device I/O events
| $ EVENT_L10="vmscan:*" # <· vmscan events
| $ EVENT_L..="vmscan:mm_vmscan_wakeup_kswapd" # <· vmscan: kswapd wakeup events
| $ EVENT_L..="migrations" # CPU migrations
| $ EVENT_L..="syscalls:sys_enter_connect" # connect()s (outbound connections)
| $ EVENT_L..="syscalls:sys_enter_accept*" # accepts()s (inbound connections)
| $ EVENT_L..="block:block_rq_insert" # block device (disk I/O) requests
| $ EVENT_L..="block:block_rq_insert" # block device requests
| $ EVENT_L..="block:block_rq_issue" # block device issues
| $ EVENT_L..="block:block_rq_complete" # block device completions (has timestamps)
| # filter completion examples:
| --filter 'nr_sector > 200' : size at least 100 Kbytes
| --filter 'rwbs == "WS"' : synchronous writes only
| --filter 'rwbs ~ "*W*"' : all types of writes.
| $ EVENT_L..="minor-faults" # minor faults (RSS growth)
| $ EVENT_L..="page-faults"
|
| $ sudo perf stat \ # -I 1000: collect per-second (-I 1000)
| -e ${EVENT_L0} -e ${EVENT_L1} -e ...
| \ command ¹
|
| ¹: alternatively:
| -a: System-wide
| -p, --pid=<pid_list> stat over running list of processes.
| -t, --tid=<tid_list> stat over running list of threads.
| --cgroup=docker/1d567f... : container identified by its /sys/fs/cgroup/perf_event cgroup
| -b, --bpf-prog=<bid_list> star over running list of bpf program.
| bpf programs can be listed like:
| # bpftool prog | head -n 1
| 17247: tracepoint name sys_enter tag 192d548b9d754067 gpl
|
| # perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
|
| Performance counter stats for 'BPF program(s) 17247':
|
| 85,967 cycles
| 28,982 instructions # 0.34 insn per cycle
|
| 1.102235068 seconds time elapsed
```
### Oneliners perf PROFILING
```
| # Sample ...
| # perf record -F 99 command # on-CPU functions, at 99 Hertz:
| # -g: on-CPU stack traces (via frame pointers)
| # --call-graph dwarf: on-CPU stack traces. use dwarf (dbg info)
| # to unwind stacks.
| # --call-graph lbr: on-CPU stack traces using last branch record for stacks
| # (>= Linux 4.?).
| # -p: branch tracing (needs HW support)
| # -T: record timestamp
|
| # perf record \ # <·· Sample CPU stack traces,
| -e L1-dcache-load-misses \# <· once every 10,000 Level 1 data cache misses,
| -c 10000 \ #
| -a \ # <· system wide
| -g \ # # <· catpure call graphs (stack traces)
| -- sleep 5 # for 5 seconds
|
| # perf top -F 49 # show top addresses and symbols, live (no perf.data file):
| # -ns comm,dso: Show also names and segments
```
```
| $ perf buildid-cache --add `which node` # Add Node.js USDT probes (Linux 4.10+):
```
### Onliners Dynamic Tracing recipes (`perf probe kernel_function`)
* Remember: dynamics probes add events anyhwere in kernel or userspace code.
```
| $ perf probe -l # <· List currently available dynamic probes
| $ perf probe --add tcp_sendmsg # Add tracepoint for tcp_sendmsg function entry
| $ perf probe --add 'tcp_sendmsg%return' # Add tracepoint for tcp_sendmsg function return
| $ perf probe -d tcp_sendmsg # Remove tracepoint
| $ perf probe -V tcp_sendmsg # Show available variables for tcp_sendmsg function (needs debuginfo)
| $ perf probe -V tcp_sendmsg --externs # --externs: show also external vars (needs debuginfo)
| $ perf probe -L tcp_sendmsg # Show available line probes (needs debuginfo):
| $ perf probe -V tcp_sendmsg:81 # Show available variables for tcp_sendmsg at line 81 (needs debuginfo).
|
| $ perf probe ... # Add tracepoint for tcp_sendmsg ...
| ...'tcp_sendmsg %ax %dx %cx' # <· with 3 entry argument registers (platform specific)
| ...'tcp_sendmsg bytes=%cx' # <· with an alias ("bytes") for the %cx register (platform specific)
| # --filter 'bytes > 100' # trace only when bytes > 100
| ...'tcp_sendmsg size' # <· with "size" entry argument (reliable, needs debuginfo)
| ...'tcp_sendmsg size sk->_sk_common.skc_state' # <· "size"+socket state ( " , " " )
| ...'tcp_sendmsg:81 seglen' # <· in line 81 with local variable seglen (needs debuginfo)
|
| $ perf probe ... # Add tracepoint ...
| .. 'do_sys_open filename:string' # <· for do_sys_open() with the filename as a string (needs debuginfo)
| .. 'myfunc%return +0($retval):string' # <· for myfunc() return, and include the retval as a string
| .. -x /lib64/libc.so.6 malloc # <· for the user-level malloc() function from libc
| # (-x => file separator)
| .. -x /usr/lib64/libpthread-2.24.so \ # <· for user-level static probe (USDT, aka SDT event)
| %sdt_libpthread:mutex_entry
| └─┬────────────────────────────────┘
| Add --null --verbose (or -nv) to simulate without running collecting.
|
| $ perf record \ # Trace previous probe
| -e probe:tcp_sendmsg \
| --filter 'size > 0 && skc_state != 1'# <· 1 == TCP_ESTABLISHED(1) (needs debuginfo):
| -a # <· global (all processes and threads)
|
|
| Mixed
|
| $ perf top \ # Show summary by process every 2 seconds
| -e raw_syscalls:sys_enter \ # for system calls
| -ns comm
|
| $ stdbuf -oL \ # Trace sent network packets by on-CPU process,
| perf top -e net:net_dev_xmit \ # rolling output (no clear):
| -ns comm | \
| strings
|
| $ perf record -F99 \ # Sample stacks+context switches at 99 Hz
| -e cpu-clock -e cs -a -g # system wide.
|
| $ perf record -F99 \ # Sample stacks to 2 levels deep, and,
| -e cpu-clock/max-stack=2/ \ # context switch stacks (5 levels) (needs 4.8)
| -e cs/max-stack=5/ -a -g
|
| $ perf c2c record -a -- sleep 10 # Record cacheline events (Linux 4.10+):
|
| REPORTING ─────────────────────────────
| $ perf c2c report # Report it. (ncurser browser by default)
| $ perf report -n # Report it with a column for sample count.
| $ perf report --stdio # Report as text with data coalesced and percentages
| $ perf report --stdio -n -g folded # Report, with stacks in folded format:
| # one line per stack (needs 4.4)
| $ perf script # List all events from perf.data
| $ perf script --header # List all perf.data events, with data header
| (newer kernels; was previously default)
| $ perf script -F time,event,trace # List all with customized fields (Linux > 4.1)
| (-f for Linux <= 4.1)
| "my recomended fields":
| comm,pid,tid,cpu,time,event,ip,sym,dso
| $ perf annotate --stdio # Disassemble and annotate instructions with percentages
| # (needs some debuginfo)
```
...It may tell you to install another related package (linux-tools-kernelversion).
... To get the most out perf, you'll want symbols and stack traces.
... I find it useful to add both libc6-dbgsym and coreutils-dbgsym, to provide
some symbol coverage of user-level OS codepaths.
```
| # file .../some_executable_ELF
| ... ELF 64-bit LSB executable, ... not stripped
| └─────┬────┘
| => note stripped => symbols included
```
... Kernel-level symbols are in the kernel debuginfo package, or when compiled with CONFIG_KALLSYMS.
[[{dev_stack.java,dev_stack.nodeJS]]
... JIT Symbols for VMs (JVM, Node.js,...)
when profiling with `perf_events`, you'll see symbols for the VM engine, which have some use (
identify GC cycles,...), but you won't see the language-level context you might be expecting.
**perf_events has JIT support to solve this**, which requires the VM to maintain a
/tmp/perf-PID.map file for symbol translation.
* Java can do this with perf-map-agent.<br/>
full stacks may not be shown, due to hotspot on x86 omitting the frame pointer (just like gcc).
`-XX:+PreserveFramePointer` (JDK v 8u60+) fix it.
* Node.js 0.11.13+ with --perf_basic_prof.
[[dev_stack.java,dev_stack.nodeJS}]]
[[{qa.best_patterns]]
... Always compile with frame pointers. Omitting them (-O2 gcc optimization) is an evil
compiler optimization that breaks debuggers, and sadly, is often the
default. They lead to incomplete stacks.
Fix like:
- use dwarf (kernel 3.9+) data to unwind the stack using user-level stacks:
libunwind, which uses dwarf. This can be enabled using "--call-graph dwarf"
- using last branch record (LBR) if available (a processor feature disabled
in most cloud enviroments).
- returning the frame pointers.
other stack walking techniques include:
- BTS (Branch Trace Store)
- new ORC unwinder.
[[qa.best_patterns}]]
<!-- { -->
## flamegraph
* <https://github.com/flamegraph-rs/flamegraph>
* perf+dtrace flamegraph generator with additional support for Cargo projects! (Rust based)
It can be used to profile anything, not just Rust projects!
* See also notes on latencytop
<!-- } -->
## Syscall Pofiling:
```
| $ perf trace --durtion=100
| 340.448 (1000.122 ms): JS Watchdog/15221 futex(uaddr: 0x7f3e9cd1a434, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3eae487df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 119.549 (1221.529 ms): tmux: server/2501 poll(ufds: 0x55edaa47c990, nfds: 11, timeout_msecs: 12189) = 1
| 395.984 (1000.133 ms): tuned/19297 futex(uaddr: 0x7f37a4027130, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f37aad37e30, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 691.446 (1000.105 ms): JS Watchdog/15347 futex(uaddr: 0x7f6c829550b0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f6c942a0df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 755.478 (1000.029 ms): Timer/15227 futex(uaddr: 0x7f3eb5b5cc80, op: WAIT|PRIVATE_FLAG, utime: 0x7f3e9c2c1a60) = -1 ETIMEDOUT (Connection timed out)
| 755.609 (1000.017 ms): Web Content/15215 poll(ufds: 0x7f3e9bd04760, nfds: 3, timeout_msecs: 4294967295) = 1
| 311.581 (1527.461 ms): Gecko_IOThread/15157 epoll_wait(epfd: 8<anon_inode:[eventpoll]>, events: 0x7f3d6d1f5200, maxevents: 32, timeout: 4294967295) = 1
| 311.955 (1527.194 ms): firefox/15132 poll(ufds: 0x7f3d1ebd5610, nfds: 5, timeout_msecs: 4294967295) = 1
| 876.905 (1000.146 ms): dockerd/32491 futex(uaddr: 0x561e1da0b920, utime: 0xc42045bed8) = -1 ETIMEDOUT (Connection timed out)
| 877.069 (1000.064 ms): dockerd/27832 futex(uaddr: 0x561e1da07950, utime: 0x7f50e7c61b90) = 0
| 877.025 (1000.145 ms): dockerd/27904 futex(uaddr: 0xc420c82548) = 0
| 912.964 (1000.133 ms): JS Watchdog/15158 futex(uaddr: 0x7f3d57c4c0f0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3d65a41df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 311.586 (1607.337 ms): Chrome_~dThrea/15346 epoll_wait(epfd: 11<anon_inode:[eventpoll]>, events: 0x7f6c9b9cd080, maxevents: 32, timeout: 4294967295) = 1
| 937.245 (1000.102 ms): JS Watchdog/15276 futex(uaddr: 0x7feca361bbf0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7fecb4e27df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 214.944 (1927.025 ms): Timer/15164 futex(uaddr: 0x7f3d6d165be0, op: WAIT|PRIVATE_FLAG, utime: 0x7f3d542a1a60) = -1 ETIMEDOUT (Connection timed out)
| 215.042 (1927.063 ms): Socket Thread/15166 poll(ufds: 0x7f3d539028f0, nfds: 8, timeout_msecs: 4294967295) = 1
| 1340.624 (1000.072 ms): JS Watchdog/15221 futex(uaddr: 0x7f3e9cd1a434, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3eae487df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 1396.377 (1000.131 ms): tuned/19297 futex(uaddr: 0x7f37a4027130, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f37aad37e30, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 1691.606 (1000.059 ms): JS Watchdog/15347 futex(uaddr: 0x7f6c829550b0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f6c942a0df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
| 1877.200 (1000.115 ms): dockerd/27844 futex(uaddr: 0x561e1da0b9a0, utime: 0xc420460ed8) = -1 ETIMEDOUT (Connection timed out)
| 876.826 (2000.665 ms): dockerd/27840 futex(uaddr: 0xc4206d7148) = 0
| 1877.252 (1000.149 ms): dockerd/27832 futex(uaddr: 0x561e1da07950, utime: 0x7f50e7c61b90) = 0
| 1877.190 (1000.239 ms): dockerd/27904 futex(uaddr: 0xc420c82548) = 0
| 1877.189 (1000.372 ms): dockerd/32491 futex(uaddr: 0xc420c83948) = 0
```
[[{troubleshooting.locks]]
## perf-lock
* Analyze lock events
* <https://linux.die.net/man/1/perf-lock>
```
| perf lock {record|report|script|info}
|
| $ perf lock record <command>' # <· records lock events for <command>
| into perf.data
| $ perf lock report. # <· reports statistical data.
|
| $ perf lock script # <· shows raw lock events.
|
| $ perf lock info # <· shows metadata like threads or addresses
| of lock instances.
|
| Common Options
|
| --input=<file> : (or stdin)
| --verbose : (show symbol address, etc).
| --dump-raw-trace: (in ASCII)
|
| Common Report Options
|
| --key=<value> Sorting key. One of acquired* | contended | wait_total | wait_max | wait_min
|
| Common Info Options
|
| --threads : dump thread list in perf.data
| --map : dump (address:name) map of lock instances
```
[[troubleshooting.locks}]]
[[{scalability.profiling]]
## Profiling 60 Terabytes of input data with Perf at Facebook
Extracted from: "Apache Spark @Scale: A production use case"
<https://engineering.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/>
... Tools we used to find performance bottleneck
* Spark Linux Perf/Flame Graph support: Although the two tools
above are very handy, they do not provide an aggregated view of CPU
profiling for the job running across hundreds of machines at the same
time. On a per-job basis, **we added support for enabling Perf
profiling (via libperfagent for Java symbols) and can customize the
duration/frequency of sampling. The profiling samples are aggregated
and displayed as a Flame Graph across the executors using our
internal metrics collection framework**.
[[scalability.profiling}]]
[[monitoring.perf}]]
* special purpose perf "tools":
```
| perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
| perf kmem: kernel memory allocation analysis.
| perf kvm: KVM virtual guest analysis.
| perf lock: lock analysis.
| perf mem: memory access analysis.
| perf sched: kernel scheduler statistics.
| perf-bench: General framework for benchmark suites
```
[[}]]
[[linux.}]]