linux_perf.txt

[[{linux.perf,monitoring.perf]]
[[{monitoring.perf,monitoring.kernel,monitoring.apps,profiling.jobs]]
[[profiling.perf,troubleshooting.locks,PM.TODO]]

# perf "Performance Counters for Linux" (PCL), also known as perf events.

* External Refs:
  * <http://www.brendangregg.com/perf.html>
  * <http://www.brendangregg.com/flamegraphs.html>

* perf is a performance analyzing (multi-)tool introduced in linux kernel (kernel 2.6.31+, 2009)
  and controlled by user space `perf`.
* It is capable of statistical profiling of the entire system (both kernel and userland code)
  "It provides a framework for all things related to performance analysis """
  [[doc_has.keypoint]]

*  perf_events can instrument in three ways:
   1. counting events in-kernel context, where a summary of counts is printed by perf.
      (generate a perf.data file)
   2. sampling events, which writes event data to a kernel buffer, which is read at
      a gentle asynchronous rate by the perf command to write to the perf.data file.
      This file is then read by the perf report or perf script commands.
   3. bpf programs on events, a new feature in Linux 4.4+ kernels that can execute 
      custom user-defined programs in kernel space, which can perform efficient 
      filters and summaries of the data. Eg, efficiently-measured latency histograms.

* Details about the events can be collected, including:
  ```
  ┌───────────────────────────────────┐
  │ event                             │
  ├───────────────────────────────────┤
  │ + timestamp                       │
  │ + path leading to it (stack trace)│
  │ + details (ex: value of stack,...)│
  │ + ...                             │
  └───────────────────────────────────┘
  ```

* Event covered: <!-- { -->
  ```
  | $ sudo perf list # list currently known-to-kernel events that
  |                  # can be gathered.  >650
  |
  | List of pre-defined events (to be used in -e): 
  |                                    event-type 
  |                                    ----------
  |   cpu-cycles OR cycles               Hardware
  |   instructions                       Hardware
  |   cache-references                   Hardware
  |   cache-misses                       Hardware
  |   branch-instructions OR branches    Hardware
  |   branch-misses                      Hardware
  |   bus-cycles                         Hardware
  |   (stalled|iddle)-cycles-frontend    Hardware
  |   (stalled|iddle)-cycles-backend     Hardware
  |   ref-cycles                         Hardware
  |   cpu-clock                          Software
  |   task-clock                         Software
  |   page-faults OR faults              Software
  |   context-switches OR cs             Software
  |   cpu-migrations OR migrations       Software
  |   minor-faults                       Software
  |   major-faults                       Software
  |   alignment-faults                   Software
  |   emulation-faults                   Software
  |   bpf-output                         Software
  |   dummy                              Software
  |   L1-dcache-loads                    Hardw.cache
  |   L1-dcache-load-misses              Hardw.cache
  |   L1-dcache-stores                   Hardw.cache
  | [...]
  |   rNNN                               Raw hard.descriptor
  |   cpu/t1=v1[,t2=v2,t3 ...]/modifier  Raw hard.descriptor
  |    (see 'man perf-list' on how
  |     to encode it)
  |   mem:<addr>[:access]                Hard.breakpoint
  |   probe:tcp_sendmsg                  Tracepoint <· Dyn.Tracing 
                                                       extends `perf list` ¹ 
  | [...]                                          
  |   sched:sched_process_exec           Tracepoint
  |   sched:sched_process_fork           Tracepoint
  |   sched:sched_process_wait           Tracepoint
  |   sched:sched_wait_task              Tracepoint
  |   sched:sched_process_exit           Tracepoint
  | [...]                                └───┬────┘
  | ┌─ event types ──────────────────────────┘
  |   [[{doc_has.keypoint]]
  | · Hardware: CPU performance monitoring counters.
  |
  | · Software: CPU migrations, minor faults, major faults, etc.
  |
  | · Kernel Tracepoint: static kernel-level instrumentation points
  |       hardcoded in interesting places.
  | · User Statically-Defined Tracing (USDT): static tracepoints for 
  |       user-level programs and applications.
  | · ¹ Dynamic Tracing: events created in any location
  |     For kernel software, `kprobes` framework is used.
  |     For user   software, `uprobes` framework is used.
  | · Timed Profiling: Snapshots can be collected at arbitrary 
  |   frequency, (`perf record -F`).  Commonly used for CPU
  |   usage profiling, and works by creating custom timed 
  |   interrupt events.
  | [[doc_has.keypoint}]]
  |
  | $ perf list 'sched:*'  <· Listing sched tracepoints
  ```
<!-- } -->

Start juggling with perf like:

  ```
  |   ┌───────┬─···· Lowest system overhead, while 
  |   ·       ·      "good enough" for many purposes. 
  |   ·       ·      
  |   v       v    ┌─────┬─ any binary or shell command will work
  | $ perf stat -- make -j
  | (Output will be similar to ...)
  |  Performance counter stats for 'make -j':
  |
  |     83723.452481  task-clock:u (msec) #   1.004 CPUs utilized
  |                0  context-switches:u  #   0.000 K/sec
  |                0  cpu-migrations:u    #   0.000 K/sec
  |        3,228,188  page-faults:u       #   0.039 M/sec
  |  229,570,665,834  cycles:u            #   2.742 GHz
  |  313,163,853,778  instructions:u      #   1.36  insn per cycle
  |   69,704,684,856  branches:u          # 832.559 M/sec
  |    2,078,861,393  branch-misses:u     #   2.98% of all branches
  |
  |  83.409183620 seconds time elapsed
  |
  |  74.684747000 seconds user
  |   8.739217000 seconds sys
  | 
  | $ perf stats \
  |   -r 4 \                    <· repeat 4 times
  |   --event=cycles:{k,u} \    <· spring into {k}ernel/{u}ser
  |   -- \
  |   make -j                   <· alternatively -a for system-wide
  | ...                            events.
  | 123,123,11  cycles:k
  |   1,123,11  cycles:u
  |                         
  | 0.00013913  secs elapsed
  ```

[[{profiling.jobs.IO.BLOCK]]
### Example: tracing/profiling disk I/O:

  ```
  | $ sudo perf record \             
  |  -e block:block_rq_issue \  # <· fires when a block device I/O request is issued 
  |  -a \                       # <· trace all CPUs.
  |  -g     # <·                # <· catpure call graphs (stack traces)
  | (ouput is dutmp to perf.data)
  |
  | $ sudo perf report          # <· takes perf.data as input
  |                             #    build tree from the stack traces.
  |                             #    coalescing common paths and showing
  |                             #    percentages for each path.
  |  ... 
  | # Samples: 2K of event 'block:block_rq_issue'
  | # Event count (approx.): 2216                      <·· related events captured
  | #
  | # Over  Com  Shared            Symbol
  | # head  mand  Object
  | # ....  ...  ................  ...................
  | #
  |   32.1% dd  [kernel.kallsyms]  [k] blk_peek_request
  |         |                       └ issued by kernel func.
  |         -- blk_peek_request
  |            virtblk_request
  |            _blk_run_queue
  |           |         
  |           |-98.31%- queue_unplugged
  |           |        blk_flush_plug_list
  |           |        |          
  |           |        |-91.00%-blk_queue_bio
  |           |        |        generic_make_request
  |           |        |        submit_bio
  |           |        |        ext4_io_submit
  |           |        |        |          
  |           |        |        |-58.71%-ext4_bio_write_page
  |           |        |        |        mpage_da_submit_io
  |           |        |        |        mpage_da_map_and_submit
  |           |        |        |        write_cache_pages_da
  |           |        |        |        ext4_da_writepages
  |           |        |        |        do_writepages
  |           |        |        |        __filemap_fdatawrite_range
  |           |        |        |        filemap_flush
  |           |        |        |        ext4_alloc_da_blocks
  |           |        |        |        ext4_release_file
  |           |        |        |        __fput
  |           |        |        |        ____fput
  |           |        |        |        task_work_run
  |           |        |        |        do_notify_resume
  |           |        |        |        int_signal
  |           |        |        |        close
  |           |        |        |        0x0
  |           |        |        |         
  |           |        |        -41.29%-mpage_da_submit_io
  | [...]
  ```

[[profiling.jobs.IO}]]


###  perf-top

  ```
  | $ perf top # <· Sort of detailed/advanced "top"
  | 
  |   Samples: 12K of event 'cycles:ppp', Event count (approx.): 54543453553535
  | Overhead  Shared Object              Symbol
  | 13.11%    libQT5Core.so.5.7.0        [.] QHashData:NextNode
  |  5.11%    libc-2.24.so               [.] _int_malloc
  |  2.90%    perf                       [.] symbols__insert
  | ...
  ```

## perf "one liners" recipes

* Summary from <http://www.brendangregg.com/perf.html>, by Brendan Gregg

* event-oriented observability tool, which solve advanced performance and troubleshooting questions like:
  * Why is the kernel on-CPU so much? What code-paths?  [[troubleshooting.performance,monitoring.cpu]]
  * Which code-paths are causing CPU level 2 cache misses? [[monitoring.cpu.cache]]
  * Are the CPUs stalled on memory I/O? [[profiling.memory]]
  * Which code-paths are allocating memory, and how much?  [[profiling.memory]]
  * What is triggering TCP retransmits? [[profiling.network,monitoring.network]]
  * Is a certain kernel function being called, and how often? [[profiling.kernel]]
  * What reasons are threads leaving the CPU? [[profiling.threads]]


  ```
  | $ perf stat -a sleep 5 # <· # CPU counter statistics for the entire system, for 5 seconds:
  |
  | "LLC-load-misses" # <· Slast level cache misses
  | "cycles:k"        # <· CPU kernel instructions.
  | "cycles:u"        # <· CPU user   instructions.
  | "cycles:up"       # <· CPU user   instructions, precisely (using PEBS)
  | "context-switches"
  | 
  | ──────────────────────────────────────────────────────  # collect counter statistics (perf stat) for ...
  | $ EVENT_L00="cycles,instructions,cache-references" 
  | $ EVENT_L00="${EVENT_L00},cache-misses,bus-cycles"      # <· basic CPU statistics
  | $ EVENT_L01="L1-dcache-loads,L1-dcache-load-misses"
  | $ EVENT_L01="${EVENT_L01},L1-dcache-stores"             # <· Level 1 data cache stats.
  | $ EVENT_L02="dTLB-loads,dTLB-load-misses
  | $ EVENT_L02="${EVENT_L02},dTLB-prefetch-misses"         # <· Various CPU data TLB statistics.
  | $ EVENT_L03="LLC-loads,LLC-load-misses,LLC-stores" 
  | $ EVENT_L03="${EVENT_L03},LLC-prefetches"               # <· CPU last level cache statistics 
  | $ EVENT_L04="r003c"                                     # <· eg of raw PMC counters, (unhalted core cycles)
  | $ EVENT_L05="cycles"                                    # <· counting cycles
  | $ EVENT_L06="cpu/event=0x0e,umask=0x01,inv,cmask=0x01/" # <· frontend stalls via raw specification
  | $ EVENT_L07="raw_syscalls:sys_enter"                    # <· system calls
  | $ EVENT_L08="syscalls:sys_enter_*"                      # <· system calls by type
  | $ EVENT_L09="sched:*"                                   # <· scheduler events
  | $ EVENT_L10="ext4:*"                                    # <· ext4 events
  | $ EVENT_L10="block:*"                                   # <· block device I/O events
  | $ EVENT_L10="vmscan:*"                                  # <· vmscan events
  | $ EVENT_L..="vmscan:mm_vmscan_wakeup_kswapd"            # <· vmscan: kswapd wakeup events
  | $ EVENT_L..="migrations"                                # CPU migrations
  | $ EVENT_L..="syscalls:sys_enter_connect"                # connect()s (outbound connections)
  | $ EVENT_L..="syscalls:sys_enter_accept*"                # accepts()s (inbound connections)
  | $ EVENT_L..="block:block_rq_insert"                     # block device (disk I/O) requests
  | $ EVENT_L..="block:block_rq_insert"                     # block device requests 
  | $ EVENT_L..="block:block_rq_issue"                      # block device issues 
  | $ EVENT_L..="block:block_rq_complete"                   # block device  completions (has timestamps)
  |                                                         # filter completion examples:
  |                                                           --filter 'nr_sector > 200' : size at least 100 Kbytes
  |                                                           --filter 'rwbs == "WS"'    : synchronous writes only
  |                                                           --filter 'rwbs ~ "*W*"'    : all types of writes.
  | $ EVENT_L..="minor-faults"                              # minor faults (RSS growth)
  | $ EVENT_L..="page-faults"
  | 
  | $ sudo perf stat \                        # -I 1000: collect per-second (-I 1000)
  |  -e ${EVENT_L0} -e ${EVENT_L1} -e ...
  |  \  command ¹
  | 
  |     ¹: alternatively:
  |        -a:  System-wide 
  |        -p,      --pid=<pid_list> stat over running list of processes. 
  |        -t,      --tid=<tid_list> stat over running list of threads.
  |        --cgroup=docker/1d567f...  : container identified by its /sys/fs/cgroup/perf_event cgroup
  |        -b, --bpf-prog=<bid_list> star over running list of bpf program.
  |                                  bpf programs can be listed like:
  |              # bpftool prog | head -n 1
  |              17247: tracepoint  name sys_enter  tag 192d548b9d754067  gpl
  | 
  |              # perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
  | 
  |              Performance counter stats for 'BPF program(s) 17247':
  | 
  |              85,967      cycles
  |              28,982      instructions              #    0.34  insn per cycle
  | 
  |              1.102235068 seconds time elapsed
  ```

### Oneliners perf PROFILING

  ```
  |                              # Sample ...
  | # perf record -F 99 command  #     on-CPU functions, at 99 Hertz:
  |                              # -g: on-CPU stack traces (via frame pointers)
  |                              # --call-graph dwarf: on-CPU stack traces. use dwarf (dbg info)
  |                              #                     to unwind stacks.
  |                              # --call-graph lbr: on-CPU stack traces using last branch record for stacks
  |                              #                   (>= Linux 4.?).
  |                              # -p: branch tracing (needs HW support)
  |                              # -T: record timestamp
  |   
  | # perf record \              # <·· Sample CPU stack traces, 
  |    -e L1-dcache-load-misses \#     <· once every 10,000 Level 1 data cache misses, 
  |    -c 10000 \                #       
  |    -a \                      #     <· system wide
  |    -g \                      #     # <· catpure call graphs (stack traces)
  |    -- sleep 5                #     for 5 seconds
  |   
  | # perf top -F 49             # show top addresses and symbols, live (no perf.data file):
  |                              # -ns comm,dso: Show also names and segments  
 ```
   
 ```
 | $ perf buildid-cache --add `which node` # Add Node.js USDT probes (Linux 4.10+):
 ```
   
### Onliners Dynamic Tracing recipes (`perf probe kernel_function`)

* Remember: dynamics probes add events anyhwere in kernel or userspace code.
   
   ```
   | $ perf probe -l                         # <· List currently available dynamic probes
   | $ perf probe --add tcp_sendmsg          # Add tracepoint for tcp_sendmsg function entry
   | $ perf probe --add 'tcp_sendmsg%return' # Add tracepoint for tcp_sendmsg function return
   | $ perf probe -d tcp_sendmsg             # Remove tracepoint 
   | $ perf probe -V tcp_sendmsg             # Show available variables for tcp_sendmsg function (needs debuginfo)
   | $ perf probe -V tcp_sendmsg --externs   # --externs: show also external vars (needs debuginfo)
   | $ perf probe -L tcp_sendmsg             # Show available line probes (needs debuginfo):
   | $ perf probe -V tcp_sendmsg:81          # Show available variables for tcp_sendmsg at line 81 (needs debuginfo).
   |
   | $ perf probe ...                        # Add tracepoint for tcp_sendmsg ...
   |   ...'tcp_sendmsg %ax %dx %cx'          # <· with 3 entry argument registers (platform specific)
   |   ...'tcp_sendmsg bytes=%cx'            # <· with an alias ("bytes") for the %cx register (platform specific)
   |                                         #    --filter 'bytes > 100' # trace only when bytes > 100
   |   ...'tcp_sendmsg size'                 # <· with "size" entry argument     (reliable, needs debuginfo)
   |   ...'tcp_sendmsg size sk->_sk_common.skc_state'   # <· "size"+socket state ( "      , "     "        )
   |   ...'tcp_sendmsg:81 seglen'            # <· in line 81 with local variable seglen (needs debuginfo)
   |
   | $ perf probe ...                        # Add tracepoint ...
   |   .. 'do_sys_open filename:string'      # <· for do_sys_open() with the filename as a string (needs debuginfo)
   |   .. 'myfunc%return +0($retval):string' # <· for myfunc() return, and include the retval as a string
   |   .. -x /lib64/libc.so.6 malloc         # <· for the user-level malloc() function from libc
   |                                         #    (-x => file separator)
   |   .. -x /usr/lib64/libpthread-2.24.so \ # <· for user-level static probe (USDT, aka SDT event)
   |       %sdt_libpthread:mutex_entry 
   | └─┬────────────────────────────────┘
   |   Add --null --verbose (or -nv) to simulate without running collecting. 
   |
   | $ perf record \                         # Trace previous probe
   |    -e probe:tcp_sendmsg \
   |    --filter 'size > 0 && skc_state != 1'# <· 1 == TCP_ESTABLISHED(1) (needs debuginfo):
   |    -a                                   # <· global (all processes and threads)
   | 
   | 
   | Mixed
   | 
   | $ perf top \                           # Show summary by process every 2 seconds
   |   -e raw_syscalls:sys_enter \          # for system calls
   |   -ns comm                             
   |
   | $ stdbuf -oL \                         # Trace sent network packets by on-CPU process,
   |   perf top -e net:net_dev_xmit \       # rolling output (no clear):
   |     -ns comm | \
   |     strings
   | 
   | $ perf record -F99 \                   # Sample stacks+context switches at 99 Hz
   |   -e cpu-clock -e cs -a -g             # system wide.
   | 
   | $ perf record -F99 \                   # Sample stacks to 2 levels deep, and, 
   |     -e cpu-clock/max-stack=2/ \        # context switch stacks (5 levels) (needs 4.8)
   |     -e cs/max-stack=5/ -a -g 
   | 
   | $ perf c2c record -a -- sleep 10       # Record cacheline events (Linux 4.10+):
   |
   | REPORTING ─────────────────────────────
   | $ perf c2c report                      # Report it. (ncurser browser by default)
   | $ perf report -n                       # Report it with a column for sample count.
   | $ perf report --stdio                  # Report as text with data coalesced and percentages
   | $ perf report --stdio -n -g folded     # Report, with stacks in folded format: 
   |                                        # one line per stack (needs 4.4)
   | $ perf script                          # List all events from perf.data
   | $ perf script --header                 # List all perf.data events, with data header
   |                                          (newer kernels; was previously default)
   | $ perf script -F time,event,trace      # List all with customized fields (Linux > 4.1) 
   |                                          (-f for Linux <= 4.1)
   |                                          "my recomended fields":
   |                                          comm,pid,tid,cpu,time,event,ip,sym,dso 
   | $ perf annotate --stdio                # Disassemble and annotate instructions with percentages
   |                                        # (needs some debuginfo)
   ```


...It may tell you to install another related package (linux-tools-kernelversion).
... To get the most out perf, you'll want symbols and stack traces. 

... I find it useful to add both libc6-dbgsym and coreutils-dbgsym, to provide
some symbol coverage of user-level OS codepaths.

   ```
   | # file .../some_executable_ELF
   | ... ELF 64-bit LSB executable, ... not stripped
   |                                    └─────┬────┘
   |           => note stripped => symbols included
   ```
   
...  Kernel-level symbols are in the kernel debuginfo package, or when compiled with CONFIG_KALLSYMS.


[[{dev_stack.java,dev_stack.nodeJS]]
... JIT Symbols for VMs (JVM, Node.js,...)
   
when profiling with `perf_events`, you'll see symbols for the VM engine, which have some use (
identify GC cycles,...), but you won't see the language-level context you might be expecting.

**perf_events has JIT support to solve this**, which requires the VM to maintain a
/tmp/perf-PID.map file for symbol translation. 

* Java can do this with perf-map-agent.<br/>
  full stacks may not be shown, due to hotspot on x86 omitting the frame pointer (just like gcc).
  `-XX:+PreserveFramePointer` (JDK v 8u60+) fix it.
* Node.js 0.11.13+ with --perf_basic_prof.
[[dev_stack.java,dev_stack.nodeJS}]]


[[{qa.best_patterns]]
...  Always compile with frame pointers. Omitting them (-O2 gcc optimization) is an evil 
compiler optimization that breaks debuggers, and sadly, is often the 
default. They lead to incomplete stacks. 
Fix like:
- use dwarf (kernel 3.9+) data to unwind the stack using user-level stacks:
  libunwind, which uses dwarf. This can be enabled using "--call-graph dwarf" 
- using last branch record (LBR) if available (a processor feature disabled 
  in most cloud enviroments).
- returning the frame pointers.
   
other stack walking techniques include:
- BTS (Branch Trace Store)
- new ORC unwinder.
[[qa.best_patterns}]]
 
<!-- { -->
## flamegraph

* <https://github.com/flamegraph-rs/flamegraph>

* perf+dtrace flamegraph generator with additional support for Cargo projects! (Rust based)
  It can be used to profile anything, not just Rust projects!

*  See also notes on latencytop
<!-- } -->

## Syscall Pofiling:
  ```
  | $ perf trace --durtion=100
  |  340.448 (1000.122 ms): JS Watchdog/15221 futex(uaddr: 0x7f3e9cd1a434, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3eae487df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  |  119.549 (1221.529 ms): tmux: server/2501 poll(ufds: 0x55edaa47c990, nfds: 11, timeout_msecs: 12189)            = 1
  |  395.984 (1000.133 ms): tuned/19297 futex(uaddr: 0x7f37a4027130, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f37aad37e30, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  |  691.446 (1000.105 ms): JS Watchdog/15347 futex(uaddr: 0x7f6c829550b0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f6c942a0df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  |  755.478 (1000.029 ms): Timer/15227 futex(uaddr: 0x7f3eb5b5cc80, op: WAIT|PRIVATE_FLAG, utime: 0x7f3e9c2c1a60) = -1 ETIMEDOUT (Connection timed out)
  |  755.609 (1000.017 ms): Web Content/15215 poll(ufds: 0x7f3e9bd04760, nfds: 3, timeout_msecs: 4294967295)        = 1
  |  311.581 (1527.461 ms): Gecko_IOThread/15157 epoll_wait(epfd: 8<anon_inode:[eventpoll]>, events: 0x7f3d6d1f5200, maxevents: 32, timeout: 4294967295) = 1
  |  311.955 (1527.194 ms): firefox/15132 poll(ufds: 0x7f3d1ebd5610, nfds: 5, timeout_msecs: 4294967295)        = 1
  |  876.905 (1000.146 ms): dockerd/32491 futex(uaddr: 0x561e1da0b920, utime: 0xc42045bed8)                     = -1 ETIMEDOUT (Connection timed out)
  |  877.069 (1000.064 ms): dockerd/27832 futex(uaddr: 0x561e1da07950, utime: 0x7f50e7c61b90)                   = 0
  |  877.025 (1000.145 ms): dockerd/27904 futex(uaddr: 0xc420c82548)                                            = 0
  |  912.964 (1000.133 ms): JS Watchdog/15158 futex(uaddr: 0x7f3d57c4c0f0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3d65a41df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  |  311.586 (1607.337 ms): Chrome_~dThrea/15346 epoll_wait(epfd: 11<anon_inode:[eventpoll]>, events: 0x7f6c9b9cd080, maxevents: 32, timeout: 4294967295) = 1
  |  937.245 (1000.102 ms): JS Watchdog/15276 futex(uaddr: 0x7feca361bbf0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7fecb4e27df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  |  214.944 (1927.025 ms): Timer/15164 futex(uaddr: 0x7f3d6d165be0, op: WAIT|PRIVATE_FLAG, utime: 0x7f3d542a1a60) = -1 ETIMEDOUT (Connection timed out)
  |  215.042 (1927.063 ms): Socket Thread/15166 poll(ufds: 0x7f3d539028f0, nfds: 8, timeout_msecs: 4294967295)        = 1
  | 1340.624 (1000.072 ms): JS Watchdog/15221 futex(uaddr: 0x7f3e9cd1a434, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f3eae487df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  | 1396.377 (1000.131 ms): tuned/19297 futex(uaddr: 0x7f37a4027130, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f37aad37e30, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  | 1691.606 (1000.059 ms): JS Watchdog/15347 futex(uaddr: 0x7f6c829550b0, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f6c942a0df0, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
  | 1877.200 (1000.115 ms): dockerd/27844 futex(uaddr: 0x561e1da0b9a0, utime: 0xc420460ed8)                     = -1 ETIMEDOUT (Connection timed out)
  |  876.826 (2000.665 ms): dockerd/27840 futex(uaddr: 0xc4206d7148)                                            = 0
  | 1877.252 (1000.149 ms): dockerd/27832 futex(uaddr: 0x561e1da07950, utime: 0x7f50e7c61b90)                   = 0
  | 1877.190 (1000.239 ms): dockerd/27904 futex(uaddr: 0xc420c82548)                                            = 0
  | 1877.189 (1000.372 ms): dockerd/32491 futex(uaddr: 0xc420c83948)                                            = 0
  ```

[[{troubleshooting.locks]]

## perf-lock
* Analyze lock events
* <https://linux.die.net/man/1/perf-lock>   


  ```
  | perf lock {record|report|script|info}
  | 
  | $  perf lock record <command>'  # <· records lock events for <command>
  |                                      into perf.data
  | $  perf lock report.            # <· reports statistical data.
  | 
  | $  perf lock script             # <· shows raw lock events.
  | 
  | $  perf lock info               # <·  shows metadata like threads or addresses
  |                                       of lock instances.
  | 
  | Common Options
  | 
  |   --input=<file>  : (or stdin)
  |   --verbose       : (show symbol address, etc). 
  |   --dump-raw-trace: (in ASCII)
  | 
  | Common Report Options
  | 
  |   --key=<value> Sorting key. One of acquired* | contended | wait_total | wait_max | wait_min
  | 
  | Common Info Options
  | 
  |   --threads       : dump thread list in perf.data 
  |   --map           : dump (address:name) map of lock instances 
  ```

[[troubleshooting.locks}]]

[[{scalability.profiling]]
## Profiling 60 Terabytes of input data with Perf at Facebook

Extracted from: "Apache Spark @Scale: A  production use case"
  <https://engineering.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/>
  ...  Tools we used to find performance bottleneck
* Spark Linux Perf/Flame Graph support: Although the two tools
  above are very handy, they do not provide an aggregated view of CPU
  profiling for the job running across hundreds of machines at the same
  time. On a per-job basis, **we added support for enabling Perf
  profiling (via libperfagent for Java symbols) and can customize the
  duration/frequency of sampling. The profiling samples are aggregated
  and displayed as a Flame Graph across the executors using our
  internal metrics collection framework**.
[[scalability.profiling}]]


[[monitoring.perf}]]

* special purpose perf "tools":
  ```
  | perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
  | perf kmem: kernel memory allocation analysis.
  | perf kvm: KVM virtual guest analysis.
  | perf lock: lock analysis.
  | perf mem: memory access analysis.
  | perf sched: kernel scheduler statistics.
  | perf-bench: General framework for benchmark suites
  ```

[[}]]


[[linux.}]]