Below you will find a list of possible metrics which Rezolus can export as well as details on their source and meaning. Metrics below may typically be followed by:
/count
- the value of the counter/histogram/(percentile)
- a percentile of a counter's secondly rate, a gauge's instantaneous readings, or the percentile taken from a distribution
Sampler configurations will refer to the metrics according to their basenames as used in the descriptions below.
Note: summary metrics taken from underlying distributions use a significant figure preserving histogram binning. This means that the reported values will be rounded up to the highest value that still preserves that number of leading digits. Currently, this is fixed at 2 significant figures to help maintain a low memory footprint. This means, you may see a percentile like 10999, which implies the true value is somewhere between 10000 and 10999 (inclusive).
Summary metrics for counters and gauges use a different strategy for percentile calculation, as we can hold the number of samples to calculate an exact percentile in memory.
Provides telemetry around CPU usage and performance.
cpu/cstate/c0/time
- nanoseconds spent in c0 state, Active Modecpu/cstate/c1/time
- nanoseconds spent in c1 state, Auto Haltcpu/cstate/c1e/time
- nanoseconds spent in c1e state, Auto Halt + low frequency + low voltagecpu/cstate/c2/time
- nanoseconds spent in c2 state, temporary before c3 with memory paths still opencpu/cstate/c3/time
- nanoseconds spent in c3 state, L1/L2 flush + clocks offcpu/cstate/c6/time
- nanoseconds spent in c6 state, save core states before shutdown and PLL offcpu/cstate/c7/time
- nanoseconds spent in c7 state, c6 + LLC may flushcpu/cstate/c8/time
- nanoseconds spent in c8 state, c7 + LLC must flushcpu/frequency
- instantaneous cpu frequency in Hzcpu/usage/guest
- nanoseconds spent running a guest VMcpu/usage/guestnice
- nanoseconds spent running a low-priority guest VMcpu/usage/idle
- nanoseconds spent idlecpu/usage/irq
- nanoseconds spent handling interruptscpu/usage/nice
- nanoseconds spent on lower-priority taskscpu/usage/softirq
- nanoseconds spent handling soft interruptscpu/usage/steal
- nanoseconds stolen by the hypervisorcpu/usage/system
- nanoseconds spent in kernel-spacecpu/usage/user
- nanoseconds spent in user-space
cpu/bpu/branch
- total branch instructionscpu/bpu/miss
- branch predictions resulting in misscpu/cache/access
- total cache accessescpu/cache/miss
- cache accesses resulting in misscpu/cycles
- cpu cycles elapsed, may not be accurate with frequency scaling. consult processor documentation for details and consider usingcpu/reference_cycles
metriccpu/dtlb/load/access
- total dtlb loadscpu/dtlb/load/miss
- dtlb loads resulting in misscpu/dtlb/store/access
- total dtlb storescpu/dtlb/store/miss
- dtlb stores resulting in misscpu/instructions
- instructions retiredcpu/reference_cycles
- reference number of cpu cycles elapsed, may not be present on all processors. consult processor documentationcpu/stalled_cycles/backend
- cycles stalled waiting on backend, eg memory accesscpu/stalled_cycles/frontend
- cycles stalled waiting on frontend, eg instructions
Provides system-wide telemetry for disk devices
disk/discard/bytes
- bytes marked as unused on SSD devicesdisk/discard/operations
- total number of discards completeddisk/read/bytes
- bytes read from disk devicesdisk/read/operations
- total number of reads completeddisk/write/bytes
- bytes written to disk devicesdisk/write/operations
- total number of writes completed
disk/read/device_latency
- latency distribution, in nanoseconds, waiting for disk to complete a read operationdisk/read/latency
- end-to-end latency distribution, in nanoseconds, for read operationsdisk/read/io_size
- size distribution, in bytes, for read operationsdisk/read/queue_latency
- latency distribution, in nanoseconds, where read was waiting on the device queuedisk/write/device_latency
- latency distribution, in nanoseconds, waiting for disk to complete a write operationdisk/write/io_size
- size distribution, in bytes, for write operationsdisk/write/latency
- end-to-end latency distribution, in nanoseconds, for write operationsdisk/write/queue_latency
- latency distribution, in nanoseconds, where write was waiting on the device queue
Provides system-wide telemetry for EXT4 filesystems
ext4/fsync/latency
- latency distribution, in nanoseconds, forfsync()
on ext4 filesystemsext4/open/latency
- latency distribution, in nanoseconds, foropen()
on ext4 filesystemsext4/read/latency
- latency distribution, in nanoseconds, forread()
on ext4 filesystemsext4/write/latency
- latency distribution, in nanoseconds, forwrite()
on ext4 filesystems
Provides system-wide telemetry for IRQs
interrupt/local_timer
- APIC interrupts which fire on a specific CPU as a result of a local timerinterrupt/machine_check_exception
- interrupts caused by machine check exceptionsinterrupt/network
- interrupts for servicing network devices (NIC queues)interrupt/nmi
- Non-Maskable Interruptsinterrupt/node0/network
- interrupts for servicing network devices which were handled on NUMA node 0interrupt/node0/nvme
- interrupts for servicing NVMe devices which were handled on NUMA node 0interrupt/node0/total
- total interrupts which were handled on NUMA node 0interrupt/node1/network
- interrupts for servicing network devices which were handled on NUMA node 1interrupt/node1/nvme
- interrupts for servicing NVMe devices which were handled on NUMA node 1interrupt/node1/total
- total interrupts which were handled on NUMA node 1interrupt/nvme
- interrupts for servicing NVMe queuesinterrupt/performance_monitoring
- interrupts generated when a performance counter overflows or PEBS interrupt threshold is reachedinterrupt/rescheduling
- interrupts used to notify a core to schedule a threadinterrupt/rtc
- interrupts caused by the realtime clockinterrupt/spurious
- interrupts which were marked spurious and not handledinterrupt/thermal_event
- interrupts caused by thermal events, like throttlinginterrupt/timer
- interrupts related to the system timer (PIT/HPET)interrupt/tlb_shootdowns
- interrupts caused to trigger TLB shootdownsinterrupt/total
- total interrupts
Provides telemetry to track MIT kerberos ticket requests served by the krb5kdc binary. This is accomplished by attaching user space probes to the following functions: finish_process_as_req, finish_dispatch_cache and process_tgs_req. Each function exports a call count that is broken down by the resulting error code. Since there is a very large list of possible error codes, the first 30 error codes are exported. All other error code values are exported as UNKNOWN.
Each error code is reformatted to better fit metric naming standards: "KRB5KDC_ERR_BAD_PVNO" -> "bad_pvno"
krb5kdc/finish_process_as_req/{ERROR_CODE}
- count of finish_process_as_req calls by errorkrb5kdc/finish_dispatch_cache/{ERROR_CODE}
- count of finish_dispatch_cache calls by errorkrb5kdc/process_tgs_req/{ERROR_CODE}
- count of process_tgs_req calls by error
Provides telemetry around memory usage, transparent huge-pages, huge-pages, compaction, NUMA access, etc.
memory/active/anon
- the amount of anonymous and tmpfs/shmem memory, in bytes, that is in active use, or was in active use since the last time the system moved something to swap.memory/active/file
- the amount of file cache memory, in bytes, that is in active use, or was in active use since the last time the system reclaimed memory.memory/active/total
- the amount of memory, in bytes, that has been used more recently and is usually not reclaimed unless absolutely necessary.memory/anon_hugepages
- the total amount of memory, in bytes, used by huge pages that are not backed by files and are mapped into userspace page tables.memory/anon_pages
- the total amount of memory, in bytes, used by pages that are not backed by files and are mapped into userspace page tables.memory/available
- estimate of the amount of memory, in bytes, available on the system to allocate without swappingmemory/bounce
- the amount of memory, in bytes, used for the block device "bounce buffers".memory/buffers
- the amount, in bytes, of temporary storage for raw disk blocksmemory/cached
- the amount of physical RAM, in bytes, used as cache memorymemory/commit/committed
- the total amount of memory, in bytes, estimated to complete the workload. This value represents the worst case scenario value, and also includes swap memory.memory/commit/limit
- total amount of memory, inb bytes, currently available to be allocated on the system based on the overcommit ratiomemory/compact/daemon/free_scanned
- the number of pages kcompactd has scanned to potentially freememory/compact/daemon/migrate_scanned
- the number of pages kcompactd has scanned to potentially migratememory/compact/daemon/wake
- the number of times kcompactd has wokenmemory/compact/fail
- the number of compactions which fail to free a hugepagememory/compact/free_scanned
- the number of pages scanned to potentially freememory/compact/isolated
- the number of pages isolated by compactionmemory/compact/migrate_scanned
- the number of pages scanned to potentially migratememory/compact/stall
- the number of times processes stall to run compactionmemory/compact/success
- the number of compactions resulting in successfully freeing a hugepagememory/directmap/1G
- the amount of memory, in bytes, mapped into kernel address space with 1 GB page mappings.memory/directmap/2M
- the amount of memory, in bytes, mapped into kernel address space with 2 MB page mappings.memory/directmap/4k
- the amount of memory, in bytes, mapped into kernel address space with 4 kB page mappings.memory/dirty
- the total amount of memory, in bytes, waiting to be written back to the disk.memory/free
- the amount of physical RAM, in bytes, left unused by the systemmemory/hardware_corrupted
- the amount of memory, in bytes, with physical memory corruption problems, identified by the hardware and set aside by the kernel so it does not get used.memory/hugepage_size
- the size for each hugepages unit in bytes.memory/hugepages/free
- the total number of hugepages available for the system.memory/hugepages/reserved
- the number of unused huge pages reserved for hugetlbfs.memory/hugepages/surplus
- the number of surplus huge pages.memory/hugepages/total
- the total number of hugepages for the system.memory/hugetlb
memory/inactive/anon
- the amount of anonymous and tmpfs/shmem memory, in bytes, that is a candidate for eviction.memory/inactive/file
- the amount of file cache memory, in bytes, that is newly loaded from the disk, or is a candidate for reclaiming.memory/inactive/total
- the amount of memory, in bytes, that has been used less recently and is more eligible to be reclaimed for other purposes.memory/kernel_stack
- the amount of memory, in bytes, used by the kernel stack allocations done for each task in the system.memory/mapped
- the memory, in bytes, used for files that have been mmaped, such as libraries.memory/mlocked
- the total amount of memory, in bytes, that is not evictable because it is locked into memory by user programs.memory/nfs_unstable
- the amount, in bytes, of NFS pages sent to the server but not yet committed to the stable storage.memory/numa/foreign
- the number of bytes which had to be allocated on a remote node even though the allocation should have been localmemory/numa/hit
- the number of bytes successfully allocated on the intended nodememory/numa/interleave
- the number of bytes allocated on the remote node as intended by interleave policymemory/numa/local
- the number of bytes allocated on the node where the process was running at time of allocationmemory/numa/miss
- the number of bytes which could not be allocated on the intended nodememory/numa/other
- the number of bytes allocated on a node where the process was not running at time of allocationmemory/page_tables
- the total amount of memory, in bytes, dedicated to the lowest page table level.memory/shmem_hugepages
- the number of hugepages which are used for shared memory allocated as transparent hugepagesmemory/shmem_pmd_mapped
- the number of hugepages which are used for application transparent hugepagesmemory/shmem
- the total amount of memory, in bytes, used by shared memory (shmem) and tmpfs.memory/slab/reclaimable
- the part of Slab that can be reclaimed, such as caches.memory/slab/total
- the total amount of memory, in bytes, used by the kernel to cache data structures for its own use.memory/slab/unreclaimable
- the part of Slab that cannot be reclaimed even when lacking memory.memory/swap/cached
- the amount of memory, in bytes, that has once been moved into swap, then back into the main memory, but still also remains in the swapfile. This saves I/O, because the memory does not need to be moved into swap again.memory/swap/free
- the total amount of swap free, in bytes.memory/swap/total
- the total amount of swap available, in bytes.memory/thp/collapse_alloc
- number of times a hugepage was successfully allocated to collapse multiple pagesmemory/thp/collapse_alloc_failed
- number of times the allocation of a hugepage failed when attempting to collapse multiple pagesmemory/thp/deferred_split_page
- number of times a page split was deferred by placing it on the split queue. This means the page is partially unmapped and splitting will free some memorymemory/thp/fault_alloc
- the number of times a huge page was allocated to satisfy a page faultmemory/thp/fault_fallback
- the number of times a page fault required a base page allocation following a failure allocating a huge pagememory/thp/split_page
- the number of huge pages which have been split into base pagesmemory/thp/split_page_failed
- the number of times a huge page split failedmemory/total
- total amount of usable RAM, in bytes, which is physical RAM minus a number of reserved bits and the kernel binary codememory/unevictable
- the amount of memory, in bytes, discovered by the pageout code, that is not evictable because it is locked into memory by user programs.memory/vmalloc/chunk
- the largest contiguous block of memory, in bytes, of available virtual address space.memory/vmalloc/total
- total amount of memory, in bytes, of total allocated virtual address space.memory/vmalloc/used
- total amount of memory, in bytes, of used virtual address space.memory/writeback_temp
- the amount of memory, in bytes, used by FUSE for temporary writeback buffers.memory/writeback
- the total amount of memory, in bytes, actively being written back to the disk.
Provides system-wide network telemetry
network/receive/bytes
- number of bytes received on all network interfacesnetwork/receive/compressed
- number of compressed packets receivednetwork/receive/drops
- number of received packets which were dropped by the device drivernetwork/receive/errors
- number of receive errors detected by the device drivernetwork/receive/fifo
- number of FIFO buffer errors on receivenetwork/receive/frame
- number of packets received with framming errorsnetwork/receive/multicast
- number of multicast packets receivednetwork/receive/packets
- total number of packets receivednetwork/transmit/bytes
- number of bytes transmitted on all network interfacesnetwork/transmit/carrier
- number of carrier losses detected by the device drivernetwork/transmit/collisions
- number of collisions detectednetwork/transmit/compressed
- number of compressed packets transmittednetwork/transmit/drops
- number of packets to transmit which were dropped by the device drivernetwork/transmit/errors
- total number of errors when transmitting packetsnetwork/transmit/fifo
- number of FIFO buffer errors on transmitnetwork/transmit/packets
- total number of packets transmitted
network/receive/size
- size distribution, in bytes, of received packetsnetwork/transmit/size
- size distribution, in bytes, of transmitted packets
NTP sampler provides some basic stats about time synchronization via NTP.
NOTE: this sampler is currently not supported for musl toolchains
ntp/estimated_error
- the current estimated error of the local clock in nanosecondsntp/maximum_error
- the maximum error of the local clock in nanoseconds
Telemetry for Nvidia GPUs, collected by using the Nvidia Management Library
(NVML). Unlike other samplers, these stats are fully scoped to specific GPUs
within the system. Exported metrics will have the form: nvidia/gpu_[id]/...
where the id is the device identifier as reported by the NVML. The set of
metrics to collect uses the short form of the metric name, as provided below.
clock/sm/current
- current streaming multiprocessor clock speed in MHzclock/memory/current
- current memory clock speed in MHzdecoder/utilization
- video decoder utilization as a percentageencoder/utilization
- video encoder utilization as a percentageenergy/consumption
- total energy consumption since boot in Joulesgpu/temperature
- current GPU temperature in °Ccpu/utilization
- GPU utilzation as a percentagememory/ecc/enabled
- boolean (0 or 1) indicating if ECC is enabledmemory/ecc/dbe
- count of double-bit errors (uncorrectable)memory/ecc/sbe
- count of single-bit errors (correctable)memory/fb/free
- framebuffer memory free in bytesmemory/fb/total
- total framebuffer memory in bytesmemory/fb/used
- framebuffer memory used in bytesmemory/retired/sbe
- memory pages retired due to multiple single-bit errorsmemory/retired/dbe
- memory pages retired due to double-bit errormemory/retired/pending
- boolean (0 or 1) indicating that memory pages are pending retirementmemory/utilization
- memory copy utilization as a percentagepcie/replay
- count of PCIe replays. May indicate link issues.pcie/rx/throughput
- PCIe receive throughput in KB/spcie/tx/throughput
- PCIe transmit throughput in KB/spower/limit
- enforced power limit in Wattspower/usage
- current power usage in Wattsprocesses/compute
- number of processes running in compute context
The page cache is a transparent cache for pages originating from a secondary storage. Telemetry about page cache performance can be useful for when tuning applications which rely on the page cache.
page_cache/hit
- the number of times a read request was served from the page cachepage_cache/miss
- the number of times a read request resulted in a page cache miss
Provides telemetry about Rezolus itself. This can be used to understand the runtime characteristics of Rezolus for various configurations.
rezolus/cpu/user
- nanoseconds spent in user mode running Rezolusrezolus/cpu/system
- nanoseconds spent in system mode running Rezolusrezolus/memory/virtual
- total virtual memory allocated to Rezolusrezolus/memory/resident
- amount of memory actually used by Rezolus
Provides telemetry about the Linux scheduler. Provides insights into thread/process characteristics. The runqueue latency is useful when performing scheduler tuning or investigating potential interference between workloads.
scheduler/context_switches
- number of context switchesscheduler/processes/created
- number of processes createdscheduler/processes/running
- number of processes currently runningscheduler/processes/blocked
- number of processes currently blocked
scheduler/cpu_migrations
- number of times processes have been migrated across CPUs
scheduler/runqueue/latency
- the distribution of time that runnable tasks were waiting on the runqueue
Softnet telemetry provides a view into kernel packet processing.
softnet/processed
- the total number of packets processed in the softnet layersoftnet/dropped
- the number of packets droppedsoftnet/time_squeezed
- number of times that packet processing did not complete within the time slicesoftnet/cpu_collision
- collisions occurring obtaining device lock while transmittingsoftnet/received_rps
- number of times cpus woken up for received rpssoftnet/flow_limit_count
- number of times the flow limit count was reached
This sampler provides telemetry about TCP traffic and connections.
tcp/abort/failed
- failed to send RST on abort due to memory pressuretcp/abort/on_close
- connections reset due to early user closetcp/abort/on_data
- connections reset due to unexpected datatcp/abort/on_linger
- connections reset after user close while in linger timeouttcp/abort/on_memory
- connections reset due to memory pressure or too many orphaned socketstcp/abort/on_timeout
- connections reset due to timeouttcp/receive/checksum_error
- segments received with invalid checksumtcp/receive/collapsed
- segments collapsed in the receive queuetcp/receive/error
- total number of errors on receivetcp/receive/listen_drops
- number of SYNs to LISTEN sockets ignoredtcp/receive/listen_overflows
- times the listen queue of a socket overflowedtcp/receive/ofo_pruned
- number of packets pruned from the out-of-order queue due to socket buffer overruntcp/receive/prune_called
- number of packets pruned from the receive queue because of socket buffer overruntcp/receive/pruned
- packets pruned from the receive queuetcp/receive/segment
- total number of segments receivedtcp/syncookies/failed
- number of invalid SYN cookies receivedtcp/syncookies/received
- number of SYN cookies receivedtcp/syncookies/sent
- number of SYN cookies senttcp/transmit/delayed_ack
- number of delayed ACKs senttcp/transmit/reset
- number of RSTs senttcp/transmit/retransmit
- number of segments retransmittedtcp/transmit/segment
- number of segments transmitted
tcp/connect/latency
- end-to-end latency, in nanoseconds, from an active outboundconnect()
until the socket is establishedtcp/srtt
- the smoothed round trip latency distribution, in nanosecondstcp/jitter
- the median deviation of the smoothed round trip latency distribution, in nanosecondstcp/connection/accepted
- number of connections accepted passivelytcp/connection/initiated
- number of connections initiated activelytcp/drop
- number of packets dropped in the kernel TCP stacktcp/tlp
- number of Tail Loss Recovery Probes senttcp/transmit/retransmit_timeout
- number of retransmit timeoutstcp/receive/duplicate
- number of duplicate TCP segments received.tcp/receive/out_of_order
- number of out of order TCP segments received.
udp/receive/datagrams
- number of datagrams receivedudp/receive/errors
- number of errors on receiveudp/transmit/datagrams
- number of datagrams transmitted
Provides telemetry about XFS filesystem performance.
xfs/fsync/latency
- latency distribution, in nanoseconds, forfsync()
on xfs filesystemsxfs/open/latency
- latency distribution, in nanoseconds, foropen()
on xfs filesystemsxfs/read/latency
- latency distribution, in nanoseconds, forread()
on xfs filesystemsxfs/write/latency
- latency distribution, in nanoseconds, forwrite()
on xfs filesystems