-
-
Notifications
You must be signed in to change notification settings - Fork 909
Add Intel Xe GPU driver support #1457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e5e3853 to
1f8f375
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for Intel GPUs using the Xe kernel driver, which is required for newer Intel hardware like Lunar Lake and Battlemage that don't use the legacy i915 driver.
Changes:
- Added minimal
xe_drm.hheader with UAPI definitions for interfacing with the Xe driver - Implemented Xe namespace with PMU-based GPU monitoring using perf events
- Added dynamic driver detection and PMU device discovery to support both i915 and Xe drivers
- Routes collection to appropriate code path based on detected driver type
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| src/linux/xe_drm.h | New header file containing minimal Xe DRM UAPI definitions (structs and ioctls) for device query and engine enumeration |
| src/linux/btop_collect.cpp | Added Xe namespace with PMU-based monitoring, dynamic driver detection, PMU device discovery, and routing logic to switch between i915 and Xe code paths |
Comments suppressed due to low confidence (1)
src/linux/btop_collect.cpp:2103
- Missing free() call for gpu_path on error path. If get_intel_device_id() returns null, the function returns false without freeing gpu_path that was allocated by find_intel_gpu_dir().
char *gpu_device_id = get_intel_device_id(gpu_path);
if (!gpu_device_id) {
Logger::debug("Failed to find Intel GPU device ID, Intel GPUs will not be detected");
return false;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add support for Intel GPUs using the new Xe kernel driver, which is required for newer hardware like Lunar Lake. This addresses issue aristocratos#1407. Implementation approach (per maintainer feedback on PR aristocratos#1408): - Add Xe namespace in btop_collect.cpp (does not modify IGT files) - Add minimal xe_drm.h header with UAPI definitions from Linux kernel - Detect driver type (xe vs i915) and route to appropriate code path - Use Xe PMU perf events for GPU utilization and clock speed Supported metrics for Xe driver: - GPU utilization (via engine-active-ticks/engine-total-ticks) - GPU clock speed (via gt-actual-frequency) Also fixes dynamic PMU device discovery for discrete Intel GPUs (addresses issue aristocratos#938) by checking for device-specific PMU names like 'xe_0000_03_00.0' before falling back to generic 'xe' or 'i915'. Closes aristocratos#1407
- Fix critical heap corruption: remove free() on static buffer from find_intel_gpu_dir() - Fix GPU clock unit: convert Hz to MHz to match other drivers - Fix MAX_GPU_CLOCK overflow: change from 10e9 Hz to 10000 MHz - Add bounds validation for num_engines to prevent OOB access - Add first-sample baselining to prevent initial utilization spike - Add error handling for ioctl PERF_EVENT_IOC_ENABLE - Add error handling for clock_gettime in init() and collect() - Add group_fd validation before assignment - Add dt minimum clamping to prevent division issues - Add stoull exception handling - Fix memory leak: call free_engines() when pmu_init fails - Replace magic number 14 with strlen(PCI_SLOT_PREFIX) - Use ull suffix and static_cast for type safety in build_config()
The PMU gt-actual-frequency event requires complex time-weighted calculation that wasn't working correctly (always showed 0 MHz). Switch to reading frequency directly from sysfs: /sys/class/drm/cardX/device/tile0/gtN/freq0/cur_freq This matches how nvtop reads Xe GPU frequency and provides accurate real-time clock speed values.
Restore original 2-tab indentation that was accidentally changed to 1-tab in previous commit.
- Replace PMU-based engine-active-ticks with sysfs idle_residency_ms (works on Battlemage without CAP_PERFMON, fixes 0% utilization) - Add DRM_XE_DEVICE_QUERY_MEM_REGIONS for VRAM usage reporting - Implement GT separation: RC (Render/Compute) and MC (Media) tracking for architectures with split GT layout (Lunar Lake, Battlemage) - Update UI to show RC/MC labels instead of ENC/DEC when gt_utilization is enabled, with separate graphs for each GT type Tested on: Lunar Lake (Core Ultra 5 228V) Fixes: aristocratos#1407 (partial - needs Battlemage testing)
- Add first_sample flag to skip first gtidle calculation (fixes 100% spike on startup) - Add pci.ids database lookup for accurate GPU product names (e.g. 'Intel Arc B580' instead of 'Intel Battlemage (Gen20)') - Fallback to codename-based naming if pci.ids lookup fails
- Read idle_status sysfs to detect power gating state (gt-c6 vs gt-c0) - When idle counter doesn't advance AND GT is power-gated: report 0% (not 100%) - When idle counter doesn't advance AND GT is active: report 100% (real load) - Add EMA smoothing (alpha=0.3) to reduce transient spikes from compositor - Handle counter wrap/reset by preserving previous smoothed value This fixes false 100% utilization spikes that occurred when the GPU entered power gating (RC6/MC6) and the idle_residency_ms counter stopped advancing, which was incorrectly interpreted as 100% busy.
- Fix indentation (2 tabs -> 3 tabs in structs/globals) - Replace !, &&, || with not, and, or operators - Add comments explaining DRM ioctl patterns, EMA smoothing, and power gating detection logic - Wrap long lines for readability
- Refactor Intel namespace to support multiple GPUs via GpuInstance struct - Add discover_intel_gpus() to find all Intel GPUs via sysfs vendor ID - Refactor Xe namespace with per-GPU state (states vector, gpu_index params) - Update Intel::init/shutdown/collect to iterate over gpu_instances - Add has_pmu_permissions() to prevent crash from assert in i915 PMU code - Add empty gpus vector early-return defense in Gpu::collect() - Fix division by zero guards and typo (mem_total -> pwr_total) - Initialize gpu-vram-totals and gpu-pwr-totals in Xe first_sample block Fixes: Multi-GPU not detected (Issue aristocratos#1407) Fixes: Crash without sudo (Aborted core dumped)
178426b to
c4341c0
Compare
Implement fdinfo-based GPU utilization measurement for Intel Xe GPUs: - Add FdinfoCycles struct and collect_fdinfo_cycles function - Parse /proc/*/fdinfo/* for drm-cycles-rcs/vcs data - Use client-id deduplication to prevent double-counting - Apply EMA smoothing for stable readings - Fall back to gtidle when fdinfo unavailable This provides more accurate utilization data compared to residency-based gtidle measurements.
Summary
This PR adds support for Intel GPUs using the Xe kernel driver, which is required for newer hardware like Lunar Lake, Battlemage, and other recent Intel GPUs.
Closes #1407
Changes
Xenamespace inbtop_collect.cppwith PMU-based GPU monitoringxe_drm.hheader with UAPI definitions from Linux kernelImplementation Approach
Per maintainer feedback on PR #1408, this implementation:
src/linux/intel_gpu_top/(IGT files)btop_collect.cppstructureSupported Metrics (Xe driver)
engine-active-ticks/engine-total-ticksPMU eventsgt-actual-frequencyPMU eventTesting
make GPU_SUPPORT=trueTechnical Notes
perf_event_open()syscall directly for PMU counter accessDRM_IOCTL_XE_DEVICE_QUERYioctl