-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add allocation profile export and zleak utility for import #17576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1a44f2e
to
0f38484
Compare
27be603
to
3c9e82a
Compare
7f97e48
to
4bea873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was fast. I haven't had a chance to test it locally but I like where this is going.
When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new utility called zleak (and its supporting library and kernel changes). This is a small python program that invokes a new ioctl (via libzfs_core): zfs_ioc_raw_alloc. This ioctl takes in an nvlist of allocations to perform, and then allocates them. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding raw_free ioctl (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
4bea873
to
adecfd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can you simply add a basic test case to tests/functional/cli_root/zhack/
.
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]
Motivation and Context
When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. Finally, it may also be difficult to source storage that is large enough to match what's being used at customer/production sites for budgetary or procurement reasons.
Description
The core idea of the solution is this: If we know what regions are allocated on the production system we're trying to mimic, we don't actually need to do the process that got us there. We can skip straight to the final state by doing raw allocations of the allocated regions on that system, with no data underlying them or block pointers pointing to them.
This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though if the process takes long enough we can run into issues with our older TXG starting to get overwritten. A checkpoint is a good way to preserve the system state at a single point in time for analysis while the system is serving IO.
The second is a new utility called zleak (and its supporting library and kernel changes). This is a small python program that invokes a new ioctl (via libzfs_core): zfs_ioc_raw_alloc. This ioctl takes in an nvlist of allocations to perform, and then allocates them. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding raw_free ioctl (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint.
We verify that the system receiving the allocation profile has the same layout as the source system, to prevent any issues with violating ZFS's expectations or triggering assertions. This includes number of vdevs, number of metaslabs per vdev, and metaslab size. There is a
-f
option to allow profiles to skip the check for number of metaslabs per vdev, in which case allocations beyond the last metaslab will be dropped.How Has This Been Tested?
Tested with ZFS test suite to ensure no regressions. New utility and functionality has been used to performance performance testing multiple times. Also manually verified space map contents to ensure that allocation mapping matches original system.
Types of changes
Checklist:
Signed-off-by
.