-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Create linstor cluster with 4 nodes. Configure resource group with --place-count 2. Set lower eviction and balance times to observe the behavior. Everything is fine after eviction, like it has been described in this post:
And no, I could not reproduce this issue anymore. I ran the following test:
4 nodes, whereas in one test the 4th node (
echo) has no disk (only theDfltDisklessStorPool).Test
linstor n c bravo linstor n c charlie linstor n c delta linstor n c echo linstor n l linstor sp c lvm bravo lvmpool scratch linstor sp c lvm charlie lvmpool scratch linstor sp c lvm delta lvmpool scratch linstor rd c rsc linstor vd c rsc 1G linstor r c bravo charlie rsc -s lvmpool linstor c sp DrbdOptions/AutoEvictAfterTime 1 # speed up eviction # test is shutting down charlie(Satellite) ssh root@charlie drbdadm down all linstor --no-utf8 --no-color n l +-------------------------------------------------------------------------------------------------+ | Node | NodeType | Addresses | State | |=================================================================================================| | bravo | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online | | charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | OFFLINE (Auto-eviction: 2023-08-29 07:50:30) | | delta | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online | | echo | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online | +-------------------------------------------------------------------------------------------------+ To cancel automatic eviction please consider the corresponding DrbdOptions/AutoEvict* properties on controller and / or node level See 'linstor controller set-property --help' or 'linstor node set-property --help' for more details linstor --no-utf8 --no-color sp l +---------------------------------------------------------------------------------------------------------------------------------------------+ | StoragePool | Node | Driver | PoolName | FreeCapacity | TotalCapacity | CanSnapshots | State | SharedName | |=============================================================================================================================================| | DfltDisklessStorPool | bravo | DISKLESS | | | | False | Ok | bravo;DfltDisklessStorPool | | DfltDisklessStorPool | charlie | DISKLESS | | | | False | Warning | charlie;DfltDisklessStorPool | | DfltDisklessStorPool | delta | DISKLESS | | | | False | Ok | delta;DfltDisklessStorPool | | DfltDisklessStorPool | echo | DISKLESS | | | | False | Ok | echo;DfltDisklessStorPool | | lvmpool | bravo | LVM | scratch | 18.99 GiB | 20.00 GiB | False | Ok | bravo;lvmpool | | lvmpool | charlie | LVM | scratch | | | False | Warning | charlie;lvmpool | | lvmpool | delta | LVM | scratch | 10.00 GiB | 10.00 GiB | False | Ok | delta;lvmpool | +---------------------------------------------------------------------------------------------------------------------------------------------+ WARNING: Description: No active connection to satellite 'charlie' Details: The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update. linstor --no-utf8 --no-color r l -a +-------------------------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | CreatedOn | |=================================================================================================| | rsc | bravo | 7000 | Unused | Connecting(charlie) | UpToDate | 2023-08-29 07:49:28 | | rsc | charlie | 7000 | | | Unknown | 2023-08-29 07:49:28 | | rsc | delta | 7000 | Unused | Connecting(charlie) | TieBreaker | 2023-08-29 07:49:27 | +-------------------------------------------------------------------------------------------------+ sleep 65.0s linstor --no-utf8 --no-color n l +------------------------------------------------------------+ | Node | NodeType | Addresses | State | |============================================================| | bravo | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online | | charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | EVICTED | | delta | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online | | echo | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online | +------------------------------------------------------------+ linstor --no-utf8 --no-color r l -a +-------------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | CreatedOn | |=====================================================================================| | rsc | bravo | 7000 | Unused | Ok | UpToDate | 2023-08-29 07:49:28 | | rsc | charlie | 7000 | | Ok | INACTIVE | 2023-08-29 07:49:28 | | rsc | delta | 7000 | Unused | Ok | Inconsistent | 2023-08-29 07:49:27 | | rsc | echo | 7000 | Unused | Ok | TieBreaker | 2023-08-29 07:50:46 | +-------------------------------------------------------------------------------------+ sleep 5.0s linstor --no-utf8 --no-color r l -a +-------------------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | CreatedOn | |===========================================================================================| | rsc | bravo | 7000 | Unused | Ok | UpToDate | 2023-08-29 07:49:28 | | rsc | charlie | 7000 | | Ok | INACTIVE | 2023-08-29 07:49:28 | | rsc | delta | 7000 | Unused | Ok | SyncTarget(53.63%) | 2023-08-29 07:49:27 | | rsc | echo | 7000 | Unused | Ok | TieBreaker | 2023-08-29 07:50:46 | +-------------------------------------------------------------------------------------------+ sleep 5.0s linstor --no-utf8 --no-color r l -a +-----------------------------------------------------------------------------------+ | ResourceName | Node | Port | Usage | Conns | State | CreatedOn | |===================================================================================| | rsc | bravo | 7000 | Unused | Ok | UpToDate | 2023-08-29 07:49:28 | | rsc | charlie | 7000 | | Ok | INACTIVE | 2023-08-29 07:49:28 | | rsc | delta | 7000 | Unused | Ok | UpToDate | 2023-08-29 07:49:27 | | rsc | echo | 7000 | Unused | Ok | TieBreaker | 2023-08-29 07:50:46 | +-----------------------------------------------------------------------------------+Feel free to re-open this issue if you experience a different behavior or if I missed something.
Originally posted by @ghernadi in #236
But after the Balance task executes, the resource goes stuck into state DELETING on the node that went from Diskless TieBreaker to Diskfull UpToDate. To reproduce, just follow the code snippet posted above, but also set:
linstor controller sp BalanceResourcesInterval 20 # Run balance task every 20 sec
linstor controller sp BalanceResourcesGracePeriod 20
I even set the BalanceResourcesEnabled to false to pinpoint the issue. Example:
root@test-storage4:~# linstor --no-color --no-utf8 r l
+-------------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|=================================================================================================|
| linstor_db | test-storage4 | DRBD,STORAGE | InUse | Ok | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-storage5 | DRBD,STORAGE | Unused | Ok | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-storage6 | DRBD,STORAGE | Unused | Ok | TieBreaker | 2025-11-07 10:37:22 |
+-------------------------------------------------------------------------------------------------+
root@test-storage4:~# linstor controller sp BalanceResourcesEnabled false
root@test-storage4:~# linstor controller lp | grep Balance
| BalanceResourcesEnabled | false |
| BalanceResourcesGracePeriod | 20 |
| BalanceResourcesInterval | 20 |
root@test-storage4:~# ifdown eth0
# Switch to another node
root@test-storage6:~# linstor --no-color --no-utf8 r l
+---------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|=====================================================================================================================|
| linstor_db | test-storage4 | DRBD,STORAGE | | | Unknown | 2025-11-07 10:37:22 |
| linstor_db | test-storage5 | DRBD,STORAGE | InUse | Connecting(test-storage4) | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-storage6 | DRBD,STORAGE | Unused | Connecting(test-storage4) | TieBreaker | 2025-11-07 10:37:22 |
+---------------------------------------------------------------------------------------------------------------------+
# After 1 min, eviction triggered and linstor_db restored:
root@test-storage6:~# linstor --no-color --no-utf8 r l
+------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|======================================================================================================|
| linstor_db | test-storage4 | DRBD,STORAGE | | Ok | INACTIVE | 2025-11-07 10:37:22 |
| linstor_db | test-storage5 | DRBD,STORAGE | InUse | Ok | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-storage6 | DRBD,STORAGE | Unused | Ok | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-swarm-worker1 | DRBD,STORAGE | Unused | Ok | TieBreaker | 2025-11-07 10:57:16 |
+------------------------------------------------------------------------------------------------------+
root@test-storage6:~# linstor controller sp BalanceResourcesEnabled true # Enable rebalancing to trigger the bug
root@test-storage6:~# sleep 20
root@test-storage6:~# linstor --no-color --no-utf8 r l
+----------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|============================================================================================================================|
| linstor_db | test-storage4 | DRBD,STORAGE | | Ok | INACTIVE | 2025-11-07 10:37:22 |
| linstor_db | test-storage5 | DRBD,STORAGE | InUse | Connecting(test-swarm-worker1) | UpToDate | 2025-11-07 10:37:22 |
| linstor_db | test-storage6 | DRBD,STORAGE | | Connecting(test-swarm-worker1) | DELETING | 2025-11-07 10:37:22 |
| linstor_db | test-swarm-worker1 | DRBD,STORAGE | | Ok | DELETING | 2025-11-07 10:57:16 |
+----------------------------------------------------------------------------------------------------------------------------+
For the moment, in my use case I am fine with disabling the Auto Balancing task, but I suppose this scenario should still work with it enabled (default configuration).
Versions used:
root@test-storage6:~# /usr/share/linstor-server/bin/Satellite --version
LINSTOR, Module Satellite
Version: 1.32.3 (6dac06aed233f2c89ac7cc6b1185d6dce9ec74c4)
Build time: 2025-10-13T06:37:58+00:00 Log v2
Java Version: 17
Java VM: Debian, Version 17.0.17+10-Debian-1deb12u1
Operating system: Linux, Version 6.1.0-37-amd64
Environment: amd64, 12 processors, 2048 MiB memory reserved for allocations
System components initialization in progress
LINSTOR Satellite 1.32.3
root@test-storage6:~# linstor controller version
linstor controller 1.32.3; GIT-hash: 6dac06aed233f2c89ac7cc6b1185d6dce9ec74c4