Skip to content

Default recurring task to Balance the Resource Placement breaks the state of a resource after node eviction #467

@Sebiee

Description

@Sebiee

Create linstor cluster with 4 nodes. Configure resource group with --place-count 2. Set lower eviction and balance times to observe the behavior. Everything is fine after eviction, like it has been described in this post:

And no, I could not reproduce this issue anymore. I ran the following test:

4 nodes, whereas in one test the 4th node (echo) has no disk (only the DfltDisklessStorPool).

Test

linstor n c bravo

linstor n c charlie

linstor n c delta

linstor n c echo

linstor n l

linstor sp c lvm bravo lvmpool scratch

linstor sp c lvm charlie lvmpool scratch

linstor sp c lvm delta lvmpool scratch

linstor rd c rsc

linstor vd c rsc 1G

linstor r c bravo charlie rsc -s lvmpool

linstor c sp DrbdOptions/AutoEvictAfterTime 1                         # speed up eviction

# test is shutting down charlie(Satellite)

ssh root@charlie drbdadm down all

linstor --no-utf8 --no-color n l

+-------------------------------------------------------------------------------------------------+

| Node    | NodeType  | Addresses                  | State                                        |

|=================================================================================================|

| bravo   | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online                                       |

| charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | OFFLINE (Auto-eviction: 2023-08-29 07:50:30) |

| delta   | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online                                       |

| echo    | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online                                       |

+-------------------------------------------------------------------------------------------------+

To cancel automatic eviction please consider the corresponding DrbdOptions/AutoEvict* properties on controller and / or node level

See 'linstor controller set-property --help' or 'linstor node set-property --help' for more details



linstor --no-utf8 --no-color sp l

+---------------------------------------------------------------------------------------------------------------------------------------------+

| StoragePool          | Node    | Driver   | PoolName | FreeCapacity | TotalCapacity | CanSnapshots | State   | SharedName                   |

|=============================================================================================================================================|

| DfltDisklessStorPool | bravo   | DISKLESS |          |              |               | False        | Ok      | bravo;DfltDisklessStorPool   |

| DfltDisklessStorPool | charlie | DISKLESS |          |              |               | False        | Warning | charlie;DfltDisklessStorPool |

| DfltDisklessStorPool | delta   | DISKLESS |          |              |               | False        | Ok      | delta;DfltDisklessStorPool   |

| DfltDisklessStorPool | echo    | DISKLESS |          |              |               | False        | Ok      | echo;DfltDisklessStorPool    |

| lvmpool              | bravo   | LVM      | scratch  |    18.99 GiB |     20.00 GiB | False        | Ok      | bravo;lvmpool                |

| lvmpool              | charlie | LVM      | scratch  |              |               | False        | Warning | charlie;lvmpool              |

| lvmpool              | delta   | LVM      | scratch  |    10.00 GiB |     10.00 GiB | False        | Ok      | delta;lvmpool                |

+---------------------------------------------------------------------------------------------------------------------------------------------+

WARNING:

Description:

    No active connection to satellite 'charlie'

Details:

    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.



linstor --no-utf8 --no-color r l -a

+-------------------------------------------------------------------------------------------------+

| ResourceName | Node    | Port | Usage  | Conns               |      State | CreatedOn           |

|=================================================================================================|

| rsc          | bravo   | 7000 | Unused | Connecting(charlie) |   UpToDate | 2023-08-29 07:49:28 |

| rsc          | charlie | 7000 |        |                     |    Unknown | 2023-08-29 07:49:28 |

| rsc          | delta   | 7000 | Unused | Connecting(charlie) | TieBreaker | 2023-08-29 07:49:27 |

+-------------------------------------------------------------------------------------------------+



sleep 65.0s

linstor --no-utf8 --no-color n l

+------------------------------------------------------------+

| Node    | NodeType  | Addresses                  | State   |

|============================================================|

| bravo   | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online  |

| charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | EVICTED |

| delta   | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online  |

| echo    | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online  |

+------------------------------------------------------------+



linstor --no-utf8 --no-color r l -a

+-------------------------------------------------------------------------------------+

| ResourceName | Node    | Port | Usage  | Conns |        State | CreatedOn           |

|=====================================================================================|

| rsc          | bravo   | 7000 | Unused | Ok    |     UpToDate | 2023-08-29 07:49:28 |

| rsc          | charlie | 7000 |        | Ok    |     INACTIVE | 2023-08-29 07:49:28 |

| rsc          | delta   | 7000 | Unused | Ok    | Inconsistent | 2023-08-29 07:49:27 |

| rsc          | echo    | 7000 | Unused | Ok    |   TieBreaker | 2023-08-29 07:50:46 |

+-------------------------------------------------------------------------------------+



sleep 5.0s

linstor --no-utf8 --no-color r l -a

+-------------------------------------------------------------------------------------------+

| ResourceName | Node    | Port | Usage  | Conns |              State | CreatedOn           |

|===========================================================================================|

| rsc          | bravo   | 7000 | Unused | Ok    |           UpToDate | 2023-08-29 07:49:28 |

| rsc          | charlie | 7000 |        | Ok    |           INACTIVE | 2023-08-29 07:49:28 |

| rsc          | delta   | 7000 | Unused | Ok    | SyncTarget(53.63%) | 2023-08-29 07:49:27 |

| rsc          | echo    | 7000 | Unused | Ok    |         TieBreaker | 2023-08-29 07:50:46 |

+-------------------------------------------------------------------------------------------+



sleep 5.0s

linstor --no-utf8 --no-color r l -a

+-----------------------------------------------------------------------------------+

| ResourceName | Node    | Port | Usage  | Conns |      State | CreatedOn           |

|===================================================================================|

| rsc          | bravo   | 7000 | Unused | Ok    |   UpToDate | 2023-08-29 07:49:28 |

| rsc          | charlie | 7000 |        | Ok    |   INACTIVE | 2023-08-29 07:49:28 |

| rsc          | delta   | 7000 | Unused | Ok    |   UpToDate | 2023-08-29 07:49:27 |

| rsc          | echo    | 7000 | Unused | Ok    | TieBreaker | 2023-08-29 07:50:46 |

+-----------------------------------------------------------------------------------+

Feel free to re-open this issue if you experience a different behavior or if I missed something.

Originally posted by @ghernadi in #236

But after the Balance task executes, the resource goes stuck into state DELETING on the node that went from Diskless TieBreaker to Diskfull UpToDate. To reproduce, just follow the code snippet posted above, but also set:

linstor controller sp BalanceResourcesInterval 20 # Run balance task every 20 sec
linstor controller sp BalanceResourcesGracePeriod 20

I even set the BalanceResourcesEnabled to false to pinpoint the issue. Example:

root@test-storage4:~# linstor --no-color --no-utf8 r l
+-------------------------------------------------------------------------------------------------+
| ResourceName | Node          | Layers       | Usage  | Conns |      State | CreatedOn           |
|=================================================================================================|
| linstor_db   | test-storage4 | DRBD,STORAGE | InUse  | Ok    |   UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-storage5 | DRBD,STORAGE | Unused | Ok    |   UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-storage6 | DRBD,STORAGE | Unused | Ok    | TieBreaker | 2025-11-07 10:37:22 |
+-------------------------------------------------------------------------------------------------+
root@test-storage4:~# linstor controller sp BalanceResourcesEnabled false
root@test-storage4:~# linstor controller lp | grep Balance
| BalanceResourcesEnabled                   | false                                                  |
| BalanceResourcesGracePeriod               | 20                                                     |
| BalanceResourcesInterval                  | 20                                                     |
root@test-storage4:~# ifdown eth0
# Switch to another node
root@test-storage6:~# linstor --no-color --no-utf8 r l
+---------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node          | Layers       | Usage  | Conns                     |      State | CreatedOn           |
|=====================================================================================================================|
| linstor_db   | test-storage4 | DRBD,STORAGE |        |                           |    Unknown | 2025-11-07 10:37:22 |
| linstor_db   | test-storage5 | DRBD,STORAGE | InUse  | Connecting(test-storage4) |   UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-storage6 | DRBD,STORAGE | Unused | Connecting(test-storage4) | TieBreaker | 2025-11-07 10:37:22 |
+---------------------------------------------------------------------------------------------------------------------+
# After 1 min, eviction triggered and linstor_db restored:
root@test-storage6:~# linstor --no-color --no-utf8 r l
+------------------------------------------------------------------------------------------------------+
| ResourceName | Node               | Layers       | Usage  | Conns |      State | CreatedOn           |
|======================================================================================================|
| linstor_db   | test-storage4      | DRBD,STORAGE |        | Ok    |   INACTIVE | 2025-11-07 10:37:22 |
| linstor_db   | test-storage5      | DRBD,STORAGE | InUse  | Ok    |   UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-storage6      | DRBD,STORAGE | Unused | Ok    |   UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-swarm-worker1 | DRBD,STORAGE | Unused | Ok    | TieBreaker | 2025-11-07 10:57:16 |
+------------------------------------------------------------------------------------------------------+
root@test-storage6:~# linstor controller sp BalanceResourcesEnabled true # Enable rebalancing to trigger the bug
root@test-storage6:~# sleep 20
root@test-storage6:~# linstor --no-color --no-utf8 r l
+----------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node               | Layers       | Usage | Conns                          |    State | CreatedOn           |
|============================================================================================================================|
| linstor_db   | test-storage4      | DRBD,STORAGE |       | Ok                             | INACTIVE | 2025-11-07 10:37:22 |
| linstor_db   | test-storage5      | DRBD,STORAGE | InUse | Connecting(test-swarm-worker1) | UpToDate | 2025-11-07 10:37:22 |
| linstor_db   | test-storage6      | DRBD,STORAGE |       | Connecting(test-swarm-worker1) | DELETING | 2025-11-07 10:37:22 |
| linstor_db   | test-swarm-worker1 | DRBD,STORAGE |       | Ok                             | DELETING | 2025-11-07 10:57:16 |
+----------------------------------------------------------------------------------------------------------------------------+

For the moment, in my use case I am fine with disabling the Auto Balancing task, but I suppose this scenario should still work with it enabled (default configuration).

Versions used:

root@test-storage6:~# /usr/share/linstor-server/bin/Satellite --version
LINSTOR, Module Satellite
Version:            1.32.3 (6dac06aed233f2c89ac7cc6b1185d6dce9ec74c4)
Build time:         2025-10-13T06:37:58+00:00 Log v2
Java Version:       17
Java VM:            Debian, Version 17.0.17+10-Debian-1deb12u1
Operating system:   Linux, Version 6.1.0-37-amd64
Environment:        amd64, 12 processors, 2048 MiB memory reserved for allocations


System components initialization in progress

LINSTOR Satellite 1.32.3
root@test-storage6:~# linstor controller version
linstor controller 1.32.3; GIT-hash: 6dac06aed233f2c89ac7cc6b1185d6dce9ec74c4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions