vdsm: get gluster volume info from any gluster peer #412

josgutie · 2024-03-20T12:59:22Z

The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage. The proposed changed let the cluster cli to use any available gluster peer.

nirs

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

josgutie · 2024-03-20T13:46:58Z

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

Because the PR was bad formed, and I guess that it's more clean this PR.

Here is your question in the other PR,

I think this was added years ago because not specifying the server could be a problem. Not sure that the gluster cluster can cope with a case when one of the server is down.

Gluster cluster can cope with one of the server down, the issue is triggering because adding the --remote-host to the gluster cli limits the gluster query to this host, avoiding querying to the backup servers. Maybe it's a mechanism to avoid to register not updated info.

_Please check git history to understand when and why self.volfileserver was added.

In the git history the function was always called with the sef._volfileserver, as you can see here 907560b.

Also how did you test this change? Please describe what was tested - how you simulated the issue and tested that this change solves the issue, and how you tested that this change does not cause regressions in other flows.

Regarding the tests, I will describe in a new comment if you don't mind.

nirs · 2024-03-20T14:01:29Z

@josgutie ok lets continue here, but please do not close PRs and submit new one like this. If there was an issue with the previous PR it can be easily fixed by pushing a new version.

josgutie · 2024-03-21T11:53:14Z

Environment:
RHHI:
- 1.8
- ovirt-engine:
  - 4.5.3.10
- RHVH:
  - 4.5.3.10
  - vdsm:
    - 4.50.3.9-1
  - vdsm-gluster:
    - 4.50.3.9-1
  - glusterfs:
    - 6.0-63
SPM: rhhi02.storage01.lab
Gluster Storage domains configuration:

$ /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT connection, mount_options FROM storage_server_connections;"
connection | mount_options
-------------------------------+------------------------------------------------------------------
rhhi01.storage01.lab:/data | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab
rhhi03.storage01.lab:/engine | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab
rhhi03.storage01.lab:/vmstore | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ grep -E "storage|mnt_options" /etc/ovirt-hosted-engine/hosted-engine.conf
storage=rhhi03.storage01.lab:/engine
mnt_options=backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ df -h | grep glusterSD
rhhi03.storage01.lab:/engine 100G 14G 87G 14% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_engine
rhhi03.storage01.lab:/vmstore 150G 2.6G 148G 2% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_vmstore
rhhi01.storage01.lab:/data 50G 908M 50G 2% /rhev/data-center/mnt/glusterSD/rhhi01.storage01.lab:_data

josgutie · 2024-03-21T13:26:48Z

Issue:
- Put on maintenance mode data storage domain.
- Shutdown rhhi01 host.
- data storage domain can't be activated.
Here are the tests completed and their result after applying the patch.
- Scenario of the issue:
  - Result: SD data can be activated.
- Environment power off:
  - Scenario:
    - Host rhhi01 is powered of, rest are active.
    - Shutdown managed by the ansible role rhv.shutdown_env.
  - Result:
    - Shutdown without any issues.
- Environment power up
  - Only two hosts started, rhhi02 and rhhi03.
  - Result:
    - All Storage Domains are active
    - Cluster and Data Center active.

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

nirs · 2024-03-21T13:48:46Z

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

Test looks good, and sure we want to test the common operations like creating a new storage domain, putting storage domain to maintenance and activating it.

But I think you change is not complete - it assumes that backup-volfile-servers= is configured with all the nodes. In this case it seems that gluster does use any of the nodes and does not need the volfileserver argument (otherwise it would not work). But if I remember correctly, backup-volfile-servers= is optional on engine side, so your change must handle the case when it is not configured on engine side.

So the logic seems to be:

if backup_volfile_servers:
    ommit the volfileserver argument
else:
    use the volfileserver argument

Also not using backup-volfile-servers seems very fragile, so we can consider warning about it when connecting to the storage.

josgutie · 2024-03-21T15:40:49Z

@nirs the gluster command doesn't have any parameter to point to the backup servers, you can review the following client documentation Gluster Command Line Interface. In fact it's hard to find the definition of the --remote-host gluster client parameter.

Regarding the suggested change, the problem is if the volfile server is down, the call will fail because --remote-host limits the call to the volfile server and omit calling to the backup servers.

We could deal with connection issue in that part of the code, but from my point of view, it's a gluster cli responsibility not vdsm.

I will create and destroy a gluster storage domain.

josgutie · 2024-03-25T12:59:49Z

New tests completed:

GlusterFS based Storage Domain creation:
- Create a new Storage Domain from a gluster volume.
- New domain parameters:
- Name: data2
- Storage Type: GlusterFS
- Host: rhhi01.libvirt.lab
- Path: rhhi01.libvirt.lab:/data2
- Mount Options: backup-volfile-servers=rhhi02.storage01.lab:rhhi03.storage01.lab
- Result: Storage Domain data2 created.
GlusterFS based Storage Domain maintenance:
- Domain name: data2
- Result: Correct
GlusterFS based Storage Domain activation:
- Domain name: data2
- Result: Correct
GlusterFS based Storage Domain detach:
- Domain name: data2
- Result: Correct
GlusterFS based Storage Domain attach:
- Domain name: data2
- Result: Correct
GlusterFS based Storage Domain remove:
- Domain name: data2
- Result: Correct, gluster volume is empty.

The only thing left to test is to perform a full RHHI 1.8 deployment.

josgutie · 2024-04-02T10:23:45Z

Hi,
A new RHHI 1.8 deployment test completed.
Scenario:

RHVH: rhvh-4.5.3.9
SHE: 4.5.0.7-0.9.
Hypervisors: 3
Storage domains: engine, vmstore and data.
Networking:
ovirtmgmt
gluster(MTU=9000)
livemigration
Result: Deployment completed without any issue.

In this new deployment I've created a new brick, glusterfs volume and glusterfs storage domain.

If you need further details of the tests performed, do not hesitate to ask.

aesteve-rh · 2024-04-02T13:33:17Z

/ost

aesteve-rh · 2024-04-02T13:38:47Z

@josgutie Thanks, your test battery looks correct. @nirs wdyt? do you still have concerns regarding the update?

@josgutie Could you check the module tests that are failing in the pipelines? Probably expectations changed after the update.

aesteve-rh · 2024-04-03T12:39:19Z

/ost

aesteve-rh

LGTM :)

aesteve-rh · 2024-04-10T12:06:48Z

@josgutie linter failed, mind the longer lines. Limit is 79 characters.

I was also wondering if you could add a small test that throws an exceptions and triggers the retry with the backup servers. So that at least we add some coverage for the new code.

tests/storage/storageserver_test.py

aesteve-rh · 2024-04-18T12:32:11Z

/ost

aesteve-rh · 2024-04-19T07:07:50Z

/ost

aesteve-rh · 2024-04-19T12:43:54Z

/ost

aesteve-rh · 2024-04-22T12:59:03Z

/ost

tinez · 2024-04-22T13:52:58Z

/ost el8stream basic-suite-master

tinez · 2024-04-22T13:55:55Z

/ost basic-suite-master el8stream

michalskrivanek · 2024-04-22T15:10:04Z

el8stream OST is not working, el9stream should pass

michalskrivanek · 2024-04-22T15:10:18Z

/ost basic-suite-master el9stream

aesteve-rh · 2024-04-23T10:48:29Z

/ost

aesteve-rh · 2024-04-23T12:50:42Z

@nirs do you have any comment on this PR? Otherwise, I will integrate.

nirs

This change will not work in all cases, is little bit messy and need better documentation. See comments on to make it more clear.

lib/vdsm/storage/storageServer.py

tests/storage/storageserver_test.py

josgutie · 2024-04-24T16:06:57Z

Hi Nir, PR is updated, my doubt in on the test coverage, but it should work in both cases, where backup server are defined and not.

nirs

Thanks @josgutie! Looks very clean and safe change. The test can be renamed and documented.

lib/vdsm/storage/storageServer.py

tests/storage/storageserver_test.py

The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage. The proposed changed let the cluster cli to use any available gluster peer. Signed-off-by: José Enrique Gutiérrez Mazón <[email protected]>

aesteve-rh · 2024-04-25T08:17:47Z

/ost

aesteve-rh · 2024-04-25T08:20:28Z

/ost

aesteve-rh · 2024-04-25T11:21:44Z

/ost

josgutie requested a review from aesteve-rh as a code owner March 20, 2024 12:59

aesteve-rh requested a review from nirs March 20, 2024 13:04

nirs requested changes Mar 20, 2024

View reviewed changes

aesteve-rh force-pushed the josgutie branch from a90b8f2 to 523e55f Compare April 2, 2024 13:30

josgutie force-pushed the josgutie branch 3 times, most recently from 8edc354 to a000ba4 Compare April 10, 2024 11:53

aesteve-rh previously approved these changes Apr 10, 2024

View reviewed changes

josgutie dismissed aesteve-rh’s stale review via 4928a83 April 17, 2024 08:29

josgutie force-pushed the josgutie branch 2 times, most recently from 4928a83 to c15b799 Compare April 17, 2024 10:17

aesteve-rh reviewed Apr 17, 2024

View reviewed changes

tests/storage/storageserver_test.py Outdated Show resolved Hide resolved

josgutie force-pushed the josgutie branch 5 times, most recently from c72cad3 to de8ff6b Compare April 18, 2024 11:55

aesteve-rh previously approved these changes Apr 18, 2024

View reviewed changes

nirs reviewed Apr 23, 2024

View reviewed changes

josgutie dismissed aesteve-rh’s stale review via cb497ae April 24, 2024 16:04

josgutie force-pushed the josgutie branch from de8ff6b to cb497ae Compare April 24, 2024 16:04

josgutie requested a review from nirs April 24, 2024 16:07

nirs previously approved these changes Apr 24, 2024

View reviewed changes

lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved

tests/storage/storageserver_test.py Show resolved Hide resolved

josgutie dismissed nirs’s stale review via 85834a3 April 25, 2024 07:50

josgutie force-pushed the josgutie branch from cb497ae to 85834a3 Compare April 25, 2024 07:50

josgutie force-pushed the josgutie branch from 85834a3 to 4ef7345 Compare April 25, 2024 07:59

aesteve-rh approved these changes Apr 25, 2024

View reviewed changes

aesteve-rh merged commit d206f0b into oVirt:master Apr 25, 2024
18 checks passed

josgutie deleted the josgutie branch April 25, 2024 15:33

sandrobonazzola added this to the ovirt-4.5.7 milestone Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdsm: get gluster volume info from any gluster peer #412

vdsm: get gluster volume info from any gluster peer #412

josgutie commented Mar 20, 2024

nirs left a comment

josgutie commented Mar 20, 2024 •

edited

Loading

nirs commented Mar 20, 2024

josgutie commented Mar 21, 2024 •

edited

Loading

josgutie commented Mar 21, 2024 •

edited

Loading

nirs commented Mar 21, 2024

josgutie commented Mar 21, 2024

josgutie commented Mar 25, 2024 •

edited

Loading

josgutie commented Apr 2, 2024

aesteve-rh commented Apr 2, 2024

aesteve-rh commented Apr 2, 2024

aesteve-rh commented Apr 3, 2024

aesteve-rh left a comment

aesteve-rh commented Apr 10, 2024

aesteve-rh commented Apr 18, 2024

aesteve-rh commented Apr 19, 2024

aesteve-rh commented Apr 19, 2024

aesteve-rh commented Apr 22, 2024

tinez commented Apr 22, 2024

tinez commented Apr 22, 2024

michalskrivanek commented Apr 22, 2024

michalskrivanek commented Apr 22, 2024

aesteve-rh commented Apr 23, 2024

aesteve-rh commented Apr 23, 2024

nirs left a comment

josgutie commented Apr 24, 2024

nirs left a comment

aesteve-rh commented Apr 25, 2024

aesteve-rh commented Apr 25, 2024

aesteve-rh commented Apr 25, 2024

vdsm: get gluster volume info from any gluster peer #412

vdsm: get gluster volume info from any gluster peer #412

Conversation

josgutie commented Mar 20, 2024

nirs left a comment

Choose a reason for hiding this comment

josgutie commented Mar 20, 2024 • edited Loading

nirs commented Mar 20, 2024

josgutie commented Mar 21, 2024 • edited Loading

josgutie commented Mar 21, 2024 • edited Loading

nirs commented Mar 21, 2024

josgutie commented Mar 21, 2024

josgutie commented Mar 25, 2024 • edited Loading

josgutie commented Apr 2, 2024

aesteve-rh commented Apr 2, 2024

aesteve-rh commented Apr 2, 2024

aesteve-rh commented Apr 3, 2024

aesteve-rh left a comment

Choose a reason for hiding this comment

aesteve-rh commented Apr 10, 2024

aesteve-rh commented Apr 18, 2024

aesteve-rh commented Apr 19, 2024

aesteve-rh commented Apr 19, 2024

aesteve-rh commented Apr 22, 2024

tinez commented Apr 22, 2024

tinez commented Apr 22, 2024

michalskrivanek commented Apr 22, 2024

michalskrivanek commented Apr 22, 2024

aesteve-rh commented Apr 23, 2024

aesteve-rh commented Apr 23, 2024

nirs left a comment

Choose a reason for hiding this comment

josgutie commented Apr 24, 2024

nirs left a comment

Choose a reason for hiding this comment

aesteve-rh commented Apr 25, 2024

aesteve-rh commented Apr 25, 2024

aesteve-rh commented Apr 25, 2024

josgutie commented Mar 20, 2024 •

edited

Loading

josgutie commented Mar 21, 2024 •

edited

Loading

josgutie commented Mar 21, 2024 •

edited

Loading

josgutie commented Mar 25, 2024 •

edited

Loading