Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdsm: get gluster volume info from any gluster peer #412

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

josgutie
Copy link
Contributor

The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage. The proposed changed let the cluster cli to use any available gluster peer.

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

@josgutie
Copy link
Contributor Author

josgutie commented Mar 20, 2024

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

Because the PR was bad formed, and I guess that it's more clean this PR.

Here is your question in the other PR,

I think this was added years ago because not specifying the server could be a problem. Not sure that the gluster cluster can cope with a case when one of the server is down.

Gluster cluster can cope with one of the server down, the issue is triggering because adding the --remote-host to the gluster cli limits the gluster query to this host, avoiding querying to the backup servers. Maybe it's a mechanism to avoid to register not updated info.

_Please check git history to understand when and why self.volfileserver was added.

In the git history the function was always called with the sef._volfileserver, as you can see here 907560b.

Also how did you test this change? Please describe what was tested - how you simulated the issue and tested that this change solves the issue, and how you tested that this change does not cause regressions in other flows.

Regarding the tests, I will describe in a new comment if you don't mind.

@nirs
Copy link
Member

nirs commented Mar 20, 2024

@josgutie ok lets continue here, but please do not close PRs and submit new one like this. If there was an issue with the previous PR it can be easily fixed by pushing a new version.

@josgutie
Copy link
Contributor Author

josgutie commented Mar 21, 2024

  • Environment:
  • RHHI:
    • 1.8
    • ovirt-engine:
      • 4.5.3.10
    • RHVH:
      • 4.5.3.10
      • vdsm:
        • 4.50.3.9-1
      • vdsm-gluster:
        • 4.50.3.9-1
      • glusterfs:
        • 6.0-63
  • SPM: rhhi02.storage01.lab
  • Gluster Storage domains configuration:

$ /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT connection, mount_options FROM storage_server_connections;"
connection | mount_options
-------------------------------+------------------------------------------------------------------
rhhi01.storage01.lab:/data | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab
rhhi03.storage01.lab:/engine | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab
rhhi03.storage01.lab:/vmstore | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ grep -E "storage|mnt_options" /etc/ovirt-hosted-engine/hosted-engine.conf
storage=rhhi03.storage01.lab:/engine
mnt_options=backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ df -h | grep glusterSD
rhhi03.storage01.lab:/engine 100G 14G 87G 14% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_engine
rhhi03.storage01.lab:/vmstore 150G 2.6G 148G 2% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_vmstore
rhhi01.storage01.lab:/data 50G 908M 50G 2% /rhev/data-center/mnt/glusterSD/rhhi01.storage01.lab:_data

@josgutie
Copy link
Contributor Author

josgutie commented Mar 21, 2024

  • Issue:

    • Put on maintenance mode data storage domain.
    • Shutdown rhhi01 host.
    • data storage domain can't be activated.
  • Here are the tests completed and their result after applying the patch.

    • Scenario of the issue:

      • Result: SD data can be activated.
    • Environment power off:

      • Scenario:
        • Host rhhi01 is powered of, rest are active.
        • Shutdown managed by the ansible role rhv.shutdown_env.
      • Result:
        • Shutdown without any issues.
    • Environment power up

      • Only two hosts started, rhhi02 and rhhi03.
      • Result:
        • All Storage Domains are active
        • Cluster and Data Center active.

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

@nirs
Copy link
Member

nirs commented Mar 21, 2024

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

Test looks good, and sure we want to test the common operations like creating a new storage domain, putting storage domain to maintenance and activating it.

But I think you change is not complete - it assumes that backup-volfile-servers= is configured with all the nodes. In this case it seems that gluster does use any of the nodes and does not need the volfileserver argument (otherwise it would not work). But if I remember correctly, backup-volfile-servers= is optional on engine side, so your change must handle the case when it is not configured on engine side.

So the logic seems to be:

if backup_volfile_servers:
    ommit the volfileserver argument
else:
    use the volfileserver argument

Also not using backup-volfile-servers seems very fragile, so we can consider warning about it when connecting to the storage.

@josgutie
Copy link
Contributor Author

@nirs the gluster command doesn't have any parameter to point to the backup servers, you can review the following client documentation Gluster Command Line Interface. In fact it's hard to find the definition of the --remote-host gluster client parameter.

Regarding the suggested change, the problem is if the volfile server is down, the call will fail because --remote-host limits the call to the volfile server and omit calling to the backup servers.

We could deal with connection issue in that part of the code, but from my point of view, it's a gluster cli responsibility not vdsm.

I will create and destroy a gluster storage domain.

@josgutie
Copy link
Contributor Author

josgutie commented Mar 25, 2024

New tests completed:

  • GlusterFS based Storage Domain creation:
    • Create a new Storage Domain from a gluster volume.
    • New domain parameters:
    • Name: data2
    • Storage Type: GlusterFS
    • Host: rhhi01.libvirt.lab
    • Path: rhhi01.libvirt.lab:/data2
    • Mount Options: backup-volfile-servers=rhhi02.storage01.lab:rhhi03.storage01.lab
    • Result: Storage Domain data2 created.
  • GlusterFS based Storage Domain maintenance:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain activation:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain detach:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain attach:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain remove:
    • Domain name: data2
    • Result: Correct, gluster volume is empty.

The only thing left to test is to perform a full RHHI 1.8 deployment.

@josgutie
Copy link
Contributor Author

josgutie commented Apr 2, 2024

Hi,
A new RHHI 1.8 deployment test completed.
Scenario:

  • RHVH: rhvh-4.5.3.9
  • SHE: 4.5.0.7-0.9.
    Hypervisors: 3
    Storage domains: engine, vmstore and data.
    Networking:
  • ovirtmgmt
  • gluster(MTU=9000)
  • livemigration
    Result: Deployment completed without any issue.

In this new deployment I've created a new brick, glusterfs volume and glusterfs storage domain.

If you need further details of the tests performed, do not hesitate to ask.

@aesteve-rh
Copy link
Member

/ost

@aesteve-rh
Copy link
Member

@josgutie Thanks, your test battery looks correct. @nirs wdyt? do you still have concerns regarding the update?

@josgutie Could you check the module tests that are failing in the pipelines? Probably expectations changed after the update.

@aesteve-rh
Copy link
Member

/ost

@josgutie josgutie force-pushed the josgutie branch 3 times, most recently from 8edc354 to a000ba4 Compare April 10, 2024 11:53
aesteve-rh
aesteve-rh previously approved these changes Apr 10, 2024
Copy link
Member

@aesteve-rh aesteve-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@aesteve-rh
Copy link
Member

@josgutie linter failed, mind the longer lines. Limit is 79 characters.

I was also wondering if you could add a small test that throws an exceptions and triggers the retry with the backup servers. So that at least we add some coverage for the new code.

@josgutie josgutie force-pushed the josgutie branch 5 times, most recently from c72cad3 to de8ff6b Compare April 18, 2024 11:55
aesteve-rh
aesteve-rh previously approved these changes Apr 18, 2024
@aesteve-rh
Copy link
Member

/ost

3 similar comments
@aesteve-rh
Copy link
Member

/ost

@aesteve-rh
Copy link
Member

/ost

@aesteve-rh
Copy link
Member

/ost

@tinez
Copy link
Member

tinez commented Apr 22, 2024

/ost el8stream basic-suite-master

@tinez
Copy link
Member

tinez commented Apr 22, 2024

/ost basic-suite-master el8stream

@michalskrivanek
Copy link
Member

el8stream OST is not working, el9stream should pass

@michalskrivanek
Copy link
Member

/ost basic-suite-master el9stream

@aesteve-rh
Copy link
Member

/ost

@aesteve-rh
Copy link
Member

@nirs do you have any comment on this PR? Otherwise, I will integrate.

Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will not work in all cases, is little bit messy and need better documentation. See comments on to make it more clear.

lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved
lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved
lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved
lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved
tests/storage/storageserver_test.py Show resolved Hide resolved
@josgutie
Copy link
Contributor Author

Hi Nir, PR is updated, my doubt in on the test coverage, but it should work in both cases, where backup server are defined and not.

@josgutie josgutie requested a review from nirs April 24, 2024 16:07
nirs
nirs previously approved these changes Apr 24, 2024
Copy link
Member

@nirs nirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @josgutie! Looks very clean and safe change. The test can be renamed and documented.

lib/vdsm/storage/storageServer.py Outdated Show resolved Hide resolved
tests/storage/storageserver_test.py Show resolved Hide resolved
The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage.
The proposed changed let the cluster cli to use any available gluster peer.

Signed-off-by: José Enrique Gutiérrez Mazón <[email protected]>
@aesteve-rh
Copy link
Member

/ost

@aesteve-rh
Copy link
Member

/ost

1 similar comment
@aesteve-rh
Copy link
Member

/ost

@aesteve-rh aesteve-rh merged commit d206f0b into oVirt:master Apr 25, 2024
18 checks passed
@josgutie josgutie deleted the josgutie branch April 25, 2024 15:33
@sandrobonazzola sandrobonazzola added this to the ovirt-4.5.7 milestone Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants