-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadWriteMany support #1567
Comments
hmm RWX is indeed not supported. |
I have a small k8s cluster at home and I use mayastor as my primary storage solution, most of the time RWO works fine but sometimes I get errors when the descheduler moves an app to another node or I need different apps using the same PVC |
If you have apps using the same pvc then ext4 or xfs will not suit your use case, you need a distributed filesystem. |
Sharing the same PVC is a rare thing to do, I'm ok with having to use nodeAffinity but what about apps getting scheduled on different nodes?
|
I can manually re-attach a PVC by deleting the $ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
jellyseerr Bound pvc-5c622d91-8dcf-4a47-b84a-859a606ced94 5Gi RWO mayastor-thin <unset> 3d
$ kubectl get volumeattachments.storage.k8s.io
NAME ATTACHER PV NODE ATTACHED AGE
csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 io.openebs.csi-mayastor pvc-5c622d91-8dcf-4a47-b84a-859a606ced94 gea true 24h
$ kubectl patch volumeattachments.storage.k8s.io csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 -p '{"metadata":{"finalizers":[]}}' --type=merge
volumeattachment.storage.k8s.io/csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 patched
$ kubectl delete volumeattachments.storage.k8s.io csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8
volumeattachment.storage.k8s.io "csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8" deleted
$ kubectl rollout restart deployment jellyseerr With this the PVC can be attached to a new node solving the error mentioned before |
I don't know the exact context of what's being done, but patching the attachments to allow multiples nodes access to the same filesystem volumes can be a recipe for data corruption because the filesystem drivers on both nodes don't know about each other. |
I have a use case for this where I would like to use Mayastor for high performance VM storage with KubeVirt, but also preserve the ability to live migrate VMs which requires RWX. In this case, accesses to the block device is orchestrated by Kubevirt and KVM, where the RWX is simply needed to be able to attach the PV to both VM pods simultaneously during the live migration, it is not being accessed concurrently, but both source and destination VM pod requires access to the PV simultaneously for the "handover" of the block device, before the source pod terminates. Please see https://kubevirt.io/user-guide/operations/live_migration/#enabling-the-live-migration-support |
Does the kubevirt use reservations on the "handover"? |
I'm not a go programmer, but looking at kubevirt's documentation, architecture, and code, it seems to me that kubevirt doesn't get involved in the actual migration itself and offloads this entirely to libvirt/qemu layer. So for the precise handling of the block devices on either host, we would have to look at libvirt's source, there is an explanation at https://www.linux-kvm.org/page/Migration , but I have no idea how up to date this is, though it states the following:
Start guest on destination, connect, enable dirty page logging and more Guest continues to run And sync VM image(s) (guest's hard drives). As fast as possible (no bandwidth limitation) On destination upon success Based on this, i can only assume that step 4 would flush to disk on source after it has stopped source VM, then inform libvirt on the destination that block device is ready for takeover? edit, i found the actual migration handler here: This seems to be handled by methods: I think qemuMigrationSrcBeginPhaseBlockDirtyBitmaps is of most interest for this. ref: |
I second this. Mayastor is really compelling for running as a datastore for Kubevirt. RWX would be a really great addition. |
Another reasonable reason: I want to serve LLMs, and not having the ability to use the same storage means i would have to have the exact same model in each PV for each replica. This is a huge waste of storage space, when you could easily just have a single model you load into the replica as needed. I really like that mayastor is future focused and not a mess (at least according to talos and openEBS). So, having the ability to share storage between pods would be ideal. I get that there are other methods to do this, but they require significantly more complex setups, when just allowing the PV to connect to multiple pods would simplify everything, and reduce the complexity of the entire deployment. I also get that there are security implications, but this can be mitigated using other methods. For my personal use case, the ideal version of this would be readManyWriteOnce, but it doesn't seem like anyone can do that. Also, because you don't support volumeExpansion, There's no straightforward method to transfer files to a new PV... |
What you're asking here is for a clustered filesystem. |
@Mechputer Would the openebs nfs provisioner not suit your needs? |
@tsteine Unfortunately, no. I'm running talos, which supports mayastor. I can't find anywhere if an NFS server can be put on talos, or how. I don't know if they just think that's excessive, or unsafe. There are several factors that make this not work:
I'm looking for a solution that uses newer technology (mayastor/NVMEoF), works on a read-only filesystem (and has instructions somewhere), and doesn't fail to reconnect if something goes down. I need redundancy and stability, not assumed functionality. Unless there's something I'm not aware of, that would allow the new NFS container to have the same ID as the previous one (which I'm pretty sure goes against how k8s works, in the first place). In fact, this should just be something that is obviously wrong with the mayastor system as a whole. Anything you're using, that uses a mayastor PVC, you're assuming that the container will never crash, or go down, or anything. Unless the assumption is that you would always constantly back up the mayastor drive, and recreate it if there's a current image stored somewhere outside the container, that is the only container that has access to the PVC (new or old, since only one container can have access to a mayastor drive ever), and tell it to restore the drive from the backup. |
Why is the data lost?
I don't understand what it is you're implying here. Mayastor survives both container crash and node crash. As mentioned above you might need to delete the volumeattachment manually, and this is not something which is specific to mayastor for that matter, though perhaps could be better automated in certain cases. @tsteine I'm meeting up with some folks to discuss RWM for live migration at the kubecon :) |
I'm realizing I might be an idiot. If the node that goes down has the PVC with the data, of course it's not accessible to the rest of the cluster. If it's replicated to multiple drives on multiple machines, that should keep the data in the PVC. Is that correct? |
If you have mayastor volume with N replicas, then we can generally support loss of N-1 nodes because the data exists on N nodes. |
Yeah, no, please ignore my idiocy. I only had 1pvc, no replicas. Both the container and the pvc were on the node that went down. So of course I lost everything when it was recreated, with nothing to copy from. Still, RWM please. To be fair, I'm only 3 months into learning anything k8s related. |
So I can't create an NFS server on talos, because it's read only and/or it's not explicitly stated and/or there are no instructions. I can't add RWX to mayastor, because that's not allowed. I can't use NFS on top of mayastor, because it needs kernel permissions, and I don't have an NFS server. I can't use ganesha for openEBS's nfs provisioner, because for whatever reason, they don't include that. I can even create an nfs pvc, but then my pods can't access it, because it's expected to be accessed by root. Once again, it sounds like all of the issues stem from using a read-only filesystem OS like talos. Am I just a complete idiot? Is there a way for this to work that I'm just not able to find? Do I have to switch to a mutable filesystem in order to use anything? talos 1.6.5, k8s 1.29. I'm able to create PVCs using mayastor, they're just not RWX. I can create "kernel" nfs PVCs, they just can't be accessed. All of my add-ons and service function without issue, except for the things trying to access the read-only filesystem of talos, which is expected. |
@Mechputer I think this is getting off topic, with regards to the project, and that a mayastor issue on RWX support is not the appropriate forum for how to run NFS servers on Talos. That being said, the way this would normally work would be that you would set up the OpenEBS nfs provisioner in the kubernetes cluster, with setup pointing it to the storage class for mayastor, and a new storage class with an appropriate name like "OpenEBS NFS RWX" for the nfs provisioner. I don't see why this kind of setup shouldn't work just fine on Talos, since the NFS server is run in a pod. edit: see here for included packages in the kubelet for talos, nfs-common is included. Edit 2: You might be running into this issue: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/ |
Single point of failure is usually no go for production. And in that state it rather not really help solving RWX support on mayastor. Even setting manually 2 replicas for nfs-pvc not fully solving that, but might help a bit. Although I think there is no option to set default number of replicas for new created nfs-pvc pods. And actually these pods are mounting mayastor RWO volumes so still can be only deployed on same node. So there is no option to tolerate node failures. |
Hi @tsteine @synthe102 ... as @tiagolobocastro mentioned earlier... Our Team was set to meet with the KubeVirt / RedHat Storage engineering team (e.g. KubeVirt, KVM & QEMU folks) at the Paris 2024 KubeCon last week. This meeting did happen and was very good We discussed this issue and how to safely enable this RWX label/tag explicitly for KVM/KubeVirt/QEMU. ** We are going to do this engineering work as it's not too complex. So your solution is coming soon. Please note that we are going to add some safety precautions around enabling RWX label and so that we will only allow it to be done for KVM/KubeVIrt/QEMU... as its a dangerous recipe for data corruption if we allow it to be generally & widely open to any apps. Its guaranteed that users will try all sorts on unsupported operations with this and corrupt their filesystems (& block devices) very easily. |
That's great to hear. As for restricting it to Kubevirt/KVM/QEMU, that makes perfect sense. Was the kubevirt csi driver mentioned during the meeting? I am making note of this, as I would like to be able to live-migrate my downstream k8s vm nodes, however, I suspect that if we set RWX on the pvc in the virtualized cluster, and on the pvc in the kubevirt infrastructure cluster, that it would be possible to hotplug the volume to multiple vms simultaneously outside of live-migration, since this is all done by kubevirt. I think it may be necessary to look at a mechanism for restricting kubevirt-csi provisioned volumes to only RWO, if using mayastor with the kubevirt-csi in virtualized K8S clusters. Either that, or document specifically that the kubevirt-csi driver is not supported for mayastor, and if you do and corrupt your data, that is "your bad", I don't mind using a different CSI in a downstream cluster, being able to hook into upstream storage just makes it easier. |
Firstly, Peppered in with that discussion we did diverge off (a little bit) to the Downstream KubeVirt CSI driver, but the conversation quickly came back to the QEMU/KVM/Live-Migration discussion. We didn't spend much time on the KV CSI driver. The issues you bring up with the downstream KV CSI driver are complex and very tricky, especially when you add Live-Migration component into that mix via the CSI of the storge platform (OpenEBS). It feels complex and a bit dangerous and I get the feeling we may be doing some pioneering work here that is probably going to be a bit messy until we figure it all out. I don't have the answers yet; but this is a high priority project for OpenEBS + RedHat that we are (OpenEBS) will start coding-up once we drop the next release of OpenEBS (v2.6) in a week or 2. |
Not sure how effective is this, but maybe it can be an improvenemt for openebs nfs?
Longhorn also has some failure handling Although from my testing if nfs terminates the pods using the volume are also terminated.. |
I probably should've replied earlier, but I'm certainly interested in helping testing this feature. |
Found this thread trying to set up RWX on a 3-node replicated mayastor storage-class. I am seeing the same error mentioned above:
Here are the storage-class details:
Here is my use-case - I'm trying to set up 2 replicas of an application that only works with a local filesystem. I am trying to create a deployment with 2 replicas sitting behind a loadbalancer and hoping that both pods will have RW access to the same pvc backed by the same sc backed by 3 mayastor disk pools. The intended topology should support nodenotready scenarios for either the application or mayastor in any combination. As long as there is one node available in both the control plane and the data plane, the overall application should seamlessly work. Is this possible today? |
@anmshkmr, that is not possible because the local filesystem is "local", it is mounted on the node where the application is running. |
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
|
The storage class would not change I'd say. If you use an NFS provisioner
We don't currently have docs for this. There might be some older docs for other engines you could use perhaps, maybe @avishnu can suggest some? |
You may refer NFS-on-CStor provisioning steps here and replicate the same setup replacing CStor with Mayastor. |
Thank you. I am going to try that and let you know. |
Hi @tiagolobocastro Is there a plan to support exporting a volume through multiple nodes to achieve NVMe-oF multipathing and support RWX features? |
@avishnu @tiagolobocastro Just a follow-up here - unfortunately, nfs-server-provisioner has been deprecated. When I install it using the helm chart mentioned on the page you linked, the nfs-provisioner pod cannot be rescheduled on a multi-node cluster. It's because the pvc that's created for the nfs server's internal use is bound to a single node. This appears to be an alternative: https://github.com/openebs-archive/dynamic-nfs-provisioner, but that is part of openebs-archive as well. I also found this: https://blog.mayadata.io/openebs/setting-up-persistent-volumes-in-rwx-mode-using-openebs, but that looks pretty involved. Do you have the recommendation for a good alternative for nfs-server-provisioner that can work seamlessly on a multi-node cluster? |
Hey @anmshkmr, we've now documented this: https://openebs.io/docs/Solutioning/read-write-many/nfspvc |
For a block mode volume? |
Thanks, that's very helpful in general. I was able to set up the whole stack end-to-end a few weeks ago though. I can tell you that the setup is still not practical to use because the cpu usage is very high from the io-engine. It makes the overall k8s cluster very unstable to run the other workloads. Any suggestion to keep the overall resource footprint low? |
You can reduce the io-engine cpu usage to 1 core. |
@tiagolobocastro is there any advantage using configuration from https://openebs.io/docs/Solutioning/read-write-many/nfspvc in practice from user perspective I see the main difference in new approach is using only one pvc for nfs and using subdir for volumes vs using multiple volumes in old approach. For me that is rather disadvantage if i'd like to for example snapshot individual volumes. But maybe there are some other pros in new way? |
Yes! |
Issue #1127 was closed as completed but trying to create a PVC with RWX access mode
throws an error:
failed to provision volume with StorageClass "mayastor-thin": rpc error: code = InvalidArgument desc = Invalid volume access mode: 5
Is this not supported?
PVC manifest
StorageClass manifest
The text was updated successfully, but these errors were encountered: