Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOEPCA+ Infrastructure Deployment Guide should focus on pre-requisites #21

Open
spinto opened this issue Nov 12, 2024 · 6 comments
Open
Assignees

Comments

@spinto
Copy link

spinto commented Nov 12, 2024

I appreciate the effort to put information on how to create a cluster satisfying EOEPCA+ needs, but I think what is in the Infrastructure Deployment guide is a bit too much and risks to divert the attention of people from the peculiarities of EOEPCA.

So my fear is that, one, it will be hard and a bit pointless to maintain a guide on how to install kubernetes and setup a k8s cluster when there are several on the internet (which we can point, like the rancher k8s non-production environment installation), two people may just skip that, and assume they have enough with their kubernetes cluster, while actually there are some peculiarities of EOEPCA like the need to run containers as root, the readwritemany storage, the specific storageclass names for persistance, which may get lost.

So my proposal qould be to rename the "Infrastructure Deployment" to "EOEPCA pre-requisite". Have there the ollowing sessions:

  • Kubernetes, where we can still point on how to install a Kubernetes (e.g. the Rancher distribution), but moslty explain what we require/recommend from the Kubernetes installation (execution of containers as root (required), an ingress/wildcard dns (required), a load balancer with internet incoming 80/443 ports (recommended), a cert-manager (recommended), etc...).
  • EBS, where we explain that we strongly recommend block storage to be attached to the K8S containers, so a readwritemany storageclass provisioner for persistence, and this is a requirement for some of the BBs like the CWL processing engine. This can be provided with NFS, OpenEBS or Longhorn for example.
  • Object Storage, where we explain we need an S3, hypothetically any S3, if you do not have one you can deploy one on K8S

Plus, the check-prequisite script should be more "invasive" and run some tests in the cluster, e.g. running a pod as root, starting a pod service with an ingress and checking if the pod is accessible, checking if the certificate for that pod is correct, etc...

@spinto spinto changed the title EOEPCA+ Infrastructure Deployment Guide should be renamed more to pre-requisites EOEPCA+ Infrastructure Deployment Guide should be renamed to pre-requisites Nov 13, 2024
@spinto spinto changed the title EOEPCA+ Infrastructure Deployment Guide should be renamed to pre-requisites EOEPCA+ Infrastructure Deployment Guide should focus on pre-requisites Nov 13, 2024
@james-hinton james-hinton self-assigned this Nov 13, 2024
@spinto
Copy link
Author

spinto commented Nov 14, 2024

as a note from the discussions in #23 and #14 , in the pre-requisite page we should consider putting info about what is recommended for production and what is recommended for development. This is valid for all the 3 areas, the K8S cluster, the EBS storage and the Object Storage.

I would imagine that, for the K8S cluster, for production we would recommend an external IP-address, certmanager with letsencrypt and Rancher (production install), while for development/internal testing/demos we would recommend Rancher (single node install) and the manual TLS.

For the EBS, I have runned several solutions in the past, and in production IBM Spectrum Scale (proprietary) and GlusterFS (open source) works quite well, while for development Longhorn and OpenEBS are supposed to be much simpler to setup.

For Object Storage also, the EOEPCA minio helm chart is good for development/testin/demo, but probably a standalone Minio installation or something like the EMC Object Storage solution (or Amazon S3) is a better option

@jdries
Copy link
Contributor

jdries commented Nov 14, 2024

So you are saying, all operational platforms should operate a GlusterFS or else a proprietary solution right? (Unless, if rwx volumes are offered by cloud provider?)

@spinto
Copy link
Author

spinto commented Nov 14, 2024

No, I am not saying that. I am saying that there are several solutions which are proven to be operationally-ready, GlusterFS is one of them, but there are others. OpenEBS may be one of them, I did not use it personally but Fabrice was telling that it is used in operations in different platforms.

@jdries
Copy link
Contributor

jdries commented Nov 15, 2024

Thanks for all the explanation, it's already helpful!
Anyway, the main concern for operational platforms is to get an idea of what operational cost will be, and how complex it is to run something like that on an autoscaling cluster and cloud environment where VM's are ephemeral. From my own experience in running a data storage cluster, it does require significantly more experience & work, but perhaps something modern like OpenEBS solves that. (Even though I hear that cloud providers themselves are also struggling or have struggled with providing rwx volumes.)
The other interesting option to explore is cwl runners that in fact avoid the shared storage requirement, but again, I would hope that this has all been researched in the past.

@spinto
Copy link
Author

spinto commented Nov 15, 2024

About the cost/complexity vs advantages, I think it mostly depends on which kind of applications you want to support. CWL is mostly used in HTC/HPC , so it feels "natural" that a CWL runner would assume or be configured by default with a shared storage across your nodes... but CWL, even if born in HPC, it is just a workflow and does not require per-se distributed storage.

And yes, this was explored in the past, CWL (or OGC AppPackage, BTW) does not mean calrissian. That is what we have in one of the EOEPCA processing BB "engine", but we have already Toil as a CWL runner for another "engine", and Toil for example should not require a ReadWriteMany if configured with HTCondor as scheduler. Also, for OpenEO UDF, as the use case is not really HTC, you could just have a simple execution via cwltool . We can chat more about what it is best.

NOTE: we are digressing outside the scope of this ticket, for that, as I said before, what we need to ensure is that the documentation is clear also in addressing that the OpenEBS or other ReadWriteMany solutions is required only by some of the EOEPCA BBs (and we should specify which ones)

@james-hinton
Copy link
Contributor

I am making progress with this issue on this branch.

Still got quite a bit to go but will update this as I go along

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants