Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use s-nail host mail server (FNAL/CERN) #1566

Merged
merged 1 commit into from
Jan 23, 2025
Merged

Conversation

khurtado
Copy link
Contributor

@khurtado khurtado commented Nov 21, 2024

Fixes dmwm/WMCore#12159
Depends on #1573

Issue: Both CERN and FNAL already run mail agents at the host-level. Deploying a mail server inside docker (like exim4) conflicts with the host-server, since both would operate on ports 25. Running an independent mail agent on different port can lead to firewall issues as well.

Solution: Both CERN and FNAL use the same mail agent (s-nail). Therefore, configure the containers to use the host mail agent instead.

@khurtado
Copy link
Contributor Author

I have the following workaround for the exim4 mail server failing inside docker.

In general, the problem is related to the fact that some users are created at build time (exim4 creates a couple of users), that we lose when we bind mount the users from the host, but we need the latter in order to run as e.g.: cmst1, etc (since the host at CERN for example runs s-nail rather than exim4, their local users do not include anything related to exim4).

We also had issues with cron due to this bind mounting approach which we are working around already.

My opinion is that this change here should workaround exim4, but in the future a general solution would be to stop bind mounting /etc/group and /etc/passwd into the docker container and somehow create them at build time, in order to avoid services like cron or exim4 failing.

@todor-ivanov
Copy link
Contributor

My opinion is that this change here should workaround exim4, but in the future a general solution would be to stop bind mounting /etc/group and /etc/passwd into the docker container and somehow create them at build time, in order to avoid services like cron or exim4 failing.

I completely agree with you Kenyi

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@amaltaro
Copy link
Contributor

@khurtado can you please also test this at FNAL? Once we confirm this is working in both CERN and FNAL (and virtual env, but Todor already confirmed it works for him), then we can move forward and also work on a new hot-patch for the agents.

@khurtado
Copy link
Contributor Author

khurtado commented Nov 22, 2024

@amaltaro It seems I will need to work a bit more on this. The test at FNAL failed. The reason is both s-nail at the host and exim4 in docker are running on port 25 and we use the host network. I will need to work that around as well (in the config)

@amaltaro
Copy link
Contributor

amaltaro commented Jan 9, 2025

@khurtado I am not sure I have already mentioned this, but if needed, we should also consider reaching out to the CERN/FNAL teams and request new packages/services/etc to be installed. In case that would help to resolve this issue.

@khurtado khurtado requested a review from todor-ivanov January 9, 2025 04:01
@khurtado khurtado changed the title Fix exim4 mail server Use s-nail host mail server (FNAL/CERN) Jan 9, 2025
@khurtado
Copy link
Contributor Author

khurtado commented Jan 9, 2025

@amaltaro @todor-ivanov
I gave up on fixing exim4 because

  1. Running on port 25 is not possible, since it is already taken by the host mail agent (even though it worked at CERN somehow)
  2. Using a different port would require firewall settings at the host-level

So, I changed the solution to use the host-level mail agent instead (re-adding system users/groups for this to work well was still needed).

This current solution is working both at CERN and FNAL.

There is one thing I would likely change though:
This line:
https://github.com/dmwm/CMSKubernetes/pull/1566/files#diff-d276e67ac0a9cb57c1f967acb04ff6f223d21bb3cded596c8de279941c934dcfR178

Installs the s-nail package at runtime (this package has 500KB in size and installs no additional dependencies. When I checked with "docker ps -s", the additional size to the container was about 5MB total).
I would rather add this line as part of the wmagent-base image:

https://github.com/dmwm/CMSKubernetes/blob/master/docker/pypi/wmagent-base/Dockerfile

But of course, for that I would need to

  1. Add that line as a different PR
  2. Wait for someone to approve and merge
  3. Get help to produce a new wmagent-base uploaded to the CERN registry
  4. Adjust this line accordingly to use the new base
    https://github.com/dmwm/CMSKubernetes/blob/master/docker/pypi/wmagent/Dockerfile#L2

If you guys approve the rest of the code here and prefer the s-nail installation in the base image, let me know and I can create the PR to proceed with the steps above.
If you guys think the PR is good as is, then I'm fine with it as well.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 9, 2025

Given all the problems we had with this cronjob and email setup since the migration to dockers, I am starting to think that we should make a little extra effort and integrate all this logic into AgentStatusWatcher.
Having it implemented in python will give us more visibility and easier maintenance.

The open question is, how do we notify our team upon component automatic restart? I think we would have the same problems, as the python library(ies) are probably using the system mail system.
We could look into making this notification through Prometheus alerts, but AFAIK it is not open to nodes outside CERN. Any other suggestions?

Any thoughts?

@vkuznet
Copy link
Collaborator

vkuznet commented Jan 9, 2025

Alan, please keep in mind the following:

  • implementing in Python does not mean it will be easier to maintain. Over time learnt that python implementation is not the easiest to maintain, quite the opposite (sorry, it is my opinion). Too many complexities, dependencies, run-time environment, setup, etc.
  • What if we'll eventually write a service in another language, or use another (Python) framework. How custom python module will be integrated and handle this?
  • It is unclear what are requirements?

Here is my suggestion:

  • write small HTTP server (preferably in Go) and put it on CMSWEB
    • use ANY HTTP client to make POST request to such service
    • use HTTP GET API to get alerts as JSON docs
    • optionally: write service web UI to display alerts as dashboard (there are plenty of ready-to-use JS libraies)
    • optionally, add redirect API to route alerts to CMSMonitoring Prometheus and then use Prometheus/AM for alerts routing

In Go, it would be one page server, static executable (i.e. no run-time dependencies), it will be accessible in CMSWEB, it will be reachable from anywhere, it will work with any language via HTTP protocol, it will support customizable auth/authz as any other services we run, etc. In my view, it is the easiest and most portable and scalable solution.

Here is an example of working one-page alerts web server and here is how it can be used in any programming language (here is curl shows its functionality):

  • POST an alert with curl -X POST -H "Content-Type: application/json" -d '{"message":"Alert 1"}' http://localhost:8080/alerts
  • Retrieve alerts with curl http://localhost:8080/alerts

@khurtado
Copy link
Contributor Author

khurtado commented Jan 9, 2025

@amaltaro I think we should first put some effort into fixing the root problem both this issue and the cron issue had in common: getting rid of the host-level passwd/group binding for the production user/group uuids inside docker. I say this because we may encounter other issues in the future with other services because of this binding trick.

For the mail issue, considering this solution is working and uses the host-level mail agent, I would vote for using this now and perhaps open another issue with the tag "new feature", with a feature migration proposal, and assign a priority accordingly. In my opinion, while reimplementing this logic would be nice, if the goal is to make maintenance easier, then getting rid of the binding trick would remove a lot of the hacks we are doing at runtime and make maintenance much easier already though.

@todor-ivanov
Copy link
Contributor

hi @khurtado

In my opinion, while reimplementing this logic would be nice, if the goal is to make maintenance easier, then getting rid of the binding trick would remove a lot of the hacks we are doing at runtime and make maintenance much easier already though.

I couldn't agree more!

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khurtado I had only two comments inline concerning readability of the code, but not game changing ones. In case we decide to continue in that direction consider my approval of this PR.

@@ -148,3 +172,13 @@ userStatus="$(docker exec -u root -it wmagent sh -c "passwd -S $wmaUser" | awk '
if [ "${userStatus:0:1}" == "P" ]; then
docker exec -u root -it wmagent sh -c "echo $wmaUser:$wmaUser | chpasswd"
fi

# Install s-nail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed would be better to go into the Docker file, either wmagent or wmbase. Which might also alleviate/remove the need for the package configuration here.

docker/pypi/wmagent/wmagent-docker-run.sh Outdated Show resolved Hide resolved
docker/pypi/wmagent/wmagent-docker-run.sh Outdated Show resolved Hide resolved
@khurtado
Copy link
Contributor Author

khurtado commented Jan 10, 2025

@todor-ivanov Thank you! I added implemented your change proposals for the nested conditionals.
@amaltaro If you are okay with the changes as is, we could merge it. If you agree with installing s-nail (500 KB package) in wmagent-base (or wmagent itself, but I would rather do it in the base image, because the wmagent Dockerfile has no package installations, so I would like to keep it consistent), then I can proceed with a new PR for that part.

@todor-ivanov todor-ivanov self-requested a review January 13, 2025 07:38
Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @khurtado

@amaltaro
Copy link
Contributor

@khurtado can you please update this PR (Dockerfile and scripts) with the new image tag available in CERN Registry? Please see: #1573 (comment)

@khurtado
Copy link
Contributor Author

@amaltaro I just made the changes and tested with 2.3.8rc10 (which has the latest htcondor python bindings, as 10.9.0 is not on pypi anymore and couldn't build the image).
Everything worked fine.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khurtado thank you for making the relevant changes and testing them out.

I left one comment along the code, which I think it would provide further improvements to the way we are dealing with this mail service setup. In case we ever need to review this, we might consider that.

@@ -148,3 +169,10 @@ userStatus="$(docker exec -u root -it wmagent sh -c "passwd -S $wmaUser" | awk '
if [ "${userStatus:0:1}" == "P" ]; then
docker exec -u root -it wmagent sh -c "echo $wmaUser:$wmaUser | chpasswd"
fi

# Configure s-nail to use the host s-nail mail server
docker exec -u root -it wmagent cp /etc/s-nail_host.rc /etc/s-nail.rc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we could keep the modified s-nail.rc under $HOST_MOUNT_DIR/admin/etc/ and simply mount it from there to /etc/s-nail.rc. This way we could do some of this tweaking only once in a lifetime of an agent.

The update-alternatives seem to be still required though - but perhaps it could go to the wmagent Dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding update-alternatives, I agree this could be a further improvement.
Regarding keeping the modified s-nail.rc in $HOST_MOUNT_DIR/admin/etc/. What I'm afraid of in this case is that a s-nail update in the host could modify the host config and would not be in sync anymore (with a potential failure). We would need to add a verification step to make sure the config we have in $HOST_MOUNT_DIR/admin/etc/ without the tweaks match the host config, or update it and then tweak it otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Thank you for this clarification, Kenyi.

@khurtado
Copy link
Contributor Author

@amaltaro Thank you!

@amaltaro
Copy link
Contributor

Let me merge this before we forget. This is a great timing as well, as we are about to release a final stable release for WMAgent.

@amaltaro amaltaro merged commit 1338a03 into dmwm:master Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can not send email from wmagent docker container
4 participants