-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf Generating Orphaned DBus Processes on RHEL Servers #2 #13635
Comments
@powersj |
As an aside, it looks like so far this issue appears to be absent from 1.24.2. |
here are the dbus details which we are seeing it in server
|
telegraf 5850 1 0 06:54 ? 00:00:02 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf telegraf.d
telegraf 5866 1 0 06:54 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 11702 24282 0 07:34 pts/0 00:00:00 grep --color=auto -i telegraf |
Is there a way to disable the secret store completely? We don't use it and some component related to it seems to be causing the issues. |
Thanks for the issue and logs. Are you seeing this across RHEL 6, 7, and 8 this time? Or only RHEL 6? I have got a RHEL 7 VM up looping over telegraf with
Only with a custom build of Telegraf. Assuming that the issue is with the same code of the secret store as last time, that dbus command runs in the |
Thanks Joshua for your reply ,I could see this issue only from RHEL6 and i have the latest version deployed in RHEL7/8 and there i don't see any issue with DBUS.
|
@crflanigan, @elangovanseshan, If you must have the newer version of Telegraf on RHEL 6, my suggestion then is to consider building telegraf with the custom builder. The result would provide you with a ~23Mb binary containing only the plugins you need and the secret store plugins would not be present. git clone https://github.com/influxdata/telegraf
cd telegraf
go build -o ./tools/custom_builder/custom_builder ./tools/custom_builder
./tools/custom_builder/custom_builder --config <conf_file> --config-dir <conf_dir> Would this be an option for you? |
Hi @powersj, We can look at that. Is Telegraf not supported on RHEL 6, if so, when was the last release where it was supported? Thanks! |
Thank you @powersj ,Let me try custom builder without secret store plugins |
Internally to Telegraf, the secret stores are treated like the other plugins, so that you could build telegraf without it.
We have a published doc for supported platforms, which essentially says we support OSes that are under standard support. In line with that, RHEL 6 stopped being supported at the end of 2020. RHEL 7 will stop next June 2024. While we will not go out of our way to break any previous releases, if we do make a change that breaks them we are less inclined to revert it nor will we continue to test it. |
Ok @powersj , It sounds like patching this issue is unlikely since it's occuring on an unsupported OS, is that right? Thanks! |
If you proposed a PR or an idea to get around this we would certainly consider it. We are not going to completely close the door on a fix. |
Fair enough, thanks buddy! |
Thanks ! @powersj custom_builder is working fine for me. I passed the sample conf file to build the binary ,it contain the cpu disk diskio exec mem net swap system input plugins and it's working fine. We have multiple internal teams are using multiple input plugins other than i mentioned above so if we build the binary with limited input plugins, it will affect other internal customers, So we would like to build the custom binary with all input plugins but except secret store plugins . Is there any possible way to build it without passing conf file for each input plugins or can we build with dummy conf files without secret store plugins? |
You can get a list of all the input plugins by generating the default config and grep'ing out all the input headers:
You could then add that to your example config or pass that as a second file to the custom builder. You could also use the various build tags to build telegraf as the customization docs show using
If you do start to go this route, please ensure you include everything you actually need ;) It is easy to forget or not realize you are using a serializer for example. This is why I like the custom builder + an actual config better. |
@powersj our initial testing is working fine with custom Telegraf with limited input and output plugin and no evidence of dbus process . also i would like to know that how can we add the serializers to custom build? I added the required input,output,aggregators,processors through the example conf but not sure about serializers . Do we need to pass it through conf file or do we have any other option? |
You can reference any of the serializers the same way. For example, if you want only the JSON serialier you can add The way to determine these build tags is to look in each plugin's Does that help? |
Thank you @powersj let me try this out one more thing for your information, initially i updated like dbus issue happening only in RHEL6 servers but we had an issue with RHEL7/8 as well . So we are planning to go with custom telegraf with limited plugins . |
@elangovanseshan, @crflanigan,
Sorry I never responded to this. Looking at the mentioned gosnowflake issue it looks like a workaround is setting For Telegraf, I am inclined to document this and link to the still open upstream issue. Thoughts? |
Hi @powersj, Sorry for the delayed response. I actually commented on one of these issues for keyring and got a notification this morning that they may have resolved it? Seems like a lot of people use this library. What do you think? |
Hey @crflanigan, Did someone delete their comment? Latest I see is from Apr 12, 2023. |
@powersj I think @crflanigan was referring to snowflakedb/gosnowflake#773 (comment) BTW, I now have that message even when not using
This does not happen with telegraf 1.30.3 |
After upgrading telegraf to 1.31.0 all of our hosts seem to report the
|
@trauta Indeed, also seems to happen on RHEL and warned maintainers about it already 2 weeks ago: https://influxcommunity.slack.com/archives/C019JDRJAE7/p1717146621896149 |
When this was only on RHEL 6/7/8 I was less concerned about this especially given the upcoming EOL date. However, this is also appearing on newer releases as well (e.g. Ubuntu Noble). The root cause is from the keyring dependency. We use this for secret store, but it appears the snowflake library we use also does. The keyring library has not been updated and does not appear to be planned to update anytime soon. Even if we moved to a fork, the snowflake library also uses it. I haven't played with the go replace enough, but it may be possible to use it? The warning message from snowflake does tell you what to do to get rid of the message, so I don't consider this critical to fix, but it is something we are looking to figure out how to address. |
I am seeing this on my setup on the current version of Debian (12, Bookworm), too. I don't see the warning about |
@l33, @crflanigan or @elangovanseshan could anyone please test the binary in #15860 to verify it solves this issue!? |
Closing this issue as it is likely solved by #15860. If I'm wrong, please reopen the issue or drop me a note so I can reopen it. |
That PR is not merged yet. Better would have been to add a |
I just installed telegraf_1.33.0-1_amd64.deb (fresh install) from repos.influxdata.com on Debian unstable. Telegraf is still leaving behind dbus-daemon processes, one per telegraf (re)start. No mention of dbus in debug log. I'm new to Telegraf, but let me know if I can somehow help with debugging the issue. |
@pyksy do you have some idea on what exactly is opening DBUS? Could you post your (redacted) config? |
@srebhan unfortunately I have no idea why Telegraf spawns dbus-session processes and leaves them hanging around. I haven't really investigated much. Here's my uncommented telegraf.conf (nothing in telegraf.d)
Here's how the problem manifests:
|
Just to be sure, does this also happen if you are starting Telegraf manually using the following command
i.e. without using the service? |
Thanks for the hint @srebhan. I did some more digging and it depends on whether the DBUS_SESSION_BUS_ADDRESS environment variable is set: launching the daemon as the telegraf user with the variable unset will leave a stray dbus-daemon (edit: namefix) process behind.
Running telegraf as root will not leave a stray dbus-daemon process. I still run SysV init on this box. If I launch telegraf by running the init script directly (/etc/init.d/telegraf start) it does not launch the extra dbus-daemon, because the launching root shell has a valid DBUS_SESSION_BUS_ADDRESS environment variable set that gets passed to telegraf. if I use the service command (service telegraf start) then the extra dbus-daemon gets spawned because the service command runs the init script in a clean environment, with DBUS_SESSION_BUS_ADDRESS unset. I guess not that many people run Debian with sysvinit nowadays, so the impact probably isn't that big. Might be a concern to folks running Devuan (the systemd-free Debian fork) but I haven't tested it personally. |
Would the workaround setting I'm asking because digging into the code, it seems like solving this would involve replacing an (unmaintained) dependency which is a bigger issue... |
@srebhan yes, it does :) and could be a quick fix in scripts/init.sh. Thanks!
|
Could you please open a pull request with that fix!? |
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf-1.27.2 it's running in OS Linux 2.6.32-754.50.1.el6.x86_64
Docker
No response
Steps to reproduce
Reproducing has been tricky as it doesn't always appear to occur, but on systems that were impacted (hundreds+) reverting Telegraf to an earlier version, stopping the Telegraf service and removing the orphaned process, or performing the below actions resolved the issue.
What we have seen:
Upgrading the Telegraf version 1.14 to 1.25.2 on RHEL servers seems to create an issue where DBus generates many orphaned processes. This eventually causes the system to hit the ceiling of available PIDs. Rolling back to 1.14 seems to clear the problem.
Example from one of our systems:
ps -ef|grep dbus|grep -v grep|wc -l
1459
Based on the issue #13481 it was resolved in recent release telegraf-1.27.2 but we are experiencing the same issue with recent release aswell
Expected behavior
Telegraf works as expected.
Actual behavior
Telegraf inadvertantly creates thousands of orphaned DBus processes which eventually causes the available PID's to hit the maximum ceiling, which causes system degradation.
Additional info
No response
Tasks
The text was updated successfully, but these errors were encountered: