-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collector logging running amok #935
Comments
Thanks for sharing this @Elyytscha, I'm sorry that this caused you to have to delete stackrox. To help in our investigation can you share any relevant details about the deployments running on the 3/10 affected nodes? Also you using GKE stable release channel? |
Yeah, it was just an evaluation setup, so we didn't run this in production, so its not that critical that we have deleted it, but when this happens in an evaluation you will agree that this is a bad situation.. yes we are on gke stable release channel. for relevant details about deployments i have to investigate a little bit. |
@Elyytscha, thanks for this confirmation of the release channel. We will make sure to fix the logging flood soon. Would you be aware of a common factor to these 3 nodes having the issue ? Is something different about them relatively to the other 7 ones ? We will try to reproduce the issue, but in the meantime, it would greatly help if you could provide us with the result of |
@Elyytscha, based on the early analysis of this issue, it seems that you should be able to workaround by disabling the https://github.com/stackrox/collector/blob/master/docs/references.md#collector-config |
Hello, first I wanted to say thanks for your help! This is not self-evident within opensource projects, to get answers that fast.
no, those nodes a from the same resource pool, the only difference to other nodes could be the pods which get scheduled to those nodes maybe some information is that we use calico as SDN this info i can give about pods running on one of the affected nodes (but its possible that the pods which caused this could be already scheduled to another node..)
be prepared, the logfile i got for this is long.. its so big, i had to split it into two files, but i think what you want to see you will find there, because there are also 'No such file or directory' messages
There i have some questions:
|
We debugged a little bit down the rabbit hole, what we found out is that this happens on nodes where we have running static jenkins agents which are building docker container in containerd. the relating pids where the error appears within stackrox are looking like this:
and we think its related to this problem in containerd: |
Good to hear that you may have identified the root cause. Setting turnOffScrape to true will solve the problem with the logging statement, but finding the root cause and solving that is the best option. turnOffScrape has not been extensively tested and is used for internal debugging. When you set it to true you will lose information about endpoints and connections formed before collector was turned on. You can set it via helm charts, but that is not the method that I would recommend. You have to be careful when doing it that way, as COLLECTOR_CONFIG is used to set a few different parameters and you want to set those correctly, not just turnOffScrape. A command you could run to set turnOffScrape is
To use helm charts you have to set "tlsConfig" correctly. See https://github.com/stackrox/helm-charts/blob/5cd826a14d7c30d1b7ca538b4ff71d1723339a2c/3.72.2/secured-cluster-services/templates/collector.yaml#L[…]2 Before your latest comment I thought that the problem might be that your resource limits and requests for resources in namespaces other than stackrox being too low. It might still be worth looking into that. |
We fixed it, the issue was due to our old docker in docker container build system, for new systems we actually use kaniko but for old legacy systems there are docker builds via dockerd in a containerd system, we fixed it basically due to this comment: docker has added tini as docker-init in their container and we used it with:
after this we had no zombie/defunct processes anymore after docker in docker build where running. actually someone could argument that stackrox showed us an issue in our k8s cluster, but the way it showed us the issue basically produced another issue (flooding our log system with an unnecessary amount of logs). but still i think it would be a good idea to limit those log messages stackrox btw. collector will produce in such a situation. the situation appears when there are zombie/defunct processes from old containers which somehow dont get reaped. |
Glad to hear that you resolved your problem. Thanks for bringing this to our attention. Based on this experience we plan a few improvements to collector including throttling of logging statements and better handling of defunct processes. |
our aggregated logging was exploding due to stackrox collector logging.. there where millions of lines like this the last days from 3/10 collector nodes..
actually to mitigate as fast as possible we had to delete stackrox
this is the error message which where present millions of times, this message appaered on 3 from 10 collector nodes about 6-10 times every millisecond..
[E 20221209 165715 ConnScraper.cpp:415] Could not determine network namespace: No such file or directory
it was deployed like this:
/bin/bash <(curl -fsSL https://raw.githubusercontent.com/stackrox/stackrox/master/scripts/quick-helm-install.sh)
on gke v1.24
The text was updated successfully, but these errors were encountered: