Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why ELK stack crashed #14

Open
tschaffter opened this issue Aug 10, 2021 · 9 comments
Open

Investigate why ELK stack crashed #14

tschaffter opened this issue Aug 10, 2021 · 9 comments
Assignees

Comments

@tschaffter
Copy link
Member

tschaffter commented Aug 10, 2021

Today only the container docker-elk_logspout_1 was running though it was constantly restarting. The three containers listed below were stopped.

  • docker-elk_logstash_1
  • docker-elk_kibana_1
  • docker-elk_elasticsearch_1

I run docker-compose stop followed by docker-compose up. The above three containers were then running but docker-elk_logspout_1 was still restarting. A couple of hours later (now), docker-elk_logspout_1 seems to be running fine.

Task

@tschaffter tschaffter self-assigned this Aug 10, 2021
@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

Kernel logs are being flooded with the following messages. This has happened for several hours, several times per minutes. This behavior occurred last time 3 hours ago.

Note: I have currently the same issue with a personal project where I'm running a stack with docker-compose.

$ sudo cat /var/log/kern.log
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.045360] br-3563230c915f: port 1(veth57e5ce0) entered disabled state
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.045433] device veth57e5ce0 entered promiscuous mode
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.365546] eth0: renamed from veth8e92727
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.381452] IPv6: ADDRCONF(NETDEV_CHANGE): veth57e5ce0: link becomes ready
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.381537] br-3563230c915f: port 1(veth57e5ce0) entered blocking state
Aug 10 18:40:43 ip-10-255-21-47 kernel: [7033680.381539] br-3563230c915f: port 1(veth57e5ce0) entered forwarding state
Aug 10 18:40:48 ip-10-255-21-47 kernel: [7033685.481871] br-3563230c915f: port 1(veth57e5ce0) entered disabled state
Aug 10 18:40:48 ip-10-255-21-47 kernel: [7033685.481968] veth8e92727: renamed from eth0
Aug 10 18:40:48 ip-10-255-21-47 kernel: [7033685.519298] br-3563230c915f: port 1(veth57e5ce0) entered disabled state
Aug 10 18:40:48 ip-10-255-21-47 kernel: [7033685.524291] device veth57e5ce0 left promiscuous mode
Aug 10 18:40:48 ip-10-255-21-47 kernel: [7033685.524295] br-3563230c915f: port 1(veth57e5ce0) entered disabled state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033716.737296] br-3563230c915f: port 1(veth06320fa) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033716.737298] br-3563230c915f: port 1(veth06320fa) entered disabled state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033716.737376] device veth06320fa entered promiscuous mode
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.115059] eth0: renamed from vethf25aa97
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.130071] IPv6: ADDRCONF(NETDEV_CHANGE): veth06320fa: link becomes ready
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.130146] br-3563230c915f: port 1(veth06320fa) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.130148] br-3563230c915f: port 1(veth06320fa) entered forwarding state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.196003] br-3563230c915f: port 2(veth387fab8) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.196006] br-3563230c915f: port 2(veth387fab8) entered disabled state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.196128] device veth387fab8 entered promiscuous mode
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.196349] br-3563230c915f: port 2(veth387fab8) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.196350] br-3563230c915f: port 2(veth387fab8) entered forwarding state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.411681] br-3563230c915f: port 3(vethb438dc0) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.411683] br-3563230c915f: port 3(vethb438dc0) entered disabled state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.411747] device vethb438dc0 entered promiscuous mode
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.411834] br-3563230c915f: port 3(vethb438dc0) entered blocking state
Aug 10 18:41:20 ip-10-255-21-47 kernel: [7033717.411836] br-3563230c915f: port 3(vethb438dc0) entered forwarding state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.724602] eth0: renamed from veth6614e9f
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.738921] br-3563230c915f: port 2(veth387fab8) entered disabled state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.739036] br-3563230c915f: port 3(vethb438dc0) entered disabled state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.740785] IPv6: ADDRCONF(NETDEV_CHANGE): veth387fab8: link becomes ready
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.740836] br-3563230c915f: port 2(veth387fab8) entered blocking state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.740838] br-3563230c915f: port 2(veth387fab8) entered forwarding state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.943298] eth0: renamed from vethf4ab668
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.966464] IPv6: ADDRCONF(NETDEV_CHANGE): vethb438dc0: link becomes ready
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.966538] br-3563230c915f: port 3(vethb438dc0) entered blocking state
Aug 10 18:41:21 ip-10-255-21-47 kernel: [7033717.966540] br-3563230c915f: port 3(vethb438dc0) entered forwarding state
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478319] br-3563230c915f: port 4(veth66a7447) entered blocking state
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478321] br-3563230c915f: port 4(veth66a7447) entered disabled state
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478388] device veth66a7447 entered promiscuous mode
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478468] br-3563230c915f: port 4(veth66a7447) entered blocking state
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478470] br-3563230c915f: port 4(veth66a7447) entered forwarding state
Aug 10 18:41:48 ip-10-255-21-47 kernel: [7033745.478801] br-3563230c915f: port 4(veth66a7447) entered disabled state
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.041035] eth0: renamed from veth30a2e5f
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.055140] IPv6: ADDRCONF(NETDEV_CHANGE): veth66a7447: link becomes ready
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.055208] br-3563230c915f: port 4(veth66a7447) entered blocking state
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.055211] br-3563230c915f: port 4(veth66a7447) entered forwarding state
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.324061] br-3563230c915f: port 4(veth66a7447) entered disabled state
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.325787] veth30a2e5f: renamed from eth0
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.375076] br-3563230c915f: port 4(veth66a7447) entered disabled state
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.379709] device veth66a7447 left promiscuous mode
Aug 10 18:41:49 ip-10-255-21-47 kernel: [7033746.379713] br-3563230c915f: port 4(veth66a7447) entered disabled state
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.326297] br-3563230c915f: port 4(veth76b4d20) entered blocking state
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.326300] br-3563230c915f: port 4(veth76b4d20) entered disabled state
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.326361] device veth76b4d20 entered promiscuous mode
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.326432] br-3563230c915f: port 4(veth76b4d20) entered blocking state
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.326433] br-3563230c915f: port 4(veth76b4d20) entered forwarding state
Aug 10 18:42:49 ip-10-255-21-47 kernel: [7033806.327497] br-3563230c915f: port 4(veth76b4d20) entered disabled state
Aug 10 18:42:50 ip-10-255-21-47 kernel: [7033806.651766] eth0: renamed from veth878d8a8
Aug 10 18:42:50 ip-10-255-21-47 kernel: [7033806.679912] IPv6: ADDRCONF(NETDEV_CHANGE): veth76b4d20: link becomes ready
Aug 10 18:42:50 ip-10-255-21-47 kernel: [7033806.679988] br-3563230c915f: port 4(veth76b4d20) entered blocking state
Aug 10 18:42:50 ip-10-255-21-47 kernel: [7033806.679990] br-3563230c915f: port 4(veth76b4d20) entered forwarding state

@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

Trying to look at the log of the three main container but they are cropped to today when I (re)started them.

docker logs --since 2021-08-05 docker-elk_logstash_1

The logs of the container docker-elk_logspout_1 are crowded with repetition of an error. Interestingly, these errors stopped at he same time the repeating messages stopped in /var/log/kern.log.

$ docker logs --since 2021-08-05 docker-elk_logspout_1
2021/08/10 18:34:11 persist:/mnt/routes
2021/08/10 18:34:16 !! lookup logstash: Try again
2021/08/10 18:35:16 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:35:16 # adapters: logstash raw tcp udp
2021/08/10 18:35:16 # options :
2021/08/10 18:35:16 persist:/mnt/routes
2021/08/10 18:35:21 !! lookup logstash: Try again
2021/08/10 18:36:22 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:36:22 # adapters: raw tcp udp logstash
2021/08/10 18:36:22 # options :
2021/08/10 18:36:22 persist:/mnt/routes
2021/08/10 18:36:27 !! lookup logstash: Try again
2021/08/10 18:37:27 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:37:27 # adapters: raw tcp udp logstash
2021/08/10 18:37:27 # options :
2021/08/10 18:37:27 persist:/mnt/routes
2021/08/10 18:37:32 !! lookup logstash: Try again
2021/08/10 18:38:33 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:38:33 # adapters: logstash raw tcp udp
2021/08/10 18:38:33 # options :
2021/08/10 18:38:33 persist:/mnt/routes
2021/08/10 18:38:38 !! lookup logstash: Try again
2021/08/10 18:39:38 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:39:38 # adapters: logstash raw tcp udp
2021/08/10 18:39:38 # options :
2021/08/10 18:39:38 persist:/mnt/routes
2021/08/10 18:39:43 !! lookup logstash: Try again
2021/08/10 18:40:43 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:40:43 # adapters: raw tcp udp logstash
2021/08/10 18:40:43 # options :
2021/08/10 18:40:43 persist:/mnt/routes
2021/08/10 18:40:48 !! lookup logstash: Try again
2021/08/10 18:41:49 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:41:49 # adapters: udp logstash raw tcp
2021/08/10 18:41:49 # options :
2021/08/10 18:41:49 persist:/mnt/routes
#   ADAPTER     ADDRESS         CONTAINERS      SOURCES OPTIONS
#   logstash    logstash:5000                           map[]
2021/08/10 18:41:49 # jobs    : pump routes http[health]:80
2021/08/10 18:41:49 # routes  :
2021/08/10 18:41:49 logstash: could not write:write udp 172.18.0.5:55326->172.18.0.4:5000: write: connection refused
2021/08/10 18:42:50 # logspout v3.2.13-custom by gliderlabs
2021/08/10 18:42:50 # adapters: raw tcp udp logstash
2021/08/10 18:42:50 # options :
2021/08/10 18:42:50 persist:/mnt/routes
2021/08/10 18:42:50 # jobs    : http[health]:80 pump routes
2021/08/10 18:42:50 # routes  :
#   ADAPTER     ADDRESS         CONTAINERS      SOURCES OPTIONS
#   logstash    logstash:5000                           map[]

@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

Using an approach like below could prevent the loss of container logs:

docker run --log-driver=syslog
    --log-opt syslog-address=udp://logsN.papertrailapp.com:XXXXX 
    image-name

See https://www.papertrail.com/solution/tips/how-to-live-tail-docker-logs/

@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

The ELK stack stopped on Aug 06 06:34:19 as shown by the docker daemon logs. Before that the last docker daemon logs happened on Jul 21 06:26:58.

$ sudo journalctl -u docker.service | less
Jul 21 06:26:54 ip-10-255-21-47 dockerd[23615]: time="2021-07-21T06:26:54.312597722Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Jul 21 06:26:55 ip-10-255-21-47 dockerd[23615]: time="2021-07-21T06:26:55.312429578Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Jul 21 06:26:55 ip-10-255-21-47 dockerd[23615]: time="2021-07-21T06:26:55.312564297Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Jul 21 06:26:57 ip-10-255-21-47 dockerd[23615]: time="2021-07-21T06:26:57.647921388Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Jul 21 06:26:58 ip-10-255-21-47 dockerd[23615]: time="2021-07-21T06:26:58.158405264Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Aug 06 06:34:19 ip-10-255-21-47 systemd[1]: Stopping Docker Application Container Engine...
Aug 06 06:34:19 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:19.445042555Z" level=info msg="Processing signal 'terminated'"
Aug 06 06:34:19 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:19.504716118Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 06 06:34:19 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:19.577507133Z" level=info msg="ignoring event" container=9dbe8b7a1b275b634a05d2effc5a65e8ae60e99ca25a7bc255d1b9454d235eb0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 06 06:34:23 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:23.314745501Z" level=info msg="ignoring event" container=ea4c2036eacd2bef872e5017c900b8d29a99e76a65a220a61cdcae12217ea75b module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 06 06:34:29 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:29.460829735Z" level=info msg="Container 12d0da455654c1d346e9104b3d4615672b391cac2d4e8dc32dae45ae3ac82085 failed to exit within 10 seconds of signal 15 - using the force"
Aug 06 06:34:29 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:29.547262539Z" level=info msg="ignoring event" container=12d0da455654c1d346e9104b3d4615672b391cac2d4e8dc32dae45ae3ac82085 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 06 06:34:29 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:29.751936728Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Aug 06 06:34:29 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:29.752151007Z" level=info msg="Daemon shutdown complete"
Aug 06 06:34:29 ip-10-255-21-47 dockerd[23615]: time="2021-08-06T06:34:29.752192593Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Aug 06 06:34:29 ip-10-255-21-47 systemd[1]: Stopped Docker Application Container Engine.
Aug 06 06:34:29 ip-10-255-21-47 systemd[1]: Starting Docker Application Container Engine...

Since then the following messages have been repeating for four days until I restarted the stack today. This is likely due to the container docker-elk_logspout_1 that I saw restarting continuously today.

Aug 10 18:34:10 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:34:10.870278990Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:34:16 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:34:16.278855383Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:35:16 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:35:16.304553929Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:35:16 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:35:16.304599323Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:35:21 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:35:21.718824239Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:36:21 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:36:21.747615827Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:36:21 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:36:21.747666359Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:36:27 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:36:27.183085356Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:37:27 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:37:27.203592595Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:37:27 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:37:27.203646031Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:37:32 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:37:32.611124014Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:38:32 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:38:32.637699306Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:38:32 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:38:32.637743766Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:38:38 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:38:38.055149655Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:39:38 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:39:38.077773360Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:39:38 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:39:38.077826648Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:39:43 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:39:43.487112371Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 10 18:40:43 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:40:43.508926251Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 10 18:40:43 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:40:43.509363351Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 10 18:40:48 nlpsandbox-elk dockerd[11787]: time="2021-08-10T18:40:48.918864763Z" level=info msg="ignoring event" container=17923a2b103539581549ac53b736ae10c4756825ea18917680d4b76bd502c854 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

Is this a DNS server issue? Even when docker tries to fall back to DNS servers, the issue is still repeating. Then is the entire network now?

Content of /etc/resolv.conf:

$ sudo cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "systemd-resolve --status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0
search ec2.internal

@tschaffter
Copy link
Member Author

Docker populates /etc/resolv.conf by copying the host's /etc/resolv.conf, and filtering out any local nameservers such as 127.0.1.1. If there are no nameservers left after that, Docker will add Google's public DNS servers (8.8.8.8 and 8.8.4.4).

Source: https://stackoverflow.com/a/39400887

@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

Add non-local DNS servers to resolved

Edit /etc/systemd/resolved.conf to include non-local DNS servers (here Google DNS):

[Resolve]
DNS=8.8.8.8 8.8.4.4 2001:4860:4860::8888 2001:4860:4860::8844

EDIT: Specif only two DNS server otherwise /etc/resolv.conf will include a message that it will only consider the top 8. The third one should the local DNS server (no need to specify it here).

Don't forget to reload configuration and to restart services:

$ sudo systemctl daemon-reload
$ sudo systemctl restart systemd-networkd
$ sudo systemctl restart systemd-resolved

Source: https://unix.stackexchange.com/a/442599

The change is visible here but /etc/resolv.conf is not updated, and this is the file that docker uses.

$ sudo systemd-resolve --status
Global
         DNS Servers: 8.8.8.8
                      8.8.4.4
                      2001:4860:4860::8888
                      2001:4860:4860::8844
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
...

The issue is because /etc/resolv.conf is a symlink to /run/systemd/resolve/stub-resolv.conf, but it should point to /run/systemd/resolve/resolv.conf which now contains our new DNS servers.

Here is how to fix this:

sudo unlink /etc/resolv.conf
sudo ln -s /run/systemd/resolve/resolv.conf /etc/resolv.conf

The log lines "No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]" are no longer included in docker daemon logs. 🎉

@tschaffter
Copy link
Member Author

tschaffter commented Aug 10, 2021

Why does the container logspout keep restarting?

From the container logs:

2021/08/10 22:41:31 !! lookup logstash on 127.0.0.11:53: no such host
2021/08/10 22:41:44 # logspout v3.2.13-custom by gliderlabs
2021/08/10 22:41:44 # adapters: udp logstash raw tcp
2021/08/10 22:41:44 # options :
2021/08/10 22:41:44 persist:/mnt/routes

The reason is because the container has a dependency on the ELK network/container. The solution to logspout restarting loop is to start the ELK stack first. Note that the logspout container is set with restart: always but this is not the case for the ELK containers.

@thomasyu888 How did you start logspout? Shall we add this container to the ELK docker-compose.yml?

@tschaffter
Copy link
Member Author

@thomasyu888 Do you have an idea on why the ELK stack stopped on Aug 06 06:34:19?

@tschaffter
Copy link
Member Author

Conclusion

  • We have not identified why the ELK stack stopped on Aug 06 06:34:19.
  • We have added two non-local DNS servers to systemd-resolved to prevent a message from repeating in docker daemon logs.
  • The container logspout was restarting in loop because it depends on the ELK network/containers, which were down at the time.
  • The CloudWatch agent is now collecting logs, hopefully we will have more information available next time the ELK stack stop.
  • The repeating log messages similar to br-3563230c915f: port 3(vethb438dc0) entered blocking stat are still a mystery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant