Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash panics with Too many open files due to grpc connection socket leak in the cloud GCP env #9663

Open
solotzg opened this issue Nov 21, 2024 · 2 comments
Labels

Comments

@solotzg
Copy link
Contributor

solotzg commented Nov 21, 2024

Bug Report

1. Minimal reproduce step (Required)

  • Topology: 3-tidb-16C32G / 9-tikv-16C64G-2000G / 3-pd-4C15G-50G / 2-tiflash-16C128G-500G
  • Platform: GCP
  • normal mpp workloads

2. What did you expect to see? (Required)

  • No panic

3. What did you see instead (Required)

There are too many FD under the tiflash process. The increasing number of FD caused queries to fail and eventually tiflash crashed. Most of the FD were related to sockets and a large number of sockets were still open but could not be found in "/proc/net".

sh-5.1# ls -l /proc/1/fd/ | wc -l
295268
sh-5.1# ls -l /proc/1/fd/ | grep "eventfd" | wc -l
98368
sh-5.1# ls -l /proc/1/fd/ | grep "eventpoll" | wc -l
98391
sh-5.1# ls -l /proc/1/fd/ | grep "socket" | wc -l
98492

After disabling mpp, the number of sockets no longer continues to grow. There may be potential bugs in the implementation about mpp.

set global tidb_allow_fallback_to_tikv = "tiflash";
set global tidb_allow_mpp = 0;
set global tidb_allow_tiflash_cop = 1;

4. What is your TiFlash version? (Required)

v7.5.3, v8.1.1

Root Cause

  • When tidb planner prepares for mpp query, it will create a new grpc connection with tiflash instances and call rpc ServerInfo to obtain the hardware info of the tiflash instance, and then close the connection.
  • If TLS is enabled, creating a new grpc connection in synchronous mode and invoking server_info will generate:
    • 1 tcp connection (corresponding to 1 independent socket)
    • 1 execution thread
    • 1 eventfd
    • 1 eventpoll to listen to the socket and eventfd
  • After the grpc connection is closed, the above objects will be destroyed normally
  • When the process of tidb closing the grpc connection meets exceptions (the tiflash receives the [RST, ACK] tcp packet), the TCP connection is disconnected, the execution thread is destroyed, but the related socket, eventpoll, and eventfd are not destroyed.
    • When the cluster is running normally, this exception does not necessarily occur, but when it occurs, it is basically accompanied by the socket leak.
    • After the setting tidb_max_tiflash_threads to bypass the problematic method getServerInfoByGRPC, the socket leak problem no longer occurs.

tcpdump the exception packet transmitting of method getServerInfoByGRPC.

03:58:52.561704 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [S], seq 2249365032, win 21300, options [mss 1420,sackOK,TS val 2891417206 ecr 0,nop,wscale 7], length 0
03:58:52.561746 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [S.], seq 4043278122, ack 2249365033, win 21120, options [mss 1420,sackOK,TS val 2474091398 ecr 2891417206,nop,wscale 7], length 0
03:58:52.561854 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [.], ack 1, win 167, options [nop,nop,TS val 2891417206 ecr 2474091398], length 0
03:58:52.562091 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 1:298, ack 1, win 167, options [nop,nop,TS val 2891417206 ecr 2474091398], length 297
03:58:52.562101 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [.], ack 298, win 163, options [nop,nop,TS val 2474091398 ecr 2891417206], length 0
03:58:52.562177 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 1:100, ack 298, win 163, options [nop,nop,TS val 2474091398 ecr 2891417206], length 99
03:58:52.562238 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [.], ack 100, win 167, options [nop,nop,TS val 2891417206 ecr 2474091398], length 0
03:58:52.562279 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 298:304, ack 100, win 167, options [nop,nop,TS val 2891417206 ecr 2474091398], length 6
03:58:52.562356 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 304:634, ack 100, win 167, options [nop,nop,TS val 2891417206 ecr 2474091398], length 330
03:58:52.562385 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [.], ack 634, win 163, options [nop,nop,TS val 2474091398 ecr 2891417206], length 0
03:58:52.563221 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 100:2065, ack 634, win 163, options [nop,nop,TS val 2474091399 ecr 2891417206], length 1965
03:58:52.563345 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [.], ack 2065, win 163, options [nop,nop,TS val 2891417207 ecr 2474091399], length 0
03:58:52.565853 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 634:2292, ack 2065, win 163, options [nop,nop,TS val 2891417210 ecr 2474091399], length 1658
03:58:52.565863 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [.], ack 2292, win 163, options [nop,nop,TS val 2474091402 ecr 2891417210], length 0
03:58:52.565948 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2292:2338, ack 2065, win 163, options [nop,nop,TS val 2891417210 ecr 2474091402], length 46
03:58:52.566209 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 2065:2133, ack 2338, win 163, options [nop,nop,TS val 2474091402 ecr 2891417210], length 68
03:58:52.566283 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2338:2369, ack 2133, win 163, options [nop,nop,TS val 2891417210 ecr 2474091402], length 31
03:58:52.566414 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2369:2598, ack 2133, win 163, options [nop,nop,TS val 2891417210 ecr 2474091402], length 229
03:58:52.566443 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [.], ack 2598, win 163, options [nop,nop,TS val 2474091402 ecr 2891417210], length 0
03:58:52.566547 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 2133:2194, ack 2598, win 163, options [nop,nop,TS val 2474091402 ecr 2891417210], length 61
03:58:52.566658 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2598:2637, ack 2194, win 163, options [nop,nop,TS val 2891417211 ecr 2474091402], length 39
03:58:52.568290 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 2194:4845, ack 2637, win 163, options [nop,nop,TS val 2474091404 ecr 2891417211], length 2651
03:58:52.568392 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [.], ack 4845, win 163, options [nop,nop,TS val 2891417212 ecr 2474091404], length 0
03:58:52.568471 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2637:2689, ack 4845, win 163, options [nop,nop,TS val 2891417212 ecr 2474091404], length 52
03:58:52.568513 IP 10.0.159.129.3930 > 10.0.128.40.40836: Flags [P.], seq 4845:4884, ack 2689, win 163, options [nop,nop,TS val 2474091404 ecr 2891417212], length 39
03:58:52.568588 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [P.], seq 2689:2713, ack 4884, win 163, options [nop,nop,TS val 2891417212 ecr 2474091404], length 24
03:58:52.568600 IP 10.0.128.40.40836 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2891417212 ecr 2474091404], length 0

...
...

04:04:52.554279 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [S], seq 1897954624, win 21300, options [mss 1420,sackOK,TS val 2891777198 ecr 0,nop,wscale 7], length 0
04:04:52.554310 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [S.], seq 652598690, ack 1897954625, win 21120, options [mss 1420,sackOK,TS val 2474451390 ecr 2891777198,nop,wscale 7], length 0
04:04:52.554375 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [.], ack 1, win 167, options [nop,nop,TS val 2891777198 ecr 2474451390], length 0
04:04:52.554568 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 1:298, ack 1, win 167, options [nop,nop,TS val 2891777198 ecr 2474451390], length 297
04:04:52.554576 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [.], ack 298, win 163, options [nop,nop,TS val 2474451391 ecr 2891777198], length 0
04:04:52.554640 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 1:100, ack 298, win 163, options [nop,nop,TS val 2474451391 ecr 2891777198], length 99
04:04:52.554695 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [.], ack 100, win 167, options [nop,nop,TS val 2891777199 ecr 2474451391], length 0
04:04:52.554735 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 298:304, ack 100, win 167, options [nop,nop,TS val 2891777199 ecr 2474451391], length 6
04:04:52.554784 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 304:634, ack 100, win 167, options [nop,nop,TS val 2891777199 ecr 2474451391], length 330
04:04:52.554799 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [.], ack 634, win 163, options [nop,nop,TS val 2474451391 ecr 2891777199], length 0
04:04:52.555630 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 100:2065, ack 634, win 163, options [nop,nop,TS val 2474451392 ecr 2891777199], length 1965
04:04:52.555716 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [.], ack 2065, win 163, options [nop,nop,TS val 2891777200 ecr 2474451392], length 0
04:04:52.558248 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 634:2292, ack 2065, win 163, options [nop,nop,TS val 2891777202 ecr 2474451392], length 1658
04:04:52.558255 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [.], ack 2292, win 163, options [nop,nop,TS val 2474451394 ecr 2891777202], length 0
04:04:52.558316 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2292:2338, ack 2065, win 163, options [nop,nop,TS val 2891777202 ecr 2474451392], length 46
04:04:52.558580 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 2065:2133, ack 2338, win 163, options [nop,nop,TS val 2474451395 ecr 2891777202], length 68
04:04:52.558636 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2338:2369, ack 2133, win 163, options [nop,nop,TS val 2891777203 ecr 2474451395], length 31
04:04:52.558754 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2369:2598, ack 2133, win 163, options [nop,nop,TS val 2891777203 ecr 2474451395], length 229
04:04:52.558778 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [.], ack 2598, win 163, options [nop,nop,TS val 2474451395 ecr 2891777203], length 0
04:04:52.558858 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 2133:2194, ack 2598, win 163, options [nop,nop,TS val 2474451395 ecr 2891777203], length 61
04:04:52.558946 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2598:2637, ack 2194, win 163, options [nop,nop,TS val 2891777203 ecr 2474451395], length 39
04:04:52.560214 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 2194:4845, ack 2637, win 163, options [nop,nop,TS val 2474451396 ecr 2891777203], length 2651
04:04:52.560284 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [.], ack 4845, win 163, options [nop,nop,TS val 2891777204 ecr 2474451396], length 0
04:04:52.560345 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2637:2689, ack 4845, win 163, options [nop,nop,TS val 2891777204 ecr 2474451396], length 52
04:04:52.560399 IP 10.0.159.129.3930 > 10.0.128.40.44108: Flags [P.], seq 4845:4884, ack 2689, win 163, options [nop,nop,TS val 2474451396 ecr 2891777204], length 39
04:04:52.560441 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [P.], seq 2689:2713, ack 4845, win 163, options [nop,nop,TS val 2891777204 ecr 2474451396], length 24
04:04:52.560478 IP 10.0.128.40.44108 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2891777204 ecr 2474451396], length 0

tcpdump the normal packet transmitting of getServerInfoByGRPC.

03:58:52.987154 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [S], seq 3797980295, win 21300, options [mss 1420,sackOK,TS val 2891417631 ecr 0,nop,wscale 7], length 0
03:58:52.987186 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [S.], seq 3858219694, ack 3797980296, win 21120, options [mss 1420,sackOK,TS val 2474091823 ecr 2891417631,nop,wscale 7], length 0
03:58:52.987273 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [.], ack 1, win 167, options [nop,nop,TS val 2891417631 ecr 2474091823], length 0
03:58:52.987438 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 1:298, ack 1, win 167, options [nop,nop,TS val 2891417631 ecr 2474091823], length 297
03:58:52.987447 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [.], ack 298, win 163, options [nop,nop,TS val 2474091823 ecr 2891417631], length 0
03:58:52.987511 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 1:100, ack 298, win 163, options [nop,nop,TS val 2474091823 ecr 2891417631], length 99
03:58:52.987579 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [.], ack 100, win 167, options [nop,nop,TS val 2891417631 ecr 2474091823], length 0
03:58:52.987617 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 298:304, ack 100, win 167, options [nop,nop,TS val 2891417632 ecr 2474091823], length 6
03:58:52.987674 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 304:634, ack 100, win 167, options [nop,nop,TS val 2891417632 ecr 2474091823], length 330
03:58:52.987688 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [.], ack 634, win 163, options [nop,nop,TS val 2474091824 ecr 2891417632], length 0
03:58:52.988494 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 100:2065, ack 634, win 163, options [nop,nop,TS val 2474091824 ecr 2891417632], length 1965
03:58:52.988636 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [.], ack 2065, win 163, options [nop,nop,TS val 2891417633 ecr 2474091824], length 0
03:58:52.991165 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 634:2292, ack 2065, win 163, options [nop,nop,TS val 2891417635 ecr 2474091824], length 1658
03:58:52.991177 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [.], ack 2292, win 163, options [nop,nop,TS val 2474091827 ecr 2891417635], length 0
03:58:52.991245 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2292:2338, ack 2065, win 163, options [nop,nop,TS val 2891417635 ecr 2474091824], length 46
03:58:52.991491 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 2065:2133, ack 2338, win 163, options [nop,nop,TS val 2474091827 ecr 2891417635], length 68
03:58:52.991564 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2338:2369, ack 2133, win 163, options [nop,nop,TS val 2891417635 ecr 2474091827], length 31
03:58:52.991751 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2369:2598, ack 2133, win 163, options [nop,nop,TS val 2891417636 ecr 2474091827], length 229
03:58:52.991779 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [.], ack 2598, win 163, options [nop,nop,TS val 2474091828 ecr 2891417635], length 0
03:58:52.991876 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 2133:2194, ack 2598, win 163, options [nop,nop,TS val 2474091828 ecr 2891417635], length 61
03:58:52.991996 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2598:2637, ack 2194, win 163, options [nop,nop,TS val 2891417636 ecr 2474091828], length 39
03:58:52.993177 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 2194:4845, ack 2637, win 163, options [nop,nop,TS val 2474091829 ecr 2891417636], length 2651
03:58:52.993255 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [.], ack 4845, win 163, options [nop,nop,TS val 2891417637 ecr 2474091829], length 0
03:58:52.993312 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2637:2689, ack 4845, win 163, options [nop,nop,TS val 2891417637 ecr 2474091829], length 52
03:58:52.993356 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [P.], seq 4845:4884, ack 2689, win 163, options [nop,nop,TS val 2474091829 ecr 2891417637], length 39
03:58:52.993485 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [P.], seq 2689:2713, ack 4884, win 163, options [nop,nop,TS val 2891417637 ecr 2474091829], length 24
03:58:52.993496 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [F.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2891417637 ecr 2474091829], length 0
03:58:52.993516 IP 10.0.159.129.3930 > 10.0.128.40.40852: Flags [F.], seq 4884, ack 2714, win 163, options [nop,nop,TS val 2474091829 ecr 2891417637], length 0
03:58:52.993587 IP 10.0.128.40.40852 > 10.0.159.129.3930: Flags [.], ack 4885, win 163, options [nop,nop,TS val 2891417637 ecr 2474091829], length 0

The correlation between the number of leaked sockets and exception TCP packets

sh-5.1# lsof -p1 -nP | grep sock | wc -l
8997

sh-5.1# lsof -p1 -nP | grep sock | wc -l
8999

06:52:29.758566 IP 10.0.128.40.58928 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2815434402 ecr 2398108594], length 0
06:52:32.583755 IP 10.0.128.40.42908 > 10.0.159.129.3930: Flags [R.], seq 2691, ack 4884, win 163, options [nop,nop,TS val 2815437228 ecr 2398111420], length 0
06:56:00.184564 IP 10.0.128.40.49098 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2815644828 ecr 2398319020], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9000

06:59:15.739577 IP 10.0.128.40.33722 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2815840383 ecr 2398514575], length 0
07:03:02.503602 IP 10.0.128.40.50298 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2816067147 ecr 2398741339], length 0
07:03:34.253606 IP 10.0.128.40.32906 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2816098897 ecr 2398773089], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9001

07:05:28.859558 IP 10.0.128.40.43968 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2816213503 ecr 2398887695], length 0
07:05:37.802734 IP 10.0.128.40.33316 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2816222447 ecr 2398896639], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9002

07:11:12.165851 IP 10.0.128.40.52334 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2816556810 ecr 2399231002], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9003

07:26:30.298483 IP 10.0.128.40.41346 > 10.0.159.129.3930: Flags [R.], seq 2691, ack 4884, win 163, options [nop,nop,TS val 2817474942 ecr 2400149134], length 0
07:26:45.536537 IP 10.0.128.40.32954 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2817490180 ecr 2400164372], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9004

07:29:59.297741 IP 10.0.128.40.50300 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2817683942 ecr 2400358134], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9005

07:30:11.838733 IP 10.0.128.40.40660 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2817696483 ecr 2400370675], length 0
07:30:15.959509 IP 10.0.128.40.33626 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2817700603 ecr 2400374795], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9009

07:36:03.308641 IP 10.0.128.40.33016 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818047952 ecr 2400722144], length 0
07:36:21.119600 IP 10.0.128.40.33674 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818065763 ecr 2400739955], length 0
07:36:39.588145 IP 10.0.128.40.40878 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818084232 ecr 2400758424], length 0
07:36:41.698560 IP 10.0.128.40.40900 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818086342 ecr 2400760534], length 0
07:37:17.337552 IP 10.0.128.40.36594 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818121981 ecr 2400796173], length 0
07:37:44.534595 IP 10.0.128.40.41562 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818149178 ecr 2400823370], length 0
07:37:49.308730 IP 10.0.128.40.41582 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818153953 ecr 2400828145], length 0

sh-5.1# lsof -p1 -nP | grep sock | wc -l
9011

07:39:10.227657 IP 10.0.128.40.52256 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818234871 ecr 2400909064], length 0
07:39:21.941574 IP 10.0.128.40.48532 > 10.0.159.129.3930: Flags [R.], seq 2713, ack 4884, win 163, options [nop,nop,TS val 2818246585 ecr 2400920777], length 0

Feasible solution

  • tidb could reuse the client to invoke rpc rather than creating a new grpc connection
  • cache server info

Other

  • This kind of exception is hard to reproduce on simulation gcp env or local env.
  • Exiting the program before closing the grpc connection on the golang client side can cause the tcp protocol system to send a [RST, ACK] packet to the server. However, the above socket leak exception cannot be reproduced.
@solotzg solotzg added the type/bug The issue is confirmed as a bug. label Nov 21, 2024
@solotzg solotzg added affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed may-affects-7.5 labels Dec 6, 2024
@solotzg solotzg changed the title TiFlash panics with Too many open files in the cloud GCP env TiFlash panics with Too many open files due to sockets leak in the cloud GCP env Jan 22, 2025
@solotzg solotzg added affects-8.1 This bug affects the 8.1.x(LTS) versions. and removed may-affects-8.1 labels Jan 22, 2025
@solotzg solotzg changed the title TiFlash panics with Too many open files due to sockets leak in the cloud GCP env TiFlash panics with Too many open files due to grpc connection socket leak in the cloud GCP env Jan 23, 2025
@windtalker
Copy link
Contributor

set tidb_max_tiflash_threads explicitly should be another workaround?

@solotzg
Copy link
Contributor Author

solotzg commented Jan 23, 2025

set tidb_max_tiflash_threads explicitly should be another workaround?

It has been verified that setting tidb_max_tiflash_threads can help to solve socket leak issue by bypassing the problematic method getServerInfoByGRPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants