Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"brimcap analyze" hangs on a pcap that produces no Suricata logs #112

Closed
philrz opened this issue Jun 30, 2021 · 2 comments · Fixed by #115
Closed

"brimcap analyze" hangs on a pcap that produces no Suricata logs #112

philrz opened this issue Jun 30, 2021 · 2 comments · Fixed by #115
Assignees
Labels
bug Something isn't working
Milestone

Comments

@philrz
Copy link
Contributor

philrz commented Jun 30, 2021

Repro is with Brimcap commit 357bd57, but I also confirmed this issue was with us at commit 6406f89, so it seems it's not unique to the recent refactor of brimcap analyze (#110).

The ultimate user problem is that they tried to drag a pcap into Brim which neither Zeek nor Suricata can produce meaningful logs from. The attached test pcap
cap_00001_20210622092340.pcap.gz (after uncompressing) reproduces the issue. Using Brim commit 976d840 with its package.json pointing at Brimcap commit 357bd57, the following video shows the current user-facing result, which is:

  1. The pcap load never seems to finish
  2. If the user gets sick of waiting and quits Brim, the brimcap process is left behind and must be killed manually
Repro.mp4

The inability to parse is ultimately an orthogonal Zeek/Suricata problem and is separately tracked in #107. However, knowing that un-parse-able pcaps are likely to come up in practice, the purpose of this issue to ensure that we can fail on them gracefully so we don't have the kind of hanging & orphaned process like we just saw.

Knowing that our intent is to move to a brimcap analyze approach, I then reproduced it at the CLI, seeing it hang there as well.

Analyze-Hang.mp4

The dump it showed after I gave up and Ctrl-\'ed it:

$ brimcap analyze -z cap_00001_20210622092340.pcap 
  0.1% 64KiB/1MiB37KiB434B records=0 
^C^\SIGQUIT: quit
PC=0x7fff2046acde m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x2060220, 0x20601e0, 0x0)
	/usr/local/opt/go/libexec/src/runtime/sys_darwin.go:384 +0x39
runtime.semasleep(0xffffffffffffffff, 0xc000046000)
	/usr/local/opt/go/libexec/src/runtime/os_darwin.go:63 +0x8d
runtime.notesleep(0x205ffd0)
	/usr/local/opt/go/libexec/src/runtime/lock_sema.go:181 +0xdb
runtime.mPark()
	/usr/local/opt/go/libexec/src/runtime/proc.go:1340 +0x39
runtime.stoplockedm()
	/usr/local/opt/go/libexec/src/runtime/proc.go:2495 +0x6e
runtime.schedule()
	/usr/local/opt/go/libexec/src/runtime/proc.go:3103 +0x48c
runtime.park_m(0xc000001980)
	/usr/local/opt/go/libexec/src/runtime/proc.go:3318 +0x9d
runtime.mcall(0x106ef36)
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:327 +0x5b

goroutine 1 [chan send (nil chan)]:
github.com/gosuri/uilive.(*Writer).Stop(...)
	/Users/phil/.go/pkg/mod/github.com/gosuri/[email protected]/writer.go:119
github.com/brimdata/brimcap/cli/analyzecli.(*statusLineDisplay).End(0xc00039a420)
	/Users/phil/work/brimcap/cli/analyzecli/display.go:75 +0x66
github.com/brimdata/brimcap/cmd/brimcap/analyze.(*Command).Run(0xc0002b2100, 0xc000032070, 0x1, 0x1, 0x1b85ec0, 0xc0002d2000)
	/Users/phil/work/brimcap/cmd/brimcap/analyze/command.go:88 +0x494
github.com/brimdata/zed/pkg/charm.path.run(0xc0002c4450, 0x2, 0x2, 0xc000032070, 0x1, 0x1, 0xc0002c4450, 0x2)
	/Users/phil/.go/pkg/mod/github.com/brimdata/[email protected]/pkg/charm/path.go:11 +0x8f
github.com/brimdata/zed/pkg/charm.(*Spec).ExecRoot(0x20535e0, 0xc000032050, 0x3, 0x3, 0xffffffff, 0xc00009c058)
	/Users/phil/.go/pkg/mod/github.com/brimdata/[email protected]/pkg/charm/charm.go:63 +0x1e7
main.main()
	/Users/phil/work/brimcap/cmd/brimcap/main.go:19 +0x74

goroutine 34 [syscall]:
os/signal.signal_recv(0x1b918e0)
	/usr/local/opt/go/libexec/src/runtime/sigqueue.go:165 +0x9d
os/signal.loop()
	/usr/local/opt/go/libexec/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/usr/local/opt/go/libexec/src/os/signal/signal.go:151 +0x45

rax    0x104
rbx    0xab53e00
rcx    0x7ffeefbff4a8
rdx    0x500
rdi    0x2060220
rsi    0x80100000900
rbp    0x7ffeefbff550
rsp    0x7ffeefbff4a8
r8     0x0
r9     0xa0
r10    0x0
r11    0x246
r12    0x20601e0
r13    0x16
r14    0x80100000900
r15    0x500
rip    0x7fff2046acde
rflags 0x247
cs     0x7
fs     0x0
gs     0x0

Knowing as I do that brimcap analyze is just calling the built-in zeekrunner and suricatarunner, I also repro'ed at that layer, and I think I see what's going on.

I think Zeek is doing fine. It produces a few useless logs that reflect little beyond its inability to parse, but at minimum the weird.log that gets produced could have been turned into ZNG/ZSON just fine.

$ cat ../cap_00001_20210622092340.pcap | ~/work/brimcap/build/dist/zeek/zeekrunner 
WARNING: No Site::local_nets have been defined.  It's usually a good idea to define your local networks.

$ ls -l
total 24
-rw-r--r--  1 phil  staff  276 Jun 30 14:45 capture_loss.log
-rw-r--r--  1 phil  staff  791 Jun 30 14:45 stats.log
-rw-r--r--  1 phil  staff  333 Jun 30 14:45 weird.log

I suspect Suricata is what's giving the problem. Its output reflects its inability to parse the pcap, but unfortunately it then shows an exit code of 0 and a zero-length eve.json.

$ cat ../cap_00001_20210622092340.pcap | ~/work/brimcap/build/dist/suricata/suricatarunner 
30/6/2021 -- 14:45:51 - <Notice> - This is Suricata version 5.0.3 RELEASE running in USER mode
30/6/2021 -- 14:45:51 - <Info> - CPUs/cores online: 12
30/6/2021 -- 14:45:51 - <Info> - No 'host-mode': suricata is in IDS mode, using default setting 'sniffer-only'
30/6/2021 -- 14:45:51 - <Info> - eve-log output device (regular) initialized: eve.json
30/6/2021 -- 14:45:51 - <Info> - 1 rule files processed. 22608 rules successfully loaded, 0 rules failed
30/6/2021 -- 14:45:51 - <Info> - Threshold config parsed: 0 rule(s) found
30/6/2021 -- 14:45:51 - <Info> - 22611 signatures processed. 1134 are IP-only rules, 3887 are inspecting packet payload, 17392 inspect application layer, 103 are decoder event only
30/6/2021 -- 14:45:57 - <Error> - [ERRCODE: SC_ERR_UNIMPLEMENTED(88)] - datalink type 127 not (yet) supported in module PcapFile.
30/6/2021 -- 14:45:57 - <Warning> - [ERRCODE: SC_ERR_PCAP_DISPATCH(20)] - Failed to init pcap file -, skipping
30/6/2021 -- 14:45:57 - <Notice> - all 1 packet processing threads, 2 management threads initialized, engine started.
30/6/2021 -- 14:45:57 - <Error> - [ERRCODE: SC_ERR_INVALID_ARGUMENT(13)] - pcap file reader thread failed to initialize
30/6/2021 -- 14:45:57 - <Notice> - Signal Received.  Stopping engine.
30/6/2021 -- 14:45:57 - <Info> - time elapsed 0.018s
30/6/2021 -- 14:45:57 - <Info> - Alerts: 0
30/6/2021 -- 14:45:57 - <Info> - cleaning up signature grouping structure... complete

$ echo $?
0

$ ls -l
total 0
-rw-r--r--  1 phil  staff  0 Jun 30 14:45 eve.json

Is it possible the successful exit of the analyzer process is somehow not playing well with the zero-length log file, such that maybe we're hung waiting on output that's never going to arrive? If so, maybe we need some special handling to catch this corner case.

@philrz philrz added the bug Something isn't working label Jun 30, 2021
@mattnibs
Copy link
Collaborator

mattnibs commented Jul 1, 2021

Is it possible the successful exit of the analyzer process is somehow not playing well with the zero-length log file, such that maybe we're hung waiting on output that's never going to arrive? If so, maybe we need some special handling to catch this corner case.

Huh, interesting. This is a good bit of debug for me to go off of.

@philrz philrz added this to the Data MVP0 milestone Jul 1, 2021
mattnibs added a commit that referenced this issue Jul 6, 2021
Fix error where an analyze process that writes no records caused a
deadlock.

Closes #112
mattnibs added a commit that referenced this issue Jul 7, 2021
Fix error where an analyze process that writes no records caused a
deadlock.

Closes #112
@philrz
Copy link
Contributor Author

philrz commented Jul 7, 2021

Verified in Brimcap commit 1760eba.

Now instead of hanging, when processing the example pcap, Brimcap now exits with an appropriate error.

$ brimcap -version
Version: v0.0.5-26-g1760eba

$ brimcap analyze -z cap_00001_20210622092340.pcap
  6.0% 65.54KB/1.09MB records=0 
{_path:"weird",ts:2021-06-22T07:23:40.390809Z,uid:null (bstring),id:{orig_h:null (ip),orig_p:null (port=(uint16)),resp_h:null (ip),resp_p:null (port)} (=0),name:"unknown_packet_type" (bstring),addl:null (bstring),notice:false,peer:"zeek" (bstring)} (=1)
{_path:"stats",ts:2021-06-22T07:23:40.390809Z,peer:"zeek" (bstring),mem:67 (uint64),pkts_proc:1 (uint64),bytes_recv:382 (uint64),pkts_dropped:null (uint64),pkts_link:null (uint64),pkt_lag:null (duration),events_proc:423 (uint64),events_queued:12 (uint64),active_tcp_conns:0 (uint64),active_udp_conns:0 (uint64),active_icmp_conns:0 (uint64),tcp_conns:0 (uint64),udp_conns:0 (uint64),icmp_conns:0 (uint64),timers:39 (uint64),active_timers:35 (uint64),files:0 (uint64),active_files:0 (uint64),dns_requests:0 (uint64),active_dns_requests:0 (uint64),reassem_tcp_size:0 (uint64),reassem_file_size:0 (uint64),reassem_frag_size:0 (uint64),reassem_unknown_size:0 (uint64)} (=2)
zeekrunner exited with code 1
stdout:
WARNING: No Site::local_nets have been defined.  It's usually a good idea to define your local networks.
stderr:
1624346621.976873 fatal error in <command line>, line 3: failed to read a packet from -: truncated dump file; tried to read 16 header bytes, only got 10

$ echo $?
1

Unfortunately it looks like the higher-level problem of Brim hanging is still there even when using this newer Brimcap version, and per the verification video in #111 (comment) it looks to be due to the fact that the app still uses brimcap load that has other problems. So until brimdata/zui#1716 is addressed, users can't have the benefit of this fix. I'll make this a priority with the UX team.

Thanks @mattnibs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants