RUN_ONCE and many flows fails #291

bagder · 2017-02-10T15:44:38Z

./client_http_run_once -n 80 bsd10.fh-muenster.de

and

./client_http_run_once -n 80 localhost

both show the same symptom (bsd10 is at a 42ms RTT from my test machine). It quickly handles the first 55-56 flows and then gets stuck on 25 or 24 flows remaining, spending 100% CPU time until after some time on_error is called and the application exits.

[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
[25 flows to go]
Waiting...
on_error

Running on Linux kernel 4.9.2, no working IPv6, no SCTP.

The text was updated successfully, but these errors were encountered:

weinrank · 2017-02-13T14:49:12Z

Hey Daniel,
I'm facing the same problem and will try to solve it.

weinrank · 2017-02-14T12:28:39Z

I found the problem, it was a bug in my application, not related to the library.

weinrank · 2017-04-19T12:39:35Z

Hey Daniel,

https://github.com/NEAT-project/neat/blob/master/examples/client_http_run_once.c#L110

Here you call neat_stop_event_loop - why?
Is this useful if a ctx ist used by several flows?

bagder · 2017-04-19T12:46:56Z

I honestly can't remember any good reason for that. Possibly I added it during some debug session to see if it helped and forgot to remove it... !

weinrank · 2017-06-20T12:17:49Z

Should be fixed by #334

bagder · 2017-07-05T12:11:06Z

It works differently now, but it's still not working fine.

./client_http_run_once -n 80 bsd10.fh-muenster.de on my machine reaches further than before, but it gets stuck and never exits properly. The final lines of the output (before I pressed control-c) is below.

The cutoff line for me: at -n 65 it works, but at -n 66 it hangs. The last few lines in the hanging -n 66 case:

[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
Waiting...
[1 flows to go]
[Flow 29 ended, 0 to go, 0 for ctx 9]
on_close - no flows left for ctx, stopping event loop
[   0.890973][WRN] neat_notify_close - flow already closed - skipping
Cleanup!

Running it with gdb and breaking at that "Cleanup!" point, the stack trace looks like this:

^C
Program received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7970aea in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00007ffff796ee34 in ?? () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#3  0x00007ffff7960924 in uv_run () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#4  0x00007ffff7b813ad in neat_close_loop (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:281
#5  neat_core_cleanup (nc=0x6276e0) at /home/daniel/src/neat/neat_core.c:290
#6  neat_free_ctx (nc=<optimized out>) at /home/daniel/src/neat/neat_core.c:329
#7  0x0000000000401779 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/daniel/src/neat/examples/client_http_run_once.c:324
(gdb)

karlgrin · 2017-12-19T16:42:26Z

@bagder @naeemk Not able to recreate the problem. Actually, it seems to work for me. Have anyone checked this issue recently?

weinrank · 2017-12-20T12:46:16Z

@karlgrin
I have and I still see the problem.
I've used the following command on a Ubuntu 17.10 machine.
./client_http_run_once -n 80 bsd10.fh-muenster.de

karlgrin · 2018-01-04T13:23:25Z

@weinrank @bagder
Have continued my investigation of issue #291. When I did a trial with a timeout of 10 ms for the poll() in client_http_run_once it seemed to work ok for very large values. For example, I ran './client_http_run_once -n 200 bsd10.fh-muenster.de' and it took outp.txt a bit less than 4 seconds to complete. You find the output enclosed with this email. My question is basically that I don't really see the problem.

karlgrin · 2018-01-09T13:54:08Z

@weinrank have you some time to start looking at this?

weinrank · 2018-01-09T14:01:32Z

How did you change the timeout to 10ms?

karlgrin · 2018-01-09T15:08:52Z

@weinrank Hardcoded '10 ms' in poll at line 293 in file client_http_run_once.c
( int rc = poll(fds, num_ctxs, 10) )

tuexen · 2018-01-09T19:52:38Z

Just wondering: Does this really fix a bug? So can you explain why it did not work without the 10 ms and does work with 10 ms.

I just want to make sure that this is not a work around protecting against a race condition in some scenarios...

karlgrin · 2018-01-09T20:59:15Z

@tuexen I don't consider this a fix of the bug. I only consider this an observation. Since I cannot explain the problem, I don't consider it fixed. I'll continue looking into this issue tomorrow. This problem is that I have not been able to replicate the problem myself. I'll set up a new machine with Ubuntu 17.10 on tomorrow and see if I am able to replicate the problem on that machine.

tuexen · 2018-01-10T14:24:47Z

@karlgrin OK, makes sense. Thanks for the clarification.

I also was able to reproduce the problem sometimes (means that I can run the program multiple times, one only sometimes it hangs, sometimes it works) using FreeBSD head. So the problem is not platform dependent. The system I was using to test has IPv4 and IPv6 addresses. Possibly this is relevant.
When the system was hanging, I used sockstat to figure out in which state the TCP connections were. I saw about 78 or so in the state CLOSE_WAIT, which means that there was no close() call issued for them. Using truss to figure out what the program was doing resulted in a loop of kevent() calls.

perhurt · 2018-01-26T09:27:48Z

The problem is related to the resolver.

When avoiding to use the resolver (i.e., by using an IP-address directly), the problem goes away. For example (towards bsd10.fh-muenster.de):

./client_http_run_once -n 85 212.201.121.100

One of the problems, when opening a large number of connections, is that the maximum number of open files is reached in neat_resolver.c:772. When increasing this number (in /etc/security/limits.conf on Linux) from 1024 to 100000 the problems went away.

I'm not that familiar with the "resolver business" so I hope that someone else could take a look at this @naeemk !?

kristrev · 2018-01-26T10:52:36Z

This is not a resolve-issue per se, it is an issue of the resolver in combination with operating system limits. The resolver that is currently in NEAT is the first version and quite naive, which means that there are roughly zero optimizations. I haven't looked at the code in detail for a long time, but I suspect that what happens is that one resolver is started for every connection. With 85 connections and multiple interfaces, this will exhaust the system limit on per process file descriptors pretty fast.

There were plans to make the resolver smarter, by for example implementing some sort of caching or do something smarter than attaching the resolver to a flow, but this has never materialized. I unfortunately doubt that I will have time to do anything about the resolver before NEAT is over, but if anyone else wants to have a go then that would be great.

The best solution would probably be to have a small cache, say 10 entries with short TTL, attach the resolver to the context and change the resolver from being on-demand to process a request queue (with a limit N on how many requests can be processed in parallel). We could also attach flows to DNS requests in process, if there are multiple requests for the same domain. The first request to a given domain will then trigger a resolve, and the others will hit the cache/be attached to this request.

weinrank mentioned this issue May 11, 2017

Closing improvements and several fixes #321

Closed

weinrank mentioned this issue Jun 19, 2017

Closing procedure improvements, several fixes #334

Merged

naeemk assigned weinrank and karlgrin Nov 6, 2017

naeemk added the bug label Nov 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUN_ONCE and many flows fails #291

RUN_ONCE and many flows fails #291

bagder commented Feb 10, 2017

weinrank commented Feb 13, 2017

weinrank commented Feb 14, 2017 •

edited

Loading

weinrank commented Apr 19, 2017

bagder commented Apr 19, 2017 •

edited

Loading

weinrank commented Jun 20, 2017

bagder commented Jul 5, 2017

karlgrin commented Dec 19, 2017

weinrank commented Dec 20, 2017

karlgrin commented Jan 4, 2018

karlgrin commented Jan 9, 2018

weinrank commented Jan 9, 2018

karlgrin commented Jan 9, 2018

tuexen commented Jan 9, 2018

karlgrin commented Jan 9, 2018

tuexen commented Jan 10, 2018

perhurt commented Jan 26, 2018 •

edited

Loading

kristrev commented Jan 26, 2018

RUN_ONCE and many flows fails #291

RUN_ONCE and many flows fails #291

Comments

bagder commented Feb 10, 2017

weinrank commented Feb 13, 2017

weinrank commented Feb 14, 2017 • edited Loading

weinrank commented Apr 19, 2017

bagder commented Apr 19, 2017 • edited Loading

weinrank commented Jun 20, 2017

bagder commented Jul 5, 2017

karlgrin commented Dec 19, 2017

weinrank commented Dec 20, 2017

karlgrin commented Jan 4, 2018

karlgrin commented Jan 9, 2018

weinrank commented Jan 9, 2018

karlgrin commented Jan 9, 2018

tuexen commented Jan 9, 2018

karlgrin commented Jan 9, 2018

tuexen commented Jan 10, 2018

perhurt commented Jan 26, 2018 • edited Loading

kristrev commented Jan 26, 2018

weinrank commented Feb 14, 2017 •

edited

Loading

bagder commented Apr 19, 2017 •

edited

Loading

perhurt commented Jan 26, 2018 •

edited

Loading