-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUN_ONCE and many flows fails #291
Comments
Hey Daniel, |
I found the problem, it was a bug in my application, not related to the library. |
Hey Daniel, https://github.com/NEAT-project/neat/blob/master/examples/client_http_run_once.c#L110 Here you call |
I honestly can't remember any good reason for that. Possibly I added it during some debug session to see if it helped and forgot to remove it... ! |
Should be fixed by #334 |
It works differently now, but it's still not working fine.
The cutoff line for me: at
Running it with gdb and breaking at that "Cleanup!" point, the stack trace looks like this:
|
@karlgrin |
@weinrank @bagder |
@weinrank have you some time to start looking at this? |
How did you change the timeout to 10ms? |
@weinrank Hardcoded '10 ms' in poll at line 293 in file client_http_run_once.c |
Just wondering: Does this really fix a bug? So can you explain why it did not work without the 10 ms and does work with 10 ms. I just want to make sure that this is not a work around protecting against a race condition in some scenarios... |
@tuexen I don't consider this a fix of the bug. I only consider this an observation. Since I cannot explain the problem, I don't consider it fixed. I'll continue looking into this issue tomorrow. This problem is that I have not been able to replicate the problem myself. I'll set up a new machine with Ubuntu 17.10 on tomorrow and see if I am able to replicate the problem on that machine. |
@karlgrin OK, makes sense. Thanks for the clarification. I also was able to reproduce the problem sometimes (means that I can run the program multiple times, one only sometimes it hangs, sometimes it works) using FreeBSD head. So the problem is not platform dependent. The system I was using to test has IPv4 and IPv6 addresses. Possibly this is relevant. |
The problem is related to the resolver. When avoiding to use the resolver (i.e., by using an IP-address directly), the problem goes away. For example (towards bsd10.fh-muenster.de):
One of the problems, when opening a large number of connections, is that the maximum number of open files is reached in I'm not that familiar with the "resolver business" so I hope that someone else could take a look at this @naeemk !? |
This is not a resolve-issue per se, it is an issue of the resolver in combination with operating system limits. The resolver that is currently in NEAT is the first version and quite naive, which means that there are roughly zero optimizations. I haven't looked at the code in detail for a long time, but I suspect that what happens is that one resolver is started for every connection. With 85 connections and multiple interfaces, this will exhaust the system limit on per process file descriptors pretty fast. There were plans to make the resolver smarter, by for example implementing some sort of caching or do something smarter than attaching the resolver to a flow, but this has never materialized. I unfortunately doubt that I will have time to do anything about the resolver before NEAT is over, but if anyone else wants to have a go then that would be great. The best solution would probably be to have a small cache, say 10 entries with short TTL, attach the resolver to the context and change the resolver from being on-demand to process a request queue (with a limit N on how many requests can be processed in parallel). We could also attach flows to DNS requests in process, if there are multiple requests for the same domain. The first request to a given domain will then trigger a resolve, and the others will hit the cache/be attached to this request. |
./client_http_run_once -n 80 bsd10.fh-muenster.de
and
./client_http_run_once -n 80 localhost
both show the same symptom (bsd10 is at a 42ms RTT from my test machine). It quickly handles the first 55-56 flows and then gets stuck on 25 or 24 flows remaining, spending 100% CPU time until after some time
on_error
is called and the application exits.Running on Linux kernel 4.9.2, no working IPv6, no SCTP.
The text was updated successfully, but these errors were encountered: