Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dependencies #428

Merged
merged 46 commits into from
Sep 6, 2019
Merged

Update dependencies #428

merged 46 commits into from
Sep 6, 2019

Conversation

araffin
Copy link
Collaborator

@araffin araffin commented Jul 31, 2019

Update dependencies:

Docker images (notably the one used by travis) are also updated (using latest gym version)

@araffin
Copy link
Collaborator Author

araffin commented Aug 1, 2019

I have to investigate why it fails more often, i suspect pytest to be responsible...

@araffin araffin mentioned this pull request Aug 2, 2019
@araffin
Copy link
Collaborator Author

araffin commented Aug 4, 2019

@AdamGleave @hill-a @erniejunior I identified the issue and unfortunately, it comes from tensorflow (v1.5.0 works fine, tf>=v1.8.0 makes travis hang more often, even though no test fails)
What should we do?

@AdamGleave
Copy link
Collaborator

Do we know what's causing it to fail? Do the tests run OK in the Docker image on local machines?

Can debug what's causing the test to hang by attaching gdb and seeing where it's get stuck. I've run into TensorFlow deadlock issues before, usually due to multiprocessing or multithreading. So might want to try setting the spawn_method to forkserver in SubprocVecEnv which should be thread-safe, and maybe try with different versions of libopenmpi.

@araffin
Copy link
Collaborator Author

araffin commented Aug 5, 2019

Do we know what's causing it to fail?

I suspect an out of RAM error from travis. We are pretty close to the limit and some memory is not properly released during the tests (I tried to fix that in the past but was not successful).

Do the tests run OK in the Docker image on local machines?

I need to check that but I would say yes, I'm using tf 1.8.0 since the beginning on my machine and I could run the tests without any problem in the past, but we should double-check.

@AdamGleave
Copy link
Collaborator

I suspect an out of RAM error from travis. We are pretty close to the limit and some memory is not properly released during the tests (I tried to fix that in the past but was not successful).

One option would be to split our test suite in half (e.g. alphabetically) and run in two separate Travis instances. This won't help if one test consumes more memory than available on Travis, but if there are leaks this would help. And would have the added benefit of speeding up the test suite as well.

Confused why out of RAM would cause hanging rather than e.g. OutOfMemoryError, but seems possible, especially if swapping could end up very slow.

@araffin
Copy link
Collaborator Author

araffin commented Aug 5, 2019

if swapping could end up very slow.

yes, there may be two things: first, if it uses swap it becomes super slow, then when you have no RAM left, you usually don't get any error (see what happens when you do a fork bomb) and things just hang...

One option would be to split our test suite in half (e.g. alphabetically) and run in two separate Travis instances.

Good idea, how easy is it to implement? And yes, this is definitely memory leaks that were happening.

@AdamGleave
Copy link
Collaborator

It's now hanging on test_subproc_start_method. So I'm disabling the non thread-safe methods there as well. Frustratingly there's some non-determinsm in the tests (e.g. the branch build passed but the PR one failed), so we'll probably need to run it a few times to make sure things work.

@araffin I've changed the default for SubprocVecEnv back to forkserver/spawn. I remember from #217 you found this broke some applications. I believe the errors were largely MPI related, so hoping it's OK now that #433 is merged. An alternative would be to keep the non thread-safe fork as default, and just have tests run with thread-safe method. I feel a bit uneasy about this (not testing the default seems bad), but it might be OK: I think this kind of deadlock is unusually likely to happen in large test suites, where TensorFlow is used in the parent before starting the environments.

@araffin
Copy link
Collaborator Author

araffin commented Aug 6, 2019

@AdamGleave I managed to reproduce the deadlock (for me, it is currently happening on atari test ACER + lstm). You were right, it is not due to memory problem.
I will need to check with my normal env (so without docker).

I've changed the default for SubprocVecEnv back to forkserver/spawn

I would avoid that because this makes it really user unfriendly (e.g. you cannot use in a ipython terminal anymore) and I would be interested in knowing where it does actually comes from (in the sense what changed in tf that broke it ?)
but it is not an easy choice :/

@AdamGleave
Copy link
Collaborator

AdamGleave commented Aug 6, 2019

I've changed the default for SubprocVecEnv back to forkserver/spawn

I would avoid that because this makes it really user unfriendly (e.g. you cannot use in a ipython terminal anymore) and I would be interested in knowing where it does actually comes from (in the sense what changed in tf that broke it ?)
but it is not an easy choice :/

I normally use use DummyVecEnv anyway for small experiments, but I agree it's not ideal.
I'd be OK with keeping fork the default but with a Python warning that it should be avoided for non-interactive use.

When I last looked into it, the deadlock happened in the graph destructor. I think what happens is:

  • TensorFlow is created in parent. Child inherits pointer to graph.
  • When the environment processes are closed, there's a race condition where the graph destructor gets called in multiple processes.
  • TensorFlow deadlocks.

Although irritating, TensorFlow is working as intended, and I think we'll always being rolling the dice on it if using fork. tensorflow/tensorflow#20600 I think had a similar problem.

@AdamGleave
Copy link
Collaborator

I'm done fiddling around with this. Happy to make changes to the SubprocVecEnv if others have opinions or better ideas on how to fix this.

@araffin if you want me to work on merging the Codacy reports then ping me once the Docker image is updated with the dependencies.

@araffin
Copy link
Collaborator Author

araffin commented Aug 6, 2019

if you want me to work on merging the Codacy reports then ping me once the Docker image is updated with the dependencies.

@AdamGleave I'll do that (just don't have a good internet connection for now...)

@araffin
Copy link
Collaborator Author

araffin commented Aug 7, 2019

@AdamGleave the image is pushed! (tag: 2.7.1, you run the codacy coverage reporter using ./../codacy-coverage-reporter)

@AdamGleave
Copy link
Collaborator

Why are we pip installing things inside the test running script, rather than making this part of Docker? I think this'll slow down each of the tests by a constant amount.

@araffin
Copy link
Collaborator Author

araffin commented Aug 7, 2019

Why are we pip installing things inside the test running script, rather than making this part of Docker?

I was just testing what was causing travis to hang (for tf and gym), you can remove them normally (tf 1.8.0 and gym 0.14.0 are in the docker image now).

@araffin
Copy link
Collaborator Author

araffin commented Aug 23, 2019

@hill-a could you review that one?

@araffin araffin mentioned this pull request Aug 29, 2019
Copy link
Owner

@hill-a hill-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants