Update dependencies #428

araffin · 2019-07-31T20:36:11Z

Update dependencies:

pytest can be any version (closes Update pytest requirement to v4.4.0 #399 )
tensorflow 1.8.0 is required (this prepare for Tensorflow 2.0 support? #366 anyway)
remove annoying warning caused by tensorflow
remove unused dependencies
fix dataset loading for numpy>=1.17

Docker images (notably the one used by travis) are also updated (using latest gym version)

araffin · 2019-08-01T08:10:47Z

I have to investigate why it fails more often, i suspect pytest to be responsible...

araffin · 2019-08-04T10:51:40Z

@AdamGleave @hill-a @erniejunior I identified the issue and unfortunately, it comes from tensorflow (v1.5.0 works fine, tf>=v1.8.0 makes travis hang more often, even though no test fails)
What should we do?

AdamGleave · 2019-08-04T18:45:22Z

Do we know what's causing it to fail? Do the tests run OK in the Docker image on local machines?

Can debug what's causing the test to hang by attaching gdb and seeing where it's get stuck. I've run into TensorFlow deadlock issues before, usually due to multiprocessing or multithreading. So might want to try setting the spawn_method to forkserver in SubprocVecEnv which should be thread-safe, and maybe try with different versions of libopenmpi.

araffin · 2019-08-05T09:12:59Z

Do we know what's causing it to fail?

I suspect an out of RAM error from travis. We are pretty close to the limit and some memory is not properly released during the tests (I tried to fix that in the past but was not successful).

Do the tests run OK in the Docker image on local machines?

I need to check that but I would say yes, I'm using tf 1.8.0 since the beginning on my machine and I could run the tests without any problem in the past, but we should double-check.

AdamGleave · 2019-08-05T09:24:13Z

I suspect an out of RAM error from travis. We are pretty close to the limit and some memory is not properly released during the tests (I tried to fix that in the past but was not successful).

One option would be to split our test suite in half (e.g. alphabetically) and run in two separate Travis instances. This won't help if one test consumes more memory than available on Travis, but if there are leaks this would help. And would have the added benefit of speeding up the test suite as well.

Confused why out of RAM would cause hanging rather than e.g. OutOfMemoryError, but seems possible, especially if swapping could end up very slow.

araffin · 2019-08-05T11:40:20Z

if swapping could end up very slow.

yes, there may be two things: first, if it uses swap it becomes super slow, then when you have no RAM left, you usually don't get any error (see what happens when you do a fork bomb) and things just hang...

One option would be to split our test suite in half (e.g. alphabetically) and run in two separate Travis instances.

Good idea, how easy is it to implement? And yes, this is definitely memory leaks that were happening.

AdamGleave · 2019-08-06T09:41:00Z

It's now hanging on test_subproc_start_method. So I'm disabling the non thread-safe methods there as well. Frustratingly there's some non-determinsm in the tests (e.g. the branch build passed but the PR one failed), so we'll probably need to run it a few times to make sure things work.

@araffin I've changed the default for SubprocVecEnv back to forkserver/spawn. I remember from #217 you found this broke some applications. I believe the errors were largely MPI related, so hoping it's OK now that #433 is merged. An alternative would be to keep the non thread-safe fork as default, and just have tests run with thread-safe method. I feel a bit uneasy about this (not testing the default seems bad), but it might be OK: I think this kind of deadlock is unusually likely to happen in large test suites, where TensorFlow is used in the parent before starting the environments.

araffin · 2019-08-06T13:32:35Z

@AdamGleave I managed to reproduce the deadlock (for me, it is currently happening on atari test ACER + lstm). You were right, it is not due to memory problem.
I will need to check with my normal env (so without docker).

I've changed the default for SubprocVecEnv back to forkserver/spawn

I would avoid that because this makes it really user unfriendly (e.g. you cannot use in a ipython terminal anymore) and I would be interested in knowing where it does actually comes from (in the sense what changed in tf that broke it ?)
but it is not an easy choice :/

AdamGleave · 2019-08-06T13:44:36Z

I've changed the default for SubprocVecEnv back to forkserver/spawn

I would avoid that because this makes it really user unfriendly (e.g. you cannot use in a ipython terminal anymore) and I would be interested in knowing where it does actually comes from (in the sense what changed in tf that broke it ?)
but it is not an easy choice :/

I normally use use DummyVecEnv anyway for small experiments, but I agree it's not ideal.
I'd be OK with keeping fork the default but with a Python warning that it should be avoided for non-interactive use.

When I last looked into it, the deadlock happened in the graph destructor. I think what happens is:

TensorFlow is created in parent. Child inherits pointer to graph.
When the environment processes are closed, there's a race condition where the graph destructor gets called in multiple processes.
TensorFlow deadlocks.

Although irritating, TensorFlow is working as intended, and I think we'll always being rolling the dice on it if using fork. tensorflow/tensorflow#20600 I think had a similar problem.

AdamGleave · 2019-08-06T16:47:22Z

I'm done fiddling around with this. Happy to make changes to the SubprocVecEnv if others have opinions or better ideas on how to fix this.

@araffin if you want me to work on merging the Codacy reports then ping me once the Docker image is updated with the dependencies.

araffin · 2019-08-06T18:33:08Z

if you want me to work on merging the Codacy reports then ping me once the Docker image is updated with the dependencies.

@AdamGleave I'll do that (just don't have a good internet connection for now...)

araffin · 2019-08-07T09:01:34Z

@AdamGleave the image is pushed! (tag: 2.7.1, you run the codacy coverage reporter using ./../codacy-coverage-reporter)

AdamGleave · 2019-08-07T10:07:28Z

Why are we pip installing things inside the test running script, rather than making this part of Docker? I think this'll slow down each of the tests by a constant amount.

araffin · 2019-08-07T10:09:24Z

Why are we pip installing things inside the test running script, rather than making this part of Docker?

I was just testing what was causing travis to hang (for tf and gym), you can remove them normally (tf 1.8.0 and gym 0.14.0 are in the docker image now).

araffin · 2019-08-23T08:33:23Z

@hill-a could you review that one?

hill-a

LGTM

Update dependencies

1e51861

araffin requested review from AdamGleave, ernestum and hill-a July 31, 2019 20:36

Fix for numpy v1.17.0

5c146e7

araffin added 2 commits August 1, 2019 10:41

Downgrade pytest version for travis

69d1208

Update .travis.yml

2f27160

araffin mentioned this pull request Aug 2, 2019

Make MPI install optional #433

Merged

araffin added 16 commits August 2, 2019 10:13

Rollback to previous docker image

e42ed6f

Update .travis.yml

d7e9591

Trying to upgrade pytest

f5be220

Try to ignore pytest warning

377e65f

Update setup.cfg

2073ad4

Use full name to ignore pytest warning

a9e8ab7

Correct import

22bdc57

Remove gym and tf warnings

be0ea3c

Add TD3 to tensorboard tests

6cce6b0

Ignore additional gym warning

12a10e8

Get rid of additional tf warning

67a9f34

Re-enable new docker image

001ee31

Test with different tf version

74beefc

Upgrade pytest

a2da183

Upgrade tf to 1.13.2

ed157e2

Try downgrading tf

eece3d0

AdamGleave added 2 commits August 6, 2019 11:11

Rebalance tests

1d9322a

Rebalance some more

9d1f1a1

Rebalance, hopefully for the last time

c9bcd1b

AdamGleave force-pushed the update-deps branch from e48a65c to c9bcd1b Compare August 6, 2019 16:27

Fix globs

150dc72

araffin added 2 commits August 6, 2019 20:34

Merge branch 'master' into update-deps

ce4a4af

Update docker cpu image: add coverage reporter for travis

21bb414

AdamGleave added 3 commits August 7, 2019 10:52

Codacy partial upload

d072313

Bump Docker image version

1cc3018

Make Travis read environment variable

2b18787

AdamGleave and others added 3 commits August 7, 2019 11:38

Pass project token in

1b5e738

Remove pip install and fix coverage final report

06d1de8

Merge branch 'master' into update-deps

d65f58c

araffin added 2 commits August 29, 2019 11:41

Merge branch 'master' into update-deps

56a8de2

Merge branch 'master' into update-deps

ea19618

araffin mentioned this pull request Aug 29, 2019

Tensorflow 2.0 support? #366

Closed

Merge branch 'master' into update-deps

c45e07f

hill-a approved these changes Sep 6, 2019

View reviewed changes

hill-a merged commit e2e5c1a into master Sep 6, 2019

hill-a deleted the update-deps branch September 6, 2019 07:56

kumariko mentioned this pull request Aug 30, 2021

Tensorflow 1.8 graph destructor hanging indefinitely tensorflow/tensorflow#20600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependencies #428

Update dependencies #428

araffin commented Jul 31, 2019 •

edited

Loading

araffin commented Aug 1, 2019

araffin commented Aug 4, 2019

AdamGleave commented Aug 4, 2019

araffin commented Aug 5, 2019

AdamGleave commented Aug 5, 2019

araffin commented Aug 5, 2019

AdamGleave commented Aug 6, 2019

araffin commented Aug 6, 2019

AdamGleave commented Aug 6, 2019 •

edited

Loading

AdamGleave commented Aug 6, 2019

araffin commented Aug 6, 2019

araffin commented Aug 7, 2019

AdamGleave commented Aug 7, 2019

araffin commented Aug 7, 2019

araffin commented Aug 23, 2019

hill-a left a comment

Update dependencies #428

Update dependencies #428

Conversation

araffin commented Jul 31, 2019 • edited Loading

araffin commented Aug 1, 2019

araffin commented Aug 4, 2019

AdamGleave commented Aug 4, 2019

araffin commented Aug 5, 2019

AdamGleave commented Aug 5, 2019

araffin commented Aug 5, 2019

AdamGleave commented Aug 6, 2019

araffin commented Aug 6, 2019

AdamGleave commented Aug 6, 2019 • edited Loading

AdamGleave commented Aug 6, 2019

araffin commented Aug 6, 2019

araffin commented Aug 7, 2019

AdamGleave commented Aug 7, 2019

araffin commented Aug 7, 2019

araffin commented Aug 23, 2019

hill-a left a comment

Choose a reason for hiding this comment

araffin commented Jul 31, 2019 •

edited

Loading

AdamGleave commented Aug 6, 2019 •

edited

Loading