Sync ray_merge with master #5

cathywu · 2018-12-30T23:09:50Z

What do these changes do?

Sync ray_merge with master, since this is the branch used for Travis tests.

Related issue number

…ge (ray-project#3426) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <[email protected]>

* Add custom cluster name to exec info * Update submit info to match exec info

…k with curriculum example (ray-project#3451) * train step and docs * debug * doc * doc * fix examples * fix code * integration test * fix * ... * space * instance * Update .travis.yml * fix test

Changes include: - Notify Components on Requeue - Slight refactoring of Node Failure handling - Better tests

…ravis

* Add the extra fallback for serialization. * Better comments & warnings. quotes. * Update test/runtest.py Co-Authored-By: suquark <[email protected]> * Update test/runtest.py Co-Authored-By: suquark <[email protected]> * linting * Don't hijack too much errors. * simplify the test * Update runtest.py * simplify

* increase container memory and shm to 20G * variables are POWERFUL

auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler add some Q-learning debug stats report min, max of custom metrics better errors

…sample_from instead) (ray-project#3457) * wip * exclude

It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure. The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client. In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.

## What do these changes do? JSON Logger now uses cloudpickle to dump the configs as welll, which pkls the functions needed for multi-agent replay. ## Related issue number

) * Removing the check about the size re: ray-project#3450 * Addressing comments * Update services.py

* Init commit for async plasma client * Create an eventloop model for ray/plasma * Implement a poll-like selector base on `ray.wait`. Huge improvements. * Allow choosing workers & selectors * remove original design * initial implementation of epoll-like selector for plasma * Add a param for `worker` used in `PlasmaSelectorEventLoop` * Allow accepting a `Future` which returns object_id * Do not need `io.py` anymore * Create a basic testing model * fix: `ray.wait` returns tuple of lists * fix a few bugs * improving performance & bug fixing * add test * several improvements & fixing * fix relative import * [async] change code format, remove old files * [async] Create context wrapper for the eventloop * [async] fix: context should return a value * [async] Implement futures grouping * [async] Fix bugs & replace old functions * [async] Fix bugs found in tests * [async] Implement `PlasmaEpoll` * [async] Make test faster, add tests for epoll * [async] Fix code format * [async] Add comments for main code. * [async] Fix import path. * [async] Fix test. * [async] Compatibility. * [async] less verbose to not annoy the CI. * [async] Add test for new API * [async] Allow showing debug info in some of the test. * [async] Fix test. * [async] Proper shutdown. * [async] Lint~ * [async] Move files to experimental and create API * [async] Use async/await syntax * [async] Fix names & styles * [async] comments * [async] bug fixing & use pytest * [async] bug fixing & change tests * [async] use logger * [async] add tests * [async] lint * [async] type checking * [async] add more tests * [async] fix bugs on waiting a future while timeout. Add more docs. * [async] Formal docs. * [async] Add typing info since these codes are compatible with py3.5+. * [async] Documents. * [async] Lint. * [async] Fix deprecated call. * [async] Fix deprecated call. * [async] Implement a more reasonable way for dealing with pending inputs. * [async] Fix docs * [async] Lint * [async] Fix bug: Type for time * [async] Set our eventloop as the default eventloop so that we can get it through `asyncio.get_event_loop()`. * [async] Update test & docs. * [async] Lint. * [async] Temporarily print more debug info. * [async] Use `Poll` as a default option. * [async] Limit resources. * new async implementation for Ray * implement linked list * bug fix * update * support seamless async operations * update * update API * fix tests * lint * bug fix * refactor names * improve doc * properly shutdown async_api * doc * Change the table on the index page. * Adjust table size. * Only keeps `as_future`. * change how we init connection * init connection in `ray.worker.connect` * doc * fix * Move initialization code into the module. * Fix docs & code * Update pyarrow version. * lint * Restore index.rst * Add known issues. * Apply suggestions from code review Co-Authored-By: suquark <[email protected]> * rename * Update async_api.rst * Update async_api.py * Update async_api.rst * Update async_api.py * Update worker.py * Update async_api.rst * fix tests * lint * lint * replace the magic number

…ient plasma crashes) (ray-project#3484)

…ting ray (ray-project#3483)

…3491) * fix * lint

… overrides (ray-project#3480)

* wip * fix * remove check * fix null * revert * lint and kl * also fix rollout

* Add return value for recontruct RPC. * Fix comment function name

…ct#3499) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format

* add file lock to protect compilation of sgd op * lint * update * fix * fix * lint * update * rebase on arrow * Update sgd_worker.py

…y-project#3503)

* [tune] * fix * lint * fix

* update * fix * dict ordre * fix * fix

…ay-project#3591) * Allowing multiple users to access the /tmp/ray file at the same time Previous sequence that caused this issue: * User A starts ray with `ray.init` when /tmp/ray does not exist * User B starts ray with `ray.init` and /tmp/ray now exists User B will get a permissions error Checking the permissions, /tmp/ray is 700 I have identified a race condition in `try_to_create_directory` * Multiple processes try to create /tmp/ray at the same time * chmod is either silently erroring or working properly within the race condition Resolution: Move chmod outside of the check for whether the directory exists or not. * Adding try except for users who do not own the directory

Renaming variable due to user errors.

…t main (ray-project#3597)

…ocs (ray-project#3589)

* Export tensorflow model of policy graph * Add tests,examples,pydocs and infer extra signatures from existing methods * Add example usage in export_policy_model comment * Fix lint error * Fix lint error * Fix lint error

* remove tensorflow workaround * update docker * add boost threads * add date_time, too * change link order * cosmetics

…-project#3592) Initialize private class variables to avoid valgrind errors. They are used before initialization.

…-project#3617) * Initialize some variables in constructor instead of header file

* Upgrade flatbuffers version to 1.10.0. * Temporarily change ray.utils.decode for backwards compatibility.

…anches (ray-project#3621)

…#3623) * wip * lint * wip * wip * rename * wip * Cleaner handling of cli prompt

In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.

* Update release instructions. * Add note about wheels. * Fix * Update * update example * Update RELEASE_PROCESS.rst

…ect#3626) Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also ray-project#3586.

…y-project#3632)

) ## What do these changes do? 1. Fix the Jenkins test failure by add driver id to Actor GCS Key. 2. Move `object_manager_test.py` from Jenkins to Travis.

…3631)

* merge configs * deep merge * lint * add resolve * test

* Export policy model checkpoint * update comment

Master merge

ericl and others added 30 commits December 3, 2018 01:24

[rllib] Better error message for unsupported non-atari image observat…

7abfbfd

…ion sizes (ray-project#3444)

[rllib] Auto clip actions to Box space range; deprecate squash_to_ran…

d820597

…ge (ray-project#3426) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <[email protected]>

Tweak/exec attach info (ray-project#3447)

be6567e

* Add custom cluster name to exec info * Update submit info to match exec info

[rllib] Allow envs to be auto-registered; add on_train_result callbac…

ce355d1

…k with curriculum example (ray-project#3451) * train step and docs * debug * doc * doc * fix examples * fix code * integration test * fix * ... * space * instance * Update .travis.yml * fix test

[tune] Component notification on node failure + Tests (ray-project#3414)

9d0bd50

Changes include: - Notify Components on Requeue - Slight refactoring of Node Failure handling - Better tests

added cloudpickle

9575500

[docs] Switch docs to use rllib train instead of train.py

93a9d32

Make test_actor_multiple_gpus_from_multiple_tasks less stressful in t…

06f6431

…ravis

increase container memory and shm to 20G (ray-project#3475)

7a79b7f

* increase container memory and shm to 20G * variables are POWERFUL

[rllib] fixes from dogfooding multi-agent (ray-project#3456)

d864f29

auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler add some Q-learning debug stats report min, max of custom metrics better errors

[tune] Deprecate ambiguous function values (use tune.function / tune.…

412aaa5

…sample_from instead) (ray-project#3457) * wip * exclude

Removing the check about the size re: ray-project#3450 (ray-project#3464

970babf

) * Removing the check about the size re: ray-project#3450 * Addressing comments * Update services.py

[rllib] Copy data before passing to Ape-X learner thread (fixes trans…

8395523

…ient plasma crashes) (ray-project#3484)

Resolve no handlers could be found for logger 'ray.worker' when impor…

f6490f9

…ting ray (ray-project#3483)

[rllib] Use smoothed version of collect metrics for DQN (ray-project#…

462e6ef

…3491) * fix * lint

[rllib] Better document which methods are abstract and which ones are…

8b5827b

… overrides (ray-project#3480)

[rllib] Multi-GPU support for Multi-Agent PPO (ray-project#3479)

7aec357

* wip * fix * remove check * fix null * revert * lint and kl * also fix rollout

Add return value for recontruction RPC. (ray-project#3493)

0136af5

* Add return value for recontruct RPC. * Fix comment function name

Add option to evict keys LRU from the sharded redis tables (ray-proje…

cffe8f9

…ct#3499) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format

[sgd] Add file lock to protect compilation of sgd op (ray-project#3486)

87c0d24

* add file lock to protect compilation of sgd op * lint * update * fix * fix * lint * update * rebase on arrow * Update sgd_worker.py

[rllib] Learner should not see clipped actions (ray-project#3496)

ce388a4

Make stress test time shorter. (ray-project#3506)

abd781d

[autoscaler] Use fixed timestamp to check against health timeouts (ra…

962f187

…y-project#3503)

[tune] Fix PyTorch example after PyTorch v1 (ray-project#3500)

1f4a01c

* [tune] * fix * lint * fix

[rllib] Fix multiagent_two_trainer test (ray-project#3509)

52df4df

* update * fix * dict ordre * fix * fix

Show slowest tests in travis. (ray-project#3507)

74c3370

devin-petersohn and others added 28 commits December 20, 2018 18:46

[tune] resources_per_trial from trial_resources (ray-project#3580)

e046a5c

Renaming variable due to user errors.

[java] change RayLog.core to org.slf4j.Logger (ray-project#3579)

e65b8f1

change the order of allocation for io_service and gcs client in rayle…

6b179cb

…t main (ray-project#3597)

[rllib] Add requested clarifications to test requirement of contrib d…

ddc9786

…ocs (ray-project#3589)

Use BaseTest to instead of TestListener. (ray-project#3577)

8393df2

Fix TensorFlow and PyTorch compatibility (ray-project#3574)

e578a38

* remove tensorflow workaround * update docker * add boost threads * add date_time, too * change link order * cosmetics

object store notification mgr: fix using uninitialized variables (ray…

bada42c

…-project#3592) Initialize private class variables to avoid valgrind errors. They are used before initialization.

Initialize some variables in constructor instead of header file. (ray…

ddd4c84

…-project#3617) * Initialize some variables in constructor instead of header file

Upgrade flatbuffers version to 1.10.0. (ray-project#3559)

bb7ca3b

* Upgrade flatbuffers version to 1.10.0. * Temporarily change ray.utils.decode for backwards compatibility.

bump version from 0.6.0 to 0.6.1 (ray-project#3610)

9b8d757

Resize logo in README. (ray-project#3619)

a1995ff

[modin] Append to path to avoid namespace collision on development br…

c13b268

…anches (ray-project#3621)

[rllib] Allow development without needing to compile Ray (ray-project…

9f63119

…#3623) * wip * lint * wip * wip * rename * wip * Cleaner handling of cli prompt

Ensure numpy is at least 1.10.4 in setup.py (ray-project#2462)

3d8f564

In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.

Update release documentation. (ray-project#3587)

1e8cdb5

* Update release instructions. * Add note about wheels. * Fix * Update * update example * Update RELEASE_PROCESS.rst

Update documentation to reflect 0.6.1 release. (ray-project#3622)

5426234

Fix: ServerConnection should be closed before being removed (ray-proj…

f401175

…ect#3626) Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also ray-project#3586.

[Java] Fix the issue when waiting an empty list or a null pointer (ra…

a971b73

…y-project#3632)

Fix Jenkins test failures and function descriptor bug. (ray-project#3569

1b98fb8

) ## What do these changes do? 1. Fix the Jenkins test failure by add driver id to Actor GCS Key. 2. Move `object_manager_test.py` from Jenkins to Travis.

[Java] Print the log message slowly. (ray-project#3633)

4cde971

Average aggregated gradients before put in plasma store (ray-project#…

4ce3818

…3631)

[tune] Support Configuration Merging (ray-project#3584)

6e2d7a9

* merge configs * deep merge * lint * add resolve * test

[rllib] Export policy model checkpoint (ray-project#3637)

b4f61df

* Export policy model checkpoint * update comment

[rllib] Add starcraft multiagent env as example (ray-project#3542)

ac792d7

merge in ray upstream master

e319e49

Merge pull request #4 from flow-project/master_merge

34b5d13

Master merge

cathywu requested review from eugenevinitsky and AboudyKreidieh December 30, 2018 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync ray_merge with master #5

Sync ray_merge with master #5

cathywu commented Dec 30, 2018

Sync ray_merge with master #5

Are you sure you want to change the base?

Sync ray_merge with master #5

Conversation

cathywu commented Dec 30, 2018

What do these changes do?

Related issue number