Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync ray_merge with master #5

Open
wants to merge 94 commits into
base: ray_merge
Choose a base branch
from
Open

Sync ray_merge with master #5

wants to merge 94 commits into from

Conversation

cathywu
Copy link
Member

@cathywu cathywu commented Dec 30, 2018

What do these changes do?

Sync ray_merge with master, since this is the branch used for Travis tests.

Related issue number

Should resolve flow-project/flow#338

ericl and others added 30 commits December 3, 2018 01:24
…ge (ray-project#3426)

* fix clip

* tweak wording

* remove squash entirely

* Update rllib-models.rst

* fix argument order

* Apply suggestions from code review

Co-Authored-By: ericl <[email protected]>
* Add custom cluster name to exec info

* Update submit info to match exec info
…k with curriculum example (ray-project#3451)

* train step and docs

* debug

* doc

* doc

* fix examples

* fix code

* integration test

* fix

* ...

* space

* instance

* Update .travis.yml

* fix test
Changes include:
 - Notify Components on Requeue
 - Slight refactoring of Node Failure handling
 - Better tests
* Add the extra fallback for serialization.

* Better comments & warnings. quotes.

* Update test/runtest.py

Co-Authored-By: suquark <[email protected]>

* Update test/runtest.py

Co-Authored-By: suquark <[email protected]>

* linting

* Don't hijack too much errors.

* simplify the test

* Update runtest.py

* simplify
* increase container memory and shm to 20G

* variables are POWERFUL
auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler
add some Q-learning debug stats
report min, max of custom metrics
better errors
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.

The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.

In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
<!--
Thank you for your contribution!

Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->

## What do these changes do?

JSON Logger now uses cloudpickle to dump the configs as welll, which pkls the functions needed for multi-agent replay.

## Related issue number

<!-- Are there any issues opened that will be resolved by merging this change? -->
)

* Removing the check about the size re: ray-project#3450

* Addressing comments

* Update services.py
* Init commit for async plasma client

* Create an eventloop model for ray/plasma

* Implement a poll-like selector base on `ray.wait`. Huge improvements.

* Allow choosing workers & selectors

* remove original design

* initial implementation of epoll-like selector for plasma

* Add a param for `worker` used in `PlasmaSelectorEventLoop`

* Allow accepting a `Future` which returns object_id

* Do not need `io.py` anymore

* Create a basic testing model

* fix: `ray.wait` returns tuple of lists

* fix a few bugs

* improving performance & bug fixing

* add test

* several improvements & fixing

* fix relative import

* [async] change code format, remove old files

* [async] Create context wrapper for the eventloop

* [async] fix: context should return a value

* [async] Implement futures grouping

* [async] Fix bugs & replace old functions

* [async] Fix bugs found in tests

* [async] Implement `PlasmaEpoll`

* [async] Make test faster, add tests for epoll

* [async] Fix code format

* [async] Add comments for main code.

* [async] Fix import path.

* [async] Fix test.

* [async] Compatibility.

* [async] less verbose to not annoy the CI.

* [async] Add test for new API

* [async] Allow showing debug info in some of the test.

* [async] Fix test.

* [async] Proper shutdown.

* [async] Lint~

* [async] Move files to experimental and create API

* [async] Use async/await syntax

* [async] Fix names & styles

* [async] comments

* [async] bug fixing & use pytest

* [async] bug fixing & change tests

* [async] use logger

* [async] add tests

* [async] lint

* [async] type checking

* [async] add more tests

* [async] fix bugs on waiting a future while timeout. Add more docs.

* [async] Formal docs.

* [async] Add typing info since these codes are compatible with py3.5+.

* [async] Documents.

* [async] Lint.

* [async] Fix deprecated call.

* [async] Fix deprecated call.

* [async] Implement a more reasonable way for dealing with pending inputs.

* [async] Fix docs

* [async] Lint

* [async] Fix bug: Type for time

* [async] Set our eventloop as the default eventloop so that we can get it through `asyncio.get_event_loop()`.

* [async] Update test & docs.

* [async] Lint.

* [async] Temporarily print more debug info.

* [async] Use `Poll` as a default option.

* [async] Limit resources.

* new async implementation for Ray

* implement linked list

* bug fix

* update

* support seamless async operations

* update

* update API

* fix tests

* lint

* bug fix

* refactor names

* improve doc

* properly shutdown async_api

* doc

* Change the table on the index page.

* Adjust table size.

* Only keeps `as_future`.

* change how we init connection

* init connection in `ray.worker.connect`

* doc

* fix

* Move initialization code into the module.

* Fix docs & code

* Update pyarrow version.

* lint

* Restore index.rst

* Add known issues.

* Apply suggestions from code review

Co-Authored-By: suquark <[email protected]>

* rename

* Update async_api.rst

* Update async_api.py

* Update async_api.rst

* Update async_api.py

* Update worker.py

* Update async_api.rst

* fix tests

* lint

* lint

* replace the magic number
* wip

* fix

* remove check

* fix null

* revert

* lint and kl

* also fix rollout
* Add return value for recontruct RPC.

* Fix comment function name
…ct#3499)

* wip

* wip

* format

* wip

* note

* lint

* fix

* flag

* typo

* raise timeout

* fix

* optional get

* fix flag

* increase timeout in test

* update docs

* format
* add file lock to protect compilation of sgd op

* lint

* update

* fix

* fix

* lint

* update

* rebase on arrow

* Update sgd_worker.py
* update

* fix

* dict ordre

* fix

* fix
devin-petersohn and others added 28 commits December 20, 2018 18:46
…ay-project#3591)

* Allowing multiple users to access the /tmp/ray file at the same time

Previous sequence that caused this issue:
* User A starts ray with `ray.init` when /tmp/ray does not exist
* User B starts ray with `ray.init` and /tmp/ray now exists

User B will get a permissions error
Checking the permissions, /tmp/ray is 700

I have identified a race condition in `try_to_create_directory`
* Multiple processes try to create /tmp/ray at the same time
* chmod is either silently erroring or working properly within the race condition

Resolution: Move chmod outside of the check for whether the directory exists or not.

* Adding try except for users who do not own the directory
* Export tensorflow model of policy graph

* Add tests,examples,pydocs and infer extra signatures from existing methods

* Add example usage in export_policy_model comment

* Fix lint error

* Fix lint error

* Fix lint error
* remove tensorflow workaround
* update docker
* add boost threads
* add date_time, too
* change link order
* cosmetics
…-project#3592)

Initialize private class variables to avoid valgrind errors. They are used before initialization.
…-project#3617)

* Initialize some variables in constructor instead of header file
* Upgrade flatbuffers version to 1.10.0.

* Temporarily change ray.utils.decode for backwards compatibility.
…#3623)

* wip

* lint

* wip

* wip

* rename

* wip

* Cleaner handling of cli prompt
In the build script, numpy is specifically set at 1.10.4. We should also ensure that it is indeed the case in `setup.py`.
* Update release instructions.

* Add note about wheels.

* Fix

* Update

* update example

* Update RELEASE_PROCESS.rst
…ect#3626)

Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also ray-project#3586.
)

## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
* merge configs

* deep merge

* lint

* add resolve

* test
* Export policy model checkpoint

* update comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.