Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change XML saving for maze to fix infrequent crash #206

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mishmish66
Copy link

Description

I changed the way that temporary xmls are stored for the maze envs. I have used this change for training and it improves stability, without it my training would crash sometimes, usually after about 1m env steps. With this change it seems to be stable.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation N/A
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@Kallinteris-Andreas
Copy link
Collaborator

Can you please elaborate on why the crashing was happening prior to this change?

@mishmish66
Copy link
Author

mishmish66 commented Jan 19, 2024

Honestly it's a little unclear to me, the crash threw an XML parsing error giving a line and reason, but obviously the XML is supposed to be working. My guess is either the XML tool we are using to write to the XML file does not flush the buffers correctly internally, and since here we use the open() context manager it flushes as needed, or it's because there is a collision since I'm using a lot of parallelism and this is using timestamping. With these changes the directories are unique regardless of time, and also the file is flushed fully.

@mishmish66
Copy link
Author

@Kallinteris-Andreas is there anything else I should do to get this fix in?

Copy link
Collaborator

@Kallinteris-Andreas Kallinteris-Andreas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, those changes are an improvement over the hacky solution with timestamping XML files I had come up with.

To elaborate a bit, the likely cause of the previous version crashing was because of possible timestamp aliasing, which this PR fixes by using /tmp/ files

Other than the testing issue, it is ready for merge

@@ -39,3 +39,23 @@ def test_goal_cell():
obs = env.reset(options={"goal_cell": [2, 1]}, seed=42)[0]
desired_goal = np.array([-0.36302198, -0.53056078])
np.testing.assert_almost_equal(desired_goal, obs["desired_goal"], decimal=4)


def test_multiprocessing_not_crash():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Problem with this kind of nondeterministic testing, is that I can verify it working or not

I can not get it to fail with the version of maze

If it can not prove that a bug has been fixed, it should not be added

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see if I can make a test that fails it, I think the actual problem is that when using Tianshou RL it recreates the env a bunch of times, so what I should be doing to test is creating a buttload of envs.

@Kallinteris-Andreas
Copy link
Collaborator

@mishmish66 any progress on finding the test?

Note: I am willing to merge without a test, since it is hard to assert concurrency behavior

@mishmish66
Copy link
Author

No, I couldn't replicate the errors in a test which is weird. I was seeing this on a node in a cluster so I guess testing it locally is slightly different, but in any case I guess it isn't a very significant issue if it only happens on strangely configured cluster nodes.

@Kallinteris-Andreas
Copy link
Collaborator

Kallinteris-Andreas commented Feb 15, 2024

Just remove test_multiprocessing_not_crash and I can merge this, it is still an improvement over the existing implementation

@Kallinteris-Andreas
Copy link
Collaborator

@mishmish66 rebase and remove the test, and I can merge

@mishmish66
Copy link
Author

I'll do that! Now that the semester is going again it may take me a second to get to it. Thanks!

@Kallinteris-Andreas
Copy link
Collaborator

@mishmish66 please remove tests and rebase

@mishmish66
Copy link
Author

@Kallinteris-Andreas I rebased and fixed a warning that was thrown from the env on cleanup that was causing the tests to fail. I also added the change to maze-v3 envs, let me know if that is the wrong thing to do and I'll go remove that!

Thanks for your patience!

@@ -308,3 +316,7 @@ def compute_truncated(

def update_target_site_pos(self, pos):
raise NotImplementedError

def __del__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually required? does it now cleanup in /tmp without it?

Copy link
Collaborator

@Kallinteris-Andreas Kallinteris-Andreas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response,

please run pre-commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants