Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugtest sumo crash reproduction #619

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

Gamenot
Copy link
Collaborator

@Gamenot Gamenot commented Feb 25, 2021

This is a branch that contains scripts for generating a crash and reproducing it.

Reproduction:

Option 1 - Docker

$ docker pull huaweinoah/smarts:v0.4.13-SUMO_CRASH_UPDATED
$ docker run --rm -it -p 8081:8081 huaweinoah/smarts:v0.4.13-SUMO_CRASH_UPDATED

# Inside docker container
$ export PYTHONHASHSEED=42
$ CRASH_DIR=./crash_test
$ STEPS=1500
$ SPEED=$(cat "${CRASH_DIR}/speed.txt")
$ python examples/bugtest.py scenarios/loop --save-dir $CRASH_DIR --speed $SPEED --max-steps $STEPS --headless

If the crash runs into assert action == base_action after switching out dependencies then the crash will need to be regenerated:

CRASH_DIR=./crash_test
STEPS=1500
export PYTHONHASHSEED=42
# Create a crash
# If the trials do not create a new crash the total_trials may need to be increased
# parallel_crash_trial.sh <save_dir> <parallel_runs> <total_trials> <max_steps_per_trial>
bash parallel_crash_trial.sh $CRASH_DIR 2 300 $STEPS
# Run a generated crash
SPEED=$(cat "${CRASH_DIR}/speed.txt") 
python examples/bugtest.py scenarios/loop --save-dir $CRASH_DIR --speed $SPEED --max-steps $STEPS --headless

Option 2 - Repo

External dependencies:
Ubuntu 18.04
Python3.7.5
Eclipse SUMO 1.8.0

# 1. construct your virtual envirornment (can skip if testing on the docker image)
cd <project>
./install_deps.sh # skip if you have python installed
python3.7 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .
pip install -r bugtest_requirements.txt

# 2. Run the crashing test:
export PYTHONHASHSEED=42
CRASH_DIR=./crash_test
SPEED=$(cat "${CRASH_DIR}/speed.txt")
python examples/bugtest.py scenarios/loop --save-dir $CRASH_DIR --speed $SPEED --headless

Generating a new crash:

CRASH_DIR=./crash_test
export PYTHONHASHSEED=42
# Create a crash
# parallel_crash_trial.sh <save_dir> <parallel_runs> <total_trials> <max_steps_per_trial>
bash parallel_crash_trial.sh $CRASH_DIR 2 300 1500
# Run a generated crash
SPEED=$(cat "${CRASH_DIR}/speed.txt")
python examples/bugtest.py scenarios/loop --save-dir $CRASH_DIR --speed $SPEED --headless

Expected output

...
Step:  1040
Step:  1050
Step:  1060
ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
ERROR:SMARTS:connection closed by SUMO
Traceback (most recent call last):
  File "/home/dev/Desktop/repos/NewSMARTS/SMARTS_SUMO_BUG/smarts/core/smarts.py", line 172, in step
    return self._step(agent_actions)
  File "/home/dev/Desktop/repos/NewSMARTS/SMARTS_SUMO_BUG/smarts/core/smarts.py", line 221, in _step
    provider_state = self._step_providers(all_agent_actions, dt)
  File "/home/dev/Desktop/repos/NewSMARTS/SMARTS_SUMO_BUG/smarts/core/smarts.py", line 700, in _step_providers
    provider, actions, dt, self._elapsed_sim_time
  File "/home/dev/Desktop/repos/NewSMARTS/SMARTS_SUMO_BUG/smarts/core/smarts.py", line 739, in _step_provider
    provider_state = provider.step(provider_actions, dt, elapsed_sim_time)
  File "/home/dev/Desktop/repos/NewSMARTS/SMARTS_SUMO_BUG/smarts/core/sumo_traffic_simulation.py", line 341, in step
    self._traci_conn.simulationStep(self._cumulative_sim_seconds)
  File "/usr/share/sumo/tools/traci/connection.py", line 302, in simulationStep
    result = self._sendCmd(tc.CMD_SIMSTEP, None, None, "D", step)
  File "/usr/share/sumo/tools/traci/connection.py", line 180, in _sendCmd
    return self._sendExact()
  File "/usr/share/sumo/tools/traci/connection.py", line 90, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO

This was linked to issues Mar 3, 2021
@Gamenot
Copy link
Collaborator Author

Gamenot commented Mar 7, 2021

I think this crash generation could be useful for more than just the SUMO crash.

@Gamenot Gamenot requested a review from Adaickalavan March 16, 2021 06:29
@Adaickalavan
Copy link
Member

Running the docker instructions, the FatalTraCIError("connection closed by SUMO") error is reproducible at step 1166 of the simulation.

@behrisch
Copy link

I tried to redo it with the current HEAD of sumo but get a different crash:

Step:  105
Step:  106
ERROR:grpc._server:Exception calling application: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/grpc/_server.py", line 435, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/src/smarts/zoo/worker_servicer.py", line 65, in act
    action = self._agent.act(adapted_obs)
  File "/src/zoo/policies/replay_agent.py", line 52, in act
    raise e
  File "/src/zoo/policies/replay_agent.py", line 50, in act
    assert action == base_action
AssertionError
ERROR:AgentManager:Resolving the remote agent's action (a Future object) generated exception.
ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
ERROR:SMARTS:<_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: "
        debug_error_string = "{"created":"@1616076344.087839117","description":"Error received from peer ipv4:127.0.0.1:44463","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Exception calling application: ","grpc_status":2}"
>
Traceback (most recent call last):
  File "/src/smarts/core/smarts.py", line 172, in step
    return self._step(agent_actions)
  File "/src/smarts/core/smarts.py", line 218, in _step
    all_agent_actions = self._agent_manager.fetch_agent_actions(self, agent_actions)
  File "/src/smarts/core/agent_manager.py", line 242, in fetch_agent_actions
    raise e
  File "/src/smarts/core/agent_manager.py", line 236, in fetch_agent_actions
    for agent_id, remote_agent in self._remote_social_agents.items()
  File "/src/smarts/core/agent_manager.py", line 236, in <dictcomp>
    for agent_id, remote_agent in self._remote_social_agents.items()
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 625, in result
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: "
        debug_error_string = "{"created":"@1616076344.087839117","description":"Error received from peer ipv4:127.0.0.1:44463","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Exception calling application: ","grpc_status":2}"
>

Is this to be expected?

@Gamenot
Copy link
Collaborator Author

Gamenot commented Mar 20, 2021

@behrisch Yes, this is very possible. It is likely that SUMO's behaviour has changed between the version Eclipse SUMO 1.8.0 and the current HEAD of SUMO. The crash we generated tries to replay the exact agent actions and if there is a divergence we would need to regenerate the crash again.

I think I have found a way to more quickly generate the crash I will try to get a new container up for tomorrow.

@Gamenot
Copy link
Collaborator Author

Gamenot commented Mar 21, 2021

@behrisch I have created a new docker image and updated the instructions.

@Gamenot
Copy link
Collaborator Author

Gamenot commented Apr 28, 2021

This should eventually be converted into a proper PR for more general bug-testing purposes.

@Gamenot Gamenot mentioned this pull request Jun 28, 2021
@Gamenot
Copy link
Collaborator Author

Gamenot commented Feb 7, 2023

I will work to create a proper utility out of this. It was useful to hunt down the bug with SUMO but should also be equally useful in doing the same with most other bugs and for testing determinism.

@Gamenot Gamenot mentioned this pull request Feb 8, 2023
14 tasks
@Gamenot
Copy link
Collaborator Author

Gamenot commented Mar 10, 2023

It is not even certain that this is an issue any longer but it is necessary to confirm it to merge #1884.

Edit: moved to #1842

@Gamenot Gamenot removed the request for review from Adaickalavan March 10, 2023 16:12
@Gamenot Gamenot self-assigned this Mar 10, 2023
@Gamenot Gamenot linked an issue Mar 10, 2023 that may be closed by this pull request
14 tasks
@Gamenot Gamenot mentioned this pull request Mar 14, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants