Implemented Monte-Carlo method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0
.
gen_obs
generates partially observable agent's view (an image)- For discrete observation, we use
agent_pos
, which returns the grid number at which the agent is present.
Num | Action |
---|---|
0 | Turn Left |
1 | Turn Right |
2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- 0.9
- Training Episodes
- 75
- Exploration
- Epsilon = Epsilon/1.1
Implemented SARSA-λ and Backward SARSA method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0
and MiniGrid-FourRooms-v0
.
gen_obs
generates partially observable agent's view (an image)- For discrete observation, we use
agent_pos
, which returns the grid number at which the agent is present.
Num | Action |
---|---|
0 | Turn Left |
1 | Turn Right |
2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- 0.9
- Sarsa Lambda
- 0.99
- Training Episodes
- 50
- Exploration
- Epsilon = Epsilon/1.05
Implemented SARSA-λ and Backward SARSA method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0
.
gen_obs
generates partially observable agent's view (an image)- For discrete observation, we use
agent_pos
, which returns the grid number at which the agent is present.
Num | Action |
---|---|
0 | Turn Left |
1 | Turn Right |
2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- Trained agents with 5 different values of gamma
- 0.9, 0.7, 0.5, 0.3, 0.1
- Trained agents with 5 different values of gamma
- Training Episodes
- 150
- Exploration
- Epsilon = Epsilon/1.1
Implemented DQN on Gym Envrionment, Gym-CartPole-v0
.
Num | Observation | Min | Max |
---|---|---|---|
0 | Cart Position | -4.8 | 4.8 |
1 | Cart Velocity | -Inf | Inf |
2 | Pole Angle | -0.418 rad(-24 deg) | 0.418 rad(-24 deg) |
3 | Pole Angular Velocity | -Inf | Inf |
Num | Action |
---|---|
0 | Push Cart to Left |
1 | Push Cart to Right |
- Reward is 1 for every step taken, including the termination step
- Network Architecture
- 4 Linear Layers of dim = [16, 32, 16, 2]
- Optimizer
- Adam Optimizer
- Learning Rate
- 0.0001
- Batch Size
- 128
- Training Episodes
- 700
Implemented Policy Gradient Method (Actor-Critic) on Gym Envrionment, Gym-CartPole-v0
.
The observation is a ndarray
with shape (3,)
representing the x-y coordinates of the pendulum's free end and its angular velocity.
Num | Observation | Min | Max |
---|---|---|---|
0 | x = cos(theta) | -1.0 | 1.0 |
1 | y = sin(angle) | -1.0 | 1.0 |
2 | Angular Velocity | -8.0 | 8.0 |
The action is a ndarray
with shape (1,)
representing the torque applied to free end of the pendulum.
Num | Action | Min | Max |
---|---|---|---|
0 | Torque | -2.0 | 2.0 |
- The reward function is a function of theta, angle made by the pendulum.
-
Network Architecture
- Actor
- 4 Linear Layers of dim = [31,128,32,2]
- Critic
- 4 Linear Layers of dim = [31,128,32,1]
- Actor
-
Optimizer
- Adam Optimizer
-
Learning Rate
- 0.0005
-
Batch Size
- 64
-
Training Episodes
- 1200