Skip to content

Commit 963f2db

Browse files
committed
Clean up the write up
1 parent 5c1142b commit 963f2db

File tree

1 file changed

+34
-28
lines changed

1 file changed

+34
-28
lines changed

mdps/intro.qmd

+34-28
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Finite horizon MDPs
33
keywords:
44
- MDPs
5-
- Markov strategies
5+
- Markov policies
66
- dynamic programming
77
- comparison principle
88
- principle of irrelevant information
@@ -16,8 +16,8 @@ where $S_t \in \ALPHABET S$ is the state, $A_t \in \ALPHABET A$ is the control
1616
input, and $W_t \in \ALPHABET W$ is the noise. An agent/controller observes
1717
the state and chooses the control input $A_t$.
1818

19-
Eq. \\eqref{eq:state} is a _non-linear_ _τtochastic_ state-space
20-
model—_non-linear_ because $f_t$ can be any nonlinear function; _τtochastic_
19+
Eq. \\eqref{eq:state} is a _non-linear_ _stochastic_ state-space
20+
model—_non-linear_ because $f_t$ can be any nonlinear function; _stochastic_
2121
because the system is driven by stochastic noise $\{W_t\}_{t \ge 1}$.
2222

2323
The controller can be as sophisticated as we want. In principle, it can
@@ -41,29 +41,29 @@ benign assumption is critical for the theory that we present to go through.
4141
Suppose we have to design such a controller. We are told the probability
4242
distribution of the initial state and the noise. We are also told the system
4343
update functions $(f_1, \dots, f_T)$ and the cost functions $(c_1, \dots,
44-
c_T)$. We are asked to choose a _control strategy_ $π = (π_1, \dots, π_T)$ to
44+
c_T)$. We are asked to choose a _control policy_ $π = (π_1, \dots, π_T)$ to
4545
minimize the expected total cost
4646
$$ J(π) := \EXP\bigg[ \sum_{t=1}^T c_t(S_t, A_t) \bigg]. $$
4747
How should we proceed?
4848

4949
At first glance, the problem looks intimidating. It appears that we have to
5050
design a very sophisticated controller: one that analyzes all past data to
5151
choose a control input. However, this is not the case. A remarkable result is
52-
that the optimal control station can discard all past data and choose the
52+
that the optimal controller can discard all past data and choose the
5353
control input based only on the current state of the system. Formally, we have
5454
the following:
5555

5656
:::{#thm-MDP-markov}
57-
Optimality of Markov strategies.
57+
#### Optimality of Markov policies
5858
For the system model described above, there is no loss of optimality in
5959
chosing the control action according to
6060
$$ A_t = π_t(S_t), \quad t=1, \dots, T.$$
61-
Such a control strategy is called a _Markov strategy_.
61+
Such a control policy is called a _Markov policy_.
6262
:::
6363

6464

65-
The above result claims that the cost incurred by the best Markovian strategy
66-
is the same as the cost incurred by the best history dependent strategy. This
65+
The above result claims that the cost incurred by the best Markovian policy
66+
is the same as the cost incurred by the best history dependent policy. This
6767
appears to be a tall claim, so lets see how we can prove it. The main idea of
6868
the proof is to repeatedly apply [Blackwell's principle of irrelevant
6969
information][Blackwell] [@Blackwell1964]
@@ -73,7 +73,7 @@ information][Blackwell] [@Blackwell1964]
7373
::: {#lem-MDP-two-step-lemma}
7474
## Two-Step Lemma
7575
Consider an MDP that operates for two steps ($T=2$). Then there is no loss
76-
of optimality in restricting attention to a Markov control strategy at time
76+
of optimality in restricting attention to a Markov control policy at time
7777
$t=2$.
7878
:::
7979

@@ -132,10 +132,10 @@ A_1) ].$$
132132
:::
133133

134134
Now we have enough background to present the proof of optimality of Markov
135-
strategies.
135+
policies.
136136

137137
:::{.callout-note collapse="true"}
138-
#### Proof of optimality of Markov strategies {-}
138+
#### Proof of optimality of Markov policies {-}
139139

140140
The main idea is that any system can be thought of as a two- or three-step
141141
system by aggregating time. Suppose that the system operates for $T$ steps.
@@ -145,7 +145,7 @@ lemma, there is no loss of optimality in restricting attention to Markov
145145
control law at step 2 (i.e., at time $t=T$), i.e.,
146146
$$ A_T = π_T(S_T). $$
147147

148-
Now consider a system where we are using a Markov strategy at time $t=T$. This
148+
Now consider a system where we are using a Markov policy at time $t=T$. This
149149
system can be thought of as a three-step system where $t \in \{1, \dots,
150150
T-2\}$ corresponds to step 1, $t = T-1$ corresponds to step 2, and $t=T$
151151
corresponds to step 3. Since the controller at time $T$ is Markov, the
@@ -154,7 +154,7 @@ no loss of optimality in restricting attention to Markov controllers at step 2
154154
(i.e., at time $t=T-1$), i.e.,
155155
$$A_{T-1} = π_{T-1}(S_{T-1}).$$
156156

157-
Now consider a system where we are using a Markov strategy at time $t \in
157+
Now consider a system where we are using a Markov policy at time $t \in
158158
\{T-1, T\}$. This can be thought of as a three-step system where $t \in \{1,
159159
\dots, T - 3\}$ correspond to step 1, $t = T-2$ correspond to step 2, and $t
160160
\in \{T-1, T\}$ correspond to step 3. Since the controllers at time $t \in
@@ -177,15 +177,15 @@ $$A_τ = π_τ(S_τ).$$
177177
Proceeding until $s=2$, completes the proof.
178178
:::
179179

180-
## Performance of Markov strategies {#performance}
180+
## Performance of Markov policies {#performance}
181181

182182
We have shown that there is no loss of optimality to restrict attention to
183-
Markov strategies. One of the advantages of Markov strategies is that their performance can be computed recursively. In particular, given any Markov
184-
strategy $π = (π_1, \dots, π_T)$, define _the cost-to-go functions_ as
183+
Markov policies. One of the advantages of Markov policies is that their performance can be computed recursively. In particular, given any Markov
184+
policy $π = (π_1, \dots, π_T)$, define _the cost-to-go functions_ or _value function_ as
185185
follows:
186186
$$V^{π}_t(s) = \EXP^π \bigg[ \sum_{τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
187187
s\bigg]. $$
188-
Note that $V^{π}_t(s)$ only depends on the future strategy $(π_t, \dots, π_T)$. These functions can be computed recursively as follows:
188+
Note that $V^{π}_t(s)$ only depends on the future policy $(π_t, \dots, π_T)$. These functions can be computed recursively as follows:
189189
\begin{align*}
190190
V^{π}_t(s) &= \EXP^π \bigg[ \sum_{τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
191191
s \bigg] \\
@@ -227,18 +227,18 @@ Instead of proving the above result, we prove a related result
227227

228228
:::{#thm-comparison-principle}
229229
### The comparison principle
230-
For any Markov strategy $π$
230+
For any Markov policy $π$
231231
$$ V^{π}_t(s) \ge V_t(s) $$
232-
with equality at $t$ if and only if the _future strategy__{t:T}$
232+
with equality at $t$ if and only if the _future policy__{t:T}$
233233
satisfies the verification step \\eqref{eq:verification}.
234234
:::
235235

236-
Note that the comparison principle immediately implies that the strategy
236+
Note that the comparison principle immediately implies that the policy
237237
obtained using dynamic programming is optimal.
238238

239239
The comparison principle also allows us to interpret the value functions. The
240240
value function at time $t$ is the minimum of all the cost-to-go functions over
241-
all future strategies. The comparison principle also allows us to interpret the
241+
all future policies. The comparison principle also allows us to interpret the
242242
optimal policy (the interpretation is due to Bellman and is colloquially
243243
called Bellman's principle of optimality).
244244

@@ -251,7 +251,7 @@ policy with regard to the state resulting from the first decision.
251251

252252
:::{.callout-note collapse="true"}
253253
#### Proof of the comparison principle {-}
254-
The proof proceeds by backward induction. Consider any Markov strategy $π =
254+
The proof proceeds by backward induction. Consider any Markov policy $π =
255255
(π_1, \dots, π_T)$. For $t = T$,
256256
$$ \begin{align*}
257257
V_T(s) &= \min_{a \in \ALPHABET A} Q_T(s,a) \\
@@ -293,8 +293,7 @@ induction, is true for all time.
293293

294294
In the basic model that we have considered above, we assumed that the per-step
295295
cost depends only on the current state and current actions. In some
296-
applications, such as the [inventory management](inventory-management.html)
297-
model considered in class, it is more natural to have a cost function where
296+
applications, such as the [inventory management](inventory-management.html), it is more natural to have a cost function where
298297
the cost depends on the current state, current action, and the next state.
299298
Conceptually, such problems can be treated in the same way as the standard
300299
model.
@@ -380,6 +379,10 @@ $$J(π) = \EXP\Bigl[ \prod_{t=1}^T \exp( \theta c_t(S_t, A_t)) \Bigr]. $$
380379
Therefore, the dynamic program for multiplicative cost is also applicable for
381380
this model.
382381

382+
See notes on [risk-sensitive MDPs] for more details
383+
384+
[risk-sensitive MDPs]: ../risk-sensitive/risk-sensitive-mdps.qmd
385+
383386

384387
### Optimal stopping {#optimal-stopping}
385388

@@ -525,10 +528,13 @@ us to use the theory developed in the class to tackle this setup.
525528

526529
## Notes {-}
527530

528-
The proof idea for the optimality of Markov strategies is based on a proof
529-
by @Witsenhausen1979 on the structure of optimal coding strategies for
531+
The proof idea for the optimality of Markov policies is based on a proof
532+
by @Witsenhausen1979 on the structure of optimal coding policies for
530533
real-time communication. Note that the proof does not require us to find a
531534
dynamic programming decomposition of the problem. This is in contrast with the
532-
standard textbook proof where the optimality of Markov strategies is proved as
535+
standard textbook proof where the optimality of Markov policies is proved as
533536
part of the dynamic programming decomposition.
534537

538+
<!-- FIXME
539+
Add notes on the history of MDPs. Zormello's proof of existence of value of chess. Results on inventory management. Bellman's book. Howard's book (Howard's paper on history of MDP). Bellman's anecdote on the term "dynamic programming".
540+
-->

0 commit comments

Comments
 (0)