2
2
title : Finite horizon MDPs
3
3
keywords :
4
4
- MDPs
5
- - Markov strategies
5
+ - Markov policies
6
6
- dynamic programming
7
7
- comparison principle
8
8
- principle of irrelevant information
@@ -16,8 +16,8 @@ where $S_t \in \ALPHABET S$ is the state, $A_t \in \ALPHABET A$ is the control
16
16
input, and $W_t \in \ALPHABET W$ is the noise. An agent/controller observes
17
17
the state and chooses the control input $A_t$.
18
18
19
- Eq. \\ eqref{eq: state } is a _ non-linear_ _ τtochastic _ state-space
20
- model—_ non-linear_ because $f_t$ can be any nonlinear function; _ τtochastic _
19
+ Eq. \\ eqref{eq: state } is a _ non-linear_ _ stochastic _ state-space
20
+ model—_ non-linear_ because $f_t$ can be any nonlinear function; _ stochastic _
21
21
because the system is driven by stochastic noise $\{ W_t\} _ {t \ge 1}$.
22
22
23
23
The controller can be as sophisticated as we want. In principle, it can
@@ -41,29 +41,29 @@ benign assumption is critical for the theory that we present to go through.
41
41
Suppose we have to design such a controller. We are told the probability
42
42
distribution of the initial state and the noise. We are also told the system
43
43
update functions $(f_1, \dots, f_T)$ and the cost functions $(c_1, \dots,
44
- c_T)$. We are asked to choose a _ control strategy _ $π = (π_1, \dots, π_T)$ to
44
+ c_T)$. We are asked to choose a _ control policy _ $π = (π_1, \dots, π_T)$ to
45
45
minimize the expected total cost
46
46
$$ J(π) := \EXP\bigg[ \sum_{t=1}^T c_t(S_t, A_t) \bigg]. $$
47
47
How should we proceed?
48
48
49
49
At first glance, the problem looks intimidating. It appears that we have to
50
50
design a very sophisticated controller: one that analyzes all past data to
51
51
choose a control input. However, this is not the case. A remarkable result is
52
- that the optimal control station can discard all past data and choose the
52
+ that the optimal controller can discard all past data and choose the
53
53
control input based only on the current state of the system. Formally, we have
54
54
the following:
55
55
56
56
:::{#thm-MDP-markov}
57
- Optimality of Markov strategies.
57
+ #### Optimality of Markov policies
58
58
For the system model described above, there is no loss of optimality in
59
59
chosing the control action according to
60
60
$$ A_t = π_t(S_t), \quad t=1, \dots, T. $$
61
- Such a control strategy is called a _ Markov strategy _ .
61
+ Such a control policy is called a _ Markov policy _ .
62
62
:::
63
63
64
64
65
- The above result claims that the cost incurred by the best Markovian strategy
66
- is the same as the cost incurred by the best history dependent strategy . This
65
+ The above result claims that the cost incurred by the best Markovian policy
66
+ is the same as the cost incurred by the best history dependent policy . This
67
67
appears to be a tall claim, so lets see how we can prove it. The main idea of
68
68
the proof is to repeatedly apply [ Blackwell's principle of irrelevant
69
69
information] [ Blackwell ] [ @Blackwell1964 ]
@@ -73,7 +73,7 @@ information][Blackwell] [@Blackwell1964]
73
73
::: {#lem-MDP-two-step-lemma}
74
74
## Two-Step Lemma
75
75
Consider an MDP that operates for two steps ($T=2$). Then there is no loss
76
- of optimality in restricting attention to a Markov control strategy at time
76
+ of optimality in restricting attention to a Markov control policy at time
77
77
$t=2$.
78
78
:::
79
79
@@ -132,10 +132,10 @@ A_1) ].$$
132
132
:::
133
133
134
134
Now we have enough background to present the proof of optimality of Markov
135
- strategies .
135
+ policies .
136
136
137
137
:::{.callout-note collapse="true"}
138
- #### Proof of optimality of Markov strategies {-}
138
+ #### Proof of optimality of Markov policies {-}
139
139
140
140
The main idea is that any system can be thought of as a two- or three-step
141
141
system by aggregating time. Suppose that the system operates for $T$ steps.
@@ -145,7 +145,7 @@ lemma, there is no loss of optimality in restricting attention to Markov
145
145
control law at step 2 (i.e., at time $t=T$), i.e.,
146
146
$$ A_T = π_T(S_T). $$
147
147
148
- Now consider a system where we are using a Markov strategy at time $t=T$. This
148
+ Now consider a system where we are using a Markov policy at time $t=T$. This
149
149
system can be thought of as a three-step system where $t \in \{ 1, \dots,
150
150
T-2\} $ corresponds to step 1, $t = T-1$ corresponds to step 2, and $t=T$
151
151
corresponds to step 3. Since the controller at time $T$ is Markov, the
@@ -154,7 +154,7 @@ no loss of optimality in restricting attention to Markov controllers at step 2
154
154
(i.e., at time $t=T-1$), i.e.,
155
155
$$ A_{T-1} = π_{T-1}(S_{T-1}). $$
156
156
157
- Now consider a system where we are using a Markov strategy at time $t \in
157
+ Now consider a system where we are using a Markov policy at time $t \in
158
158
\{ T-1, T\} $. This can be thought of as a three-step system where $t \in \{ 1,
159
159
\dots, T - 3\} $ correspond to step 1, $t = T-2$ correspond to step 2, and $t
160
160
\in \{ T-1, T\} $ correspond to step 3. Since the controllers at time $t \in
@@ -177,15 +177,15 @@ $$A_τ = π_τ(S_τ).$$
177
177
Proceeding until $s=2$, completes the proof.
178
178
:::
179
179
180
- ## Performance of Markov strategies {#performance}
180
+ ## Performance of Markov policies {#performance}
181
181
182
182
We have shown that there is no loss of optimality to restrict attention to
183
- Markov strategies . One of the advantages of Markov strategies is that their performance can be computed recursively. In particular, given any Markov
184
- strategy $π = (π_1, \dots, π_T)$, define _ the cost-to-go functions_ as
183
+ Markov policies . One of the advantages of Markov policies is that their performance can be computed recursively. In particular, given any Markov
184
+ policy $π = (π_1, \dots, π_T)$, define _ the cost-to-go functions_ or _ value function _ as
185
185
follows:
186
186
$$ V^{π}_t(s) = \EXP^π \bigg[ \sum_{τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
187
187
s\bigg]. $$
188
- Note that $V^{π}_ t(s)$ only depends on the future strategy $(π_t, \dots, π_T)$. These functions can be computed recursively as follows:
188
+ Note that $V^{π}_ t(s)$ only depends on the future policy $(π_t, \dots, π_T)$. These functions can be computed recursively as follows:
189
189
\begin{align* }
190
190
V^{π}_ t(s) &= \EXP^π \bigg[ \sum_ {τ = t}^{T} c_τ(S_τ, π_τ(S_τ)) \biggm| S_t =
191
191
s \bigg] \\
@@ -227,18 +227,18 @@ Instead of proving the above result, we prove a related result
227
227
228
228
:::{#thm-comparison-principle}
229
229
### The comparison principle
230
- For any Markov strategy $π$
230
+ For any Markov policy $π$
231
231
$$ V^{π}_t(s) \ge V_t(s) $$
232
- with equality at $t$ if and only if the _ future strategy _ $π_ {t: T }$
232
+ with equality at $t$ if and only if the _ future policy _ $π_ {t: T }$
233
233
satisfies the verification step \\ eqref{eq: verification }.
234
234
:::
235
235
236
- Note that the comparison principle immediately implies that the strategy
236
+ Note that the comparison principle immediately implies that the policy
237
237
obtained using dynamic programming is optimal.
238
238
239
239
The comparison principle also allows us to interpret the value functions. The
240
240
value function at time $t$ is the minimum of all the cost-to-go functions over
241
- all future strategies . The comparison principle also allows us to interpret the
241
+ all future policies . The comparison principle also allows us to interpret the
242
242
optimal policy (the interpretation is due to Bellman and is colloquially
243
243
called Bellman's principle of optimality).
244
244
@@ -251,7 +251,7 @@ policy with regard to the state resulting from the first decision.
251
251
252
252
:::{.callout-note collapse="true"}
253
253
#### Proof of the comparison principle {-}
254
- The proof proceeds by backward induction. Consider any Markov strategy $π =
254
+ The proof proceeds by backward induction. Consider any Markov policy $π =
255
255
(π_1, \dots, π_T)$. For $t = T$,
256
256
$$ \begin{align*}
257
257
V_T(s) &= \min_{a \in \ALPHABET A} Q_T(s,a) \\
@@ -293,8 +293,7 @@ induction, is true for all time.
293
293
294
294
In the basic model that we have considered above, we assumed that the per-step
295
295
cost depends only on the current state and current actions. In some
296
- applications, such as the [ inventory management] ( inventory-management.html )
297
- model considered in class, it is more natural to have a cost function where
296
+ applications, such as the [ inventory management] ( inventory-management.html ) , it is more natural to have a cost function where
298
297
the cost depends on the current state, current action, and the next state.
299
298
Conceptually, such problems can be treated in the same way as the standard
300
299
model.
@@ -380,6 +379,10 @@ $$J(π) = \EXP\Bigl[ \prod_{t=1}^T \exp( \theta c_t(S_t, A_t)) \Bigr]. $$
380
379
Therefore, the dynamic program for multiplicative cost is also applicable for
381
380
this model.
382
381
382
+ See notes on [ risk-sensitive MDPs] for more details
383
+
384
+ [ risk-sensitive MDPs ] : ../risk-sensitive/risk-sensitive-mdps.qmd
385
+
383
386
384
387
### Optimal stopping {#optimal-stopping}
385
388
@@ -525,10 +528,13 @@ us to use the theory developed in the class to tackle this setup.
525
528
526
529
## Notes {-}
527
530
528
- The proof idea for the optimality of Markov strategies is based on a proof
529
- by @Witsenhausen1979 on the structure of optimal coding strategies for
531
+ The proof idea for the optimality of Markov policies is based on a proof
532
+ by @Witsenhausen1979 on the structure of optimal coding policies for
530
533
real-time communication. Note that the proof does not require us to find a
531
534
dynamic programming decomposition of the problem. This is in contrast with the
532
- standard textbook proof where the optimality of Markov strategies is proved as
535
+ standard textbook proof where the optimality of Markov policies is proved as
533
536
part of the dynamic programming decomposition.
534
537
538
+ <!-- FIXME
539
+ Add notes on the history of MDPs. Zormello's proof of existence of value of chess. Results on inventory management. Bellman's book. Howard's book (Howard's paper on history of MDP). Bellman's anecdote on the term "dynamic programming".
540
+ -->
0 commit comments