Built site for gh-pages

adityam · Jul 22, 2024 · 7127087 · 7127087
1 parent 767c999
commit 7127087
Show file tree

Hide file tree

Showing 4 changed files with 20 additions and 20 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-0d1f6c45
+216560de
diff --git a/rl/intro.html b/rl/intro.html
@@ -7,7 +7,7 @@
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
 
 <meta name="author" content="Aditya Mahajan">
-<meta name="dcterms.date" content="2024-07-12">
+<meta name="dcterms.date" content="2024-07-22">
 <meta name="description" content="Course Notes for ECSE 506 (McGill University)">
 
 <title>30&nbsp; The learning setup – Stochastic Control and Decision Theory</title>
@@ -718,7 +718,7 @@ <h1 class="title"><span class="chapter-number">30</span>&nbsp; <span class="chap
     <div>
     <div class="quarto-title-meta-heading">Updated</div>
     <div class="quarto-title-meta-contents">
-      <p class="date">July 12, 2024</p>
+      <p class="date">July 22, 2024</p>
     </div>
   </div>
 
@@ -740,15 +740,15 @@ <h1 class="title"><span class="chapter-number">30</span>&nbsp; <span class="chap
 </div>
 </div>
 <div class="callout-body-container callout-body">
-<p>In the planning setting we assumed that the decision maker is interested in minimizing cost. To be consistent with the reinforcement learning literature, in this and subsequent sections, we assume that the decision maker is interested in maximizing rewards. Mathematically, maximizing rewards is the dual of minimizing costs, so going for cost to rewards makes no difference. Though, philosophically, there is a difference. See <span class="citation" data-cites="Lu2023">Lu et al. (<a href="../references.html#ref-Lu2023" role="doc-biblioref">2023</a>)</span> for a discussion.</p>
-<p>In the planning setting we showed that there is no difference between the per-step cost just being a function of the current state and action versus <a href="../mdps/intro.html#sec-cost-depending-on-next-state">the per-step cost depends on next state as well</a>. However, for the learning setting, the two setups are different. So, in the learning setting we will assume that the per-step reward depends on next state as well, i.e., is of the form <span class="math inline">\(r(s_t,a_t,s_{t+1})\)</span>.</p>
+<p>In the planning setting we assumed that the decision maker is interested in minimizing cost. To be consistent with the reinforcement learning literature, in this and subsequent sections, we assume that the decision maker is interested in maximizing rewards. Mathematically, maximizing rewards is the dual of minimizing costs, so going from cost to rewards makes no difference. Though, philosophically, there is a difference. See <span class="citation" data-cites="Lu2023">Lu et al. (<a href="../references.html#ref-Lu2023" role="doc-biblioref">2023</a>)</span> for a discussion.</p>
+<p>In the planning setting we showed that there is no difference between the per-step cost just being a function of the current state and action versus <a href="../mdps/intro.html#sec-cost-depending-on-next-state">the per-step cost that depends on the next state as well</a>. However, for the learning setting, the two setups are different. So, in the learning setting we will assume that the per-step reward depends on the next state as well, i.e., is of the form <span class="math inline">\(r(s_t,a_t,s_{t+1})\)</span>.</p>
 </div>
 </div>
 <p>So far, we have assumed that the decision maker knows the system model. We now consider the case when the system model is not known but one of the following is true:</p>
 <ul>
-<li>the decision maker can interact with the environment and observes the per-step reward.</li>
+<li>the decision maker acts in the environments, i.e., observes the state and the per-step reward and chooses the action.</li>
 <li>the decision maker has access to a system simulator, can provide state-action input to the simulator and observe the next state and the realized reward.</li>
-<li>the decision maker has access to an offline dataset of interactions with the environment; the data might be generated by a sub-optimal policy.</li>
+<li>the decision maker has access to an offline dataset of interactions with the environment (i.e., tuples of (current state, current action, current reward, next state)); the data might have been generated by a sub-optimal policy.</li>
 </ul>
 <p>To fix ideas, we start with the <a href="../mdps/inf-horizon.html">infinite horizon discounted reward setting</a>. The same fundamental concerns are present in the finite horizon and infinite horizon average cost setting as well.</p>
 <p>Thus, we have the same model as before. There is a controlled Markov process <span class="math inline">\(\{S_t\}_{t \ge 1}\)</span>, <span class="math inline">\(S_t \in \ALPHABET S\)</span>, controlled by the process <span class="math inline">\(\{A_t\}_{t \ge 1}\)</span>, <span class="math inline">\(A_t \in \ALPHABET A\)</span>. The system dynamics are time-homogeneous and are given by the transition matrices <span class="math inline">\(P(a)\)</span>, <span class="math inline">\(a \in \ALPHABET A\)</span>. At each step, the system yields a reward <span class="math inline">\(R_t = r(S_t, A_t, S_{t+1})\)</span>. <strong>The dynamics <span class="math inline">\(P\)</span> and the reward <span class="math inline">\(r\)</span> are not known to the decision maker.</strong></p>
@@ -758,12 +758,12 @@ <h1 class="title"><span class="chapter-number">30</span>&nbsp; <span class="chap
 <section id="the-learning-objective" class="level2" data-number="30.1">
 <h2 data-number="30.1" class="anchored" data-anchor-id="the-learning-objective"><span class="header-section-number">30.1</span> The learning objective</h2>
 <p>Modulo the difference between cost minimization and reward maximization, the setup above is the same as the models that we have considered before with one difference: earlier it was assumed that the decision maker knows the system model <span class="math inline">\((P,r)\)</span> while in the current setup it does not.</p>
-<p>The objective is also same as before: choose a policy to maximize the expected discounted reward. There are different ways formalize this objective. The simplest way to pose this is to pose it as a min-max problem: <span class="math display">\[\begin{equation}\label{eq:min-max}
+<p>The objective is also the same as before: choose a policy to maximize the expected discounted reward. There are different ways formalize this objective. One possibility is to seek a policy <span class="math inline">\(π^\star\)</span> that has the property that for all other history dependent policies <span class="math inline">\(π\)</span> and all model parameters <span class="math inline">\(θ\)</span>, we have: <span class="math display">\[
+  J(π^\star, θ) \ge J(π,θ).
+\]</span> This is equivalent to a min-max problem: <span class="math display">\[\begin{equation}\label{eq:min-max}
   \tag{min-max-opt}
   \max_{π} \min_{θ \in Θ} J(π,θ),
-\end{equation}\]</span> where the max is over all history dependent policies. That is, we want to choose a policy <span class="math inline">\(π^\star\)</span> that has the property that for all other history dependent policies <span class="math inline">\(π\)</span> and all model parameters <span class="math inline">\(θ\)</span>, we have: <span class="math display">\[
-  J(π^\star, θ) \ge J(π,θ).
-\]</span> This approach is akin to a frequentist approach in inference. It is also possible to set up the problem in a Bayesian manner. In particular, suppose there is a prior <span class="math inline">\(μ\)</span> on the set <span class="math inline">\(Θ\)</span>. Then the Bayesian approach is to solve the following: <span class="math display">\[\begin{equation}\label{eq:Bayes}
+\end{equation}\]</span> where the max is over all history dependent policies. This approach is akin to a frequentist approach in inference. It is also possible to set up the problem in a Bayesian manner. In particular, suppose there is a prior <span class="math inline">\(μ\)</span> on the set <span class="math inline">\(Θ\)</span>. Then the Bayesian approach is to solve the following: <span class="math display">\[\begin{equation}\label{eq:Bayes}
   \tag{Bayes-opt}
   \max_{π} \int_{Θ} J(π,θ) μ(d θ).
 \end{equation}\]</span></p>
@@ -784,13 +784,13 @@ <h2 data-number="30.2" class="anchored" data-anchor-id="the-different-learning-s
 <h2 data-number="30.3" class="anchored" data-anchor-id="high-level-overview-of-the-learning-algorithms"><span class="header-section-number">30.3</span> High-level overview of the learning algorithms</h2>
 <p>Loosely speaking, there are two approaches to find an asymptotically optimal policy. The simplest idea is <strong>model based learning</strong>: use the experience to generate estimates <span class="math inline">\(\hat θ_t\)</span> of the model parameters, and then choose a policy related to the optimal policy of the estimated parameter. These algorithms differ in how they <strong>trade-off exploration and exploitation</strong> are are broadly classified as follows:</p>
 <ol type="1">
-<li><p><strong>Certainty equivalence or plug-in estimator</strong>. In these methods, one generates an estimate <span class="math inline">\(\hat θ_t\)</span> (e.g., least squares estimator or maximum likelihood estimator) and then chooses the control action as a <em>noisy version</em> of <span class="math inline">\(π^\star_{\hat θ_t}(s_t)\)</span>. For continuous state models such as <a href="../linear-systems/lqr.html">LQR</a>, the noisy version is usually just addive noise, thus the control is <span class="math display">\[ a_t = π^\star_{\hat θ_t}(s_t) + ε_t \]</span> where <span class="math inline">\(ε_t \sim {\cal N}(0,σ_t^2I)\)</span> and the variance <span class="math inline">\(σ_t^2\)</span> is slowly reduced in a controlled manner. For disrete state models, the noisy version of control is often chosen as an <span class="math inline">\(ε\)</span>-greedy policy, i.e., <span class="math display">\[ a_t = \begin{cases}
+<li><p><strong>Certainty equivalence or plug-in estimator</strong>. In these methods, one generates an estimate <span class="math inline">\(\hat θ_t\)</span> (e.g., least squares estimator or maximum likelihood estimator) and then chooses the control action as a <em>noisy version</em> of <span class="math inline">\(π^\star_{\hat θ_t}(s_t)\)</span>. For continuous state models such as <a href="../linear-systems/lqr.html">LQR</a>, the noisy version is usually just addive noise, thus the control is <span class="math display">\[ a_t = π^\star_{\hat θ_t}(s_t) + ε_t \]</span> where <span class="math inline">\(ε_t \sim {\cal N}(0,σ_t^2I)\)</span> and the variance <span class="math inline">\(σ_t^2\)</span> is slowly reduced in a controlled manner. For discrete state models, the noisy version of control is often chosen as an <span class="math inline">\(ε\)</span>-greedy policy, i.e., <span class="math display">\[ a_t = \begin{cases}
    π^\star_{\hat θ_t}(s_t), &amp; \hbox{w.p. } 1 - ε_t \\
    \hbox{random action}, &amp; \hbox{w.p. } ε_t
 \end{cases}
 \]</span> where the <em>exploration rate</em> <span class="math inline">\(ε_t\)</span> is slowly reduced in a controlled manner.</p></li>
 <li><p><strong>Upper confidence based reinfocement learning (UCRL)</strong> In these methods, one generates an upper confidence bound based estimator <span class="math inline">\(\bar θ_t\)</span> and then chooses a control action <span class="math display">\[ a_t = π^\star_{\bar θ_t}(s_t). \]</span></p></li>
-<li><p><strong>Thompson/Posterior sampling (PSRL)</strong> This a Bayesian method, where one maintains a posterior distribution <span class="math inline">\(μ_t\)</span> on the unknown parameters <span class="math inline">\(θ\)</span> of the system model. At appropriately chosen times, one samples a model <span class="math inline">\(θ_t\)</span> from the posterior <span class="math inline">\(μ_t\)</span> and chooses the control action <span class="math display">\[ a_t = π^\star_{θ_t}(s_t). \]</span></p></li>
+<li><p><strong>Thompson/Posterior sampling (PSRL)</strong> This is a Bayesian method where one maintains a posterior distribution <span class="math inline">\(μ_t\)</span> on the unknown parameters <span class="math inline">\(θ\)</span> of the system model. At appropriately chosen times, one samples a model <span class="math inline">\(θ_t\)</span> from the posterior <span class="math inline">\(μ_t\)</span> and chooses the control action <span class="math display">\[ a_t = π^\star_{θ_t}(s_t). \]</span></p></li>
 </ol>
 <p>Another class of algorithms are <strong>model-free</strong> algorithms, which mimic <a href="../mdps/mdp-algorithms.html">the algorithms to solve MDPs</a>. These are broadly classified as</p>
 <ol type="1">
@@ -806,11 +806,11 @@ <h2 data-number="30.4" class="anchored" data-anchor-id="beyond-asymptotic-optima
 <li><p><strong>Sample complexity</strong>: Given an accuracy-level <span class="math inline">\(α\)</span> and a probability <span class="math inline">\(p\)</span>, how many samples <span class="math inline">\(T = T(α,p)\)</span> are needed such that with probability at least <span class="math inline">\(1-p\)</span> we have <span class="math display">\[ \NORM{V^\star - V^{\pi_T}} \le α? \]</span> The scaling of <span class="math inline">\(T\)</span> with <span class="math inline">\(α\)</span> and <span class="math inline">\(p\)</span> is known as the sample complexity.</p></li>
 <li><p><strong>Regret</strong>: Given a horion <span class="math inline">\(T\)</span>, what is the difference between sample path performance of the learning algorithm until time <span class="math inline">\(T\)</span> compared to the performance of the optimal policy until time <span class="math inline">\(T\)</span>.</p></li>
 </ol>
-<p>The general approach for characterizing sample complexity and regret proceeds as follows. We establish <strong>lower bounds</strong> which construct models such that no algorithm can learn such models without using certain number of samples or incuring a certain regret. Then, one tries to <strong>upper bound</strong> the sample complexity or regret of a specific algorithm. If the lower and the upper bounds match (up to logarithmic factors), we say that the algorithm under consideration is <strong>order optimal</strong>.</p>
+<p>The general approach for characterizing sample complexity and regret proceeds as follows. We establish <strong>lower bounds</strong> which construct models such that no algorithm can learn such models without using a certain number of samples or incuring a certain regret. Then, one tries to <strong>upper bound</strong> the sample complexity or regret of a specific algorithm. If the lower and the upper bounds match (up to logarithmic factors), we say that the algorithm under consideration is <strong>order optimal</strong>.</p>
 </section>
 <section id="notes" class="level2 unnumbered">
 <h2 class="unnumbered anchored" data-anchor-id="notes">Notes</h2>
-<p>A good resource for getting started in reinforcement learning in <span class="citation" data-cites="SuttonBarto2018">Sutton and Barto (<a href="../references.html#ref-SuttonBarto2018" role="doc-biblioref">2018</a>)</span>. For a more formal treatment, see <span class="citation" data-cites="BertsekasTsitsiklis1996">Bertsekas and Tsitsiklis (<a href="../references.html#ref-BertsekasTsitsiklis1996" role="doc-biblioref">1996</a>)</span>.</p>
+<p>A good resource for getting started in reinforcement learning is <span class="citation" data-cites="SuttonBarto2018">Sutton and Barto (<a href="../references.html#ref-SuttonBarto2018" role="doc-biblioref">2018</a>)</span>. For a more formal treatment, see <span class="citation" data-cites="BertsekasTsitsiklis1996">Bertsekas and Tsitsiklis (<a href="../references.html#ref-BertsekasTsitsiklis1996" role="doc-biblioref">1996</a>)</span>.</p>
 
 
 <div id="refs" class="references csl-bib-body hanging-indent" role="list" style="display: none">