diff --git a/README.md b/README.md
index d9058efd0..cb1d965ec 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 [![Build Status](http://ci.d2l.ai/job/d2l-zh/job/master/badge/icon)](http://ci.d2l.ai/job/d2l-zh/job/master/)
 
-[第一版：zh-v1.D2L.ai](https://zh-v1.d2l.ai/) |  [第二版预览版：zh.D2L.ai](https://zh.d2l.ai)  | 安装和使用书中源代码：[第一版](https://zh-v1.d2l.ai/chapter_prerequisite/install.html) [第二版](https://zh.d2l.ai/chapter_installation/index.html) | 当前版本: v2.0.0-alpha2
+[第一版：zh-v1.D2L.ai](https://zh-v1.d2l.ai/) |  [第二版预览版：zh.D2L.ai](https://zh.d2l.ai)  | 安装和使用书中源代码：[第一版](https://zh-v1.d2l.ai/chapter_prerequisite/install.html) [第二版](https://zh.d2l.ai/chapter_installation/index.html) | 当前版本: v2.0.0-beta1
 
 <h5 align="center"><i>理解深度学习的最佳方法是学以致用。</i></h5>
 
diff --git a/chapter_appendix-tools-for-deep-learning/aws.md b/chapter_appendix-tools-for-deep-learning/aws.md
index 98c719e3a..054aebe55 100644
--- a/chapter_appendix-tools-for-deep-learning/aws.md
+++ b/chapter_appendix-tools-for-deep-learning/aws.md
@@ -7,7 +7,7 @@
 1. 安装CUDA（或使用预装CUDA的Amazon机器映像）。
 1. 安装深度学习框架和其他库以运行本书的代码。
 
-此过程也适用于其他实例（和其他云），尽管需要一些细微的修改。在继续操作之前，你需要创建一个aws帐户，有关更多详细信息，请参阅 :numref:`sec_sagemaker`。
+此过程也适用于其他实例（和其他云），尽管需要一些细微的修改。在继续操作之前，你需要创建一个AWS帐户，有关更多详细信息，请参阅 :numref:`sec_sagemaker`。
 
 ## 创建和运行EC2实例
 
diff --git a/chapter_attention-mechanisms/self-attention-and-positional-encoding.md b/chapter_attention-mechanisms/self-attention-and-positional-encoding.md
index cdc8a36fb..c6ccdf1ed 100644
--- a/chapter_attention-mechanisms/self-attention-and-positional-encoding.md
+++ b/chapter_attention-mechanisms/self-attention-and-positional-encoding.md
@@ -2,7 +2,7 @@
 :label:`sec_self-attention-and-positional-encoding`
 
 在深度学习中，我们经常使用卷积神经网络（CNN）或循环神经网络（RNN）对序列进行编码。
-想象一下，有了注意力机制之后，我们将词元序列输入注意力池化中，
+想象一下，有了注意力机制之后，我们将词元序列输入注意力汇聚中，
 以便同一组词元同时充当查询、键和值。
 具体来说，每个查询都会关注所有的键－值对并生成一个注意力输出。
 由于查询、键和值来自同一组输入，因此被称为
@@ -43,7 +43,7 @@ $\mathbf{y}_1, \ldots, \mathbf{y}_n$，其中：
 
 $$\mathbf{y}_i = f(\mathbf{x}_i, (\mathbf{x}_1, \mathbf{x}_1), \ldots, (\mathbf{x}_n, \mathbf{x}_n)) \in \mathbb{R}^d$$
 
-根据 :eqref:`eq_attn-pooling`中定义的注意力池化函数$f$。
+根据 :eqref:`eq_attn-pooling`中定义的注意力汇聚函数$f$。
 下面的代码片段是基于多头注意力对一个张量完成自注意力的计算，
 张量的形状为（批量大小，时间步的数目或词元序列的长度，$d$）。
 输出与输入的张量形状相同。
diff --git a/chapter_computer-vision/anchor.md b/chapter_computer-vision/anchor.md
index 8278cdcc1..604164562 100644
--- a/chapter_computer-vision/anchor.md
+++ b/chapter_computer-vision/anchor.md
@@ -59,7 +59,7 @@ def multibox_prior(data, sizes, ratios):
     ratio_tensor = d2l.tensor(ratios, ctx=device)
 
     # 为了将锚点移动到像素的中心，需要设置偏移量。
-    # 因为一个像素的的高为1且宽为1，我们选择偏移我们的中心0.5
+    # 因为一个像素的高为1且宽为1，我们选择偏移我们的中心0.5
     offset_h, offset_w = 0.5, 0.5
     steps_h = 1.0 / in_height  # 在y轴上缩放步长
     steps_w = 1.0 / in_width  # 在x轴上缩放步长
@@ -101,7 +101,7 @@ def multibox_prior(data, sizes, ratios):
     ratio_tensor = d2l.tensor(ratios, device=device)
 
     # 为了将锚点移动到像素的中心，需要设置偏移量。
-    # 因为一个像素的的高为1且宽为1，我们选择偏移我们的中心0.5
+    # 因为一个像素的高为1且宽为1，我们选择偏移我们的中心0.5
     offset_h, offset_w = 0.5, 0.5
     steps_h = 1.0 / in_height  # 在y轴上缩放步长
     steps_w = 1.0 / in_width  # 在x轴上缩放步长
@@ -109,7 +109,7 @@ def multibox_prior(data, sizes, ratios):
     # 生成锚框的所有中心点
     center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
     center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
-    shift_y, shift_x = torch.meshgrid(center_h, center_w)
+    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
     shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)
 
     # 生成“boxes_per_pixel”个高和宽，
@@ -324,8 +324,8 @@ def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
     anchors_bbox_map = np.full((num_anchors,), -1, dtype=np.int32, ctx=device)
     # 根据阈值，决定是否分配真实边界框
     max_ious, indices = np.max(jaccard, axis=1), np.argmax(jaccard, axis=1)
-    anc_i = np.nonzero(max_ious >= 0.5)[0]
-    box_j = indices[max_ious >= 0.5]
+    anc_i = np.nonzero(max_ious >= iou_threshold)[0]
+    box_j = indices[max_ious >= iou_threshold]
     anchors_bbox_map[anc_i] = box_j
     col_discard = np.full((num_anchors,), -1)
     row_discard = np.full((num_gt_boxes,), -1)
@@ -352,8 +352,8 @@ def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
                                   device=device)
     # 根据阈值，决定是否分配真实边界框
     max_ious, indices = torch.max(jaccard, dim=1)
-    anc_i = torch.nonzero(max_ious >= 0.5).reshape(-1)
-    box_j = indices[max_ious >= 0.5]
+    anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
+    box_j = indices[max_ious >= iou_threshold]
     anchors_bbox_map[anc_i] = box_j
     col_discard = torch.full((num_anchors,), -1)
     row_discard = torch.full((num_gt_boxes,), -1)
diff --git a/chapter_computer-vision/image-augmentation.md b/chapter_computer-vision/image-augmentation.md
index 1306c6596..56ef513a0 100644
--- a/chapter_computer-vision/image-augmentation.md
+++ b/chapter_computer-vision/image-augmentation.md
@@ -203,11 +203,11 @@ test_augs = torchvision.transforms.Compose([
 ```
 
 :begin_tab:`mxnet`
-接下来，我们定义了一个辅助函数，以便于读取图像和应用图像增广。Gluon数据集提供的`transform_first`函数将图像增广应用于每个训练示例的第一个元素（图像和标签），即图像顶部的元素。有关`DataLoader`的详细介绍，请参阅 :numref:`sec_fashion_mnist`。
+接下来，我们定义了一个辅助函数，以便于读取图像和应用图像增广。Gluon数据集提供的`transform_first`函数将图像增广应用于每个训练样本的第一个元素（由图像和标签组成），即应用在图像上。有关`DataLoader`的详细介绍，请参阅 :numref:`sec_fashion_mnist`。
 :end_tab:
 
 :begin_tab:`pytorch`
-接下来，我们[**定义一个辅助函数，以便于读取图像和应用图像增广**]。PyTorch数据集提供的`transform`函数应用图像增广来转化图像。有关`DataLoader`的详细介绍，请参阅 :numref:`sec_fashion_mnist`。
+接下来，我们[**定义一个辅助函数，以便于读取图像和应用图像增广**]。PyTorch数据集提供的`transform`参数应用图像增广来转化图像。有关`DataLoader`的详细介绍，请参阅 :numref:`sec_fashion_mnist`。
 :end_tab:
 
 ```{.python .input}
diff --git a/chapter_computer-vision/rcnn.md b/chapter_computer-vision/rcnn.md
index 26bc1204d..0cd010578 100644
--- a/chapter_computer-vision/rcnn.md
+++ b/chapter_computer-vision/rcnn.md
@@ -89,7 +89,7 @@ rois = torch.Tensor([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])
 ```
 
 由于`X`的高和宽是输入图像高和宽的$1/10$，因此，两个提议区域的坐标先按`spatial_scale`乘以0.1。
-然后，在`X`上分别标出这两个兴趣区域`X[:, :, 1:4, 0:4]`和`X[:, :, 1:4, 0:4]`。
+然后，在`X`上分别标出这两个兴趣区域`X[:, :, 0:3, 0:3]`和`X[:, :, 1:4, 0:4]`。
 最后，在$2\times 2$的兴趣区域汇聚层中，每个兴趣区域被划分为子窗口网格，并进一步抽取相同形状$2\times 2$的特征。
 
 ```{.python .input}
diff --git a/chapter_convolutional-modern/batch-norm.md b/chapter_convolutional-modern/batch-norm.md
index 9df586e30..e26de82a3 100644
--- a/chapter_convolutional-modern/batch-norm.md
+++ b/chapter_convolutional-modern/batch-norm.md
@@ -70,7 +70,7 @@ $$\begin{aligned} \hat{\boldsymbol{\mu}}_\mathcal{B} &= \frac{1}{|\mathcal{B}|}
 ### 全连接层
 
 通常，我们将批量规范化层置于全连接层中的仿射变换和激活函数之间。
-设全连接层的输入为u，权重参数和偏置参数分别为$\mathbf{W}$和$\mathbf{b}$，激活函数为$\phi$，批量规范化的运算符为$\mathrm{BN}$。
+设全连接层的输入为x，权重参数和偏置参数分别为$\mathbf{W}$和$\mathbf{b}$，激活函数为$\phi$，批量规范化的运算符为$\mathrm{BN}$。
 那么，使用批量规范化的全连接层的输出的计算详情如下：
 
 $$\mathbf{h} = \phi(\mathrm{BN}(\mathbf{W}\mathbf{x} + \mathbf{b}) ).$$
diff --git a/chapter_convolutional-neural-networks/lenet.md b/chapter_convolutional-neural-networks/lenet.md
index 037cb0c3b..eea38e70d 100644
--- a/chapter_convolutional-neural-networks/lenet.md
+++ b/chapter_convolutional-neural-networks/lenet.md
@@ -27,7 +27,7 @@ LeNet被广泛用于自动取款机（ATM）机中，帮助识别处理支票的
 ![LeNet中的数据流。输入是手写数字，输出为10种可能结果的概率。](../img/lenet.svg)
 :label:`img_lenet`
 
-每个卷积块中的基本单元是一个卷积层、一个sigmoid激活函数和平均汇聚层。请注意，虽然ReLU和最大汇聚层更有效，但它们在20世纪90年代还没有出现。每个卷积层使用$5\times 5$卷积核和一个sigmoid激活函数。这些层将输入映射到多个二维特征输出，通常同时增加通道的数量。第一卷积层有6个输出通道，而第二个卷积层有16个输出通道。每个$2\times2$池操作（步骤2）通过空间下采样将维数减少4倍。卷积的输出形状由批量大小、通道数、高度、宽度决定。
+每个卷积块中的基本单元是一个卷积层、一个sigmoid激活函数和平均汇聚层。请注意，虽然ReLU和最大汇聚层更有效，但它们在20世纪90年代还没有出现。每个卷积层使用$5\times 5$卷积核和一个sigmoid激活函数。这些层将输入映射到多个二维特征输出，通常同时增加通道的数量。第一卷积层有6个输出通道，而第二个卷积层有16个输出通道。每个$2\times2$池操作（步幅2）通过空间下采样将维数减少4倍。卷积的输出形状由批量大小、通道数、高度、宽度决定。
 
 为了将卷积块的输出传递给稠密块，我们必须在小批量中展平每个样本。换言之，我们将这个四维输入转换成全连接层所期望的二维输入。这里的二维表示的第一个维度索引小批量中的样本，第二个维度给出每个样本的平面向量表示。LeNet的稠密块有三个全连接层，分别有120、84和10个输出。因为我们在执行分类任务，所以输出层的10维对应于最后输出结果的数量。
 
diff --git a/chapter_convolutional-neural-networks/padding-and-strides.md b/chapter_convolutional-neural-networks/padding-and-strides.md
index 1254696e4..21a3f12f2 100644
--- a/chapter_convolutional-neural-networks/padding-and-strides.md
+++ b/chapter_convolutional-neural-networks/padding-and-strides.md
@@ -182,7 +182,7 @@ conv2d = tf.keras.layers.Conv2D(1, kernel_size=(3,5), padding='valid',
 comp_conv2d(conv2d, X).shape
 ```
 
-为了简洁起见，当输入高度和宽度两侧的填充数量分别为$p_h$和$p_w$时，我们称之为填充$(p_h, p_w)$。当$p_h = p_w = p$时，填充是$p$。同理，当高度和宽度上的步幅分别为$s_h$和$s_w$时，我们称之为步幅$(s_h, s_w)$。当时的步幅为$s_h = s_w = s$时，步幅为$s$。默认情况下，填充为0，步幅为1。在实践中，我们很少使用不一致的步幅或填充，也就是说，我们通常有$p_h = p_w$和$s_h = s_w$。
+为了简洁起见，当输入高度和宽度两侧的填充数量分别为$p_h$和$p_w$时，我们称之为填充$(p_h, p_w)$。当$p_h = p_w = p$时，填充是$p$。同理，当高度和宽度上的步幅分别为$s_h$和$s_w$时，我们称之为步幅$(s_h, s_w)$。特别地，当$s_h = s_w = s$时，我们称步幅为$s$。默认情况下，填充为0，步幅为1。在实践中，我们很少使用不一致的步幅或填充，也就是说，我们通常有$p_h = p_w$和$s_h = s_w$。
 
 ## 小结
 
diff --git a/chapter_deep-learning-computation/model-construction.md b/chapter_deep-learning-computation/model-construction.md
index a6589f076..8499af677 100644
--- a/chapter_deep-learning-computation/model-construction.md
+++ b/chapter_deep-learning-computation/model-construction.md
@@ -295,7 +295,7 @@ class MySequential(nn.Module):
         super().__init__()
         for idx, module in enumerate(args):
             # 这里，module是Module子类的一个实例。我们把它保存在'Module'类的成员
-            # 变量_modules中。module的类型是OrderedDict
+            # 变量_modules中。_module的类型是OrderedDict
             self._modules[str(idx)] = module
 
     def forward(self, X):
diff --git a/chapter_deep-learning-computation/parameters.md b/chapter_deep-learning-computation/parameters.md
index 95a6b680b..41929f6b2 100644
--- a/chapter_deep-learning-computation/parameters.md
+++ b/chapter_deep-learning-computation/parameters.md
@@ -372,14 +372,14 @@ print(net[1].weight.data())
 
 ```{.python .input}
 #@tab pytorch
-def xavier(m):
+def init_xavier(m):
     if type(m) == nn.Linear:
         nn.init.xavier_uniform_(m.weight)
 def init_42(m):
     if type(m) == nn.Linear:
         nn.init.constant_(m.weight, 42)
 
-net[0].apply(xavier)
+net[0].apply(init_xavier)
 net[2].apply(init_42)
 print(net[0].weight.data[0])
 print(net[2].weight.data)
diff --git a/chapter_introduction/index.md b/chapter_introduction/index.md
index 44702be51..579bd8beb 100644
--- a/chapter_introduction/index.md
+++ b/chapter_introduction/index.md
@@ -594,7 +594,7 @@ agent的动作会影响后续的观察，而奖励只与所选的动作相对应
 [罗纳德·费舍尔（1890-1962）](https://en.wikipedia.org/wiki/Ronald_-Fisher)对统计理论和在遗传学中的应用做出了重大贡献。
 他的许多算法（如线性判别分析）和公式（如费舍尔信息矩阵）至今仍被频繁使用。
 甚至，费舍尔在1936年发布的鸢尾花卉数据集，有时仍然被用来解读机器学习算法。
-他也是优生学的倡导者，这提醒我们：使用数据科学虽然在道德上存在疑问，但是与数据科学在工业和自然科学中的生产性使用一样，有着悠久的历史。
+他也是优生学的倡导者，这提醒我们：数据科学在道德上存疑的使用，与其在工业和自然科学中的生产性使用一样，有着悠远而持久的历史。
 
 机器学习的第二个影响来自[克劳德·香农(1916--2001)](https://en.wikipedia.org/wiki/Claude_Shannon)的信息论和[艾伦·图灵（1912-1954）](https://en.wikipedia.org/wiki/Alan_Turing)的计算理论。
 图灵在他著名的论文《计算机器与智能》 :cite:`Turing.1950` 中提出了“机器能思考吗？”的问题。
diff --git a/chapter_linear-networks/linear-regression.md b/chapter_linear-networks/linear-regression.md
index fe2fa22f6..dec8319ca 100644
--- a/chapter_linear-networks/linear-regression.md
+++ b/chapter_linear-networks/linear-regression.md
@@ -459,9 +459,9 @@ $$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma
 这种想法归功于我们对真实生物神经系统的研究。
 
 当今大多数深度学习的研究几乎没有直接从神经科学中获得灵感。
-我们援引斯图尔特·罗素和彼得·诺维格谁，在他们的经典人工智能教科书
+我们援引斯图尔特·罗素和彼得·诺维格在他们的经典人工智能教科书
 *Artificial Intelligence:A Modern Approach* :cite:`Russell.Norvig.2016`
-中所说：虽然飞机可能受到鸟类的启发，但几个世纪以来，鸟类学并不是航空创新的主要驱动力。
+中所说的：虽然飞机可能受到鸟类的启发，但几个世纪以来，鸟类学并不是航空创新的主要驱动力。
 同样地，如今在深度学习中的灵感同样或更多地来自数学、统计学和计算机科学。
 
 ## 小结
diff --git a/chapter_linear-networks/softmax-regression.md b/chapter_linear-networks/softmax-regression.md
index 3cc091349..a8dc141b2 100644
--- a/chapter_linear-networks/softmax-regression.md
+++ b/chapter_linear-networks/softmax-regression.md
@@ -109,22 +109,24 @@ $$
 这些违反了 :numref:`sec_prob`中所说的概率基本公理。
 
 要将输出视为概率，我们必须保证在任何数据上的输出都是非负的且总和为1。
-此外，我们需要一个训练目标，来鼓励模型精准地估计概率。
-在分类器输出0.5的所有样本中，我们希望这些样本有一半实际上属于预测的类。
+此外，我们需要一个训练的目标函数，来激励模型精准地估计概率。
+例如，
+在分类器输出0.5的所有样本中，我们希望这些样本是刚好有一半实际上属于预测的类别。
 这个属性叫做*校准*（calibration）。
 
 社会科学家邓肯·卢斯于1959年在*选择模型*（choice model）的理论基础上
 发明的*softmax函数*正是这样做的：
-softmax函数将未规范化的预测变换为非负并且总和为1，同时要求模型保持可导。
-我们首先对每个未规范化的预测求幂，这样可以确保输出非负。
-为了确保最终输出的总和为1，我们再对每个求幂后的结果除以它们的总和。如下式：
+softmax函数能够将未规范化的预测变换为非负数并且总和为1，同时让模型保持
+可导的性质。
+为了完成这一目标，我们首先对每个未规范化的预测求幂，这样可以确保输出非负。
+为了确保最终输出的概率值总和为1，我们再让每个求幂后的结果除以它们的总和。如下式：
 
 $$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{其中}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$$
 :eqlabel:`eq_softmax_y_and_o`
 
 这里，对于所有的$j$总有$0 \leq \hat{y}_j \leq 1$。
 因此，$\hat{\mathbf{y}}$可以视为一个正确的概率分布。
-softmax运算不会改变未规范化的预测$\mathbf{o}$之间的顺序，只会确定分配给每个类别的概率。
+softmax运算不会改变未规范化的预测$\mathbf{o}$之间的大小次序，只会确定分配给每个类别的概率。
 因此，在预测过程中，我们仍然可以用下式来选择最有可能的类别。
 
 $$
@@ -137,11 +139,11 @@ $$
 ## 小批量样本的矢量化
 :label:`subsec_softmax_vectorization`
 
-为了提高计算效率并且充分利用GPU，我们通常会针对小批量数据执行矢量计算。
+为了提高计算效率并且充分利用GPU，我们通常会对小批量样本的数据执行矢量计算。
 假设我们读取了一个批量的样本$\mathbf{X}$，
 其中特征维度（输入数量）为$d$，批量大小为$n$。
 此外，假设我们在输出中有$q$个类别。
-那么小批量特征为$\mathbf{X} \in \mathbb{R}^{n \times d}$，
+那么小批量样本的特征为$\mathbf{X} \in \mathbb{R}^{n \times d}$，
 权重为$\mathbf{W} \in \mathbb{R}^{d \times q}$，
 偏置为$\mathbf{b} \in \mathbb{R}^{1\times q}$。
 softmax回归的矢量计算表达式为：
@@ -155,7 +157,7 @@ $$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\ma
 那么softmax运算可以*按行*（rowwise）执行：
 对于$\mathbf{O}$的每一行，我们先对所有项进行幂运算，然后通过求和对它们进行标准化。
 在 :eqref:`eq_minibatch_softmax_reg`中，
-$\mathbf{X} \mathbf{W} + \mathbf{b}$的求和会使用广播，
+$\mathbf{X} \mathbf{W} + \mathbf{b}$的求和会使用广播机制，
 小批量的未规范化预测$\mathbf{O}$和输出概率$\hat{\mathbf{Y}}$
 都是形状为$n \times q$的矩阵。
 
diff --git a/chapter_multilayer-perceptrons/environment.md b/chapter_multilayer-perceptrons/environment.md
index e8b1caa07..762606667 100644
--- a/chapter_multilayer-perceptrons/environment.md
+++ b/chapter_multilayer-perceptrons/environment.md
@@ -122,7 +122,7 @@ $P(y \mid \mathbf{x})$的分布可能会因我们的位置不同而得到不同
 然后这家初创公司问我们是否可以帮助他们建立一个用于检测疾病的分类器。
 
 正如我们向他们解释的那样，用近乎完美的精度来区分健康和患病人群确实很容易。
-然而，这是可能因为受试者在年龄、激素水平、体力活动、
+然而，这可能是因为受试者在年龄、激素水平、体力活动、
 饮食、饮酒以及其他许多与疾病无关的因素上存在差异。
 这对检测疾病的分类器可能并不适用。
 这些抽样可能会遇到极端的协变量偏移。
@@ -236,7 +236,7 @@ $$\mathop{\mathrm{minimize}}_f \frac{1}{n} \sum_{i=1}^n \beta_i l(f(\mathbf{x}_i
 这是用于二元分类的softmax回归（见 :numref:`sec_softmax`）的一个特例。
 综上所述，我们学习了一个分类器来区分从$p(\mathbf{x})$抽取的数据
 和从$q(\mathbf{x})$抽取的数据。
-如果无法区分这两个分布，则意味着想相关的样本可能来自这两个分布中的任何一个。
+如果无法区分这两个分布，则意味着相关的样本可能来自这两个分布中的任何一个。
 另一方面，任何可以很好区分的样本都应该相应地显著增加或减少权重。
 
 为了简单起见，假设我们分别从$p(\mathbf{x})$和$q(\mathbf{x})$
diff --git a/chapter_multilayer-perceptrons/environment_origin.md b/chapter_multilayer-perceptrons/environment_origin.md
index c04cba61e..0dbe31a97 100644
--- a/chapter_multilayer-perceptrons/environment_origin.md
+++ b/chapter_multilayer-perceptrons/environment_origin.md
@@ -603,7 +603,7 @@ Likewise, a user's behavior on a news site will depend on what we showed her pre
 Recently,
 control theory (e.g., PID variants) has also been used
 to automatically tune hyperparameters
-to achive better disentangling and reconstruction quality,
+to achieve better disentangling and reconstruction quality,
 and improve the diversity of generated text and the reconstruction quality of generated images :cite:`Shao.Yao.Sun.ea.2020`.
 
 
diff --git a/chapter_multilayer-perceptrons/kaggle-house-price.md b/chapter_multilayer-perceptrons/kaggle-house-price.md
index cb60524eb..455b95e2b 100644
--- a/chapter_multilayer-perceptrons/kaggle-house-price.md
+++ b/chapter_multilayer-perceptrons/kaggle-house-price.md
@@ -11,7 +11,7 @@ Kaggle的房价预测比赛是一个很好的起点。
 数据集要大得多，也有更多的特征。
 
 本节我们将详细介绍数据预处理、模型设计和超参数选择。
-通过亲身实践，你将获得一手经验，这些经验将指导你数据科学家职业生涯。
+通过亲身实践，你将获得一手经验，这些经验将有益数据科学家的职业成长。
 
 ## 下载和缓存数据集
 
diff --git a/chapter_multilayer-perceptrons/mlp-concise.md b/chapter_multilayer-perceptrons/mlp-concise.md
index bbe2ead64..69b4bb1ea 100644
--- a/chapter_multilayer-perceptrons/mlp-concise.md
+++ b/chapter_multilayer-perceptrons/mlp-concise.md
@@ -60,7 +60,7 @@ net = tf.keras.models.Sequential([
 ```
 
 [**训练过程**]的实现与我们实现softmax回归时完全相同，
-这种模块化设计使我们能够将与和模型架构有关的内容独立出来。
+这种模块化设计使我们能够将与模型架构有关的内容独立出来。
 
 ```{.python .input}
 batch_size, lr, num_epochs = 256, 0.1, 10
diff --git a/chapter_multilayer-perceptrons/mlp.md b/chapter_multilayer-perceptrons/mlp.md
index 590736769..8784a0106 100644
--- a/chapter_multilayer-perceptrons/mlp.md
+++ b/chapter_multilayer-perceptrons/mlp.md
@@ -160,7 +160,7 @@ $$
 例如，在一对输入上进行基本逻辑操作，多层感知机是通用近似器。
 即使是网络只有一个隐藏层，给定足够的神经元和正确的权重，
 我们可以对任意函数建模，尽管实际中学习该函数是很困难的。
-你可能认为神经网络有点像C语言。
+神经网络有点像C语言。
 C语言和任何其他现代编程语言一样，能够表达任何可计算的程序。
 但实际上，想出一个符合规范的程序才是最困难的部分。
 
diff --git a/chapter_multilayer-perceptrons/underfit-overfit.md b/chapter_multilayer-perceptrons/underfit-overfit.md
index 7bedf92e9..01979745c 100644
--- a/chapter_multilayer-perceptrons/underfit-overfit.md
+++ b/chapter_multilayer-perceptrons/underfit-overfit.md
@@ -158,7 +158,7 @@
 又有时，我们需要比较不同的超参数设置下的同一类模型。
 
 例如，训练多层感知机模型时，我们可能希望比较具有
-不同数量的隐藏层、不同数量的隐藏单元以及不同的的激活函数组合的模型。
+不同数量的隐藏层、不同数量的隐藏单元以及不同的激活函数组合的模型。
 为了确定候选模型中的最佳模型，我们通常会使用验证集。
 
 ### 验证集
diff --git a/chapter_multilayer-perceptrons/weight-decay.md b/chapter_multilayer-perceptrons/weight-decay.md
index 3885b1c68..677990bd1 100644
--- a/chapter_multilayer-perceptrons/weight-decay.md
+++ b/chapter_multilayer-perceptrons/weight-decay.md
@@ -16,13 +16,12 @@
 单项式的阶数是幂的和。
 例如，$x_1^2 x_2$和$x_3 x_5^2$都是3次单项式。
 
-注意，随着阶数$d$的增长，带有阶数$d$的项数迅速增加。
-给定$k$个变量，阶数$d$（即$k$多选$d$）的个数为
-${k - 1 + d} \choose {k - 1}$。
-即使是阶数上的微小变化，比如从$2$到$3$，
-也会显著增加我们模型的复杂性。
-因此，我们经常需要一个更细粒度的工具来调整函数的复杂性。
-
+注意，随着阶数$d$的增长，带有阶数$d$的项数迅速增加。 
+给定$k$个变量，阶数为$d$的项的个数为
+${k - 1 + d} \choose {k - 1}$，即$C^{k-1}_{k-1+d} = \frac{(k-1+d)!}{(d)!(k-1)!}$。
+因此即使是阶数上的微小变化，比如从$2$到$3$，也会显著增加我们模型的复杂性。
+仅仅通过简单的限制特征数量（在多项式回归中体现为限制阶数），可能仍然使模型在过简单和过复杂中徘徊，
+我们需要一个更细粒度的工具来调整函数的复杂性，使其达到一个合适的平衡位置。
 ## 范数与权重衰减
 
 在 :numref:`subsec_lin-algebra-norms`中，
diff --git a/chapter_natural-language-processing-applications/finetuning-bert_origin.md b/chapter_natural-language-processing-applications/finetuning-bert_origin.md
index f6c836b97..cdbe34f8e 100644
--- a/chapter_natural-language-processing-applications/finetuning-bert_origin.md
+++ b/chapter_natural-language-processing-applications/finetuning-bert_origin.md
@@ -12,7 +12,7 @@ In :numref:`sec_bert`,
 we introduced a pretraining model, BERT,
 that requires minimal architecture changes
 for a wide range of natural language processing tasks.
-One one hand,
+On one hand,
 at the time of its proposal,
 BERT improved the state of the art on various natural language processing tasks.
 On the other hand,
diff --git a/chapter_natural-language-processing-applications/natural-language-inference-attention.md b/chapter_natural-language-processing-applications/natural-language-inference-attention.md
index 4924bf817..6f6d5e163 100644
--- a/chapter_natural-language-processing-applications/natural-language-inference-attention.md
+++ b/chapter_natural-language-processing-applications/natural-language-inference-attention.md
@@ -169,7 +169,7 @@ class Compare(nn.Module):
 
 ### 聚合
 
-现在我们有有两组比较向量$\mathbf{v}_{A,i}$（$i = 1, \ldots, m$）和$\mathbf{v}_{B,j}$（$j = 1, \ldots, n$）。在最后一步中，我们将聚合这些信息以推断逻辑关系。我们首先求和这两组比较向量：
+现在我们有两组比较向量$\mathbf{v}_{A,i}$（$i = 1, \ldots, m$）和$\mathbf{v}_{B,j}$（$j = 1, \ldots, n$）。在最后一步中，我们将聚合这些信息以推断逻辑关系。我们首先求和这两组比较向量：
 
 $$
 \mathbf{v}_A = \sum_{i=1}^{m} \mathbf{v}_{A,i}, \quad \mathbf{v}_B = \sum_{j=1}^{n}\mathbf{v}_{B,j}.
diff --git a/chapter_natural-language-processing-applications/natural-language-inference-bert.md b/chapter_natural-language-processing-applications/natural-language-inference-bert.md
index 1108a7c70..835a9e889 100644
--- a/chapter_natural-language-processing-applications/natural-language-inference-bert.md
+++ b/chapter_natural-language-processing-applications/natural-language-inference-bert.md
@@ -284,7 +284,7 @@ net = BERTClassifier(bert)
 
 回想一下，在 :numref:`sec_bert`中，`MaskLM`类和`NextSentencePred`类在其使用的多层感知机中都有一些参数。这些参数是预训练BERT模型`bert`中参数的一部分，因此是`net`中的参数的一部分。然而，这些参数仅用于计算预训练过程中的遮蔽语言模型损失和下一句预测损失。这两个损失函数与微调下游应用无关，因此当BERT微调时，`MaskLM`和`NextSentencePred`中采用的多层感知机的参数不会更新（陈旧的，staled）。
 
-为了允许具有陈旧梯度的参数，标志`ignore_stale_grad=True`在`step`函数`d2l.train_batch_ch13`中被设置。我们通过该函数使用SNLI的训练集（`train_iter`）和测试集（`test_iter`）对`net`模型进行训练和评估。。由于计算资源有限，[**训练**]和测试精度可以进一步提高：我们把对它的讨论留在练习中。
+为了允许具有陈旧梯度的参数，标志`ignore_stale_grad=True`在`step`函数`d2l.train_batch_ch13`中被设置。我们通过该函数使用SNLI的训练集（`train_iter`）和测试集（`test_iter`）对`net`模型进行训练和评估。由于计算资源有限，[**训练**]和测试精度可以进一步提高：我们把对它的讨论留在练习中。
 
 ```{.python .input}
 lr, num_epochs = 1e-4, 5
diff --git a/chapter_natural-language-processing-pretraining/bert.md b/chapter_natural-language-processing-pretraining/bert.md
index 389daca20..57084c150 100644
--- a/chapter_natural-language-processing-pretraining/bert.md
+++ b/chapter_natural-language-processing-pretraining/bert.md
@@ -229,7 +229,7 @@ class MaskLM(nn.Module):
         batch_size = X.shape[0]
         batch_idx = torch.arange(0, batch_size)
         # 假设batch_size=2，num_pred_positions=3
-        # 那么batch_idx是np.array（[0,0,0,1,1]）
+        # 那么batch_idx是np.array（[0,0,0,1,1,1]）
         batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
         masked_X = X[batch_idx, pred_positions]
         masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
diff --git a/chapter_natural-language-processing-pretraining/glove.md b/chapter_natural-language-processing-pretraining/glove.md
index 20c6f0f20..d06ab6424 100644
--- a/chapter_natural-language-processing-pretraining/glove.md
+++ b/chapter_natural-language-processing-pretraining/glove.md
@@ -61,7 +61,7 @@ $$\sum_{i\in\mathcal{V}} \sum_{j\in\mathcal{V}} h(x_{ij}) \left(\mathbf{u}_j^\to
 
 从 :numref:`tab_glove`中，我们可以观察到以下几点：
 
-* 对于与“ice”相关但与“gas”无关的单词$w_k$，例如$w_k=\text{solid}$，我们预计会有更大的共现概率比值，例如8.9。
+* 对于与“ice”相关但与“steam”无关的单词$w_k$，例如$w_k=\text{solid}$，我们预计会有更大的共现概率比值，例如8.9。
 * 对于与“steam”相关但与“ice”无关的单词$w_k$，例如$w_k=\text{gas}$，我们预计较小的共现概率比值，例如0.085。
 * 对于同时与“ice”和“steam”相关的单词$w_k$，例如$w_k=\text{water}$，我们预计其共现概率的比值接近1，例如1.36.
 * 对于与“ice”和“steam”都不相关的单词$w_k$，例如$w_k=\text{fashion}$，我们预计共现概率的比值接近1，例如0.96.
diff --git a/chapter_optimization/lr-scheduler.md b/chapter_optimization/lr-scheduler.md
index 71fd11a2e..2f6f141ab 100644
--- a/chapter_optimization/lr-scheduler.md
+++ b/chapter_optimization/lr-scheduler.md
@@ -288,9 +288,11 @@ train(net, train_iter, test_iter, num_epochs, lr,
 此外，余弦学习率调度在实践中的一些问题上运行效果很好。
 在某些问题上，最好在使用较高的学习率之前预热优化器。
 
-### 多因子调度器
+### 单因子调度器
 
-多项式衰减的一种替代方案是乘法衰减，即$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$其中$\alpha \in (0, 1)$。为了防止学习率衰减超出合理的下限，更新方程经常修改为$\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)$。
+多项式衰减的一种替代方案是乘法衰减，即$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$其中$\alpha \in (0, 1)$。
+为了防止学习率衰减到一个合理的下界之下，
+更新方程经常修改为$\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)$。
 
 ```{.python .input}
 #@tab all
@@ -312,8 +314,9 @@ d2l.plot(d2l.arange(50), [scheduler(t) for t in range(50)])
 
 ### 多因子调度器
 
-训练深度网络的常见策略之一是保持分段稳定的学习率，并且每隔一段时间就一定程度学习率降低。
-具体地说，给定一组降低学习率的时间，例如$s = \{5, 10, 20\}$每当$t \in s$时降低$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$。
+训练深度网络的常见策略之一是保持学习率为一组分段的常量，并且不时地按给定的参数对学习率做乘法衰减。
+具体地说，给定一组降低学习率的时间点，例如$s = \{5, 10, 20\}$，
+每当$t \in s$时，降低$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$。
 假设每步中的值减半，我们可以按如下方式实现这一点。
 
 ```{.python .input}
@@ -427,8 +430,8 @@ scheduler = CosineScheduler(max_update=20, base_lr=0.3, final_lr=0.01)
 d2l.plot(d2l.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
 ```
 
-在计算机视觉中，这个调度可以引出改进的结果。
-但请注意，如下所示，这种改进并不能保证成立。
+在计算机视觉的背景下，这个调度方式可能产生改进的结果。
+但请注意，如下所示，这种改进并不一定成立。
 
 ```{.python .input}
 trainer = gluon.Trainer(net.collect_params(), 'sgd',
diff --git a/chapter_optimization/optimization-intro.md b/chapter_optimization/optimization-intro.md
index f6e2cf7ab..43bca4dea 100644
--- a/chapter_optimization/optimization-intro.md
+++ b/chapter_optimization/optimization-intro.md
@@ -161,7 +161,7 @@ annotate('vanishing gradient', (4, 1), (2, 0.0))
 
 ## 练习
 
-1. 考虑一个简单的的MLP，它有一个隐藏层，比如，隐藏层中维度为$d$和一个输出。证明对于任何局部最小值，至少有$d！$个等效方案。
+1. 考虑一个简单的MLP，它有一个隐藏层，比如，隐藏层中维度为$d$和一个输出。证明对于任何局部最小值，至少有$d！$个等效方案。
 1. 假设我们有一个对称随机矩阵$\mathbf{M}$，其中条目$M_{ij} = M_{ji}$各自从某种概率分布$p_{ij}$中抽取。此外，假设$p_{ij}(x) = p_{ij}(-x)$，即分布是对称的（详情请参见 :cite:`Wigner.1958`）。
     1. 证明特征值的分布也是对称的。也就是说，对于任何特征向量$\mathbf{v}$，关联的特征值$\lambda$满足$P(\lambda > 0) = P(\lambda < 0)$的概率为$P(\lambda > 0) = P(\lambda < 0)$。
     1. 为什么以上*没有*暗示$P(\lambda > 0) = 0.5$？
diff --git a/chapter_preliminaries/calculus.md b/chapter_preliminaries/calculus.md
index 08baf7030..e5e3d8eca 100644
--- a/chapter_preliminaries/calculus.md
+++ b/chapter_preliminaries/calculus.md
@@ -36,7 +36,7 @@
 简而言之，对于每个参数，
 如果我们把这个参数*增加*或*减少*一个无穷小的量，我们可以知道损失会以多快的速度增加或减少，
 
-假设我们有一个函数$f: \mathbb{R}^n \rightarrow \mathbb{R}$，其输入和输出都是标量。
+假设我们有一个函数$f: \mathbb{R} \rightarrow \mathbb{R}$，其输入和输出都是标量。
 (**如果$f$的*导数*存在，这个极限被定义为**)
 
 (**$$f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}.$$**)
diff --git a/chapter_recurrent-modern/gru.md b/chapter_recurrent-modern/gru.md
index 52ade35c9..36959a60e 100644
--- a/chapter_recurrent-modern/gru.md
+++ b/chapter_recurrent-modern/gru.md
@@ -35,7 +35,7 @@
 ## 门控隐状态
 
 门控循环单元与普通的循环神经网络之间的关键区别在于：
-后者支持隐状态的门控。
+前者支持隐状态的门控。
 这意味着模型有专门的机制来确定应该何时更新隐状态，
 以及应该何时重置隐状态。
 这些机制是可学习的，并且能够解决了上面列出的问题。
diff --git a/chapter_recurrent-modern/machine-translation-and-dataset_origin.md b/chapter_recurrent-modern/machine-translation-and-dataset_origin.md
index c3142de02..c9eb67f96 100644
--- a/chapter_recurrent-modern/machine-translation-and-dataset_origin.md
+++ b/chapter_recurrent-modern/machine-translation-and-dataset_origin.md
@@ -148,7 +148,7 @@ for machine translation
 we prefer word-level tokenization here
 (state-of-the-art models may use more advanced tokenization techniques).
 The following `tokenize_nmt` function
-tokenizes the the first `num_examples` text sequence pairs,
+tokenizes the first `num_examples` text sequence pairs,
 where
 each token is either a word or a punctuation mark.
 This function returns
diff --git a/chapter_recurrent-modern/seq2seq.md b/chapter_recurrent-modern/seq2seq.md
index ea822a14b..4e1ce1bb1 100644
--- a/chapter_recurrent-modern/seq2seq.md
+++ b/chapter_recurrent-modern/seq2seq.md
@@ -134,7 +134,7 @@ class Seq2SeqEncoder(d2l.Encoder):
         state = self.rnn.begin_state(batch_size=X.shape[1], ctx=X.ctx)
         output, state = self.rnn(X, state)
         # output的形状:(num_steps,batch_size,num_hiddens)
-        # state[0]的形状:(num_layers,batch_size,num_hiddens)
+        # state的形状:(num_layers,batch_size,num_hiddens)
         return output, state
 ```
 
@@ -159,7 +159,7 @@ class Seq2SeqEncoder(d2l.Encoder):
         # 如果未提及状态，则默认为0
         output, state = self.rnn(X)
         # output的形状:(num_steps,batch_size,num_hiddens)
-        # state[0]的形状:(num_layers,batch_size,num_hiddens)
+        # state的形状:(num_layers,batch_size,num_hiddens)
         return output, state
 ```
 
@@ -303,7 +303,7 @@ class Seq2SeqDecoder(d2l.Decoder):
         output, state = self.rnn(X_and_context, state)
         output = self.dense(output).swapaxes(0, 1)
         # output的形状:(batch_size,num_steps,vocab_size)
-        # state[0]的形状:(num_layers,batch_size,num_hiddens)
+        # state的形状:(num_layers,batch_size,num_hiddens)
         return output, state
 ```
 
@@ -331,7 +331,7 @@ class Seq2SeqDecoder(d2l.Decoder):
         output, state = self.rnn(X_and_context, state)
         output = self.dense(output).permute(1, 0, 2)
         # output的形状:(batch_size,num_steps,vocab_size)
-        # state[0]的形状:(num_layers,batch_size,num_hiddens)
+        # state的形状:(num_layers,batch_size,num_hiddens)
         return output, state
 ```
 
diff --git a/chapter_recurrent-neural-networks/sequence.md b/chapter_recurrent-neural-networks/sequence.md
index 690a311ff..64967fe25 100644
--- a/chapter_recurrent-neural-networks/sequence.md
+++ b/chapter_recurrent-neural-networks/sequence.md
@@ -152,9 +152,9 @@ $$P(x_1, \ldots, x_T) = \prod_{t=T}^1 P(x_t \mid x_{t+1}, \ldots, x_T).$$
 例如，在某些情况下，对于某些可加性噪声$\epsilon$，
 显然我们可以找到$x_{t+1} = f(x_t) + \epsilon$，
 而反之则不行 :cite:`Hoyer.Janzing.Mooij.ea.2009`。
-这是个好消息，因为这个前进方向通常也是我们感兴趣的方向。
-彼得斯等人写的这本书 :cite:`Peters.Janzing.Scholkopf.2017`
-已经解释了关于这个主题的更多内容，而我们仅仅触及了它的皮毛。
+而这个向前推进的方向恰好也是我们通常感兴趣的方向。
+彼得斯等人 :cite:`Peters.Janzing.Scholkopf.2017`
+对该主题的更多内容做了详尽的解释，而我们的上述讨论只是其中的冰山一角。
 
 ## 训练
 
@@ -424,11 +424,11 @@ max_steps = 64
 ```{.python .input}
 #@tab mxnet, pytorch
 features = d2l.zeros((T - tau - max_steps + 1, tau + max_steps))
-# 列i（i<tau）是来自x的观测，其时间步从（i+1）到（i+T-tau-max_steps+1）
+# 列i（i<tau）是来自x的观测，其时间步从（i）到（i+T-tau-max_steps+1）
 for i in range(tau):
     features[:, i] = x[i: i + T - tau - max_steps + 1]
 
-# 列i（i>=tau）是来自（i-tau+1）步的预测，其时间步从（i+1）到（i+T-tau-max_steps+1）
+# 列i（i>=tau）是来自（i-tau+1）步的预测，其时间步从（i）到（i+T-tau-max_steps+1）
 for i in range(tau, tau + max_steps):
     features[:, i] = d2l.reshape(net(features[:, i - tau: i]), -1)
 ```
@@ -436,11 +436,11 @@ for i in range(tau, tau + max_steps):
 ```{.python .input}
 #@tab tensorflow
 features = tf.Variable(d2l.zeros((T - tau - max_steps + 1, tau + max_steps)))
-# 列i（i<tau）是来自x的观测，其时间步从（i+1）到（i+T-tau-max_steps+1）
+# 列i（i<tau）是来自x的观测，其时间步从（i）到（i+T-tau-max_steps+1）
 for i in range(tau):
     features[:, i].assign(x[i: i + T - tau - max_steps + 1].numpy())
 
-# 列i（i>=tau）是来自（i-tau+1）步的预测，其时间步从（i+1）到（i+T-tau-max_steps+1）
+# 列i（i>=tau）是来自（i-tau+1）步的预测，其时间步从（i）到（i+T-tau-max_steps+1）
 for i in range(tau, tau + max_steps):
     features[:, i].assign(d2l.reshape(net((features[:, i - tau: i])), -1))
 ```
diff --git a/chapter_recurrent-neural-networks/sequence_origin.md b/chapter_recurrent-neural-networks/sequence_origin.md
index 4217ec2f6..aa953fb2e 100644
--- a/chapter_recurrent-neural-networks/sequence_origin.md
+++ b/chapter_recurrent-neural-networks/sequence_origin.md
@@ -355,12 +355,12 @@ max_steps = 64
 #@tab mxnet, pytorch
 features = d2l.zeros((T - tau - max_steps + 1, tau + max_steps))
 # Column `i` (`i` < `tau`) are observations from `x` for time steps from
-# `i + 1` to `i + T - tau - max_steps + 1`
+# `i` to `i + T - tau - max_steps + 1`
 for i in range(tau):
     features[:, i] = x[i: i + T - tau - max_steps + 1].T
 
 # Column `i` (`i` >= `tau`) are the (`i - tau + 1`)-step-ahead predictions for
-# time steps from `i + 1` to `i + T - tau - max_steps + 1`
+# time steps from `i` to `i + T - tau - max_steps + 1`
 for i in range(tau, tau + max_steps):
     features[:, i] = d2l.reshape(net(features[:, i - tau: i]), -1)
 ```
@@ -369,12 +369,12 @@ for i in range(tau, tau + max_steps):
 #@tab tensorflow
 features = tf.Variable(d2l.zeros((T - tau - max_steps + 1, tau + max_steps)))
 # Column `i` (`i` < `tau`) are observations from `x` for time steps from
-# `i + 1` to `i + T - tau - max_steps + 1`
+# `i` to `i + T - tau - max_steps + 1`
 for i in range(tau):
     features[:, i].assign(x[i: i + T - tau - max_steps + 1].numpy().T)
 
 # Column `i` (`i` >= `tau`) are the (`i - tau + 1`)-step-ahead predictions for
-# time steps from `i + 1` to `i + T - tau - max_steps + 1`
+# time steps from `i` to `i + T - tau - max_steps + 1`
 for i in range(tau, tau + max_steps):
     features[:, i].assign(d2l.reshape(net((features[:, i - tau: i])), -1))
 ```
diff --git a/config.ini b/config.ini
index dbb3ba6a5..eb2e0c179 100644
--- a/config.ini
+++ b/config.ini
@@ -8,7 +8,7 @@ author = Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
 
 copyright = 2022, All authors. Licensed under CC-BY-SA-4.0 and MIT-0.
 
-release = 2.0.0-beta0
+release = 2.0.0-beta1
 
 lang = zh
 
@@ -99,6 +99,7 @@ post_latex = ./static/post_latex/main.py
 
 latex_logo = static/logo.png
 
+bibfile = d2l.bib
 
 [library]
 
diff --git a/d2l/__init__.py b/d2l/__init__.py
index 685a93002..5af511a7c 100644
--- a/d2l/__init__.py
+++ b/d2l/__init__.py
@@ -8,4 +8,4 @@
 
 """
 
-__version__ = "2.0.0-beta0"
+__version__ = "2.0.0-beta1"
diff --git a/d2l/mxnet.py b/d2l/mxnet.py
index 8ed5a5f34..27275dd91 100644
--- a/d2l/mxnet.py
+++ b/d2l/mxnet.py
@@ -1478,7 +1478,7 @@ def multibox_prior(data, sizes, ratios):
     ratio_tensor = d2l.tensor(ratios, ctx=device)
 
     # 为了将锚点移动到像素的中心，需要设置偏移量。
-    # 因为一个像素的的高为1且宽为1，我们选择偏移我们的中心0.5
+    # 因为一个像素的高为1且宽为1，我们选择偏移我们的中心0.5
     offset_h, offset_w = 0.5, 0.5
     steps_h = 1.0 / in_height  # 在y轴上缩放步长
     steps_w = 1.0 / in_width  # 在x轴上缩放步长
diff --git a/d2l/torch.py b/d2l/torch.py
index c092d774d..d7f7da3ad 100644
--- a/d2l/torch.py
+++ b/d2l/torch.py
@@ -1579,7 +1579,7 @@ def multibox_prior(data, sizes, ratios):
     ratio_tensor = d2l.tensor(ratios, device=device)
 
     # 为了将锚点移动到像素的中心，需要设置偏移量。
-    # 因为一个像素的的高为1且宽为1，我们选择偏移我们的中心0.5
+    # 因为一个像素的高为1且宽为1，我们选择偏移我们的中心0.5
     offset_h, offset_w = 0.5, 0.5
     steps_h = 1.0 / in_height  # 在y轴上缩放步长
     steps_w = 1.0 / in_width  # 在x轴上缩放步长