page fault

digoal · digoal · commit 88bf274acbee · 2017-03-03T19:50:41.000+08:00
diff --git a/201607/20160725_06.md b/201607/20160725_06.md
@@ -0,0 +1,185 @@
+## page fault带来的性能问题  
+              
+### 作者                                                           
+digoal         
+                
+### 日期           
+2016-07-25          
+            
+### 标签         
+PostgreSQL , Linux , page fault , major , minor         
+              
+----        
+                 
+## 背景       
+## Linux进程如何访问内存  
+Linux下，进程并不是直接访问物理内存，而是通过内存管理单元(MMU)来访问内存资源。  
+  
+原因后面会讲到。  
+  
+## 为什么需要虚拟内存地址空间  
+假设某个进程需要4MB的空间，内存假设是1MB的，如果进程直接使用物理地址，这个进程会因为内存不足跑不起来。  
+  
+既然进程不是直接访问物理内存，那么进程中涉及的内存地址当然也不是物理内存地址。  
+  
+而是虚拟的内存地址，虚拟的内存地址和物理的内存地址之间保持一种映射关系，这种关系由MMU进行管理。  
+  
+每个进程都有自己独立的虚拟地址空间。  
+  
+## 什么是MMU  
+![pic](20160725_06_pic_001.png)  
+  
+MMU全称是内存管理单元，它将物理内存分割成多个pages，MMU管理进程的虚拟地址空间中的PAGE和物理内存中的PAGE之间的映射关系。  
+  
+因为是映射，所以随时都可能发生变化，例如某个进程虚拟内存空间中的PAGE1，在不同的时间点，可能出现在物理内存中的不同位置（当发生了页交换时）。  
+  
+## 什么是page fault  
+当进程访问它的虚拟地址空间中的PAGE时，如果这个PAGE目前还不在物理内存中，此时CPU是不能干活的，  
+  
+Linux会产生一个hard page fault中断。  
+  
+系统需要从慢速设备（如磁盘）将对应的数据PAGE读入物理内存，并建立物理内存地址与虚拟地址空间PAGE的映射关系。  
+  
+然后进程才能访问这部分虚拟地址空间的内存。  
+  
+page fault 又分为几种，major page fault、 minor page fault、 invalid(segment fault)。  
+  
+major page fault也称为hard page fault, 指需要访问的内存不在虚拟地址空间，也不在物理内存中，需要从慢速设备载入。从swap回到物理内存也是hard page fault。  
+  
+minor page fault也称为soft page fault, 指需要访问的内存不在虚拟地址空间，但是在物理内存中，只需要MMU建立物理内存和虚拟地址空间的映射关系即可。  
+  
+（通常是多个进程访问同一个共享内存中的数据，可能某些进程还没有建立起映射关系，所以访问时会出现soft page fault）  
+  
+invalid fault也称为segment fault, 指进程需要访问的内存地址不在它的虚拟地址空间范围内，属于越界访问，内核会报segment fault错误。  
+  
+## 什么是进程的working set  
+进程的working set是指当前在物理内存中，属于该进程的pages组成的集合。  
+  
+这个集合中的page数随着系统的运行，可能扩大也可能缩小。  
+  
+扩大working set是指进程正在访问更多的内存时。  
+  
+缩小working set是指其他进程正在访问更多的内存，并且整个物理内存的空间不足，需要将某些干净的内存页free掉或者一些脏的内存页swap到交换分区去，如果这个操作设计到当前进程，对当前进程来说就是shrink working set。  
+  
+缩小working set需要遵循lru (内核的内存老化)算法，并不是随便挑选PAGE进行shrink的。  
+  
+## 什么是swap  
+swap和前面提到的shrink working set有关，如果是干净页（即从读入虚拟地址空间后，没有修改过的页），则直接标记为free，不需要写盘，也不会发生swap。  
+  
+如果是脏页，那么它需要写盘，或者需要swap 到交换分区。  
+  
+## 如何优化swap  
+建议使用IOPS和读写带宽很高的盘作为SWAP分区，例如PCI-E SSD。  
+  
+## 什么时候需要加内存  
+如果你发现经常发生swap in , out。  
+  
+说明进程的脏页被换出到swap后，紧接着这些页可能又需要被进程访问，从而这些页需要从swap写入物理内存，发生了hard page fault。  
+  
+这种情况下，你就需要加内存了，确实是内存不够用了。  
+  
+hard page fault非常损耗性能，因为发生page fault时，是需要等待的，而且IO通常来说都是比较慢的，容易成为性能瓶颈。  
+  
+## 什么是oom, 为什么会发生OOM  
+前面讲到了shrink working set，是指系统在内存调度时，使用的一种手段，尽可能的让系统能使用有限的内存资源，支撑更多的并发任务。  
+  
+oom是指系统已经没有足够的内存给进程使用，即能free的都已经free了，能swap out的也已经swap out了，再也不能挤出物理内存的情况。  
+  
+如果遇到这种情况就会发生OOM，表示系统内存以及不足，Linux会挑选并KILL一些进程，来释放内存。  
+  
+## Linux page相关统计信息  
+sar  
+  
+```  
+de  >       -B     Report paging statistics.  The following values are displayed:  
+  
+              pgpgin/s  
+                     Total number of kilobytes the system paged in from disk per second.  
+  
+              pgpgout/s  
+                     Total number of kilobytes the system paged out to disk per second.  
+  
+              fault/s  
+                     Number of page faults (major + minor) made by the system per second.  This is not a count of page faults that generate I/O, because some page faults can be resolved without I/O.  
+  
+              majflt/s  
+                     Number of major faults the system has made per second, those which have required loading a memory page from disk.  
+  
+              pgfree/s  
+                     Number of pages placed on the free list by the system per second.  
+  
+              pgscank/s  
+                     Number of pages scanned by the kswapd daemon per second.  
+  
+              pgscand/s  
+                     Number of pages scanned directly per second.  
+  
+              pgsteal/s  
+                     Number of pages the system has reclaimed from cache (pagecache and swapcache) per second to satisfy its memory demands.  
+  
+              %vmeff  
+                     Calculated as pgsteal / pgscan, this is a metric of the efficiency of page reclaim. If it is near 100% then almost every page coming off the tail of the inactive list is being reaped. If it gets too  low  (e.g.  
+                     less than 30%) then the virtual memory is having some difficulty.  This field is displayed as zero if no pages have been scanned during the interval of time.  
+de>  
+```  
+  
+## 一些CASE  
+1\. 如果PostgreSQL 设置了非常大的shared buffer, 为什么会有一段性能低潮(指全力压测时，平时估计感觉不到)  
+  
+在PostgreSQL shared buffer设得非常大的情况下，Shared buffer作为进程的虚拟地址空间中的一部分，刚启动时这些虚拟地址空间还没有在物理内存中，所以填充过程中会发生page fault，性能较低。  
+  
+当shared buffer被快速填满后，page fault减少，性能会上来。  
+  
+例如在压测持续高并发的COPY数据入库时，看这个case https://yq.aliyun.com/articles/8528  
+  
+在前面几百GB，page fault较多，产生较多的中断(这个case是软的，因为是在shared buffer中进行extend block，么有落盘)，性能比shared buffer填满后有一定的差异。  
+  
+240GB shared buffer,   
+  
+shared buffer 填满前，2.8GB/s 导入速度  
+  
+```  
+de  >12:13:46 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff  
+12:13:47 AM      8.08 1699410.10  40058.59      0.00 142064.65 137696.97      0.00 137632.32     99.95  
+12:13:48 AM     15.84 1634396.04  38605.94      0.00 412391.09 411564.36      0.00 411627.72    100.02  
+  
+ 22  44  33   0   0   0|  32k 2763M| 239B  364B|   0     0 | 184k  335k  
+ 22  44  34   0   0   0|  12k 2791M| 120B  544B|   0     0 | 184k  336k  
+ 22  45  33   0   0   0|  12k 2768M| 471B 1198B|   0     0 | 185k  341k  
+ 22  44  33   0   0   0|  16k 2764M| 210B  454B|   0     0 | 183k  336k  
+ 22  44  34   0   0   0|  16k 2777M| 239B  364B|   0     0 | 185k  341k  
+de>  
+```  
+  
+shared buffer 填满后, 4.7GB/s导入速度  
+  
+```  
+de  >12:14:40 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff  
+12:14:41 AM     16.33 2831730.61    823.47      0.00 735045.92 725126.53      0.00 724985.71     99.98  
+12:14:42 AM     12.00 2760160.00    743.00      0.00 738037.00 728352.00      0.00 728370.00    100.00  
+12:14:43 AM     11.88 2751279.21    326.73      0.00 720171.29 715310.89      0.00 715374.26    100.01  
+12:14:44 AM     20.00 2788468.00    338.00      0.00 727221.00 720480.00      0.00 720480.00    100.00  
+12:14:45 AM     12.50 2858970.83    425.00      0.00 753507.29 740266.67      0.00 740166.67     99.99  
+  
+ 24  18  56   0   0   1|  28k 4703M| 120B  364B|   0     0 | 366k  655k  
+ 24  18  57   0   0   1|  16k 5031M| 892B  994B|   0     0 | 382k  690k  
+ 24  19  56   0   0   0|  12k 4833M| 120B  364B|   0     0 | 374k  672k  
+ 24  19  56   0   0   1|8192B 4737M| 239B  544B|   0     0 | 371k  668k  
+ 24  19  56   0   0   1|  20k 4826M| 120B  454B|   0     0 | 375k  674k  
+de>  
+```  
+  
+## 参考  
+https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/3/html/Introduction_to_System_Administration/s1-memory-virt-details.html  
+  
+https://en.wikipedia.org/wiki/Page_fault  
+  
+http://www.cnblogs.com/xelatex/p/3491305.html  
+  
+https://en.wikipedia.org/wiki/Memory_management_unit  
+  
+http://pages.cs.wisc.edu/~remzi/OSTEP/vm-segmentation.pdf?spm=5176.100239.blogcont55820.33.KDVPoF&file=vm-segmentation.pdf  
+  
+http://www.tldp.org/LDP/tlk/mm/memory.html?spm=5176.100239.blogcont55820.44.tOZ1UX  
+            
+[Count](http://info.flagcounter.com/h9V1)                                               
diff --git a/201607/20160725_06_pic_001.png b/201607/20160725_06_pic_001.png
diff --git a/201607/readme.md b/201607/readme.md
@@ -8,6 +8,7 @@
 ##### 20160727_02.md   [《固若金汤 - PostgreSQL pgcrypto加密插件》](20160727_02.md)  
 ##### 20160727_01.md   [《PostgreSQL 会话级资源隔离探索》](20160727_01.md)  
 ##### 20160726_01.md   [《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL(Haar wavelet)》](20160726_01.md)  
+##### 20160725_06.md   [《page fault带来的性能问题》](20160725_06.md)  
 ##### 20160725_05.md   [《PostgreSQL 如何高效解决 按任意字段分词检索的问题 - case 1》](20160725_05.md)  
 ##### 20160725_04.md   [《PostgreSQL Oracle 兼容性之 - 锁定执行计划(Outline system)》](20160725_04.md)  
 ##### 20160725_03.md   [《mongoDB BI 分析利器 - PostgreSQL FDW (MongoDB Connector for BI)》](20160725_03.md)  
diff --git a/README.md b/README.md
@@ -277,6 +277,7 @@ http://pan.baidu.com/s/1pKVCgHX  ,  如果连接失效请通知我, 谢谢
 ##### 201607/20160727_02.md   [《固若金汤 - PostgreSQL pgcrypto加密插件》](201607/20160727_02.md)  
 ##### 201607/20160727_01.md   [《PostgreSQL 会话级资源隔离探索》](201607/20160727_01.md)  
 ##### 201607/20160726_01.md   [《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL(Haar wavelet)》](201607/20160726_01.md)  
+##### 201607/20160725_06.md   [《page fault带来的性能问题》](201607/20160725_06.md)  
 ##### 201607/20160725_05.md   [《PostgreSQL 如何高效解决 按任意字段分词检索的问题 - case 1》](201607/20160725_05.md)  
 ##### 201607/20160725_04.md   [《PostgreSQL Oracle 兼容性之 - 锁定执行计划(Outline system)》](201607/20160725_04.md)  
 ##### 201607/20160725_03.md   [《mongoDB BI 分析利器 - PostgreSQL FDW (MongoDB Connector for BI)》](201607/20160725_03.md)