|
| 1 | +## PostgreSQL database cann't startup because memory overcommit |
| 2 | + |
| 3 | +### 作者 |
| 4 | +digoal |
| 5 | + |
| 6 | +### 日期 |
| 7 | +2015-07-30 |
| 8 | + |
| 9 | +### 标签 |
| 10 | +PostgreSQL , oom , 资源限制 |
| 11 | + |
| 12 | +---- |
| 13 | + |
| 14 | +## 背景 |
| 15 | +你可能遇到过类似的数据库无法启动的问题, |
| 16 | + |
| 17 | +``` |
| 18 | +postgres@digoal-> FATAL: XX000: could not map anonymous shared memory: Cannot allocate memory |
| 19 | +HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 3322716160 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. |
| 20 | +LOCATION: CreateAnonymousSegment, pg_shmem.c:398 |
| 21 | +``` |
| 22 | + |
| 23 | +通过查看meminfo可以得到原因。 |
| 24 | + |
| 25 | +``` |
| 26 | +CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), |
| 27 | + this is the total amount of memory currently available to |
| 28 | + be allocated on the system. This limit is only adhered to |
| 29 | + if strict overcommit accounting is enabled (mode 2 in |
| 30 | + 'vm.overcommit_memory'). |
| 31 | + The CommitLimit is calculated with the following formula: |
| 32 | + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * |
| 33 | + overcommit_ratio / 100 + [total swap pages] |
| 34 | + For example, on a system with 1G of physical RAM and 7G |
| 35 | + of swap with a `vm.overcommit_ratio` of 30 it would |
| 36 | + yield a CommitLimit of 7.3G. |
| 37 | + For more details, see the memory overcommit documentation |
| 38 | + in vm/overcommit-accounting. |
| 39 | +Committed_AS: The amount of memory presently allocated on the system. |
| 40 | + The committed memory is a sum of all of the memory which |
| 41 | + has been allocated by processes, even if it has not been |
| 42 | + "used" by them as of yet. A process which malloc()'s 1G |
| 43 | + of memory, but only touches 300M of it will show up as |
| 44 | + using 1G. This 1G is memory which has been "committed" to |
| 45 | + by the VM and can be used at any time by the allocating |
| 46 | + application. With strict overcommit enabled on the system |
| 47 | + (mode 2 in 'vm.overcommit_memory'),allocations which would |
| 48 | + exceed the CommitLimit (detailed above) will not be permitted. |
| 49 | + This is useful if one needs to guarantee that processes will |
| 50 | + not fail due to lack of memory once that memory has been |
| 51 | + successfully allocated. |
| 52 | +``` |
| 53 | + |
| 54 | +依据vm.overcommit_memory设置的值, |
| 55 | + |
| 56 | +当vm.overcommit_memory=0时,不允许普通用户overcommit, 但是允许root用户轻微的overcommit。 |
| 57 | + |
| 58 | +当vm.overcommit_memory=1时,允许overcommit. |
| 59 | + |
| 60 | +当vm.overcommit_memory=2时,Committed_AS不能大于CommitLimit。 |
| 61 | + |
| 62 | +commit 限制 计算方法 |
| 63 | + |
| 64 | +``` |
| 65 | + The CommitLimit is calculated with the following formula: |
| 66 | + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * |
| 67 | + overcommit_ratio / 100 + [total swap pages] |
| 68 | + For example, on a system with 1G of physical RAM and 7G |
| 69 | + of swap with a `vm.overcommit_ratio` of 30 it would |
| 70 | + yield a CommitLimit of 7.3G. |
| 71 | +[root@digoal postgresql-9.4.4]# free |
| 72 | + total used free shared buffers cached |
| 73 | +Mem: 1914436 713976 1200460 72588 32384 529364 |
| 74 | +-/+ buffers/cache: 152228 1762208 |
| 75 | +Swap: 1048572 542080 506492 |
| 76 | +[root@digoal ~]# cat /proc/meminfo |grep Commit |
| 77 | +CommitLimit: 2005788 kB |
| 78 | +Committed_AS: 132384 kB |
| 79 | +``` |
| 80 | + |
| 81 | +这个例子的2G就是以上公式计算得来。 |
| 82 | + |
| 83 | +overcommit限制的初衷是malloc后,内存并不是立即使用掉,所以如果多个进程同时申请一批内存的话,不允许OVERCOMMIT可能导致某些进程申请内存失败,但实际上内存是还有的。所以Linux内核给出了几种选择,2是比较靠谱或者温柔的做法。1的话风险有点大,因为可能会导致OOM。 |
| 84 | + |
| 85 | +所以当数据库无法启动时,要么你降低一下数据库申请内存的大小(例如降低shared_buffer或者max conn),要么就是修改一下overcommit的风格。 |
| 86 | + |
| 87 | +## 参考 |
| 88 | +1\. kernel-doc-2.6.32/Documentation/filesystems/proc.txt |
| 89 | + |
| 90 | +``` |
| 91 | + MemTotal: Total usable ram (i.e. physical ram minus a few reserved |
| 92 | + bits and the kernel binary code) |
| 93 | + MemFree: The sum of LowFree+HighFree |
| 94 | +MemAvailable: An estimate of how much memory is available for starting new |
| 95 | + applications, without swapping. Calculated from MemFree, |
| 96 | + SReclaimable, the size of the file LRU lists, and the low |
| 97 | + watermarks in each zone. |
| 98 | + The estimate takes into account that the system needs some |
| 99 | + page cache to function well, and that not all reclaimable |
| 100 | + slab will be reclaimable, due to items being in use. The |
| 101 | + impact of those factors will vary from system to system. |
| 102 | + This line is only reported if sysctl vm.meminfo_legacy_layout = 0 |
| 103 | + Buffers: Relatively temporary storage for raw disk blocks |
| 104 | + shouldn't get tremendously large (20MB or so) |
| 105 | + Cached: in-memory cache for files read from the disk (the |
| 106 | + pagecache). Doesn't include SwapCached |
| 107 | + SwapCached: Memory that once was swapped out, is swapped back in but |
| 108 | + still also is in the swapfile (if memory is needed it |
| 109 | + doesn't need to be swapped out AGAIN because it is already |
| 110 | + in the swapfile. This saves I/O) |
| 111 | + Active: Memory that has been used more recently and usually not |
| 112 | + reclaimed unless absolutely necessary. |
| 113 | + Inactive: Memory which has been less recently used. It is more |
| 114 | + eligible to be reclaimed for other purposes |
| 115 | + HighTotal: |
| 116 | + HighFree: Highmem is all memory above ~860MB of physical memory |
| 117 | + Highmem areas are for use by userspace programs, or |
| 118 | + for the pagecache. The kernel must use tricks to access |
| 119 | + this memory, making it slower to access than lowmem. |
| 120 | + LowTotal: |
| 121 | + LowFree: Lowmem is memory which can be used for everything that |
| 122 | + highmem can be used for, but it is also available for the |
| 123 | + kernel's use for its own data structures. Among many |
| 124 | + other things, it is where everything from the Slab is |
| 125 | + allocated. Bad things happen when you're out of lowmem. |
| 126 | + SwapTotal: total amount of swap space available |
| 127 | + SwapFree: Memory which has been evicted from RAM, and is temporarily |
| 128 | + on the disk |
| 129 | + Dirty: Memory which is waiting to get written back to the disk |
| 130 | + Writeback: Memory which is actively being written back to the disk |
| 131 | + AnonPages: Non-file backed pages mapped into userspace page tables |
| 132 | +AnonHugePages: Non-file backed huge pages mapped into userspace page tables |
| 133 | + Mapped: files which have been mmaped, such as libraries |
| 134 | + Slab: in-kernel data structures cache |
| 135 | +SReclaimable: Part of Slab, that might be reclaimed, such as caches |
| 136 | + SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure |
| 137 | + PageTables: amount of memory dedicated to the lowest level of page |
| 138 | + tables. |
| 139 | +NFS_Unstable: NFS pages sent to the server, but not yet committed to stable |
| 140 | + storage |
| 141 | + Bounce: Memory used for block device "bounce buffers" |
| 142 | +WritebackTmp: Memory used by FUSE for temporary writeback buffers |
| 143 | + CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), |
| 144 | + this is the total amount of memory currently available to |
| 145 | + be allocated on the system. This limit is only adhered to |
| 146 | + if strict overcommit accounting is enabled (mode 2 in |
| 147 | + 'vm.overcommit_memory'). |
| 148 | + The CommitLimit is calculated with the following formula: |
| 149 | + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * |
| 150 | + overcommit_ratio / 100 + [total swap pages] |
| 151 | + For example, on a system with 1G of physical RAM and 7G |
| 152 | + of swap with a `vm.overcommit_ratio` of 30 it would |
| 153 | + yield a CommitLimit of 7.3G. |
| 154 | + For more details, see the memory overcommit documentation |
| 155 | + in vm/overcommit-accounting. |
| 156 | +Committed_AS: The amount of memory presently allocated on the system. |
| 157 | + The committed memory is a sum of all of the memory which |
| 158 | + has been allocated by processes, even if it has not been |
| 159 | + "used" by them as of yet. A process which malloc()'s 1G |
| 160 | + of memory, but only touches 300M of it will show up as |
| 161 | + using 1G. This 1G is memory which has been "committed" to |
| 162 | + by the VM and can be used at any time by the allocating |
| 163 | + application. With strict overcommit enabled on the system |
| 164 | + (mode 2 in 'vm.overcommit_memory'),allocations which would |
| 165 | + exceed the CommitLimit (detailed above) will not be permitted. |
| 166 | + This is useful if one needs to guarantee that processes will |
| 167 | + not fail due to lack of memory once that memory has been |
| 168 | + successfully allocated. |
| 169 | +VmallocTotal: total size of vmalloc memory area |
| 170 | + VmallocUsed: amount of vmalloc area which is used |
| 171 | +VmallocChunk: largest contiguous block of vmalloc area which is free |
| 172 | +``` |
| 173 | + |
| 174 | +2\. kernel-doc-2.6.32/Documentation/vm/overcommit-accounting |
| 175 | + |
| 176 | +``` |
| 177 | +The Linux kernel supports the following overcommit handling modes |
| 178 | + |
| 179 | +0 - Heuristic overcommit handling. Obvious overcommits of |
| 180 | + address space are refused. Used for a typical system. It |
| 181 | + ensures a seriously wild allocation fails while allowing |
| 182 | + overcommit to reduce swap usage. root is allowed to |
| 183 | + allocate slighly more memory in this mode. This is the |
| 184 | + default. |
| 185 | + |
| 186 | +1 - Always overcommit. Appropriate for some scientific |
| 187 | + applications. |
| 188 | + |
| 189 | +2 - Don't overcommit. The total address space commit |
| 190 | + for the system is not permitted to exceed swap + a |
| 191 | + configurable amount (default is 50%) of physical RAM. |
| 192 | + Depending on the amount you use, in most situations |
| 193 | + this means a process will not be killed while accessing |
| 194 | + pages but will receive errors on memory allocation as |
| 195 | + appropriate. |
| 196 | + |
| 197 | +The overcommit policy is set via the sysctl `vm.overcommit_memory'. |
| 198 | + |
| 199 | +The overcommit amount can be set via `vm.overcommit_ratio' (percentage) |
| 200 | +or `vm.overcommit_kbytes' (absolute value). |
| 201 | + |
| 202 | +The current overcommit limit and amount committed are viewable in |
| 203 | +/proc/meminfo as CommitLimit and Committed_AS respectively. |
| 204 | + |
| 205 | +Gotchas |
| 206 | +------- |
| 207 | + |
| 208 | +The C language stack growth does an implicit mremap. If you want absolute |
| 209 | +guarantees and run close to the edge you MUST mmap your stack for the |
| 210 | +largest size you think you will need. For typical stack usage this does |
| 211 | +not matter much but it's a corner case if you really really care |
| 212 | + |
| 213 | +In mode 2 the MAP_NORESERVE flag is ignored. |
| 214 | + |
| 215 | + |
| 216 | +How It Works |
| 217 | +------------ |
| 218 | + |
| 219 | +The overcommit is based on the following rules |
| 220 | + |
| 221 | +For a file backed map |
| 222 | + SHARED or READ-only - 0 cost (the file is the map not swap) |
| 223 | + PRIVATE WRITABLE - size of mapping per instance |
| 224 | + |
| 225 | +For an anonymous or /dev/zero map |
| 226 | + SHARED - size of mapping |
| 227 | + PRIVATE READ-only - 0 cost (but of little use) |
| 228 | + PRIVATE WRITABLE - size of mapping per instance |
| 229 | + |
| 230 | +Additional accounting |
| 231 | + Pages made writable copies by mmap |
| 232 | + shmfs memory drawn from the same pool |
| 233 | + |
| 234 | +Status |
| 235 | +------ |
| 236 | + |
| 237 | +o We account mmap memory mappings |
| 238 | +o We account mprotect changes in commit |
| 239 | +o We account mremap changes in size |
| 240 | +o We account brk |
| 241 | +o We account munmap |
| 242 | +o We report the commit status in /proc |
| 243 | +o Account and check on fork |
| 244 | +o Review stack handling/building on exec |
| 245 | +o SHMfs accounting |
| 246 | +o Implement actual limit enforcement |
| 247 | + |
| 248 | +To Do |
| 249 | +----- |
| 250 | +o Account ptrace pages (this is hard) |
| 251 | +``` |
| 252 | + |
0 commit comments