-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Ubuntu HA Cluster with lsyncd, remote MariaDB, Apache Reverse Proxy Setup Guide
The following document describes how to setup High Availability and failover ERPNext Cluster based on the latest stable distribution of Ubuntu Server 14.04 LTS x86-64.
It is expected that you:
- have enough experience in Ubuntu server administration but hate vim and love mc (Midnight Commander),
- must understand what each provided command is doing,
- follow this guide step-by-step, and replace certain values matching your network structure
- you have your own dedicated servers located in the same local TCP/IP network (i.e. 192.168.2.0/24),
- already set up and tweaked your Ubuntu servers,
- with password-less SSH (RSA or DSA key) root login enabled between the machines,
- have at least two servers identically set up with two different public IP addresses (preferably in different subnets) to act as balancers/failover/gateways and as ERPNext Application Servers,
- Bind9 service setup on each of these machines, providing round-robin/failover DNS resolution of your ERPNext FQDN names (top level domains with wildcard sub-domains, for easy SaaS service setups),
- third server as dedicated MariaDB MySQL & Memcached "Master" servers properly tuned for best performance (MariaDB MySQL & Memcached must be allowed to listen/bind on 0.0.0.0 address), available locally to frontend Application Servers,
- taken care of backing up your data prior to starting the setup,
- and have lots of patience.
In the the example setup we use 3 servers as follows:
Server 1: PUBLIC "GATEWAY" with local IP address 192.168.2.1
Server 2: PUBLIC "GATEWAY2" with local IP address 192.168.2.2
Server 3: LOCAL DATABASE "DB" SERVER with local IP address 192.168.2.21
Add to /etc/hosts on each server:
192.168.2.1 gateway
192.168.2.2 gateway2
192.168.2.21 db
I recommend installing low latency kernel (at the time of writing the guide the kernel version was 3.13.0-36):
su -i
apt-get install linux-image-lowlatency linux-image-3.13.0-36-lowlatency
update-initramfs -ck all
update-grub2
reboot
If you have /var/www mounted as separate partition, make sure /etc/fstab entry for it looks similar to this:
# /var/www was on /dev/sda4 during installation
UUID=d1bb10a1-0f00-4595-b8e2-13a53c8aa534 /var/www ext4 noatime,nodiratime,relatime 0 2
Optimize Linux kernel for better security, performance, and proper lsyncd syncing. Edit (modify) /etc/sysctl.conf file if you wish (and replace "kernel.domainname" value with your own public FQDN). Two last lines are very important and are present cause of default kernel inotify limits:
#
# /etc/sysctl.conf - Configuration file for setting system variables
# See /etc/sysctl.d/ for additional system variables.
# See sysctl.conf (5) for information.
#
kernel.domainname = MY-FQDN-REPACE-THIS-FILED.com
# Uncomment the following to stop low-level messages on console
#kernel.printk = 3 4 1 3
# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2
# Allowed local port range
net.ipv4.ip_local_port_range = 1024 65535
#Protect ICMP attacks
net.ipv4.icmp_echo_ignore_broadcasts = 1
# Turn on protection for bad icmp error messages
net.ipv4.icmp_ignore_bogus_error_responses = 1
# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1
# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15
# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
# reduce TIME_WAIT from the 120s default
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
# Enable Spoof protection (reverse-path filter)
#net.ipv4.conf.default.rp_filter=1
#net.ipv4.conf.all.rp_filter=1
# Enable TCP/IP SYN cookies
net.ipv4.tcp_syncookies=1
# Make sure no one can alter the routing tables
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
# Uncomment the next line to enable packet forwarding for IPv4
net.ipv4.ip_forward=1
# Do not send ICMP redirects (we are not a router)
net.ipv4.conf.all.send_redirects = 1
net.ipv4.conf.default.send_redirects = 1
# Turn on execshild
#kernel.exec-shield = 1
kernel.randomize_va_space = 1
# Log Martian Packets
net.ipv4.conf.all.log_martians = 1
# Disable IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
# Default Socket Receive Buffer
net.core.rmem_default = 31457280
# Maximum Socket Receive Buffer
net.core.rmem_max = 12582912
# Default Socket Send Buffer
net.core.wmem_default = 31457280
# Maximum Socket Send Buffer
net.core.wmem_max = 12582912
# Increase number of incoming connections
net.core.somaxconn = 65535
# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65536
# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824
# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144
# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384
# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.udp_wmem_min = 16384
# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
# Increase size of file handles and inode cache
fs.file-max = 2097152
vm.overcommit_memory = 1
# Do less swapping
vm.swappiness = 0
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2
vm.vfs_cache_pressure=50
# Fix network WAN failover kernel cache problems
# Just be careful to not make this too long – you can run out of memory. On a large site, use with care...
net.ipv4.ipfrag_secret_interval=14400
net.ipv4.route.gc_elasticity=80
#fs.inotify.max_user_watches=16384
fs.inotify.max_user_watches = 1000000
fs.inotify.max_queued_events = 1000000
Apply sysctl.conf changes. The output should show no errors, just applied tunings:
sudo sysctl -p
Install required software to begin with:
su -i
apt-get update
apt-get upgrade
apt-get purge apparmor
apt-get install python-software-properties
apt-get install apache2 apache2-bin apache2-data apache2-utils
a2dismod mpm_prefork mpm_event
apt-get install apache2-mpm-worker
a2enmod mpm-worker
apt-get install mc htop git socat ufw sysv-rc-conf lsyncd nano python-dev python-setuptools build-essential python-mysqldb git ntp screen mariadb-common libmariadbclient-dev libxslt1.1 libxslt1-dev redis-server libssl-dev libcrypto++-dev postfix supervisor python-pip python-setproctitle python-concurrent.futures python-eventlet python-greenlet python-celery
pip install gunicorn
cd /tmp && wget http://downloads.sourceforge.net/project/wkhtmltopdf/0.12.1/wkhtmltox-0.12.1_linux-trusty-amd64.deb
dpkg -i wkhtmltox-0.12.1_linux-trusty-amd64.deb
If using UFW firewall, make sure your ports 22 and 8000 on Application Servers are open bidirectionally for local network. I suggest completely disabling UFW on DATABASE server since it is bound within local TCP/IP network.
Create system user for ERPNext:
useradd erpnext -U -m -r -b /var/www -s /bin/bash
chmod o+x /var/www/erpnext
chmod o+r /var/www/erpnext
Basic install of Frappe Bench and Initialization thereof:
su -l erpnext
git clone https://github.com/frappe/bench bench-repo
exit
Should be still logged in as root, and continue:
pip install -e /var/www/erpnext/bench-repo
su -l erpnext
bench init frappe-bench && cd frappe-bench
bench get-app erpnext https://github.com/frappe/erpnext
bench get-app shopping_cart https://github.com/frappe/shopping-cart
Will continue initializing sites a bit later after done with setting up basic cluster synchronization with lsyncd, and making some tricks with socat to locally relay remotely running MariaDB and Memcached (we should not have these services up and running on Application Servers, i.e. GATEWAY & GATEWAY2).
Lets fake mysql user on Application Servers, if it does not exist there:
useradd mysql -U -M -r
Open /etc/rc.local file and add these lines before last line "exit 0":
# Start relaying remote MySQL to local socket
mkdir /var/run/mysqld
chown mysql: /var/run/mysqld
socat UNIX-LISTEN:/var/run/mysqld/mysqld.sock,fork,reuseaddr,unlink-early,user=mysql,group=mysql,mode=777 TCP:192.168.2.21:3306 &
# Start relaying remote memcached to localhost
socat TCP-LISTEN:11211,reuseaddr,fork,user=proxy,group=proxy,perm=0777 TCP:192.168.2.21:11211 &
Now execute this /etc/rc.local file and then run:
htop
Press "F4"
Start typing "socat"
Now you should see two "socat" processes running.
Exit "htop" by pressing ESC several types and then "q" and ENTER
Check socat relaying remote memcached with telnet, to see if it is connecting:
telnet localhost 11211
^]
quit
Now check mysql socket for relayed remote mysql service:
stat /var/run/mysqld/mysqld.sock
And this way to make sure it is working:
mysqlreport --user root --password YOURMYSQLROOTPASSWORD
Good stuff. So far, we are more then half way away! Now lets setup lsyncd synchronization of /var/www/erpnext sub-directories:
cd /etc/lsyncd
nano lsyncd.conf.lua
If the file is empty, then add these lines, where "gatewayXYZ" is the local name (you set it up previously in /etc/hosts) of "the other" Application Server where you want to synch the data to. You need to perform these edits on each Application Server to make the data sync to and fro.
Best way to sync data is the STAR pattern where you have ""master** lsyncd instance, and slaves, allowing you to do all maintenance and further setups on master server with automatic syncing to the slaves i.e.:
- GATEWAY (1st App Server) <--> GATEWAY2 (2st App Server)
- GATEWAY (1st App Server) <--> APPSERVER3 (3st App Server)
- GATEWAY (1st App Server) <--> APPSERVER4 (4st App Server) etc.
Here is the code to add to /etc/lsyncd/lsyncd.conf.lua:
settings{
logfile = "/var/log/lsyncd.log",
statusFile = "/var/log/lsyncd.status",
statusInterval=2,
delay=2,
maxDelays = 1,
insist = 1,
maxProcesses = 3
}
sync{default.rsync, source="/var/www/erpnext/bench-repo/", target="root@gatewayXYZ:/var/www/erpnext/bench-repo/",
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
sync{default.rsync, source="/var/www/erpnext/frappe-bench/", target="root@gatewayXYZ:/var/www/erpnext/frappe-bench/", exclude = {"*.log"},
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
I recommend adding /etc/ssl, /etc/bind, /etc/apache2/sites-available and /etc/apache2/sites-enabled to frontend GATEWAYs (still the same /etc/lsyncd/lsyncd.conf.lua) for uniform essentials synchronizations:
sync{default.rsync, source="/etc/apache2/sites-available/", target="root@gatewayXYZ:/etc/apache2/",
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
sync{default.rsync, source="/etc/apache2/sites-enabled/", target="root@gatewayXYZ:/etc/apache2/",
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
sync{default.rsync, source="/etc/bind/", target="root@gatewayXYZ:/etc/bind/",
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
sync{default.rsync, source="/etc/ssl/", target="root@gatewayXYZ:/etc/ssl/",
rsync = {acls=true, verbose=true, archive = true, owner = true, perms = true, group = true, compress = false, whole_file = true, rsh="/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"}
}
Save the lsyncd.conf.lua file and start/restart the lsyncd service:
service lsyncd restart
You may check the lsyncd log files to make sure everything is working smoothly. Also, worthy to note, that lsync, though being ways better then unison and such, has a bug, where exclude rules do not perform as expected, so, in my further writing you will see I cheated it a little bit, in order to escape conflicts on frappe-bench's celerybeat.pid and scheduler.schedule files overwrites. Each Application server will generate its own *.pid and *.schedule file to escape any conflicts.
tail -n 100 /var/log/lsyncd.log
tail -n 100 /var/log/lsyncd.status
And now is the interesting, but tricky part - setting up ERPNext sites with remote MariaDB MySQL server:
# run as root
bench setup sudoers erpnext
WILL ADD INFO HERE
Test-Driving the frappe-bench. There should be no errors excreted, just nice output:
su -l erpnext
cd frappe-bench/
bench start
If everything is OK, then just kill the processes, executing the command a few times:
killall -9 python
Tuning frappe-bench:
su -l erpnext
cd frappe-bench/
bench config auto_update off
bench config dns_multitenant on
bench config serve_default_site on
bench config restart_supervisor_on_update on
bench config http_timeout 300
bench setup procfile
bench setup backups
bench setup auto-update
bench update
Setting up supervisor application to run ERPNext unattended in our fancy-dancy ERPNext High Availability / Failover Cluster...
In /var/www/erpnext/frappe-bench/config create two files:
su -l erpnext
cd frappe-bench/config
touch supervisor-gateway1.conf
touch supervisor-gateway2.conf
Add the following code to supervisor-gateway1.conf
[program:frappe-web]
environment=SITES_PATH='/var/www/erpnext/frappe-bench/sites'
command=/var/www/erpnext/frappe-bench/env/bin/gunicorn -b 0.0.0.0:8000 --proxy-allow-from 127.0.0.1,192.168.2.1,192.168.2.2 --forwarded-allow-ips 127.0.0.1,192.168.2.1,192.168.2.2 -w 25 -t 120 --keep-alive 5 -n erpnext-gateway1 --proxy-protocol --access-logfile /var/www/erpnext/frappe-bench/logs/gateway1-web-gunicorn-access.log --error-logfile /var/www/erpnext/frappe-bench/logs/gateway1-web-gunicorn-error.log --log-level info frappe.app:application
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-web.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-web.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[program:frappe-worker]
command=/var/www/erpnext/frappe-bench/env/bin/python -m frappe.celery_app worker -n erpnext-gateway1
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-worker.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-worker.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[program:frappe-workerbeat]
command=/var/www/erpnext/frappe-bench/env/bin/python -m frappe.celery_app beat -s scheduler-gateway1.schedule --pidfile celerybeat-gateway1.pid
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-workerbeat.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway1-workerbeat.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[group:frappe]
programs=frappe-web,frappe-worker,frappe-workerbeat
Add same code to supervisor-gateway2.conf, but with "gateway1" changed to "gateway2":
[program:frappe-web]
environment=SITES_PATH='/var/www/erpnext/frappe-bench/sites'
command=/var/www/erpnext/frappe-bench/env/bin/gunicorn -b 0.0.0.0:8000 --proxy-allow-from 127.0.0.1,192.168.2.1,192.168.2.2 --forwarded-allow-ips 127.0.0.1,192.168.2.1,192.168.2.2 -w 25 -t 120 --keep-alive 5 -n erpnext-gateway2 --proxy-protocol --access-logfile /var/www/erpnext/frappe-bench/logs/gateway2-web-gunicorn-access.log --error-logfile /var/www/erpnext/frappe-bench/logs/gateway2-web-gunicorn-error.log --log-level info frappe.app:application
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-web.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-web.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[program:frappe-worker]
command=/var/www/erpnext/frappe-bench/env/bin/python -m frappe.celery_app worker -n erpnext-gateway2
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-worker.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-worker.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[program:frappe-workerbeat]
command=/var/www/erpnext/frappe-bench/env/bin/python -m frappe.celery_app beat -s scheduler-gateway2.schedule --pidfile celerybeat-gateway2.pid
autostart=true
autorestart=true
stopsignal=QUIT
stdout_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-workerbeat.log
stderr_logfile=/var/www/erpnext/frappe-bench/logs/gateway2-workerbeat.error.log
user=erpnext
directory=/var/www/erpnext/frappe-bench/sites
[group:frappe]
programs=frappe-web,frappe-worker,frappe-workerbeat
Now symlink /var/www/erpnext/frappe-bench/config/supervisor-gateway1.conf to /etc/supervisor/conf.d/frappe.conf on GATEWAY:
exit # go back to root user
ln -s /var/www/erpnext/frappe-bench/config/supervisor-gateway1.conf /etc/supervisor/conf.d/frappe.conf
... and on GATEWAY2:
ln -s /var/www/erpnext/frappe-bench/config/supervisor-gateway2.conf /etc/supervisor/conf.d/frappe.conf
The above supervisor-gatewayX.conf configuration is tuned for large powerful Dell PowerEdge G11 servers with lots of processing power and RAM, but you might need to fine tune (at lest lover workers count) this line according to your environment:
command=/var/www/erpnext/frappe-bench/env/bin/gunicorn -b 0.0.0.0:8000 --proxy-allow-from 127.0.0.1,192.168.2.1,192.168.2.2 --forwarded-allow-ips 127.0.0.1,192.168.2.1,192.168.2.2 -w 25 -t 120 --keep-alive 5 -n erpnext-gatewayX --proxy-protocol --access-logfile /var/www/erpnext/frappe-bench/logs/gatewayX-web-gunicorn-access.log --error-logfile /var/www/erpnext/frappe-bench/logs/gatewayX-web-gunicorn-error.log --log-level info frappe.app:application
Restart supervisor on both servers, and check if EPRNext (gunicorn) server is running:
service supervisor restart
htop -u erpnext
Making sure everything needed is installed:
WILL ADD INFO HERE
Editing the /etc/apache2 configuration ...
WILL ADD INFO HERE
Check if all needed services are set up to be run on server reboot/startup:
WILL ADD INFO HERE
If you notice any errors and omissions, or want to suggest anything to be added, please email directly at "victor at devdesco.com"