Network configuration and administration is a complex topic. In this section, I mostly describe those administrative tasks that are important yet unique for Rocks cluster operating system. I also give some tricks and tips for troubleshooting the system and for accomplishing common tasks.
One of the most important things, that you cannot easily learn from other places such as Rocks manual, is that Rocks uses a MySQL database to store all information related to network configuration. I learned this in a very hard way, because it is not really documented anywhere clearly. So it is very important to always use rocks
command to administrater networks, rather than using the typical Linux commands (again, because if you use typical Linux command, the changes are not stored in MySQL and will not survive compute node re-installation or system upgrades). For example, setting up new network should be done by rocks list/set host network
commands, and setting up firewall should be done by rocks list/set host firewall
commands.
In a typical Linux system, there are several configuration files for enabling network in a host machine. These files includ /etc/resolv.conf
which gives name server information, /etc/hosts
which gives host information (as they may not be in name server), /etc/sysconfig/network-scripts/ifcfg-eth0
for configuring network interface.
Typical configuration files
[kaiwang@biocluster ~]$ cat /etc/resolv.conf
search local med.usc.edu
nameserver 127.0.0.1
nameserver 128.125.253.143
nameserver 128.125.7.23
However, the user-defined hosts are stored in /etc/hosts.local
file, and rocks sync config
and rocks sync network
will copy this to all cluster. In other words, you should never edit the /etc/hosts
file yourself, because any edits will not be preserved as this file will be generated automatically by Rocks whenever the system refreshes configuration file or when the system restarts. Instead, to define new hosts, you must edit /etc/hosts.local
file. I recommend that you put IP addresses for IPMI into this file (technically Rocks has the ability to handle IPMI by DHCP but it rarely works well, so it is best that you just set up static IP for IPMI ports and put these IP into the host file).
Similarly, the configuration files below are also automatically generated by Rocks and you should not edit them, as any edit will not survive a refreshment of Rocks configuration or a restart.
[root@biocluster ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
HWADDR=00:25:90:09:E6:66
IPADDR=10.1.1.1
NETMASK=255.255.255.0
BOOTPROTO=none
ONBOOT=yes
MTU=1500
[root@biocluster ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
HWADDR=80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:CF:FF
IPADDR=192.168.1.1
NETMASK=255.255.255.0
BOOTPROTO=none
ONBOOT=yes
MTU=65520
Finally, do not set up iptables (/etc/iptables
) yourself for the same reasons. See more details later in this article.
The following description only applies to version 6. It seems that Rocks 7 automatically handles these types of issues.
By default, if an infiniband card is already present in the system, then IB-related packages (such as opensm and ibutils and drivers) will be installed automatically by Rocks. However, occasionally this is not the case, and sometimes you may want to add IB to an existing Rocks installation. How shall we address this?
I found that simply doing yum install openib ibutils infiniband-diags opensm
does not solve the problem per se. I still get an error umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version
if I run service openibd restart; service opensmd restart
followed by ibhosts
. This occurs in some systems, but not other systems.
To address this problem, I have to download OFED version 3.4-1.0.0.0 (which is suitable for CentOS 6.6) from http://www.mellanox.com/page/mlnx_ofed_matrix?mtag=linux_sw_drivers. Install it in head node first. Then do a mkdir -p /export/rocks/install/contrib/extra/install
and copy this file to this directory. Then edit /export/rocks/install/site-profiles/6.2/nodes/extend-compute.xml
and add the lines to the <post> </post>
section:
echo 'Installing OFED...'
cd /root
wget http://127.0.0.1/install/contrib/extra/install/MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64.tgz 2> /root/temp1
ls -l > /root/temp2
tar xvfz MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64.tgz 2> /root/temp3 > /root/temp3.5
echo y | ./MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64/mlnxofedinstall 2> /root/temp4 > /root/temp4.5
/etc/init.d/openibd restart 2> /root/temp5 > /root/temp5.5
cd /
Recently, I verified that MLNX_OFED version 4.0 also works okay:
echo 'Installing OFED...'
cd /root
wget http://127.0.0.1/install/contrib/extra/install/MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64.tgz 2> /root/temp1
ls -l > /root/temp2
tar xvfz MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64.tgz 2> /root/temp3 > /root/temp3.5
echo y | ./MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64/mlnxofedinstall 2> /root/temp4 > /root/temp4.5
/etc/init.d/openibd restart 2> /root/temp5 > /root/temp5.5
cd /
The direction of stderr and stdout is to help diagnose potential problems if installation fails. From head node, you may also do a rocks list host profile compute-0-0 > /dev/null
to check if there is any error message in compiling this xml file in kickstart script.
Then cd /export/rocks/install; rocks create distro
, and ssh into each compute node and reinstall it by /boot/kickstart/cluster-kickstart
.
It is important to change the MTU for IB network, since by default it is set as 1500 (default value for ethernet), which will not result in optimal performance for IB traffic.
rocks set network mtu ipoib 65520
this should set the MTU for all hosts in the system to 65520
The iblinkinfo
is perhaps the most commonly used command to check IB status. It tells you which host is actively connected and what is the speed of connection, and which host has trouble linking up. The ibhosts
and ibswitches
can be used to check hosts in the network.
[root@biocluster ~]# ibhosts
Ca : 0x0002c903000cc172 ports 1 "nas-0-0 mlx4_0"
Ca : 0xf452140300ee07b0 ports 1 "compute-0-0 mlx4_0"
Ca : 0xf452140300ee0798 ports 1 "compute-0-1 mlx4_0"
Ca : 0xf452140300ee0814 ports 1 "compute-0-2 mlx4_0"
Ca : 0xf452140300ee082c ports 1 "compute-0-3 mlx4_0"
Ca : 0x0002c90300505e94 ports 1 "biocluster mlx4_0"
[root@biocluster ~]# ibswitches
Switch : 0x0002c90200422f90 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 30 lmc 0
First run qperf
without argument in nas-0-0
, then run
[kaiwang@biocluster ~]$ qperf nas-0-0-ib tcp_bw tcp_lat udp_bw udp_lat ud_bw ud_lat rc_bi_bw sdp_bw
[kaiwang@biocluster ~]$ qperf nas-0-0 tcp_bw tcp_lat udp_bw udp_lat ud_bw ud_lat rc_bi_bw sdp_bw
There is a clear different in tcp_bw (from 1.49GB/s to 119MB/s), but the latency is not great for IB.
First run ibv_rc_pingpong
in biocluster, then run
[kaiwang@compute-0-10 ~]$ ibv_rc_pingpong biocluster
local address: LID 0x000e, QPN 0x48004b, PSN 0x74f5ba, GID ::
remote address: LID 0x0003, QPN 0x74004b, PSN 0xbaa1c2, GID ::
8192000 bytes in 0.01 seconds = 5671.66 Mbit/sec
1000 iters in 0.01 seconds = 11.56 usec/iter
Similarly, ibv_uc_pingpong,ibv_ud_pingpong, ibv_srq_pingpongcan be used. The ibv_srq_pingpong is very fast as it presumably use 16 different queue pairs by default.
first run ibstat in biocluster to find the port GUID, then run ibping -S
in biocluster as root, then run ibping -G 0x0002c903000ecfffa
in compute-0-0.
Network security is very important, even if it is an academic research computing cluster. The combined use of DenyHosts and appropriate firewall have helped our cluster withstand multiple external attacks.
yum --enablerepo=epel install denyhosts
service denyhosts restart
chkconfig denyhosts on
Edit the /etc/denyhosts.conf
file for configuation changes. The file should be self-evident to edit.
Some useful guideline include this. However, as I have emphasized, you should not edit iptables yourself, but always use the Rocks command to make modifications, to ensure the longevity of network configurations.
rocks list host firewall compute-0-0
RULENAME SERVICE PROTOCOL ACTION CHAIN NETWORK OUTPUT-NETWORK FLAGS COMMENT CATEGORY
A15-ALL-LOCAL all all ACCEPT INPUT ------- -------------- -i lo ------- global
A20-ALL-PRIVATE all all ACCEPT INPUT private -------------- ------------------------ ------- global
A20-SSH-PUBLIC ssh tcp ACCEPT INPUT public -------------- -m state --state NEW ------- global
A30-RELATED-PUBLIC all all ACCEPT INPUT public -------------- -m state --state RELATED ------- global
R900-PRIVILEGED-TCP all tcp REJECT INPUT public -------------- --dport 0:1023 ------- global
R900-PRIVILEGED-UDP all udp REJECT INPUT public -------------- --dport 0:1023 ------- global
rocks add firewall host=compute-0-0 network=public2 service=50030 chain=INPUT action=ACCEPT rulename=A40-PUBLIC-HADOOP-50030 protocol=tcp
rocks sync host firewall compute-0-0
This command opens port 50030 in a particular compute node (which by the way serves as the Hadoop head node).
Note that the protocol must be tcp
, not all
, for the command to work.
rocks remove firewall host=compute-0-0 rulename=A40-PUBLIC-HADOOP-50030
You must specify protocol (as tcp or udp) for the command to work. Otherwise, the --dport
argument cannot be recognized (it is not a problem of rocks add firewall
command itself). To understand this more, see below:
[root@biocluster /home/kaiwang]$ iptables -A INPUT -p all -i eth1 --dport 50030 -j ACCEPT
iptables v1.4.7: unknown option `--dport'
Try `iptables -h' or 'iptables --help' for more information.
[root@biocluster /home/kaiwang]$ iptables -A INPUT -p tcp -i eth1 --dport 50030 -j ACCEPT
This is useful when you set up a web service in one compute node (such as compute-0-0), but want to access this service from frontend from the outside world using a specific port such as 8010.
In general, I do not recommend doing this for a variety of security reasons. Even if you want to do this, it is perhaps best to just use iptables temporarily to do it, without saving such rules to firewall.
The following two commands can do this on frontend:
iptables -t nat -A PREROUTING -p tcp -i eth1 --dport 8010 -j DNAT --to-destination 10.1.1.253:8010
iptables -A FORWARD -p tcp -d 10.1.1.253 --dport 8010 -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT
To allow LAN nodes with private IP addresses to communicate with external public networks, configure the firewall for IP masquerading, which masks requests from LAN nodes with the IP address of the firewall's external device (in this case, eth1):
iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE
The rule uses the NAT packet matching table (-t nat
) and specifies the built-in POSTROUTING chain for NAT (-A POSTROUTING
) on the firewall's external networking device (-o eth1
). Note that the eth1
could be replaced by the actual name of the public interface such as eno2
.
If this works out well, try to add the command using rocks add firewall
.
It this does not work, it is possible that the kernel disabled ip forwarding. See below for an example. Consider adding it into rc.d
so it is enabled during system restart.
[root@biocluster wangk]# cat /proc/sys/net/ipv4/ip_forward
0
[root@biocluster wangk]# echo "1" > /proc/sys/net/ipv4/ip_forward
[root@biocluster wangk]# cat /proc/sys/net/ipv4/ip_forward
1
It could be as simple as this:
[root@biocluster /]$ rocks add host interface compute-0-0 iface=eth1 subnet=public2 name=compute-0-0-public ip=128.125.248.xxx
However, in my case, since the new IP is not in the same subnet as the IP of the frontend, I have to create a new subnet called public2
to make this work:
[root@biocluster /]$ rocks add network public2 128.125.248.0 255.255.254.0
[root@biocluster /]$ rocks list network
NETWORK SUBNET NETMASK MTU DNSZONE SERVEDNS
ipoib: 192.168.1.0 255.255.255.0 65520 ipoib True
private: 10.1.1.0 255.255.255.0 1500 local True
public: 68.181.163.xxx 255.255.255.128 1500 med.usc.edu False
public2: 128.125.248.xxx 255.255.254.0 1500 public2 False
[root@biocluster /]$ rocks remove host interface compute-0-0 eth1
[root@biocluster /]$ rocks add host interface compute-0-0 iface=eth1 subnet=public2 name=compute-0-0-public ip=128.125.248.xxx
A further complication is that typically default gateway for any network is the head node. However, head node may have iptables that disable any other node to access the internet. In these cases, one need to change the default gateway for eth1 to something else.
Use rocks list host route compute-0-0
to check what's the current gateway. It is clear that head node (10.1.1.1) is the default.
bioinform2: 68.181.163.xxx 255.255.255.255 10.1.1.1 G
Now do this:
[root@biocluster /]$ rocks add host route compute-0-0 0.0.0.0 128.125.249.xxx netmask=0.0.0.0
[root@biocluster /]$ rocks sync host network compute-0-0
Note that the 128.125.249.xxx is the gateway address, which needs to be specified correctly. If you do not know what is the gateway, check the head node's gateway using the route
command. This information by design was not included in the subnet set up (so rocks list network
will not show it), and needs to be provided by users instead.
Now compute-0-0 can access internet directly through eth1. If not, try do it manually in compute-0-0 by route add default gw 10.30.10.254
.
To remove a route (sometimes you may want to change a route, but there is no way to do rocks set host route
in my experience, and you have to remove and add back a route), use rocks remove host route compute-0-0 address=0.0.0.0
. Other syntax does not work.
In 2020, I found out that after adding a public interface to a computer node, I still cannot ssh into this node directly from the outside world. I believe it is a routing issue that binds the default route to eno1
(first ethernet adapter in this particular machine). Initially, I followed manual but still cannot use Rocks command to solve this issue. Instead, I first route del default
to delete the existing external route that goes through eno1. I used route add default gw 10.30.10.254 eno2
(note that this 10.30.10.254 is also the default gateway in head node). The results are:
[root@compute-0-0 ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default gateway 0.0.0.0 UG 0 0 0 eno2
default 0.0.0.0 0.0.0.0 U 0 0 0 eno1
10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eno1
10.30.10.0 0.0.0.0 255.255.255.0 U 0 0 0 eno2
biocluster biocluster.loca 255.255.255.255 UGH 0 0 0 eno1
link-local 0.0.0.0 255.255.0.0 U 1002 0 0 eno1
link-local 0.0.0.0 255.255.0.0 U 1003 0 0 eno2
link-local 0.0.0.0 255.255.0.0 U 1004 0 0 ib0
192.168.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ib0
224.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1
255.255.255.255 0.0.0.0 255.255.255.255 UH 0 0 0 eno1
Now I can ping google.com correctly now. Now I can log into this machine remotely directly so this is the biocluster2 machine. However, this is not the perfect solution, because whenever compute-0-0 reinstalls, this change will be lost.
I finally solved the problem by assigning the gateway IP to both 0.0.0.0 and default (if I only set up one of them, the system does not work).
[root@biocluster ~]# rocks add host route compute-0-0 0.0.0.0 10.30.10.254 netmask=0.0.0.0
[root@biocluster ~]# rocks add host route compute-0-0 default 10.30.10.254 netmask=0.0.0.0
[root@biocluster ~]# rocks sync host network compute-0-0
Using route
command, I can see that default is bound to eno2 (the second ethernet interface). Using traceroute google.com
in compute-0-0, I can see that the network traffic does not go through head node. So this seems to be a somewhat awkward solution but it works in the end. (No other method works as I have tried many times after reading through the manaul and google for a long time.)
Instead of IB, you can install 10G ethernet instead. This allows fast access to Internet for uploading and downloading files.
The hardware requirement depends very much on the specific network settings that you have. In my case, there is a 10G fiber switch in the server room, so we only need three things: (1) a fiber cable (for example, we use BELKIN 5M LC/LC OM3 50/125 FIBER). (2) A 10G ethernet card (for example, Intel X520-DA2). (3) a 10G-SR transreceiver (for example, Intel E10GSFPSR). Basically, we connect a fiber cable to the transreceiver which switch to copper and connect to SFP port in the ethernet card.
[root@biocluster ~]# arp -a| sort
ethtool -p eth0
to blink the eth0 port, to help identify which port is eth0. This is important, since when inserting a new ethernet card into the system, the new card may be treated as eth0/1 instead, so the built-in ethernet ports are eth2/3.
Use the nmap -sP 192.168.1.*
command, which basically ping all hosts.
You can check the current speed by the ethtool eth0
command, such as below:
[root@biocluster /]$ ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
This is almost always caused by low-quality cable. Replace with a better CAT6 cable to solve the issue.
Depending on your motherboard, eth0/eth1 can be called em1/em2, for example in Dell C6100 servers.
Sometimes ib programs cannot detect the MAC address for ib card correctly, resulting in configuration errors. This is very difficult to diagnose in my experience. The following command can be used to check this problem, and fix it by manually changing MAC.
[root@ ~]# ssh compute-0-0 ifdown ib0
Device ib0 has MAC address A0:00:02:20:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9, instead of configured address 80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9. Ignoring.
[root@ ~]# rocks set host interface mac compute-0-0 iface=ib0 mac=A0:00:02:20:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9
[root@ ~]# rocks sync host network compute-0-0