Skip to content

Latest commit

 

History

History
387 lines (273 loc) · 20.2 KB

05 Network administration.md

File metadata and controls

387 lines (273 loc) · 20.2 KB

Introduction

Network configuration and administration is a complex topic. In this section, I mostly describe those administrative tasks that are important yet unique for Rocks cluster operating system. I also give some tricks and tips for troubleshooting the system and for accomplishing common tasks.

One of the most important things, that you cannot easily learn from other places such as Rocks manual, is that Rocks uses a MySQL database to store all information related to network configuration. I learned this in a very hard way, because it is not really documented anywhere clearly. So it is very important to always use rocks command to administrater networks, rather than using the typical Linux commands (again, because if you use typical Linux command, the changes are not stored in MySQL and will not survive compute node re-installation or system upgrades). For example, setting up new network should be done by rocks list/set host network commands, and setting up firewall should be done by rocks list/set host firewall commands.

Configuration files

In a typical Linux system, there are several configuration files for enabling network in a host machine. These files includ /etc/resolv.conf which gives name server information, /etc/hosts which gives host information (as they may not be in name server), /etc/sysconfig/network-scripts/ifcfg-eth0 for configuring network interface.

Typical configuration files

[kaiwang@biocluster ~]$ cat /etc/resolv.conf 
search local med.usc.edu
nameserver 127.0.0.1
nameserver 128.125.253.143
nameserver 128.125.7.23

However, the user-defined hosts are stored in /etc/hosts.local file, and rocks sync config and rocks sync network will copy this to all cluster. In other words, you should never edit the /etc/hosts file yourself, because any edits will not be preserved as this file will be generated automatically by Rocks whenever the system refreshes configuration file or when the system restarts. Instead, to define new hosts, you must edit /etc/hosts.local file. I recommend that you put IP addresses for IPMI into this file (technically Rocks has the ability to handle IPMI by DHCP but it rarely works well, so it is best that you just set up static IP for IPMI ports and put these IP into the host file).

Similarly, the configuration files below are also automatically generated by Rocks and you should not edit them, as any edit will not survive a refreshment of Rocks configuration or a restart.

[root@biocluster ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
HWADDR=00:25:90:09:E6:66
IPADDR=10.1.1.1
NETMASK=255.255.255.0
BOOTPROTO=none
ONBOOT=yes
MTU=1500
[root@biocluster ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib0  
DEVICE=ib0
HWADDR=80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:CF:FF
IPADDR=192.168.1.1
NETMASK=255.255.255.0
BOOTPROTO=none
ONBOOT=yes
MTU=65520

Finally, do not set up iptables (/etc/iptables) yourself for the same reasons. See more details later in this article.

Adding infiniband to a system

The following description only applies to version 6. It seems that Rocks 7 automatically handles these types of issues.

By default, if an infiniband card is already present in the system, then IB-related packages (such as opensm and ibutils and drivers) will be installed automatically by Rocks. However, occasionally this is not the case, and sometimes you may want to add IB to an existing Rocks installation. How shall we address this?

I found that simply doing yum install openib ibutils infiniband-diags opensm does not solve the problem per se. I still get an error umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version if I run service openibd restart; service opensmd restart followed by ibhosts. This occurs in some systems, but not other systems.

To address this problem, I have to download OFED version 3.4-1.0.0.0 (which is suitable for CentOS 6.6) from http://www.mellanox.com/page/mlnx_ofed_matrix?mtag=linux_sw_drivers. Install it in head node first. Then do a mkdir -p /export/rocks/install/contrib/extra/install and copy this file to this directory. Then edit /export/rocks/install/site-profiles/6.2/nodes/extend-compute.xml and add the lines to the <post> </post> section:

echo 'Installing OFED...'
cd /root
wget http://127.0.0.1/install/contrib/extra/install/MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64.tgz 2> /root/temp1
ls -l > /root/temp2
tar xvfz MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64.tgz 2> /root/temp3 > /root/temp3.5
echo y | ./MLNX_OFED_LINUX-3.4-1.0.0.0-rhel6.6-x86_64/mlnxofedinstall 2> /root/temp4 > /root/temp4.5
/etc/init.d/openibd restart 2> /root/temp5 > /root/temp5.5
cd /

Recently, I verified that MLNX_OFED version 4.0 also works okay:

echo 'Installing OFED...'
cd /root
wget http://127.0.0.1/install/contrib/extra/install/MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64.tgz 2> /root/temp1
ls -l > /root/temp2
tar xvfz MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64.tgz 2> /root/temp3 > /root/temp3.5
echo y | ./MLNX_OFED_LINUX-4.0-2.0.0.1-rhel6.6-x86_64/mlnxofedinstall 2> /root/temp4 > /root/temp4.5
/etc/init.d/openibd restart 2> /root/temp5 > /root/temp5.5
cd /

The direction of stderr and stdout is to help diagnose potential problems if installation fails. From head node, you may also do a rocks list host profile compute-0-0 > /dev/null to check if there is any error message in compiling this xml file in kickstart script.

Then cd /export/rocks/install; rocks create distro, and ssh into each compute node and reinstall it by /boot/kickstart/cluster-kickstart.

Infiniband specific configuration and commands

Change MTU for IB card for the cluster

It is important to change the MTU for IB network, since by default it is set as 1500 (default value for ethernet), which will not result in optimal performance for IB traffic.

rocks set network mtu ipoib 65520

this should set the MTU for all hosts in the system to 65520

Check IB status

The iblinkinfo is perhaps the most commonly used command to check IB status. It tells you which host is actively connected and what is the speed of connection, and which host has trouble linking up. The ibhosts and ibswitches can be used to check hosts in the network.

[root@biocluster ~]# ibhosts 
Ca      : 0x0002c903000cc172 ports 1 "nas-0-0 mlx4_0"
Ca      : 0xf452140300ee07b0 ports 1 "compute-0-0 mlx4_0"
Ca      : 0xf452140300ee0798 ports 1 "compute-0-1 mlx4_0"
Ca      : 0xf452140300ee0814 ports 1 "compute-0-2 mlx4_0"
Ca      : 0xf452140300ee082c ports 1 "compute-0-3 mlx4_0"
Ca      : 0x0002c90300505e94 ports 1 "biocluster mlx4_0"
[root@biocluster ~]# ibswitches 
Switch  : 0x0002c90200422f90 ports 36 "Infiniscale-IV Mellanox Technologies" base port 0 lid 30 lmc 0

Measure network bandwidth and latency

First run qperf without argument in nas-0-0, then run

[kaiwang@biocluster ~]$ qperf nas-0-0-ib tcp_bw tcp_lat udp_bw udp_lat ud_bw ud_lat rc_bi_bw sdp_bw
[kaiwang@biocluster ~]$ qperf nas-0-0 tcp_bw tcp_lat udp_bw udp_lat ud_bw ud_lat rc_bi_bw sdp_bw

There is a clear different in tcp_bw (from 1.49GB/s to 119MB/s), but the latency is not great for IB.

Measure infiniband status

First run ibv_rc_pingpong in biocluster, then run

[kaiwang@compute-0-10 ~]$ ibv_rc_pingpong biocluster   
  local address:  LID 0x000e, QPN 0x48004b, PSN 0x74f5ba, GID ::
  remote address: LID 0x0003, QPN 0x74004b, PSN 0xbaa1c2, GID ::
8192000 bytes in 0.01 seconds = 5671.66 Mbit/sec
1000 iters in 0.01 seconds = 11.56 usec/iter

Similarly, ibv_uc_pingpong,ibv_ud_pingpong, ibv_srq_pingpongcan be used. The ibv_srq_pingpong is very fast as it presumably use 16 different queue pairs by default.

ibping

first run ibstat in biocluster to find the port GUID, then run ibping -S in biocluster as root, then run ibping -G 0x0002c903000ecfffa in compute-0-0.

Network security

Network security is very important, even if it is an academic research computing cluster. The combined use of DenyHosts and appropriate firewall have helped our cluster withstand multiple external attacks.

DenyHosts to prevent unauthorized brute force attack

yum --enablerepo=epel install denyhosts
service denyhosts restart
chkconfig denyhosts on

Edit the /etc/denyhosts.conf file for configuation changes. The file should be self-evident to edit.

firewall and iptables

Some useful guideline include this. However, as I have emphasized, you should not edit iptables yourself, but always use the Rocks command to make modifications, to ensure the longevity of network configurations.

Adding firewall

rocks list host firewall compute-0-0
RULENAME            SERVICE PROTOCOL ACTION CHAIN NETWORK OUTPUT-NETWORK FLAGS                    COMMENT CATEGORY
A15-ALL-LOCAL       all     all      ACCEPT INPUT ------- -------------- -i lo                    ------- global  
A20-ALL-PRIVATE     all     all      ACCEPT INPUT private -------------- ------------------------ ------- global  
A20-SSH-PUBLIC      ssh     tcp      ACCEPT INPUT public  -------------- -m state --state NEW     ------- global  
A30-RELATED-PUBLIC  all     all      ACCEPT INPUT public  -------------- -m state --state RELATED ------- global  
R900-PRIVILEGED-TCP all     tcp      REJECT INPUT public  -------------- --dport 0:1023           ------- global  
R900-PRIVILEGED-UDP all     udp      REJECT INPUT public  -------------- --dport 0:1023           ------- global  

rocks add firewall host=compute-0-0 network=public2 service=50030 chain=INPUT action=ACCEPT rulename=A40-PUBLIC-HADOOP-50030 protocol=tcp
rocks sync host firewall compute-0-0

This command opens port 50030 in a particular compute node (which by the way serves as the Hadoop head node).

Note that the protocol must be tcp, not all, for the command to work.

Removing firewall

rocks remove firewall host=compute-0-0 rulename=A40-PUBLIC-HADOOP-50030

iptables arguments

You must specify protocol (as tcp or udp) for the command to work. Otherwise, the --dport argument cannot be recognized (it is not a problem of rocks add firewall command itself). To understand this more, see below:

[root@biocluster /home/kaiwang]$ iptables -A INPUT -p all -i eth1 --dport 50030 -j ACCEPT
iptables v1.4.7: unknown option `--dport'
Try `iptables -h' or 'iptables --help' for more information.
[root@biocluster /home/kaiwang]$ iptables -A INPUT -p tcp -i eth1 --dport 50030 -j ACCEPT

Port forwarding

This is useful when you set up a web service in one compute node (such as compute-0-0), but want to access this service from frontend from the outside world using a specific port such as 8010.

In general, I do not recommend doing this for a variety of security reasons. Even if you want to do this, it is perhaps best to just use iptables temporarily to do it, without saving such rules to firewall.

The following two commands can do this on frontend:

iptables -t nat -A PREROUTING -p tcp -i eth1 --dport 8010 -j DNAT --to-destination 10.1.1.253:8010
iptables -A FORWARD -p tcp -d 10.1.1.253 --dport 8010 -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT

Allow node to access outside world

To allow LAN nodes with private IP addresses to communicate with external public networks, configure the firewall for IP masquerading, which masks requests from LAN nodes with the IP address of the firewall's external device (in this case, eth1):

iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE 

The rule uses the NAT packet matching table (-t nat) and specifies the built-in POSTROUTING chain for NAT (-A POSTROUTING) on the firewall's external networking device (-o eth1). Note that the eth1 could be replaced by the actual name of the public interface such as eno2.

If this works out well, try to add the command using rocks add firewall.

It this does not work, it is possible that the kernel disabled ip forwarding. See below for an example. Consider adding it into rc.d so it is enabled during system restart.

[root@biocluster wangk]# cat /proc/sys/net/ipv4/ip_forward
0
[root@biocluster wangk]# echo "1" > /proc/sys/net/ipv4/ip_forward
[root@biocluster wangk]# cat /proc/sys/net/ipv4/ip_forward
1

Adding a ethernet adapter on a compute node to access Internet

It could be as simple as this:

[root@biocluster /]$ rocks add host interface compute-0-0 iface=eth1 subnet=public2 name=compute-0-0-public ip=128.125.248.xxx

However, in my case, since the new IP is not in the same subnet as the IP of the frontend, I have to create a new subnet called public2 to make this work:

[root@biocluster /]$ rocks add network public2 128.125.248.0 255.255.254.0 
[root@biocluster /]$ rocks list network
NETWORK  SUBNET         NETMASK         MTU    DNSZONE     SERVEDNS
ipoib:   192.168.1.0    255.255.255.0   65520  ipoib       True    
private: 10.1.1.0       255.255.255.0   1500   local       True    
public:  68.181.163.xxx 255.255.255.128 1500   med.usc.edu False   
public2: 128.125.248.xxx  255.255.254.0   1500   public2     False   
[root@biocluster /]$ rocks remove host interface compute-0-0 eth1
[root@biocluster /]$ rocks add host interface compute-0-0 iface=eth1 subnet=public2 name=compute-0-0-public ip=128.125.248.xxx

A further complication is that typically default gateway for any network is the head node. However, head node may have iptables that disable any other node to access the internet. In these cases, one need to change the default gateway for eth1 to something else.

Use rocks list host route compute-0-0 to check what's the current gateway. It is clear that head node (10.1.1.1) is the default.

bioinform2: 68.181.163.xxx  255.255.255.255 10.1.1.1 G     

Now do this:

[root@biocluster /]$ rocks add host route compute-0-0 0.0.0.0 128.125.249.xxx netmask=0.0.0.0
[root@biocluster /]$ rocks sync host network compute-0-0

Note that the 128.125.249.xxx is the gateway address, which needs to be specified correctly. If you do not know what is the gateway, check the head node's gateway using the route command. This information by design was not included in the subnet set up (so rocks list network will not show it), and needs to be provided by users instead.

Now compute-0-0 can access internet directly through eth1. If not, try do it manually in compute-0-0 by route add default gw 10.30.10.254.

To remove a route (sometimes you may want to change a route, but there is no way to do rocks set host route in my experience, and you have to remove and add back a route), use rocks remove host route compute-0-0 address=0.0.0.0. Other syntax does not work.

In 2020, I found out that after adding a public interface to a computer node, I still cannot ssh into this node directly from the outside world. I believe it is a routing issue that binds the default route to eno1 (first ethernet adapter in this particular machine). Initially, I followed manual but still cannot use Rocks command to solve this issue. Instead, I first route del default to delete the existing external route that goes through eno1. I used route add default gw 10.30.10.254 eno2 (note that this 10.30.10.254 is also the default gateway in head node). The results are:

[root@compute-0-0 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    0      0        0 eno2
default         0.0.0.0         0.0.0.0         U     0      0        0 eno1
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 eno1
10.30.10.0      0.0.0.0         255.255.255.0   U     0      0        0 eno2
biocluster      biocluster.loca 255.255.255.255 UGH   0      0        0 eno1
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eno1
link-local      0.0.0.0         255.255.0.0     U     1003   0        0 eno2
link-local      0.0.0.0         255.255.0.0     U     1004   0        0 ib0
192.168.0.0     0.0.0.0         255.255.0.0     U     0      0        0 ib0
224.0.0.0       0.0.0.0         255.255.255.0   U     0      0        0 eno1
255.255.255.255 0.0.0.0         255.255.255.255 UH    0      0        0 eno1

Now I can ping google.com correctly now. Now I can log into this machine remotely directly so this is the biocluster2 machine. However, this is not the perfect solution, because whenever compute-0-0 reinstalls, this change will be lost.

I finally solved the problem by assigning the gateway IP to both 0.0.0.0 and default (if I only set up one of them, the system does not work).

[root@biocluster ~]# rocks add host route compute-0-0 0.0.0.0 10.30.10.254 netmask=0.0.0.0
[root@biocluster ~]# rocks add host route compute-0-0 default 10.30.10.254 netmask=0.0.0.0
[root@biocluster ~]# rocks sync host network compute-0-0

Using route command, I can see that default is bound to eno2 (the second ethernet interface). Using traceroute google.com in compute-0-0, I can see that the network traffic does not go through head node. So this seems to be a somewhat awkward solution but it works in the end. (No other method works as I have tried many times after reading through the manaul and google for a long time.)

10G ethernet specific settings

Instead of IB, you can install 10G ethernet instead. This allows fast access to Internet for uploading and downloading files.

The hardware requirement depends very much on the specific network settings that you have. In my case, there is a 10G fiber switch in the server room, so we only need three things: (1) a fiber cable (for example, we use BELKIN 5M LC/LC OM3 50/125 FIBER). (2) A 10G ethernet card (for example, Intel X520-DA2). (3) a 10G-SR transreceiver (for example, Intel E10GSFPSR). Basically, we connect a fiber cable to the transreceiver which switch to copper and connect to SFP port in the ethernet card.

Useful command line tricks

Find all network devices in the network that connects to this machine

[root@biocluster ~]# arp -a| sort

Examine cable connection

ethtool -p eth0

to blink the eth0 port, to help identify which port is eth0. This is important, since when inserting a new ethernet card into the system, the new card may be treated as eth0/1 instead, so the built-in ethernet ports are eth2/3.

list of all hosts in local network

Use the nmap -sP 192.168.1.* command, which basically ping all hosts.

Troubleshooting

Only getting 100Mb/s speed for gigabit connection

You can check the current speed by the ethtool eth0 command, such as below:

[root@biocluster /]$ ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

This is almost always caused by low-quality cable. Replace with a better CAT6 cable to solve the issue.

eth0 not found in system

Depending on your motherboard, eth0/eth1 can be called em1/em2, for example in Dell C6100 servers.

incorrect MAC address

Sometimes ib programs cannot detect the MAC address for ib card correctly, resulting in configuration errors. This is very difficult to diagnose in my experience. The following command can be used to check this problem, and fix it by manually changing MAC.

[root@ ~]# ssh compute-0-0 ifdown ib0
Device ib0 has MAC address A0:00:02:20:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9, instead of configured address 80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9. Ignoring.
[root@ ~]# rocks set host interface mac compute-0-0 iface=ib0 mac=A0:00:02:20:FE:80:00:00:00:00:00:00:00:02:C9:03:00:0E:50:E9
[root@ ~]# rocks sync host network compute-0-0