EGO issue in IBM SCF CE

Problem:

[root@pcmce-co68 ~]# egosh resource list
Cannot get host info. Not logged on.

Solution:
Log in to egosh shell (one time only). Default user and password is Admin.

[root@pcmce-co68 ~]# egosh user logon
user account: Admin
password:
Logged on successfully
[root@pcmce-co68 ~]# egosh resource list
NAME status mem swp tmp ut it pg r1m r15s r15m ls
pcmce-c* ok 827M 1516M 69G 7% 258 3.1 0.2 3.3 0.7 1

 

Disable Serial Console Redirection in xCAT

PXE boot config:

"... console=tty0 console=ttyS0,115200 ..."

Note: serial console redirection is managed by hardware profile.

To check hardware profile:

$ tabdump nodehm
#node,power,mgt,cons,termserver,termport,conserver,serialport,serialspeed,serialflow,getmac,cmdmapping,consoleondemand,comments,disable
"__HardwareProfile_IPMI",,"ipmi",,,,,,,,,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_ipmi.xml",,,
"__HardwareProfile_IBM_Flex_System_x",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_flex_x.xml",,,
"__HardwareProfile_IBM_System_x_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_iDataPlex_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_NeXtScale_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_nextscale_x.xml",,,
"__Chassis_IBM_Flex_chassis",,"blade",,,,,,,,,,,,

To disable: clear the entries of serialport, serialspeed and serialflow columns

$ chdef -t group -o __HardwareProfile_IBM_Flex_System_x serialport= serialspeed= serialflow=

 

How to Configure Mellanox Switch

Required Tools:

  1. Serial console cable (provided inside the box)
  2. Converter cable from serial port to USB port (Prolific)

Step-by-step guide using MacOS:

  1. Download and install driver for the converter cable http://plugable.com/drivers/prolific
  2. Check if the driver is installed correctly
     $ kextstat | grep prolific
    
     159 0 0xffffff7f832fa000 0x6000 0x6000 com.prolific.driver.PL2303 (1.6.0) F6A6805D-685D-3E6D-BF81-106EBBC0A386
    
     $ ioreg -c IOSerialBSDClient | grep usb
    
     | | "IOTTYBaseName" = "usbserial"
    
     | | "IOCalloutDevice" = "/dev/cu.usbserial"
    
     | | "IODialinDevice" = "/dev/tty.usbserial"
    
     | | "IOTTYDevice" = "usbserial"
  3. Start the connection
     $ screen /dev/cu.usbserial
  4. Press Enter and follow instructions in official user guide.
    For example:
Mellanox Switch

Mellanox configuration wizard
 Do you want to use the wizard for initial configuration? y
 Step 1: Hostname? [switch-56d680] switch-10g
 Step 2: Use DHCP on mgmt0 interface? [yes] no
 Step 3: Use zeroconf on mgmt0 interface? [no] no
 Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 172.21.35.60/23
 Step 5: Default gateway? 172.21.35.254
 Step 6: Primary DNS server? 155.69.3.8,155.69.3.7
 % Value must be an IPv4 address in the format of '192.168.0.1'.
 Step 6: Primary DNS server? 155.69.3.8
 Step 7: Domain name?
 Step 8: Enable IPv6? [yes] yes
 Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no] no
 Step 10: Enable DHCPv6 on mgmt0 interface? [no] no
 Step 11: Admin password (Enter to leave unchanged)?
 Step 11: Confirm admin password?
 You have entered the following information:
 1. Hostname: switch-10g
 2. Use DHCP on mgmt0 interface: no
 3. Use zeroconf on mgmt0 interface: no
 4. Primary IPv4 address and masklen: 172.21.35.60/23
 5. Default gateway: 172.21.35.254
 6. Primary DNS server: 155.69.3.8
 7. Domain name:
 8. Enable IPv6: yes
 9. Enable IPv6 autoconfig (SLAAC) on mgmt0 interface: no
 10. Enable DHCPv6 on mgmt0 interface: no
 11. Admin password (Enter to leave unchanged): (CHANGED)
 To change an answer, enter the step number to return to.
 Otherwise hit to save changes and exit.
 Choice:

High Utilization in NVIDIA GPU

Problem:

In recent GPUs, you may notice that somehow the GPU is getting high utilization while there is no process running. Based on explaination in http://docs.nvidia.com/deploy/driver-persistence/, this is happened because the kernel module is loaded but the GPU is not initialized yet. By default, GPU will be initialized when there is a GPU process start working on it, and then deinitialized when the process is completed.

[code language=”bash”]
sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:01:52 2016
+——————————————————+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:02:00.0 Off | 0 |
| 0% 33C P0 67W / 250W | 55MiB / 11519MiB | 67% Default |
+——————————-+———————-+———————-+

+—————————————————————————–+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
[/code]

 

Solution:

We can keep the GPU to be initialized all the time. This is called “persistence mode”. To enable it:
sysadmin@sap-dl:~$ sudo nvidia-smi -i 0 -pm 1
[sudo] password for sysadmin:
Enabled persistence mode for GPU 0000:02:00.0.
All done.

[code language=”bash”]
sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:02:52 2016
+——————————————————+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 0000:02:00.0 Off | 0 |
| 0% 36C P0 66W / 250W | 55MiB / 11519MiB | 0% Default |
+——————————-+———————-+———————-+

+—————————————————————————–+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
[/code]

However, this setting will be reset when the server is rebooted. Hence, we will need to set it in rc.local to make it starts during startup.

Unable to Mount Ext HDD Volume in Mac

Problem:
$ sudo mount -t exfat /dev/disk2s2 /mnt/
mount_exfat: /dev/disk2s2 on /mnt: Resource busy

$ diskutil unmountdisk /dev/disk2s2
Unmount of disk2 failed: at least one volume could not be unmounted

Solution:
$ hdiutil detach /dev/disk2s2
"disk2" unmounted.
"disk2" ejected.

Limiting CPU Usage of A Process in CentOS/RHEL 7

In HPC, we may need to protect head node from unnecessary heavy process that may cause login problem for users. One of the solutions is by using cpulimit. We can create a cronjob to monitor all processes and set certain limit for them. This is how I usually did in CentOS/RHEL 7.x.

  1. Install cpulimit package from EPEL repo.

yum install cpulimit

  1. Create a script to monitor the process. The script below is a modified version of the script in this forum. You can modify inputs of the first 3 variables: CPU_LIMIT, BLACK_PROCESSES_LIST, and WHITE_PROCESSES_LIST.

Read More …