EGO issue in IBM SCF CE

Problem:

[root@pcmce-co68 ~]# egosh resource list
Cannot get host info. Not logged on.

Solution:
Log in to egosh shell (one time only). Default user and password is Admin.

[root@pcmce-co68 ~]# egosh user logon
user account: Admin
password:
Logged on successfully
[root@pcmce-co68 ~]# egosh resource list
NAME status mem swp tmp ut it pg r1m r15s r15m ls
pcmce-c* ok 827M 1516M 69G 7% 258 3.1 0.2 3.3 0.7 1

 

Disable Serial Console Redirection in xCAT

PXE boot config:

"... console=tty0 console=ttyS0,115200 ..."

Note: serial console redirection is managed by hardware profile.

To check hardware profile:

$ tabdump nodehm
#node,power,mgt,cons,termserver,termport,conserver,serialport,serialspeed,serialflow,getmac,cmdmapping,consoleondemand,comments,disable
"__HardwareProfile_IPMI",,"ipmi",,,,,,,,,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_ipmi.xml",,,
"__HardwareProfile_IBM_Flex_System_x",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_flex_x.xml",,,
"__HardwareProfile_IBM_System_x_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_iDataPlex_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_NeXtScale_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_nextscale_x.xml",,,
"__Chassis_IBM_Flex_chassis",,"blade",,,,,,,,,,,,

To disable: clear the entries of serialport, serialspeed and serialflow columns

$ chdef -t group -o __HardwareProfile_IBM_Flex_System_x serialport= serialspeed= serialflow=

 

How to Configure Mellanox Switch

Required Tools:

  1. Serial console cable (provided inside the box)
  2. Converter cable from serial port to USB port (Prolific)

Step-by-step guide using MacOS:

  1. Download and install driver for the converter cable http://plugable.com/drivers/prolific
  2. Check if the driver is installed correctly
     $ kextstat | grep prolific
    
     159 0 0xffffff7f832fa000 0x6000 0x6000 com.prolific.driver.PL2303 (1.6.0) F6A6805D-685D-3E6D-BF81-106EBBC0A386
    
     $ ioreg -c IOSerialBSDClient | grep usb
    
     | | "IOTTYBaseName" = "usbserial"
    
     | | "IOCalloutDevice" = "/dev/cu.usbserial"
    
     | | "IODialinDevice" = "/dev/tty.usbserial"
    
     | | "IOTTYDevice" = "usbserial"
  3. Start the connection
     $ screen /dev/cu.usbserial
  4. Press Enter and follow instructions in official user guide.
    For example:
Mellanox Switch

Mellanox configuration wizard
 Do you want to use the wizard for initial configuration? y
 Step 1: Hostname? [switch-56d680] switch-10g
 Step 2: Use DHCP on mgmt0 interface? [yes] no
 Step 3: Use zeroconf on mgmt0 interface? [no] no
 Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 172.21.35.60/23
 Step 5: Default gateway? 172.21.35.254
 Step 6: Primary DNS server? 155.69.3.8,155.69.3.7
 % Value must be an IPv4 address in the format of '192.168.0.1'.
 Step 6: Primary DNS server? 155.69.3.8
 Step 7: Domain name?
 Step 8: Enable IPv6? [yes] yes
 Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no] no
 Step 10: Enable DHCPv6 on mgmt0 interface? [no] no
 Step 11: Admin password (Enter to leave unchanged)?
 Step 11: Confirm admin password?
 You have entered the following information:
 1. Hostname: switch-10g
 2. Use DHCP on mgmt0 interface: no
 3. Use zeroconf on mgmt0 interface: no
 4. Primary IPv4 address and masklen: 172.21.35.60/23
 5. Default gateway: 172.21.35.254
 6. Primary DNS server: 155.69.3.8
 7. Domain name:
 8. Enable IPv6: yes
 9. Enable IPv6 autoconfig (SLAAC) on mgmt0 interface: no
 10. Enable DHCPv6 on mgmt0 interface: no
 11. Admin password (Enter to leave unchanged): (CHANGED)
 To change an answer, enter the step number to return to.
 Otherwise hit to save changes and exit.
 Choice:

High Utilization in NVIDIA GPU

Problem:

In recent GPUs, you may notice that somehow the GPU is getting high utilization while there is no process running. Based on explaination in http://docs.nvidia.com/deploy/driver-persistence/, this is happened because the kernel module is loaded but the GPU is not initialized yet. By default, GPU will be initialized when there is a GPU process start working on it, and then deinitialized when the process is completed.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:01:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:02:00.0 Off | 0 |
| 0% 33C P0 67W / 250W | 55MiB / 11519MiB | 67% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

 

Solution:

We can keep the GPU to be initialized all the time. This is called “persistence mode”. To enable it:
sysadmin@sap-dl:~$ sudo nvidia-smi -i 0 -pm 1
[sudo] password for sysadmin:
Enabled persistence mode for GPU 0000:02:00.0.
All done.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:02:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 0000:02:00.0 Off | 0 |
| 0% 36C P0 66W / 250W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

However, this setting will be reset when the server is rebooted. Hence, we will need to set it in rc.local to make it starts during startup.