kubelet fails with error “misconfiguration…”

Error:
1.6.0 kubelet fails with error "misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

Solution:
vi /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
modify
KUBELET_CGROUP_ARGS=--cgroup-driver=systemd
to
KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs

EGO issue in IBM SCF CE

Problem:

[root@pcmce-co68 ~]# egosh resource list
Cannot get host info. Not logged on.

Solution:

Log in to egosh shell (one time only). Default user and password is Admin.
[root@pcmce-co68 ~]# egosh user logon
user account: Admin
password:
Logged on successfully
[root@pcmce-co68 ~]# egosh resource list
NAME status mem swp tmp ut it pg r1m r15s r15m ls
pcmce-c* ok 827M 1516M 69G 7% 258 3.1 0.2 3.3 0.7 1

Disable Serial Console Redirection in xCAT

PXE boot config:
"... console=tty0 console=ttyS0,115200 ..."

Note: serial console redirection is managed by hardware profile.

To check hardware profile:
$ tabdump nodehm

#node,power,mgt,cons,termserver,termport,conserver,serialport,serialspeed,serialflow,getmac,cmdmapping,consoleondemand,comments,disable
"__HardwareProfile_IPMI",,"ipmi",,,,,,,,,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_ipmi.xml",,,
"__HardwareProfile_IBM_Flex_System_x",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_flex_x.xml",,,
"__HardwareProfile_IBM_System_x_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_iDataPlex_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml",,,
"__HardwareProfile_IBM_NeXtScale_M4",,"ipmi",,,,,"0","115200","hard",,"/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_nextscale_x.xml",,,
"__Chassis_IBM_Flex_chassis",,"blade",,,,,,,,,,,,

To disable: clear the entries of serialport, serialspeed and serialflow columns
$ chdef -t group -o __HardwareProfile_IBM_Flex_System_x serialport= serialspeed= serialflow=

High Utilization in NVIDIA GPU

Problem:

In recent GPUs, you may notice that somehow the GPU is getting high utilization while there is no process running. Based on explaination in http://docs.nvidia.com/deploy/driver-persistence/, this is happened because the kernel module is loaded but the GPU is not initialized yet. By default, GPU will be initialized when there is a GPU process start working on it, and then deinitialized when the process is completed.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:01:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:02:00.0 Off | 0 |
| 0% 33C P0 67W / 250W | 55MiB / 11519MiB | 67% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

 

Solution:

We can keep the GPU to be initialized all the time. This is called “persistence mode”. To enable it:
sysadmin@sap-dl:~$ sudo nvidia-smi -i 0 -pm 1
[sudo] password for sysadmin:
Enabled persistence mode for GPU 0000:02:00.0.
All done.

sysadmin@sap-dl:~$ nvidia-smi
Tue Sep 27 19:02:52 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 0000:02:00.0 Off | 0 |
| 0% 36C P0 66W / 250W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

However, this setting will be reset when the server is rebooted. Hence, we will need to set it in rc.local to make it starts during startup.

Installing CUDA 7.5 for Tesla M40 on Ubuntu 14.04.5 LTS

Install Driver

  1. Download Tesla driver (http://www.nvidia.com/Download/index.aspx?lang=en-us )
    Picture1
  2. Move to runlevel 3
    $ telinit 3
  3. Stop lightdm service
    $ service lightdm stop

  4. Change file mode of the driver package
    $ chmod +x NVIDIA-Linux-x86_64-352.99.run

Continue reading

CUDA 7.5 and Visual Studio 2015

Sorry, I won’t tell you the solution. Instead, I will show you why you should not expect for the solution of CUDA 7.5 and Visual Studio 2015 integration problem. 😀

If you try to compile a simple kernel code with nvcc and bind it with the VS2015 C++ compiler like this:

> nvcc .\kernel.cu -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64\"

then you will get this error: Continue reading