# hami-local

==========

适配V10版本，不需要预先配置GPU切分数量，可随意指定每个容器分配多少Mi显存

本安装程序支持X86环境下的Nvdia全系GPU，为k8s容器提供GPU资源，允许共享每块物理GPU，使得1快物理GPU可以被多个pod同时使用

## 1.安装好Nvidia的驱动程序 >440

- 在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常，驱动版本>440 ,推荐官网最新驱动
  
- Clone本代码仓库到master节点。

```
# git clone https://bdgit.educoder.net/pqyr82v69/hami-local.git --depth=1
```

## 2.配置Nvidia Runtime

**在所有需要头歌管控的，安装有GPU卡的宿主机运行如下操作：**

* 清理旧的rpm包
  
  ```
  # rpm -qa|grep nvidia
  ```

* 删除所有找出来的包
  
  ```
    # rpm -e 包名
  ```

* 安装nvidia-container-toolkit  
  把`*.rpm` scp到相关的节点上，然后在所有GPU节点上分别运行：
  
  ```
  rpm -ivh *.rpm
  nvidia-ctk runtime configure --runtime=containerd --set-as-default
  systemctl restart containerd
  ```

## 3.导入离线镜像包

在master节点上运行：

```
    # nerdctl -nk8s.io load -i hami260.tar
    # nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.6.0
    # nerdctl -nk8s.io load -i scheduler1-28-9.tar
    # nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
    # nerdctl -nk8s.io load -i kube-webhook-certgen.tar
    # nerdctl -nk8s.io push sealos.hub:5000/liangjw/kube-webhook-certgen:v1.1.1
```

## 4.k8s配置

在master节点上运行：

* 把需要管控的GPU节点打上标签

```
kubectl label nodes {nodeid} gpu=on
```

* 部署

```
    # cd helm
    # ./helm.sh
```

* 检查pod是否正常启动
    kubectl get po -n kube-system

检查以下名字开始的pod正常启动

hami-device-plugin
hami-scheduler

* 检查是否配置成功

`kubectl describe no xxxx`来查看包含GPU的节点信息  
输出结果应为：

```
Capacity:
      cpu:                                 112
      devices.kubevirt.io/kvm:             1k
      devices.kubevirt.io/tun:             1k
      devices.kubevirt.io/vhost-net:       1k
      ephemeral-storage:                   1872759360Ki
      memory:                              527804556Ki
      nvidia.com/gpu:                      40
      pods:                                110
```

如果包含类似`nvidia.com/gpu: 40`，就ok。

* 测试容器使用GPU(可选)
  
  ```
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
  
      name: gpu-pod
  
    spec:
  
      restartPolicy: Never
      containers:
        - name: cuda-container
          image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
          resources:
            limits:
              nvidia.com/gpu: 1 # requesting 1 GPU
              nvidia.com/gpumem:  1024  # 代表分配1G显存
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
  
    EOF
  ```
  
  ```
  $ kubectl logs gpu-pod
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
  ```