README

5 months ago · 92c738e9d0
parent da01a3d7b2
commit 92c738e9d0
1 changed files with 133 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,135 @@
 # hami-local

+==========
+
+适配V10版本，不需要预先配置GPU切分数量，可随意指定每个容器分配多少Mi显存
+
+本安装程序支持X86环境下的Nvdia全系GPU，为k8s容器提供GPU资源，允许共享每块物理GPU，使得1快物理GPU可以被多个pod同时使用
+
+## 1.安装好Nvidia的驱动程序 >440
+
+- 在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常，驱动版本>440，推荐官网最新驱动
+  
+  
+
+## 2.配置Nvidia Runtime
+
+**在所有需要头歌管控的，安装有GPU卡的宿主机运行如下操作：**
+
+* 清理旧的rpm包
+  
+  ```
+  # rpm -qa|grep nvidia
+  ```
+
+* 删除所有找出来的包
+  
+  ```
+    # rpm -e 包名
+  ```
+
+* 安装nvidia-container-toolkit  
+  把`*.rpm` scp到相关的节点上，然后在所有GPU节点上分别运行：
+  
+  ```
+  rpm -ivh *.rpm
+  nvidia-ctk runtime configure --runtime=containerd --set-as-default
+  systemctl restart containerd
+  ```
+
+## 3.导入离线镜像包
+
+在master节点上运行：
+
+```
+    # nerdctl -nk8s.io load -i hami250.tar
+    # nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.5.0
+    # nerdctl -nk8s.io load -i scheduler1-28-9.tar
+    # nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
+```
+
+## 4.k8s配置
+
+在master节点上运行：
+
+* 把需要管控的GPU节点打上标签
+
+```
+kubectl label nodes {nodeid} gpu=on
+```
+
+* 部署
+
+```
+    # cd helm
+    # ./helm.sh
+```
+
+* 检查pod是否正常启动
+    kubectl get po -n kube-system
+
+检查以下名字开始的pod正常启动
+
+hami-device-plugin
+hami-scheduler
+
+* 检查是否配置成功
+
+`kubectl describe no xxxx`来查看包含GPU的节点信息  
+输出结果应为：
+
+```
+Capacity:
+      cpu:                                 112
+      devices.kubevirt.io/kvm:             1k
+      devices.kubevirt.io/tun:             1k
+      devices.kubevirt.io/vhost-net:       1k
+      ephemeral-storage:                   1872759360Ki
+      memory:                              527804556Ki
+      nvidia.com/gpu:                      40
+      pods:                                110
+```
+
+如果包含类似`nvidia.com/gpu: 40`，就ok。
+
+* 测试容器使用GPU(可选)
+  
+  ```
+    cat <<EOF | kubectl apply -f -
+    apiVersion: v1
+    kind: Pod
+    metadata:
+  
+      name: gpu-pod
+  
+    spec:
+  
+      restartPolicy: Never
+      containers:
+        - name: cuda-container
+          image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
+          resources:
+            limits:
+              nvidia.com/gpu: 1 # requesting 1 GPU
+              nvidia.com/gpumem:  1024  # 代表分配1G显存
+      tolerations:
+      - key: nvidia.com/gpu
+        operator: Exists
+        effect: NoSchedule
+  
+    EOF
+  ```
+  
+  ```
+  $ kubectl logs gpu-pod
+    [Vector addition of 50000 elements]
+    Copy input data from the host memory to the CUDA device
+    CUDA kernel launch with 196 blocks of 256 threads
+    Copy output data from the CUDA device to the host memory
+    Test PASSED
+    Done
+  ```
+
+
+
+