main
czd 3 months ago
parent da01a3d7b2
commit 92c738e9d0

@ -1,2 +1,135 @@
# hami-local
==========
适配V10版本不需要预先配置GPU切分数量可随意指定每个容器分配多少Mi显存
本安装程序支持X86环境下的Nvdia全系GPU为k8s容器提供GPU资源允许共享每块物理GPU使得1快物理GPU可以被多个pod同时使用
## 1.安装好Nvidia的驱动程序 >440
- 在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常驱动版本>440推荐官网最新驱动
## 2.配置Nvidia Runtime
**在所有需要头歌管控的安装有GPU卡的宿主机运行如下操作**
* 清理旧的rpm包
```
# rpm -qa|grep nvidia
```
* 删除所有找出来的包
```
# rpm -e 包名
```
* 安装nvidia-container-toolkit
把`*.rpm` scp到相关的节点上然后在所有GPU节点上分别运行
```
rpm -ivh *.rpm
nvidia-ctk runtime configure --runtime=containerd --set-as-default
systemctl restart containerd
```
## 3.导入离线镜像包
在master节点上运行
```
# nerdctl -nk8s.io load -i hami250.tar
# nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.5.0
# nerdctl -nk8s.io load -i scheduler1-28-9.tar
# nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
```
## 4.k8s配置
在master节点上运行
* 把需要管控的GPU节点打上标签
```
kubectl label nodes {nodeid} gpu=on
```
* 部署
```
# cd helm
# ./helm.sh
```
* 检查pod是否正常启动
kubectl get po -n kube-system
检查以下名字开始的pod正常启动
hami-device-plugin
hami-scheduler
* 检查是否配置成功
`kubectl describe no xxxx`来查看包含GPU的节点信息
输出结果应为:
```
Capacity:
cpu: 112
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 1872759360Ki
memory: 527804556Ki
nvidia.com/gpu: 40
pods: 110
```
如果包含类似`nvidia.com/gpu: 40`就ok。
* 测试容器使用GPU(可选)
```
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
nvidia.com/gpumem: 1024 # 代表分配1G显存
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
```
```
$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Loading…
Cancel
Save