|
|
|
@ -1,2 +1,135 @@
|
|
|
|
|
# hami-local
|
|
|
|
|
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
适配V10版本,不需要预先配置GPU切分数量,可随意指定每个容器分配多少Mi显存
|
|
|
|
|
|
|
|
|
|
本安装程序支持X86环境下的Nvdia全系GPU,为k8s容器提供GPU资源,允许共享每块物理GPU,使得1快物理GPU可以被多个pod同时使用
|
|
|
|
|
|
|
|
|
|
## 1.安装好Nvidia的驱动程序 >440
|
|
|
|
|
|
|
|
|
|
- 在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常,驱动版本>440,推荐官网最新驱动
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 2.配置Nvidia Runtime
|
|
|
|
|
|
|
|
|
|
**在所有需要头歌管控的,安装有GPU卡的宿主机运行如下操作:**
|
|
|
|
|
|
|
|
|
|
* 清理旧的rpm包
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
# rpm -qa|grep nvidia
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
* 删除所有找出来的包
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
# rpm -e 包名
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
* 安装nvidia-container-toolkit
|
|
|
|
|
把`*.rpm` scp到相关的节点上,然后在所有GPU节点上分别运行:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rpm -ivh *.rpm
|
|
|
|
|
nvidia-ctk runtime configure --runtime=containerd --set-as-default
|
|
|
|
|
systemctl restart containerd
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 3.导入离线镜像包
|
|
|
|
|
|
|
|
|
|
在master节点上运行:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
# nerdctl -nk8s.io load -i hami250.tar
|
|
|
|
|
# nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.5.0
|
|
|
|
|
# nerdctl -nk8s.io load -i scheduler1-28-9.tar
|
|
|
|
|
# nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## 4.k8s配置
|
|
|
|
|
|
|
|
|
|
在master节点上运行:
|
|
|
|
|
|
|
|
|
|
* 把需要管控的GPU节点打上标签
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
kubectl label nodes {nodeid} gpu=on
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
* 部署
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
# cd helm
|
|
|
|
|
# ./helm.sh
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
* 检查pod是否正常启动
|
|
|
|
|
kubectl get po -n kube-system
|
|
|
|
|
|
|
|
|
|
检查以下名字开始的pod正常启动
|
|
|
|
|
|
|
|
|
|
hami-device-plugin
|
|
|
|
|
hami-scheduler
|
|
|
|
|
|
|
|
|
|
* 检查是否配置成功
|
|
|
|
|
|
|
|
|
|
`kubectl describe no xxxx`来查看包含GPU的节点信息
|
|
|
|
|
输出结果应为:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Capacity:
|
|
|
|
|
cpu: 112
|
|
|
|
|
devices.kubevirt.io/kvm: 1k
|
|
|
|
|
devices.kubevirt.io/tun: 1k
|
|
|
|
|
devices.kubevirt.io/vhost-net: 1k
|
|
|
|
|
ephemeral-storage: 1872759360Ki
|
|
|
|
|
memory: 527804556Ki
|
|
|
|
|
nvidia.com/gpu: 40
|
|
|
|
|
pods: 110
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
如果包含类似`nvidia.com/gpu: 40`,就ok。
|
|
|
|
|
|
|
|
|
|
* 测试容器使用GPU(可选)
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
cat <<EOF | kubectl apply -f -
|
|
|
|
|
apiVersion: v1
|
|
|
|
|
kind: Pod
|
|
|
|
|
metadata:
|
|
|
|
|
|
|
|
|
|
name: gpu-pod
|
|
|
|
|
|
|
|
|
|
spec:
|
|
|
|
|
|
|
|
|
|
restartPolicy: Never
|
|
|
|
|
containers:
|
|
|
|
|
- name: cuda-container
|
|
|
|
|
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
|
|
|
|
|
resources:
|
|
|
|
|
limits:
|
|
|
|
|
nvidia.com/gpu: 1 # requesting 1 GPU
|
|
|
|
|
nvidia.com/gpumem: 1024 # 代表分配1G显存
|
|
|
|
|
tolerations:
|
|
|
|
|
- key: nvidia.com/gpu
|
|
|
|
|
operator: Exists
|
|
|
|
|
effect: NoSchedule
|
|
|
|
|
|
|
|
|
|
EOF
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ kubectl logs gpu-pod
|
|
|
|
|
[Vector addition of 50000 elements]
|
|
|
|
|
Copy input data from the host memory to the CUDA device
|
|
|
|
|
CUDA kernel launch with 196 blocks of 256 threads
|
|
|
|
|
Copy output data from the CUDA device to the host memory
|
|
|
|
|
Test PASSED
|
|
|
|
|
Done
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|