You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
2 months ago | |
---|---|---|
helm | 2 months ago | |
LICENSE | 3 months ago | |
README.md | 2 months ago | |
hami260.tar | 2 months ago | |
kube-webhook-certgen.tar | 3 months ago | |
libnvidia-container-tools-1.15.0-1.x86_64.rpm | 3 months ago | |
libnvidia-container1-1.15.0-1.x86_64.rpm | 3 months ago | |
nvidia-container-toolkit-1.15.0-1.x86_64.rpm | 3 months ago | |
nvidia-container-toolkit-base-1.15.0-1.x86_64.rpm | 3 months ago | |
scheduler1-28-9.tar | 3 months ago |
README.md
hami-local
==========
适配V10版本,不需要预先配置GPU切分数量,可随意指定每个容器分配多少Mi显存
本安装程序支持X86环境下的Nvdia全系GPU,为k8s容器提供GPU资源,允许共享每块物理GPU,使得1快物理GPU可以被多个pod同时使用
1.安装好Nvidia的驱动程序 >440
-
在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常,驱动版本>440 ,推荐官网最新驱动
-
Clone本代码仓库到master节点。
# git clone https://bdgit.educoder.net/pqyr82v69/hami-local.git --depth=1
2.配置Nvidia Runtime
在所有需要头歌管控的,安装有GPU卡的宿主机运行如下操作:
-
清理旧的rpm包
# rpm -qa|grep nvidia
-
删除所有找出来的包
# rpm -e 包名
-
安装nvidia-container-toolkit
把*.rpm
scp到相关的节点上,然后在所有GPU节点上分别运行:rpm -ivh *.rpm nvidia-ctk runtime configure --runtime=containerd --set-as-default systemctl restart containerd
3.导入离线镜像包
在master节点上运行:
# nerdctl -nk8s.io load -i hami260.tar
# nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.6.0
# nerdctl -nk8s.io load -i scheduler1-28-9.tar
# nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
# nerdctl -nk8s.io load -i kube-webhook-certgen.tar
# nerdctl -nk8s.io push sealos.hub:5000/liangjw/kube-webhook-certgen:v1.1.1
4.k8s配置
在master节点上运行:
- 把需要管控的GPU节点打上标签
kubectl label nodes {nodeid} gpu=on
- 部署
# cd helm
# ./helm.sh
- 检查pod是否正常启动 kubectl get po -n kube-system
检查以下名字开始的pod正常启动
hami-device-plugin hami-scheduler
- 检查是否配置成功
kubectl describe no xxxx
来查看包含GPU的节点信息
输出结果应为:
Capacity:
cpu: 112
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 1872759360Ki
memory: 527804556Ki
nvidia.com/gpu: 40
pods: 110
如果包含类似nvidia.com/gpu: 40
,就ok。
-
测试容器使用GPU(可选)
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers: - name: cuda-container image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU nvidia.com/gpumem: 1024 # 代表分配1G显存 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule EOF
$ kubectl logs gpu-pod [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done