You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
czd 0c837bf086
Merge branch 'main' of https://bdgit.educoder.net/pqyr82v69/hami-local
2 months ago
helm upgrade v2.6.0 2 months ago
LICENSE Initial commit 3 months ago
README.md Merge branch 'main' of https://bdgit.educoder.net/pqyr82v69/hami-local 2 months ago
hami260.tar upgrade v2.6.0 2 months ago
kube-webhook-certgen.tar add image 3 months ago
libnvidia-container-tools-1.15.0-1.x86_64.rpm 初始化 3 months ago
libnvidia-container1-1.15.0-1.x86_64.rpm 初始化 3 months ago
nvidia-container-toolkit-1.15.0-1.x86_64.rpm 初始化 3 months ago
nvidia-container-toolkit-base-1.15.0-1.x86_64.rpm 初始化 3 months ago
scheduler1-28-9.tar add image 3 months ago

README.md

hami-local

==========

适配V10版本不需要预先配置GPU切分数量可随意指定每个容器分配多少Mi显存

本安装程序支持X86环境下的Nvdia全系GPU为k8s容器提供GPU资源允许共享每块物理GPU使得1快物理GPU可以被多个pod同时使用

1.安装好Nvidia的驱动程序 >440

  • 在所有安装了GPU的服务器上都要确保nvidia-smi命令运行正常驱动版本>440 ,推荐官网最新驱动

  • Clone本代码仓库到master节点。

# git clone https://bdgit.educoder.net/pqyr82v69/hami-local.git --depth=1

2.配置Nvidia Runtime

在所有需要头歌管控的安装有GPU卡的宿主机运行如下操作

  • 清理旧的rpm包

    # rpm -qa|grep nvidia
    
  • 删除所有找出来的包

      # rpm -e 包名
    
  • 安装nvidia-container-toolkit
    *.rpm scp到相关的节点上然后在所有GPU节点上分别运行

    rpm -ivh *.rpm
    nvidia-ctk runtime configure --runtime=containerd --set-as-default
    systemctl restart containerd
    

3.导入离线镜像包

在master节点上运行

    # nerdctl -nk8s.io load -i hami260.tar
    # nerdctl -nk8s.io push sealos.hub:5000/projecthami/hami:v2.6.0
    # nerdctl -nk8s.io load -i scheduler1-28-9.tar
    # nerdctl -nk8s.io push sealos.hub:5000/google_containers/kube-scheduler:v1.28.9
    # nerdctl -nk8s.io load -i kube-webhook-certgen.tar
    # nerdctl -nk8s.io push sealos.hub:5000/liangjw/kube-webhook-certgen:v1.1.1

4.k8s配置

在master节点上运行

  • 把需要管控的GPU节点打上标签
kubectl label nodes {nodeid} gpu=on
  • 部署
    # cd helm
    # ./helm.sh
  • 检查pod是否正常启动 kubectl get po -n kube-system

检查以下名字开始的pod正常启动

hami-device-plugin hami-scheduler

  • 检查是否配置成功

kubectl describe no xxxx来查看包含GPU的节点信息
输出结果应为:

Capacity:
      cpu:                                 112
      devices.kubevirt.io/kvm:             1k
      devices.kubevirt.io/tun:             1k
      devices.kubevirt.io/vhost-net:       1k
      ephemeral-storage:                   1872759360Ki
      memory:                              527804556Ki
      nvidia.com/gpu:                      40
      pods:                                110

如果包含类似nvidia.com/gpu: 40就ok。

  • 测试容器使用GPU(可选)

      cat <<EOF | kubectl apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
    
        name: gpu-pod
    
      spec:
    
        restartPolicy: Never
        containers:
          - name: cuda-container
            image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
            resources:
              limits:
                nvidia.com/gpu: 1 # requesting 1 GPU
                nvidia.com/gpumem:  1024  # 代表分配1G显存
        tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
    
      EOF
    
    $ kubectl logs gpu-pod
      [Vector addition of 50000 elements]
      Copy input data from the host memory to the CUDA device
      CUDA kernel launch with 196 blocks of 256 threads
      Copy output data from the CUDA device to the host memory
      Test PASSED
      Done