You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
gpustack/docs/tutorials/cpu-inference.md

651 B

CPU Inference

When GPU resources are limited, some model layers can be offloaded to the CPU, and full CPU inference will be used if no GPU is available.

VRAM Lack

To deploy the model with CPU offloading, enable the Allow CPU Offloading option during deployment.

Allow CPU Offload

Once the deployment is complete, you can see how many layers have been offloaded to the CPU.

CPU Offload

Next, you can test the model's inference performance in the Playground.

CPU Inference