- install GCC if less then Ubuntu 21.04
- link llama-box-rpc-server via directory relative linking
- build Python instead of install
- install MindIE via pipx venv
- use multi-stage build to install vLLM in parallel
- remove instance running envs of vLLM
Signed-off-by: thxCode <thxcode0824@gmail.com>
- receive new arguments: --swa-full, --max-projected-cache
- add exception processing in specific version
Signed-off-by: thxCode <thxcode0824@gmail.com>
@ -34,7 +35,7 @@ GPUStack is an open-source GPU cluster manager for running AI models.
- **Broad GPU Compatibility:** Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
- **Extensive Model Support:** Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
- **Flexible Inference Backends:**Integrates with llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM, and Ascend MindIE.
- **Flexible Inference Backends:**Flexibly integrates with multiple inference backends including llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM and Ascend MindIE.
- **Multi-Version Backend Support:** Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
- **Distributed Inference:** Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
- **Scalable GPU Architecture:** Easily scale up by adding more GPUs or nodes to your infrastructure.
"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors." \
It supports a mix of Linux, Windows, and macOS nodes, as well as x86_64 and arm64 architectures. Additionally, It also supports various GPUs, including NVIDIA, Apple Metal, AMD, Ascend, Hygon and Moore Threads.
GPUStack supports a mix of Linux, Windows, and macOS nodes, as well as x86_64 and arm64 architectures. Additionally, It also supports various GPUs, including NVIDIA, Apple Metal, AMD, Ascend, Hygon and Moore Threads.
When running GPUStack with Docker, it works out of the box in an air-gapped environment as long as the Docker images are available. To do this, follow these steps:
1. Pull GPUStack docker image in an online environment:
```bash
docker pull gpustack/gpustack:latest-corex
```
If your online environment differs from the air-gapped environment in terms of OS or arch, specify the OS and arch of the air-gapped environment when pulling the image:
Run the following command to start the GPUStack server **and built-in worker** (host network mode is recommended):
```bash
docker run -d --name gpustack \
-v /lib/modules:/lib/modules \
-v /dev:/dev \
--privileged \
--cap-add=ALL \
--pid=host \
--restart=unless-stopped \
--network=host \
--ipc=host \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack:latest-corex
```
If you need to change the default server port 80, please use the `--port` parameter:
```bash
docker run -d --name gpustack \
-v /lib/modules:/lib/modules \
-v /dev:/dev \
--privileged \
--cap-add=ALL \
--pid=host \
--restart=unless-stopped \
--network=host \
--ipc=host \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack:latest-corex \
--port 9090
```
If other ports are in conflict, or if you want to customize startup options, refer to the [CLI Reference](../../cli-reference/start.md) for available flags and configuration instructions.
Check if the startup logs are normal:
```bash
docker logs -f gpustack
```
If the logs are normal, open `http://your_host_ip` in the browser to access the GPUStack UI. Log in to GPUStack with username `admin` and the default password. You can run the following command to get the password for the default setup:
You can add more GPU nodes to GPUStack to form a GPU cluster. You need to add workers on other GPU nodes and specify the `--server-url` and `--token` parameters to join GPUStack.
To get the token used for adding workers, run the following command on the GPUStack **server node**:
To start GPUStack as a worker, and **register it with the GPUStack server**, run the following command on the **worker node**. Be sure to replace the URL and token with your specific values:
1. **Heterogeneous cluster is supported.** No matter what type of device it is, you can add it to the current GPUStack as a worker by specifying the `--server-url` and `--token` parameters.
2. You can set additional flags for the `gpustack start` command by appending them to the docker run command.
For configuration details, please refer to the [CLI Reference](../../cli-reference/start.md).
@ -8,7 +8,7 @@ GPUStack is an open-source GPU cluster manager for running AI models.
- **Broad GPU Compatibility:** Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
- **Extensive Model Support:** Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
- **Flexible Inference Backends:**Integrates with llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM, and Ascend MindIE.
- **Flexible Inference Backends:**Flexibly integrates with multiple inference backends including llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM and Ascend MindIE.
- **Multi-Version Backend Support:** Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
- **Distributed Inference:** Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
- **Scalable GPU Architecture:** Easily scale up by adding more GPUs or nodes to your infrastructure.
@ -35,6 +35,7 @@ GPUStack is an open-source GPU cluster manager for running AI models.
- [x] Ascend CANN
- [x] Hygon DTK
- [x] Moore Threads MUSA
- [x] Iluvatar Corex
We plan to support the following accelerators in future releases.
@ -75,12 +75,12 @@ For more details, please refer to [vLLM documentation](https://docs.vllm.ai/en/s
### Supported Platforms
The vLLM backend works on AMD64 Linux.
The vLLM backend works on Linux.
!!! Note
1. When users install GPUStack on amd64 Linux using the installation script, vLLM is automatically installed.
2. When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to `{"os": "linux", "arch": "amd64"}` by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
2. When users deploy a model using the vLLM backend, GPUStack sets worker label selectors to `{"os": "linux"}` by default to ensure the model instance is scheduled to proper workers. You can customize the worker label selectors in the model configuration.
description:DeepSeek-R1-0528-Qwen3-8B is a post-trained model derived by distilling the chain-of-thought reasoning patterns from DeepSeek-R1-0528 into the Qwen3 8B Base model. As a result, it achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024 benchmark, outperforming the original Qwen3 8B by 10.0% and reaching the level of Qwen3-235B-thinking.
description:DeepSeek-R1-0528 is a minor version of the DeepSeek R1 model that features enhanced reasoning depth and inference capabilities. These improvements are achieved through increased computational resources and algorithmic optimizations applied during post-training. The model delivers strong performance across a range of benchmark evaluations, including mathematics, programming, and general logic, with overall capabilities approaching those of leading models such as O3 and Gemini 2.5 Pro.
description:DeepSeek's first-generation reasoning model that delivers superior performance in math, code, and reasoning tasks. It effectively overcomes reasoning challenges and achieves performance comparable to OpenAI-o1 across various benchmarks. This includes six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
description:DeepSeek-R1-0528-Qwen3-8B is a post-trained model derived by distilling the chain-of-thought reasoning patterns from DeepSeek-R1-0528 into the Qwen3 8B Base model. As a result, it achieves state-of-the-art (SOTA) performance among open-source models on the AIME 2024 benchmark, outperforming the original Qwen3 8B by 10.0% and reaching the level of Qwen3-235B-thinking.
description:DeepSeek-R1-0528 is a minor version of the DeepSeek R1 model that features enhanced reasoning depth and inference capabilities. These improvements are achieved through increased computational resources and algorithmic optimizations applied during post-training. The model delivers strong performance across a range of benchmark evaluations, including mathematics, programming, and general logic, with overall capabilities approaching those of leading models such as O3 and Gemini 2.5 Pro.
home:https://www.deepseek.com
icon:/static/catalog_icons/deepseek.png
categories:
- llm
capabilities:
- context/128K
sizes:
- 671
licenses:
- mit
release_date:"2025-05-28"
templates:
- quantizations:
- UD-IQ1_M
- UD-IQ1_S
- UD-Q2_K_XL
- UD-Q3_K_XL
- Q4_K_M
- Q8_0
- BF16
source:huggingface
huggingface_repo_id:unsloth/DeepSeek-R1-0528-GGUF
huggingface_filename:"*-{quantization}*.gguf"
replicas:1
backend:llama-box
backend_parameters:
# give R1 more default context to think
- --ctx-size=32768
# recommended temperature and top_p for R1
- --temp=0.6
- --top-p=0.95
cpu_offloading:true
distributed_inference_across_workers:true
- quantizations:["FP8"]
source:huggingface
huggingface_repo_id:deepseek-ai/DeepSeek-R1-0528
replicas:1
backend:vllm
backend_parameters:
- --trust-remote-code
- --max-model-len=32768
- quantizations:["BF16"]
source:huggingface
huggingface_repo_id:unsloth/DeepSeek-R1-0528-BF16
replicas:1
backend:vllm
backend_parameters:
- --trust-remote-code
- --max-model-len=32768
- name:Deepseek R1
description:DeepSeek's first-generation reasoning model that delivers superior performance in math, code, and reasoning tasks. It effectively overcomes reasoning challenges and achieves performance comparable to OpenAI-o1 across various benchmarks. This includes six dense models distilled from DeepSeek-R1 based on Llama and Qwen.