GPUStack

GPUStack is an open-source GPU cluster manager for running large language models(LLMs).

Supports a Wide Variety of Hardware: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
Multiple Inference Backends: Supports llama-box (llama.cpp) and vLLM as the inference backend.
Lightweight Python Package: Minimal dependencies and operational overhead.
OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
User and API key management: Simplified management of users and API keys.
GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

The following Linux distributions are verified to work with GPUStack:

!!! note

The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher.

We plan to support the following accelerators in future releases.

GPUStack uses llama.cpp and vLLM as the backends and supports a wide range of models. Models from the following sources are supported:

Example language models:

Example multimodal models:

For full list of supported models, please refer to the supported models section in the inference backends documentation.

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs