Compare commits

..

293 Commits

Author SHA1 Message Date
gitlawr 8fe557148d fix: qwen3-coder param
4 months ago
gitlawr dfdafd036d docs: update huggingface_token config example
4 months ago
yxf 26e81dc700 feat: Add support for Nvidia MIG detection in containerized environments.
4 months ago
gitlawr b24d08bbf1 chore: update auth config
4 months ago
thxCode f7636a5f63 ci(docker): lock ray version
4 months ago
thxCode 80509b5900 ci(docker/npu): fix different torch library
4 months ago
gitlawr 4535fb182d docs: update sso cli flags
4 months ago
gitlawr e91da52145 feat: make JWT expiration configurable
4 months ago
gitlawr 67b93d156e feat: add user avatar
4 months ago
gitlawr 30533f7275 feat: improve SSO
4 months ago
gitlawr e627a4c79e fix: update pipx envs
4 months ago
gitlawr f7fdcdb9d0 chore: remove legacy models
4 months ago
gitlawr 0d77f69e99 feat: add GLM4.5, Qwen3-Coder, Qwen3-2507 and gpt-oss models
4 months ago
ZhouForrest 54d75c7a45
Add SSO Authentication (#2658)
5 months ago
thxCode ea6e5ca9dc ci(docker/cuda): fix failed on llama-box starting
5 months ago
yxf df31850efc fix: Enhance download log display logic
5 months ago
yxf 67df35a364 fix: Update progress bar styling for frontend formatting compatibility.
5 months ago
yxf 10eaab6f90 feat: Optimize dependency management for specified backend versions
5 months ago
gitlawr fac6ed8d25 chore: bump llama-box to v0.0.169
5 months ago
thxCode 7e08098981 ci(docker/cpu): bump python version
5 months ago
thxCode bf6287c069 ci(docker/npu): bump vllm version
5 months ago
thxCode e54e269f58 ci(docker/cuda): build flashinfer from source
5 months ago
thxCode 5dd58c72cb ci(docker): support git-lfs checkout
5 months ago
thxCode a6441ff92e ci(docker/dcu): refer base image from gpustack acr
5 months ago
thxCode e2c0a7ccdc ci(docker/corex): refer base image from gpustack acr
5 months ago
thxCode cb14580408 ci(docker/npu): bump mindie version to 2.1.rc1
5 months ago
thxCode 359dc3c6e2 chore(docker): collect dockerfile together
5 months ago
linyinli 8ff76ebfee docs: update wechat qrcode
5 months ago
linyinli b5b2272e12 docs: update wechat qrcode
5 months ago
thxCode 53d255cba6 refactor(box): version candidate selecting
5 months ago
cyx 036c40753f fix: proxy request fail when no reposnse within 300s
5 months ago
yxf 7331059605 feat: Log the download progress in the model instance's log file.
5 months ago
linyinli 5e22fe99ca docs: update wechat qrcode
5 months ago
cyx 2232e4baf6 chore: update gguf-parser to 0.22.0
5 months ago
Yuxing Deng 527f7b3644 fix: skip headers in proxy response
5 months ago
gitlawr 35b570459b chore: update cosyvoice huggingface repos
5 months ago
yxf a436eff8d3 fix: Fix potential division-by-zero exceptions.
5 months ago
thxCode 6168713804 fix(ray): conflict in random runtime env agent port
5 months ago
Xiaodong Ye 309c27a9ef llama-box: bump version
5 months ago
gitlawr 307dadbbd5 ci: skip corex docker build
5 months ago
yxf 74c937acc4 fix: Enhance multi-GPU scheduling tests and improve attention heads validation messages
5 months ago
gitlawr b0a638711b fix: none distributed servers on upgrade
5 months ago
cyx 14a60fd640 fix: the environment variable configuration "HTTP_PROXY HTTPS_PROXY" is invalid
5 months ago
rushyrush 34c21e055f fix: broken links in documentation
5 months ago
yxf 775fc18d5d fix: Exclude vision_config.num_attention_heads from the num_attention_heads validation check.
5 months ago
yxf eef43b3426 fix: Specify the Triton version only for platform amd64
5 months ago
thxCode 976b4ccf1f ci(npu): export hccl lib default
5 months ago
gitlawr b24cb7466d fix: update broken tests
5 months ago
gitlawr a035f94425 fix: update qwen3 vllm parameters
5 months ago
gitlawr 0d2a42b932 chore: update vLLM to v0.10.0
5 months ago
thxCode be05857963 fix(mindie): panic with quantization config
5 months ago
Xiaodong Ye f4fc88f114 Update TBD
5 months ago
Xiaodong Ye fed3bff688 musa: restore rc in docker image tag
5 months ago
Xiaodong Ye ffeeb4830d musa: upgrade musa sdk to 4.2.0
5 months ago
gitlawr b193ae59fc docs: update JP readme
5 months ago
linyinli 18dd045598 docs: update readme
5 months ago
linyinli 13f96d7e6f docs: corrections and additions
5 months ago
gitlawr c97693ec3f chore: update parser
5 months ago
linyinli 1703ed9af4 docs: update architecture diagrams and cli options
5 months ago
yxf aebae09a10 fix: Specify the Triton version to enable DIA model execution on RTX 5090 GPUs.
5 months ago
yxf 17ff2c4b56 revert: Force reinstall a standalone torch for vox-box in docker
5 months ago
linyinli eec744793b docs: update installation and faq docs
5 months ago
thxCode 0d4bf13f4c ci(docker): fix failed to confirm build cache
5 months ago
gitlawr 8118f25941 fix: set instance starting with message reset
5 months ago
gitlawr 9c89d04008 fix: wrong allocatable from subordinate workers
5 months ago
gitlawr 4f2ae77577 fix: skip download files for llama-box subordinate workers
5 months ago
gitlawr f5b70329dc chore: bump parser
5 months ago
yxf 0a28b2a4e0 fix: Remove the TORCH_CUDA_ARCH_LIST environment variable to prevent runtime errors.
5 months ago
yxf f017065122 fix: Force reinstall a standalone torch for vox-box in docker
5 months ago
linyinli bf33757623 docs: update installer url
5 months ago
linyinli 8b8b859fb6 docs: update troubleshooting docs
5 months ago
Joe_Z 6ea4988572 docs: add GPUStack integration note for Docker Desktop installed Dify
5 months ago
thxCode e45a4e9961 refactor(vllm/ray): receive ray port setting
5 months ago
thxCode 6208c7c3a5 refactor(vllm/ray): parameterize all ports as we can
5 months ago
gitlawr 9b3850fe7c fix: remove ernie from catalog
5 months ago
thxCode 6c7728d4f3 ci(cuda): reinstall vox-box
5 months ago
thxCode feba7ecb51 ci(cuda): reinstall flashinfer
5 months ago
thxCode c3a34f3f09 ci(cuda): tidy up dockerfile
5 months ago
gitlawr 733f2c1482 docs: use deployments page
5 months ago
gitlawr a888661621 docs: update script installation deprecation
5 months ago
gitlawr 65591e10ea docs: desktop installation rephrasing
5 months ago
gitlawr 4d30c23d8f docs: update quickstart
5 months ago
yxf 9f9e0217d6 fix: improve error handling for missing pretrained configuration files
5 months ago
cyx a16e78dc32 fix: evaluation model failure in air-gapped environment
5 months ago
Yuxing Deng 0aae3fa0a6 fix: model instance not found after restart gpustack
5 months ago
thxCode 0749e40d1b fix(vllm/ascend/ray): failed on picking several nodes from the whole group 2
5 months ago
KodeMunkie b1015f353a Corrected spelling of retrieving
5 months ago
thxCode 12a4a4ef9f refactor(ray): extend ray worker port range
5 months ago
thxCode 8872d0687b fix(vllm/ascend/ray): failed on picking several nodes from the whole group
5 months ago
gitlawr 17a7c4b6e1 fix: Broken pipe on high concurrency
5 months ago
Yuxing Deng fc4aa6b6c2 docs: add installer docs
5 months ago
thxCode 16b5333fcb ci(docker): enhance cache reusing
5 months ago
gitlawr 3836fd34bb fix: only skip worker proxy for mindIE backend
5 months ago
jialin d79343b324 docs: update quickstart
5 months ago
jialin f8cc92ab0b docs: update quickstart and upgrade docs
5 months ago
gitlawr 0c760c34b5 chore: update ernie logo
5 months ago
yxf fef108171c feat: add CLI arguments for backend version switching
5 months ago
thxCode b0457b7adb chore(tool): bump gguf-parser version
5 months ago
linyinli e9f17063ca docs: fix incorrect info and image
5 months ago
cyx 57b55a8207 refactor: use --cache-expiration option can configure repo file cache expirationand
5 months ago
cyx c2b4b95824 fix: fail evaluation when single-node multi-GPU VRAM utilization
5 months ago
gitlawr ba411603ac chore: update vox-box to 0.0.19
5 months ago
gitlawr 1aeb35352a Revert "fix: reset file lock on retry download"
5 months ago
gitlawr ba5a49edd3 fix: allow nullable finish_reason for streaming legacy completions
5 months ago
gitlawr 2c16096055 fix: reset file lock on retry download
5 months ago
gitlawr a57b9c1768 feat: sqlite tuning
5 months ago
gitlawr 53be75c23c refactor: streaming with db engine
5 months ago
gitlawr 9787b28a44 fix: catch no response runtime error
5 months ago
thxCode 345024182b refactor(vllm): show process running envs in debug mode
5 months ago
thxCode b7bfda3b1d fix(vllm/ascend): failed on running across multiple devices
5 months ago
thxCode 78eda5b83a style: rename func for getting visible devices env name
5 months ago
thxCode 70b1f8486d refactor(npu): default mindie and vllm parameters in 310p
5 months ago
thxCode 76bd6838e5 ci(npu): fix vllm in 310p
5 months ago
cyx 554f5c0180 chore: update start.md
5 months ago
Yuxing Deng 7ddb9edb86 deps: relaxing transformers version restrictions
5 months ago
cyx 3fa5bf01fe chore: update gguf-parse to v0.21.1
5 months ago
yxf 1b3a192d07 feat: add installation of flashinfer-python JIT for vllm on linux/amd64 platform
5 months ago
thxCode 152e57f8a7 test(npu-smi): refine example and test case
5 months ago
thxCode 089b7fa813 ci(npu): check mindie turbo and vllm compatibility
5 months ago
thxCode c7ee3661ac refactor(npu-smi): gather 310p chip
5 months ago
yxf c52b0c1c44 feat: Isolate virtual environments for vox-box and gpustack to prevent transformer version conflicts.
5 months ago
cyx 4b30a6b743 refactor: always use cached model metadata
5 months ago
jialin e254d7bfd8 Revert "docs: update quick start"
5 months ago
jialin b0c85bb603 Revert "docs(upgrade): update upgrade steps for deprecated installation script"
5 months ago
jialin 5ffdefdbae docs(upgrade): update upgrade steps for deprecated installation script
5 months ago
jialin 10c2ae6fbc docs: update quick start
5 months ago
gitlawr 23c0b77850 chore: suppress ollama-library-url arg
5 months ago
thxCode 7eca3cabe1 refactor(mindie): validation
5 months ago
jialin eef0700bdc docs: update screenshots, add desktop installer
5 months ago
thxCode 5d65dcc0a7 refactor(selector/mindie): return object instead of tuple in _estimate_usage
5 months ago
thxCode a4cdf37379 fix(selector/mindie): failed on recoginizing n_groups=1 mla model
5 months ago
gitlawr a54ff956a1 fix: incorrect file path for qwen3 embedding from modelscope
5 months ago
thxCode 91dbbd31af chore(npu): bump vllm version
5 months ago
yxf 58af72bf08 fix:remove the redundant formatting issues in schedule_cycle prompt messages; optimize compatibility check text.
5 months ago
thxCode 8cb1a38013 fix(selector/gguf): incorrect partial offloading in uma
5 months ago
gitlawr 4aea7bcdf9 feat: apply qwen3 reranker templates
6 months ago
gitlawr b8910817b6 fix: respect vram utilization for qwen3 embedding
6 months ago
thxCode eae62e02fc chore(tool): bump version
6 months ago
thxCode d46932ae80 chore(mindie): fix typo in lable matching filter
6 months ago
gitlawr a4da5fae63 refactor: install custom-version backends under data-dir
6 months ago
thxCode 83146853ac fix(selector/mindie): failed on evaluating mulitmodal models
6 months ago
Yuxing Deng d094c2e008 fix: should use single quote instead of double quote
6 months ago
yxf f16718c776 fix: correct compatibility check messages
6 months ago
thxCode a368ac1000 docs(mindie): introduce distributed serving guide
6 months ago
cyx 8e8d41b24c fix: failed on db connecting with special chars
6 months ago
gitlawr d95edf3e29 chore: update catalog
6 months ago
gitlawr e1a4a3fca2 chore: update transformers
6 months ago
cyx 8d5c70a2ac chore: optimize error prompts when GPUDetectExepction
6 months ago
thxCode 3b8ef1c3e7 refactor(calculator): gguf-parser receive --rope-* params
6 months ago
thxCode 8b0c521458 chore(tool): bump version
6 months ago
cyx 330694ea9a fix: not providing sufficient information for Nvidia error
6 months ago
gitlawr 8d04eb7797 fix: missing distributed server allocation in dashboard
6 months ago
gitlawr 4939b4a1f7 fix: include subworkers in worker allocated summary
6 months ago
gitlawr a0e4f0c889 feat: add worker_name to subworkers
6 months ago
gitlawr c34cd72c36 fix: reject distributed vllm with custom version
6 months ago
cyx 2f50cc60ae fix: /v1/workers returns incorrect values for the "allocated" field
6 months ago
gitlawr da3e8060c6 chore: update vox-box to 0.0.18
6 months ago
thxCode f196b844ae chore(tool): bump llama-box version
6 months ago
gitlawr bca385f869 chore: update vllm to v0.9.2
6 months ago
yxf 013fc3d510 feat: skip select_multi_workers by distributed_inference_across_workers in MindIE
6 months ago
gitlawr a3cb2445c5 fix: capitalize dataclass
6 months ago
gitlawr ccf4b6d660 refactor: migration script
6 months ago
gitlawr 5a816afe2b refactor: migrate subordinate_workers
6 months ago
gitlawr c371216700 fix: missing subordinate_workers resource claim in evaluation results
6 months ago
cyx 8e5bfca828 fix: password reset takes a while to take effect
6 months ago
yxf 9647cd4237 feat: Update compatibility check messages for multi-worker/multi-GPU scenarios and refactor vllm_selector code structure
6 months ago
yxf 6812d39148 feat: enhance mindie and vllm scheduling messages
6 months ago
yxf 6c27bb8945 feat: clarifying compatibility check messages
6 months ago
thxCode 96f673b7f6 chore(tool): bump gguf-parser version
6 months ago
thxCode 3d3dde0db8 fix(scheduler): allow unsupported gguf model in evaluation
6 months ago
thxCode e6f4dd03f7 ci(npu): avoid docker build cache invalid
6 months ago
gitlawr 2083ce0778 fix: duplicate user message
6 months ago
thxCode cee7492913 chore(tool): bump gguf-parser version
6 months ago
gitlawr 501231471e chore: update box to v0.0.161
6 months ago
linyinli 0b90cf50c0 docs: fix typos and minor corrections
6 months ago
thxCode b54c72a7c1 chore(tool): bump llama-box version
6 months ago
yxf f01a4bb491 fix: improve path handling for incomplete model file deletions
6 months ago
thxCode c27a504a8a chore(logging): unify tools downloading log prefix
6 months ago
thxCode b692aeb0e5 refactor(llama-box): static link bin dectect
6 months ago
thxCode 15b796288b refactor(llama-box): tidy up rpc-server launching logic
6 months ago
gitlawr ac42c6b21d fix: flexible vendor validation
6 months ago
gitlawr 9c9caa848e fix: bencharm embedding without multiplier
6 months ago
cyx 890f15faed fix:rpc-server-args recognization
6 months ago
thxCode ce868a7a79 refactor(npu-smi): pring warning in trace
6 months ago
thxCode 250723f9a1 fix(calculator): missing default --image-vae-tiling
6 months ago
thxCode 5bbb027dbe refactor(calculator): gguf-parser receive -ot param
6 months ago
thxCode c2b7456269 chore(tool): bump gguf-parser version
6 months ago
cyx 2916a306d9 feat: environment variable `GPUSTACK_DISABLE_DYNAMIC_LINK_LLAMA_BOX` apply to default llama-box version
6 months ago
gitlawr 5bc81bbb9e fix: add ray pg readiness check
6 months ago
kekao e7f41af249 feat: upgrade dcu vllm to 0.8.5
6 months ago
gitlawr 2a02ee689a fix: empty params evaluation matches catalog specs
6 months ago
gitlawr 9fba121898 fix: model usage for images
6 months ago
thxCode b1543211a8 fix(mindie): failed on moe-tp less than local world size
6 months ago
thxCode 6f72944222 refactor(mindie): validate moe tp/ep in selector
6 months ago
thxCode 1b6760169a feat(mindie): support mtp
6 months ago
thxCode 498f143944 refactor(mindie): tidy up params
6 months ago
cyx 1d7bb41776 feat(config): Add DISABLE_DYNAMIC_LINK_LLAMA_BOX env var for llama-box deployment control
6 months ago
linyinli 2bd6648236 fix: allow custom vLLM on arm64 platform
6 months ago
gitlawr 3f8d45ded1 fix: update trust-remote-code value error message
6 months ago
gitlawr 6714afa2ca refactor: simplify test fixtures
6 months ago
gitlawr 48f10bcf1b chore: update llama-box to v0.0.159
6 months ago
gitlawr acf4ad2917 chore: default usage sort
6 months ago
thxCode 4e1d6d900c fix(mindie): invalid world size in manual selection
6 months ago
thxCode c9f44f33aa test(mindie): simplify fixture import
6 months ago
Yuxing Deng cc3fc57bc1 fix: simplify llama-box download logic
6 months ago
thxCode 777fd3d0cb refactor(mindie): refine parallelism params
6 months ago
gitlawr 3f34bdd216 fix: handle form file and stream chunks
6 months ago
gitlawr b054dc00b5 refactor: update benchmark script
6 months ago
gitlawr 94e515c844 fix: set tcp connector limit
6 months ago
thxCode cc0a88f88f refactor(bus): default queue maxsize to 256
6 months ago
thxCode 6d7d666f1e refactor(logging): print milliseconds
6 months ago
thxCode 1eda8a7810 refactor(proxy): transfer request without /proxy route
6 months ago
thxCode d12f1e9fee refactor(mindie): support distributed deployment
6 months ago
thxCode afc1023922 refactor(mindie): allow distributed model download
6 months ago
thxCode 3c5c9feb6c feat(util): introduce attribute path operator
6 months ago
thxCode a9fb44a43d fix(mindie): resource fit selection
6 months ago
thxCode ed8e4ba51f chore(detector): degrade npu-smi log level
6 months ago
thxCode feaa894000 style(mindie): chore name changed
6 months ago
yxf 62f78c10cc fix: refactor model file size handling to return file info and paths
6 months ago
cyx 85e5f717ba fix(llama-box): failed on rpc server and linux/cann starting
6 months ago
cyx bec76bac22 refactor: Llama box multi version directory structure changes
6 months ago
yxf 323d76df4c feat: enhance num_attention_heads retrieval from model configurations
6 months ago
Yuxing Deng 45b8018f82 fix: only import resource in *nix platform
6 months ago
gitlawr 38d2e31f9f feat: add usage query apis
6 months ago
yxf 0dfdafcd16 fix: handle case for empty filename in size computation
6 months ago
yxf 9272d5a9c7 fix: fix size computation for models with mmproj files
6 months ago
yxf 340b4d8d90 fix: correct model file path during ModelScope incomplete model deletion
6 months ago
yxf fab2813d56 feat: add methods for deleting incomplete model file
6 months ago
hello-daydayup d38a120a3d delete mlu comments
6 months ago
thxCode 56bf30f199 refactor(mindie): introduce candidate selectors
6 months ago
thxCode 7804c89f1e chore(base): declare abstract method get_messages
6 months ago
thxCode 383baac2e0 chore(selectors): support import from package
6 months ago
thxCode 884c94f63f refactor(worker): introduce subordinate workers
6 months ago
thxCode 7b648ce634 refactor(mindie): parameters parsing
6 months ago
yxf 6f3fab1051 feat: add disable_logger_names
6 months ago
hello-daydayup 64a7d18cf8 add Cambricon MLU
6 months ago
thxCode 7c5927db99 ci(musa): build amd64 only
6 months ago
thxCode 0b8f24f993 chore(llama-box): bump version
6 months ago
yxf 9addaad168 fix: update VRAM calc for gpu-memory-utilization
6 months ago
yxf f268725d8f feat: set default ulimit
6 months ago
cyx 68a3915e4e fix: allow override --rpc-server-cache-dir
6 months ago
cyx 5927e924cb fix: resolve intermittent error logs during rapid batch model start/stop operations
6 months ago
yxf 7c26f6f012 feat: add support for enabling Hugging Face Xet in configuration
6 months ago
yxf 3bf76a5a6e feat: add worker_uuid for name conflict detection
7 months ago
cyx 39d8e7ab0f refactor:The 'backend-parameters' configuration does not support the '--port' argument/option.
7 months ago
Yuxing Deng f4f84e0931 fix: the llama-box-rpc-server brokes with pre-downloaded llama-box
7 months ago
cyx d006860a38 refactor: enable rpc server cache
7 months ago
Yuxing Deng 37f23c2549 refactor: replace sysconfig with get_command_path
7 months ago
gitlawr b69a785f41 chore: remove pydantic workaround
7 months ago
gitlawr a06319d76c chore: update vllm to v0.9.1
7 months ago
thxCode c7d77e6e95 refactor(mindie): idempotent multi-version installation
7 months ago
yxf 296d6b5d32 chore: remove hf-xet dependency from pyproject.toml
7 months ago
gitlawr 7f247a6400 fix: disallow abbrev parsing backend parameters
7 months ago
thxCode d9996538f9 refactor(npu): tidy up mindie and vllm
7 months ago
gitlawr ec9e646105 docs: remove invalid link
7 months ago
gitlawr 8e44612eb8 ci: seperate docker cache per branch
7 months ago
Yuxing Deng 0810e2319d fix: multiple AMD GPU are detected when multiple driver installed
7 months ago
gitlawr 00e858f40f refactor: use aiohttp in proxy
7 months ago
gitlawr 7a2187aaa3 feat: cache secret hashing
7 months ago
gitlawr d9b67ff2eb feat: benchmark embeddings
7 months ago
gitlawr cc2e0bbe78 chore: update llama-box to v0.0.154
7 months ago
gitlawr 88253b59c3 chore: bump llama-box to v0.0.153
7 months ago
gitlawr 7b5078856c fix: exception on log file not ready
7 months ago
gitlawr 555d0c81ba docs: update description for start flags
7 months ago
Yuxing Deng 03cd217189 feat: support using full path of llama-box
7 months ago
linyinli bef3b521ef fix: remove incompatible Qwen3 config for vLLM Ascend
7 months ago
gitlawr b99bd3f6de chore: bump backends
7 months ago
gitlawr 157ad58ee3 feat: add r1 0528 to catalog
7 months ago
thxCode a95650f3e1 chore(tool): bump version
7 months ago
gitlawr d3c908ea07 ci: drop cuda11.8
7 months ago
linyinli 0e34f487a7 feat: support vllm ascend
7 months ago
Terada Kousuke 211132d172 Add Japanese README
7 months ago
gitlawr 45812787a0 ci: update branches
7 months ago
Yuxing Deng e39a103b79 feat: add support for built binary to run mulitprocessing
7 months ago
peiyuan.zhang 27ee8566cb remove Installation Script
7 months ago
peiyuan.zhang c060a51f76 support iluvatar
7 months ago
thxCode a964ad0816 refactor(llama-box): enable max projected cache by default
7 months ago
thxCode bee9da9c65 chore(tool): bump version
7 months ago
thxCode 457e418cd5 refactor(scheduler): params processing in gguf-parser
7 months ago
thxCode bd1313f0ef chore(tool): bump version
7 months ago
thxCode 443466e740 refactor(catalog): enable visual max image size in pixtral
7 months ago
gitlawr fc92106506 chore: update vox-box
7 months ago
gitlawr 2f42df90ad feat: add glibc check
7 months ago
gitlawr ea34a10f26 ci: use tag ui on release
7 months ago
thxCode b6f847024f refactor(mindie): enhance features
7 months ago
thxCode b2844560ff chore(tool): bump version
7 months ago
thxCode 549750b739 refactor(detectors): collect device_index/device_chip_index
7 months ago
thxCode 7222a07049 refactor(llama-box): get downloaded platform name
7 months ago
thxCode faec4babe3 fix(stmt): failed to fetch network
7 months ago
linyinli 7819ce076f docs: add FAQ for viewing Ray logs
7 months ago
thxCode 2a6af6062e ci: docker build cache
7 months ago
thxCode 14b70d472d test(detectors): npu-smi adjust
7 months ago
thxCode a7ac3f52a4 refactor(detectors): npu-smi collect
7 months ago

2
.gitattributes vendored

@ -1,3 +1,3 @@
install.ps1.sha256sum text eol=lf
* text=auto eol=lf
*.tar.gz filter=lfs diff=lfs merge=lfs -text

@ -60,7 +60,7 @@ jobs:
retention-days: 5
- name: Release GitHub Assets
uses: softprops/action-gh-release@v2.2.2
uses: softprops/action-gh-release@v2
if: startsWith(github.ref, 'refs/tags/') && matrix.python-version == '3.11' && matrix.os == 'linux'
with:
# Draft for official releases to prepare and review release notes before publishing

@ -15,18 +15,15 @@ on:
- "**.png"
- "**.jpg"
- "**.gif"
- "Dockerfile.rocm.base"
- "Dockerfile.dcu.base"
- "pack/**.base"
pull_request:
branches:
- main
- "v*-dev"
paths:
- "Dockerfile"
- "Dockerfile.*"
- ".github/workflows/docker-ci.yaml"
- "!Dockerfile.rocm.base"
- "!Dockerfile.dcu.base"
- "pack/**"
- "!pack/**.base"
jobs:
publish-docker:
@ -48,12 +45,16 @@ jobs:
tag_suffix: ""
build_args:
- "CUDA_VERSION=12.4.1"
- "CUDA_DEVEL_VERSION=12.6.3"
- "FLASHINFER_BUILD_MAX_JOBS=1"
- device: cuda
dockerfile: "Dockerfile"
platforms: "linux/amd64,linux/arm64"
tag_suffix: "-cuda12.8"
build_args:
- "CUDA_VERSION=12.8.1"
- "CUDA_DEVEL_VERSION=12.8.1"
- "FLASHINFER_BUILD_MAX_JOBS=1"
#
# HIP RoCM
#
@ -83,7 +84,7 @@ jobs:
#
- device: musa
dockerfile: "Dockerfile.musa"
platforms: "linux/amd64,linux/arm64"
platforms: "linux/amd64"
tag_suffix: "-musa"
build_args: []
#
@ -115,6 +116,7 @@ jobs:
- name: Checkout
uses: actions/checkout@v4
with:
lfs: true
fetch-depth: 1
persist-credentials: false
- name: Maximize Docker Build Space
@ -159,12 +161,66 @@ jobs:
echo "$arg" >> $GITHUB_OUTPUT
done
echo "EOF" >> $GITHUB_OUTPUT
- name: Get Cache Ref
id: cache-ref
run: |
#
# Use different cache ref for different branches.
#
# Examples:
# CACHE_FROM_REF
# - vX.Y.Z -> "gpustack/build-cache:gpustack${TAG_SUFFIX}-${VERSION|DEFAULT_BRANCH}"
# - PR/PUSH to branch -> "gpustack/build-cache:gpustack${TAG_SUFFIX}-${BRANCH|DEFUALT_BRANCH}"
# CACHE_TO_REF
# - vX.Y.Z -> "gpustack/build-cache:gpustack${TAG_SUFFIX}-${VERSION}"
# - PUSH to branch -> "gpustack/build-cache:gpustack${TAG_SUFFIX}-${BRANCH}"
#
# Stories(device cpu):
# CACHE_FROM_REF
# - Release tag v0.7.0rc1 -> gpustack/build-cache:gpustack-cpu-v0.7, if not found, fallback to gpustack/build-cache:gpustack-cpu-main
# - Release tag v0.7.0 -> gpustack/build-cache:gpustack-cpu-v0.7
# - PR to "main" branch -> gpustack/build-cache:gpustack-cpu-main
# - PR to "v0.7-dev" branch -> gpustack/build-cache:gpustack-cpu-v0.7, if not found, fallback to gpustack/build-cache:gpustack-cpu-main
# - Push to "main" branch -> gpustack/build-cache:gpustack-cpu-main
# - Push to "v0.7-dev" branch -> gpustack/build-cache:gpustack-cpu-v0.7, if not found, fallback to gpustack/build-cache:gpustack-cpu-main
# CACHE_TO_REF
# - Release tag v0.7.0rc1 -> gpustack/build-cache:gpustack-cpu-v0.7
# - Release tag v0.7.0 -> gpustack/build-cache:gpustack-cpu-v0.7
# - PR to "main" branch -> gpustack/build-cache:gpustack-cpu-main
# - PR to "v0.7-dev" branch -> gpustack/build-cache:gpustack-cpu-v0.7
# - Push to "main" branch -> gpustack/build-cache:gpustack-cpu-main
# - Push to "v0.7-dev" branch -> gpustack/build-cache:gpustack-cpu-v0.7
DEFAULT_BRANCH="main"
TAG_SUFFIX="${{ matrix.tag_suffix }}"
if [[ "${GITHUB_REF}" == refs/tags/* ]]; then
REF="${GITHUB_REF#refs/tags/}"
IFS="." read -r VERSION_MAJOR VERSION_MINOR VERSION_PATCH <<< "${REF}"
VERSION="${VERSION_MAJOR}.${VERSION_MINOR}"
CACHE_FROM_REF="gpustack/build-cache:gpustack${TAG_SUFFIX}-${VERSION}"
CACHE_TO_REF="${CACHE_FROM_REF}"
else
REF="${GITHUB_BASE_REF:-${GITHUB_REF}}"
BRANCH="${REF#refs/heads/}"
BRANCH="${BRANCH%-dev}"
CACHE_FROM_REF="gpustack/build-cache:gpustack${TAG_SUFFIX}-${BRANCH}"
CACHE_TO_REF="${CACHE_FROM_REF}"
fi
if ! docker manifest inspect "${CACHE_FROM_REF}" >/dev/null 2>&1; then
CACHE_FROM_REF="gpustack/build-cache:gpustack${TAG_SUFFIX}-${DEFAULT_BRANCH}"
fi
echo "CACHE_FROM_REF=${CACHE_FROM_REF}" >> $GITHUB_ENV
echo "CACHE_TO_REF=${CACHE_TO_REF}" >> $GITHUB_ENV
echo "DEBUG: GITHUB_BASE_REF=${GITHUB_BASE_REF}"
echo "DEBUG: GITHUB_REF=${GITHUB_REF}"
echo "DEBUG: TAG_SUFFIX=${TAG_SUFFIX}"
echo "DEBUG: CACHE_FROM_REF=${CACHE_FROM_REF}"
echo "DEBUG: CACHE_TO_REF=${CACHE_TO_REF}"
- name: Package
uses: docker/build-push-action@v6
id: package
with:
push: ${{ github.event_name != 'pull_request' }}
file: ${{ github.workspace }}/${{ matrix.dockerfile }}
file: ${{ github.workspace }}/pack/${{ matrix.dockerfile }}
context: ${{ github.workspace }}
platforms: ${{ matrix.platforms }}
tags: ${{ steps.metadata.outputs.tags }}
@ -174,6 +230,6 @@ jobs:
build-args: |
${{ steps.build-args.outputs.BUILD_ARGS }}
cache-from: |
type=registry,ref=gpustack/build-cache:gpustack${{ matrix.tag_suffix }}
type=registry,ref=${{ env.CACHE_FROM_REF }}
cache-to: |
${{ github.event_name != 'pull_request' && format('type=registry,mode=max,compression=gzip,ref=gpustack/build-cache:gpustack{0},ignore-error=true', matrix.tag_suffix) || '' }}
${{ github.event_name != 'pull_request' && format('type=registry,mode=max,compression=gzip,ref={0},ignore-error=true', env.CACHE_TO_REF) || '' }}

@ -4,12 +4,14 @@ on:
push:
branches:
- main
- "v*-dev"
paths:
- "install.ps1"
- ".github/workflows/install-script-windows.yml"
pull_request:
branches:
- main
- "v*-dev"
paths:
- "install.ps1"
- ".github/workflows/install-script-windows.yml"
@ -74,7 +76,7 @@ jobs:
$env:INSTALL_PACKAGE_SPEC = [System.IO.Path]::Combine("dist", $env:whlPackageName)
Write-Host "INSTALL_PACKAGE_SPEC: $env:INSTALL_PACKAGE_SPEC"
Write-Host "AppData $env:APPDATA"
# Use port 8080 since 80 is occupied by the System
./install.ps1 -ServerPort 8080
@ -100,8 +102,7 @@ jobs:
Start-Sleep -Seconds $retryDelaySeconds
}
}
if ($responseCode -ne 200) {
Write-Host "All retry attempts failed. Last error: $lastError"
}

@ -4,12 +4,14 @@ on:
push:
branches:
- main
- "v*-dev"
paths:
- "install.sh"
- ".github/workflows/install-script.yml"
pull_request:
branches:
- main
- "v*-dev"
paths:
- "install.sh"
- ".github/workflows/install-script.yml"

@ -11,6 +11,7 @@ repos:
hooks:
- id: flake8
exclude: ".*/migrations"
args: [--max-complexity=15]
- repo: https://github.com/psf/black
rev: 24.4.2
hooks:

@ -1,44 +0,0 @@
ARG CUDA_VERSION=12.4.1
ARG CUDA_TAG_SUFFIX=-cudnn-runtime-ubuntu22.04
FROM nvidia/cuda:${CUDA_VERSION}${CUDA_TAG_SUFFIX}
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
git \
curl \
wget \
tzdata \
iproute2 \
python3 \
python3-pip \
python3-venv \
tini \
&& rm -rf /var/lib/apt/lists/*
COPY . /workspace/gpustack
RUN cd /workspace/gpustack && \
make build
ARG VLLM_VERSION=0.8.5.post1
RUN <<EOF
if [ "$TARGETPLATFORM" = "linux/amd64" ]; then
# Install vllm dependencies for x86_64
WHEEL_PACKAGE="$(ls /workspace/gpustack/dist/*.whl)[all]";
else
WHEEL_PACKAGE="$(ls /workspace/gpustack/dist/*.whl)[audio]";
fi
pip install pipx
pip install $WHEEL_PACKAGE
pip cache purge
rm -rf /workspace/gpustack
EOF
RUN gpustack download-tools
# Download dac weights used by audio models like Dia
RUN python3 -m dac download
ENTRYPOINT [ "tini", "--", "gpustack", "start" ]

@ -1,32 +0,0 @@
FROM ubuntu:22.04
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
git \
curl \
wget \
tzdata \
iproute2 \
python3 \
python3-pip \
python3-venv \
tini \
&& rm -rf /var/lib/apt/lists/*
COPY . /workspace/gpustack
RUN cd /workspace/gpustack && \
make build && \
WHEEL_PACKAGE="$(ls /workspace/gpustack/dist/*.whl)[audio]" && \
pip install pipx && \
pip install $WHEEL_PACKAGE && \
pip cache purge && \
rm -rf /workspace/gpustack
RUN gpustack download-tools
# Download dac weights used by audio models like Dia
RUN python3 -m dac download
ENTRYPOINT [ "tini", "--", "gpustack", "start" ]

@ -10,7 +10,7 @@
<img alt="Documentation" src="https://img.shields.io/badge/Docs-GPUStack-blue?logo=readthedocs&logoColor=white"></a>
<a href="./LICENSE" target="_blank">
<img alt="License" src="https://img.shields.io/github/license/gpustack/gpustack?logo=github&logoColor=white&label=License&color=blue"></a>
<a href="./docs/assets/wechat-assistant.png" target="_blank">
<a href="./docs/assets/wechat-group-qrcode.jpg" target="_blank">
<img alt="WeChat" src="https://img.shields.io/badge/微信群-GPUStack-blue?logo=wechat&logoColor=white"></a>
<a href="https://discord.gg/VXYJzuaqwD" target="_blank">
<img alt="Discord" src="https://img.shields.io/badge/Discord-GPUStack-blue?logo=discord&logoColor=white"></a>
@ -35,7 +35,7 @@ GPUStack is an open-source GPU cluster manager for running AI models.
- **Broad GPU Compatibility:** Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
- **Extensive Model Support:** Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
- **Flexible Inference Backends:** Flexibly integrates with multiple inference backends including llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM and Ascend MindIE.
- **Flexible Inference Backends:** Flexibly integrates with multiple inference backends including vLLM, Ascend MindIE, llama-box (llama.cpp & stable-diffusion.cpp) and vox-box.
- **Multi-Version Backend Support:** Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
- **Distributed Inference:** Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
- **Scalable GPU Architecture:** Easily scale up by adding more GPUs or nodes to your infrastructure.
@ -50,83 +50,71 @@ GPUStack is an open-source GPU cluster manager for running AI models.
## Installation
### Linux or macOS
### Linux
GPUStack provides a script to install it as a service on systemd or launchd based systems with default port 80. To install GPUStack using this method, just run:
If you are using NVIDIA GPUs, ensure [Docker](https://docs.docker.com/engine/install/) and [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) are installed on your system. Then, run the following command to start the GPUStack server.
```bash
curl -sfL https://get.gpustack.ai | sh -s -
docker run -d --name gpustack \
--restart=unless-stopped \
--gpus all \
--network=host \
--ipc=host \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack
```
### Windows
For more details on the installation or other GPU hardware platforms, please refer to the [Installation Documentation](docs/installation/installation-requirements.md).
Run PowerShell as administrator (**avoid** using PowerShell ISE), then run the following command to install GPUStack:
```powershell
Invoke-Expression (Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content
```
### Other Installation Methods
For manual installation, docker installation or detailed configuration options, please refer to the [Installation Documentation](https://docs.gpustack.ai/latest/installation/installation-script/).
## Getting Started
1. Run and chat with the **llama3.2** model:
After the server starts, run the following command to get the default admin password:
```bash
gpustack chat llama3.2 "tell me a joke."
docker exec gpustack cat /var/lib/gpustack/initial_admin_password
```
2. Run and generate an image with the **stable-diffusion-v3-5-large-turbo** model:
Open your browser and navigate to `http://your_host_ip` to access the GPUStack UI. Use the default username `admin` and the password you retrieved above to log in.
> ### 💡 Tip
>
> This command downloads the model (~12GB) from Hugging Face. The download time depends on your network speed. Ensure you have enough disk space and VRAM (12GB) to run the model. If you encounter issues, you can skip this step and move to the next one.
### macOS & Windows
```bash
gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \
"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors." \
--sample-steps 5 --show
```
A desktop installer is available for macOS and Windows — see the [documentation](https://docs.gpustack.ai/latest/installation/desktop-installer/) for installation details.
Once the command completes, the generated image will appear in the default viewer. You can experiment with the prompt and CLI options to customize the output.
## Deploy a Model
![Generated Image](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/quickstart-minion.png)
1. Navigate to the `Catalog` page in the GPUStack UI.
3. Open `http://your_host_ip` in the browser to access the GPUStack UI. Log in to GPUStack with username `admin` and the default password. You can run the following command to get the password for the default setup:
2. Select the `Qwen3` model from the list of available models.
**Linux or macOS**
3. After the deployment compatibility checks pass, click the `Save` button to deploy the model.
```bash
cat /var/lib/gpustack/initial_admin_password
```
![deploy qwen3 from catalog](docs/assets/quick-start/quick-start-qwen3.png)
**Windows**
4. GPUStack will start downloading the model files and deploying the model. When the deployment status shows `Running`, the model has been deployed successfully.
```powershell
Get-Content -Path "$env:APPDATA\gpustack\initial_admin_password" -Raw
```
![model is running](docs/assets/quick-start/model-running.png)
5. Click `Playground - Chat` in the navigation menu, check that the model `qwen3` is selected from the top-right `Model` dropdown. Now you can chat with the model in the UI playground.
4. Click `Playground - Chat` in the navigation menu. Now you can chat with the LLM in the UI playground.
![quick chat](docs/assets/quick-start/quick-chat.png)
![Playground Screenshot](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/playground-screenshot.png)
## Use the model via API
5. Click `API Keys` in the navigation menu, then click the `New API Key` button.
1. Hover over the user avatar and navigate to the `API Keys` page, then click the `New API Key` button.
6. Fill in the `Name` and click the `Save` button.
2. Fill in the `Name` and click the `Save` button.
7. Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.
3. Copy the generated API key and save it somewhere safe. Please note that you can only see it once on creation.
8. Now you can use the API key to access the OpenAI-compatible API. For example, use curl as the following:
4. You can now use the API key to access the OpenAI-compatible API endpoints provided by GPUStack. For example, use curl as the following:
```bash
# Replace `your_api_key` and `your_gpustack_server_url`
# with your actual API key and GPUStack server URL.
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1-openai/chat/completions \
curl http://your_gpustack_server_url/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "llama3.2",
"model": "qwen3",
"messages": [
{
"role": "system",
@ -134,7 +122,7 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
},
{
"role": "user",
"content": "Hello!"
"content": "Tell me a joke."
}
],
"stream": true
@ -143,8 +131,8 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
## Supported Platforms
- [x] macOS
- [x] Linux
- [x] macOS
- [x] Windows
## Supported Accelerators
@ -156,15 +144,11 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
- [x] Hygon DTK
- [x] Moore Threads MUSA
- [x] Iluvatar Corex
We plan to support the following accelerators in future releases.
- [ ] Intel oneAPI
- [ ] Qualcomm AI Engine
- [x] Cambricon MLU
## Supported Models
GPUStack uses [llama-box](https://github.com/gpustack/llama-box) (bundled [llama.cpp](https://github.com/ggml-org/llama.cpp) and [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) server), [vLLM](https://github.com/vllm-project/vllm), [Ascend MindIE](https://www.hiascend.com/en/software/mindie) and [vox-box](https://github.com/gpustack/vox-box) as the backends and supports a wide range of models. Models from the following sources are supported:
GPUStack uses [vLLM](https://github.com/vllm-project/vllm), [Ascend MindIE](https://www.hiascend.com/en/software/mindie), [llama-box](https://github.com/gpustack/llama-box) (bundled [llama.cpp](https://github.com/ggml-org/llama.cpp) and [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) server) and [vox-box](https://github.com/gpustack/vox-box) as the backends and supports a wide range of models. Models from the following sources are supported:
1. [Hugging Face](https://huggingface.co/)
@ -172,16 +156,16 @@ GPUStack uses [llama-box](https://github.com/gpustack/llama-box) (bundled [llama
3. Local File Path
### Example Models:
### Example Models
| **Category** | **Models** |
| -------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Large Language Models(LLMs)** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **Vision Language Models(VLMs)** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL2.5](https://huggingface.co/models?search=internvl2_5) |
| **Diffusion Models** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **Embedding Models** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings) |
| **Reranker Models** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker) |
| **Audio Models** | [Whisper](https://huggingface.co/models?search=Systran/faster) (Speech-to-Text), [CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice) (Text-to-Speech) |
| **Category** | **Models** |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Large Language Models(LLMs)** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **Vision Language Models(VLMs)** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL3](https://huggingface.co/models?search=internvl3) |
| **Diffusion Models** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **Embedding Models** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings), [Qwen3-Embedding](https://huggingface.co/models?search=qwen/qwen3-embedding) |
| **Reranker Models** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker), [Qwen3-Reranker](https://huggingface.co/models?search=qwen/qwen3-reranker) |
| **Audio Models** | [Whisper](https://huggingface.co/models?search=Systran/faster) (Speech-to-Text), [CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice) (Text-to-Speech) |
For full list of supported models, please refer to the supported models section in the [inference backends](https://docs.gpustack.ai/latest/user-guide/inference-backends/) documentation.

@ -10,7 +10,7 @@
<img alt="Documentation" src="https://img.shields.io/badge/文档-GPUStack-blue?logo=readthedocs&logoColor=white"></a>
<a href="./LICENSE" target="_blank">
<img alt="License" src="https://img.shields.io/github/license/gpustack/gpustack?logo=github&logoColor=white&label=License&color=blue"></a>
<a href="./docs/assets/wechat-assistant.png" target="_blank">
<a href="./docs/assets/wechat-group-qrcode.jpg" target="_blank">
<img alt="WeChat" src="https://img.shields.io/badge/微信群-GPUStack-blue?logo=wechat&logoColor=white"></a>
<a href="https://discord.gg/VXYJzuaqwD" target="_blank">
<img alt="Discord" src="https://img.shields.io/badge/Discord-GPUStack-blue?logo=discord&logoColor=white"></a>
@ -35,7 +35,7 @@ GPUStack 是一个用于运行 AI 模型的开源 GPU 集群管理器。
- **广泛的 GPU 兼容性**:无缝支持 Apple Mac、Windows PC 和 Linux 服务器上各种供应商的 GPU。
- **广泛的模型支持**:支持各种模型,包括 LLM、多模态 VLM、图像模型、语音模型、文本嵌入模型和重排序模型。
- **灵活的推理后端**:支持与 llama-boxllama.cpp 和 stable-diffusion.cpp、vox-box、vLLM 和 Ascend MindIE 等多种推理后端的灵活集成。
- **灵活的推理后端**:支持与 vLLM 、 Ascend MindIE、llama-boxllama.cpp 和 stable-diffusion.cpp和 vox-box 等多种推理后端的灵活集成。
- **多版本后端支持**:同时运行推理后端的多个版本,以满足不同模型的不同运行依赖。
- **分布式推理**:支持单机和多机多卡并行推理,包括跨供应商和运行环境的异构 GPU。
- **可扩展的 GPU 架构**:通过向基础设施添加更多 GPU 或节点轻松进行扩展。
@ -50,84 +50,71 @@ GPUStack 是一个用于运行 AI 模型的开源 GPU 集群管理器。
## 安装
### Linux 或 macOS
### Linux
GPUStack 提供了安装脚本,可以将其安装为 Linux 的 systemd 服务或 macOS 的 launchd 服务,默认端口为 80。要使用此方法安装 GPUStack执行以下命令
如果你是 NVIDIA GPU 环境,请确保 [Docker](https://docs.docker.com/engine/install/) 和 [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) 都已经在系统中安装。 然后,执行如下命令启动 GPUStack
```bash
curl -sfL https://get.gpustack.ai | INSTALL_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple sh -s -
docker run -d --name gpustack \
--restart=unless-stopped \
--gpus all \
--network=host \
--ipc=host \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack
```
### Windows
有关其它平台的安装或详细配置选项,请参考[安装文档](docs/installation/installation-requirements.md).
以管理员身份运行 PowerShell**避免**使用 PowerShell ISE然后执行以下命令安装 GPUStack
```powershell
$env:INSTALL_INDEX_URL = "https://pypi.tuna.tsinghua.edu.cn/simple"
Invoke-Expression (Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content
```
### 其他安装方式
有关 pip 安装、Docker 安装或详细配置选项,请参考[安装文档](https://docs.gpustack.ai/latest/installation/installation-requirements/)。
## 新手入门
1. 在命令行运行 **llama3.2** 模型并进行对话:
容器正常运行后,执行以下命令获取默认密码:
```bash
gpustack chat llama3.2 "tell me a joke."
docker exec gpustack cat /var/lib/gpustack/initial_admin_password
```
2. 运行 **stable-diffusion-v3-5-large-turbo** 模型并生成图像:
在浏览器中打开 `http://your_host_ip`,访问 GPUStack 界面。使用 `admin` 用户名和默认密码登录 GPUStack。
> ### 💡 Tip
>
> 此命令将从 Hugging Face 下载模型(约 12GB。下载时间取决于你的网络速度。确保你有足够的磁盘空间和 VRAM12GB来运行模型。如果遇到问题你可以跳过此步骤并转到下一步。
### macOS & Windows
```bash
gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \
"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors." \
--sample-steps 5 --show
```
对于 macOS 和 Windows我们提供了桌面安装程序。请参阅[文档](https://docs.gpustack.ai/latest/installation/desktop-installer/)了解安装细节。
命令完成后,生成的图像将出现在默认查看器中。你可以尝试修改 prompt 和 CLI 参数来定制输出。
## 部署模型
![Generated Image](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/quickstart-minion.png)
1. 在 GPUStack 界面,在菜单中点击“模型库”。
3. 在浏览器中打开 `http://your_host_ip`,访问 GPUStack 界面。使用“admin”用户名和默认密码登录 GPUStack。可以执行以下命令获取默认密码
2. 从模型列表中选择 `Qwen3` 模型。
**Linux 或 macOS**
3. 在部署兼容性检查通过之后,选择保存部署模型。
```bash
cat /var/lib/gpustack/initial_admin_password
```
![deploy qwen3 from catalog](docs/assets/quick-start/quick-start-qwen3.png)
**Windows**
4. GPUStack 将开始下载模型文件并部署模型。当部署状态显示为 `Running` 时,表示模型已成功部署。
```powershell
Get-Content -Path "$env:APPDATA\gpustack\initial_admin_password" -Raw
```
![model is running](docs/assets/quick-start/model-running.png)
5. 点击菜单中的“试验场 - 对话”,在右上方模型菜单中选择模型 `qwen3`。现在你可以在试验场中与 LLM 进行对话。
4. 在菜单中点击“试验场 - 对话”,现在你可以在试验场中与 LLM 进行对话。
![quick chat](docs/assets/quick-start/quick-chat.png)
![Playground Screenshot](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/playground-screenshot.png)
## 通过 API 使用模型
5. 在菜单中点击“API 秘钥”,然后点击“新建 API 秘钥”按钮。
1. 将鼠标移动到右下角的用户头像上选择“API 密钥”,然后点击“新建 API 秘钥”按钮。
6. 填写“名称”,然后点击“保存”按钮。
2. 填写“名称”,然后点击“保存”按钮。
7. 复制生成的 API 密钥并将其保存。请注意,秘钥只在创建时可见。
3. 复制生成的 API 密钥并将其保存。请注意,秘钥只在创建时可见。
8. 现在你可以使用 API 密钥访问 OpenAI 兼容 API。例如curl 的用法如下:
4. 现在你可以使用 API 密钥访问 OpenAI 兼容 API。例如curl 的用法如下:
```bash
# Replace `your_api_key` and `your_gpustack_server_url`
# with your actual API key and GPUStack server URL.
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1-openai/chat/completions \
curl http://your_gpustack_server_url/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "llama3.2",
"model": "qwen3",
"messages": [
{
"role": "system",
@ -135,7 +122,7 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
},
{
"role": "user",
"content": "Hello!"
"content": "Tell me a joke."
}
],
"stream": true
@ -144,8 +131,8 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
## 平台支持
- [x] macOS
- [x] Linux
- [x] macOS
- [x] Windows
## 加速框架支持
@ -157,15 +144,11 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
- [x] 海光 DTK
- [x] 摩尔线程 MUSA
- [x] 天数智芯 Corex
我们计划在未来的版本中支持以下加速框架:
- [ ] Intel oneAPI
- [ ] Qualcomm AI Engine
- [x] 寒武纪 MLU
## 模型支持
GPUStack 使用 [llama-box](https://github.com/gpustack/llama-box)(基于 [llama.cpp](https://github.com/ggml-org/llama.cpp) 和 [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)、[vLLM](https://github.com/vllm-project/vllm)、 [Ascend MindIE](https://www.hiascend.com/en/software/mindie) 和 [vox-box](https://github.com/gpustack/vox-box) 作为后端并提供广泛的模型支持。支持从以下来源部署模型:
GPUStack 使用 [vLLM](https://github.com/vllm-project/vllm)、 [Ascend MindIE](https://www.hiascend.com/en/software/mindie)、[llama-box](https://github.com/gpustack/llama-box)(基于 [llama.cpp](https://github.com/ggml-org/llama.cpp) 和 [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp))和 [vox-box](https://github.com/gpustack/vox-box) 作为后端并提供广泛的模型支持。支持从以下来源部署模型:
1. [Hugging Face](https://huggingface.co/)
@ -175,14 +158,14 @@ GPUStack 使用 [llama-box](https://github.com/gpustack/llama-box)(基于 [lla
### 示例模型
| **类别** | **模型** |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **大语言模型LLM** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **多模态模型VLM** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL2.5](https://huggingface.co/models?search=internvl2_5) |
| **Diffusion 扩散模型** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **Embedding 模型** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings) |
| **Reranker 模型** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker) |
| **语音模型** | [Whisper](https://huggingface.co/models?search=Systran/faster) (Speech-to-Text), [CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice) (Text-to-Speech) |
| **类别** | **模型** |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **大语言模型LLM** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **多模态模型VLM** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL3](https://huggingface.co/models?search=internvl3) |
| **Diffusion 扩散模型** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **Embedding 模型** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings), [Qwen3-Embedding](https://huggingface.co/models?search=qwen/qwen3-embedding) |
| **Reranker 模型** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker), [Qwen3-Reranker](https://huggingface.co/models?search=qwen/qwen3-reranker) |
| **语音模型** | [Whisper](https://huggingface.co/models?search=Systran/faster) (Speech-to-Text), [CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice) (Text-to-Speech) |
有关支持模型的完整列表,请参阅 [inference backends](https://docs.gpustack.ai/latest/user-guide/inference-backends/) 文档中的 Supported Models 部分。
@ -236,20 +219,20 @@ GPUStack 用户可以在 UI 中生成自己的 API 密钥。
## 加入社区
扫码添加 GPUStack 微信小助手加入社区群:
扫码加入社区群:
<p align="left">
<img alt="Wechat-assistant" src="./docs/assets/wechat-assistant.png" width="300px"/>
<img alt="Wechat-group" src="./docs/assets/wechat-group-qrcode.jpg" width="300px"/>
</p>
## License
版权所有 (c) 2024 GPUStack 作者
本项目基于 Apache-2.0 许可证(以下简称“许可证”)授权。
您只能在遵守许可证条款的前提下使用本项目。
本项目基于 Apache-2.0 许可证(以下简称“许可证”)授权。
您只能在遵守许可证条款的前提下使用本项目。
许可证的完整内容请参阅 [LICENSE](./LICENSE) 文件。
除非适用法律另有规定或双方另有书面约定,依据许可证分发的软件按“原样”提供,
除非适用法律另有规定或双方另有书面约定,依据许可证分发的软件按“原样”提供,
不附带任何明示或暗示的保证或条件。
有关许可证规定的具体权利和限制,请参阅许可证了解更多详细信息。

@ -10,7 +10,7 @@
<img alt="Documentation" src="https://img.shields.io/badge/ドキュメント-GPUStack-blue?logo=readthedocs&logoColor=white"></a>
<a href="./LICENSE" target="_blank">
<img alt="License" src="https://img.shields.io/github/license/gpustack/gpustack?logo=github&logoColor=white&label=License&color=blue"></a>
<a href="./docs/assets/wechat-assistant.png" target="_blank">
<a href="./docs/assets/wechat-group-qrcode.jpg" target="_blank">
<img alt="WeChat" src="https://img.shields.io/badge/微信群-GPUStack-blue?logo=wechat&logoColor=white"></a>
<a href="https://discord.gg/VXYJzuaqwD" target="_blank">
<img alt="Discord" src="https://img.shields.io/badge/Discord-GPUStack-blue?logo=discord&logoColor=white"></a>
@ -35,7 +35,7 @@ GPUStack は、AI モデルを実行するためのオープンソース GPU ク
- **幅広い GPU 互換性:** Apple Mac、Windows PC、Linux サーバー上のさまざまなベンダーの GPU をシームレスにサポート。
- **豊富なモデルサポート:** LLM、VLM、画像モデル、音声モデル、埋め込みモデル、リランクモデルを含む幅広いモデルをサポート。
- **柔軟な推論バックエンド:** llama-boxllama.cpp と stable-diffusion.cpp、vox-box、vLLM、Ascend MindIE と統合。
- **柔軟な推論バックエンド:** vLLM、Ascend MindIE、llama-boxllama.cpp と stable-diffusion.cpp、vox-box と統合。
- **マルチバージョンバックエンドサポート:** 異なるモデルの多様なランタイム要件を満たすために、推論バックエンドの複数バージョンを同時実行。
- **分散推論:** ベンダーやランタイム環境をまたぐ異種 GPU を含む、シングルノードおよびマルチノードのマルチ GPU 推論をサポート。
- **スケーラブルな GPU アーキテクチャ:** インフラストラクチャに GPU やノードを追加することで簡単にスケールアップ。
@ -50,83 +50,71 @@ GPUStack は、AI モデルを実行するためのオープンソース GPU ク
## インストール
### Linux または macOS
### Linux
GPUStack は、systemd または launchd ベースのシステムでサービスとしてインストールするスクリプトを提供しており、デフォルトポートは 80 です。この方法で GPUStack をインストールするには、以下を実行します:
NVIDIA GPU を使用している場合は、Docker と NVIDIA Container Toolkit をインストールしてください。その後、以下のコマンドで GPUStack サーバーを起動します:
```bash
curl -sfL https://get.gpustack.ai | sh -s -
docker run -d --name gpustack \
--restart=unless-stopped \
--gpus all \
--network=host \
--ipc=host \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack
```
### Windows
詳細なインストール手順やその他の GPU ハードウェアプラットフォームについては、インストールドキュメント を参照してください。
管理者として PowerShell を実行しPowerShell ISE の使用は**避けてください**)、以下のコマンドを実行して GPUStack をインストールします:
```powershell
Invoke-Expression (Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content
```
### その他のインストール方法
手動インストール、Docker インストール、または詳細な構成オプションについては、[インストールドキュメント](https://docs.gpustack.ai/latest/installation/installation-script/)を参照してください。
## はじめに
1. **llama3.2**モデルを実行してチャットする:
サーバー起動後、次のコマンドでデフォルト管理者パスワードを取得できます:
```bash
gpustack chat llama3.2 "tell me a joke."
cat /var/lib/gpustack/initial_admin_password
```
2. **stable-diffusion-v3-5-large-turbo**モデルで画像を生成する:
ブラウザで http://your_host_ip にアクセスし、ユーザー名 admin と取得したパスワードでログインします。
> ### 💡 ヒント
>
> このコマンドは Hugging Face からモデル(約 12GBをダウンロードします。ダウンロード時間はネットワーク速度に依存します。モデルを実行するために十分なディスクスペースと VRAM12GBがあることを確認してください。問題が発生した場合は、このステップをスキップして次に進むことができます。
### macOS & Windows
```bash
gpustack draw hf.co/gpustack/stable-diffusion-v3-5-large-turbo-GGUF:stable-diffusion-v3-5-large-turbo-Q4_0.gguf \
"A minion holding a sign that says 'GPUStack'. The background is filled with futuristic elements like neon lights, circuit boards, and holographic displays. The minion is wearing a tech-themed outfit, possibly with LED lights or digital patterns. The sign itself has a sleek, modern design with glowing edges. The overall atmosphere is high-tech and vibrant, with a mix of dark and neon colors." \
--sample-steps 5 --show
```
macOS および Windows 向けにデスクトップインストーラーが用意されています。インストールの詳細は [ドキュメント](https://docs.gpustack.ai/latest/installation/desktop-installer/) をご覧ください。
コマンドが完了すると、生成された画像がデフォルトビューアに表示されます。プロンプトと CLI オプションを実験して出力をカスタマイズできます。
## モデルのデプロイ
![Generated Image](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/quickstart-minion.png)
1. GPUStack UI の Catalog ページに移動します。
3. ブラウザで`http://your_host_ip`を開いて GPUStack UI にアクセスします。ユーザー名`admin`とデフォルトパスワードで GPUStack にログインします。デフォルト設定のパスワードを取得するには、以下のコマンドを実行します:
2. モデルリストから Qwen3 モデルを選択します。
**Linux または macOS**
3. デプロイ互換性チェックが完了したら、Save ボタンをクリックしてデプロイします。
```bash
cat /var/lib/gpustack/initial_admin_password
```
![deploy qwen3 from catalog](docs/assets/quick-start/quick-start-qwen3.png)
**Windows**
4. モデルのダウンロードとデプロイが開始されます。ステータスが Running になると、デプロイ成功です。
```powershell
Get-Content -Path "$env:APPDATA\gpustack\initial_admin_password" -Raw
```
![model is running](docs/assets/quick-start/model-running.png)
5. ナビゲーションメニューから Playground - Chat を選択し、右上の Model ドロップダウンで qwen3 が選択されていることを確認してチャットを開始します。
4. ナビゲーションメニューで`Playground - Chat`をクリックします。これで UI プレイグラウンドで LLM とチャットできます。
![quick chat](docs/assets/quick-start/quick-chat.png)
![Playground Screenshot](https://raw.githubusercontent.com/gpustack/gpustack/main/docs/assets/playground-screenshot.png)
## API でモデルを使用する
5. ナビゲーションメニューで`API Keys`をクリックし、`New API Key`ボタンをクリックします。
1. ユーザーアバターをホバーし、API Keys ページに移動後、New API Key をクリックします。
6. `Name`を入力し、`Save`ボタンをクリックします。
2. Name を入力し、Save をクリックします。
7. 生成された API キーをコピーして安全な場所に保存します。作成時にのみ一度だけ表示されることに注意してください
3. 生成された API キーをコピーして安全な場所に保管してください(一度しか表示されません)
8. これで API キーを使用して OpenAI 互換 API にアクセスできます。例えば、curl を使用する場合
4. OpenAI 互換エンドポイントにアクセスできます。例
```bash
# Replace `your_api_key` and `your_gpustack_server_url`
# with your actual API key and GPUStack server URL.
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1-openai/chat/completions \
curl http://your_gpustack_server_url/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "llama3.2",
"model": "qwen3",
"messages": [
{
"role": "system",
@ -134,7 +122,7 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
},
{
"role": "user",
"content": "Hello!"
"content": "Tell me a joke."
}
],
"stream": true
@ -143,8 +131,8 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
## サポートされているプラットフォーム
- [x] macOS
- [x] Linux
- [x] macOS
- [x] Windows
## サポートされているアクセラレータ
@ -156,15 +144,11 @@ curl http://your_gpustack_server_url/v1-openai/chat/completions \
- [x] Hygon DTK
- [x] Moore Threads MUSA
- [x] Iluvatar Corex
以下のアクセラレータは将来のリリースでサポートする予定です。
- [ ] Intel oneAPI
- [ ] Qualcomm AI Engine
- [x] Cambricon MLU
## サポートされているモデル
GPUStack は[llama-box](https://github.com/gpustack/llama-box)(バンドルされた[llama.cpp](https://github.com/ggml-org/llama.cpp)と[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)サーバー)、[vLLM](https://github.com/vllm-project/vllm)、[Ascend MindIE](https://www.hiascend.com/en/software/mindie)、[vox-box](https://github.com/gpustack/vox-box)をバックエンドとして使用し、幅広いモデルをサポートしています。以下のソースからのモデルがサポートされています:
GPUStack は[vLLM](https://github.com/vllm-project/vllm)、[Ascend MindIE](https://www.hiascend.com/en/software/mindie)、[llama-box](https://github.com/gpustack/llama-box)(バンドルされた[llama.cpp](https://github.com/ggml-org/llama.cpp)と[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)サーバー)、[vox-box](https://github.com/gpustack/vox-box)をバックエンドとして使用し、幅広いモデルをサポートしています。以下のソースからのモデルがサポートされています:
1. [Hugging Face](https://huggingface.co/)
@ -172,16 +156,16 @@ GPUStack は[llama-box](https://github.com/gpustack/llama-box)(バンドルさ
3. ローカルファイルパス
### モデル例
### モデル例
| **カテゴリ** | **モデル** |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **大規模言語モデルLLM** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **ビジョン言語モデルVLM** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL2.5](https://huggingface.co/models?search=internvl2_5) |
| **拡散モデル** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **埋め込みモデル** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings) |
| **リランカーモデル** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker) |
| **音声モデル** | [Whisper](https://huggingface.co/models?search=Systran/faster)(音声認識)、[CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice)(音声合成) |
| **カテゴリ** | **モデル** |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **大規模言語モデルLLM** | [Qwen](https://huggingface.co/models?search=Qwen/Qwen), [LLaMA](https://huggingface.co/meta-llama), [Mistral](https://huggingface.co/mistralai), [DeepSeek](https://huggingface.co/models?search=deepseek-ai/deepseek), [Phi](https://huggingface.co/models?search=microsoft/phi), [Gemma](https://huggingface.co/models?search=Google/gemma) |
| **ビジョン言語モデルVLM** | [Llama3.2-Vision](https://huggingface.co/models?pipeline_tag=image-text-to-text&search=llama3.2), [Pixtral](https://huggingface.co/models?search=pixtral) , [Qwen2.5-VL](https://huggingface.co/models?search=Qwen/Qwen2.5-VL), [LLaVA](https://huggingface.co/models?search=llava), [InternVL3](https://huggingface.co/models?search=internvl3) |
| **拡散モデル** | [Stable Diffusion](https://huggingface.co/models?search=gpustack/stable-diffusion), [FLUX](https://huggingface.co/models?search=gpustack/flux) |
| **埋め込みモデル** | [BGE](https://huggingface.co/gpustack/bge-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-embedding-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-embeddings), [Qwen3-Embedding](https://huggingface.co/models?search=qwen/qwen3-embedding) |
| **リランカーモデル** | [BGE](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF), [BCE](https://huggingface.co/gpustack/bce-reranker-base_v1-GGUF), [Jina](https://huggingface.co/models?search=gpustack/jina-reranker), [Qwen3-Reranker](https://huggingface.co/models?search=qwen/qwen3-reranker) |
| **音声モデル** | [Whisper](https://huggingface.co/models?search=Systran/faster)(音声認識)、[CosyVoice](https://huggingface.co/models?search=FunAudioLLM/CosyVoice)(音声合成) |
サポートされているモデルの完全なリストについては、[推論バックエンド](https://docs.gpustack.ai/latest/user-guide/inference-backends/)ドキュメントのサポートされているモデルセクションを参照してください。

@ -1,346 +0,0 @@
import asyncio
import time
import httpx
import numpy
import logging
import argparse
import json
import random
from openai import AsyncOpenAI
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
# Avoid client side connection error: https://github.com/encode/httpx/discussions/3084
http_client = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=10000, max_keepalive_connections=10000, keepalive_expiry=30
)
)
SAMPLE_PROMPTS = [
"Explain how blockchain technology works, and provide a real-world example of its application outside of cryptocurrency.",
"Compare and contrast the philosophies of Nietzsche and Kant, including their views on morality and human nature.",
"Imagine you're a travel blogger. Write a detailed post describing a week-long adventure through rural Japan.",
"Write a fictional letter from Albert Einstein to a modern-day physicist, discussing the current state of quantum mechanics.",
"Provide a comprehensive explanation of how transformers work in machine learning, including attention mechanisms and positional encoding.",
"Draft a business proposal for launching a new AI-powered productivity app, including target audience, key features, and a monetization strategy.",
"Simulate a panel discussion between Elon Musk, Marie Curie, and Sun Tzu on the topic of 'Leadership in Times of Crisis'.",
"Describe the process of photosynthesis in depth, and explain its importance in the global carbon cycle.",
"Analyze the impact of social media on political polarization, citing relevant studies or historical examples.",
"Write a short science fiction story where humans discover a parallel universe that operates under different physical laws.",
"Explain the role of the Federal Reserve in the U.S. economy and how it manages inflation and unemployment.",
"Describe the architecture of a modern web application, from frontend to backend, including databases, APIs, and deployment.",
"Write an essay discussing whether artificial general intelligence (AGI) poses an existential threat to humanity.",
"Summarize the key events and consequences of the Cuban Missile Crisis, and reflect on lessons for modern diplomacy.",
"Create a guide for beginners on how to train a custom LLM using open-source tools and publicly available datasets.",
]
async def process_stream(stream):
first_token_time = None
total_tokens = 0
async for chunk in stream:
if first_token_time is None:
first_token_time = time.time()
if chunk.choices[0].delta.content:
total_tokens += 1
if chunk.choices[0].finish_reason is not None:
break
return first_token_time, total_tokens
async def make_request(
client: AsyncOpenAI, model, max_completion_tokens, request_timeout
):
start_time = time.time()
content = random.choice(SAMPLE_PROMPTS)
try:
stream = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": content}],
max_completion_tokens=max_completion_tokens,
stream=True,
)
first_token_time, total_tokens = await asyncio.wait_for(
process_stream(stream), timeout=request_timeout
)
end_time = time.time()
elapsed_time = end_time - start_time
ttft = first_token_time - start_time if first_token_time else None
tokens_per_second = total_tokens / elapsed_time if elapsed_time > 0 else 0
return total_tokens, elapsed_time, tokens_per_second, ttft
except asyncio.TimeoutError:
logging.warning(f"Request timed out after {request_timeout} seconds")
return None
except Exception as e:
logging.error(f"Error during request: {str(e)}")
return None
async def worker(
client,
model,
semaphore,
queue,
results,
max_completion_tokens,
request_timeout,
):
while True:
async with semaphore:
task_id = await queue.get()
if task_id is None:
queue.task_done()
break
logging.info(f"Starting request {task_id}")
result = await make_request(
client, model, max_completion_tokens, request_timeout
)
if result:
results.append(result)
else:
logging.warning(f"Request {task_id} failed")
queue.task_done()
logging.info(f"Finished request {task_id}")
def calculate_percentile(values, percentile, reverse=False):
if not values:
return None
if reverse:
return numpy.percentile(values, 100 - percentile)
return numpy.percentile(values, percentile)
async def preflight_check(client, model) -> bool:
result = await make_request(client, model, 16, 60)
return result is not None
async def main(
model,
num_requests,
concurrency,
request_timeout,
max_completion_tokens,
server_url,
api_key,
):
client = AsyncOpenAI(
base_url=f"{server_url}/v1",
api_key=api_key,
http_client=http_client,
max_retries=0,
)
if not await preflight_check(client, model):
logging.error(
"Preflight check failed. Please check configuration and the service status."
)
return
semaphore = asyncio.Semaphore(concurrency)
queue = asyncio.Queue()
results = []
# Add tasks to the queue
for i in range(num_requests):
await queue.put(i)
# Add sentinel values to stop workers
for _ in range(concurrency):
await queue.put(None)
# Create worker tasks
workers = [
asyncio.create_task(
worker(
client,
model,
semaphore,
queue,
results,
max_completion_tokens,
request_timeout,
)
)
for _ in range(concurrency)
]
start_time = time.time()
# Wait for all tasks to complete
await queue.join()
await asyncio.gather(*workers)
end_time = time.time()
# Calculate metrics
total_elapsed_time = end_time - start_time
total_tokens = sum(tokens for tokens, _, _, _ in results if tokens is not None)
latencies = [
elapsed_time for _, elapsed_time, _, _ in results if elapsed_time is not None
]
tokens_per_second_list = [tps for _, _, tps, _ in results if tps is not None]
ttft_list = [ttft for _, _, _, ttft in results if ttft is not None]
successful_requests = len(results)
success_rate = successful_requests / num_requests if num_requests > 0 else 0
requests_per_second = (
successful_requests / total_elapsed_time if total_elapsed_time > 0 else 0
)
avg_latency = sum(latencies) / len(latencies) if latencies else 0
avg_tokens_per_second = (
sum(tokens_per_second_list) / len(tokens_per_second_list)
if tokens_per_second_list
else 0
)
overall_tokens_per_second = (
total_tokens / total_elapsed_time if total_elapsed_time > 0 else 0
)
avg_ttft = sum(ttft_list) / len(ttft_list) if ttft_list else 0
# Calculate percentiles
percentiles = [50, 95, 99]
latency_percentiles = [calculate_percentile(latencies, p) for p in percentiles]
tps_percentiles = [
calculate_percentile(tokens_per_second_list, p, reverse=True)
for p in percentiles
]
ttft_percentiles = [calculate_percentile(ttft_list, p) for p in percentiles]
return {
"model": model,
"total_requests": num_requests,
"successful_requests": successful_requests,
"success_rate": success_rate,
"concurrency": concurrency,
"request_timeout": request_timeout,
"max_completion_tokens": max_completion_tokens,
"total_time": total_elapsed_time,
"requests_per_second": requests_per_second,
"total_completion_tokens": total_tokens,
"latency": {
"average": avg_latency,
"p50": latency_percentiles[0],
"p95": latency_percentiles[1],
"p99": latency_percentiles[2],
},
"tokens_per_second": {
"overall": overall_tokens_per_second,
"average": avg_tokens_per_second,
"p50": tps_percentiles[0],
"p95": tps_percentiles[1],
"p99": tps_percentiles[2],
},
"time_to_first_token": {
"average": avg_ttft,
"p50": ttft_percentiles[0],
"p95": ttft_percentiles[1],
"p99": ttft_percentiles[2],
},
}
def output_results(results, result_file=None):
# Round all floats in results to two decimal places for output
def _round_floats(obj, ndigits=2):
if isinstance(obj, dict):
return {k: _round_floats(v, ndigits) for k, v in obj.items()}
if isinstance(obj, list):
return [_round_floats(v, ndigits) for v in obj]
if isinstance(obj, float):
return round(obj, ndigits)
return obj
formatted_results = _round_floats(results, 2)
if result_file:
with open(result_file, "w") as f:
json.dump(formatted_results, f, indent=2)
logging.info(f"Results saved to {result_file}")
else:
print(json.dumps(formatted_results, indent=2))
def set_http_client(args):
if args.headers:
for header in args.headers:
if ":" not in header:
parser.error(f"Invalid header format: {header}. Expected Key:Value")
key, value = header.split(":", 1)
http_client.headers[key.strip()] = value.strip()
http_client.timeout = args.request_timeout
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark Chat Completions API")
parser.add_argument(
"-m", "--model", type=str, required=True, help="Name of the model"
)
parser.add_argument(
"-n",
"--num-requests",
type=int,
default=100,
help="Number of requests to make (default: 100)",
)
parser.add_argument(
"-c",
"--concurrency",
type=int,
default=10,
help="Number of concurrent requests (default: 10)",
)
parser.add_argument(
"--request-timeout",
type=int,
default=300,
help="Timeout for each request in seconds (default: 300)",
)
parser.add_argument(
"--max-completion-tokens",
type=int,
default=1024,
help="Maximum number of tokens in the completion (default: 1024)",
)
parser.add_argument(
"--server-url",
type=str,
default="http://127.0.0.1",
help="URL of the GPUStack server",
)
parser.add_argument("--api-key", type=str, default="fake", help="GPUStack API key")
parser.add_argument(
"--result-file",
type=str,
help="Result file path to save benchmark json results",
)
parser.add_argument(
"-H",
"--header",
action="append",
dest="headers",
help="Custom HTTP header in Key:Value format. May be specified multiple times.",
)
args = parser.parse_args()
set_http_client(args)
results = asyncio.run(
main(
args.model,
args.num_requests,
args.concurrency,
args.request_timeout,
args.max_completion_tokens,
args.server_url,
args.api_key,
)
)
output_results(results, args.result_file)

@ -0,0 +1,654 @@
import asyncio
from dataclasses import asdict, dataclass, is_dataclass
import time
from typing import List, Optional
import aiohttp
import numpy
import logging
import argparse
import json
import random
from openai import APIConnectionError, AsyncOpenAI
from aiohttp import ClientSession
from httpx_aiohttp import AiohttpTransport
from openai import DefaultAsyncHttpxClient
from openai.types.chat import (
ChatCompletionStreamOptionsParam,
)
from tqdm import tqdm
logging.basicConfig(
level=logging.WARNING, format="%(asctime)s - %(levelname)s - %(message)s"
)
SAMPLE_PROMPTS = [
"Explain how blockchain technology works, and provide a real-world example of its application outside of cryptocurrency.",
"Compare and contrast the philosophies of Nietzsche and Kant, including their views on morality and human nature.",
"Imagine you're a travel blogger. Write a detailed post describing a week-long adventure through rural Japan.",
"Write a fictional letter from Albert Einstein to a modern-day physicist, discussing the current state of quantum mechanics.",
"Provide a comprehensive explanation of how transformers work in machine learning, including attention mechanisms and positional encoding.",
"Draft a business proposal for launching a new AI-powered productivity app, including target audience, key features, and a monetization strategy.",
"Simulate a panel discussion between Elon Musk, Marie Curie, and Sun Tzu on the topic of 'Leadership in Times of Crisis'.",
"Describe the process of photosynthesis in depth, and explain its importance in the global carbon cycle.",
"Analyze the impact of social media on political polarization, citing relevant studies or historical examples.",
"Write a short science fiction story where humans discover a parallel universe that operates under different physical laws.",
"Explain the role of the Federal Reserve in the U.S. economy and how it manages inflation and unemployment.",
"Describe the architecture of a modern web application, from frontend to backend, including databases, APIs, and deployment.",
"Write an essay discussing whether artificial general intelligence (AGI) poses an existential threat to humanity.",
"Summarize the key events and consequences of the Cuban Missile Crisis, and reflect on lessons for modern diplomacy.",
"Create a guide for beginners on how to train a custom LLM using open-source tools and publicly available datasets.",
]
@dataclass
class PercentileResults:
average: float
p50: float
p95: float
p99: float
@dataclass
class BenchmarkResults:
model: str
total_requests: int
successful_requests: int
success_rate: float
concurrency: int
request_timeout: int
max_completion_tokens: int
total_time: float
requests_per_second: float
total_tokens: int
total_prompt_tokens: int
total_completion_tokens: int
total_tokens_per_second: float
total_prompt_tokens_per_second: float
total_completion_tokens_per_second: float
latency: PercentileResults
completion_tokens_per_second: PercentileResults
time_to_first_token: PercentileResults
async def process_stream(stream):
first_token_time = None
async for chunk in stream:
if first_token_time is None:
first_token_time = time.time()
if chunk.usage:
return first_token_time, chunk.usage
return first_token_time, None
def get_random_prompt(prompt_multiplier):
"""
Returns a random prompt from the SAMPLE_PROMPTS list, repeated prompt_multiplier times.
"""
# Add a random prefix to avoid prefix cache hits
random_prefix = str(random.randint(100000, 999999))
return (
random_prefix + " " + (random.choice(SAMPLE_PROMPTS) + " ") * prompt_multiplier
)
async def make_chat_completion_request(
client: AsyncOpenAI,
model,
max_completion_tokens,
ignore_eos,
request_timeout,
prompt_multiplier,
):
start_time = time.time()
content = get_random_prompt(prompt_multiplier)
try:
stream = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": content}],
max_completion_tokens=max_completion_tokens,
stream=True,
stream_options=ChatCompletionStreamOptionsParam(include_usage=True),
extra_body={"ignore_eos": ignore_eos} if ignore_eos else None,
)
first_token_time, usage = await asyncio.wait_for(
process_stream(stream), timeout=request_timeout
)
end_time = time.time()
elapsed_time = end_time - start_time
ttft = (first_token_time - start_time) * 1000 if first_token_time else None
return usage, elapsed_time, ttft
except asyncio.TimeoutError:
logging.warning(f"Request timed out after {request_timeout} seconds")
return None
except APIConnectionError as e:
logging.error(f"API connection error: {str(e)}")
return None
except Exception as e:
logging.error(f"Error during request: {str(e)}")
return None
async def make_embedding_request(
client: AsyncOpenAI,
model,
request_timeout,
prompt_multiplier=1,
):
start_time = time.time()
content = get_random_prompt(prompt_multiplier)
try:
response = await asyncio.wait_for(
client.embeddings.create(model=model, input=content),
timeout=request_timeout,
)
end_time = time.time()
elapsed_time = end_time - start_time
ttft = None # Embeddings do not have a time to first token in the same way as chat completions
return response.usage, elapsed_time, ttft
except asyncio.TimeoutError:
logging.warning(f"Embedding request timed out after {request_timeout} seconds")
return None
except Exception as e:
logging.error(f"Error during embedding request: {str(e)}")
return None
async def worker(
client,
model,
semaphore,
queue,
results,
max_completion_tokens,
ignore_eos,
request_timeout,
embeddings=False,
prompt_multiplier=1,
pbar=None,
):
while True:
async with semaphore:
task_id = await queue.get()
if task_id is None:
queue.task_done()
break
logging.debug(f"Starting request {task_id}")
if embeddings:
result = await make_embedding_request(
client, model, request_timeout, prompt_multiplier
)
else:
result = await make_chat_completion_request(
client,
model,
max_completion_tokens,
ignore_eos,
request_timeout,
prompt_multiplier,
)
if result:
results.append(result)
else:
logging.warning(f"Request {task_id} failed")
queue.task_done()
if pbar:
pbar.update(1)
logging.debug(f"Finished request {task_id}")
def calculate_percentile(values, percentile, reverse=False):
if not values:
return None
if reverse:
return numpy.percentile(values, 100 - percentile)
return numpy.percentile(values, percentile)
async def preflight_check(client, model, embeddings=False) -> bool:
if embeddings:
result = await make_embedding_request(client, model, 16)
else:
result = await make_chat_completion_request(client, model, 16, False, 60, 1)
return result is not None
def set_headers(aiohttp_session: ClientSession, headers: Optional[List[str]]):
if headers:
for header in headers:
if ":" not in header:
raise ValueError(f"Invalid header format: {header}. Expected Key:Value")
key, value = header.split(":", 1)
aiohttp_session.headers[key.strip()] = value.strip()
async def main(
model,
num_requests,
concurrency,
request_timeout,
max_completion_tokens,
ignore_eos,
server_url,
api_key,
headers=None,
embeddings=False,
prompt_multiplier=1,
) -> Optional[BenchmarkResults]:
connector = aiohttp.TCPConnector(
limit=2000,
force_close=True,
)
async with ClientSession(connector=connector, trust_env=True) as aiohttp_session:
if headers:
set_headers(aiohttp_session, headers)
transport = AiohttpTransport(client=aiohttp_session)
httpx_client = DefaultAsyncHttpxClient(
transport=transport, timeout=request_timeout
)
client = AsyncOpenAI(
base_url=f"{server_url}/v1",
api_key=api_key,
http_client=httpx_client,
max_retries=0,
)
if not await preflight_check(client, model, embeddings=embeddings):
raise Exception(
"Preflight check failed. Please check configuration and the service status."
)
semaphore = asyncio.Semaphore(concurrency)
queue = asyncio.Queue()
results = []
# Add tasks to the queue
for i in range(num_requests):
await queue.put(i)
# Add sentinel values to stop workers
for _ in range(concurrency):
await queue.put(None)
pbar = tqdm(
total=num_requests,
desc="Running Benchmark requests",
unit="request",
dynamic_ncols=True,
)
# Create worker tasks
workers = [
asyncio.create_task(
worker(
client,
model,
semaphore,
queue,
results,
max_completion_tokens,
ignore_eos,
request_timeout,
embeddings,
prompt_multiplier,
pbar=pbar,
)
)
for _ in range(concurrency)
]
start_time = time.time()
# Wait for all tasks to complete
await queue.join()
await asyncio.gather(*workers)
end_time = time.time()
total_elapsed_time = end_time - start_time
return calculate_results(
model,
concurrency,
request_timeout,
max_completion_tokens,
total_elapsed_time,
num_requests,
results,
)
def calculate_results(
model,
concurrency,
request_timeout,
max_completion_tokens,
total_elapsed_time,
num_requests,
results,
):
# Calculate metrics
total_tokens = 0
prompt_tokens = 0
completion_tokens = 0
tokens_per_second_list = []
prompt_tokens_per_second_list = []
completion_tokens_per_second_list = []
for usage, elapsed_time, _ in results:
if usage is not None:
total_tokens += usage.total_tokens
prompt_tokens += usage.prompt_tokens
completion_tokens += usage.completion_tokens
prompt_tokens_per_second = (
usage.prompt_tokens / elapsed_time if elapsed_time > 0 else 0
)
completion_tokens_per_second = (
usage.completion_tokens / elapsed_time if elapsed_time > 0 else 0
)
tokens_per_second = (
usage.total_tokens / elapsed_time if elapsed_time > 0 else 0
)
tokens_per_second_list.append(tokens_per_second)
prompt_tokens_per_second_list.append(prompt_tokens_per_second)
completion_tokens_per_second_list.append(completion_tokens_per_second)
latencies = [
elapsed_time for _, elapsed_time, _ in results if elapsed_time is not None
]
ttft_list = [ttft for _, _, ttft in results if ttft is not None]
successful_requests = len(results)
success_rate = successful_requests / num_requests if num_requests > 0 else 0
requests_per_second = (
successful_requests / total_elapsed_time if total_elapsed_time > 0 else 0
)
avg_latency = sum(latencies) / len(latencies) if latencies else 0
avg_completion_tokens_per_second = (
sum(completion_tokens_per_second_list) / len(completion_tokens_per_second_list)
if completion_tokens_per_second_list
else 0
)
total_tokens_per_second = (
total_tokens / total_elapsed_time if total_elapsed_time > 0 else 0
)
total_prompt_tokens_per_second = (
prompt_tokens / total_elapsed_time if total_elapsed_time > 0 else 0
)
total_completion_tokens_per_second = (
completion_tokens / total_elapsed_time if total_elapsed_time > 0 else 0
)
avg_ttft = sum(ttft_list) / len(ttft_list) if ttft_list else 0
# Calculate percentiles
percentiles = [50, 95, 99]
latency_percentiles = [calculate_percentile(latencies, p) for p in percentiles]
completion_tps_percentiles = [
calculate_percentile(completion_tokens_per_second_list, p, reverse=True)
for p in percentiles
]
ttft_percentiles = [calculate_percentile(ttft_list, p) for p in percentiles]
return BenchmarkResults(
model=model,
total_requests=num_requests,
successful_requests=successful_requests,
success_rate=success_rate,
concurrency=concurrency,
request_timeout=request_timeout,
max_completion_tokens=max_completion_tokens,
total_time=total_elapsed_time,
requests_per_second=requests_per_second,
total_tokens=total_tokens,
total_prompt_tokens=prompt_tokens,
total_completion_tokens=completion_tokens,
total_tokens_per_second=total_tokens_per_second,
total_prompt_tokens_per_second=total_prompt_tokens_per_second,
total_completion_tokens_per_second=total_completion_tokens_per_second,
latency=PercentileResults(
average=avg_latency,
p50=latency_percentiles[0],
p95=latency_percentiles[1],
p99=latency_percentiles[2],
),
completion_tokens_per_second=PercentileResults(
average=avg_completion_tokens_per_second,
p50=completion_tps_percentiles[0],
p95=completion_tps_percentiles[1],
p99=completion_tps_percentiles[2],
),
time_to_first_token=PercentileResults(
average=avg_ttft,
p50=ttft_percentiles[0],
p95=ttft_percentiles[1],
p99=ttft_percentiles[2],
),
)
def fmt_line(label, *values, width=40):
label_part = f"{label:<{width}}"
value_part = " ".join(str(v) for v in values)
return f"{label_part}{value_part}"
def fmt_float(v, suffix=""):
return f"{v:.2f}{suffix}"
def output_benchmark_results_pretty(
results: BenchmarkResults, file: str = None, embeddings: bool = False
):
lines = []
lines.append("============== Serving Benchmark Result ===============")
lines.append(fmt_line("Model:", results.model))
lines.append(
fmt_line(
"Total requests:",
f"{results.successful_requests}/{results.total_requests}({results.success_rate:.2%})",
)
)
lines.append(fmt_line("Concurrency:", results.concurrency))
lines.append(fmt_line("Benchmark duration (s):", fmt_float(results.total_time)))
lines.append(
fmt_line("Request throughput (req/s):", fmt_float(results.requests_per_second))
)
lines.append(fmt_line("Total input tokens:", results.total_prompt_tokens))
if not embeddings:
lines.append(fmt_line("Total output tokens:", results.total_completion_tokens))
output_tok_per_sec = (
results.total_completion_tokens / results.total_time
if results.total_time > 0
else 0
)
total_tok_per_sec = (
results.total_tokens / results.total_time if results.total_time > 0 else 0
)
if not embeddings:
lines.append(
fmt_line("Output token throughput (tok/s):", fmt_float(output_tok_per_sec))
)
lines.append(
fmt_line("Total token throughput (tok/s):", fmt_float(total_tok_per_sec))
)
lines.append("------------------- Request Latency -------------------")
lines.append(fmt_line("Average latency (s):", fmt_float(results.latency.average)))
lines.append(fmt_line("P50 latency (s):", fmt_float(results.latency.p50)))
lines.append(fmt_line("P95 latency (s):", fmt_float(results.latency.p95)))
lines.append(fmt_line("P99 latency (s):", fmt_float(results.latency.p99)))
if not embeddings:
lines.append("--------------- Output Token Per Second ---------------")
lines.append(
fmt_line(
"Average TPS (tok/s):",
fmt_float(results.completion_tokens_per_second.average),
)
)
lines.append(
fmt_line(
"P50 TPS (tok/s):", fmt_float(results.completion_tokens_per_second.p50)
)
)
lines.append(
fmt_line(
"P95 TPS (tok/s):", fmt_float(results.completion_tokens_per_second.p95)
)
)
lines.append(
fmt_line(
"P99 TPS (tok/s):", fmt_float(results.completion_tokens_per_second.p99)
)
)
lines.append("----------------- Time to First Token -----------------")
lines.append(
fmt_line(
"Average TTFT (ms):", fmt_float(results.time_to_first_token.average)
)
)
lines.append(
fmt_line("P50 TTFT (ms):", fmt_float(results.time_to_first_token.p50))
)
lines.append(
fmt_line("P95 TTFT (ms):", fmt_float(results.time_to_first_token.p95))
)
lines.append(
fmt_line("P99 TTFT (ms):", fmt_float(results.time_to_first_token.p99))
)
lines.append("=" * 55)
output = "\n".join(lines)
if file:
with open(file, "w") as f:
f.write(output + "\n")
logging.info(f"Pretty benchmark results saved to {file}")
else:
print(output)
def output_benchmark_results_json(
results: BenchmarkResults, result_file=None, embeddings: bool = False
):
# Round all floats in results to two decimal places for output
def _round_floats(obj, ndigits=2):
if is_dataclass(obj):
obj = asdict(obj)
if isinstance(obj, dict):
return {k: _round_floats(v, ndigits) for k, v in obj.items()}
if isinstance(obj, list):
return [_round_floats(v, ndigits) for v in obj]
if isinstance(obj, float):
return round(obj, ndigits)
return obj
formatted_results = _round_floats(results, 2)
if result_file:
with open(result_file, "w") as f:
json.dump(formatted_results, f, indent=2)
logging.info(f"Results saved to {result_file}")
else:
print(json.dumps(formatted_results, indent=2))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark Chat Completions API")
parser.add_argument(
"-m", "--model", type=str, required=True, help="Name of the model"
)
parser.add_argument(
"-n",
"--num-requests",
type=int,
default=100,
help="Number of requests to make (default: 100)",
)
parser.add_argument(
"-c",
"--concurrency",
type=int,
default=10,
help="Number of concurrent requests (default: 10)",
)
parser.add_argument(
"--request-timeout",
type=int,
default=300,
help="Timeout for each request in seconds (default: 300)",
)
parser.add_argument(
"--max-completion-tokens",
type=int,
default=1024,
help="Maximum number of tokens in the completion (default: 1024)",
)
parser.add_argument(
"--prompt-multiplier",
type=int,
default=1,
help="Repeat the randomly selected prompt N times to create longer inputs",
)
parser.add_argument(
'--ignore-eos',
action='store_true',
help='Set ignore_eos flag when sending the benchmark request. This will not stop the stream when the model generates an EOS token.',
)
parser.add_argument(
"--server-url",
type=str,
default="http://127.0.0.1",
help="URL of the GPUStack server",
)
parser.add_argument("--api-key", type=str, default="fake", help="GPUStack API key")
parser.add_argument(
"--result-file",
type=str,
help="Result file path to save benchmark json results",
)
parser.add_argument(
"-H",
"--header",
action="append",
dest="headers",
help="Custom HTTP header in Key:Value format. May be specified multiple times.",
)
parser.add_argument(
'--embeddings',
action='store_true',
help='Run embedding benchmark instead of chat completions',
)
parser.add_argument(
'--json',
action='store_true',
help='Output results in JSON format instead of pretty format',
)
args = parser.parse_args()
try:
results = asyncio.run(
main(
args.model,
args.num_requests,
args.concurrency,
args.request_timeout,
args.max_completion_tokens,
args.ignore_eos,
args.server_url,
args.api_key,
args.headers,
args.embeddings,
args.prompt_multiplier,
)
)
if args.json:
output_benchmark_results_json(
results, args.result_file, embeddings=args.embeddings
)
else:
output_benchmark_results_pretty(
results, args.result_file, embeddings=args.embeddings
)
except Exception as e:
logging.error(f"Benchmarking failed: {str(e)}")
exit(1)

@ -0,0 +1,26 @@
aiohappyeyeballs==2.6.1
aiohttp==3.12.13
aiosignal==1.3.2
annotated-types==0.7.0
anyio==4.9.0
attrs==25.3.0
certifi==2025.6.15
distro==1.9.0
frozenlist==1.7.0
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
httpx-aiohttp==0.1.6
idna==3.10
jiter==0.10.0
multidict==6.5.1
numpy==2.3.1
openai==1.92.2
propcache==0.3.2
pydantic==2.11.7
pydantic_core==2.33.2
sniffio==1.3.1
tqdm==4.67.1
typing-inspection==0.4.1
typing_extensions==4.14.0
yarl==1.20.1

@ -0,0 +1,21 @@
import shutil
import tempfile
import pytest
from gpustack.config.config import Config, set_global_config
@pytest.fixture(scope="module", autouse=True)
def temp_dir():
tmp_dir = tempfile.mkdtemp()
print(f"Created temporary directory: {tmp_dir}")
yield tmp_dir
shutil.rmtree(tmp_dir)
@pytest.fixture(scope="module", autouse=True)
def config(temp_dir):
cfg = Config(
token="test", jwt_secret_key="test", data_dir=temp_dir, enable_ray=True
)
set_global_config(cfg)
return cfg

@ -27,7 +27,7 @@ The GPUStack server connects to a SQL database as the datastore. GPUStack uses S
### Inference Server
Inference servers are the backends that performs the inference tasks. GPUStack supports [llama-box](https://github.com/gpustack/llama-box), [vLLM](https://github.com/vllm-project/vllm), [Ascend MindIE](https://www.hiascend.com/en/software/mindie) and [vox-box](https://github.com/gpustack/vox-box) as the inference server.
Inference servers are the backends that performs the inference tasks. GPUStack supports [vLLM](https://github.com/vllm-project/vllm), [Ascend MindIE](https://www.hiascend.com/en/software/mindie), [llama-box](https://github.com/gpustack/llama-box) and [vox-box](https://github.com/gpustack/vox-box) as the inference server.
### RPC Server

Binary file not shown.

Before

Width:  |  Height:  |  Size: 446 KiB

After

Width:  |  Height:  |  Size: 288 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 349 KiB

After

Width:  |  Height:  |  Size: 397 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 506 KiB

After

Width:  |  Height:  |  Size: 471 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 353 KiB

After

Width:  |  Height:  |  Size: 280 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 310 KiB

After

Width:  |  Height:  |  Size: 213 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 582 KiB

After

Width:  |  Height:  |  Size: 496 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 248 KiB

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 200 KiB

After

Width:  |  Height:  |  Size: 284 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 148 KiB

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 965 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 910 KiB

After

Width:  |  Height:  |  Size: 862 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 256 KiB

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 285 KiB

After

Width:  |  Height:  |  Size: 488 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 278 KiB

After

Width:  |  Height:  |  Size: 216 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 141 KiB

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 160 KiB

After

Width:  |  Height:  |  Size: 181 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 186 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 162 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 471 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 305 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 338 KiB

After

Width:  |  Height:  |  Size: 462 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 193 KiB

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 379 KiB

After

Width:  |  Height:  |  Size: 460 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 735 KiB

After

Width:  |  Height:  |  Size: 666 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 391 KiB

After

Width:  |  Height:  |  Size: 444 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 166 KiB

After

Width:  |  Height:  |  Size: 158 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 176 KiB

After

Width:  |  Height:  |  Size: 313 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 237 KiB

After

Width:  |  Height:  |  Size: 247 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 201 KiB

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 403 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 330 KiB

After

Width:  |  Height:  |  Size: 234 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 567 KiB

After

Width:  |  Height:  |  Size: 600 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 316 KiB

After

Width:  |  Height:  |  Size: 326 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 865 KiB

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 951 KiB

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 610 KiB

After

Width:  |  Height:  |  Size: 591 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 420 KiB

After

Width:  |  Height:  |  Size: 326 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 217 KiB

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 326 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 564 KiB

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 108 KiB

After

Width:  |  Height:  |  Size: 395 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.2 MiB

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.3 MiB

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 602 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 332 KiB

After

Width:  |  Height:  |  Size: 190 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 416 KiB

After

Width:  |  Height:  |  Size: 528 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 436 KiB

After

Width:  |  Height:  |  Size: 490 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 206 KiB

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 210 KiB

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 195 KiB

After

Width:  |  Height:  |  Size: 169 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 205 KiB

After

Width:  |  Height:  |  Size: 159 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 325 KiB

After

Width:  |  Height:  |  Size: 394 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 195 KiB

After

Width:  |  Height:  |  Size: 166 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 426 KiB

After

Width:  |  Height:  |  Size: 377 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.1 MiB

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 231 KiB

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 305 KiB

After

Width:  |  Height:  |  Size: 318 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 305 KiB

After

Width:  |  Height:  |  Size: 318 KiB

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save