High Performance LLM with vLLM
build:
arguments:
endpoint: Completions
model: facebook/opt-125M
model_server: VLLM
runtime:
predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
View on Github
vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease.
build:
arguments:
vLLM supports multiple types of endpoints:
- Completions — Follows the same API as the OpenAI Completions API
- ChatCommpletions — Follows the same API as the OpenAI ChatCompletions API
endpoint: Completions
Select which vLLM-compatible model you’d like to use
model: facebook/opt-125M
The model_server
parameter allows you to specify TGI
model_server: VLLM
Another important parameter to configure if you are choosing vLLM is the predict_concurrency
.
One of the main benefits of vLLM is continuous batching — in which multiple requests can be
processed at the same time. Without predict_concurrency, you cannot take advantage of this
feature.
runtime:
predict_concurrency: 128
The remaining config options listed are standard Truss Config options.
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
Deploy the model
Deploy the vLLM model like you would other Trusses, with:
$ truss push
You can then invoke the model with:
$ truss predict -d '{"prompt": "What is a large language model?", "model": "facebook/opt-125M"}' --published
build:
arguments:
endpoint: Completions
model: facebook/opt-125M
model_server: VLLM
runtime:
predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []