Serve models with VLLM
vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We’re going to walk through deploying a vLLM-optimized OPT-125M model.
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
This example will cover:
- Generating the base Truss
- Setting sufficient model resources for inference
- Deploying the model
Step 1: Generating the base Truss
Get started by creating a new Truss:
truss init --backend VLLM opt125
You’re going to see a couple of prompts. Follow along with the instructions below:
- Type
facebook/opt-125M
when prompted formodel
. - Press the
tab
key when prompted forendpoint
. Select theCompletions
endpoint. - Give your model a name like
OPT-125M
.
The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select ChatCompletion
. OPT-125M is not a chat model so we selected Completion
.
Finally, navigate to the directory:
cd opt125
Step 2: Setting resources and other arguments
You’ll notice that there’s a config.yaml
in the new directory. This is where we’ll set the resources and other arguments for the model. Open the file in your favorite editor.
OPT-125M will need a GPU so let’s set the correct resources. Update the resources
key with the following:
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
Also notice the build
key which contains the model_server
we’re using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find here.
Step 3: Deploy the model
You’ll need a Baseten API key for this step.
Let’s deploy our OPT-125M vLLM model.
truss push
You can invoke the model with:
truss predict -d '{"prompt": "What is a large language model?", "model": "facebook/opt-125M"}' --published
build:
arguments:
endpoint: Completions
model: facebook/opt-125M
model_server: VLLM
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: OPT-125M
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
build:
arguments:
endpoint: Completions
model: facebook/opt-125M
model_server: VLLM
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: OPT-125M
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []