vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We’re going to walk through deploying a vLLM-optimized OPT-125M model.

You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.

This example will cover:

  1. Generating the base Truss
  2. Setting sufficient model resources for inference
  3. Deploying the model

Step 1: Generating the base Truss

Get started by creating a new Truss:

truss init --backend VLLM opt125

You’re going to see a couple of prompts. Follow along with the instructions below:

  1. Type facebook/opt-125M when prompted for model.
  2. Press the tab key when prompted for endpoint. Select the Completions endpoint.
  3. Give your model a name like OPT-125M.

The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select ChatCompletion. OPT-125M is not a chat model so we selected Completion.

Finally, navigate to the directory:

cd opt125

Step 2: Setting resources and other arguments

You’ll notice that there’s a config.yaml in the new directory. This is where we’ll set the resources and other arguments for the model. Open the file in your favorite editor.

OPT-125M will need a GPU so let’s set the correct resources. Update the resources key with the following:

config.yaml
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true

Also notice the build key which contains the model_server we’re using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find here.

Step 3: Deploy the model

You’ll need a Baseten API key for this step.

Let’s deploy our OPT-125M vLLM model.

truss push

You can invoke the model with:

truss predict -d '{"prompt": "What is a large language model?", "model": "facebook/opt-125M"}' --published
build:
  arguments:
    endpoint: Completions
    model: facebook/opt-125M
  model_server: VLLM
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: OPT-125M
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []