Serve LLM models with TGI
TGI is a model server optimized for language models.
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
This example will cover:
- Generating the base Truss
- Setting sufficient model resources for inference
- Deploying the model
Step 1: Generating the base Truss
Get started by creating a new Truss:
truss init --backend TGI falcon-7b
You’re going to see a couple of prompts. Follow along with the instructions below:
- Type
tiiuae/falcon-7b
when prompted formodel
. - Press the
tab
key when prompted forendpoint
. Select thegenerate_stream
endpoint. - Give your model a name like
Falcon 7B
.
Finally, navigate to the directory:
cd falcon-7b
Step 2: Setting resources and other arguments
You’ll notice that there’s a config.yaml
in the new directory. This is where we’ll set the resources and other arguments for the model. Open the file in your favorite editor.
Falcon 7B will need a GPU so let’s set the correct resources. Update the resources
key with the following:
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
Also notice the build
key which contains the model_server
we’re using as well as other arguments. These arguments are passed to the underlying TGI server.
Step 3: Deploy the model
You’ll need a Baseten API key for this step.
Let’s deploy our Falcon 7B TGI model.
truss push
You can invoke the model with:
truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}' --published
build:
arguments:
endpoint: generate_stream
model: tiiuae/falcon-7b
model_server: TGI
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: Falcon 7B
python_version: py39
requirements: []
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
build:
arguments:
endpoint: generate_stream
model: tiiuae/falcon-7b
model_server: TGI
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: Falcon 7B
python_version: py39
requirements: []
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []