TGI is a model server optimized for language models.

You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.

This example will cover:

  1. Generating the base Truss
  2. Setting sufficient model resources for inference
  3. Deploying the model

Step 1: Generating the base Truss

Get started by creating a new Truss:

truss init --backend TGI falcon-7b

You’re going to see a couple of prompts. Follow along with the instructions below:

  1. Type tiiuae/falcon-7b when prompted for model.
  2. Press the tab key when prompted for endpoint. Select the generate_stream endpoint.
  3. Give your model a name like Falcon 7B.

Finally, navigate to the directory:

cd falcon-7b

Step 2: Setting resources and other arguments

You’ll notice that there’s a config.yaml in the new directory. This is where we’ll set the resources and other arguments for the model. Open the file in your favorite editor.

Falcon 7B will need a GPU so let’s set the correct resources. Update the resources key with the following:

config.yaml
resources:
  accelerator: A10G
  cpu: "4"
  memory: 16Gi
  use_gpu: true

Also notice the build key which contains the model_server we’re using as well as other arguments. These arguments are passed to the underlying TGI server.

Step 3: Deploy the model

You’ll need a Baseten API key for this step.

Let’s deploy our Falcon 7B TGI model.

truss push

You can invoke the model with:

truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}' --published
build:
  arguments:
    endpoint: generate_stream
    model: tiiuae/falcon-7b
  model_server: TGI
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: Falcon 7B
python_version: py39
requirements: []
resources:
  accelerator: A10G
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []