High Performance LLM with TGI
build:
arguments:
endpoint: generate_stream
model_id: tiiuae/falcon-7b
model_server: TGI
runtime:
predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"inputs": "what is the meaning of life"}
model_name: Falcon-TGI
python_version: py39
requirements: []
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
View on Github
TGI is a model server optimized for language models. In this example, we put together a Truss that serves the model Falcon 7B using TGI.
For Trusses that use TGI, there is no user code to define, so there is only a config.yaml file. You can run any model that supports TGI.
build:
arguments:
The endpoint argument has two options:
- generate: This returns the response as JSON when the full response is generated
- generate_stream: If you choose this option, results will be streamed as they are ready, using server-sent events
endpoint: generate_stream
Select the model that you’d like to use with TGI
model_id: tiiuae/falcon-7b
The model_server
parameter allows you to specify a supported backend (in this example, TGI)
model_server: TGI
Another important parameter to configure if you are choosing TGI is the predict_concurrency
.
One of the main benefits of TGI is continuous batching — in which multiple requests can be
processed at the same time. Without predict_concurrency
set to a high enough number, you cannot take advantage of this
feature.
runtime:
predict_concurrency: 128
The remaining config options listed are standard Truss Config options.
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"inputs": "what is the meaning of life"}
model_name: Falcon-TGI
python_version: py39
requirements: []
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
Deploy the model
Deploy the TGI model like you would other Trusses, with:
$ truss push
You can then invoke the model with:
$ truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}' --published
build:
arguments:
endpoint: generate_stream
model_id: tiiuae/falcon-7b
model_server: TGI
runtime:
predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input: {"inputs": "what is the meaning of life"}
model_name: Falcon-TGI
python_version: py39
requirements: []
resources:
accelerator: A10G
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []