转到末尾 下载完整的示例代码
TensorRT 中的权重流式传输是一项强大的功能,旨在克服使用大型模型时的 GPU 内存限制。它通过在推理期间将权重数据从主机 (CPU) 内存流式传输到 GPU 内存,从而能够运行比可用 GPU 内存更大的模型。
此示例使用预训练的 Llama-2 模型,并展示如何将权重流式传输功能与 Torch-TensorRT 结合使用。
编译选项 - 使用权重流式传输功能构建 trt 引擎
运行时 API - 通过上下文管理器控制权重流式传输预算
import copy
import timeit
import numpy as np
import torch
import torch_tensorrt
from transformers import AutoModelForCausalLM
from utils import export_llm
def time_generate(model, inputs, output_seq_length, iterations=10):
Measure the time for generating a sentence over certain number of iterations
# We only support single input (B x seq_len) for LLMs now
input_seq = inputs[0]
with torch.no_grad():
timings = []
for _ in range(iterations):
start_time = timeit.default_timer()
inputs_copy = copy.copy(input_seq)
# Greedy decoding of the model. This generates up to max_tokens.
while inputs_copy.shape[1] <= output_seq_length:
outputs = model(inputs_copy)
logits = outputs.logits
next_token_logits = logits[:, -1, :]
next_tokens = torch.argmax(next_token_logits, dim=-1)
inputs_copy = torch.cat([inputs_copy, next_tokens[:, None]], dim=-1)
end_time = timeit.default_timer()
timings.append(end_time - start_time)
times = np.array(timings)
time_mean_ms = np.mean(times) * 1000
return time_mean_ms
# Load the LLaMA-2 model
DEVICE = torch.device("cuda:0")
llama_path = "meta-llama/Llama-2-7b-chat-hf"
with torch.no_grad():
model = AutoModelForCausalLM.from_pretrained(
llama_path, use_cache=False, attn_implementation="eager"
# Set input and output sequence lengths
isl = 128
osl = 256
# Create random input tensors
input_tensors = [torch.randint(0, 5, (1, isl), dtype=torch.int64).cuda()]
# Convert the model to half precision (FP16)
model = model.half()
# Exports the LLM model into an ExportedProgram with dynamic shapes.
llama2_ep = export_llm(model, input_tensors[0], max_seq_len=osl)
构建具有权重流式传输功能的引擎需要 enable_weight_streaming=True 选项和 use_explicit_typing=True。 use_explicit_typing=True 选项创建一个强类型网络,并且在 enabled_precisions 选项中仅允许 float32 精度
# Create a TensorRT-compiled model
trt_model = torch_tensorrt.dynamo.compile(
# Warm up for 3 iterations
_ = time_generate(trt_model, input_tensors, osl, 3)
指定 enable_weight_streaming 编译选项后,将配置自动预算大小。此自动大小可能并不总是提供最佳解决方案,因为自动确定的预算缺乏对用户特定内存约束和使用模式的洞察。
# Weight streaming context to get current weight budget information
weight_streaming_ctx = torch_tensorrt.runtime.weight_streaming(trt_model)
# Measure the mean latency of the model with weight streaming
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of current weight budget used
weight_budget_pct = (
weight_streaming_ctx.device_budget / weight_streaming_ctx.total_device_budget * 100
f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
可以使用权重流式传输上下文管理器来限制权重流式传输预算。预算大小的允许范围为 0 到 ctx.total_device_budget。0 表示通过使用最少量的内存来实现最大的内存节省。值等于 ctx.total_device_budget 将禁用权重流式传输。如果创建了多个 trt 引擎,则预算按比例分配
# Use a context manager for weight streaming
with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
# Get the total size of streamable weights in the engine
streamable_budget = weight_streaming_ctx.total_device_budget
# Scenario 1: Automatic weight streaming budget
# Get the automatically determined weight streaming budget
requested_budget = weight_streaming_ctx.get_automatic_weight_streaming_budget()
# Set the device budget to the automatically determined value
weight_streaming_ctx.device_budget = requested_budget
# Measure the mean latency with automatic budget
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of the weight budget used
weight_budget_pct = (
/ weight_streaming_ctx.total_device_budget
* 100
f"Set auto weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
# Scenario 2: Manual 10% weight streaming budget
# Set the budget to 10% of the total streamable weights
requested_budget = int(streamable_budget * 0.1)
weight_streaming_ctx.device_budget = requested_budget
# Measure the mean latency with 10% budget
mean_latency = time_generate(trt_model, input_tensors, osl, 1)
# Calculate the percentage of the weight budget used
weight_budget_pct = (
/ weight_streaming_ctx.total_device_budget
* 100
f"Set weight streaming budget as {weight_budget_pct}%. {weight_streaming_ctx.device_budget} bytes out of {weight_streaming_ctx.total_device_budget}. mean latency = {mean_latency} ms"
脚本的总运行时间: (0 分钟 0.000 秒)