torchtrtc¶
torchtrtc
是一个用于使用 Torch-TensorRT 编译器的 CLI 应用程序。它提供了一种简单的方法,可以通过命令行使用 Torch-TensorRT 编译 TorchScript 模块,以便快速检查支持情况或作为部署管道的一部分。所有基本编译器功能都受支持,包括训练后量化(尽管您必须已经拥有校准缓存文件才能使用 PTQ 功能)。编译器可以输出两种格式,一种是嵌入 TensorRT 引擎的 TorchScript 程序,另一种是 TensorRT 引擎本身的 PLAN 文件。
编译后运行程序所需的全部操作是:C++ 链接到 libtorchtrt.so
或在 Python 中导入 torch_tensorrt 包。使用已编译模块的其他所有方面都与标准 TorchScript 相同。使用 torch.jit.load()
加载并像运行任何其他模块一样运行它。
torchtrtc [input_file_path] [output_file_path]
[input_specs...] {OPTIONS}
torchtrtc is a compiler for TorchScript, it will compile and optimize
TorchScript programs to run on NVIDIA GPUs using TensorRT
OPTIONS:
-h, --help Display this help menu
Verbiosity of the compiler
-v, --verbose Dumps debugging information about the
compilation process onto the console
-w, --warnings Disables warnings generated during
compilation onto the console (warnings
are on by default)
--i, --info Dumps info messages generated during
compilation onto the console
--build-debuggable-engine Creates a debuggable engine
--allow-gpu-fallback (Only used when targeting DLA
(device-type)) Lets engine run layers on
GPU if they are not supported on DLA
--require-full-compilation Require that the model should be fully
compiled to TensorRT or throw an error
--check-method-support=[method_name]
Check the support for end to end
compilation of a specified method in the
TorchScript module
--disable-tf32 Prevent Float32 layers from using the
TF32 data format
--sparse-weights Enable sparsity for weights of conv and
FC layers
-p[precision...],
--enable-precision=[precision...] (Repeatable) Enabling an operating
precision for kernels to use when
building the engine (Int8 requires a
calibration-cache argument) [ float |
float32 | f32 | fp32 | half | float16 |
f16 | fp16 | int8 | i8 | char ]
(default: float)
-d[type], --device-type=[type] The type of device the engine should be
built for [ gpu | dla ] (default: gpu)
--gpu-id=[gpu_id] GPU id if running on multi-GPU platform
(defaults to 0)
--dla-core=[dla_core] DLACore id if running on available DLA
(defaults to 0)
--engine-capability=[capability] The type of device the engine should be
built for [ standard | safety |
dla_standalone ]
--calibration-cache-file=[file_path]
Path to calibration cache file to use
for post training quantization
--teo=[op_name...],
--torch-executed-op=[op_name...] (Repeatable) Operator in the graph that
should always be run in PyTorch for
execution (partial compilation must be
enabled)
--tem=[module_name...],
--torch-executed-mod=[module_name...]
(Repeatable) Module that should always
be run in Pytorch for execution (partial
compilation must be enabled)
--mbs=[num_ops],
--min-block-size=[num_ops] Minimum number of contiguous TensorRT
supported ops to compile a subgraph to
TensorRT
--embed-engine Whether to treat input file as a
serialized TensorRT engine and embed it
into a TorchScript module (device spec
must be provided)
--num-avg-timing-iters=[num_iters]
Number of averaging timing iterations
used to select kernels
--workspace-size=[workspace_size] Maximum size of workspace given to
TensorRT
--dla-sram-size=[dla_sram_size] Fast software managed RAM used by DLA
to communicate within a layer.
--dla-local-dram-size=[dla_local_dram_size] Host RAM used by DLA to share
intermediate tensor data across operations.
--dla-global-dram-size=[dla_global_dram_size] Host RAM used by DLA to store
weights and metadata for execution
--atol=[atol] Absolute tolerance threshold for acceptable
numerical deviation from standard torchscript
output (default 1e-8)
--rtol=[rtol] Relative tolerance threshold for acceptable
numerical deviation from standard torchscript
output (default 1e-5)
--no-threshold-check Skip checking threshold compliance
--truncate-long-double,
--truncate, --truncate-64bit Truncate weights that are provided in
64bit to 32bit (Long, Double to Int,
Float)
--save-engine Instead of compiling a full a
TorchScript program, save the created
engine to the path specified as the
output path
--custom-torch-ops (repeatable) Shared object/DLL containing custom torch operators
--custom-converters (repeatable) Shared object/DLL containing custom converters
input_file_path Path to input TorchScript file
output_file_path Path for compiled TorchScript (or
TensorRT engine) file
input_specs... Specs for inputs to engine, can either
be a single size or a range defined by
Min, Optimal, Max sizes, e.g.
"(N,..,C,H,W)"
"[(MIN_N,..,MIN_C,MIN_H,MIN_W);(OPT_N,..,OPT_C,OPT_H,OPT_W);(MAX_N,..,MAX_C,MAX_H,MAX_W)]".
Data Type and format can be specified by
adding an "@" followed by dtype and "%"
followed by format to the end of the
shape spec. e.g. "(3, 3, 32,
32)@f16%NHWC"
"--" can be used to terminate flag options and force all following
arguments to be treated as positional options
例如:
torchtrtc tests/modules/ssd_traced.jit.pt ssd_trt.ts "[(1,3,300,300); (1,3,512,512); (1, 3, 1024, 1024)]@f16%contiguous" -p f16
要包含一组自定义操作符
torchtrtc tests/modules/ssd_traced.jit.pt ssd_trt.ts --custom-torch-ops=<path to custom library .so file> "[(1,3,300,300); (1,3,512,512); (1, 3, 1024, 1024)]@fp16%contiguous" -p f16
要包含一组自定义转换器
torchtrtc tests/modules/ssd_traced.jit.pt ssd_trt.ts --custom-converters=<path to custom library .so file> "[(1,3,300,300); (1,3,512,512); (1, 3, 1024, 1024)]@fp16%contiguous" -p f16