⚠️ 通知：有限维护

本项目不再积极维护。现有版本仍然可用，但没有计划中的更新、错误修复、新功能或安全补丁。用户应注意，漏洞可能无法得到解决。

使用 TorchServe 进行批量推理¶

本文档目录¶

引言
先决条件
使用 TorchServe 默认处理程序进行批量推理
使用 ResNet-152 模型通过 TorchServe 进行批量推理
配置支持批处理的 TorchServe ResNet-152 模型演示
使用 Docker 配置支持批处理的 TorchServe ResNet-152 模型演示

引言¶

批量推理是将推理请求聚合起来，然后一次性通过 ML/DL 框架发送这些聚合的请求进行推理的过程。TorchServe 设计上原生支持对传入的推理请求进行批量处理。此功能使您能够最佳地利用主机资源，因为大多数 ML/DL 框架都针对批量请求进行了优化。主机资源的这种最佳利用反过来又降低了使用 TorchServe 托管推理服务的运营成本。

在本文档中，我们将展示一个示例，说明如何在本地或使用 Docker 容器提供模型服务时，在 Torchserve 中使用批量推理。

先决条件¶

在开始阅读本文档之前，请先阅读以下文档

使用 TorchServe 默认处理程序进行批量推理¶

TorchServe 的默认处理程序原生支持批量推理，text_classifier 处理程序除外。

使用 ResNet-152 模型通过 TorchServe 进行批量推理¶

为了支持批量推理，TorchServe 需要以下内容

TorchServe 模型配置：通过使用“POST /models”管理 API 或 config.properties 中的设置来配置 batch_size 和 max_batch_delay。TorchServe 需要知道模型可以处理的最大批量大小以及 TorchServe 等待填充每个批量请求的最长时间。
模型处理程序代码：TorchServe 要求模型处理程序处理批量推理请求。

有关支持批处理的自定义模型处理程序的完整工作示例，请参阅Hugging face transformer 通用处理程序

TorchServe 模型配置¶

从 Torchserve 0.4.1 开始，有两种方法可以配置 TorchServe 使用批量处理功能

通过使用POST /models API 提供批量配置信息。
通过配置文件 config.properties 提供批量配置信息。

我们关注的配置属性如下

batch_size：这是模型预期能够处理的最大批量大小。
max_batch_delay：这是 TorchServe 等待接收 batch_size 数量请求的最大批量延迟时间，单位为 ms。如果在计时器超时前 TorchServe 未收到 batch_size 数量的请求，它会将已收到的请求发送到模型 handler。

让我们来看一个通过管理 API 使用此配置的示例

# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milliseconds.
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"

以下是使用 config.properties 进行此配置的示例

# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds, in the config.properties.

models={\
  "resnet-152": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 8,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120\
    }\
  }\
}

这些配置在 TorchServe 和模型的自定义服务代码（即处理程序代码）中都会用到。TorchServe 将批量相关的配置与每个模型关联起来。前端随后会尝试聚合指定批量大小的请求，并将其发送到后端。

配置支持批处理的 TorchServe ResNet-152 模型演示¶

在本节中，我们将启动模型服务器并加载 Resnet-152 模型，该模型使用默认的 image_classifier 处理程序进行批量推理。

安装 TorchServe 和 Torch Model Archiver¶

首先，按照主要Readme 中的说明安装所有必需的软件包，包括 torchserve。

使用管理 API 配置 Resnet-152 的批量推理¶

启动模型服务器。在此示例中，我们将模型服务器启动运行在推理端口 8080 和管理端口 8081 上。

$ cat config.properties
...
inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
...
$ torchserve --start --model-store model_store

验证 TorchServe 是否已启动并正在运行

$ curl localhost:8080/ping
{
  "status": "Healthy"
}

现在让我们加载 resnet-152 模型，该模型支持批量推理。因为这是一个示例，我们将启动 1 个工作进程，处理批量大小为 3，max_batch_delay 为 10ms。

$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"
{
  "status": "Processing worker updates..."
}

验证工作进程是否已正确启动。

curl https://:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 10,
    "loadedAtStartup": false,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T23:18:21.793Z",
        "status": "READY",
        "memoryUsage": 1726554112,
        "pid": 19946,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试此服务。

获取用于测试此服务的图像

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl https://:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

通过 config.properties 配置 Resnet-152 的批量推理¶

在这里，我们首先在 config.properties 中设置 batch_size 和 max_batch_delay，确保 mar 文件位于 model-store 中，并且 models 设置中的版本与创建的 mar 文件版本一致。要了解更多关于配置的信息，请参阅这篇文档。

load_models=resnet-152-batch_v2.mar
models={\
  "resnet-152-batch_v2": {\
    "2.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152-batch_v2.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 3,\
        "maxBatchDelay": 5000,\
        "responseTimeout": 120\
    }\
  }\
}

然后通过使用 --ts-config 标志传递 config.properties 来启动 Torchserve

torchserve --start --model-store model_store  --ts-config config.properties

验证 TorchServe 是否已启动并正在运行

$ curl localhost:8080/ping
{
  "status": "Healthy"
}

验证工作进程是否已正确启动。

curl https://:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 5000,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T22:44:36.742Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 19116,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试此服务。

获取用于测试此服务的图像

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl https://:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

使用 Docker 配置支持批处理的 TorchServe ResNet-152 模型演示¶

在这里，我们展示了在使用 Docker 容器提供模型服务时如何注册支持批量推理的模型。我们在 config.properties 中设置了 batch_size 和 max_batch_delay，类似于上一节，这些设置会被 dockered_entrypoint.sh 使用。

使用 Docker 容器进行 Resnet-152 的批量推理¶

在 config.properties 中设置批量 batch_size 和 max_batch_delay，如 dockered_entrypoint.sh 中所引用。

inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
metrics_address=http://127.0.0.1:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
load_models=resnet-152-batch_v2.mar
models={\
  "resnet-152-batch_v2": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152-batch_v2.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 3,\
        "maxBatchDelay": 100,\
        "responseTimeout": 120\
    }\
  }\
}

从此处构建目标 Docker 镜像，此处我们使用 gpu 镜像

./build_image.sh -g -cv cu102

使用容器启动模型服务，并将 config.properties 传递给容器

 docker run --rm -it --gpus all -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 --name mar -v /home/ubuntu/serve/model_store:/home/model-server/model-store  -v $ path to config.properties:/home/model-server/config.properties  pytorch/torchserve:latest-gpu

验证工作进程是否已正确启动。

curl https://:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 5000,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T22:44:36.742Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 19116,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试此服务。

获取用于测试此服务的图像

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl https://:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

使用 TorchServe 进行批量推理¶

本文档目录¶

引言¶

先决条件¶

使用 TorchServe 默认处理程序进行批量推理¶

使用 ResNet-152 模型通过 TorchServe 进行批量推理¶

TorchServe 模型配置¶

配置支持批处理的 TorchServe ResNet-152 模型演示¶

安装 TorchServe 和 Torch Model Archiver¶

使用管理 API 配置 Resnet-152 的批量推理¶

通过 config.properties 配置 Resnet-152 的批量推理¶

使用 Docker 配置支持批处理的 TorchServe ResNet-152 模型演示¶

使用 Docker 容器进行 Resnet-152 的批量推理¶

文档

教程

资源