• 教程 >
  • 使用 TorchRL 强化学习 (PPO) 教程
快捷方式

使用 TorchRL 强化学习 (PPO) 教程

创建于:2023 年 3 月 15 日 | 最后更新:2025 年 1 月 27 日 | 最后验证:2024 年 11 月 05 日

作者Vincent Moens

本教程演示了如何使用 PyTorch 和 torchrl 训练参数化策略网络,以解决来自 OpenAI-Gym/Farama-Gymnasium 控制库 的倒立摆任务。

Inverted pendulum

倒立摆

主要学习内容

  • 如何在 TorchRL 中创建环境,转换其输出,并从此环境中收集数据;

  • 如何使用 TensorDict 让您的类相互通信;

  • 使用 TorchRL 构建训练循环的基础知识

    • 如何计算策略梯度方法的优势信号;

    • 如何使用概率神经网络创建随机策略;

    • 如何创建动态回放缓冲区并从中采样而不重复。

我们将介绍 TorchRL 的六个关键组件

如果您在 Google Colab 中运行此代码,请确保安装以下依赖项

!pip3 install torchrl
!pip3 install gym[mujoco]
!pip3 install tqdm

近端策略优化 (PPO) 是一种策略梯度算法,其中收集一批数据并直接使用以训练策略,从而在给定某些邻近约束的情况下最大化预期回报。您可以将其视为 REINFORCE 的复杂版本,REINFORCE 是基础策略优化算法。有关更多信息,请参阅 近端策略优化算法 论文。

PPO 通常被认为是用于在线、在策略强化算法的快速有效方法。TorchRL 提供了一个损失模块,它可以为您完成所有工作,因此您可以依赖此实现,并专注于解决您的问题,而不是每次想要训练策略时都重新发明轮子。

为了完整性,这里简要概述一下损失的计算方式,即使这由我们的 ClipPPOLoss 模块处理 - 该算法的工作原理如下:1. 我们将通过在环境中运行策略给定步数来采样一批数据。2. 然后,我们将使用 REINFORCE 损失的裁剪版本,对该批数据的随机子样本执行给定数量的优化步骤。3. 裁剪将对我们的损失设置悲观界限:较低的回报估计将比更高的回报估计更受欢迎。损失的精确公式是

\[L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \right),\]

该损失中有两个组成部分:在最小运算符的第一部分中,我们只是简单地计算 REINFORCE 损失的重要性加权版本(例如,我们已针对当前策略配置滞后于用于数据收集的策略配置的事实进行校正的 REINFORCE 损失)。最小运算符的第二部分是类似的损失,其中我们裁剪了比率,当它们超过或低于给定的一对阈值时。

此损失确保无论优势是正还是负,都会阻止产生与先前配置显着偏移的策略更新。

本教程的结构如下

  1. 首先,我们将定义一组我们将用于训练的超参数。

  2. 接下来,我们将专注于使用 TorchRL 的包装器和变换创建我们的环境或模拟器。

  3. 接下来,我们将设计策略网络和价值模型,这对于损失函数是必不可少的。这些模块将用于配置我们的损失模块。

  4. 接下来,我们将创建回放缓冲区和数据加载器。

  5. 最后,我们将运行我们的训练循环并分析结果。

在本教程中,我们将始终使用 tensordict 库。TensorDict 是 TorchRL 的通用语言:它帮助我们抽象模块读取和写入的内容,并减少对特定数据描述的关注,而更多地关注算法本身。

import warnings
warnings.filterwarnings("ignore")
from torch import multiprocessing


from collections import defaultdict

import matplotlib.pyplot as plt
import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,
                          TransformedEnv)
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type
from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
from tqdm import tqdm

定义超参数

我们为我们的算法设置超参数。根据可用资源,可以选择在 GPU 或其他设备上执行策略。frame_skip 将控制单个动作被执行多少帧。计算帧数的其余参数必须针对此值进行校正(因为一个环境步骤实际上将返回 frame_skip 帧)。

is_fork = multiprocessing.get_start_method() == "fork"
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)
num_cells = 256  # number of cells in each layer i.e. output dim.
lr = 3e-4
max_grad_norm = 1.0

数据收集参数

在收集数据时,我们将能够通过定义 frames_per_batch 参数来选择每个批次的大小。我们还将定义我们将允许自己使用的帧数(例如与模拟器的交互次数)。一般来说,RL 算法的目标是学习尽快解决任务,以环境交互次数衡量:total_frames 越低越好。

frames_per_batch = 1000
# For a complete training, bring the number of frames up to 1M
total_frames = 50_000

PPO 参数

在每次数据收集(或批量收集)时,我们将对一定数量的epoch运行优化,每次在嵌套的训练循环中消耗我们刚刚获取的全部数据。在这里,sub_batch_size 与上面 frames_per_batch 不同:回想一下,我们正在处理来自我们收集器的数据“批次”,其大小由 frames_per_batch 定义,并且我们将在内部训练循环中进一步拆分为更小的子批次。这些子批次的大小由 sub_batch_size 控制。

sub_batch_size = 64  # cardinality of the sub-samples gathered from the current data in the inner loop
num_epochs = 10  # optimization steps per batch of data collected
clip_epsilon = (
    0.2  # clip value for PPO loss: see the equation in the intro for more context.
)
gamma = 0.99
lmbda = 0.95
entropy_eps = 1e-4

定义环境

在 RL 中,环境通常是我们指代模拟器或控制系统的方式。各种库为强化学习提供模拟环境,包括 Gymnasium(以前的 OpenAI Gym)、DeepMind 控制套件和许多其他库。作为一个通用库,TorchRL 的目标是为各种 RL 模拟器提供可互换的接口,使您可以轻松地将一个环境与另一个环境交换。例如,可以使用少量字符创建包装的 gym 环境

base_env = GymEnv("InvertedDoublePendulum-v4", device=device)

此代码中有一些需要注意的地方:首先,我们通过调用 GymEnv 包装器创建了环境。如果传递了额外的关键字参数,它们将被传输到 gym.make 方法,从而涵盖最常见的环境构建命令。或者,也可以直接使用 gym.make(env_name, **kwargs) 创建 gym 环境,并将其包装在 GymWrapper 类中。

还有 device 参数:对于 gym,这仅控制输入动作和观察到的状态将存储在哪个设备上,但执行将始终在 CPU 上完成。原因很简单,gym 不支持设备上执行,除非另有说明。对于其他库,我们可以控制执行设备,并且尽可能地尝试在存储和执行后端保持一致。

变换

我们将向我们的环境添加一些变换,以准备策略的数据。在 Gym 中,这通常通过包装器来实现。TorchRL 采用不同的方法,更类似于其他 pytorch 域库,通过使用变换。要向环境添加变换,只需将其包装在 TransformedEnv 实例中,并将变换序列附加到它。变换后的环境将继承包装环境的设备和元数据,并根据其包含的变换序列来变换这些数据。

归一化

第一个要编码的是归一化变换。作为经验法则,最好让数据大致匹配单位高斯分布:为了获得这个,我们将在环境中运行一定数量的随机步骤,并计算这些观察结果的摘要统计信息。

我们将附加另外两个变换:DoubleToFloat 变换会将双精度条目转换为单精度数字,以便策略读取。StepCounter 变换将用于计算环境终止之前的步骤。我们将使用此度量作为性能的补充度量。

正如我们稍后将看到的,TorchRL 的许多类都依赖于 TensorDict 进行通信。您可以将其视为具有一些额外张量功能的 python 字典。在实践中,这意味着我们将使用的许多模块都需要被告知要读取哪个键 (in_keys) 以及在它们将接收到的 tensordict 中写入哪个键 (out_keys)。通常,如果省略 out_keys,则假定 in_keys 条目将就地更新。对于我们的变换,我们唯一感兴趣的条目被称为 "observation",我们的变换层将被告知修改此条目且仅修改此条目

env = TransformedEnv(
    base_env,
    Compose(
        # normalize observations
        ObservationNorm(in_keys=["observation"]),
        DoubleToFloat(),
        StepCounter(),
    ),
)

您可能已经注意到,我们创建了一个归一化层,但我们没有设置其归一化参数。为此,ObservationNorm 可以自动收集我们环境的摘要统计信息

env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)

ObservationNorm 变换现在已填充了位置和比例,将用于归一化数据。

让我们对摘要统计信息的形状进行一些健全性检查

print("normalization constant shape:", env.transform[0].loc.shape)
normalization constant shape: torch.Size([11])

环境不仅由其模拟器和变换定义,还由一系列元数据定义,这些元数据描述了在执行期间可以预期什么。出于效率目的,TorchRL 在环境规范方面非常严格,但您可以轻松检查您的环境规范是否足够。在我们的示例中,GymWrapper 和从中继承的 GymEnv 已经负责为您的环境设置正确的规范,因此您不必关心这一点。

尽管如此,让我们通过查看其规范来查看使用我们变换后的环境的具体示例。有三个规范需要查看:observation_spec,它定义了在环境中执行动作时预期会发生什么,reward_spec,它指示奖励域,最后是 input_spec(其中包含 action_spec),它表示环境执行单个步骤所需的一切。

print("observation_spec:", env.observation_spec)
print("reward_spec:", env.reward_spec)
print("input_spec:", env.input_spec)
print("action_spec (as defined by input_spec):", env.action_spec)
observation_spec: Composite(
    observation: UnboundedContinuous(
        shape=torch.Size([11]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    step_count: BoundedDiscrete(
        shape=torch.Size([1]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
            high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
        device=cpu,
        dtype=torch.int64,
        domain=discrete),
    device=cpu,
    shape=torch.Size([]))
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)
input_spec: Composite(
    full_state_spec: Composite(
        step_count: BoundedDiscrete(
            shape=torch.Size([1]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        device=cpu,
        shape=torch.Size([])),
    full_action_spec: Composite(
        action: BoundedContinuous(
            shape=torch.Size([1]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
action_spec (as defined by input_spec): BoundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

check_env_specs() 函数运行一个小的 rollout,并将其输出与环境规范进行比较。如果未引发错误,我们可以确信规范已正确定义

check_env_specs(env)

为了好玩,让我们看看一个简单的随机 rollout 是什么样的。您可以调用 env.rollout(n_steps) 并大致了解环境输入和输出的外观。动作将自动从动作规范域中绘制,因此您无需关心设计随机采样器。

通常,在每个步骤中,RL 环境接收一个动作作为输入,并输出一个观察结果、一个奖励和一个完成状态。观察结果可能是复合的,这意味着它可能由多个张量组成。对于 TorchRL 来说,这不是问题,因为整组观察结果会自动打包在输出 TensorDict 中。在对给定步数执行 rollout(例如,一系列环境步骤和随机动作生成)后,我们将检索一个 TensorDict 实例,其形状与此轨迹长度匹配

rollout = env.rollout(3)
print("rollout of three steps:", rollout)
print("Shape of the rollout TensorDict:", rollout.batch_size)
rollout of three steps: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
                terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([3]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
        step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
        terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([3]),
    device=cpu,
    is_shared=False)
Shape of the rollout TensorDict: torch.Size([3])

我们的 rollout 数据的形状为 torch.Size([3]),这与我们运行它的步数相匹配。"next" 条目指向当前步骤之后的数据。在大多数情况下,时间 t"next" 数据与 t+1 的数据匹配,但如果我们使用一些特定的变换(例如,多步),则可能不是这种情况。

策略

PPO 利用随机策略来处理探索。这意味着我们的神经网络将不得不输出分布的参数,而不是对应于所采取动作的单个值。

由于数据是连续的,我们使用 Tanh-Normal 分布来尊重动作空间边界。TorchRL 提供了这样的分布,我们唯一需要关心的是构建一个神经网络,该网络输出策略工作所需的正确参数数量(位置或均值,以及比例)。

\[f_{\theta}(\text{observation}) = \mu_{\theta}(\text{observation}), \sigma^{+}_{\theta}(\text{observation})\]

这里带来的唯一额外困难是将我们的输出分成两个相等的部分,并将第二个部分映射到严格正的空间。

我们分三个步骤设计策略

  1. 定义一个神经网络 D_obs -> 2 * D_action。实际上,我们的 loc (mu) 和 scale (sigma) 都具有维度 D_action

  2. 附加一个 NormalParamExtractor 来提取位置和比例(例如,将输入分成两个相等的部分,并将正变换应用于比例参数)。

  3. 创建一个概率 TensorDictModule,它可以生成此分布并从中采样。

为了使策略能够通过 tensordict 数据载体与环境“对话”,我们将 nn.Module 包装在 TensorDictModule 中。此类将简单地读取提供给它的 in_keys,并将输出就地写入注册的 out_keys

policy_module = TensorDictModule(
    actor_net, in_keys=["observation"], out_keys=["loc", "scale"]
)

我们现在需要根据我们的正态分布的位置和比例构建一个分布。为此,我们指示 ProbabilisticActor 类根据位置和比例参数构建 TanhNormal。我们还提供了此分布的最小值和最大值,我们从环境规范中收集这些值。

in_keys 的名称(因此也是上面 TensorDictModule 中的 out_keys 的名称)不能设置为任何人可能喜欢的值,因为 TanhNormal 分布构造函数将期望 locscale 关键字参数。话虽如此,ProbabilisticActor 也接受 Dict[str, str] 类型的 in_keys,其中键值对指示每个要使用的关键字参数应使用哪个 in_key 字符串。

policy_module = ProbabilisticActor(
    module=policy_module,
    spec=env.action_spec,
    in_keys=["loc", "scale"],
    distribution_class=TanhNormal,
    distribution_kwargs={
        "low": env.action_spec.space.low,
        "high": env.action_spec.space.high,
    },
    return_log_prob=True,
    # we'll need the log-prob for the numerator of the importance weights
)

价值网络

价值网络是 PPO 算法的关键组成部分,即使它在推理时不会使用。此模块将读取观察结果,并返回对后续轨迹的折扣回报的估计。这使我们能够通过依赖于在训练期间动态学习的某些效用估计来摊销学习。

value_net = nn.Sequential(
    nn.LazyLinear(num_cells, device=device),
    nn.Tanh(),
    nn.LazyLinear(num_cells, device=device),
    nn.Tanh(),
    nn.LazyLinear(num_cells, device=device),
    nn.Tanh(),
    nn.LazyLinear(1, device=device),
)

value_module = ValueOperator(
    module=value_net,
    in_keys=["observation"],
)

让我们尝试一下我们的策略和价值模块。正如我们前面所说,TensorDictModule 的使用使得可以直接读取环境的输出以运行这些模块,因为它们知道要读取哪些信息以及在哪里写入它

print("Running policy:", policy_module(env.reset()))
print("Running value:", value_module(env.reset()))
Running policy: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        loc: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
        sample_log_prob: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        scale: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
Running value: TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
        state_value: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

数据收集器

TorchRL 提供了一组 DataCollector 类。简而言之,这些类执行三个操作:重置环境,根据最新的观察结果计算动作,在环境中执行步骤,并重复最后两个步骤,直到环境发出停止信号(或达到完成状态)。

它们允许您控制每次迭代要收集多少帧(通过 frames_per_batch 参数),何时重置环境(通过 max_frames_per_traj 参数),策略应在哪个 device 上执行,等等。它们还被设计为与批量和多进程环境高效地协同工作。

最简单的数据收集器是 SyncDataCollector:它是一个迭代器,你可以用它来获取给定长度的数据批次,并且一旦收集到总帧数(total_frames),它将停止。其他数据收集器(MultiSyncDataCollectorMultiaSyncDataCollector)将在多进程工作器集上以同步和异步方式执行相同的操作。

至于之前的策略和环境,数据收集器将返回 TensorDict 实例,其元素总数将与 frames_per_batch 匹配。使用 TensorDict 将数据传递到训练循环,你可以编写数据加载管道,这些管道 100% 忽略 rollout 内容的实际特性。

collector = SyncDataCollector(
    env,
    policy_module,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
    split_trajs=False,
    device=device,
)

回放缓冲区

回放缓冲区是离策略 RL 算法的常见构建块。在在线策略上下文中,每次收集一批数据时,都会重新填充回放缓冲区,并且其数据会被重复消耗一定数量的 epoch。

TorchRL 的回放缓冲区是使用通用容器 ReplayBuffer 构建的,它接受缓冲区的组件作为参数:存储器、写入器、采样器以及可能的一些转换。只有存储器(指示回放缓冲区的容量)是强制性的。我们还指定了一个无重复采样器,以避免在一个 epoch 中多次采样相同的项目。对于 PPO,使用回放缓冲区不是强制性的,我们可以简单地从收集的批次中采样子批次,但是使用这些类使我们能够以可重复的方式轻松构建内部训练循环。

replay_buffer = ReplayBuffer(
    storage=LazyTensorStorage(max_size=frames_per_batch),
    sampler=SamplerWithoutReplacement(),
)

损失函数

为了方便起见,PPO 损失可以直接从 TorchRL 导入,使用 ClipPPOLoss 类。这是使用 PPO 最简单的方法:它隐藏了 PPO 的数学运算以及与之相关的控制流。

PPO 需要计算一些“优势估计”。简而言之,优势是一个反映回报值期望的值,同时处理偏差/方差权衡。要计算优势,只需 (1) 构建优势模块,该模块利用我们的价值算子,以及 (2) 在每个 epoch 之前将每批数据传递给它。GAE 模块将使用新的 "advantage""value_target" 条目更新输入 tensordict"value_target" 是一个无梯度的张量,表示价值网络应该用输入观察表示的经验价值。这两者都将被 ClipPPOLoss 用于返回策略和价值损失。

advantage_module = GAE(
    gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True
)

loss_module = ClipPPOLoss(
    actor_network=policy_module,
    critic_network=value_module,
    clip_epsilon=clip_epsilon,
    entropy_bonus=bool(entropy_eps),
    entropy_coef=entropy_eps,
    # these keys match by default but we set this for completeness
    critic_coef=1.0,
    loss_critic_type="smooth_l1",
)

optim = torch.optim.Adam(loss_module.parameters(), lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optim, total_frames // frames_per_batch, 0.0
)

训练循环

我们现在拥有了编写训练循环所需的所有组件。步骤包括

  • 收集数据

    • 计算优势

      • 循环遍历收集的数据以计算损失值

      • 反向传播

      • 优化

      • 重复

    • 重复

  • 重复

logs = defaultdict(list)
pbar = tqdm(total=total_frames)
eval_str = ""

# We iterate over the collector until it reaches the total number of frames it was
# designed to collect:
for i, tensordict_data in enumerate(collector):
    # we now have a batch of data to work with. Let's learn something from it.
    for _ in range(num_epochs):
        # We'll need an "advantage" signal to make PPO work.
        # We re-compute it at each epoch as its value depends on the value
        # network which is updated in the inner loop.
        advantage_module(tensordict_data)
        data_view = tensordict_data.reshape(-1)
        replay_buffer.extend(data_view.cpu())
        for _ in range(frames_per_batch // sub_batch_size):
            subdata = replay_buffer.sample(sub_batch_size)
            loss_vals = loss_module(subdata.to(device))
            loss_value = (
                loss_vals["loss_objective"]
                + loss_vals["loss_critic"]
                + loss_vals["loss_entropy"]
            )

            # Optimization: backward, grad clipping and optimization step
            loss_value.backward()
            # this is not strictly mandatory but it's good practice to keep
            # your gradient norm bounded
            torch.nn.utils.clip_grad_norm_(loss_module.parameters(), max_grad_norm)
            optim.step()
            optim.zero_grad()

    logs["reward"].append(tensordict_data["next", "reward"].mean().item())
    pbar.update(tensordict_data.numel())
    cum_reward_str = (
        f"average reward={logs['reward'][-1]: 4.4f} (init={logs['reward'][0]: 4.4f})"
    )
    logs["step_count"].append(tensordict_data["step_count"].max().item())
    stepcount_str = f"step count (max): {logs['step_count'][-1]}"
    logs["lr"].append(optim.param_groups[0]["lr"])
    lr_str = f"lr policy: {logs['lr'][-1]: 4.4f}"
    if i % 10 == 0:
        # We evaluate the policy once every 10 batches of data.
        # Evaluation is rather simple: execute the policy without exploration
        # (take the expected value of the action distribution) for a given
        # number of steps (1000, which is our ``env`` horizon).
        # The ``rollout`` method of the ``env`` can take a policy as argument:
        # it will then execute this policy at each step.
        with set_exploration_type(ExplorationType.DETERMINISTIC), torch.no_grad():
            # execute a rollout with the trained policy
            eval_rollout = env.rollout(1000, policy_module)
            logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
            logs["eval reward (sum)"].append(
                eval_rollout["next", "reward"].sum().item()
            )
            logs["eval step_count"].append(eval_rollout["step_count"].max().item())
            eval_str = (
                f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
                f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
                f"eval step-count: {logs['eval step_count'][-1]}"
            )
            del eval_rollout
    pbar.set_description(", ".join([eval_str, cum_reward_str, stepcount_str, lr_str]))

    # We're also using a learning rate scheduler. Like the gradient clipping,
    # this is a nice-to-have but nothing necessary for PPO to work.
    scheduler.step()
  0%|          | 0/50000 [00:00<?, ?it/s]
  2%|2         | 1000/50000 [00:04<03:17, 248.03it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.0969 (init= 9.0969), step count (max): 11, lr policy:  0.0003:   2%|2         | 1000/50000 [00:04<03:17, 248.03it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.0969 (init= 9.0969), step count (max): 11, lr policy:  0.0003:   4%|4         | 2000/50000 [00:07<03:09, 253.73it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1171 (init= 9.0969), step count (max): 16, lr policy:  0.0003:   4%|4         | 2000/50000 [00:07<03:09, 253.73it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1171 (init= 9.0969), step count (max): 16, lr policy:  0.0003:   6%|6         | 3000/50000 [00:11<03:04, 255.42it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1550 (init= 9.0969), step count (max): 20, lr policy:  0.0003:   6%|6         | 3000/50000 [00:11<03:04, 255.42it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1550 (init= 9.0969), step count (max): 20, lr policy:  0.0003:   8%|8         | 4000/50000 [00:15<02:58, 258.33it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1764 (init= 9.0969), step count (max): 19, lr policy:  0.0003:   8%|8         | 4000/50000 [00:15<02:58, 258.33it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.1764 (init= 9.0969), step count (max): 19, lr policy:  0.0003:  10%|#         | 5000/50000 [00:19<02:52, 260.64it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2009 (init= 9.0969), step count (max): 26, lr policy:  0.0003:  10%|#         | 5000/50000 [00:19<02:52, 260.64it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2009 (init= 9.0969), step count (max): 26, lr policy:  0.0003:  12%|#2        | 6000/50000 [00:23<02:47, 262.48it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2242 (init= 9.0969), step count (max): 26, lr policy:  0.0003:  12%|#2        | 6000/50000 [00:23<02:47, 262.48it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2242 (init= 9.0969), step count (max): 26, lr policy:  0.0003:  14%|#4        | 7000/50000 [00:26<02:43, 263.76it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2368 (init= 9.0969), step count (max): 30, lr policy:  0.0003:  14%|#4        | 7000/50000 [00:26<02:43, 263.76it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2368 (init= 9.0969), step count (max): 30, lr policy:  0.0003:  16%|#6        | 8000/50000 [00:30<02:38, 265.00it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2486 (init= 9.0969), step count (max): 46, lr policy:  0.0003:  16%|#6        | 8000/50000 [00:30<02:38, 265.00it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2486 (init= 9.0969), step count (max): 46, lr policy:  0.0003:  18%|#8        | 9000/50000 [00:34<02:37, 259.69it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2509 (init= 9.0969), step count (max): 47, lr policy:  0.0003:  18%|#8        | 9000/50000 [00:34<02:37, 259.69it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2509 (init= 9.0969), step count (max): 47, lr policy:  0.0003:  20%|##        | 10000/50000 [00:38<02:32, 262.13it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2599 (init= 9.0969), step count (max): 52, lr policy:  0.0003:  20%|##        | 10000/50000 [00:38<02:32, 262.13it/s]
eval cumulative reward:  91.9271 (init:  91.9271), eval step-count: 9, average reward= 9.2599 (init= 9.0969), step count (max): 52, lr policy:  0.0003:  22%|##2       | 11000/50000 [00:42<02:27, 263.79it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2698 (init= 9.0969), step count (max): 56, lr policy:  0.0003:  22%|##2       | 11000/50000 [00:42<02:27, 263.79it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2698 (init= 9.0969), step count (max): 56, lr policy:  0.0003:  24%|##4       | 12000/50000 [00:45<02:24, 263.22it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2751 (init= 9.0969), step count (max): 75, lr policy:  0.0003:  24%|##4       | 12000/50000 [00:45<02:24, 263.22it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2751 (init= 9.0969), step count (max): 75, lr policy:  0.0003:  26%|##6       | 13000/50000 [00:49<02:19, 264.82it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2811 (init= 9.0969), step count (max): 92, lr policy:  0.0003:  26%|##6       | 13000/50000 [00:49<02:19, 264.82it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2811 (init= 9.0969), step count (max): 92, lr policy:  0.0003:  28%|##8       | 14000/50000 [00:53<02:15, 265.99it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2798 (init= 9.0969), step count (max): 75, lr policy:  0.0003:  28%|##8       | 14000/50000 [00:53<02:15, 265.99it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2798 (init= 9.0969), step count (max): 75, lr policy:  0.0003:  30%|###       | 15000/50000 [00:57<02:11, 266.73it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2881 (init= 9.0969), step count (max): 64, lr policy:  0.0002:  30%|###       | 15000/50000 [00:57<02:11, 266.73it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2881 (init= 9.0969), step count (max): 64, lr policy:  0.0002:  32%|###2      | 16000/50000 [01:00<02:07, 267.02it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2940 (init= 9.0969), step count (max): 107, lr policy:  0.0002:  32%|###2      | 16000/50000 [01:00<02:07, 267.02it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2940 (init= 9.0969), step count (max): 107, lr policy:  0.0002:  34%|###4      | 17000/50000 [01:04<02:03, 267.38it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2961 (init= 9.0969), step count (max): 128, lr policy:  0.0002:  34%|###4      | 17000/50000 [01:04<02:03, 267.38it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2961 (init= 9.0969), step count (max): 128, lr policy:  0.0002:  36%|###6      | 18000/50000 [01:08<01:59, 267.44it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2795 (init= 9.0969), step count (max): 55, lr policy:  0.0002:  36%|###6      | 18000/50000 [01:08<01:59, 267.44it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2795 (init= 9.0969), step count (max): 55, lr policy:  0.0002:  38%|###8      | 19000/50000 [01:12<01:55, 267.51it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2901 (init= 9.0969), step count (max): 87, lr policy:  0.0002:  38%|###8      | 19000/50000 [01:12<01:55, 267.51it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2901 (init= 9.0969), step count (max): 87, lr policy:  0.0002:  40%|####      | 20000/50000 [01:15<01:52, 267.58it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2891 (init= 9.0969), step count (max): 70, lr policy:  0.0002:  40%|####      | 20000/50000 [01:15<01:52, 267.58it/s]
eval cumulative reward:  371.6797 (init:  91.9271), eval step-count: 39, average reward= 9.2891 (init= 9.0969), step count (max): 70, lr policy:  0.0002:  42%|####2     | 21000/50000 [01:19<01:50, 261.70it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2790 (init= 9.0969), step count (max): 71, lr policy:  0.0002:  42%|####2     | 21000/50000 [01:19<01:50, 261.70it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2790 (init= 9.0969), step count (max): 71, lr policy:  0.0002:  44%|####4     | 22000/50000 [01:23<01:47, 260.02it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2980 (init= 9.0969), step count (max): 97, lr policy:  0.0002:  44%|####4     | 22000/50000 [01:23<01:47, 260.02it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2980 (init= 9.0969), step count (max): 97, lr policy:  0.0002:  46%|####6     | 23000/50000 [01:27<01:42, 262.49it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3049 (init= 9.0969), step count (max): 100, lr policy:  0.0002:  46%|####6     | 23000/50000 [01:27<01:42, 262.49it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3049 (init= 9.0969), step count (max): 100, lr policy:  0.0002:  48%|####8     | 24000/50000 [01:31<01:38, 264.37it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3054 (init= 9.0969), step count (max): 110, lr policy:  0.0002:  48%|####8     | 24000/50000 [01:31<01:38, 264.37it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3054 (init= 9.0969), step count (max): 110, lr policy:  0.0002:  50%|#####     | 25000/50000 [01:34<01:34, 265.43it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2923 (init= 9.0969), step count (max): 66, lr policy:  0.0002:  50%|#####     | 25000/50000 [01:34<01:34, 265.43it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2923 (init= 9.0969), step count (max): 66, lr policy:  0.0002:  52%|#####2    | 26000/50000 [01:38<01:30, 266.25it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3045 (init= 9.0969), step count (max): 85, lr policy:  0.0001:  52%|#####2    | 26000/50000 [01:38<01:30, 266.25it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3045 (init= 9.0969), step count (max): 85, lr policy:  0.0001:  54%|#####4    | 27000/50000 [01:42<01:26, 266.81it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3033 (init= 9.0969), step count (max): 75, lr policy:  0.0001:  54%|#####4    | 27000/50000 [01:42<01:26, 266.81it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3033 (init= 9.0969), step count (max): 75, lr policy:  0.0001:  56%|#####6    | 28000/50000 [01:46<01:22, 267.28it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2913 (init= 9.0969), step count (max): 74, lr policy:  0.0001:  56%|#####6    | 28000/50000 [01:46<01:22, 267.28it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2913 (init= 9.0969), step count (max): 74, lr policy:  0.0001:  58%|#####8    | 29000/50000 [01:49<01:18, 267.79it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3053 (init= 9.0969), step count (max): 82, lr policy:  0.0001:  58%|#####8    | 29000/50000 [01:49<01:18, 267.79it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.3053 (init= 9.0969), step count (max): 82, lr policy:  0.0001:  60%|######    | 30000/50000 [01:53<01:14, 267.81it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2982 (init= 9.0969), step count (max): 93, lr policy:  0.0001:  60%|######    | 30000/50000 [01:53<01:14, 267.81it/s]
eval cumulative reward:  755.6193 (init:  91.9271), eval step-count: 80, average reward= 9.2982 (init= 9.0969), step count (max): 93, lr policy:  0.0001:  62%|######2   | 31000/50000 [01:57<01:10, 267.77it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.2977 (init= 9.0969), step count (max): 75, lr policy:  0.0001:  62%|######2   | 31000/50000 [01:57<01:10, 267.77it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.2977 (init= 9.0969), step count (max): 75, lr policy:  0.0001:  64%|######4   | 32000/50000 [02:01<01:07, 265.36it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.2932 (init= 9.0969), step count (max): 62, lr policy:  0.0001:  64%|######4   | 32000/50000 [02:01<01:07, 265.36it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.2932 (init= 9.0969), step count (max): 62, lr policy:  0.0001:  66%|######6   | 33000/50000 [02:04<01:03, 266.44it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3195 (init= 9.0969), step count (max): 169, lr policy:  0.0001:  66%|######6   | 33000/50000 [02:04<01:03, 266.44it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3195 (init= 9.0969), step count (max): 169, lr policy:  0.0001:  68%|######8   | 34000/50000 [02:08<01:01, 261.42it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3232 (init= 9.0969), step count (max): 184, lr policy:  0.0001:  68%|######8   | 34000/50000 [02:08<01:01, 261.42it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3232 (init= 9.0969), step count (max): 184, lr policy:  0.0001:  70%|#######   | 35000/50000 [02:12<00:56, 263.63it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3380 (init= 9.0969), step count (max): 284, lr policy:  0.0001:  70%|#######   | 35000/50000 [02:12<00:56, 263.63it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3380 (init= 9.0969), step count (max): 284, lr policy:  0.0001:  72%|#######2  | 36000/50000 [02:16<00:52, 265.49it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3322 (init= 9.0969), step count (max): 157, lr policy:  0.0001:  72%|#######2  | 36000/50000 [02:16<00:52, 265.49it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3322 (init= 9.0969), step count (max): 157, lr policy:  0.0001:  74%|#######4  | 37000/50000 [02:19<00:48, 266.75it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3289 (init= 9.0969), step count (max): 250, lr policy:  0.0001:  74%|#######4  | 37000/50000 [02:19<00:48, 266.75it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3289 (init= 9.0969), step count (max): 250, lr policy:  0.0001:  76%|#######6  | 38000/50000 [02:23<00:44, 267.51it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3309 (init= 9.0969), step count (max): 190, lr policy:  0.0000:  76%|#######6  | 38000/50000 [02:23<00:44, 267.51it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3309 (init= 9.0969), step count (max): 190, lr policy:  0.0000:  78%|#######8  | 39000/50000 [02:27<00:41, 268.12it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3267 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  78%|#######8  | 39000/50000 [02:27<00:41, 268.12it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3267 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  80%|########  | 40000/50000 [02:31<00:37, 268.64it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3280 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  80%|########  | 40000/50000 [02:31<00:37, 268.64it/s]
eval cumulative reward:  465.6833 (init:  91.9271), eval step-count: 49, average reward= 9.3280 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  82%|########2 | 41000/50000 [02:34<00:33, 268.98it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3319 (init= 9.0969), step count (max): 162, lr policy:  0.0000:  82%|########2 | 41000/50000 [02:34<00:33, 268.98it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3319 (init= 9.0969), step count (max): 162, lr policy:  0.0000:  84%|########4 | 42000/50000 [02:38<00:30, 265.99it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3313 (init= 9.0969), step count (max): 166, lr policy:  0.0000:  84%|########4 | 42000/50000 [02:38<00:30, 265.99it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3313 (init= 9.0969), step count (max): 166, lr policy:  0.0000:  86%|########6 | 43000/50000 [02:42<00:26, 266.75it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3250 (init= 9.0969), step count (max): 161, lr policy:  0.0000:  86%|########6 | 43000/50000 [02:42<00:26, 266.75it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3250 (init= 9.0969), step count (max): 161, lr policy:  0.0000:  88%|########8 | 44000/50000 [02:46<00:22, 267.36it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3278 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  88%|########8 | 44000/50000 [02:46<00:22, 267.36it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3278 (init= 9.0969), step count (max): 131, lr policy:  0.0000:  90%|######### | 45000/50000 [02:49<00:18, 267.73it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3244 (init= 9.0969), step count (max): 167, lr policy:  0.0000:  90%|######### | 45000/50000 [02:49<00:18, 267.73it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3244 (init= 9.0969), step count (max): 167, lr policy:  0.0000:  92%|#########2| 46000/50000 [02:53<00:15, 262.46it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3307 (init= 9.0969), step count (max): 156, lr policy:  0.0000:  92%|#########2| 46000/50000 [02:53<00:15, 262.46it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3307 (init= 9.0969), step count (max): 156, lr policy:  0.0000:  94%|#########3| 47000/50000 [02:57<00:11, 264.54it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3390 (init= 9.0969), step count (max): 196, lr policy:  0.0000:  94%|#########3| 47000/50000 [02:57<00:11, 264.54it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3390 (init= 9.0969), step count (max): 196, lr policy:  0.0000:  96%|#########6| 48000/50000 [03:01<00:07, 265.83it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3307 (init= 9.0969), step count (max): 132, lr policy:  0.0000:  96%|#########6| 48000/50000 [03:01<00:07, 265.83it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3307 (init= 9.0969), step count (max): 132, lr policy:  0.0000:  98%|#########8| 49000/50000 [03:04<00:03, 266.75it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3348 (init= 9.0969), step count (max): 339, lr policy:  0.0000:  98%|#########8| 49000/50000 [03:04<00:03, 266.75it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3348 (init= 9.0969), step count (max): 339, lr policy:  0.0000: 100%|##########| 50000/50000 [03:08<00:00, 267.31it/s]
eval cumulative reward:  615.5883 (init:  91.9271), eval step-count: 65, average reward= 9.3294 (init= 9.0969), step count (max): 147, lr policy:  0.0000: 100%|##########| 50000/50000 [03:08<00:00, 267.31it/s]

结果

在达到 1M 步数上限之前,该算法应已达到 1000 步的最大步数,这是轨迹被截断之前的最大步数。

plt.figure(figsize=(10, 10))
plt.subplot(2, 2, 1)
plt.plot(logs["reward"])
plt.title("training rewards (average)")
plt.subplot(2, 2, 2)
plt.plot(logs["step_count"])
plt.title("Max step count (training)")
plt.subplot(2, 2, 3)
plt.plot(logs["eval reward (sum)"])
plt.title("Return (test)")
plt.subplot(2, 2, 4)
plt.plot(logs["eval step_count"])
plt.title("Max step count (test)")
plt.show()
training rewards (average), Max step count (training), Return (test), Max step count (test)

结论和后续步骤

在本教程中,我们学习了

  1. 如何使用 torchrl 创建和自定义环境;

  2. 如何编写模型和损失函数;

  3. 如何设置典型的训练循环。

如果你想对本教程进行更多实验,可以应用以下修改

  • 从效率的角度来看,我们可以并行运行多个模拟来加速数据收集。查看 ParallelEnv 以获取更多信息。

  • 从日志记录的角度来看,可以在请求渲染后向环境添加 torchrl.record.VideoRecorder 转换,以获得倒立摆运动的可视化渲染。查看 torchrl.record 以了解更多信息。

脚本总运行时间: ( 3 分钟 10.450 秒)

由 Sphinx-Gallery 生成的图库

文档

访问 PyTorch 的全面开发者文档

查看文档

教程

获取面向初学者和高级开发者的深入教程

查看教程

资源

查找开发资源并获得您的问题解答

查看资源