注意
点击 这里 下载完整的示例代码
使用 TorchRL 的强化学习 (PPO) 教程¶
作者: Vincent Moens
本教程演示了如何使用 PyTorch 和 torchrl
训练一个参数策略网络来解决来自 OpenAI-Gym/Farama-Gymnasium 控制库 的倒立摆任务。
关键学习内容
如何在 TorchRL 中创建环境、转换其输出以及从该环境收集数据;
如何使用
TensorDict
使您的类相互通信;使用 TorchRL 构建训练循环的基础知识
如何为策略梯度方法计算优势信号;
如何使用概率神经网络创建随机策略;
如何创建动态回放缓冲区并从中进行采样而不会重复。
我们将涵盖 TorchRL 的六个关键组件
如果您在 Google Colab 中运行它,请确保安装以下依赖项
!pip3 install torchrl
!pip3 install gym[mujoco]
!pip3 install tqdm
近端策略优化(PPO)是一种策略梯度算法,它收集一批数据并直接使用这些数据来训练策略,以在一些接近性约束下最大化预期回报。可以把它看作是 REINFORCE 的一个更复杂的版本,REINFORCE 是基础的策略优化算法。有关更多信息,请参阅 近端策略优化算法 论文。
PPO 通常被认为是一种快速高效的在线、在线策略强化算法。TorchRL 提供了一个损失模块,它为你完成了所有工作,这样你就可以依赖这个实现,专注于解决你的问题,而不是每次想要训练一个策略时都重新发明轮子。
为了完整起见,这里简要概述一下损失的计算内容,即使这由我们的 ClipPPOLoss
模块负责——该算法的工作原理如下:1. 我们将通过在环境中使用策略进行一定数量的步骤来采样一批数据。2. 然后,我们将使用 REINFORCE 损失的剪裁版本,对这批数据的随机子样本执行给定数量的优化步骤。3. 剪裁将对我们的损失设置一个悲观的界限:与更高的回报估计相比,更低的回报估计将得到青睐。损失的精确公式是
该损失有两个组成部分:在最小运算符的第一部分,我们只是计算 REINFORCE 损失的加权重要性版本(例如,我们已经校正了当前策略配置落后于用于数据收集的策略配置的 REINFORCE 损失)。最小运算符的第二部分是类似的损失,其中我们在比率超过或低于给定的一对阈值时对比率进行了剪裁。
这种损失确保了无论优势是正的还是负的,都会阻止导致策略更新与先前配置发生重大偏移的更新。
本教程的结构如下
首先,我们将定义一组用于训练的超参数。
接下来,我们将专注于使用 TorchRL 的包装器和转换来创建我们的环境或模拟器。
接下来,我们将设计策略网络和价值模型,这是损失函数不可或缺的一部分。这些模块将用于配置我们的损失模块。
接下来,我们将创建重播缓冲区和数据加载器。
最后,我们将运行我们的训练循环并分析结果。
在本教程中,我们将使用 tensordict
库。 TensorDict
是 TorchRL 的通用语言:它帮助我们抽象化一个模块读取和写入的内容,并减少对特定数据描述的关注,而更多地关注算法本身。
import warnings
warnings.filterwarnings("ignore")
from torch import multiprocessing
from collections import defaultdict
import matplotlib.pyplot as plt
import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,
TransformedEnv)
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type
from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
from tqdm import tqdm
定义超参数¶
我们为算法设置超参数。根据可用资源,可以选择在 GPU 或其他设备上执行策略。 frame_skip
将控制单个动作执行的帧数。其他计算帧的论据必须根据此值进行校正(因为一个环境步骤实际上将返回 frame_skip
帧)。
is_fork = multiprocessing.get_start_method() == "fork"
device = (
torch.device(0)
if torch.cuda.is_available() and not is_fork
else torch.device("cpu")
)
num_cells = 256 # number of cells in each layer i.e. output dim.
lr = 3e-4
max_grad_norm = 1.0
数据收集参数¶
在收集数据时,我们可以通过定义一个 frames_per_batch
参数来选择每个批次的大小。我们还将定义允许自己使用的帧数(例如,与模拟器的交互次数)。一般来说,RL 算法的目标是学习尽快(就环境交互而言)解决任务: total_frames
越低越好。
frames_per_batch = 1000
# For a complete training, bring the number of frames up to 1M
total_frames = 50_000
PPO 参数¶
在每次数据收集(或批次收集)中,我们将在一定数量的时期内运行优化,每次在嵌套训练循环中消耗我们刚刚获得的全部数据。这里, sub_batch_size
不同于上面的 frames_per_batch
:请记住,我们正在使用来自收集器的“数据批次”工作,其大小由 frames_per_batch
定义,并且我们将在内部训练循环中将其进一步细分为更小的子批次。这些子批次的大小由 sub_batch_size
控制。
sub_batch_size = 64 # cardinality of the sub-samples gathered from the current data in the inner loop
num_epochs = 10 # optimization steps per batch of data collected
clip_epsilon = (
0.2 # clip value for PPO loss: see the equation in the intro for more context.
)
gamma = 0.99
lmbda = 0.95
entropy_eps = 1e-4
定义一个环境¶
在 RL 中,环境通常是指模拟器或控制系统。各种库为强化学习提供了仿真环境,包括 Gymnasium(以前的 OpenAI Gym)、DeepMind 控制套件等等。作为通用库,TorchRL 的目标是提供一个可互换的接口,用于大量 RL 模拟器,允许你轻松地将一个环境替换为另一个环境。例如,使用几个字符就可以创建一个包装的 gym 环境
在这段代码中需要注意一些事项:首先,我们通过调用 GymEnv
包装器来创建环境。如果传递了额外的关键字参数,它们将被传递到 gym.make
方法,从而涵盖最常见的环境构造命令。或者,也可以直接使用 gym.make(env_name, **kwargs)
创建一个 gym 环境,并将其包装在一个 GymWrapper 类中。
还有 device
参数:对于 gym,这仅控制存储输入动作和观察到的状态的设备,但执行始终在 CPU 上完成。这样做的原因很简单,因为 gym 不支持设备上的执行,除非另有说明。对于其他库,我们有权控制执行设备,并且我们尽可能地在存储和执行后端方面保持一致。
转换¶
我们将向我们的环境中添加一些转换,以准备策略的数据。在 Gym 中,这通常是通过包装器来实现的。TorchRL 采用了不同的方法,更类似于其他 pytorch 领域库,通过使用转换。要向环境添加转换,只需将它包装在一个 TransformedEnv
实例中,并将转换序列附加到它。转换后的环境将继承包装环境的设备和元数据,并根据它包含的转换序列对这些数据进行转换。
规范化¶
第一个要编码的是规范化转换。一般来说,最好拥有与单位高斯分布大致匹配的数据:为了获得这一点,我们将运行一定数量的环境随机步骤,并计算这些观察结果的汇总统计信息。
我们将附加另外两个转换: DoubleToFloat
转换将 double 类型的条目转换为单精度数字,准备供策略读取。 StepCounter
转换将用于统计环境终止之前的步数。我们将使用此度量作为性能的补充度量。
正如我们将在后面看到的,许多 TorchRL 的类依赖于 TensorDict
进行通信。可以把它看作是一个 python 字典,它有一些额外的张量功能。在实践中,这意味着我们将要使用的许多模块都需要告知读取的键 (in_keys
) 和写入的键 (out_keys
) 在它们将接收的 tensordict
中。通常情况下,如果省略了 out_keys
,则假设 in_keys
条目将在原地更新。对于我们的转换,我们唯一感兴趣的条目被称为 "observation"
,我们的转换层将被告知仅修改此条目。
env = TransformedEnv(
base_env,
Compose(
# normalize observations
ObservationNorm(in_keys=["observation"]),
DoubleToFloat(),
StepCounter(),
),
)
正如你可能已经注意到的,我们已经创建了一个规范化层,但我们没有设置它的规范化参数。为此, ObservationNorm
可以自动收集环境的汇总统计信息
env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)
ObservationNorm
转换现在已经填充了位置和比例,这些位置和比例将用于规范化数据。
让我们对汇总统计数据的形状进行一个小小的健全性检查
print("normalization constant shape:", env.transform[0].loc.shape)
normalization constant shape: torch.Size([11])
环境不仅由其模拟器和转换定义,还由一系列描述其执行过程中可以预期的元数据定义。出于效率考虑,TorchRL 在环境规范方面非常严格,但你可以轻松地检查你的环境规范是否足够。在我们的示例中, GymWrapper
和从它继承的 GymEnv
已经负责设置环境的正确规范,因此你不必担心这个问题。
尽管如此,让我们通过查看其规范来查看一个使用转换后的环境的具体示例。有三个规范需要关注: observation_spec
定义了在环境中执行动作时可以预期什么, reward_spec
指示奖励域,最后是 input_spec
(它包含 action_spec
),它代表环境执行单个步骤所需的一切。
print("observation_spec:", env.observation_spec)
print("reward_spec:", env.reward_spec)
print("input_spec:", env.input_spec)
print("action_spec (as defined by input_spec):", env.action_spec)
observation_spec: CompositeSpec(
observation: UnboundedContinuousTensorSpec(
shape=torch.Size([11]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
step_count: BoundedTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=continuous),
device=cpu,
shape=torch.Size([]))
reward_spec: UnboundedContinuousTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous)
input_spec: CompositeSpec(
full_state_spec: CompositeSpec(
step_count: BoundedTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=continuous),
device=cpu,
shape=torch.Size([])),
full_action_spec: CompositeSpec(
action: BoundedTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
action_spec (as defined by input_spec): BoundedTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous)
check_env_specs()
函数运行一个小型的展开,并将其输出与环境规范进行比较。如果没有抛出错误,我们就可以相信规范已正确定义
check_env_specs(env)
为了好玩,让我们看看一个简单的随机展开的样子。你可以调用 env.rollout(n_steps),并大致了解环境的输入和输出。动作将自动从动作规范域中抽取,因此你不必关心设计随机采样器。
通常,在每一步中,强化学习环境接收一个动作作为输入,并输出一个观察结果、一个奖励和一个完成状态。观察结果可能是复合的,这意味着它可以由多个张量组成。这对 TorchRL 来说不是问题,因为整个观察结果集会自动打包到输出 TensorDict
中。在给定步数内执行一次展开(例如,一系列环境步骤和随机动作生成)后,我们将检索到一个 TensorDict
实例,其形状与该轨迹长度相匹配。
rollout = env.rollout(3)
print("rollout of three steps:", rollout)
print("Shape of the rollout TensorDict:", rollout.batch_size)
rollout of three steps: TensorDict(
fields={
action: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
reward: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([3]),
device=cpu,
is_shared=False),
observation: Tensor(shape=torch.Size([3, 11]), device=cpu, dtype=torch.float32, is_shared=False),
step_count: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.int64, is_shared=False),
terminated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([3, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([3]),
device=cpu,
is_shared=False)
Shape of the rollout TensorDict: torch.Size([3])
我们的展开数据形状为 torch.Size([3])
,这与我们运行的步数相匹配。 "next"
条目指向当前步骤之后的数据。在大多数情况下,时间 t 的 "next"
数据与时间 t+1
的数据匹配,但如果我们使用某些特定转换(例如,多步),则情况可能并非如此。
策略¶
PPO 利用随机策略来处理探索。这意味着我们的神经网络必须输出分布参数,而不是对应于所采取动作的单个值。
由于数据是连续的,我们使用 Tanh-Normal 分布来尊重动作空间边界。TorchRL 提供了这种分布,我们唯一需要关心的是构建一个神经网络,该网络输出策略工作所需的正确参数数量(位置或均值,以及尺度)。
这里带来的唯一额外困难是将我们的输出分成两个相等的部分,并将第二个映射到严格的正空间。
我们分三步设计策略
定义一个神经网络
D_obs
->2 * D_action
。实际上,我们的loc
(mu) 和scale
(sigma) 的维度都是D_action
。附加一个
NormalParamExtractor
来提取位置和尺度(例如,将输入分成两个相等的部分,并对尺度参数应用正变换)。创建一个概率
TensorDictModule
,它可以生成此分布并从中采样。
actor_net = nn.Sequential(
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),
NormalParamExtractor(),
)
为了使策略能够通过 tensordict
数据载体与环境“对话”,我们将 nn.Module
包裹在 TensorDictModule
中。此类将简单地准备好提供的 in_keys
,并将输出就地写入注册的 out_keys
。
policy_module = TensorDictModule(
actor_net, in_keys=["observation"], out_keys=["loc", "scale"]
)
现在我们需要根据正态分布的位置和尺度构建一个分布。为此,我们指示 ProbabilisticActor
类从位置和尺度参数中构建一个 TanhNormal
。我们还提供了此分布的最小值和最大值,这些值来自环境规范。
in_keys
的名称(以及来自上面的 TensorDictModule
的 out_keys
的名称)不能设置为任何人们喜欢的值,因为 TanhNormal
分布构造函数将期望 loc
和 scale
关键字参数。也就是说,ProbabilisticActor
也接受 Dict[str, str]
类型的 in_keys
,其中键值对指示对于要使用的每个关键字参数应该使用哪个 in_key
字符串。
policy_module = ProbabilisticActor(
module=policy_module,
spec=env.action_spec,
in_keys=["loc", "scale"],
distribution_class=TanhNormal,
distribution_kwargs={
"min": env.action_spec.space.low,
"max": env.action_spec.space.high,
},
return_log_prob=True,
# we'll need the log-prob for the numerator of the importance weights
)
价值网络¶
价值网络是 PPO 算法的关键组成部分,即使在推断时不会使用它。此模块将读取观察结果并返回对以下轨迹的折扣回报的估计。这使我们能够通过依赖在训练期间动态学习的一些效用估计来摊销学习。我们的价值网络与策略具有相同的结构,但为简单起见,我们为其分配了自己的参数集。
value_net = nn.Sequential(
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(1, device=device),
)
value_module = ValueOperator(
module=value_net,
in_keys=["observation"],
)
让我们尝试使用我们的策略和价值模块。正如我们之前所说,使用 TensorDictModule
使得可以直接读取环境的输出以运行这些模块,因为它们知道要读取什么信息以及在哪里写入它。
print("Running policy:", policy_module(env.reset()))
print("Running value:", value_module(env.reset()))
Running policy: TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
loc: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
sample_log_prob: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
scale: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
Running value: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([11]), device=cpu, dtype=torch.float32, is_shared=False),
state_value: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
step_count: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.int64, is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
数据收集器¶
TorchRL 提供了一组 DataCollector 类。简而言之,这些类执行三个操作:重置环境、根据最新观察结果计算动作、在环境中执行一步,并重复最后两步,直到环境发出停止信号(或达到完成状态)。
它们允许你控制每次迭代要收集多少帧(通过 frames_per_batch
参数),何时重置环境(通过 max_frames_per_traj
参数),策略应该在哪个 device
上执行等等。它们还被设计为能够有效地与批处理和多进程环境配合使用。
最简单的数据收集器是 SyncDataCollector
:它是一个迭代器,你可以使用它来获取给定长度的数据批次,并且它将在收集到一定数量的帧 (total_frames
) 后停止。其他数据收集器 (MultiSyncDataCollector
和 MultiaSyncDataCollector
) 将以同步和异步方式在一组多进程工作器上执行相同的操作。
与策略和环境一样,数据收集器将返回 TensorDict
实例,其元素总数将与 frames_per_batch
相匹配。使用 TensorDict
将数据传递到训练循环使你能够编写完全不了解展开内容实际特性的数据加载管道。
collector = SyncDataCollector(
env,
policy_module,
frames_per_batch=frames_per_batch,
total_frames=total_frames,
split_trajs=False,
device=device,
)
回放缓冲区¶
回放缓冲区是离策略强化学习算法的常见构建块。在策略内环境中,回放缓冲区会在每次收集到一批数据时重新填充,并且其数据会被重复使用一定数量的 epoch。
TorchRL 的回放缓冲区是使用通用容器 ReplayBuffer
构建的,它以缓冲区的组件作为参数:存储、写入器、采样器,以及一些可选的转换。只有存储(指示回放缓冲区容量)是必需的。我们还指定了一个无重复采样器,以避免在一个 epoch 中多次采样相同的项目。对 PPO 使用回放缓冲区不是必须的,我们也可以直接从收集的批次中采样子批次,但使用这些类使我们能够以可重现的方式轻松构建内部训练循环。
replay_buffer = ReplayBuffer(
storage=LazyTensorStorage(max_size=frames_per_batch),
sampler=SamplerWithoutReplacement(),
)
损失函数¶
为了方便起见,可以直接从 TorchRL 中导入 PPO 损失,使用 ClipPPOLoss
类。这是使用 PPO 的最简单方法:它隐藏了 PPO 的数学运算以及随之而来的控制流。
PPO 需要计算一些“优势估计”。简而言之,优势是一个值,它反映了在处理偏差/方差权衡时对回报值的期望。要计算优势,只需 (1) 构建优势模块,该模块利用我们的价值算子,以及 (2) 在每个 epoch 之前将每个数据批次传递到优势模块。GAE 模块将使用新的 "advantage"
和 "value_target"
条目更新输入 tensordict
。 "value_target"
是一个无梯度张量,它表示价值网络应该用输入观察结果表示的经验值。这两者都将被 ClipPPOLoss
用于返回策略和价值损失。
advantage_module = GAE(
gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True
)
loss_module = ClipPPOLoss(
actor_network=policy_module,
critic_network=value_module,
clip_epsilon=clip_epsilon,
entropy_bonus=bool(entropy_eps),
entropy_coef=entropy_eps,
# these keys match by default but we set this for completeness
critic_coef=1.0,
loss_critic_type="smooth_l1",
)
optim = torch.optim.Adam(loss_module.parameters(), lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optim, total_frames // frames_per_batch, 0.0
)
训练循环¶
现在我们拥有了编写训练循环所需的所有部分。步骤包括
收集数据
计算优势
遍历收集的数据以计算损失值
反向传播
优化
重复
重复
重复
logs = defaultdict(list)
pbar = tqdm(total=total_frames)
eval_str = ""
# We iterate over the collector until it reaches the total number of frames it was
# designed to collect:
for i, tensordict_data in enumerate(collector):
# we now have a batch of data to work with. Let's learn something from it.
for _ in range(num_epochs):
# We'll need an "advantage" signal to make PPO work.
# We re-compute it at each epoch as its value depends on the value
# network which is updated in the inner loop.
advantage_module(tensordict_data)
data_view = tensordict_data.reshape(-1)
replay_buffer.extend(data_view.cpu())
for _ in range(frames_per_batch // sub_batch_size):
subdata = replay_buffer.sample(sub_batch_size)
loss_vals = loss_module(subdata.to(device))
loss_value = (
loss_vals["loss_objective"]
+ loss_vals["loss_critic"]
+ loss_vals["loss_entropy"]
)
# Optimization: backward, grad clipping and optimization step
loss_value.backward()
# this is not strictly mandatory but it's good practice to keep
# your gradient norm bounded
torch.nn.utils.clip_grad_norm_(loss_module.parameters(), max_grad_norm)
optim.step()
optim.zero_grad()
logs["reward"].append(tensordict_data["next", "reward"].mean().item())
pbar.update(tensordict_data.numel())
cum_reward_str = (
f"average reward={logs['reward'][-1]: 4.4f} (init={logs['reward'][0]: 4.4f})"
)
logs["step_count"].append(tensordict_data["step_count"].max().item())
stepcount_str = f"step count (max): {logs['step_count'][-1]}"
logs["lr"].append(optim.param_groups[0]["lr"])
lr_str = f"lr policy: {logs['lr'][-1]: 4.4f}"
if i % 10 == 0:
# We evaluate the policy once every 10 batches of data.
# Evaluation is rather simple: execute the policy without exploration
# (take the expected value of the action distribution) for a given
# number of steps (1000, which is our ``env`` horizon).
# The ``rollout`` method of the ``env`` can take a policy as argument:
# it will then execute this policy at each step.
with set_exploration_type(ExplorationType.MEAN), torch.no_grad():
# execute a rollout with the trained policy
eval_rollout = env.rollout(1000, policy_module)
logs["eval reward"].append(eval_rollout["next", "reward"].mean().item())
logs["eval reward (sum)"].append(
eval_rollout["next", "reward"].sum().item()
)
logs["eval step_count"].append(eval_rollout["step_count"].max().item())
eval_str = (
f"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} "
f"(init: {logs['eval reward (sum)'][0]: 4.4f}), "
f"eval step-count: {logs['eval step_count'][-1]}"
)
del eval_rollout
pbar.set_description(", ".join([eval_str, cum_reward_str, stepcount_str, lr_str]))
# We're also using a learning rate scheduler. Like the gradient clipping,
# this is a nice-to-have but nothing necessary for PPO to work.
scheduler.step()
0%| | 0/50000 [00:00<?, ?it/s]
2%|2 | 1000/50000 [00:03<03:05, 264.37it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.0782 (init= 9.0782), step count (max): 14, lr policy: 0.0003: 2%|2 | 1000/50000 [00:03<03:05, 264.37it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.0782 (init= 9.0782), step count (max): 14, lr policy: 0.0003: 4%|4 | 2000/50000 [00:07<03:00, 265.70it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1195 (init= 9.0782), step count (max): 12, lr policy: 0.0003: 4%|4 | 2000/50000 [00:07<03:00, 265.70it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1195 (init= 9.0782), step count (max): 12, lr policy: 0.0003: 6%|6 | 3000/50000 [00:11<02:55, 268.13it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1496 (init= 9.0782), step count (max): 14, lr policy: 0.0003: 6%|6 | 3000/50000 [00:11<02:55, 268.13it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1496 (init= 9.0782), step count (max): 14, lr policy: 0.0003: 8%|8 | 4000/50000 [00:14<02:50, 269.73it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1796 (init= 9.0782), step count (max): 21, lr policy: 0.0003: 8%|8 | 4000/50000 [00:14<02:50, 269.73it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.1796 (init= 9.0782), step count (max): 21, lr policy: 0.0003: 10%|# | 5000/50000 [00:18<02:46, 270.72it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2063 (init= 9.0782), step count (max): 22, lr policy: 0.0003: 10%|# | 5000/50000 [00:18<02:46, 270.72it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2063 (init= 9.0782), step count (max): 22, lr policy: 0.0003: 12%|#2 | 6000/50000 [00:22<02:42, 271.49it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2230 (init= 9.0782), step count (max): 28, lr policy: 0.0003: 12%|#2 | 6000/50000 [00:22<02:42, 271.49it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2230 (init= 9.0782), step count (max): 28, lr policy: 0.0003: 14%|#4 | 7000/50000 [00:25<02:37, 272.56it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2281 (init= 9.0782), step count (max): 27, lr policy: 0.0003: 14%|#4 | 7000/50000 [00:25<02:37, 272.56it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2281 (init= 9.0782), step count (max): 27, lr policy: 0.0003: 16%|#6 | 8000/50000 [00:29<02:36, 268.78it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2403 (init= 9.0782), step count (max): 35, lr policy: 0.0003: 16%|#6 | 8000/50000 [00:29<02:36, 268.78it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2403 (init= 9.0782), step count (max): 35, lr policy: 0.0003: 18%|#8 | 9000/50000 [00:33<02:31, 270.72it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2558 (init= 9.0782), step count (max): 45, lr policy: 0.0003: 18%|#8 | 9000/50000 [00:33<02:31, 270.72it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2558 (init= 9.0782), step count (max): 45, lr policy: 0.0003: 20%|## | 10000/50000 [00:36<02:26, 273.02it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2503 (init= 9.0782), step count (max): 35, lr policy: 0.0003: 20%|## | 10000/50000 [00:36<02:26, 273.02it/s]
eval cumulative reward: 129.4145 (init: 129.4145), eval step-count: 13, average reward= 9.2503 (init= 9.0782), step count (max): 35, lr policy: 0.0003: 22%|##2 | 11000/50000 [00:40<02:21, 274.67it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2695 (init= 9.0782), step count (max): 50, lr policy: 0.0003: 22%|##2 | 11000/50000 [00:40<02:21, 274.67it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2695 (init= 9.0782), step count (max): 50, lr policy: 0.0003: 24%|##4 | 12000/50000 [00:44<02:18, 274.38it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2648 (init= 9.0782), step count (max): 45, lr policy: 0.0003: 24%|##4 | 12000/50000 [00:44<02:18, 274.38it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2648 (init= 9.0782), step count (max): 45, lr policy: 0.0003: 26%|##6 | 13000/50000 [00:47<02:14, 275.40it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2593 (init= 9.0782), step count (max): 41, lr policy: 0.0003: 26%|##6 | 13000/50000 [00:47<02:14, 275.40it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2593 (init= 9.0782), step count (max): 41, lr policy: 0.0003: 28%|##8 | 14000/50000 [00:51<02:10, 276.41it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2575 (init= 9.0782), step count (max): 41, lr policy: 0.0003: 28%|##8 | 14000/50000 [00:51<02:10, 276.41it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2575 (init= 9.0782), step count (max): 41, lr policy: 0.0003: 30%|### | 15000/50000 [00:54<02:06, 277.47it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2722 (init= 9.0782), step count (max): 78, lr policy: 0.0002: 30%|### | 15000/50000 [00:54<02:06, 277.47it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2722 (init= 9.0782), step count (max): 78, lr policy: 0.0002: 32%|###2 | 16000/50000 [00:58<02:02, 277.90it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2703 (init= 9.0782), step count (max): 63, lr policy: 0.0002: 32%|###2 | 16000/50000 [00:58<02:02, 277.90it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2703 (init= 9.0782), step count (max): 63, lr policy: 0.0002: 34%|###4 | 17000/50000 [01:02<02:00, 273.91it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2840 (init= 9.0782), step count (max): 86, lr policy: 0.0002: 34%|###4 | 17000/50000 [01:02<02:00, 273.91it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2840 (init= 9.0782), step count (max): 86, lr policy: 0.0002: 36%|###6 | 18000/50000 [01:05<01:56, 275.43it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2718 (init= 9.0782), step count (max): 48, lr policy: 0.0002: 36%|###6 | 18000/50000 [01:05<01:56, 275.43it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2718 (init= 9.0782), step count (max): 48, lr policy: 0.0002: 38%|###8 | 19000/50000 [01:09<01:52, 276.77it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2883 (init= 9.0782), step count (max): 67, lr policy: 0.0002: 38%|###8 | 19000/50000 [01:09<01:52, 276.77it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2883 (init= 9.0782), step count (max): 67, lr policy: 0.0002: 40%|#### | 20000/50000 [01:13<01:48, 277.67it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2912 (init= 9.0782), step count (max): 61, lr policy: 0.0002: 40%|#### | 20000/50000 [01:13<01:48, 277.67it/s]
eval cumulative reward: 249.8351 (init: 129.4145), eval step-count: 26, average reward= 9.2912 (init= 9.0782), step count (max): 61, lr policy: 0.0002: 42%|####2 | 21000/50000 [01:16<01:44, 278.36it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2908 (init= 9.0782), step count (max): 68, lr policy: 0.0002: 42%|####2 | 21000/50000 [01:16<01:44, 278.36it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2908 (init= 9.0782), step count (max): 68, lr policy: 0.0002: 44%|####4 | 22000/50000 [01:20<01:41, 276.21it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3026 (init= 9.0782), step count (max): 96, lr policy: 0.0002: 44%|####4 | 22000/50000 [01:20<01:41, 276.21it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3026 (init= 9.0782), step count (max): 96, lr policy: 0.0002: 46%|####6 | 23000/50000 [01:23<01:37, 277.32it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2973 (init= 9.0782), step count (max): 94, lr policy: 0.0002: 46%|####6 | 23000/50000 [01:23<01:37, 277.32it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2973 (init= 9.0782), step count (max): 94, lr policy: 0.0002: 48%|####8 | 24000/50000 [01:27<01:33, 278.18it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3012 (init= 9.0782), step count (max): 90, lr policy: 0.0002: 48%|####8 | 24000/50000 [01:27<01:33, 278.18it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3012 (init= 9.0782), step count (max): 90, lr policy: 0.0002: 50%|##### | 25000/50000 [01:30<01:29, 278.98it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2977 (init= 9.0782), step count (max): 67, lr policy: 0.0002: 50%|##### | 25000/50000 [01:30<01:29, 278.98it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2977 (init= 9.0782), step count (max): 67, lr policy: 0.0002: 52%|#####2 | 26000/50000 [01:34<01:27, 274.85it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3041 (init= 9.0782), step count (max): 119, lr policy: 0.0001: 52%|#####2 | 26000/50000 [01:34<01:27, 274.85it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3041 (init= 9.0782), step count (max): 119, lr policy: 0.0001: 54%|#####4 | 27000/50000 [01:38<01:23, 276.58it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3032 (init= 9.0782), step count (max): 104, lr policy: 0.0001: 54%|#####4 | 27000/50000 [01:38<01:23, 276.58it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3032 (init= 9.0782), step count (max): 104, lr policy: 0.0001: 56%|#####6 | 28000/50000 [01:41<01:19, 277.73it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3076 (init= 9.0782), step count (max): 103, lr policy: 0.0001: 56%|#####6 | 28000/50000 [01:41<01:19, 277.73it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3076 (init= 9.0782), step count (max): 103, lr policy: 0.0001: 58%|#####8 | 29000/50000 [01:45<01:15, 278.20it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2944 (init= 9.0782), step count (max): 103, lr policy: 0.0001: 58%|#####8 | 29000/50000 [01:45<01:15, 278.20it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.2944 (init= 9.0782), step count (max): 103, lr policy: 0.0001: 60%|###### | 30000/50000 [01:49<01:11, 278.86it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3060 (init= 9.0782), step count (max): 81, lr policy: 0.0001: 60%|###### | 30000/50000 [01:49<01:11, 278.86it/s]
eval cumulative reward: 484.9409 (init: 129.4145), eval step-count: 51, average reward= 9.3060 (init= 9.0782), step count (max): 81, lr policy: 0.0001: 62%|######2 | 31000/50000 [01:52<01:08, 279.27it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3064 (init= 9.0782), step count (max): 90, lr policy: 0.0001: 62%|######2 | 31000/50000 [01:52<01:08, 279.27it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3064 (init= 9.0782), step count (max): 90, lr policy: 0.0001: 64%|######4 | 32000/50000 [01:56<01:05, 276.29it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3080 (init= 9.0782), step count (max): 143, lr policy: 0.0001: 64%|######4 | 32000/50000 [01:56<01:05, 276.29it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3080 (init= 9.0782), step count (max): 143, lr policy: 0.0001: 66%|######6 | 33000/50000 [01:59<01:01, 277.15it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2948 (init= 9.0782), step count (max): 76, lr policy: 0.0001: 66%|######6 | 33000/50000 [01:59<01:01, 277.15it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2948 (init= 9.0782), step count (max): 76, lr policy: 0.0001: 68%|######8 | 34000/50000 [02:03<00:57, 277.97it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2974 (init= 9.0782), step count (max): 96, lr policy: 0.0001: 68%|######8 | 34000/50000 [02:03<00:57, 277.97it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2974 (init= 9.0782), step count (max): 96, lr policy: 0.0001: 70%|####### | 35000/50000 [02:07<00:53, 278.83it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3052 (init= 9.0782), step count (max): 75, lr policy: 0.0001: 70%|####### | 35000/50000 [02:07<00:53, 278.83it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3052 (init= 9.0782), step count (max): 75, lr policy: 0.0001: 72%|#######2 | 36000/50000 [02:10<00:50, 274.76it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3087 (init= 9.0782), step count (max): 126, lr policy: 0.0001: 72%|#######2 | 36000/50000 [02:10<00:50, 274.76it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3087 (init= 9.0782), step count (max): 126, lr policy: 0.0001: 74%|#######4 | 37000/50000 [02:14<00:47, 275.96it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2989 (init= 9.0782), step count (max): 100, lr policy: 0.0001: 74%|#######4 | 37000/50000 [02:14<00:47, 275.96it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.2989 (init= 9.0782), step count (max): 100, lr policy: 0.0001: 76%|#######6 | 38000/50000 [02:17<00:43, 276.95it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3150 (init= 9.0782), step count (max): 120, lr policy: 0.0000: 76%|#######6 | 38000/50000 [02:17<00:43, 276.95it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3150 (init= 9.0782), step count (max): 120, lr policy: 0.0000: 78%|#######8 | 39000/50000 [02:21<00:39, 277.69it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3061 (init= 9.0782), step count (max): 131, lr policy: 0.0000: 78%|#######8 | 39000/50000 [02:21<00:39, 277.69it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3061 (init= 9.0782), step count (max): 131, lr policy: 0.0000: 80%|######## | 40000/50000 [02:25<00:35, 278.15it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3043 (init= 9.0782), step count (max): 156, lr policy: 0.0000: 80%|######## | 40000/50000 [02:25<00:35, 278.15it/s]
eval cumulative reward: 624.3900 (init: 129.4145), eval step-count: 66, average reward= 9.3043 (init= 9.0782), step count (max): 156, lr policy: 0.0000: 82%|########2 | 41000/50000 [02:28<00:32, 278.59it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3127 (init= 9.0782), step count (max): 106, lr policy: 0.0000: 82%|########2 | 41000/50000 [02:28<00:32, 278.59it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3127 (init= 9.0782), step count (max): 106, lr policy: 0.0000: 84%|########4 | 42000/50000 [02:32<00:28, 277.40it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3016 (init= 9.0782), step count (max): 103, lr policy: 0.0000: 84%|########4 | 42000/50000 [02:32<00:28, 277.40it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3016 (init= 9.0782), step count (max): 103, lr policy: 0.0000: 86%|########6 | 43000/50000 [02:35<00:25, 278.12it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.2988 (init= 9.0782), step count (max): 102, lr policy: 0.0000: 86%|########6 | 43000/50000 [02:35<00:25, 278.12it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.2988 (init= 9.0782), step count (max): 102, lr policy: 0.0000: 88%|########8 | 44000/50000 [02:39<00:21, 274.47it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3090 (init= 9.0782), step count (max): 148, lr policy: 0.0000: 88%|########8 | 44000/50000 [02:39<00:21, 274.47it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3090 (init= 9.0782), step count (max): 148, lr policy: 0.0000: 90%|######### | 45000/50000 [02:43<00:18, 275.78it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3001 (init= 9.0782), step count (max): 100, lr policy: 0.0000: 90%|######### | 45000/50000 [02:43<00:18, 275.78it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3001 (init= 9.0782), step count (max): 100, lr policy: 0.0000: 92%|#########2| 46000/50000 [02:46<00:14, 276.92it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.2957 (init= 9.0782), step count (max): 60, lr policy: 0.0000: 92%|#########2| 46000/50000 [02:46<00:14, 276.92it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.2957 (init= 9.0782), step count (max): 60, lr policy: 0.0000: 94%|#########3| 47000/50000 [02:50<00:10, 278.08it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3069 (init= 9.0782), step count (max): 103, lr policy: 0.0000: 94%|#########3| 47000/50000 [02:50<00:10, 278.08it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3069 (init= 9.0782), step count (max): 103, lr policy: 0.0000: 96%|#########6| 48000/50000 [02:53<00:07, 278.78it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3046 (init= 9.0782), step count (max): 146, lr policy: 0.0000: 96%|#########6| 48000/50000 [02:53<00:07, 278.78it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3046 (init= 9.0782), step count (max): 146, lr policy: 0.0000: 98%|#########8| 49000/50000 [02:57<00:03, 279.50it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3119 (init= 9.0782), step count (max): 95, lr policy: 0.0000: 98%|#########8| 49000/50000 [02:57<00:03, 279.50it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3119 (init= 9.0782), step count (max): 95, lr policy: 0.0000: 100%|##########| 50000/50000 [03:01<00:00, 279.88it/s]
eval cumulative reward: 296.9429 (init: 129.4145), eval step-count: 31, average reward= 9.3130 (init= 9.0782), step count (max): 100, lr policy: 0.0000: 100%|##########| 50000/50000 [03:01<00:00, 279.88it/s]
结果¶
在达到 1M 步上限之前,算法应该已经达到 1000 步的最大步数,这是轨迹被截断之前的最大步数。
plt.figure(figsize=(10, 10))
plt.subplot(2, 2, 1)
plt.plot(logs["reward"])
plt.title("training rewards (average)")
plt.subplot(2, 2, 2)
plt.plot(logs["step_count"])
plt.title("Max step count (training)")
plt.subplot(2, 2, 3)
plt.plot(logs["eval reward (sum)"])
plt.title("Return (test)")
plt.subplot(2, 2, 4)
plt.plot(logs["eval step_count"])
plt.title("Max step count (test)")
plt.show()
结论和下一步¶
在本教程中,我们学习了
如何使用
torchrl
创建和自定义环境;如何编写模型和损失函数;
如何设置典型的训练循环。
如果你想更多地尝试本教程,你可以应用以下修改
从效率的角度来看,我们可以并行运行多个模拟来加快数据收集速度。有关更多信息,请查看
ParallelEnv
。从记录的角度来看,可以在请求渲染后向环境添加一个
torchrl.record.VideoRecorder
转换,以获得倒立摆动作的视觉渲染。有关更多信息,请查看torchrl.record
。
脚本的总运行时间:(3 分钟 2.677 秒)