注意
转到末尾 下载完整的示例代码。
摆锤:使用 TorchRL 编写你的环境和变换¶
作者: Vincent Moens
创建环境(模拟器或与物理控制系统的接口)是强化学习和控制工程中不可或缺的一部分。
TorchRL 提供了一组工具,可以在多种情况下完成此操作。本教程演示了如何使用 PyTorch 和 TorchRL 从头开始编写摆锤模拟器代码。它的灵感来自于 OpenAI-Gym/Farama-Gymnasium 控制库 中的 Pendulum-v1 实现。
主要学习内容
如何在 TorchRL 中设计环境:- 编写规格(输入、观察和奖励);- 实现行为:播种、重置和步骤。
转换你的环境输入和输出,以及编写你自己的变换;
如何使用
TensorDict
将任意数据结构传递到codebase
中。在此过程中,我们将接触 TorchRL 的三个关键组件
为了了解使用 TorchRL 环境可以实现什么,我们将设计一个无状态环境。虽然有状态环境会跟踪遇到的最新物理状态,并依赖于此来模拟状态到状态的转换,但无状态环境期望在每个步骤中提供当前状态,以及所采取的操作。TorchRL 支持这两种类型的环境,但无状态环境更通用,因此涵盖了 TorchRL 中环境 API 的更广泛的功能。
建模无状态环境可以让用户完全控制模拟器的输入和输出:人们可以在任何阶段重置实验,或从外部积极地修改动态。但是,它假设我们对任务有一定的控制权,而这并不总是可能的:解决无法控制当前状态的问题更具挑战性,但也拥有更广泛的应用范围。
无状态环境的另一个优点是,它们可以实现转换模拟的批量执行。如果后端和实现允许,可以在标量、向量或张量上无缝地执行代数运算。本教程给出了这样的示例。
本教程的结构如下
我们将首先熟悉环境属性:其形状 (
batch_size
)、其方法(主要是step()
、reset()
和set_seed()
),最后是其规格。在编写完模拟器之后,我们将演示如何在训练中使用变换。
我们将探索 TorchRL API 中的新的途径,包括:转换输入的可能性、模拟的矢量化执行以及通过模拟图进行反向传播的可能性。
最后,我们将训练一个简单的策略来解决我们实现的系统。
可以在类:~torchrl.envs.PendulumEnv 中找到此环境的内置版本。
from collections import defaultdict
from typing import Optional
import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn
from torchrl.data import BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
from torchrl.envs import (
CatTensors,
EnvBase,
Transform,
TransformedEnv,
UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp
DEFAULT_X = np.pi
DEFAULT_Y = 1.0
在设计新的环境类时,有四件事需要特别注意
EnvBase._reset()
,用于编码模拟器在(可能是随机的)初始状态下的重置;EnvBase._step()
,用于编码状态转换动态;EnvBase._set_seed`()
,用于实现播种机制;环境规格。
让我们首先描述要解决的问题:我们希望对一个简单的摆锤进行建模,我们可以控制对其固定点施加的扭矩。我们的目标是使摆锤处于向上位置(按照惯例,角度位置为 0),并让它在该位置静止不动。为了设计我们的动力系统,我们需要定义两个方程:一个是在执行动作(施加的扭矩)后的运动方程,另一个是构成目标函数的奖励方程。
对于运动方程,我们将根据以下公式更新角速度
其中 \(\dot{\theta}\) 是以弧度/秒为单位的角速度,\(g\) 是重力,\(L\) 是摆锤长度,\(m\) 是其质量,\(\theta\) 是其角度位置,\(u\) 是扭矩。然后根据以下公式更新角度位置
我们将奖励定义为
当角度接近 0(摆锤处于向上位置)、角速度接近 0(没有运动)且扭矩也为 0 时,它将被最大化。
编码动作的影响:_step()
¶
第一步方法是首要考虑的,因为它将编码我们感兴趣的模拟。在 TorchRL 中,EnvBase
类有一个 EnvBase.step()
方法,它接收一个 tensordict.TensorDict
实例,其中包含一个 "action"
条目,指示要采取的行动。
为了方便从该 tensordict
中读取和写入,并确保键与库期望的一致,模拟部分已被委托给一个私有的抽象方法 _step()
,该方法从 tensordict
中读取输入数据,并使用输出数据写入一个新的 tensordict
。
_step()
方法应执行以下操作
读取输入键(例如
"action"
)并根据这些键执行模拟;检索观察结果、完成状态和奖励;
将观察值集与相应的奖励和完成状态一起写入新的
TensorDict
中的相应条目。
接下来,step()
方法将合并 step()
方法在输入 tensordict
中的输出,以强制执行输入/输出一致性。
通常,对于有状态的环境,这将类似于:
>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
请注意,根 tensordict
没有改变,唯一的修改是出现了一个新的 "next"
条目,其中包含新信息。
在 Pendulum 示例中,我们的 _step()
方法将从输入 tensordict
中读取相关条目,并在 "action"
键编码的力作用于摆锤后,计算摆锤的位置和速度。我们计算摆锤的新角位置 "new_th"
,它是先前位置 "th"
加上新的速度 "new_thdot"
在时间间隔 dt
内的结果。
由于我们的目标是使摆锤向上并保持静止,因此我们的 cost
(负奖励)函数对于靠近目标的位置和低速度而言较低。实际上,我们希望阻止远离“向上”位置和/或远离 0 的速度。
在我们的示例中,EnvBase._step()
被编码为静态方法,因为我们的环境是无状态的。在有状态的环境中,需要使用 self
参数,因为需要从环境中读取状态。
def _step(tensordict):
th, thdot = tensordict["th"], tensordict["thdot"] # th := theta
g_force = tensordict["params", "g"]
mass = tensordict["params", "m"]
length = tensordict["params", "l"]
dt = tensordict["params", "dt"]
u = tensordict["action"].squeeze(-1)
u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)
new_thdot = (
thdot
+ (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
)
new_thdot = new_thdot.clamp(
-tensordict["params", "max_speed"], tensordict["params", "max_speed"]
)
new_th = th + new_thdot * dt
reward = -costs.view(*tensordict.shape, 1)
done = torch.zeros_like(reward, dtype=torch.bool)
out = TensorDict(
{
"th": new_th,
"thdot": new_thdot,
"params": tensordict["params"],
"reward": reward,
"done": done,
},
tensordict.shape,
)
return out
def angle_normalize(x):
return ((x + torch.pi) % (2 * torch.pi)) - torch.pi
重置模拟器:_reset()
¶
我们需要关注的第二个方法是 _reset()
方法。与 _step()
一样,它应该将观察条目以及可能的完成状态写入它输出的 tensordict
中(如果省略完成状态,它将由父方法 reset()
填充为 False
)。在某些情况下,需要 _reset
方法从调用它的函数接收命令(例如,在多智能体环境中,我们可能希望指示哪些智能体需要重置)。这就是 _reset()
方法也希望输入 tensordict
的原因,尽管它可能完全为空或 None
。
父 EnvBase.reset()
执行一些简单的检查,例如 EnvBase.step()
所做的检查,例如确保在输出 tensordict
中返回了 "done"
状态,并且形状与规范预期的一致。
对我们来说,唯一需要考虑的是 EnvBase._reset()
是否包含所有预期的观察结果。再次,由于我们正在使用无状态环境,因此我们在名为 "params"
的嵌套 tensordict
中传递摆锤的配置。
在本示例中,我们不传递完成状态,因为这对 _reset()
来说不是强制性的,并且我们的环境是非终止的,因此我们始终期望它为 False
。
def _reset(self, tensordict):
if tensordict is None or tensordict.is_empty():
# if no ``tensordict`` is passed, we generate a single set of hyperparameters
# Otherwise, we assume that the input ``tensordict`` contains all the relevant
# parameters to get started.
tensordict = self.gen_params(batch_size=self.batch_size)
high_th = torch.tensor(DEFAULT_X, device=self.device)
high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
low_th = -high_th
low_thdot = -high_thdot
# for non batch-locked environments, the input ``tensordict`` shape dictates the number
# of simulators run simultaneously. In other contexts, the initial
# random state's shape will depend upon the environment batch-size instead.
th = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_th - low_th)
+ low_th
)
thdot = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_thdot - low_thdot)
+ low_thdot
)
out = TensorDict(
{
"th": th,
"thdot": thdot,
"params": tensordict["params"],
},
batch_size=tensordict.shape,
)
return out
环境元数据:env.*_spec
¶
规范定义了环境的输入和输出域。规范准确地定义将在运行时接收的张量非常重要,因为它们通常用于在多处理和分布式设置中传递有关环境的信息。它们还可以用于实例化延迟定义的神经网络和测试脚本,而无需实际查询环境(例如,对于现实世界的物理系统来说,这可能是昂贵的)。
我们必须在环境中编写四个规范:
EnvBase.observation_spec
:这将是一个CompositeSpec
实例,其中每个键都是一个观察结果(CompositeSpec
可以被视为规范的字典)。EnvBase.action_spec
:它可以是任何类型的规范,但要求它与输入tensordict
中的"action"
条目相对应;EnvBase.reward_spec
:提供有关奖励空间的信息;EnvBase.done_spec
:提供有关完成标志空间的信息。
TorchRL 规范组织在两个通用容器中:input_spec
,其中包含步骤函数读取的信息的规范(分为包含动作的 action_spec
和包含所有其他信息的 state_spec
),以及 output_spec
,它编码步骤输出的规范(observation_spec
、reward_spec
和 done_spec
)。一般来说,你不应该直接与 output_spec
和 input_spec
交互,而应该与它们的内容交互:observation_spec
、reward_spec
、done_spec
、action_spec
和 state_spec
。原因是这些规范在 output_spec
和 input_spec
中以非平凡的方式组织,而且不应该直接修改它们。
换句话说,observation_spec
和相关属性是输出和输入规范容器内容的便捷快捷方式。
TorchRL 提供了多个 TensorSpec
子类 来编码环境的输入和输出特征。
规范形状¶
环境规范的前导维度必须与环境批次大小匹配。这样做是为了确保环境的每个组件(包括它的变换)都具有预期输入和输出形状的准确表示。这在有状态环境中需要准确地编码。
对于非批次锁定环境(例如我们示例中的环境(见下文)),这无关紧要,因为环境批次大小很可能为空。
def _make_spec(self, td_params):
# Under the hood, this will populate self.output_spec["observation"]
self.observation_spec = CompositeSpec(
th=BoundedTensorSpec(
low=-torch.pi,
high=torch.pi,
shape=(),
dtype=torch.float32,
),
thdot=BoundedTensorSpec(
low=-td_params["params", "max_speed"],
high=td_params["params", "max_speed"],
shape=(),
dtype=torch.float32,
),
# we need to add the ``params`` to the observation specs, as we want
# to pass it at each step during a rollout
params=make_composite_from_td(td_params["params"]),
shape=(),
)
# since the environment is stateless, we expect the previous output as input.
# For this, ``EnvBase`` expects some state_spec to be available
self.state_spec = self.observation_spec.clone()
# action-spec will be automatically wrapped in input_spec when
# `self.action_spec = spec` will be called supported
self.action_spec = BoundedTensorSpec(
low=-td_params["params", "max_torque"],
high=td_params["params", "max_torque"],
shape=(1,),
dtype=torch.float32,
)
self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))
def make_composite_from_td(td):
# custom function to convert a ``tensordict`` in a similar spec structure
# of unbounded values.
composite = CompositeSpec(
{
key: make_composite_from_td(tensor)
if isinstance(tensor, TensorDictBase)
else UnboundedContinuousTensorSpec(
dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
)
for key, tensor in td.items()
},
shape=td.shape,
)
return composite
可重复实验:播种¶
播种环境是初始化实验时的一项常见操作。EnvBase._set_seed()
的唯一目标是设置包含的模拟器的种子。如果可能,此操作不应调用 reset()
或与环境执行交互。父 EnvBase.set_seed()
方法包含一种机制,允许使用不同的伪随机且可重复的种子播种多个环境。
def _set_seed(self, seed: Optional[int]):
rng = torch.manual_seed(seed)
self.rng = rng
将所有内容整合在一起:EnvBase
类¶
我们终于可以将所有内容整合在一起,并设计我们的环境类。规范初始化需要在环境构造期间执行,因此我们必须注意在 PendulumEnv.__init__()
中调用 _make_spec()
方法。
我们添加了一个静态方法 PendulumEnv.gen_params()
,它确定性地生成一组要在执行期间使用的超参数。
def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
"""Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
if batch_size is None:
batch_size = []
td = TensorDict(
{
"params": TensorDict(
{
"max_speed": 8,
"max_torque": 2.0,
"dt": 0.05,
"g": g,
"m": 1.0,
"l": 1.0,
},
[],
)
},
[],
)
if batch_size:
td = td.expand(batch_size).contiguous()
return td
我们将环境定义为非 batch_locked
,方法是将 homonymous
属性设置为 False
。这意味着我们**不会**强制输入 tensordict
具有与环境的批次大小匹配的批次大小。
以下代码将把我们上面编写的代码片段整合在一起。
class PendulumEnv(EnvBase):
metadata = {
"render_modes": ["human", "rgb_array"],
"render_fps": 30,
}
batch_locked = False
def __init__(self, td_params=None, seed=None, device="cpu"):
if td_params is None:
td_params = self.gen_params()
super().__init__(device=device, batch_size=[])
self._make_spec(td_params)
if seed is None:
seed = torch.empty((), dtype=torch.int64).random_().item()
self.set_seed(seed)
# Helpers: _make_step and gen_params
gen_params = staticmethod(gen_params)
_make_spec = _make_spec
# Mandatory methods: _step, _reset and _set_seed
_reset = _reset
_step = staticmethod(_step)
_set_seed = _set_seed
测试我们的环境¶
TorchRL 提供了一个简单的函数 check_env_specs()
来检查(变换后的)环境是否具有与其规范指定的输入/输出结构匹配的结构。让我们试一试。
env = PendulumEnv()
check_env_specs(env)
我们可以看看我们的规范,以便对环境签名有一个视觉表示。
print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: CompositeSpec(
th: BoundedTensorSpec(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedTensorSpec(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: CompositeSpec(
max_speed: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
state_spec: CompositeSpec(
th: BoundedTensorSpec(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedTensorSpec(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: CompositeSpec(
max_speed: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuousTensorSpec(
shape=torch.Size([]),
space=None,
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
reward_spec: UnboundedContinuousTensorSpec(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous)
我们也可以执行一些命令来检查输出结构是否符合预期。
td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
我们可以运行 env.rand_step()
从 action_spec
域中随机生成一个动作。由于我们的环境是无状态的,因此**必须**传递包含超参数和当前状态的 tensordict
。在有状态的环境中,env.rand_step()
也能完美运行。
td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
变换环境¶
为无状态模拟器编写环境转换比为有状态模拟器编写稍微复杂一些:转换需要在下一轮迭代中读取的输出条目需要在调用 meth.step()
之前应用逆变换。这是一个展示 TorchRL 的转换所有功能的理想场景!
例如,在以下转换后的环境中,我们对 ["th", "thdot"]
条目进行 unsqueeze
操作,以便能够将它们沿着最后一个维度进行堆叠。我们还将它们作为 in_keys_inv
传递,以便在下一轮迭代中将它们作为输入传递时,将它们压缩回原始形状。
env = TransformedEnv(
env,
# ``Unsqueeze`` the observations that we will concatenate
UnsqueezeTransform(
unsqueeze_dim=-1,
in_keys=["th", "thdot"],
in_keys_inv=["th", "thdot"],
),
)
编写自定义转换¶
TorchRL 的转换可能无法涵盖环境执行后想要执行的所有操作。编写转换并不需要太多努力。与环境设计类似,编写转换也分为两个步骤:
确保动态特性正确(正向和逆向);
调整环境规格。
转换可以在两种情况下使用:单独使用时,可以作为 Module
使用。也可以附加到 TransformedEnv
上使用。类的结构允许在不同的上下文中自定义行为。
一个 Transform
的骨架可以概括如下:
class Transform(nn.Module):
def forward(self, tensordict):
...
def _apply_transform(self, tensordict):
...
def _step(self, tensordict):
...
def _call(self, tensordict):
...
def inv(self, tensordict):
...
def _inv_apply_transform(self, tensordict):
...
有三个入口点 (forward()
、_step()
和 inv()
),它们都接收 tensordict.TensorDict
实例。前两个最终将遍历 in_keys
指示的键,并对每个键调用 _apply_transform()
。结果将写入 Transform.out_keys
指示的条目中(如果没有,则 in_keys
将更新为转换后的值)。如果需要执行逆转换,将执行类似的数据流,但使用 Transform.inv()
和 Transform._inv_apply_transform()
方法,并且遍历 in_keys_inv
和 out_keys_inv
键列表。下图总结了环境和重播缓冲区的这个数据流。
转换 API
在某些情况下,转换不会以单一方式作用于键的子集,而是会对父环境执行某些操作,或者与整个输入 tensordict
进行交互。在这些情况下,需要重新编写 _call()
和 forward()
方法,并且可以跳过 _apply_transform()
方法。
让我们编写新的转换,这些转换将计算位置角的 sine
和 cosine
值,因为与原始角值相比,这些值对我们学习策略更有用。
class SinTransform(Transform):
def _apply_transform(self, obs: torch.Tensor) -> None:
return obs.sin()
# The transform must also modify the data at reset time
def _reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
def transform_observation_spec(self, observation_spec):
return BoundedTensorSpec(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
class CosTransform(Transform):
def _apply_transform(self, obs: torch.Tensor) -> None:
return obs.cos()
# The transform must also modify the data at reset time
def _reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
def transform_observation_spec(self, observation_spec):
return BoundedTensorSpec(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(unsqueeze_dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th'])))
将观察结果连接到 "observation" 条目中。 del_keys=False
确保我们保留这些值以用于下一轮迭代。
cat_transform = CatTensors(
in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(unsqueeze_dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th']),
CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))
再次让我们检查一下环境规格是否与接收到的数据匹配。
check_env_specs(env)
执行展开¶
执行展开是一系列简单的步骤:
重置环境;
当某些条件未满足时:
根据策略计算动作;
根据该动作执行步骤;
收集数据;
进行
MDP
步骤;
收集数据并返回。
这些操作已方便地封装在 rollout()
方法中,我们在此处提供了一个简化版本。
def simple_rollout(steps=100):
# preallocate:
data = TensorDict({}, [steps])
# reset
_data = env.reset()
for i in range(steps):
_data["action"] = env.action_spec.rand()
_data = env.step(_data)
data[i] = _data
_data = step_mdp(_data, keep_other=True)
return data
print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
fields={
action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False)
批量计算¶
本教程中最后一个未探索的部分是我们能够在 TorchRL 中进行批量计算的能力。因为我们的环境没有对输入数据形状做出任何假设,所以我们可以无缝地对数据批次进行执行。更棒的是:对于非批量锁定环境(如我们的 Pendulum),我们可以在不重新创建环境的情况下动态更改批次大小。为此,我们只需生成具有所需形状的参数即可。
batch_size = 10 # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
rand step (batch size of 10) TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
使用数据批次执行展开需要我们在展开函数之外重置环境,因为我们需要动态定义批次大小,而这在 rollout()
中不受支持。
rollout = env.rollout(
3,
auto_reset=False, # we're executing the reset out of the ``rollout`` call
tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False)
训练一个简单的策略¶
在本例中,我们将使用奖励作为可微目标(例如,负损失)来训练一个简单的策略。我们将利用动态系统完全可微的事实,通过轨迹回报进行反向传播,并调整策略的权重以直接最大化该值。当然,在许多情况下,我们所做的许多假设并不成立,例如可微系统和完全访问底层机制。
尽管如此,这是一个非常简单的示例,它展示了如何使用 TorchRL 中的自定义环境编写训练循环。
让我们首先编写策略网络:
torch.manual_seed(0)
env.set_seed(0)
net = nn.Sequential(
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(1),
)
policy = TensorDictModule(
net,
in_keys=["observation"],
out_keys=["action"],
)
以及我们的优化器:
optim = torch.optim.Adam(policy.parameters(), lr=2e-3)
训练循环¶
我们将依次进行以下操作:
生成轨迹;
累加奖励;
通过这些操作定义的图进行反向传播;
剪裁梯度范数并执行优化步骤;
重复。
在训练循环结束时,我们应该获得一个接近 0 的最终奖励,这表明钟摆如预期那样向上且静止。
batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)
for _ in pbar:
init_td = env.reset(env.gen_params(batch_size=[batch_size]))
rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
traj_return = rollout["next", "reward"].mean()
(-traj_return).backward()
gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
optim.step()
optim.zero_grad()
pbar.set_description(
f"reward: {traj_return: 4.4f}, "
f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
)
logs["return"].append(traj_return.item())
logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
scheduler.step()
def plot():
import matplotlib
from matplotlib import pyplot as plt
is_ipython = "inline" in matplotlib.get_backend()
if is_ipython:
from IPython import display
with plt.ion():
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(logs["return"])
plt.title("returns")
plt.xlabel("iteration")
plt.subplot(1, 2, 2)
plt.plot(logs["last_reward"])
plt.title("last reward")
plt.xlabel("iteration")
if is_ipython:
display.display(plt.gcf())
display.clear_output(wait=True)
plt.show()
plot()
0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 1/625 [00:00<01:23, 7.44it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 1/625 [00:00<01:23, 7.44it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 2/625 [00:00<01:23, 7.48it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 2/625 [00:00<01:23, 7.48it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 3/625 [00:00<01:22, 7.50it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 0%| | 3/625 [00:00<01:22, 7.50it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 1%| | 4/625 [00:00<01:22, 7.48it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 4/625 [00:00<01:22, 7.48it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 5/625 [00:00<01:22, 7.49it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 5/625 [00:00<01:22, 7.49it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 6/625 [00:00<01:22, 7.53it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%| | 6/625 [00:00<01:22, 7.53it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%| | 7/625 [00:00<01:21, 7.55it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%| | 7/625 [00:01<01:21, 7.55it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%|▏ | 8/625 [00:01<01:21, 7.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|▏ | 8/625 [00:01<01:21, 7.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|▏ | 9/625 [00:01<01:21, 7.56it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 1%|▏ | 9/625 [00:01<01:21, 7.56it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 2%|▏ | 10/625 [00:01<01:21, 7.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|▏ | 10/625 [00:01<01:21, 7.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|▏ | 11/625 [00:01<01:21, 7.54it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|▏ | 11/625 [00:01<01:21, 7.54it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|▏ | 12/625 [00:01<01:21, 7.53it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|▏ | 12/625 [00:01<01:21, 7.53it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|▏ | 13/625 [00:01<01:21, 7.55it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|▏ | 13/625 [00:01<01:21, 7.55it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|▏ | 14/625 [00:01<01:20, 7.55it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|▏ | 14/625 [00:01<01:20, 7.55it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|▏ | 15/625 [00:01<01:20, 7.56it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 2%|▏ | 15/625 [00:02<01:20, 7.56it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 3%|▎ | 16/625 [00:02<01:20, 7.57it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|▎ | 16/625 [00:02<01:20, 7.57it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|▎ | 17/625 [00:02<01:20, 7.55it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|▎ | 17/625 [00:02<01:20, 7.55it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|▎ | 18/625 [00:02<01:20, 7.56it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|▎ | 18/625 [00:02<01:20, 7.56it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|▎ | 19/625 [00:02<01:20, 7.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|▎ | 19/625 [00:02<01:20, 7.57it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|▎ | 20/625 [00:02<01:20, 7.56it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|▎ | 20/625 [00:02<01:20, 7.56it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|▎ | 21/625 [00:02<01:19, 7.56it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 3%|▎ | 21/625 [00:02<01:19, 7.56it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 4%|▎ | 22/625 [00:02<01:19, 7.56it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|▎ | 22/625 [00:03<01:19, 7.56it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|▎ | 23/625 [00:03<01:19, 7.54it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|▎ | 23/625 [00:03<01:19, 7.54it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|▍ | 24/625 [00:03<01:19, 7.54it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|▍ | 24/625 [00:03<01:19, 7.54it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|▍ | 25/625 [00:03<01:19, 7.54it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|▍ | 25/625 [00:03<01:19, 7.54it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|▍ | 26/625 [00:03<01:19, 7.54it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|▍ | 26/625 [00:03<01:19, 7.54it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|▍ | 27/625 [00:03<01:19, 7.56it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|▍ | 27/625 [00:03<01:19, 7.56it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|▍ | 28/625 [00:03<01:18, 7.57it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 4%|▍ | 28/625 [00:03<01:18, 7.57it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 5%|▍ | 29/625 [00:03<01:18, 7.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|▍ | 29/625 [00:03<01:18, 7.57it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|▍ | 30/625 [00:03<01:18, 7.57it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|▍ | 30/625 [00:04<01:18, 7.57it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|▍ | 31/625 [00:04<01:18, 7.58it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|▍ | 31/625 [00:04<01:18, 7.58it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|▌ | 32/625 [00:04<01:18, 7.58it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|▌ | 32/625 [00:04<01:18, 7.58it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|▌ | 33/625 [00:04<01:18, 7.56it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|▌ | 33/625 [00:04<01:18, 7.56it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|▌ | 34/625 [00:04<01:18, 7.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 5%|▌ | 34/625 [00:04<01:18, 7.55it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 6%|▌ | 35/625 [00:04<01:17, 7.57it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|▌ | 35/625 [00:04<01:17, 7.57it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|▌ | 36/625 [00:04<01:17, 7.58it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|▌ | 36/625 [00:04<01:17, 7.58it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|▌ | 37/625 [00:04<01:17, 7.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|▌ | 37/625 [00:05<01:17, 7.57it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|▌ | 38/625 [00:05<01:17, 7.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|▌ | 38/625 [00:05<01:17, 7.58it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|▌ | 39/625 [00:05<01:17, 7.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|▌ | 39/625 [00:05<01:17, 7.58it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|▋ | 40/625 [00:05<01:17, 7.58it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 6%|▋ | 40/625 [00:05<01:17, 7.58it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 7%|▋ | 41/625 [00:05<01:16, 7.59it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|▋ | 41/625 [00:05<01:16, 7.59it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|▋ | 42/625 [00:05<01:16, 7.59it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|▋ | 42/625 [00:05<01:16, 7.59it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|▋ | 43/625 [00:05<01:16, 7.59it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|▋ | 43/625 [00:05<01:16, 7.59it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|▋ | 44/625 [00:05<01:16, 7.59it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|▋ | 44/625 [00:05<01:16, 7.59it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|▋ | 45/625 [00:05<01:16, 7.59it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|▋ | 45/625 [00:06<01:16, 7.59it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|▋ | 46/625 [00:06<01:16, 7.60it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 7%|▋ | 46/625 [00:06<01:16, 7.60it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 8%|▊ | 47/625 [00:06<01:16, 7.60it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|▊ | 47/625 [00:06<01:16, 7.60it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|▊ | 48/625 [00:06<01:15, 7.60it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|▊ | 48/625 [00:06<01:15, 7.60it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|▊ | 49/625 [00:06<01:15, 7.59it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|▊ | 49/625 [00:06<01:15, 7.59it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|▊ | 50/625 [00:06<01:15, 7.59it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|▊ | 50/625 [00:06<01:15, 7.59it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|▊ | 51/625 [00:06<01:15, 7.59it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|▊ | 51/625 [00:06<01:15, 7.59it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|▊ | 52/625 [00:06<01:15, 7.58it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|▊ | 52/625 [00:07<01:15, 7.58it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|▊ | 53/625 [00:07<01:15, 7.59it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 8%|▊ | 53/625 [00:07<01:15, 7.59it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 9%|▊ | 54/625 [00:07<01:15, 7.60it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|▊ | 54/625 [00:07<01:15, 7.60it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|▉ | 55/625 [00:07<01:15, 7.60it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|▉ | 55/625 [00:07<01:15, 7.60it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|▉ | 56/625 [00:07<01:14, 7.59it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|▉ | 56/625 [00:07<01:14, 7.59it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|▉ | 57/625 [00:07<01:14, 7.59it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|▉ | 57/625 [00:07<01:14, 7.59it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|▉ | 58/625 [00:07<01:14, 7.59it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|▉ | 58/625 [00:07<01:14, 7.59it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|▉ | 59/625 [00:07<01:14, 7.59it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 9%|▉ | 59/625 [00:07<01:14, 7.59it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 10%|▉ | 60/625 [00:07<01:14, 7.59it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|▉ | 60/625 [00:08<01:14, 7.59it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|▉ | 61/625 [00:08<01:14, 7.60it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|▉ | 61/625 [00:08<01:14, 7.60it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|▉ | 62/625 [00:08<01:14, 7.60it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|▉ | 62/625 [00:08<01:14, 7.60it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|█ | 63/625 [00:08<01:13, 7.60it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|█ | 63/625 [00:08<01:13, 7.60it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|█ | 64/625 [00:08<01:13, 7.59it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|█ | 64/625 [00:08<01:13, 7.59it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|█ | 65/625 [00:08<01:13, 7.59it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 10%|█ | 65/625 [00:08<01:13, 7.59it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 11%|█ | 66/625 [00:08<01:13, 7.60it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|█ | 66/625 [00:08<01:13, 7.60it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|█ | 67/625 [00:08<01:13, 7.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|█ | 67/625 [00:08<01:13, 7.59it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|█ | 68/625 [00:08<01:13, 7.59it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|█ | 68/625 [00:09<01:13, 7.59it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|█ | 69/625 [00:09<01:13, 7.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|█ | 69/625 [00:09<01:13, 7.60it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|█ | 70/625 [00:09<01:13, 7.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|█ | 70/625 [00:09<01:13, 7.59it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|█▏ | 71/625 [00:09<01:13, 7.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 11%|█▏ | 71/625 [00:09<01:13, 7.59it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 12%|█▏ | 72/625 [00:09<01:12, 7.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|█▏ | 72/625 [00:09<01:12, 7.59it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|█▏ | 73/625 [00:09<01:12, 7.59it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|█▏ | 73/625 [00:09<01:12, 7.59it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|█▏ | 74/625 [00:09<01:12, 7.59it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|█▏ | 74/625 [00:09<01:12, 7.59it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|█▏ | 75/625 [00:09<01:12, 7.59it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|█▏ | 75/625 [00:10<01:12, 7.59it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|█▏ | 76/625 [00:10<01:12, 7.59it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|█▏ | 76/625 [00:10<01:12, 7.59it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|█▏ | 77/625 [00:10<01:12, 7.60it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|█▏ | 77/625 [00:10<01:12, 7.60it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|█▏ | 78/625 [00:10<01:12, 7.59it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 12%|█▏ | 78/625 [00:10<01:12, 7.59it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 13%|█▎ | 79/625 [00:10<01:11, 7.59it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|█▎ | 79/625 [00:10<01:11, 7.59it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|█▎ | 80/625 [00:10<01:11, 7.59it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|█▎ | 80/625 [00:10<01:11, 7.59it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|█▎ | 81/625 [00:10<01:11, 7.57it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|█▎ | 81/625 [00:10<01:11, 7.57it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|█▎ | 82/625 [00:10<01:11, 7.59it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|█▎ | 82/625 [00:10<01:11, 7.59it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|█▎ | 83/625 [00:10<01:11, 7.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|█▎ | 83/625 [00:11<01:11, 7.59it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|█▎ | 84/625 [00:11<01:11, 7.60it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 13%|█▎ | 84/625 [00:11<01:11, 7.60it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 14%|█▎ | 85/625 [00:11<01:11, 7.60it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|█▎ | 85/625 [00:11<01:11, 7.60it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|█▍ | 86/625 [00:11<01:10, 7.60it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|█▍ | 86/625 [00:11<01:10, 7.60it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|█▍ | 87/625 [00:11<01:10, 7.60it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|█▍ | 87/625 [00:11<01:10, 7.60it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|█▍ | 88/625 [00:11<01:10, 7.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|█▍ | 88/625 [00:11<01:10, 7.60it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|█▍ | 89/625 [00:11<01:10, 7.61it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|█▍ | 89/625 [00:11<01:10, 7.61it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|█▍ | 90/625 [00:11<01:10, 7.61it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 14%|█▍ | 90/625 [00:12<01:10, 7.61it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 15%|█▍ | 91/625 [00:12<01:10, 7.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|█▍ | 91/625 [00:12<01:10, 7.60it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|█▍ | 92/625 [00:12<01:10, 7.61it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|█▍ | 92/625 [00:12<01:10, 7.61it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|█▍ | 93/625 [00:12<01:09, 7.60it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|█▍ | 93/625 [00:12<01:09, 7.60it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|█▌ | 94/625 [00:12<01:09, 7.60it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|█▌ | 94/625 [00:12<01:09, 7.60it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|█▌ | 95/625 [00:12<01:09, 7.60it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|█▌ | 95/625 [00:12<01:09, 7.60it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|█▌ | 96/625 [00:12<01:09, 7.56it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 15%|█▌ | 96/625 [00:12<01:09, 7.56it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 16%|█▌ | 97/625 [00:12<01:09, 7.57it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|█▌ | 97/625 [00:12<01:09, 7.57it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|█▌ | 98/625 [00:12<01:09, 7.59it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|█▌ | 98/625 [00:13<01:09, 7.59it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|█▌ | 99/625 [00:13<01:09, 7.59it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|█▌ | 99/625 [00:13<01:09, 7.59it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|█▌ | 100/625 [00:13<01:09, 7.60it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|█▌ | 100/625 [00:13<01:09, 7.60it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|█▌ | 101/625 [00:13<01:08, 7.60it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|█▌ | 101/625 [00:13<01:08, 7.60it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|█▋ | 102/625 [00:13<01:08, 7.60it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|█▋ | 102/625 [00:13<01:08, 7.60it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|█▋ | 103/625 [00:13<01:08, 7.60it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 16%|█▋ | 103/625 [00:13<01:08, 7.60it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 17%|█▋ | 104/625 [00:13<01:08, 7.60it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|█▋ | 104/625 [00:13<01:08, 7.60it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|█▋ | 105/625 [00:13<01:08, 7.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|█▋ | 105/625 [00:13<01:08, 7.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|█▋ | 106/625 [00:13<01:08, 7.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|█▋ | 106/625 [00:14<01:08, 7.59it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|█▋ | 107/625 [00:14<01:08, 7.59it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|█▋ | 107/625 [00:14<01:08, 7.59it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|█▋ | 108/625 [00:14<01:08, 7.59it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|█▋ | 108/625 [00:14<01:08, 7.59it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|█▋ | 109/625 [00:14<01:08, 7.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 17%|█▋ | 109/625 [00:14<01:08, 7.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 18%|█▊ | 110/625 [00:14<01:07, 7.58it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|█▊ | 110/625 [00:14<01:07, 7.58it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|█▊ | 111/625 [00:14<01:07, 7.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|█▊ | 111/625 [00:14<01:07, 7.59it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|█▊ | 112/625 [00:14<01:07, 7.58it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|█▊ | 112/625 [00:14<01:07, 7.58it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|█▊ | 113/625 [00:14<01:07, 7.56it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|█▊ | 113/625 [00:15<01:07, 7.56it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|█▊ | 114/625 [00:15<01:07, 7.56it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|█▊ | 114/625 [00:15<01:07, 7.56it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|█▊ | 115/625 [00:15<01:07, 7.58it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 18%|█▊ | 115/625 [00:15<01:07, 7.58it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 19%|█▊ | 116/625 [00:15<01:07, 7.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|█▊ | 116/625 [00:15<01:07, 7.59it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|█▊ | 117/625 [00:15<01:07, 7.58it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|█▊ | 117/625 [00:15<01:07, 7.58it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|█▉ | 118/625 [00:15<01:06, 7.58it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|█▉ | 118/625 [00:15<01:06, 7.58it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|█▉ | 119/625 [00:15<01:06, 7.59it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|█▉ | 119/625 [00:15<01:06, 7.59it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|█▉ | 120/625 [00:15<01:06, 7.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|█▉ | 120/625 [00:15<01:06, 7.60it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|█▉ | 121/625 [00:15<01:06, 7.58it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 19%|█▉ | 121/625 [00:16<01:06, 7.58it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 20%|█▉ | 122/625 [00:16<01:06, 7.59it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|█▉ | 122/625 [00:16<01:06, 7.59it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|█▉ | 123/625 [00:16<01:06, 7.59it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|█▉ | 123/625 [00:16<01:06, 7.59it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|█▉ | 124/625 [00:16<01:06, 7.59it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|█▉ | 124/625 [00:16<01:06, 7.59it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|██ | 125/625 [00:16<01:05, 7.59it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|██ | 125/625 [00:16<01:05, 7.59it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|██ | 126/625 [00:16<01:05, 7.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|██ | 126/625 [00:16<01:05, 7.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|██ | 127/625 [00:16<01:05, 7.60it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|██ | 127/625 [00:16<01:05, 7.60it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|██ | 128/625 [00:16<01:05, 7.60it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 20%|██ | 128/625 [00:17<01:05, 7.60it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 21%|██ | 129/625 [00:17<01:05, 7.60it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|██ | 129/625 [00:17<01:05, 7.60it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|██ | 130/625 [00:17<01:05, 7.60it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|██ | 130/625 [00:17<01:05, 7.60it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|██ | 131/625 [00:17<01:04, 7.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|██ | 131/625 [00:17<01:04, 7.60it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|██ | 132/625 [00:17<01:04, 7.60it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|██ | 132/625 [00:17<01:04, 7.60it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|██▏ | 133/625 [00:17<01:04, 7.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|██▏ | 133/625 [00:17<01:04, 7.60it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|██▏ | 134/625 [00:17<01:04, 7.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 21%|██▏ | 134/625 [00:17<01:04, 7.61it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 22%|██▏ | 135/625 [00:17<01:04, 7.61it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|██▏ | 135/625 [00:17<01:04, 7.61it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|██▏ | 136/625 [00:17<01:04, 7.60it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|██▏ | 136/625 [00:18<01:04, 7.60it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|██▏ | 137/625 [00:18<01:04, 7.60it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|██▏ | 137/625 [00:18<01:04, 7.60it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|██▏ | 138/625 [00:18<01:04, 7.61it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|██▏ | 138/625 [00:18<01:04, 7.61it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|██▏ | 139/625 [00:18<01:03, 7.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|██▏ | 139/625 [00:18<01:03, 7.60it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|██▏ | 140/625 [00:18<01:03, 7.60it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 22%|██▏ | 140/625 [00:18<01:03, 7.60it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 23%|██▎ | 141/625 [00:18<01:03, 7.59it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|██▎ | 141/625 [00:18<01:03, 7.59it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|██▎ | 142/625 [00:18<01:03, 7.59it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|██▎ | 142/625 [00:18<01:03, 7.59it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|██▎ | 143/625 [00:18<01:03, 7.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|██▎ | 143/625 [00:18<01:03, 7.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|██▎ | 144/625 [00:18<01:03, 7.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|██▎ | 144/625 [00:19<01:03, 7.60it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|██▎ | 145/625 [00:19<01:03, 7.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|██▎ | 145/625 [00:19<01:03, 7.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|██▎ | 146/625 [00:19<01:03, 7.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 23%|██▎ | 146/625 [00:19<01:03, 7.60it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 24%|██▎ | 147/625 [00:19<01:02, 7.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|██▎ | 147/625 [00:19<01:02, 7.60it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|██▎ | 148/625 [00:19<01:02, 7.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|██▎ | 148/625 [00:19<01:02, 7.60it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|██▍ | 149/625 [00:19<01:02, 7.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|██▍ | 149/625 [00:19<01:02, 7.60it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|██▍ | 150/625 [00:19<01:02, 7.60it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|██▍ | 150/625 [00:19<01:02, 7.60it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|██▍ | 151/625 [00:19<01:02, 7.60it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|██▍ | 151/625 [00:20<01:02, 7.60it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|██▍ | 152/625 [00:20<01:02, 7.60it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|██▍ | 152/625 [00:20<01:02, 7.60it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|██▍ | 153/625 [00:20<01:02, 7.59it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 24%|██▍ | 153/625 [00:20<01:02, 7.59it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 25%|██▍ | 154/625 [00:20<01:01, 7.60it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|██▍ | 154/625 [00:20<01:01, 7.60it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|██▍ | 155/625 [00:20<01:01, 7.60it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|██▍ | 155/625 [00:20<01:01, 7.60it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|██▍ | 156/625 [00:20<01:01, 7.60it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|██▍ | 156/625 [00:20<01:01, 7.60it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|██▌ | 157/625 [00:20<01:01, 7.60it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|██▌ | 157/625 [00:20<01:01, 7.60it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|██▌ | 158/625 [00:20<01:01, 7.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|██▌ | 158/625 [00:20<01:01, 7.60it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|██▌ | 159/625 [00:20<01:01, 7.60it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 25%|██▌ | 159/625 [00:21<01:01, 7.60it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 26%|██▌ | 160/625 [00:21<01:01, 7.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|██▌ | 160/625 [00:21<01:01, 7.60it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|██▌ | 161/625 [00:21<01:01, 7.60it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|██▌ | 161/625 [00:21<01:01, 7.60it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|██▌ | 162/625 [00:21<01:00, 7.59it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|██▌ | 162/625 [00:21<01:00, 7.59it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|██▌ | 163/625 [00:21<01:00, 7.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|██▌ | 163/625 [00:21<01:00, 7.60it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|██▌ | 164/625 [00:21<01:00, 7.61it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|██▌ | 164/625 [00:21<01:00, 7.61it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|██▋ | 165/625 [00:21<01:00, 7.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 26%|██▋ | 165/625 [00:21<01:00, 7.60it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 27%|██▋ | 166/625 [00:21<01:00, 7.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|██▋ | 166/625 [00:22<01:00, 7.60it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|██▋ | 167/625 [00:22<01:00, 7.60it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|██▋ | 167/625 [00:22<01:00, 7.60it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|██▋ | 168/625 [00:22<01:00, 7.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|██▋ | 168/625 [00:22<01:00, 7.60it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|██▋ | 169/625 [00:22<00:59, 7.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|██▋ | 169/625 [00:22<00:59, 7.60it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|██▋ | 170/625 [00:22<00:59, 7.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|██▋ | 170/625 [00:22<00:59, 7.60it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|██▋ | 171/625 [00:22<00:59, 7.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 27%|██▋ | 171/625 [00:22<00:59, 7.60it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 28%|██▊ | 172/625 [00:22<00:59, 7.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|██▊ | 172/625 [00:22<00:59, 7.60it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|██▊ | 173/625 [00:22<01:22, 5.47it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|██▊ | 173/625 [00:23<01:22, 5.47it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|██▊ | 174/625 [00:23<01:15, 5.97it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|██▊ | 174/625 [00:23<01:15, 5.97it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|██▊ | 175/625 [00:23<01:10, 6.38it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|██▊ | 175/625 [00:23<01:10, 6.38it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|██▊ | 176/625 [00:23<01:07, 6.67it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|██▊ | 176/625 [00:23<01:07, 6.67it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|██▊ | 177/625 [00:23<01:05, 6.88it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|██▊ | 177/625 [00:23<01:05, 6.88it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|██▊ | 178/625 [00:23<01:03, 7.06it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 28%|██▊ | 178/625 [00:23<01:03, 7.06it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 29%|██▊ | 179/625 [00:23<01:02, 7.18it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|██▊ | 179/625 [00:23<01:02, 7.18it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|██▉ | 180/625 [00:23<01:01, 7.25it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|██▉ | 180/625 [00:24<01:01, 7.25it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|██▉ | 181/625 [00:24<01:00, 7.34it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|██▉ | 181/625 [00:24<01:00, 7.34it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|██▉ | 182/625 [00:24<01:00, 7.38it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|██▉ | 182/625 [00:24<01:00, 7.38it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|██▉ | 183/625 [00:24<00:59, 7.41it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|██▉ | 183/625 [00:24<00:59, 7.41it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|██▉ | 184/625 [00:24<00:59, 7.44it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 29%|██▉ | 184/625 [00:24<00:59, 7.44it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 30%|██▉ | 185/625 [00:24<00:58, 7.47it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|██▉ | 185/625 [00:24<00:58, 7.47it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|██▉ | 186/625 [00:24<00:58, 7.46it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|██▉ | 186/625 [00:24<00:58, 7.46it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|██▉ | 187/625 [00:24<00:58, 7.48it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|██▉ | 187/625 [00:24<00:58, 7.48it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|███ | 188/625 [00:24<00:58, 7.48it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|███ | 188/625 [00:25<00:58, 7.48it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|███ | 189/625 [00:25<00:58, 7.48it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|███ | 189/625 [00:25<00:58, 7.48it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|███ | 190/625 [00:25<00:58, 7.49it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 30%|███ | 190/625 [00:25<00:58, 7.49it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 31%|███ | 191/625 [00:25<00:57, 7.50it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|███ | 191/625 [00:25<00:57, 7.50it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|███ | 192/625 [00:25<00:57, 7.49it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|███ | 192/625 [00:25<00:57, 7.49it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|███ | 193/625 [00:25<00:57, 7.46it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|███ | 193/625 [00:25<00:57, 7.46it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|███ | 194/625 [00:25<00:57, 7.48it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|███ | 194/625 [00:25<00:57, 7.48it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|███ | 195/625 [00:25<00:57, 7.47it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|███ | 195/625 [00:26<00:57, 7.47it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|███▏ | 196/625 [00:26<00:57, 7.48it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 31%|███▏ | 196/625 [00:26<00:57, 7.48it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 32%|███▏ | 197/625 [00:26<00:57, 7.49it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|███▏ | 197/625 [00:26<00:57, 7.49it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|███▏ | 198/625 [00:26<00:57, 7.48it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|███▏ | 198/625 [00:26<00:57, 7.48it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|███▏ | 199/625 [00:26<00:57, 7.46it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|███▏ | 199/625 [00:26<00:57, 7.46it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|███▏ | 200/625 [00:26<00:56, 7.48it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|███▏ | 200/625 [00:26<00:56, 7.48it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|███▏ | 201/625 [00:26<00:56, 7.47it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|███▏ | 201/625 [00:26<00:56, 7.47it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|███▏ | 202/625 [00:26<00:56, 7.49it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|███▏ | 202/625 [00:26<00:56, 7.49it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|███▏ | 203/625 [00:26<00:56, 7.50it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 32%|███▏ | 203/625 [00:27<00:56, 7.50it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 33%|███▎ | 204/625 [00:27<00:56, 7.49it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|███▎ | 204/625 [00:27<00:56, 7.49it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|███▎ | 205/625 [00:27<00:55, 7.50it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|███▎ | 205/625 [00:27<00:55, 7.50it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|███▎ | 206/625 [00:27<00:55, 7.50it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|███▎ | 206/625 [00:27<00:55, 7.50it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|███▎ | 207/625 [00:27<00:55, 7.49it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|███▎ | 207/625 [00:27<00:55, 7.49it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|███▎ | 208/625 [00:27<00:55, 7.49it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|███▎ | 208/625 [00:27<00:55, 7.49it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|███▎ | 209/625 [00:27<00:55, 7.50it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 33%|███▎ | 209/625 [00:27<00:55, 7.50it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 34%|███▎ | 210/625 [00:27<00:55, 7.51it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|███▎ | 210/625 [00:28<00:55, 7.51it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|███▍ | 211/625 [00:28<00:55, 7.51it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|███▍ | 211/625 [00:28<00:55, 7.51it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|███▍ | 212/625 [00:28<00:54, 7.52it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|███▍ | 212/625 [00:28<00:54, 7.52it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|███▍ | 213/625 [00:28<00:54, 7.50it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|███▍ | 213/625 [00:28<00:54, 7.50it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|███▍ | 214/625 [00:28<00:54, 7.48it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|███▍ | 214/625 [00:28<00:54, 7.48it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|███▍ | 215/625 [00:28<00:54, 7.49it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 34%|███▍ | 215/625 [00:28<00:54, 7.49it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 35%|███▍ | 216/625 [00:28<00:54, 7.47it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|███▍ | 216/625 [00:28<00:54, 7.47it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|███▍ | 217/625 [00:28<00:54, 7.48it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|███▍ | 217/625 [00:28<00:54, 7.48it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|███▍ | 218/625 [00:28<00:54, 7.49it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|███▍ | 218/625 [00:29<00:54, 7.49it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|███▌ | 219/625 [00:29<00:54, 7.49it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|███▌ | 219/625 [00:29<00:54, 7.49it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|███▌ | 220/625 [00:29<00:53, 7.50it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|███▌ | 220/625 [00:29<00:53, 7.50it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|███▌ | 221/625 [00:29<00:53, 7.50it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 35%|███▌ | 221/625 [00:29<00:53, 7.50it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 36%|███▌ | 222/625 [00:29<00:53, 7.52it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|███▌ | 222/625 [00:29<00:53, 7.52it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|███▌ | 223/625 [00:29<00:53, 7.52it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|███▌ | 223/625 [00:29<00:53, 7.52it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|███▌ | 224/625 [00:29<00:53, 7.52it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|███▌ | 224/625 [00:29<00:53, 7.52it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|███▌ | 225/625 [00:29<00:53, 7.43it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|███▌ | 225/625 [00:30<00:53, 7.43it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|███▌ | 226/625 [00:30<00:53, 7.43it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|███▌ | 226/625 [00:30<00:53, 7.43it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|███▋ | 227/625 [00:30<00:53, 7.43it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|███▋ | 227/625 [00:30<00:53, 7.43it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|███▋ | 228/625 [00:30<00:53, 7.46it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 36%|███▋ | 228/625 [00:30<00:53, 7.46it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 37%|███▋ | 229/625 [00:30<00:53, 7.47it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|███▋ | 229/625 [00:30<00:53, 7.47it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|███▋ | 230/625 [00:30<00:53, 7.45it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|███▋ | 230/625 [00:30<00:53, 7.45it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|███▋ | 231/625 [00:30<00:52, 7.45it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|███▋ | 231/625 [00:30<00:52, 7.45it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|███▋ | 232/625 [00:30<00:52, 7.47it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|███▋ | 232/625 [00:30<00:52, 7.47it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|███▋ | 233/625 [00:30<00:52, 7.49it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|███▋ | 233/625 [00:31<00:52, 7.49it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|███▋ | 234/625 [00:31<00:52, 7.49it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 37%|███▋ | 234/625 [00:31<00:52, 7.49it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 38%|███▊ | 235/625 [00:31<00:52, 7.47it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|███▊ | 235/625 [00:31<00:52, 7.47it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|███▊ | 236/625 [00:31<00:51, 7.49it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|███▊ | 236/625 [00:31<00:51, 7.49it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|███▊ | 237/625 [00:31<00:51, 7.49it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|███▊ | 237/625 [00:31<00:51, 7.49it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|███▊ | 238/625 [00:31<00:51, 7.49it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|███▊ | 238/625 [00:31<00:51, 7.49it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|███▊ | 239/625 [00:31<00:51, 7.50it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|███▊ | 239/625 [00:31<00:51, 7.50it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|███▊ | 240/625 [00:31<00:51, 7.48it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 38%|███▊ | 240/625 [00:32<00:51, 7.48it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 39%|███▊ | 241/625 [00:32<00:51, 7.48it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|███▊ | 241/625 [00:32<00:51, 7.48it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|███▊ | 242/625 [00:32<00:51, 7.49it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|███▊ | 242/625 [00:32<00:51, 7.49it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|███▉ | 243/625 [00:32<00:50, 7.49it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|███▉ | 243/625 [00:32<00:50, 7.49it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|███▉ | 244/625 [00:32<00:51, 7.47it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|███▉ | 244/625 [00:32<00:51, 7.47it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|███▉ | 245/625 [00:32<00:50, 7.48it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|███▉ | 245/625 [00:32<00:50, 7.48it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|███▉ | 246/625 [00:32<00:50, 7.47it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 39%|███▉ | 246/625 [00:32<00:50, 7.47it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 40%|███▉ | 247/625 [00:32<00:50, 7.49it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|███▉ | 247/625 [00:32<00:50, 7.49it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|███▉ | 248/625 [00:32<00:50, 7.50it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|███▉ | 248/625 [00:33<00:50, 7.50it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|███▉ | 249/625 [00:33<00:50, 7.48it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|███▉ | 249/625 [00:33<00:50, 7.48it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|████ | 250/625 [00:33<00:50, 7.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|████ | 250/625 [00:33<00:50, 7.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|████ | 251/625 [00:33<00:49, 7.50it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|████ | 251/625 [00:33<00:49, 7.50it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|████ | 252/625 [00:33<00:49, 7.51it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|████ | 252/625 [00:33<00:49, 7.51it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|████ | 253/625 [00:33<00:49, 7.47it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 40%|████ | 253/625 [00:33<00:49, 7.47it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 41%|████ | 254/625 [00:33<00:49, 7.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|████ | 254/625 [00:33<00:49, 7.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|████ | 255/625 [00:33<00:49, 7.47it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|████ | 255/625 [00:34<00:49, 7.47it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|████ | 256/625 [00:34<00:49, 7.48it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|████ | 256/625 [00:34<00:49, 7.48it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|████ | 257/625 [00:34<00:49, 7.49it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|████ | 257/625 [00:34<00:49, 7.49it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|████▏ | 258/625 [00:34<00:48, 7.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|████▏ | 258/625 [00:34<00:48, 7.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|████▏ | 259/625 [00:34<00:48, 7.48it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 41%|████▏ | 259/625 [00:34<00:48, 7.48it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 42%|████▏ | 260/625 [00:34<00:48, 7.50it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|████▏ | 260/625 [00:34<00:48, 7.50it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|████▏ | 261/625 [00:34<00:48, 7.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|████▏ | 261/625 [00:34<00:48, 7.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|████▏ | 262/625 [00:34<00:48, 7.52it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|████▏ | 262/625 [00:34<00:48, 7.52it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|████▏ | 263/625 [00:34<00:48, 7.54it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|████▏ | 263/625 [00:35<00:48, 7.54it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|████▏ | 264/625 [00:35<00:47, 7.53it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|████▏ | 264/625 [00:35<00:47, 7.53it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|████▏ | 265/625 [00:35<00:47, 7.51it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 42%|████▏ | 265/625 [00:35<00:47, 7.51it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 43%|████▎ | 266/625 [00:35<00:47, 7.51it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|████▎ | 266/625 [00:35<00:47, 7.51it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|████▎ | 267/625 [00:35<00:47, 7.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|████▎ | 267/625 [00:35<00:47, 7.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|████▎ | 268/625 [00:35<00:47, 7.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|████▎ | 268/625 [00:35<00:47, 7.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|████▎ | 269/625 [00:35<00:47, 7.50it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|████▎ | 269/625 [00:35<00:47, 7.50it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|████▎ | 270/625 [00:35<00:47, 7.50it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|████▎ | 270/625 [00:36<00:47, 7.50it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|████▎ | 271/625 [00:36<00:47, 7.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 43%|████▎ | 271/625 [00:36<00:47, 7.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 44%|████▎ | 272/625 [00:36<00:47, 7.50it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|████▎ | 272/625 [00:36<00:47, 7.50it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|████▎ | 273/625 [00:36<00:46, 7.49it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|████▎ | 273/625 [00:36<00:46, 7.49it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|████▍ | 274/625 [00:36<00:46, 7.49it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|████▍ | 274/625 [00:36<00:46, 7.49it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|████▍ | 275/625 [00:36<00:46, 7.50it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|████▍ | 275/625 [00:36<00:46, 7.50it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|████▍ | 276/625 [00:36<00:46, 7.50it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|████▍ | 276/625 [00:36<00:46, 7.50it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|████▍ | 277/625 [00:36<00:46, 7.48it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|████▍ | 277/625 [00:36<00:46, 7.48it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|████▍ | 278/625 [00:36<00:46, 7.49it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 44%|████▍ | 278/625 [00:37<00:46, 7.49it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 45%|████▍ | 279/625 [00:37<00:46, 7.49it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|████▍ | 279/625 [00:37<00:46, 7.49it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|████▍ | 280/625 [00:37<00:46, 7.47it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|████▍ | 280/625 [00:37<00:46, 7.47it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|████▍ | 281/625 [00:37<00:46, 7.46it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|████▍ | 281/625 [00:37<00:46, 7.46it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|████▌ | 282/625 [00:37<00:45, 7.46it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|████▌ | 282/625 [00:37<00:45, 7.46it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|████▌ | 283/625 [00:37<00:45, 7.46it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|████▌ | 283/625 [00:37<00:45, 7.46it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|████▌ | 284/625 [00:37<00:45, 7.48it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 45%|████▌ | 284/625 [00:37<00:45, 7.48it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 46%|████▌ | 285/625 [00:37<00:45, 7.43it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|████▌ | 285/625 [00:38<00:45, 7.43it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|████▌ | 286/625 [00:38<00:45, 7.47it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|████▌ | 286/625 [00:38<00:45, 7.47it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|████▌ | 287/625 [00:38<00:44, 7.51it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|████▌ | 287/625 [00:38<00:44, 7.51it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|████▌ | 288/625 [00:38<00:44, 7.53it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|████▌ | 288/625 [00:38<00:44, 7.53it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|████▌ | 289/625 [00:38<00:44, 7.55it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|████▌ | 289/625 [00:38<00:44, 7.55it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|████▋ | 290/625 [00:38<00:44, 7.55it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 46%|████▋ | 290/625 [00:38<00:44, 7.55it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 47%|████▋ | 291/625 [00:38<00:44, 7.57it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|████▋ | 291/625 [00:38<00:44, 7.57it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|████▋ | 292/625 [00:38<00:43, 7.57it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|████▋ | 292/625 [00:38<00:43, 7.57it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|████▋ | 293/625 [00:38<00:43, 7.57it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|████▋ | 293/625 [00:39<00:43, 7.57it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|████▋ | 294/625 [00:39<00:43, 7.57it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|████▋ | 294/625 [00:39<00:43, 7.57it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|████▋ | 295/625 [00:39<00:43, 7.55it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|████▋ | 295/625 [00:39<00:43, 7.55it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|████▋ | 296/625 [00:39<00:43, 7.56it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 47%|████▋ | 296/625 [00:39<00:43, 7.56it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 48%|████▊ | 297/625 [00:39<00:43, 7.56it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|████▊ | 297/625 [00:39<00:43, 7.56it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|████▊ | 298/625 [00:39<00:43, 7.56it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|████▊ | 298/625 [00:39<00:43, 7.56it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|████▊ | 299/625 [00:39<00:43, 7.57it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|████▊ | 299/625 [00:39<00:43, 7.57it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|████▊ | 300/625 [00:39<00:42, 7.57it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|████▊ | 300/625 [00:40<00:42, 7.57it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|████▊ | 301/625 [00:40<00:42, 7.58it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|████▊ | 301/625 [00:40<00:42, 7.58it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|████▊ | 302/625 [00:40<00:42, 7.58it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|████▊ | 302/625 [00:40<00:42, 7.58it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|████▊ | 303/625 [00:40<00:42, 7.58it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 48%|████▊ | 303/625 [00:40<00:42, 7.58it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 49%|████▊ | 304/625 [00:40<00:42, 7.56it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|████▊ | 304/625 [00:40<00:42, 7.56it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|████▉ | 305/625 [00:40<00:42, 7.56it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|████▉ | 305/625 [00:40<00:42, 7.56it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|████▉ | 306/625 [00:40<00:42, 7.57it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|████▉ | 306/625 [00:40<00:42, 7.57it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|████▉ | 307/625 [00:40<00:42, 7.57it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|████▉ | 307/625 [00:40<00:42, 7.57it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|████▉ | 308/625 [00:40<00:41, 7.56it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|████▉ | 308/625 [00:41<00:41, 7.56it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|████▉ | 309/625 [00:41<00:41, 7.56it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 49%|████▉ | 309/625 [00:41<00:41, 7.56it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 50%|████▉ | 310/625 [00:41<00:41, 7.57it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|████▉ | 310/625 [00:41<00:41, 7.57it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|████▉ | 311/625 [00:41<00:41, 7.55it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|████▉ | 311/625 [00:41<00:41, 7.55it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|████▉ | 312/625 [00:41<00:41, 7.56it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|████▉ | 312/625 [00:41<00:41, 7.56it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|█████ | 313/625 [00:41<00:41, 7.56it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|█████ | 313/625 [00:41<00:41, 7.56it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|█████ | 314/625 [00:41<00:41, 7.56it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|█████ | 314/625 [00:41<00:41, 7.56it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|█████ | 315/625 [00:41<00:40, 7.57it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 50%|█████ | 315/625 [00:42<00:40, 7.57it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 51%|█████ | 316/625 [00:42<00:40, 7.56it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|█████ | 316/625 [00:42<00:40, 7.56it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|█████ | 317/625 [00:42<00:40, 7.57it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|█████ | 317/625 [00:42<00:40, 7.57it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|█████ | 318/625 [00:42<00:40, 7.57it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|█████ | 318/625 [00:42<00:40, 7.57it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|█████ | 319/625 [00:42<00:40, 7.56it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|█████ | 319/625 [00:42<00:40, 7.56it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|█████ | 320/625 [00:42<00:40, 7.56it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|█████ | 320/625 [00:42<00:40, 7.56it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|█████▏ | 321/625 [00:42<00:40, 7.54it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 51%|█████▏ | 321/625 [00:42<00:40, 7.54it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 52%|█████▏ | 322/625 [00:42<00:40, 7.55it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|█████▏ | 322/625 [00:42<00:40, 7.55it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|█████▏ | 323/625 [00:42<00:39, 7.55it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|█████▏ | 323/625 [00:43<00:39, 7.55it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|█████▏ | 324/625 [00:43<00:39, 7.55it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|█████▏ | 324/625 [00:43<00:39, 7.55it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|█████▏ | 325/625 [00:43<00:39, 7.55it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|█████▏ | 325/625 [00:43<00:39, 7.55it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|█████▏ | 326/625 [00:43<00:39, 7.54it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|█████▏ | 326/625 [00:43<00:39, 7.54it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|█████▏ | 327/625 [00:43<00:39, 7.52it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|█████▏ | 327/625 [00:43<00:39, 7.52it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|█████▏ | 328/625 [00:43<00:39, 7.52it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 52%|█████▏ | 328/625 [00:43<00:39, 7.52it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 53%|█████▎ | 329/625 [00:43<00:39, 7.54it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|█████▎ | 329/625 [00:43<00:39, 7.54it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|█████▎ | 330/625 [00:43<00:39, 7.55it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|█████▎ | 330/625 [00:44<00:39, 7.55it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|█████▎ | 331/625 [00:44<00:38, 7.55it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|█████▎ | 331/625 [00:44<00:38, 7.55it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|█████▎ | 332/625 [00:44<00:38, 7.57it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|█████▎ | 332/625 [00:44<00:38, 7.57it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|█████▎ | 333/625 [00:44<00:38, 7.57it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|█████▎ | 333/625 [00:44<00:38, 7.57it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|█████▎ | 334/625 [00:44<00:38, 7.56it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 53%|█████▎ | 334/625 [00:44<00:38, 7.56it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 54%|█████▎ | 335/625 [00:44<00:38, 7.57it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|█████▎ | 335/625 [00:44<00:38, 7.57it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|█████▍ | 336/625 [00:44<00:38, 7.57it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|█████▍ | 336/625 [00:44<00:38, 7.57it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|█████▍ | 337/625 [00:44<00:38, 7.57it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|█████▍ | 337/625 [00:44<00:38, 7.57it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|█████▍ | 338/625 [00:44<00:37, 7.57it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|█████▍ | 338/625 [00:45<00:37, 7.57it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|█████▍ | 339/625 [00:45<00:37, 7.57it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|█████▍ | 339/625 [00:45<00:37, 7.57it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|█████▍ | 340/625 [00:45<00:37, 7.57it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 54%|█████▍ | 340/625 [00:45<00:37, 7.57it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 55%|█████▍ | 341/625 [00:45<00:37, 7.58it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|█████▍ | 341/625 [00:45<00:37, 7.58it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|█████▍ | 342/625 [00:45<00:37, 7.57it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|█████▍ | 342/625 [00:45<00:37, 7.57it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|█████▍ | 343/625 [00:45<00:37, 7.58it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|█████▍ | 343/625 [00:45<00:37, 7.58it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|█████▌ | 344/625 [00:45<00:37, 7.58it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|█████▌ | 344/625 [00:45<00:37, 7.58it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|█████▌ | 345/625 [00:45<00:37, 7.55it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|█████▌ | 345/625 [00:46<00:37, 7.55it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|█████▌ | 346/625 [00:46<00:36, 7.56it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 55%|█████▌ | 346/625 [00:46<00:36, 7.56it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 56%|█████▌ | 347/625 [00:46<00:36, 7.57it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|█████▌ | 347/625 [00:46<00:36, 7.57it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|█████▌ | 348/625 [00:46<00:36, 7.58it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|█████▌ | 348/625 [00:46<00:36, 7.58it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|█████▌ | 349/625 [00:46<00:36, 7.57it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|█████▌ | 349/625 [00:46<00:36, 7.57it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|█████▌ | 350/625 [00:46<00:49, 5.53it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|█████▌ | 350/625 [00:46<00:49, 5.53it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|█████▌ | 351/625 [00:46<00:45, 6.02it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|█████▌ | 351/625 [00:46<00:45, 6.02it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|█████▋ | 352/625 [00:46<00:42, 6.42it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|█████▋ | 352/625 [00:47<00:42, 6.42it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|█████▋ | 353/625 [00:47<00:40, 6.73it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 56%|█████▋ | 353/625 [00:47<00:40, 6.73it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 57%|█████▋ | 354/625 [00:47<00:38, 6.97it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|█████▋ | 354/625 [00:47<00:38, 6.97it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|█████▋ | 355/625 [00:47<00:37, 7.15it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|█████▋ | 355/625 [00:47<00:37, 7.15it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|█████▋ | 356/625 [00:47<00:37, 7.26it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|█████▋ | 356/625 [00:47<00:37, 7.26it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|█████▋ | 357/625 [00:47<00:36, 7.36it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|█████▋ | 357/625 [00:47<00:36, 7.36it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|█████▋ | 358/625 [00:47<00:35, 7.43it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|█████▋ | 358/625 [00:47<00:35, 7.43it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|█████▋ | 359/625 [00:47<00:35, 7.45it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 57%|█████▋ | 359/625 [00:48<00:35, 7.45it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 58%|█████▊ | 360/625 [00:48<00:35, 7.50it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|█████▊ | 360/625 [00:48<00:35, 7.50it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|█████▊ | 361/625 [00:48<00:35, 7.52it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|█████▊ | 361/625 [00:48<00:35, 7.52it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|█████▊ | 362/625 [00:48<00:34, 7.52it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|█████▊ | 362/625 [00:48<00:34, 7.52it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|█████▊ | 363/625 [00:48<00:34, 7.54it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|█████▊ | 363/625 [00:48<00:34, 7.54it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|█████▊ | 364/625 [00:48<00:34, 7.55it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|█████▊ | 364/625 [00:48<00:34, 7.55it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|█████▊ | 365/625 [00:48<00:34, 7.54it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 58%|█████▊ | 365/625 [00:48<00:34, 7.54it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 59%|█████▊ | 366/625 [00:48<00:34, 7.55it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|█████▊ | 366/625 [00:48<00:34, 7.55it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|█████▊ | 367/625 [00:48<00:34, 7.55it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|█████▊ | 367/625 [00:49<00:34, 7.55it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|█████▉ | 368/625 [00:49<00:34, 7.55it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|█████▉ | 368/625 [00:49<00:34, 7.55it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|█████▉ | 369/625 [00:49<00:33, 7.55it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|█████▉ | 369/625 [00:49<00:33, 7.55it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|█████▉ | 370/625 [00:49<00:33, 7.54it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|█████▉ | 370/625 [00:49<00:33, 7.54it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|█████▉ | 371/625 [00:49<00:33, 7.54it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 59%|█████▉ | 371/625 [00:49<00:33, 7.54it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 60%|█████▉ | 372/625 [00:49<00:33, 7.54it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|█████▉ | 372/625 [00:49<00:33, 7.54it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|█████▉ | 373/625 [00:49<00:33, 7.55it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|█████▉ | 373/625 [00:49<00:33, 7.55it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|█████▉ | 374/625 [00:49<00:33, 7.55it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|█████▉ | 374/625 [00:49<00:33, 7.55it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|██████ | 375/625 [00:49<00:33, 7.54it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|██████ | 375/625 [00:50<00:33, 7.54it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|██████ | 376/625 [00:50<00:32, 7.56it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|██████ | 376/625 [00:50<00:32, 7.56it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|██████ | 377/625 [00:50<00:32, 7.56it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|██████ | 377/625 [00:50<00:32, 7.56it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|██████ | 378/625 [00:50<00:32, 7.53it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 60%|██████ | 378/625 [00:50<00:32, 7.53it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 61%|██████ | 379/625 [00:50<00:32, 7.54it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|██████ | 379/625 [00:50<00:32, 7.54it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|██████ | 380/625 [00:50<00:32, 7.53it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|██████ | 380/625 [00:50<00:32, 7.53it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|██████ | 381/625 [00:50<00:32, 7.54it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|██████ | 381/625 [00:50<00:32, 7.54it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|██████ | 382/625 [00:50<00:32, 7.55it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|██████ | 382/625 [00:51<00:32, 7.55it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|██████▏ | 383/625 [00:51<00:32, 7.54it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|██████▏ | 383/625 [00:51<00:32, 7.54it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|██████▏ | 384/625 [00:51<00:31, 7.55it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 61%|██████▏ | 384/625 [00:51<00:31, 7.55it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 62%|██████▏ | 385/625 [00:51<00:31, 7.56it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|██████▏ | 385/625 [00:51<00:31, 7.56it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|██████▏ | 386/625 [00:51<00:31, 7.56it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|██████▏ | 386/625 [00:51<00:31, 7.56it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|██████▏ | 387/625 [00:51<00:31, 7.57it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|██████▏ | 387/625 [00:51<00:31, 7.57it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|██████▏ | 388/625 [00:51<00:31, 7.57it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|██████▏ | 388/625 [00:51<00:31, 7.57it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|██████▏ | 389/625 [00:51<00:31, 7.56it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|██████▏ | 389/625 [00:51<00:31, 7.56it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|██████▏ | 390/625 [00:51<00:31, 7.56it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 62%|██████▏ | 390/625 [00:52<00:31, 7.56it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 63%|██████▎ | 391/625 [00:52<00:30, 7.56it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|██████▎ | 391/625 [00:52<00:30, 7.56it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|██████▎ | 392/625 [00:52<00:30, 7.56it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|██████▎ | 392/625 [00:52<00:30, 7.56it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|██████▎ | 393/625 [00:52<00:30, 7.54it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|██████▎ | 393/625 [00:52<00:30, 7.54it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|██████▎ | 394/625 [00:52<00:30, 7.54it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|██████▎ | 394/625 [00:52<00:30, 7.54it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|██████▎ | 395/625 [00:52<00:30, 7.54it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|██████▎ | 395/625 [00:52<00:30, 7.54it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|██████▎ | 396/625 [00:52<00:30, 7.56it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 63%|██████▎ | 396/625 [00:52<00:30, 7.56it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 64%|██████▎ | 397/625 [00:52<00:30, 7.56it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|██████▎ | 397/625 [00:53<00:30, 7.56it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|██████▎ | 398/625 [00:53<00:30, 7.56it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|██████▎ | 398/625 [00:53<00:30, 7.56it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|██████▍ | 399/625 [00:53<00:29, 7.57it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|██████▍ | 399/625 [00:53<00:29, 7.57it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|██████▍ | 400/625 [00:53<00:29, 7.57it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|██████▍ | 400/625 [00:53<00:29, 7.57it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|██████▍ | 401/625 [00:53<00:29, 7.58it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|██████▍ | 401/625 [00:53<00:29, 7.58it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|██████▍ | 402/625 [00:53<00:29, 7.58it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|██████▍ | 402/625 [00:53<00:29, 7.58it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|██████▍ | 403/625 [00:53<00:29, 7.57it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 64%|██████▍ | 403/625 [00:53<00:29, 7.57it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 65%|██████▍ | 404/625 [00:53<00:29, 7.57it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|██████▍ | 404/625 [00:53<00:29, 7.57it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|██████▍ | 405/625 [00:53<00:29, 7.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|██████▍ | 405/625 [00:54<00:29, 7.56it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|██████▍ | 406/625 [00:54<00:28, 7.57it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|██████▍ | 406/625 [00:54<00:28, 7.57it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|██████▌ | 407/625 [00:54<00:28, 7.58it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|██████▌ | 407/625 [00:54<00:28, 7.58it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|██████▌ | 408/625 [00:54<00:28, 7.56it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|██████▌ | 408/625 [00:54<00:28, 7.56it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|██████▌ | 409/625 [00:54<00:28, 7.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 65%|██████▌ | 409/625 [00:54<00:28, 7.55it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 66%|██████▌ | 410/625 [00:54<00:28, 7.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|██████▌ | 410/625 [00:54<00:28, 7.56it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|██████▌ | 411/625 [00:54<00:28, 7.55it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|██████▌ | 411/625 [00:54<00:28, 7.55it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|██████▌ | 412/625 [00:54<00:28, 7.55it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|██████▌ | 412/625 [00:55<00:28, 7.55it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|██████▌ | 413/625 [00:55<00:28, 7.55it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|██████▌ | 413/625 [00:55<00:28, 7.55it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|██████▌ | 414/625 [00:55<00:27, 7.55it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|██████▌ | 414/625 [00:55<00:27, 7.55it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|██████▋ | 415/625 [00:55<00:27, 7.56it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 66%|██████▋ | 415/625 [00:55<00:27, 7.56it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 67%|██████▋ | 416/625 [00:55<00:27, 7.54it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|██████▋ | 416/625 [00:55<00:27, 7.54it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|██████▋ | 417/625 [00:55<00:27, 7.54it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|██████▋ | 417/625 [00:55<00:27, 7.54it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|██████▋ | 418/625 [00:55<00:27, 7.55it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|██████▋ | 418/625 [00:55<00:27, 7.55it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|██████▋ | 419/625 [00:55<00:27, 7.55it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|██████▋ | 419/625 [00:55<00:27, 7.55it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|██████▋ | 420/625 [00:55<00:27, 7.52it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|██████▋ | 420/625 [00:56<00:27, 7.52it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|██████▋ | 421/625 [00:56<00:27, 7.53it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 67%|██████▋ | 421/625 [00:56<00:27, 7.53it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 68%|██████▊ | 422/625 [00:56<00:26, 7.54it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|██████▊ | 422/625 [00:56<00:26, 7.54it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|██████▊ | 423/625 [00:56<00:26, 7.53it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|██████▊ | 423/625 [00:56<00:26, 7.53it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|██████▊ | 424/625 [00:56<00:26, 7.50it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|██████▊ | 424/625 [00:56<00:26, 7.50it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|██████▊ | 425/625 [00:56<00:26, 7.50it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|██████▊ | 425/625 [00:56<00:26, 7.50it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|██████▊ | 426/625 [00:56<00:26, 7.51it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|██████▊ | 426/625 [00:56<00:26, 7.51it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|██████▊ | 427/625 [00:56<00:26, 7.49it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|██████▊ | 427/625 [00:57<00:26, 7.49it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|██████▊ | 428/625 [00:57<00:26, 7.49it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 68%|██████▊ | 428/625 [00:57<00:26, 7.49it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 69%|██████▊ | 429/625 [00:57<00:26, 7.49it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|██████▊ | 429/625 [00:57<00:26, 7.49it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|██████▉ | 430/625 [00:57<00:25, 7.50it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|██████▉ | 430/625 [00:57<00:25, 7.50it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|██████▉ | 431/625 [00:57<00:25, 7.49it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|██████▉ | 431/625 [00:57<00:25, 7.49it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|██████▉ | 432/625 [00:57<00:25, 7.48it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|██████▉ | 432/625 [00:57<00:25, 7.48it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|██████▉ | 433/625 [00:57<00:25, 7.50it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|██████▉ | 433/625 [00:57<00:25, 7.50it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|██████▉ | 434/625 [00:57<00:25, 7.50it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 69%|██████▉ | 434/625 [00:57<00:25, 7.50it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 70%|██████▉ | 435/625 [00:57<00:25, 7.48it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|██████▉ | 435/625 [00:58<00:25, 7.48it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|██████▉ | 436/625 [00:58<00:25, 7.50it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|██████▉ | 436/625 [00:58<00:25, 7.50it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|██████▉ | 437/625 [00:58<00:25, 7.51it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|██████▉ | 437/625 [00:58<00:25, 7.51it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|███████ | 438/625 [00:58<00:24, 7.50it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|███████ | 438/625 [00:58<00:24, 7.50it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|███████ | 439/625 [00:58<00:24, 7.52it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|███████ | 439/625 [00:58<00:24, 7.52it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|███████ | 440/625 [00:58<00:24, 7.51it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 70%|███████ | 440/625 [00:58<00:24, 7.51it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 71%|███████ | 441/625 [00:58<00:24, 7.50it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|███████ | 441/625 [00:58<00:24, 7.50it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|███████ | 442/625 [00:58<00:24, 7.49it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|███████ | 442/625 [00:59<00:24, 7.49it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|███████ | 443/625 [00:59<00:24, 7.50it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|███████ | 443/625 [00:59<00:24, 7.50it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|███████ | 444/625 [00:59<00:24, 7.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|███████ | 444/625 [00:59<00:24, 7.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|███████ | 445/625 [00:59<00:23, 7.50it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|███████ | 445/625 [00:59<00:23, 7.50it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|███████▏ | 446/625 [00:59<00:23, 7.48it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 71%|███████▏ | 446/625 [00:59<00:23, 7.48it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 72%|███████▏ | 447/625 [00:59<00:23, 7.49it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|███████▏ | 447/625 [00:59<00:23, 7.49it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|███████▏ | 448/625 [00:59<00:23, 7.50it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|███████▏ | 448/625 [00:59<00:23, 7.50it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|███████▏ | 449/625 [00:59<00:23, 7.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|███████▏ | 449/625 [00:59<00:23, 7.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|███████▏ | 450/625 [00:59<00:23, 7.52it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|███████▏ | 450/625 [01:00<00:23, 7.52it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|███████▏ | 451/625 [01:00<00:23, 7.54it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|███████▏ | 451/625 [01:00<00:23, 7.54it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|███████▏ | 452/625 [01:00<00:22, 7.55it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|███████▏ | 452/625 [01:00<00:22, 7.55it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|███████▏ | 453/625 [01:00<00:22, 7.55it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 72%|███████▏ | 453/625 [01:00<00:22, 7.55it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 73%|███████▎ | 454/625 [01:00<00:22, 7.56it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|███████▎ | 454/625 [01:00<00:22, 7.56it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|███████▎ | 455/625 [01:00<00:22, 7.55it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|███████▎ | 455/625 [01:00<00:22, 7.55it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|███████▎ | 456/625 [01:00<00:22, 7.53it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|███████▎ | 456/625 [01:00<00:22, 7.53it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|███████▎ | 457/625 [01:00<00:22, 7.54it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|███████▎ | 457/625 [01:01<00:22, 7.54it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|███████▎ | 458/625 [01:01<00:22, 7.55it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|███████▎ | 458/625 [01:01<00:22, 7.55it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|███████▎ | 459/625 [01:01<00:21, 7.57it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 73%|███████▎ | 459/625 [01:01<00:21, 7.57it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 74%|███████▎ | 460/625 [01:01<00:21, 7.56it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|███████▎ | 460/625 [01:01<00:21, 7.56it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|███████▍ | 461/625 [01:01<00:21, 7.57it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|███████▍ | 461/625 [01:01<00:21, 7.57it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|███████▍ | 462/625 [01:01<00:21, 7.56it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|███████▍ | 462/625 [01:01<00:21, 7.56it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|███████▍ | 463/625 [01:01<00:21, 7.56it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|███████▍ | 463/625 [01:01<00:21, 7.56it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|███████▍ | 464/625 [01:01<00:21, 7.57it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|███████▍ | 464/625 [01:01<00:21, 7.57it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|███████▍ | 465/625 [01:01<00:21, 7.55it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 74%|███████▍ | 465/625 [01:02<00:21, 7.55it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 75%|███████▍ | 466/625 [01:02<00:21, 7.56it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|███████▍ | 466/625 [01:02<00:21, 7.56it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|███████▍ | 467/625 [01:02<00:20, 7.56it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|███████▍ | 467/625 [01:02<00:20, 7.56it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|███████▍ | 468/625 [01:02<00:20, 7.55it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|███████▍ | 468/625 [01:02<00:20, 7.55it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|███████▌ | 469/625 [01:02<00:20, 7.54it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|███████▌ | 469/625 [01:02<00:20, 7.54it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|███████▌ | 470/625 [01:02<00:20, 7.54it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|███████▌ | 470/625 [01:02<00:20, 7.54it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|███████▌ | 471/625 [01:02<00:20, 7.55it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 75%|███████▌ | 471/625 [01:02<00:20, 7.55it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 76%|███████▌ | 472/625 [01:02<00:20, 7.55it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|███████▌ | 472/625 [01:02<00:20, 7.55it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|███████▌ | 473/625 [01:02<00:20, 7.56it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|███████▌ | 473/625 [01:03<00:20, 7.56it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|███████▌ | 474/625 [01:03<00:19, 7.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|███████▌ | 474/625 [01:03<00:19, 7.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|███████▌ | 475/625 [01:03<00:19, 7.56it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|███████▌ | 475/625 [01:03<00:19, 7.56it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|███████▌ | 476/625 [01:03<00:19, 7.54it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|███████▌ | 476/625 [01:03<00:19, 7.54it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|███████▋ | 477/625 [01:03<00:19, 7.54it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|███████▋ | 477/625 [01:03<00:19, 7.54it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|███████▋ | 478/625 [01:03<00:19, 7.55it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 76%|███████▋ | 478/625 [01:03<00:19, 7.55it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 77%|███████▋ | 479/625 [01:03<00:19, 7.56it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|███████▋ | 479/625 [01:03<00:19, 7.56it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|███████▋ | 480/625 [01:03<00:19, 7.54it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|███████▋ | 480/625 [01:04<00:19, 7.54it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|███████▋ | 481/625 [01:04<00:19, 7.53it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|███████▋ | 481/625 [01:04<00:19, 7.53it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|███████▋ | 482/625 [01:04<00:18, 7.55it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|███████▋ | 482/625 [01:04<00:18, 7.55it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|███████▋ | 483/625 [01:04<00:18, 7.55it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|███████▋ | 483/625 [01:04<00:18, 7.55it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|███████▋ | 484/625 [01:04<00:18, 7.52it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 77%|███████▋ | 484/625 [01:04<00:18, 7.52it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 78%|███████▊ | 485/625 [01:04<00:18, 7.54it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|███████▊ | 485/625 [01:04<00:18, 7.54it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|███████▊ | 486/625 [01:04<00:18, 7.54it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|███████▊ | 486/625 [01:04<00:18, 7.54it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|███████▊ | 487/625 [01:04<00:18, 7.54it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|███████▊ | 487/625 [01:04<00:18, 7.54it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|███████▊ | 488/625 [01:04<00:18, 7.55it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|███████▊ | 488/625 [01:05<00:18, 7.55it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|███████▊ | 489/625 [01:05<00:18, 7.55it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|███████▊ | 489/625 [01:05<00:18, 7.55it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|███████▊ | 490/625 [01:05<00:17, 7.56it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 78%|███████▊ | 490/625 [01:05<00:17, 7.56it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 79%|███████▊ | 491/625 [01:05<00:17, 7.54it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|███████▊ | 491/625 [01:05<00:17, 7.54it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|███████▊ | 492/625 [01:05<00:17, 7.55it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|███████▊ | 492/625 [01:05<00:17, 7.55it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|███████▉ | 493/625 [01:05<00:17, 7.56it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|███████▉ | 493/625 [01:05<00:17, 7.56it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|███████▉ | 494/625 [01:05<00:17, 7.56it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|███████▉ | 494/625 [01:05<00:17, 7.56it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|███████▉ | 495/625 [01:05<00:17, 7.55it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|███████▉ | 495/625 [01:06<00:17, 7.55it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|███████▉ | 496/625 [01:06<00:17, 7.51it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 79%|███████▉ | 496/625 [01:06<00:17, 7.51it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 80%|███████▉ | 497/625 [01:06<00:16, 7.53it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|███████▉ | 497/625 [01:06<00:16, 7.53it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|███████▉ | 498/625 [01:06<00:16, 7.52it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|███████▉ | 498/625 [01:06<00:16, 7.52it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|███████▉ | 499/625 [01:06<00:16, 7.52it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|███████▉ | 499/625 [01:06<00:16, 7.52it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|████████ | 500/625 [01:06<00:16, 7.54it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|████████ | 500/625 [01:06<00:16, 7.54it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|████████ | 501/625 [01:06<00:16, 7.54it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|████████ | 501/625 [01:06<00:16, 7.54it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|████████ | 502/625 [01:06<00:16, 7.54it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|████████ | 502/625 [01:06<00:16, 7.54it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|████████ | 503/625 [01:06<00:16, 7.55it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 80%|████████ | 503/625 [01:07<00:16, 7.55it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 81%|████████ | 504/625 [01:07<00:16, 7.56it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|████████ | 504/625 [01:07<00:16, 7.56it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|████████ | 505/625 [01:07<00:15, 7.56it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|████████ | 505/625 [01:07<00:15, 7.56it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|████████ | 506/625 [01:07<00:15, 7.56it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|████████ | 506/625 [01:07<00:15, 7.56it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|████████ | 507/625 [01:07<00:15, 7.56it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|████████ | 507/625 [01:07<00:15, 7.56it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|████████▏ | 508/625 [01:07<00:15, 7.56it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|████████▏ | 508/625 [01:07<00:15, 7.56it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|████████▏ | 509/625 [01:07<00:15, 7.56it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 81%|████████▏ | 509/625 [01:07<00:15, 7.56it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 82%|████████▏ | 510/625 [01:07<00:15, 7.57it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|████████▏ | 510/625 [01:08<00:15, 7.57it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|████████▏ | 511/625 [01:08<00:15, 7.56it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|████████▏ | 511/625 [01:08<00:15, 7.56it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|████████▏ | 512/625 [01:08<00:14, 7.56it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|████████▏ | 512/625 [01:08<00:14, 7.56it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|████████▏ | 513/625 [01:08<00:14, 7.56it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|████████▏ | 513/625 [01:08<00:14, 7.56it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|████████▏ | 514/625 [01:08<00:14, 7.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|████████▏ | 514/625 [01:08<00:14, 7.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|████████▏ | 515/625 [01:08<00:14, 7.57it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 82%|████████▏ | 515/625 [01:08<00:14, 7.57it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 83%|████████▎ | 516/625 [01:08<00:14, 7.57it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|████████▎ | 516/625 [01:08<00:14, 7.57it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|████████▎ | 517/625 [01:08<00:14, 7.58it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|████████▎ | 517/625 [01:08<00:14, 7.58it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|████████▎ | 518/625 [01:08<00:14, 7.58it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|████████▎ | 518/625 [01:09<00:14, 7.58it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|████████▎ | 519/625 [01:09<00:13, 7.58it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|████████▎ | 519/625 [01:09<00:13, 7.58it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|████████▎ | 520/625 [01:09<00:13, 7.58it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|████████▎ | 520/625 [01:09<00:13, 7.58it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|████████▎ | 521/625 [01:09<00:13, 7.58it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 83%|████████▎ | 521/625 [01:09<00:13, 7.58it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 84%|████████▎ | 522/625 [01:09<00:13, 7.58it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|████████▎ | 522/625 [01:09<00:13, 7.58it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|████████▎ | 523/625 [01:09<00:13, 7.58it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|████████▎ | 523/625 [01:09<00:13, 7.58it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|████████▍ | 524/625 [01:09<00:13, 7.58it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|████████▍ | 524/625 [01:09<00:13, 7.58it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|████████▍ | 525/625 [01:09<00:13, 7.57it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|████████▍ | 525/625 [01:10<00:13, 7.57it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|████████▍ | 526/625 [01:10<00:17, 5.54it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|████████▍ | 526/625 [01:10<00:17, 5.54it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|████████▍ | 527/625 [01:10<00:16, 6.02it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|████████▍ | 527/625 [01:10<00:16, 6.02it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|████████▍ | 528/625 [01:10<00:15, 6.41it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 84%|████████▍ | 528/625 [01:10<00:15, 6.41it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 85%|████████▍ | 529/625 [01:10<00:14, 6.71it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|████████▍ | 529/625 [01:10<00:14, 6.71it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|████████▍ | 530/625 [01:10<00:13, 6.94it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|████████▍ | 530/625 [01:10<00:13, 6.94it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|████████▍ | 531/625 [01:10<00:13, 7.11it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|████████▍ | 531/625 [01:10<00:13, 7.11it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|████████▌ | 532/625 [01:10<00:12, 7.25it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|████████▌ | 532/625 [01:11<00:12, 7.25it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|████████▌ | 533/625 [01:11<00:12, 7.34it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|████████▌ | 533/625 [01:11<00:12, 7.34it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|████████▌ | 534/625 [01:11<00:12, 7.37it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 85%|████████▌ | 534/625 [01:11<00:12, 7.37it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 86%|████████▌ | 535/625 [01:11<00:12, 7.42it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|████████▌ | 535/625 [01:11<00:12, 7.42it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|████████▌ | 536/625 [01:11<00:11, 7.47it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|████████▌ | 536/625 [01:11<00:11, 7.47it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|████████▌ | 537/625 [01:11<00:11, 7.46it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|████████▌ | 537/625 [01:11<00:11, 7.46it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|████████▌ | 538/625 [01:11<00:11, 7.47it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|████████▌ | 538/625 [01:11<00:11, 7.47it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|████████▌ | 539/625 [01:11<00:11, 7.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|████████▌ | 539/625 [01:12<00:11, 7.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|████████▋ | 540/625 [01:12<00:11, 7.51it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 86%|████████▋ | 540/625 [01:12<00:11, 7.51it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 87%|████████▋ | 541/625 [01:12<00:11, 7.53it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|████████▋ | 541/625 [01:12<00:11, 7.53it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|████████▋ | 542/625 [01:12<00:11, 7.52it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|████████▋ | 542/625 [01:12<00:11, 7.52it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|████████▋ | 543/625 [01:12<00:10, 7.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|████████▋ | 543/625 [01:12<00:10, 7.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|████████▋ | 544/625 [01:12<00:10, 7.51it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|████████▋ | 544/625 [01:12<00:10, 7.51it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|████████▋ | 545/625 [01:12<00:10, 7.54it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|████████▋ | 545/625 [01:12<00:10, 7.54it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|████████▋ | 546/625 [01:12<00:10, 7.55it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 87%|████████▋ | 546/625 [01:12<00:10, 7.55it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 88%|████████▊ | 547/625 [01:12<00:10, 7.55it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|████████▊ | 547/625 [01:13<00:10, 7.55it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|████████▊ | 548/625 [01:13<00:10, 7.55it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|████████▊ | 548/625 [01:13<00:10, 7.55it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|████████▊ | 549/625 [01:13<00:10, 7.54it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|████████▊ | 549/625 [01:13<00:10, 7.54it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|████████▊ | 550/625 [01:13<00:09, 7.52it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|████████▊ | 550/625 [01:13<00:09, 7.52it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|████████▊ | 551/625 [01:13<00:09, 7.53it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|████████▊ | 551/625 [01:13<00:09, 7.53it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|████████▊ | 552/625 [01:13<00:09, 7.54it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|████████▊ | 552/625 [01:13<00:09, 7.54it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|████████▊ | 553/625 [01:13<00:09, 7.55it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 88%|████████▊ | 553/625 [01:13<00:09, 7.55it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 89%|████████▊ | 554/625 [01:13<00:09, 7.55it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|████████▊ | 554/625 [01:14<00:09, 7.55it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|████████▉ | 555/625 [01:14<00:09, 7.55it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|████████▉ | 555/625 [01:14<00:09, 7.55it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|████████▉ | 556/625 [01:14<00:09, 7.56it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|████████▉ | 556/625 [01:14<00:09, 7.56it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|████████▉ | 557/625 [01:14<00:09, 7.54it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|████████▉ | 557/625 [01:14<00:09, 7.54it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|████████▉ | 558/625 [01:14<00:08, 7.53it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|████████▉ | 558/625 [01:14<00:08, 7.53it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|████████▉ | 559/625 [01:14<00:08, 7.54it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 89%|████████▉ | 559/625 [01:14<00:08, 7.54it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 90%|████████▉ | 560/625 [01:14<00:08, 7.55it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|████████▉ | 560/625 [01:14<00:08, 7.55it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|████████▉ | 561/625 [01:14<00:08, 7.55it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|████████▉ | 561/625 [01:14<00:08, 7.55it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|████████▉ | 562/625 [01:14<00:08, 7.55it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|████████▉ | 562/625 [01:15<00:08, 7.55it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|█████████ | 563/625 [01:15<00:08, 7.55it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|█████████ | 563/625 [01:15<00:08, 7.55it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|█████████ | 564/625 [01:15<00:08, 7.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|█████████ | 564/625 [01:15<00:08, 7.56it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|█████████ | 565/625 [01:15<00:07, 7.57it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 90%|█████████ | 565/625 [01:15<00:07, 7.57it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 91%|█████████ | 566/625 [01:15<00:07, 7.56it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|█████████ | 566/625 [01:15<00:07, 7.56it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|█████████ | 567/625 [01:15<00:07, 7.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|█████████ | 567/625 [01:15<00:07, 7.56it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|█████████ | 568/625 [01:15<00:07, 7.55it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|█████████ | 568/625 [01:15<00:07, 7.55it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|█████████ | 569/625 [01:15<00:07, 7.54it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|█████████ | 569/625 [01:16<00:07, 7.54it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|█████████ | 570/625 [01:16<00:07, 7.55it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|█████████ | 570/625 [01:16<00:07, 7.55it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|█████████▏| 571/625 [01:16<00:07, 7.56it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 91%|█████████▏| 571/625 [01:16<00:07, 7.56it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 92%|█████████▏| 572/625 [01:16<00:07, 7.56it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|█████████▏| 572/625 [01:16<00:07, 7.56it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|█████████▏| 573/625 [01:16<00:06, 7.52it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|█████████▏| 573/625 [01:16<00:06, 7.52it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|█████████▏| 574/625 [01:16<00:06, 7.54it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|█████████▏| 574/625 [01:16<00:06, 7.54it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|█████████▏| 575/625 [01:16<00:06, 7.55it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|█████████▏| 575/625 [01:16<00:06, 7.55it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|█████████▏| 576/625 [01:16<00:06, 7.54it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|█████████▏| 576/625 [01:16<00:06, 7.54it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|█████████▏| 577/625 [01:16<00:06, 7.54it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|█████████▏| 577/625 [01:17<00:06, 7.54it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|█████████▏| 578/625 [01:17<00:06, 7.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 92%|█████████▏| 578/625 [01:17<00:06, 7.55it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 93%|█████████▎| 579/625 [01:17<00:06, 7.53it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|█████████▎| 579/625 [01:17<00:06, 7.53it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|█████████▎| 580/625 [01:17<00:05, 7.52it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|█████████▎| 580/625 [01:17<00:05, 7.52it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|█████████▎| 581/625 [01:17<00:05, 7.53it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|█████████▎| 581/625 [01:17<00:05, 7.53it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|█████████▎| 582/625 [01:17<00:05, 7.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|█████████▎| 582/625 [01:17<00:05, 7.55it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|█████████▎| 583/625 [01:17<00:05, 7.56it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|█████████▎| 583/625 [01:17<00:05, 7.56it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|█████████▎| 584/625 [01:17<00:05, 7.56it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 93%|█████████▎| 584/625 [01:17<00:05, 7.56it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 94%|█████████▎| 585/625 [01:17<00:05, 7.56it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|█████████▎| 585/625 [01:18<00:05, 7.56it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|█████████▍| 586/625 [01:18<00:05, 7.57it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|█████████▍| 586/625 [01:18<00:05, 7.57it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|█████████▍| 587/625 [01:18<00:05, 7.58it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|█████████▍| 587/625 [01:18<00:05, 7.58it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|█████████▍| 588/625 [01:18<00:04, 7.54it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|█████████▍| 588/625 [01:18<00:04, 7.54it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|█████████▍| 589/625 [01:18<00:04, 7.55it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|█████████▍| 589/625 [01:18<00:04, 7.55it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|█████████▍| 590/625 [01:18<00:04, 7.54it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 94%|█████████▍| 590/625 [01:18<00:04, 7.54it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 95%|█████████▍| 591/625 [01:18<00:04, 7.49it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|█████████▍| 591/625 [01:18<00:04, 7.49it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|█████████▍| 592/625 [01:18<00:04, 7.47it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|█████████▍| 592/625 [01:19<00:04, 7.47it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|█████████▍| 593/625 [01:19<00:04, 7.45it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|█████████▍| 593/625 [01:19<00:04, 7.45it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|█████████▌| 594/625 [01:19<00:04, 7.45it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|█████████▌| 594/625 [01:19<00:04, 7.45it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|█████████▌| 595/625 [01:19<00:04, 7.43it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|█████████▌| 595/625 [01:19<00:04, 7.43it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|█████████▌| 596/625 [01:19<00:03, 7.42it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 95%|█████████▌| 596/625 [01:19<00:03, 7.42it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 96%|█████████▌| 597/625 [01:19<00:03, 7.43it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|█████████▌| 597/625 [01:19<00:03, 7.43it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|█████████▌| 598/625 [01:19<00:03, 7.43it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|█████████▌| 598/625 [01:19<00:03, 7.43it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|█████████▌| 599/625 [01:19<00:03, 7.42it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|█████████▌| 599/625 [01:20<00:03, 7.42it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|█████████▌| 600/625 [01:20<00:03, 7.43it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|█████████▌| 600/625 [01:20<00:03, 7.43it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|█████████▌| 601/625 [01:20<00:03, 7.46it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|█████████▌| 601/625 [01:20<00:03, 7.46it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|█████████▋| 602/625 [01:20<00:03, 7.46it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|█████████▋| 602/625 [01:20<00:03, 7.46it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|█████████▋| 603/625 [01:20<00:02, 7.44it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 96%|█████████▋| 603/625 [01:20<00:02, 7.44it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 97%|█████████▋| 604/625 [01:20<00:02, 7.46it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|█████████▋| 604/625 [01:20<00:02, 7.46it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|█████████▋| 605/625 [01:20<00:02, 7.46it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|█████████▋| 605/625 [01:20<00:02, 7.46it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|█████████▋| 606/625 [01:20<00:02, 7.45it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|█████████▋| 606/625 [01:20<00:02, 7.45it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|█████████▋| 607/625 [01:20<00:02, 7.46it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|█████████▋| 607/625 [01:21<00:02, 7.46it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|█████████▋| 608/625 [01:21<00:02, 7.46it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|█████████▋| 608/625 [01:21<00:02, 7.46it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|█████████▋| 609/625 [01:21<00:02, 7.44it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 97%|█████████▋| 609/625 [01:21<00:02, 7.44it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 98%|█████████▊| 610/625 [01:21<00:02, 7.46it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|█████████▊| 610/625 [01:21<00:02, 7.46it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|█████████▊| 611/625 [01:21<00:01, 7.48it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|█████████▊| 611/625 [01:21<00:01, 7.48it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|█████████▊| 612/625 [01:21<00:01, 7.47it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|█████████▊| 612/625 [01:21<00:01, 7.47it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|█████████▊| 613/625 [01:21<00:01, 7.48it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|█████████▊| 613/625 [01:21<00:01, 7.48it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|█████████▊| 614/625 [01:21<00:01, 7.47it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|█████████▊| 614/625 [01:22<00:01, 7.47it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|█████████▊| 615/625 [01:22<00:01, 7.47it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 98%|█████████▊| 615/625 [01:22<00:01, 7.47it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 99%|█████████▊| 616/625 [01:22<00:01, 7.43it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|█████████▊| 616/625 [01:22<00:01, 7.43it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|█████████▊| 617/625 [01:22<00:01, 7.42it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|█████████▊| 617/625 [01:22<00:01, 7.42it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|█████████▉| 618/625 [01:22<00:00, 7.40it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|█████████▉| 618/625 [01:22<00:00, 7.40it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|█████████▉| 619/625 [01:22<00:00, 7.43it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|█████████▉| 619/625 [01:22<00:00, 7.43it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|█████████▉| 620/625 [01:22<00:00, 7.44it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|█████████▉| 620/625 [01:22<00:00, 7.44it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|█████████▉| 621/625 [01:22<00:00, 7.45it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 99%|█████████▉| 621/625 [01:22<00:00, 7.45it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 100%|█████████▉| 622/625 [01:22<00:00, 7.43it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|█████████▉| 622/625 [01:23<00:00, 7.43it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|█████████▉| 623/625 [01:23<00:00, 7.43it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|█████████▉| 623/625 [01:23<00:00, 7.43it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|█████████▉| 624/625 [01:23<00:00, 7.44it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|█████████▉| 624/625 [01:23<00:00, 7.44it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|██████████| 625/625 [01:23<00:00, 7.45it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|██████████| 625/625 [01:23<00:00, 7.50it/s]
结论¶
在本教程中,我们学习了如何从头开始编写无状态环境。我们涉及以下主题:
编写环境时需要处理的四个基本组件 (
step
、reset
、播种和构建规格)。我们了解了这些方法和类如何与TensorDict
类交互;如何使用
check_env_specs()
测试环境是否正确编写;如何在无状态环境的上下文中添加转换,以及如何编写自定义转换;
如何在完全可微的模拟器上训练策略。
脚本的总运行时间:(2 分 39.477 秒)
估计内存使用量:317 MB