TargetReturn¶

class torchrl.envs.transforms.TargetReturn(target_return: float, mode: str = 'reduce', in_keys: Sequence[NestedKey] | None = None, out_keys: Sequence[NestedKey] | None = None, reset_key: NestedKey | None = None)[source]¶

为智能体设置一个目标回报，以便在环境中实现。

在目标条件强化学习 (goal-conditioned RL) 中，TargetReturn 被定义为从当前状态到目标状态或情节结束时获得的预期累积奖励。它被用作策略的输入来指导其行为。对于训练好的策略，通常选择环境中最大的回报作为目标回报。然而，由于它被用作策略模块的输入，因此应进行相应的缩放。使用 TargetReturn 变换，可以更新 tensordict 以包含用户指定的目标回报。 mode 参数可用于指定目标回报是每一步通过减去每一步获得的奖励来更新，还是保持不变。

参数:

target_return (float) – 智能体需要达成的目标回报。
mode (str) – 用于更新目标回报的模式。可以是 “reduce” 或 “constant”。默认值：“reduce”。
in_keys (sequence of NestedKey, optional) – 指向奖励条目的键。默认为父环境的奖励键。
out_keys (sequence of NestedKey, optional) – 指向目标键的键。默认为 in_keys 的副本，其中最后一个元素被替换为 "target_return"，如果这些键不唯一则会引发异常。
reset_key (NestedKey, optional) – 用作部分重置指示器的重置键。必须是唯一的。如果未提供，则默认为父环境中唯一的重置键（如果只有一个），否则会引发异常。

示例

>>> from torchrl.envs import GymEnv
>>> env = TransformedEnv(
...     GymEnv("CartPole-v1"),
...     TargetReturn(10.0, mode="reduce"))
>>> env.set_seed(0)
>>> torch.manual_seed(0)
>>> env.rollout(20)['target_return'].squeeze()
tensor([10.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.,  0., -1., -2., -3.])

forward(tensordict: TensorDictBase) → TensorDictBase[source]¶: 读取输入 tensordict，并对选定的键应用变换。

transform_input_spec(input_spec: TensorSpec) → TensorSpec[source]¶

变换输入 spec，使其结果 spec 与变换映射匹配。

参数:: input_spec (TensorSpec) – 变换前的 spec
返回:: 变换后的预期 spec

transform_observation_spec(observation_spec: TensorSpec) → TensorSpec[source]¶

变换观察 spec，使其结果 spec 与变换映射匹配。

参数:: observation_spec (TensorSpec) – 变换前的 spec
返回:: 变换后的预期 spec

TargetReturn¶

文档

教程

资源