• 教程 >
  • 使用 Ray Tune 进行超参数调优
快捷方式

使用 Ray Tune 进行超参数调优

创建日期:2020 年 8 月 31 日 | 最后更新:2024 年 10 月 31 日 | 最后验证:2024 年 11 月 05 日

超参数调优能够使模型从表现平平跃升至高度准确。通常,一些简单的改动,如选择不同的学习率或更改网络层的大小,都可能对模型性能产生显著影响。

幸运的是,有一些工具可以帮助找到最佳参数组合。Ray Tune 是一个业界标准的分布式超参数调优工具。Ray Tune 包含了最新的超参数搜索算法,集成了各种分析库,并通过Ray 的分布式机器学习引擎原生支持分布式训练。

在本教程中,我们将展示如何将 Ray Tune 集成到您的 PyTorch 训练工作流程中。我们将扩展PyTorch 文档中的这个教程,用于训练 CIFAR10 图像分类器。

正如您将看到的,我们只需要进行一些微小的修改。具体来说,我们需要

  1. 将数据加载和训练封装到函数中,

  2. 使部分网络参数可配置,

  3. 添加检查点(可选),

  4. 并定义模型调优的搜索空间


要运行本教程,请确保已安装以下软件包

  • ray[tune]:分布式超参数调优库

  • torchvision:用于数据变换器

设置 / 导入

我们从导入开始

from functools import partial
import os
import tempfile
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray import train
from ray.train import Checkpoint, get_checkpoint
from ray.tune.schedulers import ASHAScheduler
import ray.cloudpickle as pickle

大多数导入用于构建 PyTorch 模型。只有最后的导入是针对 Ray Tune 的。

数据加载器

我们将数据加载器封装到自己的函数中,并传递一个全局数据目录。这样我们可以在不同的试验之间共享数据目录。

def load_data(data_dir="./data"):
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )

    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform
    )

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    return trainset, testset

可配置的神经网络

我们只能调优那些可配置的参数。在本例中,我们可以指定全连接层的层大小

class Net(nn.Module):
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

训练函数

现在有趣的地方来了,因为我们对PyTorch 文档中的示例进行了一些修改。

我们将训练脚本封装在一个函数 train_cifar(config, data_dir=None) 中。config 参数将接收我们希望用于训练的超参数。data_dir 指定我们加载和存储数据的目录,以便多个运行可以共享相同的数据源。如果在提供检查点的情况下,我们还在运行开始时加载模型和优化器状态。在本教程的后面部分,您将找到关于如何保存检查点及其用途的信息。

net = Net(config["l1"], config["l2"])

checkpoint = get_checkpoint()
if checkpoint:
    with checkpoint.as_directory() as checkpoint_dir:
        data_path = Path(checkpoint_dir) / "data.pkl"
        with open(data_path, "rb") as fp:
            checkpoint_state = pickle.load(fp)
        start_epoch = checkpoint_state["epoch"]
        net.load_state_dict(checkpoint_state["net_state_dict"])
        optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
    start_epoch = 0

优化器的学习率也设置为可配置的

optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

我们还将训练数据分割为训练集和验证集。因此,我们使用 80% 的数据进行训练,并在剩余的 20% 数据上计算验证损失。遍历训练集和测试集所使用的批量大小 (batch size) 也是可配置的。

使用 DataParallel 添加(多)GPU 支持

图像分类很大程度上受益于 GPU。幸运的是,我们可以在 Ray Tune 中继续使用 PyTorch 的抽象。因此,我们可以将模型封装在 nn.DataParallel 中,以支持在多个 GPU 上进行数据并行训练

device = "cpu"
if torch.cuda.is_available():
    device = "cuda:0"
    if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net)
net.to(device)

通过使用 device 变量,我们确保在没有 GPU 可用时训练也能正常进行。PyTorch 要求我们将数据显式地发送到 GPU 内存,如下所示

for i, data in enumerate(trainloader, 0):
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)

现在代码支持在 CPU、单 GPU 和多 GPU 上进行训练。值得注意的是,Ray 还支持分数 GPU (fractional GPUs),因此我们可以在试验之间共享 GPU,只要模型仍能容纳在 GPU 内存中。我们稍后会再讨论这一点。

与 Ray Tune 通信

最有趣的部分是与 Ray Tune 的通信

checkpoint_data = {
    "epoch": epoch,
    "net_state_dict": net.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
    data_path = Path(checkpoint_dir) / "data.pkl"
    with open(data_path, "wb") as fp:
        pickle.dump(checkpoint_data, fp)

    checkpoint = Checkpoint.from_directory(checkpoint_dir)
    train.report(
        {"loss": val_loss / val_steps, "accuracy": correct / total},
        checkpoint=checkpoint,
    )

在这里,我们首先保存一个检查点,然后向 Ray Tune 报告一些指标。具体来说,我们将验证损失和准确率发送回 Ray Tune。Ray Tune 随后可以使用这些指标来决定哪种超参数配置带来了最佳结果。这些指标也可以用来提前停止表现不佳的试验,以避免浪费资源。

保存检查点是可选的,但是,如果我们想使用像基于总体的训练 (Population Based Training) 这样的高级调度器,则是必需的。此外,通过保存检查点,我们稍后可以加载训练好的模型并在测试集上进行验证。最后,保存检查点对于容错很有用,并且允许我们中断训练并在之后继续训练。

完整的训练函数

完整的代码示例如下所示

def train_cifar(config, data_dir=None):
    net = Net(config["l1"], config["l2"])

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    checkpoint = get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "rb") as fp:
                checkpoint_state = pickle.load(fp)
            start_epoch = checkpoint_state["epoch"]
            net.load_state_dict(checkpoint_state["net_state_dict"])
            optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
    else:
        start_epoch = 0

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs]
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )
    valloader = torch.utils.data.DataLoader(
        val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )

    for epoch in range(start_epoch, 10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(
                    "[%d, %5d] loss: %.3f"
                    % (epoch + 1, i + 1, running_loss / epoch_steps)
                )
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        checkpoint_data = {
            "epoch": epoch,
            "net_state_dict": net.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }
        with tempfile.TemporaryDirectory() as checkpoint_dir:
            data_path = Path(checkpoint_dir) / "data.pkl"
            with open(data_path, "wb") as fp:
                pickle.dump(checkpoint_data, fp)

            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            train.report(
                {"loss": val_loss / val_steps, "accuracy": correct / total},
                checkpoint=checkpoint,
            )

    print("Finished Training")

正如您所见,大部分代码都是直接改编自原始示例。

测试集准确率

通常,机器学习模型的性能会在一个保留的测试集上进行测试,测试集中的数据未用于模型训练。我们也将其封装在一个函数中

def test_accuracy(net, device="cpu"):
    trainset, testset = load_data()

    testloader = torch.utils.data.DataLoader(
        testset, batch_size=4, shuffle=False, num_workers=2
    )

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return correct / total

该函数还期望一个 device 参数,这样我们就可以在 GPU 上进行测试集验证。

配置搜索空间

最后,我们需要定义 Ray Tune 的搜索空间。以下是一个示例

config = {
    "l1": tune.choice([2 ** i for i in range(9)]),
    "l2": tune.choice([2 ** i for i in range(9)]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16])
}

tune.choice() 接受一个值列表,这些值将从中进行均匀采样。在本例中,l1l2 参数应为 4 到 256 之间的 2 的幂,即 4、8、16、32、64、128 或 256。 lr(学习率)应在 0.0001 到 0.1 之间均匀采样。最后,批量大小 (batch size) 是从 2、4、8 和 16 中选择一个。

在每次试验中,Ray Tune 将从这些搜索空间中随机采样参数组合。然后,它将并行训练多个模型,并从中找到性能最佳的模型。我们还使用 ASHAScheduler,它将提前终止表现不佳的试验。

我们使用 functools.partial 包装 train_cifar 函数,以设置常量 data_dir 参数。我们还可以告诉 Ray Tune 为每个试验分配哪些资源

gpus_per_trial = 2
# ...
result = tune.run(
    partial(train_cifar, data_dir=data_dir),
    resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
    config=config,
    num_samples=num_samples,
    scheduler=scheduler,
    checkpoint_at_end=True)

您可以指定 CPU 的数量,这些 CPU 可用于例如增加 PyTorch DataLoader 实例的 num_workers。所选的 GPU 数量在每个试验中对 PyTorch 可见。试验无法访问未为其请求的 GPU - 因此您不必担心两个试验使用同一组资源。

这里我们也可以指定分数 GPU (fractional GPUs),所以像 gpus_per_trial=0.5 这样的设置是完全有效的。试验之间将共享 GPU。您只需确保模型仍然能够容纳在 GPU 内存中。

训练模型后,我们将找到性能最佳的模型,并从检查点文件中加载训练好的网络。然后,我们获取测试集准确率,并通过打印输出报告所有信息。

完整的 main 函数如下所示

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    data_dir = os.path.abspath("./data")
    load_data(data_dir)
    config = {
        "l1": tune.choice([2**i for i in range(9)]),
        "l2": tune.choice([2**i for i in range(9)]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16]),
    }
    scheduler = ASHAScheduler(
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2,
    )
    result = tune.run(
        partial(train_cifar, data_dir=data_dir),
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
    )

    best_trial = result.get_best_trial("loss", "min", "last")
    print(f"Best trial config: {best_trial.config}")
    print(f"Best trial final validation loss: {best_trial.last_result['loss']}")
    print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}")

    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    best_checkpoint = result.get_best_checkpoint(trial=best_trial, metric="accuracy", mode="max")
    with best_checkpoint.as_directory() as checkpoint_dir:
        data_path = Path(checkpoint_dir) / "data.pkl"
        with open(data_path, "rb") as fp:
            best_checkpoint_data = pickle.load(fp)

        best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"])
        test_acc = test_accuracy(best_trained_model, device)
        print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # You can change the number of GPUs per trial here:
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
  0% 0.00/170M [00:00<?, ?B/s]
  0% 557k/170M [00:00<00:30, 5.55MB/s]
  4% 6.39M/170M [00:00<00:04, 33.6MB/s]
 10% 16.6M/170M [00:00<00:02, 63.3MB/s]
 15% 25.2M/170M [00:00<00:02, 72.0MB/s]
 20% 33.8M/170M [00:00<00:01, 77.0MB/s]
 25% 42.4M/170M [00:00<00:01, 79.8MB/s]
 30% 50.8M/170M [00:00<00:01, 80.9MB/s]
 35% 58.9M/170M [00:00<00:01, 75.6MB/s]
 39% 66.6M/170M [00:00<00:01, 69.1MB/s]
 43% 73.6M/170M [00:01<00:01, 66.0MB/s]
 47% 80.3M/170M [00:01<00:01, 64.3MB/s]
 51% 86.9M/170M [00:01<00:01, 63.0MB/s]
 55% 93.2M/170M [00:01<00:01, 62.2MB/s]
 58% 99.5M/170M [00:01<00:01, 62.1MB/s]
 62% 106M/170M [00:01<00:01, 62.9MB/s]
 66% 113M/170M [00:01<00:00, 64.2MB/s]
 70% 119M/170M [00:01<00:00, 64.9MB/s]
 74% 127M/170M [00:01<00:00, 67.7MB/s]
 79% 134M/170M [00:02<00:00, 69.4MB/s]
 83% 141M/170M [00:02<00:00, 70.0MB/s]
 87% 149M/170M [00:02<00:00, 70.7MB/s]
 91% 156M/170M [00:02<00:00, 71.4MB/s]
 96% 163M/170M [00:02<00:00, 71.4MB/s]
100% 170M/170M [00:02<00:00, 71.0MB/s]
100% 170M/170M [00:02<00:00, 67.4MB/s]
2025-04-23 16:30:10,527 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147479552 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-04-23 16:30:10,579 INFO worker.py:1642 -- Started a local Ray instance.
2025-04-23 16:30:11,504 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`.
2025-04-23 16:30:11,506 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
+--------------------------------------------------------------------+
| Configuration for experiment     train_cifar_2025-04-23_16-30-11   |
+--------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator             |
| Scheduler                        AsyncHyperBandScheduler           |
| Number of trials                 10                                |
+--------------------------------------------------------------------+

View detailed results here: /var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11
To visualize your results with TensorBoard, run: `tensorboard --logdir /var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11`

Trial status: 10 PENDING
Current time: 2025-04-23 16:30:11. Total running time: 0s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+-------------------------------------------------------------------------------+
| Trial name                status       l1     l2            lr     batch_size |
+-------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   PENDING      16      1   0.00213327               2 |
| train_cifar_3a47b_00001   PENDING       1      2   0.013416                 4 |
| train_cifar_3a47b_00002   PENDING     256     64   0.0113784                2 |
| train_cifar_3a47b_00003   PENDING      64    256   0.0274071                8 |
| train_cifar_3a47b_00004   PENDING      16      2   0.056666                 4 |
| train_cifar_3a47b_00005   PENDING       8     64   0.000353097              4 |
| train_cifar_3a47b_00006   PENDING      16      4   0.000147684              8 |
| train_cifar_3a47b_00007   PENDING     256    256   0.00477469               8 |
| train_cifar_3a47b_00008   PENDING     128    256   0.0306227                8 |
| train_cifar_3a47b_00009   PENDING       2     16   0.0286986                2 |
+-------------------------------------------------------------------------------+

Trial train_cifar_3a47b_00000 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00000 config             |
+--------------------------------------------------+
| batch_size                                     2 |
| l1                                            16 |
| l2                                             1 |
| lr                                       0.00213 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00002 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00002 config             |
+--------------------------------------------------+
| batch_size                                     2 |
| l1                                           256 |
| l2                                            64 |
| lr                                       0.01138 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00007 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00007 config             |
+--------------------------------------------------+
| batch_size                                     8 |
| l1                                           256 |
| l2                                           256 |
| lr                                       0.00477 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00004 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00004 config             |
+--------------------------------------------------+
| batch_size                                     4 |
| l1                                            16 |
| l2                                             2 |
| lr                                       0.05667 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00003 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00003 config             |
+--------------------------------------------------+
| batch_size                                     8 |
| l1                                            64 |
| l2                                           256 |
| lr                                       0.02741 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00001 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00001 config             |
+--------------------------------------------------+
| batch_size                                     4 |
| l1                                             1 |
| l2                                             2 |
| lr                                       0.01342 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00005 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00005 config             |
+--------------------------------------------------+
| batch_size                                     4 |
| l1                                             8 |
| l2                                            64 |
| lr                                       0.00035 |
+--------------------------------------------------+

Trial train_cifar_3a47b_00006 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00006 config             |
+--------------------------------------------------+
| batch_size                                     8 |
| l1                                            16 |
| l2                                             4 |
| lr                                       0.00015 |
+--------------------------------------------------+
(func pid=4375) [1,  2000] loss: 2.327
(func pid=4375) [1,  4000] loss: 1.153 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

Trial status: 8 RUNNING | 2 PENDING
Current time: 2025-04-23 16:30:41. Total running time: 30s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+-------------------------------------------------------------------------------+
| Trial name                status       l1     l2            lr     batch_size |
+-------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   RUNNING      16      1   0.00213327               2 |
| train_cifar_3a47b_00001   RUNNING       1      2   0.013416                 4 |
| train_cifar_3a47b_00002   RUNNING     256     64   0.0113784                2 |
| train_cifar_3a47b_00003   RUNNING      64    256   0.0274071                8 |
| train_cifar_3a47b_00004   RUNNING      16      2   0.056666                 4 |
| train_cifar_3a47b_00005   RUNNING       8     64   0.000353097              4 |
| train_cifar_3a47b_00006   RUNNING      16      4   0.000147684              8 |
| train_cifar_3a47b_00007   RUNNING     256    256   0.00477469               8 |
| train_cifar_3a47b_00008   PENDING     128    256   0.0306227                8 |
| train_cifar_3a47b_00009   PENDING       2     16   0.0286986                2 |
+-------------------------------------------------------------------------------+
(func pid=4375) [1,  6000] loss: 0.769 [repeated 8x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 1 at 2025-04-23 16:30:57. Total running time: 46s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                   40.8067 |
| time_total_s                                       40.8067 |
| training_iteration                                       1 |
| accuracy                                            0.1586 |
| loss                                               2.24278 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000000
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000000)

Trial train_cifar_3a47b_00003 finished iteration 1 at 2025-04-23 16:30:58. Total running time: 46s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00003 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  41.74049 |
| time_total_s                                      41.74049 |
| training_iteration                                       1 |
| accuracy                                            0.2221 |
| loss                                               2.06515 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00003 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00003_3_batch_size=8,l1=64,l2=256,lr=0.0274_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00007 finished iteration 1 at 2025-04-23 16:30:59. Total running time: 47s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  43.10883 |
| time_total_s                                      43.10883 |
| training_iteration                                       1 |
| accuracy                                            0.4819 |
| loss                                                1.4359 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000000
(func pid=4375) [1,  8000] loss: 0.576 [repeated 5x across cluster]
(func pid=4375) [1, 10000] loss: 0.461 [repeated 5x across cluster]

Trial status: 8 RUNNING | 2 PENDING
Current time: 2025-04-23 16:31:11. Total running time: 1min 0s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+----------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status       l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+----------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   RUNNING      16      1   0.00213327               2                                                    |
| train_cifar_3a47b_00001   RUNNING       1      2   0.013416                 4                                                    |
| train_cifar_3a47b_00002   RUNNING     256     64   0.0113784                2                                                    |
| train_cifar_3a47b_00003   RUNNING      64    256   0.0274071                8        1            41.7405   2.06515       0.2221 |
| train_cifar_3a47b_00004   RUNNING      16      2   0.056666                 4                                                    |
| train_cifar_3a47b_00005   RUNNING       8     64   0.000353097              4                                                    |
| train_cifar_3a47b_00006   RUNNING      16      4   0.000147684              8        1            40.8067   2.24278       0.1586 |
| train_cifar_3a47b_00007   RUNNING     256    256   0.00477469               8        1            43.1088   1.4359        0.4819 |
| train_cifar_3a47b_00008   PENDING     128    256   0.0306227                8                                                    |
| train_cifar_3a47b_00009   PENDING       2     16   0.0286986                2                                                    |
+----------------------------------------------------------------------------------------------------------------------------------+
(func pid=4375) [1, 12000] loss: 0.384 [repeated 8x across cluster]

Trial train_cifar_3a47b_00001 finished iteration 1 at 2025-04-23 16:31:24. Total running time: 1min 12s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00001 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  67.90545 |
| time_total_s                                      67.90545 |
| training_iteration                                       1 |
| accuracy                                            0.1006 |
| loss                                               2.31772 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00001 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00001_1_batch_size=4,l1=1,l2=2,lr=0.0134_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00001 completed after 1 iterations at 2025-04-23 16:31:24. Total running time: 1min 12s

Trial train_cifar_3a47b_00008 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_3a47b_00008 config             |
+--------------------------------------------------+
| batch_size                                     8 |
| l1                                           128 |
| l2                                           256 |
| lr                                       0.03062 |
+--------------------------------------------------+
(func pid=4376) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00001_1_batch_size=4,l1=1,l2=2,lr=0.0134_2025-04-23_16-30-11/checkpoint_000000) [repeated 3x across cluster]

Trial train_cifar_3a47b_00004 finished iteration 1 at 2025-04-23 16:31:26. Total running time: 1min 15s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00004 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  70.33586 |
| time_total_s                                      70.33586 |
| training_iteration                                       1 |
| accuracy                                            0.0995 |
| loss                                               2.32697 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00004 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00004_4_batch_size=4,l1=16,l2=2,lr=0.0567_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00004 completed after 1 iterations at 2025-04-23 16:31:26. Total running time: 1min 15s

Trial train_cifar_3a47b_00009 started with configuration:
+-------------------------------------------------+
| Trial train_cifar_3a47b_00009 config            |
+-------------------------------------------------+
| batch_size                                    2 |
| l1                                            2 |
| l2                                           16 |
| lr                                       0.0287 |
+-------------------------------------------------+
(func pid=4382) [2,  4000] loss: 0.682 [repeated 4x across cluster]

Trial train_cifar_3a47b_00005 finished iteration 1 at 2025-04-23 16:31:27. Total running time: 1min 16s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  71.31317 |
| time_total_s                                      71.31317 |
| training_iteration                                       1 |
| accuracy                                            0.3719 |
| loss                                               1.68402 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000000
(func pid=4377) [1, 14000] loss: 0.331 [repeated 2x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 2 at 2025-04-23 16:31:36. Total running time: 1min 24s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000001 |
| time_this_iter_s                                  38.60301 |
| time_total_s                                      79.40971 |
| training_iteration                                       2 |
| accuracy                                            0.2028 |
| loss                                               2.12796 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000001
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000001) [repeated 3x across cluster]

Trial train_cifar_3a47b_00003 finished iteration 2 at 2025-04-23 16:31:36. Total running time: 1min 25s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00003 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000001 |
| time_this_iter_s                                   38.6455 |
| time_total_s                                      80.38599 |
| training_iteration                                       2 |
| accuracy                                            0.1971 |
| loss                                               2.19641 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00003 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00003_3_batch_size=8,l1=64,l2=256,lr=0.0274_2025-04-23_16-30-11/checkpoint_000001

Trial train_cifar_3a47b_00003 completed after 2 iterations at 2025-04-23 16:31:36. Total running time: 1min 25s

Trial train_cifar_3a47b_00007 finished iteration 2 at 2025-04-23 16:31:40. Total running time: 1min 28s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000001 |
| time_this_iter_s                                  40.65258 |
| time_total_s                                      83.76141 |
| training_iteration                                       2 |
| accuracy                                            0.4792 |
| loss                                               1.50171 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000001

Trial status: 7 RUNNING | 3 TERMINATED
Current time: 2025-04-23 16:31:41. Total running time: 1min 30s
Logical resource usage: 14.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   RUNNING        16      1   0.00213327               2                                                    |
| train_cifar_3a47b_00002   RUNNING       256     64   0.0113784                2                                                    |
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        1            71.3132   1.68402       0.3719 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        2            79.4097   2.12796       0.2028 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        2            83.7614   1.50171       0.4792 |
| train_cifar_3a47b_00008   RUNNING       128    256   0.0306227                8                                                    |
| train_cifar_3a47b_00009   RUNNING         2     16   0.0286986                2                                                    |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4377) [1, 16000] loss: 0.289 [repeated 5x across cluster]
(func pid=4376) [1,  4000] loss: 1.061 [repeated 5x across cluster]
(func pid=4378) [1,  6000] loss: 0.777 [repeated 3x across cluster]

Trial train_cifar_3a47b_00008 finished iteration 1 at 2025-04-23 16:32:02. Total running time: 1min 50s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00008 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  37.58341 |
| time_total_s                                      37.58341 |
| training_iteration                                       1 |
| accuracy                                            0.2113 |
| loss                                               2.16249 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00008 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00008_8_batch_size=8,l1=128,l2=256,lr=0.0306_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00008 completed after 1 iterations at 2025-04-23 16:32:02. Total running time: 1min 50s
(func pid=4376) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00008_8_batch_size=8,l1=128,l2=256,lr=0.0306_2025-04-23_16-30-11/checkpoint_000000) [repeated 3x across cluster]
(func pid=4377) [1, 20000] loss: 0.231 [repeated 4x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 3 at 2025-04-23 16:32:07. Total running time: 1min 55s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000002 |
| time_this_iter_s                                  30.94938 |
| time_total_s                                     110.35909 |
| training_iteration                                       3 |
| accuracy                                             0.225 |
| loss                                               2.06686 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000002
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000002)

Trial status: 6 RUNNING | 4 TERMINATED
Current time: 2025-04-23 16:32:12. Total running time: 2min 0s
Logical resource usage: 12.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   RUNNING        16      1   0.00213327               2                                                    |
| train_cifar_3a47b_00002   RUNNING       256     64   0.0113784                2                                                    |
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        1            71.3132   1.68402       0.3719 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        3           110.359    2.06686       0.225  |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        2            83.7614   1.50171       0.4792 |
| train_cifar_3a47b_00009   RUNNING         2     16   0.0286986                2                                                    |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4378) [1, 10000] loss: 0.467 [repeated 4x across cluster]

Trial train_cifar_3a47b_00000 finished iteration 1 at 2025-04-23 16:32:13. Total running time: 2min 1s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00000 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                 116.93764 |
| time_total_s                                     116.93764 |
| training_iteration                                       1 |
| accuracy                                            0.1011 |
| loss                                               2.30351 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00000 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00000_0_batch_size=2,l1=16,l2=1,lr=0.0021_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00000 completed after 1 iterations at 2025-04-23 16:32:13. Total running time: 2min 1s
(func pid=4375) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00000_0_batch_size=2,l1=16,l2=1,lr=0.0021_2025-04-23_16-30-11/checkpoint_000000)

Trial train_cifar_3a47b_00007 finished iteration 3 at 2025-04-23 16:32:13. Total running time: 2min 1s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000002 |
| time_this_iter_s                                  33.34819 |
| time_total_s                                     117.10959 |
| training_iteration                                       3 |
| accuracy                                             0.448 |
| loss                                               1.67289 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000002

Trial train_cifar_3a47b_00002 finished iteration 1 at 2025-04-23 16:32:18. Total running time: 2min 6s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00002 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                 121.95704 |
| time_total_s                                     121.95704 |
| training_iteration                                       1 |
| accuracy                                            0.1034 |
| loss                                               2.31648 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00002 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00002_2_batch_size=2,l1=256,l2=64,lr=0.0114_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00002 completed after 1 iterations at 2025-04-23 16:32:18. Total running time: 2min 6s
(func pid=4377) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00002_2_batch_size=2,l1=256,l2=64,lr=0.0114_2025-04-23_16-30-11/checkpoint_000000) [repeated 2x across cluster]
(func pid=4378) [1, 12000] loss: 0.389 [repeated 3x across cluster]

Trial train_cifar_3a47b_00005 finished iteration 2 at 2025-04-23 16:32:21. Total running time: 2min 10s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000001 |
| time_this_iter_s                                  53.70092 |
| time_total_s                                     125.01409 |
| training_iteration                                       2 |
| accuracy                                            0.4516 |
| loss                                                  1.48 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000001
(func pid=4381) [4,  4000] loss: 1.010 [repeated 2x across cluster]
(func pid=4382) [4,  4000] loss: 0.593 [repeated 3x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 4 at 2025-04-23 16:32:32. Total running time: 2min 20s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000003 |
| time_this_iter_s                                  25.17148 |
| time_total_s                                     135.53057 |
| training_iteration                                       4 |
| accuracy                                            0.2354 |
| loss                                               1.99812 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000003
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000003) [repeated 2x across cluster]

Trial train_cifar_3a47b_00007 finished iteration 4 at 2025-04-23 16:32:40. Total running time: 2min 28s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000003 |
| time_this_iter_s                                  26.63057 |
| time_total_s                                     143.74016 |
| training_iteration                                       4 |
| accuracy                                            0.5752 |
| loss                                               1.21551 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000003
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000003)
(func pid=4378) [1, 18000] loss: 0.259 [repeated 3x across cluster]

Trial status: 6 TERMINATED | 4 RUNNING
Current time: 2025-04-23 16:32:42. Total running time: 2min 30s
Logical resource usage: 8.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        2           125.014    1.48          0.4516 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        4           135.531    1.99812       0.2354 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        4           143.74     1.21551       0.5752 |
| train_cifar_3a47b_00009   RUNNING         2     16   0.0286986                2                                                    |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4378) [1, 20000] loss: 0.233 [repeated 3x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 5 at 2025-04-23 16:32:56. Total running time: 2min 45s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                  24.28981 |
| time_total_s                                     159.82038 |
| training_iteration                                       5 |
| accuracy                                            0.2481 |
| loss                                               1.91047 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000004
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000004)

Trial train_cifar_3a47b_00009 finished iteration 1 at 2025-04-23 16:32:58. Total running time: 2min 46s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00009 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000000 |
| time_this_iter_s                                  91.71496 |
| time_total_s                                      91.71496 |
| training_iteration                                       1 |
| accuracy                                            0.0986 |
| loss                                                2.3265 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00009 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00009_9_batch_size=2,l1=2,l2=16,lr=0.0287_2025-04-23_16-30-11/checkpoint_000000

Trial train_cifar_3a47b_00009 completed after 1 iterations at 2025-04-23 16:32:58. Total running time: 2min 46s
(func pid=4380) [3, 10000] loss: 0.277 [repeated 4x across cluster]

Trial train_cifar_3a47b_00005 finished iteration 3 at 2025-04-23 16:33:04. Total running time: 2min 52s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000002 |
| time_this_iter_s                                  42.52939 |
| time_total_s                                     167.54348 |
| training_iteration                                       3 |
| accuracy                                            0.4965 |
| loss                                               1.39015 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000002
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000002) [repeated 2x across cluster]
(func pid=4381) [6,  2000] loss: 1.894 [repeated 2x across cluster]

Trial train_cifar_3a47b_00007 finished iteration 5 at 2025-04-23 16:33:06. Total running time: 2min 54s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                   26.0278 |
| time_total_s                                     169.76797 |
| training_iteration                                       5 |
| accuracy                                            0.5697 |
| loss                                               1.23288 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000004
(func pid=4380) [4,  2000] loss: 1.349

Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-04-23 16:33:12. Total running time: 3min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        3           167.543    1.39015       0.4965 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        5           159.82     1.91047       0.2481 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        5           169.768    1.23288       0.5697 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4381) [6,  4000] loss: 0.929
(func pid=4380) [4,  4000] loss: 0.680 [repeated 2x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 6 at 2025-04-23 16:33:19. Total running time: 3min 7s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000005 |
| time_this_iter_s                                  22.73444 |
| time_total_s                                     182.55482 |
| training_iteration                                       6 |
| accuracy                                            0.2913 |
| loss                                               1.84608 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000005
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000005) [repeated 2x across cluster]
(func pid=4382) [6,  4000] loss: 0.542
(func pid=4380) [4,  6000] loss: 0.444

Trial train_cifar_3a47b_00007 finished iteration 6 at 2025-04-23 16:33:30. Total running time: 3min 19s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000005 |
| time_this_iter_s                                  24.77263 |
| time_total_s                                      194.5406 |
| training_iteration                                       6 |
| accuracy                                            0.5585 |
| loss                                               1.28396 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000005
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000005)
(func pid=4380) [4,  8000] loss: 0.330 [repeated 2x across cluster]
(func pid=4380) [4, 10000] loss: 0.264 [repeated 2x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 7 at 2025-04-23 16:33:41. Total running time: 3min 29s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000006 |
| time_this_iter_s                                  21.84277 |
| time_total_s                                     204.39759 |
| training_iteration                                       7 |
| accuracy                                            0.3138 |
| loss                                               1.80075 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000006
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000006)

Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-04-23 16:33:42. Total running time: 3min 30s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        3           167.543    1.39015       0.4965 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        7           204.398    1.80075       0.3138 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        6           194.541    1.28396       0.5585 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+

Trial train_cifar_3a47b_00005 finished iteration 4 at 2025-04-23 16:33:44. Total running time: 3min 32s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000003 |
| time_this_iter_s                                  39.91204 |
| time_total_s                                     207.45552 |
| training_iteration                                       4 |
| accuracy                                            0.5203 |
| loss                                               1.32945 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000003
(func pid=4382) [7,  4000] loss: 0.526 [repeated 2x across cluster]

Trial train_cifar_3a47b_00007 finished iteration 7 at 2025-04-23 16:33:55. Total running time: 3min 43s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000006 |
| time_this_iter_s                                  24.24598 |
| time_total_s                                     218.78658 |
| training_iteration                                       7 |
| accuracy                                             0.565 |
| loss                                               1.32398 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000006
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000006) [repeated 2x across cluster]
(func pid=4381) [8,  4000] loss: 0.880 [repeated 3x across cluster]

Trial train_cifar_3a47b_00006 finished iteration 8 at 2025-04-23 16:34:03. Total running time: 3min 51s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000007 |
| time_this_iter_s                                  21.80595 |
| time_total_s                                     226.20354 |
| training_iteration                                       8 |
| accuracy                                            0.3235 |
| loss                                               1.76241 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000007
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000007)
(func pid=4380) [5,  6000] loss: 0.424 [repeated 2x across cluster]
(func pid=4381) [9,  2000] loss: 1.724 [repeated 2x across cluster]

Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-04-23 16:34:12. Total running time: 4min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        4           207.456    1.32945       0.5203 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        8           226.204    1.76241       0.3235 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        7           218.787    1.32398       0.565  |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4380) [5, 10000] loss: 0.253 [repeated 3x across cluster]

Trial train_cifar_3a47b_00007 finished iteration 8 at 2025-04-23 16:34:19. Total running time: 4min 7s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000007 |
| time_this_iter_s                                  24.35562 |
| time_total_s                                      243.1422 |
| training_iteration                                       8 |
| accuracy                                            0.5739 |
| loss                                               1.30763 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000007
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000007)

Trial train_cifar_3a47b_00005 finished iteration 5 at 2025-04-23 16:34:22. Total running time: 4min 10s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000004 |
| time_this_iter_s                                   38.5072 |
| time_total_s                                     245.96273 |
| training_iteration                                       5 |
| accuracy                                            0.5457 |
| loss                                               1.26669 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000004

Trial train_cifar_3a47b_00006 finished iteration 9 at 2025-04-23 16:34:24. Total running time: 4min 13s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000008 |
| time_this_iter_s                                  21.84632 |
| time_total_s                                     248.04986 |
| training_iteration                                       9 |
| accuracy                                            0.3542 |
| loss                                               1.70474 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000008
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000008) [repeated 2x across cluster]
(func pid=4382) [9,  2000] loss: 0.938 [repeated 2x across cluster]
(func pid=4381) [10,  2000] loss: 1.692 [repeated 2x across cluster]
(func pid=4381) [10,  4000] loss: 0.836 [repeated 3x across cluster]

Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2025-04-23 16:34:42. Total running time: 4min 30s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        5           245.963    1.26669       0.5457 |
| train_cifar_3a47b_00006   RUNNING        16      4   0.000147684              8        9           248.05     1.70474       0.3542 |
| train_cifar_3a47b_00007   RUNNING       256    256   0.00477469               8        8           243.142    1.30763       0.5739 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+

Trial train_cifar_3a47b_00007 finished iteration 9 at 2025-04-23 16:34:42. Total running time: 4min 31s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000008 |
| time_this_iter_s                                  23.15711 |
| time_total_s                                     266.29931 |
| training_iteration                                       9 |
| accuracy                                            0.5424 |
| loss                                               1.46234 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000008
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000008)

Trial train_cifar_3a47b_00006 finished iteration 10 at 2025-04-23 16:34:46. Total running time: 4min 35s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00006 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000009 |
| time_this_iter_s                                  21.77325 |
| time_total_s                                     269.82311 |
| training_iteration                                      10 |
| accuracy                                            0.3718 |
| loss                                               1.65719 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00006 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000009

Trial train_cifar_3a47b_00006 completed after 10 iterations at 2025-04-23 16:34:46. Total running time: 4min 35s
(func pid=4381) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2025-04-23_16-30-11/checkpoint_000009)
(func pid=4380) [6,  8000] loss: 0.313 [repeated 2x across cluster]
(func pid=4380) [6, 10000] loss: 0.245 [repeated 2x across cluster]

Trial train_cifar_3a47b_00005 finished iteration 6 at 2025-04-23 16:34:59. Total running time: 4min 47s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000005 |
| time_this_iter_s                                  36.67234 |
| time_total_s                                     282.63507 |
| training_iteration                                       6 |
| accuracy                                            0.5495 |
| loss                                               1.27286 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000005
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000005)

Trial train_cifar_3a47b_00007 finished iteration 10 at 2025-04-23 16:35:04. Total running time: 4min 52s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00007 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000009 |
| time_this_iter_s                                  21.62619 |
| time_total_s                                     287.92551 |
| training_iteration                                      10 |
| accuracy                                            0.5714 |
| loss                                               1.36822 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00007 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000009

Trial train_cifar_3a47b_00007 completed after 10 iterations at 2025-04-23 16:35:04. Total running time: 4min 52s
(func pid=4382) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2025-04-23_16-30-11/checkpoint_000009)
(func pid=4380) [7,  2000] loss: 1.185 [repeated 2x across cluster]
(func pid=4380) [7,  4000] loss: 0.600

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-04-23 16:35:12. Total running time: 5min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        6           282.635    1.27286       0.5495 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00006   TERMINATED     16      4   0.000147684              8       10           269.823    1.65719       0.3718 |
| train_cifar_3a47b_00007   TERMINATED    256    256   0.00477469               8       10           287.926    1.36822       0.5714 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4380) [7,  6000] loss: 0.403
(func pid=4380) [7,  8000] loss: 0.300
(func pid=4380) [7, 10000] loss: 0.238

Trial train_cifar_3a47b_00005 finished iteration 7 at 2025-04-23 16:35:31. Total running time: 5min 19s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000006 |
| time_this_iter_s                                  31.79974 |
| time_total_s                                     314.43481 |
| training_iteration                                       7 |
| accuracy                                            0.5696 |
| loss                                                1.2172 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000006
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000006)
(func pid=4380) [8,  2000] loss: 1.198
(func pid=4380) [8,  4000] loss: 0.582

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-04-23 16:35:42. Total running time: 5min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        7           314.435    1.2172        0.5696 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00006   TERMINATED     16      4   0.000147684              8       10           269.823    1.65719       0.3718 |
| train_cifar_3a47b_00007   TERMINATED    256    256   0.00477469               8       10           287.926    1.36822       0.5714 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4380) [8,  6000] loss: 0.381
(func pid=4380) [8,  8000] loss: 0.293
(func pid=4380) [8, 10000] loss: 0.233

Trial train_cifar_3a47b_00005 finished iteration 8 at 2025-04-23 16:36:02. Total running time: 5min 50s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000007 |
| time_this_iter_s                                  31.24044 |
| time_total_s                                     345.67525 |
| training_iteration                                       8 |
| accuracy                                              0.57 |
| loss                                               1.21771 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000007
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000007)
(func pid=4380) [9,  2000] loss: 1.161

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-04-23 16:36:12. Total running time: 6min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        8           345.675    1.21771       0.57   |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00006   TERMINATED     16      4   0.000147684              8       10           269.823    1.65719       0.3718 |
| train_cifar_3a47b_00007   TERMINATED    256    256   0.00477469               8       10           287.926    1.36822       0.5714 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4380) [9,  4000] loss: 0.566
(func pid=4380) [9,  6000] loss: 0.385
(func pid=4380) [9,  8000] loss: 0.288
(func pid=4380) [9, 10000] loss: 0.226

Trial train_cifar_3a47b_00005 finished iteration 9 at 2025-04-23 16:36:33. Total running time: 6min 21s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000008 |
| time_this_iter_s                                   31.2056 |
| time_total_s                                     376.88085 |
| training_iteration                                       9 |
| accuracy                                            0.5853 |
| loss                                               1.16936 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000008
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000008)
(func pid=4380) [10,  2000] loss: 1.132

Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2025-04-23 16:36:42. Total running time: 6min 31s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00005   RUNNING         8     64   0.000353097              4        9           376.881    1.16936       0.5853 |
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00006   TERMINATED     16      4   0.000147684              8       10           269.823    1.65719       0.3718 |
| train_cifar_3a47b_00007   TERMINATED    256    256   0.00477469               8       10           287.926    1.36822       0.5714 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=4380) [10,  4000] loss: 0.562
(func pid=4380) [10,  6000] loss: 0.383
(func pid=4380) [10,  8000] loss: 0.279
(func pid=4380) [10, 10000] loss: 0.225

Trial train_cifar_3a47b_00005 finished iteration 10 at 2025-04-23 16:37:04. Total running time: 6min 53s
+------------------------------------------------------------+
| Trial train_cifar_3a47b_00005 result                       |
+------------------------------------------------------------+
| checkpoint_dir_name                      checkpoint_000009 |
| time_this_iter_s                                  31.40795 |
| time_total_s                                     408.28881 |
| training_iteration                                      10 |
| accuracy                                             0.573 |
| loss                                               1.19828 |
+------------------------------------------------------------+
Trial train_cifar_3a47b_00005 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000009

Trial train_cifar_3a47b_00005 completed after 10 iterations at 2025-04-23 16:37:04. Total running time: 6min 53s

Trial status: 10 TERMINATED
Current time: 2025-04-23 16:37:04. Total running time: 6min 53s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A10G)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                status         l1     l2            lr     batch_size     iter     total time (s)      loss     accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_3a47b_00000   TERMINATED     16      1   0.00213327               2        1           116.938    2.30351       0.1011 |
| train_cifar_3a47b_00001   TERMINATED      1      2   0.013416                 4        1            67.9055   2.31772       0.1006 |
| train_cifar_3a47b_00002   TERMINATED    256     64   0.0113784                2        1           121.957    2.31648       0.1034 |
| train_cifar_3a47b_00003   TERMINATED     64    256   0.0274071                8        2            80.386    2.19641       0.1971 |
| train_cifar_3a47b_00004   TERMINATED     16      2   0.056666                 4        1            70.3359   2.32697       0.0995 |
| train_cifar_3a47b_00005   TERMINATED      8     64   0.000353097              4       10           408.289    1.19828       0.573  |
| train_cifar_3a47b_00006   TERMINATED     16      4   0.000147684              8       10           269.823    1.65719       0.3718 |
| train_cifar_3a47b_00007   TERMINATED    256    256   0.00477469               8       10           287.926    1.36822       0.5714 |
| train_cifar_3a47b_00008   TERMINATED    128    256   0.0306227                8        1            37.5834   2.16249       0.2113 |
| train_cifar_3a47b_00009   TERMINATED      2     16   0.0286986                2        1            91.715    2.3265        0.0986 |
+------------------------------------------------------------------------------------------------------------------------------------+

Best trial config: {'l1': 8, 'l2': 64, 'lr': 0.0003530972286268149, 'batch_size': 4}
Best trial final validation loss: 1.1982811905056239
Best trial final validation accuracy: 0.573
(func pid=4380) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2025-04-23_16-30-11/train_cifar_3a47b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2025-04-23_16-30-11/checkpoint_000009)
Best trial test set accuracy: 0.5953

如果您运行代码,示例输出可能如下所示

Number of trials: 10/10 (10 TERMINATED)
+-----+--------------+------+------+-------------+--------+---------+------------+
| ... |   batch_size |   l1 |   l2 |          lr |   iter |    loss |   accuracy |
|-----+--------------+------+------+-------------+--------+---------+------------|
| ... |            2 |    1 |  256 | 0.000668163 |      1 | 2.31479 |     0.0977 |
| ... |            4 |   64 |    8 | 0.0331514   |      1 | 2.31605 |     0.0983 |
| ... |            4 |    2 |    1 | 0.000150295 |      1 | 2.30755 |     0.1023 |
| ... |           16 |   32 |   32 | 0.0128248   |     10 | 1.66912 |     0.4391 |
| ... |            4 |    8 |  128 | 0.00464561  |      2 | 1.7316  |     0.3463 |
| ... |            8 |  256 |    8 | 0.00031556  |      1 | 2.19409 |     0.1736 |
| ... |            4 |   16 |  256 | 0.00574329  |      2 | 1.85679 |     0.3368 |
| ... |            8 |    2 |    2 | 0.00325652  |      1 | 2.30272 |     0.0984 |
| ... |            2 |    2 |    2 | 0.000342987 |      2 | 1.76044 |     0.292  |
| ... |            4 |   64 |   32 | 0.003734    |      8 | 1.53101 |     0.4761 |
+-----+--------------+------+------+-------------+--------+---------+------------+

Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0037339984519545164, 'batch_size': 4}
Best trial final validation loss: 1.5310075663924216
Best trial final validation accuracy: 0.4761
Best trial test set accuracy: 0.4737

大多数试验已提前停止,以避免浪费资源。性能最佳的试验达到了约 47% 的验证准确率,这在测试集上得到了证实。

就这样!您现在可以调优您的 PyTorch 模型参数了。

脚本总运行时间: ( 7 minutes 7.666 seconds)

由 Sphinx-Gallery 生成的图库

文档

查阅 PyTorch 的全面开发者文档

查看文档

教程

获取针对初学者和高级开发者的深入教程

查看教程

资源

查找开发资源并获得问题解答

查看资源