注意
单击 此处 下载完整的示例代码
使用 Ray Tune 进行超参数调优¶
超参数调优可以决定模型的性能是平庸还是高度精确。通常,一些简单的事情,例如选择不同的学习率或更改网络层大小,可能会对模型性能产生重大影响。
幸运的是,有一些工具可以帮助找到最佳参数组合。 Ray Tune 是一个用于分布式超参数调优的行业标准工具。Ray Tune 包含最新的超参数搜索算法,与 TensorBoard 和其他分析库集成,并通过 Ray 的分布式机器学习引擎 本地支持分布式训练。
在本教程中,我们将向您展示如何将 Ray Tune 集成到您的 PyTorch 训练工作流程中。我们将扩展 PyTorch 文档中的本教程 以训练 CIFAR10 图像分类器。
正如您将看到,我们只需要进行一些细微的修改。特别是,我们需要
将数据加载和训练封装在函数中,
使某些网络参数可配置,
添加检查点(可选),
并定义模型调优的搜索空间
要运行本教程,请确保安装了以下软件包
ray[tune]
: 分布式超参数调优库torchvision
: 用于数据转换器
设置/导入¶
让我们从导入开始
from functools import partial
import os
import tempfile
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray import train
from ray.train import Checkpoint, get_checkpoint
from ray.tune.schedulers import ASHAScheduler
import ray.cloudpickle as pickle
大多数导入是构建 PyTorch 模型所必需的。只有最后几个导入是用于 Ray Tune。
数据加载器¶
我们将数据加载器包装在它们自己的函数中,并传递一个全局数据目录。这样,我们就可以在不同的试验之间共享数据目录。
def load_data(data_dir="./data"):
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform
)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform
)
return trainset, testset
可配置的神经网络¶
我们只能调整可配置的参数。在这个例子中,我们可以指定全连接层的层大小
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
训练函数¶
现在变得有趣了,因为我们对来自 PyTorch 文档的示例进行了一些更改。
我们将训练脚本包装在一个名为 train_cifar(config, data_dir=None)
的函数中。 config
参数将接收我们想要训练的超参数。 data_dir
指定了我们加载和存储数据的目录,以便多个运行可以共享相同的数据源。我们还会在运行开始时加载模型和优化器状态,如果提供了检查点。在本教程的后面部分,您将找到有关如何保存检查点及其用途的信息。
net = Net(config["l1"], config["l2"])
checkpoint = get_checkpoint()
if checkpoint:
with checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
checkpoint_state = pickle.load(fp)
start_epoch = checkpoint_state["epoch"]
net.load_state_dict(checkpoint_state["net_state_dict"])
optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
start_epoch = 0
优化器的学习率也变得可配置
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
我们还将训练数据分成训练集和验证集。因此,我们在 80% 的数据上进行训练,并在剩余的 20% 上计算验证损失。我们迭代训练集和测试集的批次大小也是可配置的。
使用 DataParallel 添加 (多) GPU 支持¶
图像分类在很大程度上受益于 GPU。幸运的是,我们可以在 Ray Tune 中继续使用 PyTorch 的抽象。因此,我们可以将我们的模型包装在 nn.DataParallel
中以支持在多个 GPU 上进行数据并行训练
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
通过使用 device
变量,我们确保即使没有可用的 GPU,训练也能正常工作。PyTorch 要求我们显式地将数据发送到 GPU 内存,如下所示
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
该代码现在支持在 CPU、单个 GPU 和多个 GPU 上进行训练。值得注意的是,Ray 还支持分数 GPU,因此我们可以将 GPU 分配给不同的试验,只要模型仍然适合 GPU 内存。我们稍后会回到这个问题。
与 Ray Tune 通信¶
最有趣的部分是与 Ray Tune 的通信
checkpoint_data = {
"epoch": epoch,
"net_state_dict": net.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "wb") as fp:
pickle.dump(checkpoint_data, fp)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(
{"loss": val_loss / val_steps, "accuracy": correct / total},
checkpoint=checkpoint,
)
在这里,我们首先保存一个检查点,然后将一些指标报告回 Ray Tune。具体来说,我们发送验证损失和准确率回 Ray Tune。然后,Ray Tune 可以使用这些指标来决定哪些超参数配置可以产生最佳结果。这些指标也可以用来尽早停止性能不佳的试验,以避免浪费这些试验的资源。
保存检查点是可选的,但如果我们想使用像基于种群的训练这样的高级调度器,则它是必需的。此外,通过保存检查点,我们以后可以加载训练好的模型并在测试集上对其进行验证。最后,保存检查点对于容错很有用,它允许我们中断训练并在以后继续训练。
完整训练函数¶
完整的代码示例如下所示
def train_cifar(config, data_dir=None):
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
checkpoint = get_checkpoint()
if checkpoint:
with checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
checkpoint_state = pickle.load(fp)
start_epoch = checkpoint_state["epoch"]
net.load_state_dict(checkpoint_state["net_state_dict"])
optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
else:
start_epoch = 0
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs]
)
trainloader = torch.utils.data.DataLoader(
train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
)
valloader = torch.utils.data.DataLoader(
val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
)
for epoch in range(start_epoch, 10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print(
"[%d, %5d] loss: %.3f"
% (epoch + 1, i + 1, running_loss / epoch_steps)
)
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
checkpoint_data = {
"epoch": epoch,
"net_state_dict": net.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
}
with tempfile.TemporaryDirectory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "wb") as fp:
pickle.dump(checkpoint_data, fp)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(
{"loss": val_loss / val_steps, "accuracy": correct / total},
checkpoint=checkpoint,
)
print("Finished Training")
如您所见,大部分代码都直接改编自原始示例。
测试集准确率¶
通常,机器学习模型的性能是在一个保持不变的测试集上进行测试的,该测试集包含未用于训练模型的数据。我们也将其包装在一个函数中
def test_accuracy(net, device="cpu"):
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2
)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
该函数还期望一个 device
参数,因此我们可以在 GPU 上进行测试集验证。
配置搜索空间¶
最后,我们需要定义 Ray Tune 的搜索空间。以下是一个示例
config = {
"l1": tune.choice([2 ** i for i in range(9)]),
"l2": tune.choice([2 ** i for i in range(9)]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
tune.choice()
接受一个值列表,这些值将从列表中均匀采样。在这个示例中, l1
和 l2
参数应该是 4 到 256 之间的 2 的幂,因此为 4、8、16、32、64、128 或 256。 lr
(学习率)应该在 0.0001 到 0.1 之间均匀采样。最后,批次大小是在 2、4、8 和 16 中进行选择。
在每次试验中,Ray Tune 现在将从这些搜索空间中随机采样参数组合。然后,它将并行训练多个模型,并在这些模型中找到性能最佳的模型。我们还使用 ASHAScheduler
,它将尽早终止性能不佳的试验。
我们将 train_cifar
函数包装在 functools.partial
中以设置常量 data_dir
参数。我们还可以告诉 Ray Tune 每个试验应该使用哪些资源
gpus_per_trial = 2
# ...
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
checkpoint_at_end=True)
您可以指定 CPU 的数量,然后就可以使用这些 CPU 例如来增加 PyTorch DataLoader
实例的 num_workers
。选定的 GPU 数量将在每个试验中对 PyTorch 可见。试验无法访问未为其请求的 GPU——因此您无需担心两个试验使用同一组资源。
这里我们也可以指定分数 GPU,因此类似 gpus_per_trial=0.5
是完全有效的。然后,试验将共享彼此之间的 GPU。您只需确保模型仍然适合 GPU 内存即可。
训练完模型后,我们将找到性能最佳的模型,并从检查点文件中加载训练过的网络。然后,我们将获得测试集准确率,并通过打印报告所有内容。
完整的 main 函数如下所示
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
data_dir = os.path.abspath("./data")
load_data(data_dir)
config = {
"l1": tune.choice([2**i for i in range(9)]),
"l2": tune.choice([2**i for i in range(9)]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16]),
}
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2,
)
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
)
best_trial = result.get_best_trial("loss", "min", "last")
print(f"Best trial config: {best_trial.config}")
print(f"Best trial final validation loss: {best_trial.last_result['loss']}")
print(f"Best trial final validation accuracy: {best_trial.last_result['accuracy']}")
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if gpus_per_trial > 1:
best_trained_model = nn.DataParallel(best_trained_model)
best_trained_model.to(device)
best_checkpoint = result.get_best_checkpoint(trial=best_trial, metric="accuracy", mode="max")
with best_checkpoint.as_directory() as checkpoint_dir:
data_path = Path(checkpoint_dir) / "data.pkl"
with open(data_path, "rb") as fp:
best_checkpoint_data = pickle.load(fp)
best_trained_model.load_state_dict(best_checkpoint_data["net_state_dict"])
test_acc = test_accuracy(best_trained_model, device)
print("Best trial test set accuracy: {}".format(test_acc))
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /var/lib/workspace/beginner_source/data/cifar-10-python.tar.gz
0% 0.00/170M [00:00<?, ?B/s]
0% 459k/170M [00:00<00:37, 4.53MB/s]
4% 7.47M/170M [00:00<00:03, 42.9MB/s]
10% 17.9M/170M [00:00<00:02, 70.6MB/s]
16% 27.9M/170M [00:00<00:01, 82.2MB/s]
22% 38.2M/170M [00:00<00:01, 89.7MB/s]
28% 48.4M/170M [00:00<00:01, 93.7MB/s]
34% 58.6M/170M [00:00<00:01, 96.6MB/s]
40% 68.9M/170M [00:00<00:01, 98.4MB/s]
46% 79.1M/170M [00:00<00:00, 99.5MB/s]
52% 89.3M/170M [00:01<00:00, 100MB/s]
58% 99.5M/170M [00:01<00:00, 101MB/s]
64% 110M/170M [00:01<00:00, 101MB/s]
70% 120M/170M [00:01<00:00, 102MB/s]
76% 130M/170M [00:01<00:00, 102MB/s]
82% 140M/170M [00:01<00:00, 102MB/s]
88% 151M/170M [00:01<00:00, 102MB/s]
94% 161M/170M [00:01<00:00, 102MB/s]
100% 170M/170M [00:01<00:00, 91.2MB/s]
Extracting /var/lib/workspace/beginner_source/data/cifar-10-python.tar.gz to /var/lib/workspace/beginner_source/data
Files already downloaded and verified
2024-10-17 21:58:28,302 WARNING services.py:1889 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147479552 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-10-17 21:58:28,552 INFO worker.py:1642 -- Started a local Ray instance.
2024-10-17 21:58:29,750 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`.
2024-10-17 21:58:29,752 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
+--------------------------------------------------------------------+
| Configuration for experiment train_cifar_2024-10-17_21-58-29 |
+--------------------------------------------------------------------+
| Search algorithm BasicVariantGenerator |
| Scheduler AsyncHyperBandScheduler |
| Number of trials 10 |
+--------------------------------------------------------------------+
View detailed results here: /var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29
To visualize your results with TensorBoard, run: `tensorboard --logdir /var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29`
Trial status: 10 PENDING
Current time: 2024-10-17 21:58:30. Total running time: 0s
Logical resource usage: 0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+-------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size |
+-------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 PENDING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00001 PENDING 1 2 0.013416 4 |
| train_cifar_f1b1b_00002 PENDING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00003 PENDING 64 256 0.0274071 8 |
| train_cifar_f1b1b_00004 PENDING 16 2 0.056666 4 |
| train_cifar_f1b1b_00005 PENDING 8 64 0.000353097 4 |
| train_cifar_f1b1b_00006 PENDING 16 4 0.000147684 8 |
| train_cifar_f1b1b_00007 PENDING 256 256 0.00477469 8 |
| train_cifar_f1b1b_00008 PENDING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 PENDING 2 16 0.0286986 2 |
+-------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00002 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00002 config |
+--------------------------------------------------+
| batch_size 2 |
| l1 256 |
| l2 64 |
| lr 0.01138 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00006 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00006 config |
+--------------------------------------------------+
| batch_size 8 |
| l1 16 |
| l2 4 |
| lr 0.00015 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00003 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00003 config |
+--------------------------------------------------+
| batch_size 8 |
| l1 64 |
| l2 256 |
| lr 0.02741 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00004 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00004 config |
+--------------------------------------------------+
| batch_size 4 |
| l1 16 |
| l2 2 |
| lr 0.05667 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00000 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00000 config |
+--------------------------------------------------+
| batch_size 2 |
| l1 16 |
| l2 1 |
| lr 0.00213 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00001 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00001 config |
+--------------------------------------------------+
| batch_size 4 |
| l1 1 |
| l2 2 |
| lr 0.01342 |
+--------------------------------------------------+
(func pid=5895) Files already downloaded and verified
Trial train_cifar_f1b1b_00005 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00005 config |
+--------------------------------------------------+
| batch_size 4 |
| l1 8 |
| l2 64 |
| lr 0.00035 |
+--------------------------------------------------+
Trial train_cifar_f1b1b_00007 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00007 config |
+--------------------------------------------------+
| batch_size 8 |
| l1 256 |
| l2 256 |
| lr 0.00477 |
+--------------------------------------------------+
(func pid=5889) [1, 2000] loss: 2.319
(func pid=5904) Files already downloaded and verified [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Trial status: 8 RUNNING | 2 PENDING
Current time: 2024-10-17 21:59:00. Total running time: 30s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+-------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size |
+-------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 RUNNING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00001 RUNNING 1 2 0.013416 4 |
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00003 RUNNING 64 256 0.0274071 8 |
| train_cifar_f1b1b_00004 RUNNING 16 2 0.056666 4 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 |
| train_cifar_f1b1b_00008 PENDING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 PENDING 2 16 0.0286986 2 |
+-------------------------------------------------------------------------------+
(func pid=5889) [1, 4000] loss: 1.153 [repeated 8x across cluster]
(func pid=5889) [1, 6000] loss: 0.768 [repeated 8x across cluster]
Trial train_cifar_f1b1b_00006 finished iteration 1 at 2024-10-17 21:59:29. Total running time: 59s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 53.84218 |
| time_total_s 53.84218 |
| training_iteration 1 |
| accuracy 0.0991 |
| loss 2.29352 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000000
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000000)
Trial status: 8 RUNNING | 2 PENDING
Current time: 2024-10-17 21:59:30. Total running time: 1min 0s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+----------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+----------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 RUNNING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00001 RUNNING 1 2 0.013416 4 |
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00003 RUNNING 64 256 0.0274071 8 |
| train_cifar_f1b1b_00004 RUNNING 16 2 0.056666 4 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 1 53.8422 2.29352 0.0991 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 |
| train_cifar_f1b1b_00008 PENDING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 PENDING 2 16 0.0286986 2 |
+----------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00003 finished iteration 1 at 2024-10-17 21:59:31. Total running time: 1min 1s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00003 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 55.38453 |
| time_total_s 55.38453 |
| training_iteration 1 |
| accuracy 0.1974 |
| loss 2.088 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00003 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00003_3_batch_size=8,l1=64,l2=256,lr=0.0274_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00007 finished iteration 1 at 2024-10-17 21:59:31. Total running time: 1min 1s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 55.54326 |
| time_total_s 55.54326 |
| training_iteration 1 |
| accuracy 0.4803 |
| loss 1.41309 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000000
(func pid=5889) [1, 8000] loss: 0.576 [repeated 5x across cluster]
(func pid=5895) [1, 8000] loss: 0.572 [repeated 4x across cluster]
(func pid=5889) [1, 10000] loss: 0.461 [repeated 4x across cluster]
Trial status: 8 RUNNING | 2 PENDING
Current time: 2024-10-17 22:00:00. Total running time: 1min 30s
Logical resource usage: 16.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+----------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+----------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 RUNNING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00001 RUNNING 1 2 0.013416 4 |
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00003 RUNNING 64 256 0.0274071 8 1 55.3845 2.088 0.1974 |
| train_cifar_f1b1b_00004 RUNNING 16 2 0.056666 4 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 1 53.8422 2.29352 0.0991 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 1 55.5433 1.41309 0.4803 |
| train_cifar_f1b1b_00008 PENDING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 PENDING 2 16 0.0286986 2 |
+----------------------------------------------------------------------------------------------------------------------------------+
(func pid=5895) [1, 10000] loss: 0.455 [repeated 4x across cluster]
(func pid=5889) [1, 12000] loss: 0.384 [repeated 4x across cluster]
Trial train_cifar_f1b1b_00005 finished iteration 1 at 2024-10-17 22:00:12. Total running time: 1min 42s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 96.45002 |
| time_total_s 96.45002 |
| training_iteration 1 |
| accuracy 0.3303 |
| loss 1.76039 |
+------------------------------------------------------------+
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000000) [repeated 3x across cluster]
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00001 finished iteration 1 at 2024-10-17 22:00:12. Total running time: 1min 42s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00001 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 96.71572 |
| time_total_s 96.71572 |
| training_iteration 1 |
| accuracy 0.097 |
| loss 2.30901 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00001 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00001_1_batch_size=4,l1=1,l2=2,lr=0.0134_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00001 completed after 1 iterations at 2024-10-17 22:00:12. Total running time: 1min 42s
Trial train_cifar_f1b1b_00008 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_f1b1b_00008 config |
+--------------------------------------------------+
| batch_size 8 |
| l1 128 |
| l2 256 |
| lr 0.03062 |
+--------------------------------------------------+
(func pid=5890) Files already downloaded and verified
(func pid=5890) Files already downloaded and verified
Trial train_cifar_f1b1b_00004 finished iteration 1 at 2024-10-17 22:00:14. Total running time: 1min 44s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00004 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 98.8246 |
| time_total_s 98.8246 |
| training_iteration 1 |
| accuracy 0.0961 |
| loss 2.31015 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00004 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00004_4_batch_size=4,l1=16,l2=2,lr=0.0567_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00004 completed after 1 iterations at 2024-10-17 22:00:14. Total running time: 1min 44s
Trial train_cifar_f1b1b_00009 started with configuration:
+-------------------------------------------------+
| Trial train_cifar_f1b1b_00009 config |
+-------------------------------------------------+
| batch_size 2 |
| l1 2 |
| l2 16 |
| lr 0.0287 |
+-------------------------------------------------+
Trial train_cifar_f1b1b_00006 finished iteration 2 at 2024-10-17 22:00:22. Total running time: 1min 52s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000001 |
| time_this_iter_s 52.91409 |
| time_total_s 106.75627 |
| training_iteration 2 |
| accuracy 0.2044 |
| loss 2.08109 |
+------------------------------------------------------------+(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000001) [repeated 3x across cluster]
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000001
(func pid=5895) [1, 12000] loss: 0.369
(func pid=5897) Files already downloaded and verified [repeated 2x across cluster]
Trial train_cifar_f1b1b_00007 finished iteration 2 at 2024-10-17 22:00:24. Total running time: 1min 54s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000001 |
| time_this_iter_s 52.83889 |
| time_total_s 108.38216 |
| training_iteration 2 |
| accuracy 0.5344 |
| loss 1.32872 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000001
Trial train_cifar_f1b1b_00003 finished iteration 2 at 2024-10-17 22:00:27. Total running time: 1min 57s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00003 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000001 |
| time_this_iter_s 56.17008 |
| time_total_s 111.55461 |
| training_iteration 2 |
| accuracy 0.155 |
| loss 2.2556 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00003 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00003_3_batch_size=8,l1=64,l2=256,lr=0.0274_2024-10-17_21-58-29/checkpoint_000001
Trial train_cifar_f1b1b_00003 completed after 2 iterations at 2024-10-17 22:00:27. Total running time: 1min 57s
(func pid=5889) [1, 14000] loss: 0.329
Trial status: 7 RUNNING | 3 TERMINATED
Current time: 2024-10-17 22:00:30. Total running time: 2min 0s
Logical resource usage: 14.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 RUNNING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 1 96.45 1.76039 0.3303 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 2 106.756 2.08109 0.2044 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 2 108.382 1.32872 0.5344 |
| train_cifar_f1b1b_00008 RUNNING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 RUNNING 2 16 0.0286986 2 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5899) [3, 2000] loss: 2.023 [repeated 4x across cluster]
(func pid=5897) [1, 4000] loss: 1.168 [repeated 5x across cluster]
(func pid=5899) [3, 4000] loss: 0.961 [repeated 2x across cluster]
Trial status: 7 RUNNING | 3 TERMINATED
Current time: 2024-10-17 22:01:00. Total running time: 2min 30s
Logical resource usage: 14.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 RUNNING 16 1 0.00213327 2 |
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 1 96.45 1.76039 0.3303 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 2 106.756 2.08109 0.2044 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 2 108.382 1.32872 0.5344 |
| train_cifar_f1b1b_00008 RUNNING 128 256 0.0306227 8 |
| train_cifar_f1b1b_00009 RUNNING 2 16 0.0286986 2 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00008 finished iteration 1 at 2024-10-17 22:01:03. Total running time: 2min 33s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00008 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 50.87284 |
| time_total_s 50.87284 |
| training_iteration 1 |
| accuracy 0.2337 |
| loss 2.05218 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00008 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00008_8_batch_size=8,l1=128,l2=256,lr=0.0306_2024-10-17_21-58-29/checkpoint_000000
(func pid=5890) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00008_8_batch_size=8,l1=128,l2=256,lr=0.0306_2024-10-17_21-58-29/checkpoint_000000) [repeated 3x across cluster]
Trial train_cifar_f1b1b_00006 finished iteration 3 at 2024-10-17 22:01:06. Total running time: 2min 37s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000002 |
| time_this_iter_s 44.66284 |
| time_total_s 151.41911 |
| training_iteration 3 |
| accuracy 0.3018 |
| loss 1.85412 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000002
(func pid=5889) [1, 20000] loss: 0.230 [repeated 6x across cluster]
Trial train_cifar_f1b1b_00007 finished iteration 3 at 2024-10-17 22:01:10. Total running time: 2min 41s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000002 |
| time_this_iter_s 46.44777 |
| time_total_s 154.82992 |
| training_iteration 3 |
| accuracy 0.5743 |
| loss 1.22 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000002
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000002) [repeated 2x across cluster]
(func pid=5890) [2, 2000] loss: 2.095 [repeated 4x across cluster]
(func pid=5895) [1, 20000] loss: 0.232 [repeated 4x across cluster]
Trial train_cifar_f1b1b_00000 finished iteration 1 at 2024-10-17 22:01:28. Total running time: 2min 59s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00000 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 173.29934 |
| time_total_s 173.29934 |
| training_iteration 1 |
| accuracy 0.0993 |
| loss 2.30543 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00000 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00000_0_batch_size=2,l1=16,l2=1,lr=0.0021_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00000 completed after 1 iterations at 2024-10-17 22:01:28. Total running time: 2min 59s
(func pid=5889) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00000_0_batch_size=2,l1=16,l2=1,lr=0.0021_2024-10-17_21-58-29/checkpoint_000000)
Trial status: 4 TERMINATED | 6 RUNNING
Current time: 2024-10-17 22:01:30. Total running time: 3min 0s
Logical resource usage: 12.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00002 RUNNING 256 64 0.0113784 2 |
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 1 96.45 1.76039 0.3303 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 3 151.419 1.85412 0.3018 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 3 154.83 1.22 0.5743 |
| train_cifar_f1b1b_00008 RUNNING 128 256 0.0306227 8 1 50.8728 2.05218 0.2337 |
| train_cifar_f1b1b_00009 RUNNING 2 16 0.0286986 2 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 finished iteration 2 at 2024-10-17 22:01:34. Total running time: 3min 4s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000001 |
| time_this_iter_s 81.94404 |
| time_total_s 178.39407 |
| training_iteration 2 |
| accuracy 0.4135 |
| loss 1.57162 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000001
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000001)
(func pid=5897) [1, 12000] loss: 0.389 [repeated 2x across cluster]
(func pid=5904) [4, 4000] loss: 0.564 [repeated 3x across cluster]
Trial train_cifar_f1b1b_00002 finished iteration 1 at 2024-10-17 22:01:45. Total running time: 3min 16s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00002 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 190.34233 |
| time_total_s 190.34233 |
| training_iteration 1 |
| accuracy 0.0957 |
| loss 2.32163 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00002 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00002_2_batch_size=2,l1=256,l2=64,lr=0.0114_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00002 completed after 1 iterations at 2024-10-17 22:01:45. Total running time: 3min 16s
(func pid=5895) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00002_2_batch_size=2,l1=256,l2=64,lr=0.0114_2024-10-17_21-58-29/checkpoint_000000)
(func pid=5898) [3, 2000] loss: 1.550
(func pid=5897) [1, 14000] loss: 0.334
Trial train_cifar_f1b1b_00006 finished iteration 4 at 2024-10-17 22:01:48. Total running time: 3min 18s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000003 |
| time_this_iter_s 41.1182 |
| time_total_s 192.5373 |
| training_iteration 4 |
| accuracy 0.3254 |
| loss 1.763 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000003
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000003)
Trial train_cifar_f1b1b_00008 finished iteration 2 at 2024-10-17 22:01:49. Total running time: 3min 20s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00008 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000001 |
| time_this_iter_s 46.32766 |
| time_total_s 97.2005 |
| training_iteration 2 |
| accuracy 0.206 |
| loss 2.06562 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00008 saved a checkpoint for iteration 2 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00008_8_batch_size=8,l1=128,l2=256,lr=0.0306_2024-10-17_21-58-29/checkpoint_000001
Trial train_cifar_f1b1b_00008 completed after 2 iterations at 2024-10-17 22:01:49. Total running time: 3min 20s
Trial train_cifar_f1b1b_00007 finished iteration 4 at 2024-10-17 22:01:53. Total running time: 3min 23s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000003 |
| time_this_iter_s 42.42913 |
| time_total_s 197.25905 |
| training_iteration 4 |
| accuracy 0.572 |
| loss 1.23591 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000003
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000003) [repeated 2x across cluster]
(func pid=5898) [3, 4000] loss: 0.760
(func pid=5897) [1, 16000] loss: 0.292
Trial status: 6 TERMINATED | 4 RUNNING
Current time: 2024-10-17 22:02:00. Total running time: 3min 30s
Logical resource usage: 8.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 2 178.394 1.57162 0.4135 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 4 192.537 1.763 0.3254 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 4 197.259 1.23591 0.572 |
| train_cifar_f1b1b_00009 RUNNING 2 16 0.0286986 2 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5904) [5, 2000] loss: 1.039 [repeated 2x across cluster]
(func pid=5899) [5, 4000] loss: 0.864 [repeated 3x across cluster]
(func pid=5897) [1, 20000] loss: 0.234
(func pid=5904) [5, 4000] loss: 0.543
Trial train_cifar_f1b1b_00006 finished iteration 5 at 2024-10-17 22:02:22. Total running time: 3min 52s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000004 |
| time_this_iter_s 34.15932 |
| time_total_s 226.69663 |
| training_iteration 5 |
| accuracy 0.3465 |
| loss 1.71249 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000004
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000004)
Trial train_cifar_f1b1b_00007 finished iteration 5 at 2024-10-17 22:02:29. Total running time: 3min 59s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000004 |
| time_this_iter_s 35.95874 |
| time_total_s 233.2178 |
| training_iteration 5 |
| accuracy 0.5716 |
| loss 1.25063 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000004
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000004)
(func pid=5898) [3, 10000] loss: 0.294 [repeated 2x across cluster]
Trial status: 6 TERMINATED | 4 RUNNING
Current time: 2024-10-17 22:02:30. Total running time: 4min 0s
Logical resource usage: 8.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 2 178.394 1.57162 0.4135 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 5 226.697 1.71249 0.3465 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 5 233.218 1.25063 0.5716 |
| train_cifar_f1b1b_00009 RUNNING 2 16 0.0286986 2 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00009 finished iteration 1 at 2024-10-17 22:02:33. Total running time: 4min 4s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00009 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000000 |
| time_this_iter_s 139.4363 |
| time_total_s 139.4363 |
| training_iteration 1 |
| accuracy 0.098 |
| loss 2.33758 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00009 saved a checkpoint for iteration 1 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00009_9_batch_size=2,l1=2,l2=16,lr=0.0287_2024-10-17_21-58-29/checkpoint_000000
Trial train_cifar_f1b1b_00009 completed after 1 iterations at 2024-10-17 22:02:33. Total running time: 4min 4s
Trial train_cifar_f1b1b_00005 finished iteration 3 at 2024-10-17 22:02:37. Total running time: 4min 7s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000002 |
| time_this_iter_s 63.37139 |
| time_total_s 241.76546 |
| training_iteration 3 |
| accuracy 0.4627 |
| loss 1.45451 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 3 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000002
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000002) [repeated 2x across cluster]
(func pid=5904) [6, 2000] loss: 0.991 [repeated 2x across cluster]
(func pid=5898) [4, 2000] loss: 1.424 [repeated 2x across cluster]
(func pid=5904) [6, 4000] loss: 0.525
Trial train_cifar_f1b1b_00006 finished iteration 6 at 2024-10-17 22:02:54. Total running time: 4min 24s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000005 |
| time_this_iter_s 32.1987 |
| time_total_s 258.89533 |
| training_iteration 6 |
| accuracy 0.3643 |
| loss 1.67179 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000005
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000005)
(func pid=5898) [4, 4000] loss: 0.708
Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2024-10-17 22:03:00. Total running time: 4min 30s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 3 241.765 1.45451 0.4627 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 6 258.895 1.67179 0.3643 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 5 233.218 1.25063 0.5716 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 finished iteration 6 at 2024-10-17 22:03:02. Total running time: 4min 32s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000005 |
| time_this_iter_s 33.33426 |
| time_total_s 266.55205 |
| training_iteration 6 |
| accuracy 0.5779 |
| loss 1.2377 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000005
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000005)
(func pid=5899) [7, 2000] loss: 1.644
(func pid=5898) [4, 6000] loss: 0.468
(func pid=5904) [7, 2000] loss: 0.956
(func pid=5899) [7, 4000] loss: 0.813
Trial train_cifar_f1b1b_00006 finished iteration 7 at 2024-10-17 22:03:25. Total running time: 4min 55s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000006 |
| time_this_iter_s 30.81064 |
| time_total_s 289.70597 |
| training_iteration 7 |
| accuracy 0.3875 |
| loss 1.61252 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000006
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000006)
(func pid=5904) [7, 4000] loss: 0.507 [repeated 2x across cluster]
Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2024-10-17 22:03:30. Total running time: 5min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 3 241.765 1.45451 0.4627 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 7 289.706 1.61252 0.3875 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 6 266.552 1.2377 0.5779 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 finished iteration 4 at 2024-10-17 22:03:33. Total running time: 5min 3s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000003 |
| time_this_iter_s 55.51741 |
| time_total_s 297.28287 |
| training_iteration 4 |
| accuracy 0.5025 |
| loss 1.35613 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 4 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000003
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000003)
Trial train_cifar_f1b1b_00007 finished iteration 7 at 2024-10-17 22:03:35. Total running time: 5min 6s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000006 |
| time_this_iter_s 33.28508 |
| time_total_s 299.83713 |
| training_iteration 7 |
| accuracy 0.5743 |
| loss 1.28765 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000006
(func pid=5899) [8, 2000] loss: 1.601 [repeated 2x across cluster]
(func pid=5898) [5, 2000] loss: 1.341
(func pid=5899) [8, 4000] loss: 0.794
Trial train_cifar_f1b1b_00006 finished iteration 8 at 2024-10-17 22:03:56. Total running time: 5min 26s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000007 |
| time_this_iter_s 31.4599 |
| time_total_s 321.16586 |
| training_iteration 8 |
| accuracy 0.4042 |
| loss 1.5959 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000007
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000007) [repeated 2x across cluster]
(func pid=5904) [8, 4000] loss: 0.488 [repeated 3x across cluster]
Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2024-10-17 22:04:00. Total running time: 5min 30s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 4 297.283 1.35613 0.5025 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 8 321.166 1.5959 0.4042 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 7 299.837 1.28765 0.5743 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5899) [9, 2000] loss: 1.560 [repeated 2x across cluster]
Trial train_cifar_f1b1b_00007 finished iteration 8 at 2024-10-17 22:04:08. Total running time: 5min 39s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000007 |
| time_this_iter_s 32.91228 |
| time_total_s 332.74941 |
| training_iteration 8 |
| accuracy 0.5623 |
| loss 1.40129 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000007
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000007)
(func pid=5899) [9, 4000] loss: 0.777 [repeated 2x across cluster]
Trial train_cifar_f1b1b_00006 finished iteration 9 at 2024-10-17 22:04:27. Total running time: 5min 58s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000008 |
| time_this_iter_s 31.09195 |
| time_total_s 352.25781 |
| training_iteration 9 |
| accuracy 0.4044 |
| loss 1.56051 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000008
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000008)
Trial train_cifar_f1b1b_00005 finished iteration 5 at 2024-10-17 22:04:28. Total running time: 5min 59s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000004 |
| time_this_iter_s 55.47278 |
| time_total_s 352.75564 |
| training_iteration 5 |
| accuracy 0.5099 |
| loss 1.34254 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 5 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000004
Trial status: 7 TERMINATED | 3 RUNNING
Current time: 2024-10-17 22:04:30. Total running time: 6min 0s
Logical resource usage: 6.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 5 352.756 1.34254 0.5099 |
| train_cifar_f1b1b_00006 RUNNING 16 4 0.000147684 8 9 352.258 1.56051 0.4044 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 8 332.749 1.40129 0.5623 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5904) [9, 4000] loss: 0.491 [repeated 3x across cluster]
(func pid=5899) [10, 2000] loss: 1.530
(func pid=5898) [6, 2000] loss: 1.281
Trial train_cifar_f1b1b_00007 finished iteration 9 at 2024-10-17 22:04:42. Total running time: 6min 12s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000008 |
| time_this_iter_s 33.33806 |
| time_total_s 366.08747 |
| training_iteration 9 |
| accuracy 0.5661 |
| loss 1.37246 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000008
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000008) [repeated 2x across cluster]
(func pid=5898) [6, 4000] loss: 0.646
(func pid=5899) [10, 4000] loss: 0.758
(func pid=5898) [6, 6000] loss: 0.431 [repeated 2x across cluster]
Trial train_cifar_f1b1b_00006 finished iteration 10 at 2024-10-17 22:04:58. Total running time: 6min 28s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00006 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000009 |
| time_this_iter_s 30.85066 |
| time_total_s 383.10847 |
| training_iteration 10 |
| accuracy 0.4347 |
| loss 1.50765 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00006 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000009
Trial train_cifar_f1b1b_00006 completed after 10 iterations at 2024-10-17 22:04:58. Total running time: 6min 28s
(func pid=5899) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00006_6_batch_size=8,l1=16,l2=4,lr=0.0001_2024-10-17_21-58-29/checkpoint_000009)
Trial status: 8 TERMINATED | 2 RUNNING
Current time: 2024-10-17 22:05:00. Total running time: 6min 30s
Logical resource usage: 4.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 5 352.756 1.34254 0.5099 |
| train_cifar_f1b1b_00007 RUNNING 256 256 0.00477469 8 9 366.087 1.37246 0.5661 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5904) [10, 4000] loss: 0.472
(func pid=5898) [6, 8000] loss: 0.322
Trial train_cifar_f1b1b_00007 finished iteration 10 at 2024-10-17 22:05:12. Total running time: 6min 42s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00007 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000009 |
| time_this_iter_s 30.5672 |
| time_total_s 396.65467 |
| training_iteration 10 |
| accuracy 0.5751 |
| loss 1.35308 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00007 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000009
Trial train_cifar_f1b1b_00007 completed after 10 iterations at 2024-10-17 22:05:12. Total running time: 6min 42s
(func pid=5904) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00007_7_batch_size=8,l1=256,l2=256,lr=0.0048_2024-10-17_21-58-29/checkpoint_000009)
(func pid=5898) [6, 10000] loss: 0.249
Trial train_cifar_f1b1b_00005 finished iteration 6 at 2024-10-17 22:05:20. Total running time: 6min 50s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000005 |
| time_this_iter_s 51.96922 |
| time_total_s 404.72486 |
| training_iteration 6 |
| accuracy 0.5364 |
| loss 1.29211 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 6 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000005
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000005)
(func pid=5898) [7, 2000] loss: 1.251
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:05:30. Total running time: 7min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 6 404.725 1.29211 0.5364 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5898) [7, 4000] loss: 0.619
(func pid=5898) [7, 6000] loss: 0.415
(func pid=5898) [7, 8000] loss: 0.308
(func pid=5898) [7, 10000] loss: 0.245
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:06:00. Total running time: 7min 30s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 6 404.725 1.29211 0.5364 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 finished iteration 7 at 2024-10-17 22:06:06. Total running time: 7min 36s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000006 |
| time_this_iter_s 45.67289 |
| time_total_s 450.39776 |
| training_iteration 7 |
| accuracy 0.5543 |
| loss 1.23163 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 7 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000006
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000006)
(func pid=5898) [8, 2000] loss: 1.196
(func pid=5898) [8, 4000] loss: 0.611
(func pid=5898) [8, 6000] loss: 0.394
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:06:30. Total running time: 8min 0s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 7 450.398 1.23163 0.5543 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5898) [8, 8000] loss: 0.298
(func pid=5898) [8, 10000] loss: 0.240
Trial train_cifar_f1b1b_00005 finished iteration 8 at 2024-10-17 22:06:51. Total running time: 8min 21s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000007 |
| time_this_iter_s 45.03539 |
| time_total_s 495.43314 |
| training_iteration 8 |
| accuracy 0.5636 |
| loss 1.22532 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 8 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000007
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000007)
(func pid=5898) [9, 2000] loss: 1.148
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:07:00. Total running time: 8min 31s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 8 495.433 1.22532 0.5636 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5898) [9, 4000] loss: 0.588
(func pid=5898) [9, 6000] loss: 0.391
(func pid=5898) [9, 8000] loss: 0.294
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:07:30. Total running time: 9min 1s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 8 495.433 1.22532 0.5636 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5898) [9, 10000] loss: 0.234
Trial train_cifar_f1b1b_00005 finished iteration 9 at 2024-10-17 22:07:37. Total running time: 9min 7s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000008 |
| time_this_iter_s 45.61971 |
| time_total_s 541.05286 |
| training_iteration 9 |
| accuracy 0.565 |
| loss 1.21076 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 9 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000008
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000008)
(func pid=5898) [10, 2000] loss: 1.139
(func pid=5898) [10, 4000] loss: 0.559
Trial status: 9 TERMINATED | 1 RUNNING
Current time: 2024-10-17 22:08:00. Total running time: 9min 31s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00005 RUNNING 8 64 0.000353097 4 9 541.053 1.21076 0.565 |
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
(func pid=5898) [10, 6000] loss: 0.385
(func pid=5898) [10, 8000] loss: 0.288
(func pid=5898) [10, 10000] loss: 0.228
Trial train_cifar_f1b1b_00005 finished iteration 10 at 2024-10-17 22:08:22. Total running time: 9min 52s
+------------------------------------------------------------+
| Trial train_cifar_f1b1b_00005 result |
+------------------------------------------------------------+
| checkpoint_dir_name checkpoint_000009 |
| time_this_iter_s 45.12275 |
| time_total_s 586.17561 |
| training_iteration 10 |
| accuracy 0.5756 |
| loss 1.17761 |
+------------------------------------------------------------+
Trial train_cifar_f1b1b_00005 saved a checkpoint for iteration 10 at: (local)/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000009
Trial train_cifar_f1b1b_00005 completed after 10 iterations at 2024-10-17 22:08:22. Total running time: 9min 52s
Trial status: 10 TERMINATED
Current time: 2024-10-17 22:08:22. Total running time: 9min 52s
Logical resource usage: 2.0/16 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:M60)
+------------------------------------------------------------------------------------------------------------------------------------+
| Trial name status l1 l2 lr batch_size iter total time (s) loss accuracy |
+------------------------------------------------------------------------------------------------------------------------------------+
| train_cifar_f1b1b_00000 TERMINATED 16 1 0.00213327 2 1 173.299 2.30543 0.0993 |
| train_cifar_f1b1b_00001 TERMINATED 1 2 0.013416 4 1 96.7157 2.30901 0.097 |
| train_cifar_f1b1b_00002 TERMINATED 256 64 0.0113784 2 1 190.342 2.32163 0.0957 |
| train_cifar_f1b1b_00003 TERMINATED 64 256 0.0274071 8 2 111.555 2.2556 0.155 |
| train_cifar_f1b1b_00004 TERMINATED 16 2 0.056666 4 1 98.8246 2.31015 0.0961 |
| train_cifar_f1b1b_00005 TERMINATED 8 64 0.000353097 4 10 586.176 1.17761 0.5756 |
| train_cifar_f1b1b_00006 TERMINATED 16 4 0.000147684 8 10 383.108 1.50765 0.4347 |
| train_cifar_f1b1b_00007 TERMINATED 256 256 0.00477469 8 10 396.655 1.35308 0.5751 |
| train_cifar_f1b1b_00008 TERMINATED 128 256 0.0306227 8 2 97.2005 2.06562 0.206 |
| train_cifar_f1b1b_00009 TERMINATED 2 16 0.0286986 2 1 139.436 2.33758 0.098 |
+------------------------------------------------------------------------------------------------------------------------------------+
Best trial config: {'l1': 8, 'l2': 64, 'lr': 0.0003530972286268149, 'batch_size': 4}
Best trial final validation loss: 1.1776135039269924
Best trial final validation accuracy: 0.5756
(func pid=5898) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/var/lib/ci-user/ray_results/train_cifar_2024-10-17_21-58-29/train_cifar_f1b1b_00005_5_batch_size=4,l1=8,l2=64,lr=0.0004_2024-10-17_21-58-29/checkpoint_000009)
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5846
如果您运行代码,示例输出可能如下所示
Number of trials: 10/10 (10 TERMINATED)
+-----+--------------+------+------+-------------+--------+---------+------------+
| ... | batch_size | l1 | l2 | lr | iter | loss | accuracy |
|-----+--------------+------+------+-------------+--------+---------+------------|
| ... | 2 | 1 | 256 | 0.000668163 | 1 | 2.31479 | 0.0977 |
| ... | 4 | 64 | 8 | 0.0331514 | 1 | 2.31605 | 0.0983 |
| ... | 4 | 2 | 1 | 0.000150295 | 1 | 2.30755 | 0.1023 |
| ... | 16 | 32 | 32 | 0.0128248 | 10 | 1.66912 | 0.4391 |
| ... | 4 | 8 | 128 | 0.00464561 | 2 | 1.7316 | 0.3463 |
| ... | 8 | 256 | 8 | 0.00031556 | 1 | 2.19409 | 0.1736 |
| ... | 4 | 16 | 256 | 0.00574329 | 2 | 1.85679 | 0.3368 |
| ... | 8 | 2 | 2 | 0.00325652 | 1 | 2.30272 | 0.0984 |
| ... | 2 | 2 | 2 | 0.000342987 | 2 | 1.76044 | 0.292 |
| ... | 4 | 64 | 32 | 0.003734 | 8 | 1.53101 | 0.4761 |
+-----+--------------+------+------+-------------+--------+---------+------------+
Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0037339984519545164, 'batch_size': 4}
Best trial final validation loss: 1.5310075663924216
Best trial final validation accuracy: 0.4761
Best trial test set accuracy: 0.4737
大多数试验已提前停止,以避免浪费资源。性能最佳的试验实现了大约 47% 的验证准确率,这可以在测试集中得到证实。
就是这样!现在,您可以调整 PyTorch 模型的参数了。
脚本的总运行时间: ( 10 分钟 10.049 秒)