快速入门¶

这是一个关于如何编写简单的应用程序并在本地和远程集群上启动分布式作业的完整指南。

安装¶

首先，我们需要安装 TorchX Python 包，其中包含 CLI 和库。

# install torchx with all dependencies
$ pip install "torchx[dev]"

有关安装的更多信息，请参阅 README。

[1]:

%%sh
torchx --help

usage: torchx [-h] [--log_level LOG_LEVEL] [--version]
              {builtins,cancel,configure,describe,list,log,run,runopts,status,tracker}
              ...

torchx CLI

optional arguments:
  -h, --help            show this help message and exit
  --log_level LOG_LEVEL
                        Python logging log level
  --version             show program's version number and exit

sub-commands:
  Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}

  {builtins,cancel,configure,describe,list,log,run,runopts,status,tracker}

Hello World¶

让我们从编写一个简单的“Hello World” Python 应用程序开始。这只是一个普通的 Python 程序，可以包含任何您想要的内容。

注意

此示例使用 Jupyter Notebook %%writefile 来创建本地文件，用于示例目的。在正常使用中，您会将这些文件作为独立文件。

[2]:

%%writefile my_app.py

import sys

print(f"Hello, {sys.argv[1]}!")

Overwriting my_app.py

启动¶

我们可以通过 torchx run 执行应用程序。 local_cwd 调度器相对于当前目录执行应用程序。

为此，我们将使用 utils.python 组件

[3]:

%%sh
torchx run --scheduler local_cwd utils.python --help

usage: torchx run <run args...> python  [--help] [-m str] [-c str]
                                        [--script str] [--image str]
                                        [--name str] [--cpu int] [--gpu int]
                                        [--memMB int] [-h str]
                                        [--num_replicas int]
                                        ...

Runs ``python`` with the specified module, command or script on the specified
image and host. Use ``--`` to separate component args and program args
(e.g. ``torchx run utils.python --m foo.main -- --args to --main``)

Note: (cpu, gpu, memMB) parameters are mutually exclusive with ``h`` (named resource) where
      ``h`` takes precedence if specified for setting resource requirements.
      See `registering named resources <https://pytorch.ac.cn/torchx/latest/advanced.html#registering-named-resources>`_.

positional arguments:
  str                 arguments passed to the program in sys.argv[1:] (ignored
                      with `--c`) (required)

optional arguments:
  --help              show this help message and exit
  -m str, --m str     run library module as a script (default: None)
  -c str, --c str     program passed as string (may error if scheduler has a
                      length limit on args) (default: None)
  --script str        .py script to run (default: None)
  --image str         image to run on (default: ghcr.io/pytorch/torchx:0.7.0)
  --name str          name of the job (default: torchx_utils_python)
  --cpu int           number of cpus per replica (default: 1)
  --gpu int           number of gpus per replica (default: 0)
  --memMB int         cpu memory in MB per replica (default: 1024)
  -h str, --h str     a registered named resource (if specified takes
                      precedence over cpu, gpu, memMB) (default: None)
  --num_replicas int  number of copies to run (each on its own container)
                      (default: 1)

该组件接收脚本名称，任何其他参数都将传递给脚本本身。

[4]:

%%sh
torchx run --scheduler local_cwd utils.python --script my_app.py "your name"

torchx 2024-07-17 02:04:22 INFO     Tracker configurations: {}
torchx 2024-07-17 02:04:22 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2024-07-17 02:04:22 INFO     Log directory is: /tmp/torchx_4jm6n5c0
torchx 2024-07-17 02:04:22 INFO     Waiting for the app to finish...
python/0 Hello, your name!
torchx 2024-07-17 02:04:23 INFO     Job finished: SUCCEEDED

local_cwd://torchx/torchx_utils_python-x3pqtf6102th4

我们可以通过 local_docker 调度器运行完全相同的应用程序。此调度器将本地工作区打包为指定映像之上的一个层。这为基于容器的远程调度器提供了一个非常类似的环境。

注意

这需要安装 Docker，并且在 Google Colab 等环境中不起作用。请参阅 Docker 安装说明：https://docs.dockerd.com.cn/get-docker/

[5]:

%%sh
torchx run --scheduler local_docker utils.python --script my_app.py "your name"

torchx 2024-07-17 02:04:24 INFO     Tracker configurations: {}
torchx 2024-07-17 02:04:24 INFO     Checking for changes in workspace `file:///home/ec2-user/torchx/docs/source`...
torchx 2024-07-17 02:04:24 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-07-17 02:04:24 INFO     Workspace `file:///home/ec2-user/torchx/docs/source` resolved to filesystem path `/home/ec2-user/torchx/docs/source`
torchx 2024-07-17 02:04:25 INFO     Building workspace docker image (this may take a while)...
torchx 2024-07-17 02:04:25 INFO     Step 1/4 : ARG IMAGE
torchx 2024-07-17 02:04:25 INFO     Step 2/4 : FROM $IMAGE
torchx 2024-07-17 02:04:25 INFO      ---> 2fd60971a176
torchx 2024-07-17 02:04:25 INFO     Step 3/4 : COPY . .
torchx 2024-07-17 02:04:25 INFO      ---> 5c6b0e85d941
torchx 2024-07-17 02:04:25 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-07-17 02:04:25 INFO      ---> Running in f1950ba54083
torchx 2024-07-17 02:04:25 INFO      ---> Removed intermediate container f1950ba54083
torchx 2024-07-17 02:04:25 INFO      ---> 6f6b5878c3d4
torchx 2024-07-17 02:04:25 INFO     [Warning] One or more build-args [WORKSPACE] were not consumed
torchx 2024-07-17 02:04:25 INFO     Successfully built 6f6b5878c3d4
torchx 2024-07-17 02:04:25 INFO     Built new image `sha256:6f6b5878c3d4542f486e423ae4252825f693cd34fb256bebafb539c8899712e0` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///home/ec2-user/torchx/docs/source` for role[0]=python.
torchx 2024-07-17 02:04:26 INFO     Waiting for the app to finish...
python/0 Hello, your name!
torchx 2024-07-17 02:04:27 INFO     Job finished: SUCCEEDED

local_docker://torchx/torchx_utils_python-mpqj0h6hdpzg4

TorchX 默认使用 ghcr.io/pytorch/torchx Docker 容器镜像，其中包含 PyTorch 库、TorchX 和相关依赖项。

分布式¶

TorchX 的 dist.ddp 组件使用 TorchElastic 来管理工作进程。这意味着您可以立即在所有支持的调度器上启动多工作进程和多主机作业。

[6]:

%%sh
torchx run --scheduler local_docker dist.ddp --help

usage: torchx run <run args...> ddp  [--help] [--script str] [-m str]
                                     [--image str] [--name str] [-h str]
                                     [--cpu int] [--gpu int] [--memMB int]
                                     [-j str] [--env str] [--max_retries int]
                                     [--rdzv_port int] [--rdzv_backend str]
                                     [--mounts str] [--debug str] [--tee int]
                                     ...

Distributed data parallel style application (one role, multi-replica).
Uses `torch.distributed.run <https://pytorch.ac.cn/docs/stable/distributed.elastic.html>`_
to launch and coordinate PyTorch worker processes. Defaults to using ``c10d`` rendezvous backend
on rendezvous_endpoint ``$rank_0_host:$rdzv_port``. Note that ``rdzv_port`` parameter is ignored
when running on single node, and instead we use port 0 which instructs torchelastic to chose
a free random port on the host.

Note: (cpu, gpu, memMB) parameters are mutually exclusive with ``h`` (named resource) where
      ``h`` takes precedence if specified for setting resource requirements.
      See `registering named resources <https://pytorch.ac.cn/torchx/latest/advanced.html#registering-named-resources>`_.

positional arguments:
  str                 arguments to the main module (required)

optional arguments:
  --help              show this help message and exit
  --script str        script or binary to run within the image (default: None)
  -m str, --m str     the python module path to run (default: None)
  --image str         image (e.g. docker) (default:
                      ghcr.io/pytorch/torchx:0.7.0)
  --name str          job name override in the following format:
                      ``{experimentname}/{runname}`` or ``{experimentname}/``
                      or ``/{runname}`` or ``{runname}``. Uses the script or
                      module name if ``{runname}`` not specified. (default: /)
  -h str, --h str     a registered named resource (if specified takes
                      precedence over cpu, gpu, memMB) (default: None)
  --cpu int           number of cpus per replica (default: 2)
  --gpu int           number of gpus per replica (default: 0)
  --memMB int         cpu memory in MB per replica (default: 1024)
  -j str, --j str     [{min_nnodes}:]{nnodes}x{nproc_per_node}, for gpu hosts,
                      nproc_per_node must not exceed num gpus (default: 1x2)
  --env str           environment varibles to be passed to the run (e.g.
                      ENV1=v1,ENV2=v2,ENV3=v3) (default: None)
  --max_retries int   the number of scheduler retries allowed (default: 0)
  --rdzv_port int     the port on rank0's host to use for hosting the c10d
                      store used for rendezvous. Only takes effect when
                      running multi-node. When running single node, this
                      parameter is ignored and a random free port is chosen.
                      (default: 29500)
  --rdzv_backend str  the rendezvous backend to use. Only takes effect when
                      running multi-node. (default: c10d)
  --mounts str        mounts to mount into the worker environment/container
                      (ex. type=<bind/volume>,src=/host,dst=/job[,readonly]).
                      See scheduler documentation for more info. (default:
                      None)
  --debug str         whether to run with preset debug flags enabled (default:
                      False)
  --tee int           tees the specified std stream(s) to console + file. 0:
                      none, 1: stdout, 2: stderr, 3: both (default: 3)

让我们创建一个稍微有趣一点的应用程序来利用 TorchX 的分布式支持。

[7]:

%%writefile dist_app.py

import torch
import torch.distributed as dist

dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")

a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")

Writing dist_app.py

让我们启动一个包含 2 个节点的小型作业，每个节点 2 个工作进程

[8]:

%%sh
torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py

torchx 2024-07-17 02:04:29 INFO     Tracker configurations: {}
torchx 2024-07-17 02:04:29 INFO     Checking for changes in workspace `file:///home/ec2-user/torchx/docs/source`...
torchx 2024-07-17 02:04:29 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-07-17 02:04:29 INFO     Workspace `file:///home/ec2-user/torchx/docs/source` resolved to filesystem path `/home/ec2-user/torchx/docs/source`
torchx 2024-07-17 02:04:29 INFO     Building workspace docker image (this may take a while)...
torchx 2024-07-17 02:04:29 INFO     Step 1/4 : ARG IMAGE
torchx 2024-07-17 02:04:29 INFO     Step 2/4 : FROM $IMAGE
torchx 2024-07-17 02:04:29 INFO      ---> 2fd60971a176
torchx 2024-07-17 02:04:29 INFO     Step 3/4 : COPY . .
torchx 2024-07-17 02:04:30 INFO      ---> 3952ed3cf724
torchx 2024-07-17 02:04:30 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-07-17 02:04:30 INFO      ---> Running in 69990dd849d4
torchx 2024-07-17 02:04:30 INFO      ---> Removed intermediate container 69990dd849d4
torchx 2024-07-17 02:04:30 INFO      ---> ca8f6c64499f
torchx 2024-07-17 02:04:30 INFO     [Warning] One or more build-args [WORKSPACE] were not consumed
torchx 2024-07-17 02:04:30 INFO     Successfully built ca8f6c64499f
torchx 2024-07-17 02:04:30 INFO     Built new image `sha256:ca8f6c64499fd31e8730e7195d808da9eba3824de7949b2e83154cde5a034f2d` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///home/ec2-user/torchx/docs/source` for role[0]=dist_app.
torchx 2024-07-17 02:04:31 INFO     Waiting for the app to finish...
dist_app/0 [2024-07-17 02:04:31,764] torch.distributed.run: [WARNING]
dist_app/0 [2024-07-17 02:04:31,764] torch.distributed.run: [WARNING] *****************************************
dist_app/0 [2024-07-17 02:04:31,764] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
dist_app/0 [2024-07-17 02:04:31,764] torch.distributed.run: [WARNING] *****************************************
dist_app/1 [2024-07-17 02:04:32,251] torch.distributed.run: [WARNING]
dist_app/1 [2024-07-17 02:04:32,251] torch.distributed.run: [WARNING] *****************************************
dist_app/1 [2024-07-17 02:04:32,251] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
dist_app/1 [2024-07-17 02:04:32,251] torch.distributed.run: [WARNING] *****************************************
dist_app/1 [0]:I am worker 2 of 4!
dist_app/1 [0]:all_reduce output = tensor([6])
dist_app/1 [1]:I am worker 3 of 4!
dist_app/1 [1]:all_reduce output = tensor([6])
dist_app/0 [0]:I am worker 0 of 4!
dist_app/0 [0]:all_reduce output = tensor([6])
dist_app/0 [1]:I am worker 1 of 4!
dist_app/0 [1]:all_reduce output = tensor([6])
torchx 2024-07-17 02:04:39 INFO     Job finished: SUCCEEDED

local_docker://torchx/dist_app-c206q7kmh3bl4c

工作区/修补¶

对于每个调度器，都有一个名为 image 的概念。对于 local_cwd 和 slurm，它使用当前工作目录。对于基于容器的调度器（例如 local_docker、kubernetes 和 aws_batch），它使用 Docker 容器。

为了在本地作业和远程作业之间提供相同环境，TorchX CLI 使用工作区来自动地针对每个调度器修补远程作业的映像。

当您通过 torchx run 启动作业时，它将在提供的映像之上覆盖当前目录，因此您的代码可在已启动的作业中使用。

对于 docker 基于调度器，您将需要一个本地 Docker 守护程序来构建映像并将其推送到远程 Docker 存储库。

`.torchxconfig`¶

调度器的参数可以通过命令行标志指定为 torchx run -s <scheduler> -c <args>，也可以通过每个调度器的 .torchxconfig 文件指定。

[9]:

%%writefile .torchxconfig

[kubernetes]
queue=torchx
image_repo=<your docker image repository>

[slurm]
partition=torchx

Writing .torchxconfig

远程调度器¶

TorchX 支持大量的调度器。没有看到您的调度器？申请它!

远程调度器与本地调度器的工作方式完全相同。本地使用的相同运行命令可以立即在远程使用。

$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py

根据调度器，可能还需要一些额外的配置参数，以便 TorchX 知道在哪里运行作业以及上传已构建的映像。这些参数可以通过 -c 设置，也可以在 .torchxconfig 文件中设置。

所有配置选项

[10]:

%%sh
torchx runopts

local_docker:
    usage:
        [copy_env=COPY_ENV],[env=ENV],[privileged=PRIVILEGED],[image_repo=IMAGE_REPO],[quiet=QUIET]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        env=ENV (typing.Dict[str, str], None)
            environment variables to be passed to the run. The separator sign can be eiher comma or semicolon
            (e.g. ENV1:v1,ENV2:v2,ENV3:v3 or ENV1:V1;ENV2:V2). Environment variables from env will be applied on top
            of the ones from copy_env
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

local_cwd:
    usage:
        [log_dir=LOG_DIR],[prepend_cwd=PREPEND_CWD],[auto_set_cuda_visible_devices=AUTO_SET_CUDA_VISIBLE_DEVICES]

    optional arguments:
        log_dir=LOG_DIR (str, None)
            dir to write stdout/stderr log files of replicas
        prepend_cwd=PREPEND_CWD (bool, False)
            if set, prepends CWD to replica's PATH env var making any binaries in CWD take precedence over those in PATH
        auto_set_cuda_visible_devices=AUTO_SET_CUDA_VISIBLE_DEVICES (bool, False)
            sets the `CUDA_AVAILABLE_DEVICES` for roles that request GPU resources. Each role replica will be assigned one GPU. Does nothing if the device count is less than replicas.

slurm:
    usage:
        [partition=PARTITION],[time=TIME],[comment=COMMENT],[constraint=CONSTRAINT],[mail-user=MAIL-USER],[mail-type=MAIL-TYPE],[job_dir=JOB_DIR]

    optional arguments:
        partition=PARTITION (str, None)
            The partition to run the job in.
        time=TIME (str, None)
            The maximum time the job is allowed to run for. Formats:             "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours",             "days-hours:minutes" or "days-hours:minutes:seconds"
        comment=COMMENT (str, None)
            Comment to set on the slurm job.
        constraint=CONSTRAINT (str, None)
            Constraint to use for the slurm job.
        mail-user=MAIL-USER (str, None)
            User to mail on job end.
        mail-type=MAIL-TYPE (str, None)
            What events to mail users on.
        job_dir=JOB_DIR (str, None)
            The directory to place the job code and outputs. The
            directory must not exist and will be created. To enable log
            iteration, jobs will be tracked in ``.torchxslurmjobdirs``.


kubernetes:
    usage:
        queue=QUEUE,[namespace=NAMESPACE],[service_account=SERVICE_ACCOUNT],[priority_class=PRIORITY_CLASS],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
        queue=QUEUE (str)
            Volcano queue to schedule job in

    optional arguments:
        namespace=NAMESPACE (str, default)
            Kubernetes namespace to schedule job in
        service_account=SERVICE_ACCOUNT (str, None)
            The service account name to set on the pod specs
        priority_class=PRIORITY_CLASS (str, None)
            The name of the PriorityClass to set on the job specs
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

kubernetes_mcad:
    usage:
        [namespace=NAMESPACE],[image_repo=IMAGE_REPO],[service_account=SERVICE_ACCOUNT],[priority=PRIORITY],[priority_class_name=PRIORITY_CLASS_NAME],[image_secret=IMAGE_SECRET],[coscheduler_name=COSCHEDULER_NAME],[network=NETWORK]

    optional arguments:
        namespace=NAMESPACE (str, default)
            Kubernetes namespace to schedule job in
        image_repo=IMAGE_REPO (str, None)
            The image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        service_account=SERVICE_ACCOUNT (str, None)
            The service account name to set on the pod specs
        priority=PRIORITY (int, None)
            The priority level to set on the job specs. Higher integer value means higher priority
        priority_class_name=PRIORITY_CLASS_NAME (str, None)
            Pod specific priority level. Check with your Kubernetes cluster admin if Priority classes are defined on your system
        image_secret=IMAGE_SECRET (str, None)
            The name of the Kubernetes/OpenShift secret set up for private images
        coscheduler_name=COSCHEDULER_NAME (str, None)
            Option to run TorchX-MCAD with a co-scheduler. User must provide the co-scheduler name.
        network=NETWORK (str, None)
            Name of additional pod-to-pod network beyond default Kubernetes network

aws_batch:
    usage:
        queue=QUEUE,[user=USER],[privileged=PRIVILEGED],[share_id=SHARE_ID],[priority=PRIORITY],[job_role_arn=JOB_ROLE_ARN],[execution_role_arn=EXECUTION_ROLE_ARN],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
        queue=QUEUE (str)
            queue to schedule job in

    optional arguments:
        user=USER (str, ec2-user)
            The username to tag the job with. `getpass.getuser()` if not specified.
        privileged=PRIVILEGED (bool, False)
            If true runs the container with elevated permissions. Equivalent to running with `docker run --privileged`.
        share_id=SHARE_ID (str, None)
            The share identifier for the job. This must be set if and only if the job queue has a scheduling policy.
        priority=PRIORITY (int, 0)
            The scheduling priority for the job within the context of share_id. Higher number (between 0 and 9999) means higher priority. This will only take effect if the job queue has a scheduling policy.
        job_role_arn=JOB_ROLE_ARN (str, None)
            The Amazon Resource Name (ARN) of the IAM role that the container can assume for AWS permissions.
        execution_role_arn=EXECUTION_ROLE_ARN (str, None)
            The Amazon Resource Name (ARN) of the IAM role that the ECS agent can assume for AWS permissions.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
aws_sagemaker:
    usage:
        role=ROLE,instance_type=INSTANCE_TYPE,[instance_count=INSTANCE_COUNT],[user=USER],[keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_IN_SECONDS],[volume_size=VOLUME_SIZE],[volume_kms_key=VOLUME_KMS_KEY],[max_run=MAX_RUN],[input_mode=INPUT_MODE],[output_path=OUTPUT_PATH],[output_kms_key=OUTPUT_KMS_KEY],[base_job_name=BASE_JOB_NAME],[tags=TAGS],[subnets=SUBNETS],[security_group_ids=SECURITY_GROUP_IDS],[model_uri=MODEL_URI],[model_channel_name=MODEL_CHANNEL_NAME],[metric_definitions=METRIC_DEFINITIONS],[encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFFIC],[use_spot_instances=USE_SPOT_INSTANCES],[max_wait=MAX_WAIT],[checkpoint_s3_uri=CHECKPOINT_S3_URI],[checkpoint_local_path=CHECKPOINT_LOCAL_PATH],[debugger_hook_config=DEBUGGER_HOOK_CONFIG],[enable_sagemaker_metrics=ENABLE_SAGEMAKER_METRICS],[enable_network_isolation=ENABLE_NETWORK_ISOLATION],[disable_profiler=DISABLE_PROFILER],[environment=ENVIRONMENT],[max_retry_attempts=MAX_RETRY_ATTEMPTS],[source_dir=SOURCE_DIR],[git_config=GIT_CONFIG],[hyperparameters=HYPERPARAMETERS],[container_log_level=CONTAINER_LOG_LEVEL],[code_location=CODE_LOCATION],[dependencies=DEPENDENCIES],[training_repository_access_mode=TRAINING_REPOSITORY_ACCESS_MODE],[training_repository_credentials_provider_arn=TRAINING_REPOSITORY_CREDENTIALS_PROVIDER_ARN],[disable_output_compression=DISABLE_OUTPUT_COMPRESSION],[enable_infra_check=ENABLE_INFRA_CHECK],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
        role=ROLE (str)
            an AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
        instance_type=INSTANCE_TYPE (str)
            type of EC2 instance to use for training, for example, 'ml.c4.xlarge'

    optional arguments:
        instance_count=INSTANCE_COUNT (int, 1)
            number of Amazon EC2 instances to use for training. Required if instance_groups is not set.
        user=USER (str, ec2-user)
            the username to tag the job with. `getpass.getuser()` if not specified.
        keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_IN_SECONDS (int, None)
            the duration of time in seconds to retain configured resources in a warm pool for subsequent training jobs.
        volume_size=VOLUME_SIZE (int, None)
            size in GB of the storage volume to use for storing input and output data during training (default: 30).
        volume_kms_key=VOLUME_KMS_KEY (str, None)
            KMS key ID for encrypting EBS volume attached to the training instance.
        max_run=MAX_RUN (int, None)
            timeout in seconds for training (default: 24 * 60 * 60).
        input_mode=INPUT_MODE (str, None)
            the input mode that the algorithm supports (default: ‘File’).
        output_path=OUTPUT_PATH (str, None)
            S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during the fit() method execution.
        output_kms_key=OUTPUT_KMS_KEY (str, None)
            KMS key ID for encrypting the training output (default: Your IAM role’s KMS key for Amazon S3).
        base_job_name=BASE_JOB_NAME (str, None)
            prefix for training job name when the fit() method launches. If not specified, the estimator generates a default job name based on the training image name and current timestamp.
        tags=TAGS (typing.List[typing.Dict[str, str]], None)
            list of tags for labeling a training job.
        subnets=SUBNETS (typing.List[str], None)
            list of subnet ids. If not specified training job will be created without VPC config.
        security_group_ids=SECURITY_GROUP_IDS (typing.List[str], None)
            list of security group ids. If not specified training job will be created without VPC config.
        model_uri=MODEL_URI (str, None)
            URI where a pre-trained model is stored, either locally or in S3.
        model_channel_name=MODEL_CHANNEL_NAME (str, None)
            name of the channel where ‘model_uri’ will be downloaded (default: ‘model’).
        metric_definitions=METRIC_DEFINITIONS (typing.List[typing.Dict[str, str]], None)
            list of dictionaries that defines the metric(s) used to evaluate the training jobs. Each dictionary contains two keys: ‘Name’ for the name of the metric, and ‘Regex’ for the regular expression used to extract the metric from the logs.
        encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFFIC (bool, None)
            specifies whether traffic between training containers is encrypted for the training job (default: False).
        use_spot_instances=USE_SPOT_INSTANCES (bool, None)
            specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.
        max_wait=MAX_WAIT (int, None)
            timeout in seconds waiting for spot training job.
        checkpoint_s3_uri=CHECKPOINT_S3_URI (str, None)
            S3 URI in which to persist checkpoints that the algorithm persists (if any) during training.
        checkpoint_local_path=CHECKPOINT_LOCAL_PATH (str, None)
            local path that the algorithm writes its checkpoints to.
        debugger_hook_config=DEBUGGER_HOOK_CONFIG (bool, None)
            configuration for how debugging information is emitted with SageMaker Debugger. If not specified, a default one is created using the estimator’s output_path, unless the region does not support SageMaker Debugger. To disable SageMaker Debugger, set this parameter to False.
        enable_sagemaker_metrics=ENABLE_SAGEMAKER_METRICS (bool, None)
            enable SageMaker Metrics Time Series.
        enable_network_isolation=ENABLE_NETWORK_ISOLATION (bool, None)
            specifies whether container will run in network isolation mode (default: False).
        disable_profiler=DISABLE_PROFILER (bool, None)
            specifies whether Debugger monitoring and profiling will be disabled (default: False).
        environment=ENVIRONMENT (typing.Dict[str, str], None)
            environment variables to be set for use during training job
        max_retry_attempts=MAX_RETRY_ATTEMPTS (int, None)
            number of times to move a job to the STARTING status. You can specify between 1 and 30 attempts.
        source_dir=SOURCE_DIR (str, None)
            absolute, relative, or S3 URI Path to a directory with any other training source code dependencies aside from the entry point file (default: current working directory)
        git_config=GIT_CONFIG (typing.Dict[str, str], None)
            git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password, and token.
        hyperparameters=HYPERPARAMETERS (typing.Dict[str, str], None)
            dictionary containing the hyperparameters to initialize this estimator with.
        container_log_level=CONTAINER_LOG_LEVEL (int, None)
            log level to use within the container (default: logging.INFO).
        code_location=CODE_LOCATION (str, None)
            S3 prefix URI where custom code is uploaded.
        dependencies=DEPENDENCIES (typing.List[str], None)
            list of absolute or relative paths to directories with any additional libraries that should be exported to the container.
        training_repository_access_mode=TRAINING_REPOSITORY_ACCESS_MODE (str, None)
            specifies how SageMaker accesses the Docker image that contains the training algorithm.
        training_repository_credentials_provider_arn=TRAINING_REPOSITORY_CREDENTIALS_PROVIDER_ARN (str, None)
            Amazon Resource Name (ARN) of an AWS Lambda function that provides credentials to authenticate to the private Docker registry where your training image is hosted.
        disable_output_compression=DISABLE_OUTPUT_COMPRESSION (bool, None)
            when set to true, Model is uploaded to Amazon S3 without compression after training finishes.
        enable_infra_check=ENABLE_INFRA_CHECK (bool, None)
            specifies whether it is running Sagemaker built-in infra check jobs.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

ray:
    usage:
        [cluster_config_file=CLUSTER_CONFIG_FILE],[cluster_name=CLUSTER_NAME],[dashboard_address=DASHBOARD_ADDRESS],[requirements=REQUIREMENTS]

    optional arguments:
        cluster_config_file=CLUSTER_CONFIG_FILE (str, None)
            Use CLUSTER_CONFIG_FILE to access or create the Ray cluster.
        cluster_name=CLUSTER_NAME (str, None)
            Override the configured cluster name.
        dashboard_address=DASHBOARD_ADDRESS (str, 127.0.0.1:8265)
            Use ray status to get the dashboard address you will submit jobs against
        requirements=REQUIREMENTS (str, None)
            Path to requirements.txt

lsf:
    usage:
        [lsf_queue=LSF_QUEUE],[jobdir=JOBDIR],[container_workdir=CONTAINER_WORKDIR],[host_network=HOST_NETWORK],[shm_size=SHM_SIZE]

    optional arguments:
        lsf_queue=LSF_QUEUE (str, None)
            queue name to submit jobs
        jobdir=JOBDIR (str, None)
            The directory to place the job code and outputs. The directory must not exist and will be created.
        container_workdir=CONTAINER_WORKDIR (str, None)
            working directory in container jobs
        host_network=HOST_NETWORK (bool, False)
            True if using the host network for jobs
        shm_size=SHM_SIZE (str, 64m)
            size of shared memory (/dev/shm) for jobs

自定义映像¶

基于 Docker 的调度器¶

如果您需要比标准 PyTorch 库更多的库，则可以添加自定义 Dockerfile 或构建自己的 Docker 容器，并将其用作 TorchX 作业的基础映像。

[11]:

%%writefile timm_app.py

import timm

print(timm.models.resnet18())

Writing timm_app.py

[12]:

%%writefile Dockerfile.torchx

FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

RUN pip install timm

COPY . .

Writing Dockerfile.torchx

创建 Dockerfile 后，我们可以像往常一样启动作业，TorchX 将自动使用新提供的 Dockerfile 而不是默认的 Dockerfile 构建映像。

[13]:

%%sh
torchx run --scheduler local_docker utils.python --script timm_app.py

torchx 2024-07-17 02:04:41 INFO     loaded configs from /home/ec2-user/torchx/docs/source/.torchxconfig
torchx 2024-07-17 02:04:42 INFO     Tracker configurations: {}
torchx 2024-07-17 02:04:42 INFO     Checking for changes in workspace `file:///home/ec2-user/torchx/docs/source`...
torchx 2024-07-17 02:04:42 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2024-07-17 02:04:42 INFO     Workspace `file:///home/ec2-user/torchx/docs/source` resolved to filesystem path `/home/ec2-user/torchx/docs/source`
torchx 2024-07-17 02:04:42 INFO     Building workspace docker image (this may take a while)...
torchx 2024-07-17 02:04:42 INFO     Step 1/4 : FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
torchx 2024-07-17 02:05:35 INFO      ---> c3f17e5ac010
torchx 2024-07-17 02:05:35 INFO     Step 2/4 : RUN pip install timm
torchx 2024-07-17 02:05:56 INFO      ---> Running in 41aee588382a
torchx 2024-07-17 02:05:56 INFO     Collecting timm
torchx 2024-07-17 02:05:56 INFO       Downloading timm-0.9.12-py3-none-any.whl (2.2 MB)
torchx 2024-07-17 02:05:56 INFO     Collecting huggingface-hub
torchx 2024-07-17 02:05:56 INFO       Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
torchx 2024-07-17 02:05:57 INFO     Collecting safetensors
torchx 2024-07-17 02:05:57 INFO       Downloading safetensors-0.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: torch>=1.7 in /opt/conda/lib/python3.7/site-packages (from timm) (1.10.0)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from timm) (5.4.1)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: torchvision in /opt/conda/lib/python3.7/site-packages (from timm) (0.11.0)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.7/site-packages (from torch>=1.7->timm) (3.10.0.2)
torchx 2024-07-17 02:05:57 INFO     Collecting importlib-metadata
torchx 2024-07-17 02:05:57 INFO       Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from huggingface-hub->timm) (3.0.12)
torchx 2024-07-17 02:05:57 INFO     Collecting packaging>=20.9
torchx 2024-07-17 02:05:57 INFO       Downloading packaging-24.0-py3-none-any.whl (53 kB)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: tqdm>=4.42.1 in /opt/conda/lib/python3.7/site-packages (from huggingface-hub->timm) (4.61.2)
torchx 2024-07-17 02:05:57 INFO     Collecting fsspec
torchx 2024-07-17 02:05:57 INFO       Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from huggingface-hub->timm) (2.25.1)
torchx 2024-07-17 02:05:57 INFO     Collecting zipp>=0.5
torchx 2024-07-17 02:05:57 INFO       Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub->timm) (2021.10.8)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub->timm) (1.26.6)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub->timm) (4.0.0)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->huggingface-hub->timm) (2.10)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from torchvision->timm) (1.21.2)
torchx 2024-07-17 02:05:57 INFO     Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /opt/conda/lib/python3.7/site-packages (from torchvision->timm) (8.4.0)
torchx 2024-07-17 02:05:58 INFO     Installing collected packages: zipp, packaging, importlib-metadata, fsspec, safetensors, huggingface-hub, timm
torchx 2024-07-17 02:05:58 INFO     Successfully installed fsspec-2023.1.0 huggingface-hub-0.16.4 importlib-metadata-6.7.0 packaging-24.0 safetensors-0.4.3 timm-0.9.12 zipp-3.15.0
torchx 2024-07-17 02:05:59 INFO      ---> Removed intermediate container 41aee588382a
torchx 2024-07-17 02:05:59 INFO      ---> b4cc51f902f4
torchx 2024-07-17 02:05:59 INFO     Step 3/4 : COPY . .
torchx 2024-07-17 02:05:59 INFO      ---> 61cad09424f0
torchx 2024-07-17 02:05:59 INFO     Step 4/4 : LABEL torchx.pytorch.org/version=0.7.0
torchx 2024-07-17 02:05:59 INFO      ---> Running in 4868dbc915a3
torchx 2024-07-17 02:05:59 INFO      ---> Removed intermediate container 4868dbc915a3
torchx 2024-07-17 02:05:59 INFO      ---> 2465db272966
torchx 2024-07-17 02:05:59 INFO     [Warning] One or more build-args [IMAGE WORKSPACE] were not consumed
torchx 2024-07-17 02:05:59 INFO     Successfully built 2465db272966
torchx 2024-07-17 02:05:59 INFO     Built new image `sha256:2465db272966c79ad5c698451a676e7cf35ae447dc37dc2c34d83ea5dd07eb9f` based on original image `ghcr.io/pytorch/torchx:0.7.0` and changes in workspace `file:///home/ec2-user/torchx/docs/source` for role[0]=python.
torchx 2024-07-17 02:06:00 INFO     Waiting for the app to finish...
python/0 ResNet(
python/0   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
python/0   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0   (act1): ReLU(inplace=True)
python/0   (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
python/0   (layer1): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer2): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer3): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer4): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (global_pool): SelectAdaptivePool2d(pool_type=avg, flatten=Flatten(start_dim=1, end_dim=-1))
python/0   (fc): Linear(in_features=512, out_features=1000, bias=True)
python/0 )
torchx 2024-07-17 02:06:02 INFO     Job finished: SUCCEEDED

local_docker://torchx/torchx_utils_python-g46zjxvt7zg0f

Slurm¶

slurm 和 local_cwd 使用当前环境，因此您可以像往常一样使用 pip 和 conda。

下一步¶

查看 torchx CLI 的其他功能
查看运行器支持的调度器列表
浏览内置组件集合
查看可以在哪些 ML 管道平台上运行组件
查看训练应用程序示例

快速入门¶

安装¶

Hello World¶

启动¶

分布式¶

工作区/修补¶

`.torchxconfig`¶

远程调度器¶

自定义映像¶

基于 Docker 的调度器¶

Slurm¶

下一步¶

文档

教程

资源

快速入门¶

安装¶

Hello World¶

启动¶

分布式¶

工作区/修补¶

.torchxconfig¶

远程调度器¶

自定义映像¶

基于 Docker 的调度器¶

Slurm¶

下一步¶

文档

教程

资源

`.torchxconfig`¶