构建说明¶

注意： 最新的构建说明嵌入在 FBGEMM 仓库的 setup_env.bash 脚本集中。

当前可用的 FBGEMM_GPU 构建变体有

仅 CPU
CUDA
GenAI（实验性）
ROCm

构建 FBGEMM_GPU 的一般步骤如下

设置隔离的构建环境。
为仅 CPU、CUDA 或 ROCm 构建设置工具链。
安装 PyTorch。
运行构建脚本。

设置隔离的构建环境¶

安装 Miniconda¶

建议设置 Miniconda 环境以实现可重现的构建

export PLATFORM_NAME="$(uname -s)-$(uname -m)"

# Set the Miniconda prefix directory
miniconda_prefix=$HOME/miniconda

# Download the Miniconda installer
wget -q "https://repo.anaconda.com/miniconda/Miniconda3-latest-${PLATFORM_NAME}.sh" -O miniconda.sh

# Run the installer
bash miniconda.sh -b -p "$miniconda_prefix" -u

# Load the shortcuts
. ~/.bashrc

# Run updates
conda update -n base -c defaults -y conda

从现在开始，所有安装命令都将在 Conda 环境中或针对 Conda 环境运行。

设置 Conda 环境¶

使用指定的 Python 版本创建 Conda 环境

env_name=<ENV NAME>
python_version=3.13

# Create the environment
conda create -y --name ${env_name} python="${python_version}"

# Upgrade PIP and pyOpenSSL package
conda run -n ${env_name} pip install --upgrade pip
conda run -n ${env_name} python -m pip install pyOpenSSL>22.1.0

设置仅 CPU 构建¶

按照设置隔离的构建环境和安装构建工具中的说明设置 Conda 环境。

设置为 CUDA / 仅 GenAI 构建¶

FBGEMM_GPU 的 CUDA 构建需要最新版本的 nvcc，它支持计算能力 3.5+。可以通过预构建的 Docker 镜像或在裸机上通过 Conda 安装来设置 FBGEMM_GPU 的 CUDA 构建环境。请注意，构建时不需要 GPU 或 NVIDIA 驱动程序，因为它们仅在运行时使用。

CUDA Docker 镜像¶

对于通过 Docker 设置，只需拉取预安装的 CUDA Docker 镜像，用于所需的 Linux 发行版和 CUDA 版本。

# Run for Ubuntu 22.04, CUDA 11.8
docker run -it --entrypoint "/bin/bash" nvidia/cuda:11.8.0-devel-ubuntu22.04

从这里开始，其余构建环境可以通过 Conda 构建，因为它仍然是创建隔离且可重现的构建环境的推荐机制。

安装 CUDA¶

通过 Conda 安装完整的 CUDA 包，其中包括 NVML

# See https://anaconda.org/nvidia/cuda for all available versions of CUDA
cuda_version=12.4.1

# Install the full CUDA package
conda install -n ${env_name} -y cuda -c "nvidia/label/cuda-${cuda_version}"

验证是否找到 cuda_runtime.h、libnvidia-ml.so 和 libnccl.so*

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)

find "${conda_prefix}" -name cuda_runtime.h
find "${conda_prefix}" -name libnvidia-ml.so
find "${conda_prefix}" -name libnccl.so*

安装 cuDNN¶

cuDNN 是 FBGEMM_GPU 的 CUDA 变体的构建时依赖项。下载并提取给定 CUDA 版本的 cuDNN 包

# cuDNN package URLs for each platform and CUDA version can be found in:
# https://github.com/pytorch/builder/blob/main/common/install_cuda.sh
cudnn_url=https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-8.9.2.26_cuda12-archive.tar.xz

# Download and unpack cuDNN
wget -q "${cudnn_url}" -O cudnn.tar.xz
tar -xvf cudnn.tar.xz

安装 CUTLASS¶

本节仅适用于构建实验性的 FBGEMM_GPU GenAI 模块。CUTLASS 应该已经作为 git 子模块在仓库中可用（请参阅准备构建）。以下包含路径已添加到 CMake 配置中

设置为 ROCm 构建¶

FBGEMM_GPU 支持在 AMD (ROCm) 设备上运行。可以通过预构建的 Docker 镜像或裸机设置 FBGEMM_GPU 的 ROCm 构建环境。

ROCm Docker 镜像¶

对于通过 Docker 设置，只需拉取预安装的 ROCm 最小 Docker 镜像，用于所需的 ROCm 版本

# Run for ROCm 6.2.0
docker run -it --entrypoint "/bin/bash" rocm/rocm-terminal:6.2.0

虽然完整的 ROCm Docker 镜像预装了所有 ROCm 包，但这会导致 Docker 容器非常大，因此，为了这个原因，建议使用最小镜像来构建和运行 FBGEMM_GPU。

从这里开始，其余构建环境可以通过 Conda 构建，因为它仍然是创建隔离且可重现的构建环境的推荐机制。

安装 ROCm¶

通过操作系统包管理器安装完整的 ROCm 包。完整的说明可以在 ROCm 安装指南中找到

# [OPTIONAL] Disable apt installation prompts
export DEBIAN_FRONTEND=noninteractive

# Update the repo DB
apt update

# Download the installer
wget -q https://repo.radeon.com/amdgpu-install/6.3.1/ubuntu/focal/amdgpu-install_6.3.60301-1_all.deb -O amdgpu-install.deb

# Run the installer
apt install ./amdgpu-install.deb

# Install ROCm
amdgpu-install -y --usecase=hiplibsdk,rocm --no-dkms

安装 MIOpen¶

MIOpen 是 FBGEMM_GPU 的 ROCm 变体的依赖项，需要安装

apt install hipify-clang miopen-hip miopen-hip-dev

安装构建工具¶

本节中的说明适用于所有 FBGEMM_GPU 变体的构建。

C/C++ 编译器 (GCC)¶

安装支持 C++20 的 GCC 工具链版本。sysroot 包也需要安装，以避免在编译 FBGEMM_CPU 时出现 GLIBCXX 版本符号缺失的问题

# Set GCC to 10.4.0 to keep compatibility with older versions of GLIBCXX
#
# A newer versions of GCC also works, but will need to be accompanied by an
# appropriate updated version of the sysroot_linux package.
gcc_version=10.4.0

conda install -n ${env_name} -c conda-forge --override-channels -y \
  gxx_linux-64=${gcc_version} \
  sysroot_linux-64=2.17

虽然可以使用较新版本的 GCC，但在较新版本的 GCC 下编译的二进制文件将与旧系统（如 Ubuntu 20.04 或 CentOS Stream 8）不兼容，因为编译后的库将引用系统 libstdc++.so.6 不支持的 GLIBCXX 版本中的符号。要查看可用的 libstdc++.so.6 支持哪些版本的 GLIBC 和 GLIBCXX

libcxx_path=/path/to/libstdc++.so.6

# Print supported for GLIBC versions
objdump -TC "${libcxx_path}" | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat

# Print supported for GLIBCXX versions
objdump -TC "${libcxx_path}" | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat

C/C++ 编译器 (Clang)¶

可以使用 Clang 作为主机编译器来构建 FBGEMM 和 FBGEMM_GPU（仅限 CPU 和 CUDA 变体）。为此，请安装支持 C++20 的 Clang 工具链版本

# Minimum LLVM+Clang version required for FBGEMM_GPU
llvm_version=16.0.6

# NOTE: libcxx from conda-forge is outdated for linux-aarch64, so we cannot
# explicitly specify the version number
conda install -n ${env_name} -c conda-forge --override-channels -y \
    clangxx=${llvm_version} \
    libcxx \
    llvm-openmp=${llvm_version} \
    compiler-rt=${llvm_version}

# Append $CONDA_PREFIX/lib to $LD_LIBRARY_PATH in the Conda environment
ld_library_path=$(conda run -n ${env_name} printenv LD_LIBRARY_PATH)
conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)
conda env config vars set -n ${env_name} LD_LIBRARY_PATH="${ld_library_path}:${conda_prefix}/lib"

# Set NVCC_PREPEND_FLAGS in the Conda environment for Clang to work correctly as the host compiler
conda env config vars set -n ${env_name} NVCC_PREPEND_FLAGS=\"-std=c++20 -Xcompiler -std=c++20 -Xcompiler -stdlib=libstdc++ -ccbin ${clangxx_path} -allow-unsupported-compiler\"

注意，对于 CUDA 代码编译，即使 nvcc 支持 Clang 作为主机编译器，但对于 nvcc 使用的任何主机编译器，仅支持 libstd++（GCC 的 C++ 标准库实现）。

这意味着 GCC 是 FBGEMM_GPU 的 CUDA 变体的必需依赖项，无论它是否使用 Clang 构建。在这种情况下，建议先安装 GCC 工具链，然后再安装 Clang 工具链；请参阅 C/C++ 编译器 (GCC) 中的说明。

编译器符号链接¶

安装编译器工具链后，将 C 和 C++ 编译器符号链接到 binpath（根据需要覆盖现有符号链接）。在 Conda 环境中，binpath 位于 $CONDA_PREFIX/bin

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)

ln -sf "${path_to_either_gcc_or_clang}" "$(conda_prefix)/bin/cc"
ln -sf "${path_to_either_gcc_or_clang}" "$(conda_prefix)/bin/c++"

这些符号链接将在稍后的 FBGEMM_GPU 构建配置阶段中使用。

其他构建工具¶

安装其他必要的构建工具，例如 ninja、cmake 等

conda install -n ${env_name} -c conda-forge --override-channels -y \
    click \
    cmake \
    hypothesis \
    jinja2 \
    make \
    ncurses \
    ninja \
    numpy \
    scikit-build \
    wheel

安装 PyTorch¶

官方 PyTorch 主页包含关于如何通过 Conda 或 PIP 安装 PyTorch 的最权威说明。

通过 Conda 安装¶

# Install the latest nightly
conda install -n ${env_name} -y pytorch -c pytorch-nightly

# Install the latest test (RC)
conda install -n ${env_name} -y pytorch -c pytorch-test

# Install a specific version
conda install -n ${env_name} -y pytorch==2.0.0 -c pytorch

请注意，通过 Conda 安装 PyTorch 而不指定版本（如在 nightly build 的情况下）可能并不总是可靠的。例如，已知 PyTorch nightly build 的 GPU 版本比仅 CPU 版本晚 2 小时到达 Conda。因此，在该时间窗口内 Conda 安装 pytorch-nightly 将会静默回退到安装仅 CPU 变体。

另请注意，由于 GPU 和仅 CPU 版本的 PyTorch 都放在同一个工件存储桶中，因此在安装期间选择的 PyTorch 变体将取决于系统上是否安装了 CUDA。因此，对于 GPU 构建，重要的是在 PyTorch 之前首先安装 CUDA / ROCm。

通过 PyTorch PIP 安装¶

建议通过 PyTorch PIP 安装 PyTorch，而不是 Conda，因为它更具确定性，因此更可靠

# Install the latest nightly, CPU variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu/

# Install the latest test (RC), CUDA variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/test/cu126/

# Install a specific version, CUDA variant
conda run -n ${env_name} pip install torch==2.6.0+cu126 --index-url https://download.pytorch.org/whl/cu126/

# Install the latest nightly, ROCm variant
conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3/

对于安装 PyTorch 的 ROCm 变体，PyTorch PIP 是截至撰写本文时唯一可用的渠道。

安装后检查¶

使用 import 测试验证 PyTorch 安装（版本和变体）

# Ensure that the package loads properly
conda run -n ${env_name} python -c "import torch.distributed"

# Verify the version and variant of the installation
conda run -n ${env_name} python -c "import torch; print(torch.__version__)"

对于 PyTorch 的 CUDA 变体，验证是否至少找到 cuda_cmake_macros.h

conda_prefix=$(conda run -n ${env_name} printenv CONDA_PREFIX)
find "${conda_prefix}" -name cuda_cmake_macros.h

安装 PyTorch-Triton¶

本节仅适用于构建实验性的 FBGEMM_GPU Triton-GEMM 模块。Triton 应该通过 pytorch-triton 安装，它通常随 torch 一起安装，但也可以手动安装

# pytorch-triton repos:
# https://download.pytorch.org/whl/nightly/pytorch-triton/
# https://download.pytorch.org/whl/nightly/pytorch-triton-rocm/

# The version SHA should follow the one pinned in PyTorch
# https://github.com/pytorch/pytorch/blob/main/.ci/docker/ci_commit_pins/triton.txt
conda run -n ${env_name} pip install --pre pytorch-triton==3.0.0+dedb7bdf33 --index-url https://download.pytorch.org/whl/nightly/

使用 import 测试验证 PyTorch-Triton 安装

# Ensure that the package loads properly
conda run -n ${env_name} python -c "import triton"

其他构建前设置¶

准备构建¶

克隆仓库及其子模块，并安装 requirements.txt

# !! Run inside the Conda environment !!

# Select a version tag
FBGEMM_VERSION=v1.0.0

# Clone the repo along with its submodules
git clone --recursive -b ${FBGEMM_VERSION} https://github.com/pytorch/FBGEMM.git fbgemm_${FBGEMM_VERSION}

# Install additional required packages for building and testing
cd fbgemm_${FBGEMM_VERSION}/fbgemm_gpu
pip install -r requirements.txt

构建过程¶

FBGEMM_GPU 构建过程使用基于 scikit-build CMake 的构建流程，并且跨安装运行保持状态。因此，构建可能会变得陈旧，并且在由于缺少依赖项等导致构建失败后尝试重新运行时可能会导致问题。为了解决这个问题，只需清除构建缓存

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

python setup.py clean

设置 Wheel 构建变量¶

在构建 Python wheel 时，必须首先正确设置包名称、Python 版本标签和 Python 平台名称

# Set the package name depending on the build variant
export package_name=fbgemm_gpu_{cpu, cuda, rocm}

# Set the Python version tag.  It should follow the convention `py<major><minor>`,
# e.g. Python 3.13 --> py313
export python_tag=py313

# Determine the processor architecture
export ARCH=$(uname -m)

# Set the Python platform name for the Linux case
export python_plat_name="manylinux_2_28_${ARCH}"
# For the macOS (x86_64) case
export python_plat_name="macosx_10_9_${ARCH}"
# For the macOS (arm64) case
export python_plat_name="macosx_11_0_${ARCH}"
# For the Windows case
export python_plat_name="win_${ARCH}"

仅 CPU 构建¶

对于仅 CPU 构建，需要指定 --cpu_only 标志。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Build the wheel artifact only
python setup.py bdist_wheel \
    --package_variant=cpu \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}"

# Build and install the library into the Conda environment (GCC)
python setup.py install \
    --package_variant=cpu

要使用 Clang + libstdc++ 而不是 GCC 进行构建，只需附加 --cxxprefix 标志

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Build the wheel artifact only
python setup.py bdist_wheel \
    --package_variant=cpu \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    --cxxprefix=$CONDA_PREFIX

# Build and install the library into the Conda environment (Clang)
python setup.py install \
    --package_variant=cpu
    --cxxprefix=$CONDA_PREFIX

请注意，这假定 Clang 工具链与 GCC 工具链一起正确安装，并且可以作为 ${cxxprefix}/bin/cc 和 ${cxxprefix}/bin/c++ 使用。

要启用运行时调试功能，例如 CUDA 和 HIP 中的设备端断言，只需在调用 setup.py 时附加 --debug 标志。

CUDA 构建¶

为 CUDA 构建 FBGEMM_GPU 需要安装 NVML 和 cuDNN，并通过环境变量使其可用于构建。但是，构建包不需要 CUDA 设备。

与仅 CPU 构建类似，可以通过将 --cxxprefix=$CONDA_PREFIX 附加到构建命令来启用使用 Clang + libstdc++ 进行构建，前提是工具链已正确安装。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# [OPTIONAL] Specify the CUDA installation paths
# This may be required if CMake is unable to find nvcc
export CUDACXX=/path/to/nvcc
export CUDA_BIN_PATH=/path/to/cuda/installation

# [OPTIONAL] Provide the CUB installation directory (applicable only to CUDA versions prior to 11.1)
export CUB_DIR=/path/to/cub

# [OPTIONAL] Allow NVCC to use host compilers that are newer than what NVCC officially supports
nvcc_prepend_flags=(
  -allow-unsupported-compiler
)

# [OPTIONAL] If clang is the host compiler, set NVCC to use libstdc++ since libc++ is not supported
nvcc_prepend_flags+=(
  -Xcompiler -stdlib=libstdc++
  -ccbin "/path/to/clang++"
)

# [OPTIONAL] Set NVCC_PREPEND_FLAGS as needed
export NVCC_PREPEND_FLAGS="${nvcc_prepend_flags[@]}"

# [OPTIONAL] Enable verbose NVCC logs
export NVCC_VERBOSE=1

# Specify cuDNN header and library paths
export CUDNN_INCLUDE_DIR=/path/to/cudnn/include
export CUDNN_LIBRARY=/path/to/cudnn/lib

# Specify NVML filepath
export NVML_LIB_PATH=/path/to/libnvidia-ml.so

# Specify NCCL filepath
export NCCL_LIB_PATH=/path/to/libnccl.so.2

# Build for SM70/80 (V100/A100 GPU); update as needed
# If not specified, only the CUDA architecture supported by current system will be targeted
# If not specified and no CUDA device is present either, all CUDA architectures will be targeted
cuda_arch_list=7.0;8.0

# Unset TORCH_CUDA_ARCH_LIST if it exists, bc it takes precedence over
# -DTORCH_CUDA_ARCH_LIST during the invocation of setup.py
unset TORCH_CUDA_ARCH_LIST

# Build the wheel artifact only
python setup.py bdist_wheel \
    --package_variant=cuda \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

# Build and install the library into the Conda environment
python setup.py install \
    --package_variant=cuda \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

仅 GenAI 构建¶

默认情况下，FBGEMM_GPU 的 CUDA 构建包括用于 GenAI 应用程序的所有实验性模块。仅构建实验性模块的说明与 CUDA 构建的说明相同，但需要在构建调用中指定 --package_variant=genai

# Build the wheel artifact only
python setup.py bdist_wheel \
    --package_variant=genai \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

# Build and install the library into the Conda environment
python setup.py install \
    --package_variant=genai \
    --nvml_lib_path=${NVML_LIB_PATH} \
    --nccl_lib_path=${NCCL_LIB_PATH} \
    -DTORCH_CUDA_ARCH_LIST="${cuda_arch_list}"

请注意，目前，实验性模块仅支持 CUDA。

ROCm 构建¶

对于 ROCm 构建，需要指定 ROCM_PATH 和 PYTORCH_ROCM_ARCH。但是，构建包不需要 ROCm 设备。

与仅 CPU 和 CUDA 构建类似，可以通过将 --cxxprefix=$CONDA_PREFIX 附加到构建命令来启用使用 Clang + libstdc++ 进行构建，前提是工具链已正确安装。

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

export ROCM_PATH=/path/to/rocm

# [OPTIONAL] Enable verbose HIPCC logs
export HIPCC_VERBOSE=1

# Build for the target architecture of the ROCm device installed on the machine (e.g. 'gfx908,gfx90a,gfx942')
# See https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html for list
export PYTORCH_ROCM_ARCH=$(${ROCM_PATH}/bin/rocminfo | grep -o -m 1 'gfx.*')

# Build the wheel artifact only
python setup.py bdist_wheel \
    --package_variant=rocm \
    --python-tag="${python_tag}" \
    --plat-name="${python_plat_name}" \
    -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
    -DHIP_ROOT_DIR="${ROCM_PATH}" \
    -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
    -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

# Build and install the library into the Conda environment
python setup.py install \
    --package_variant=rocm \
    -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
    -DHIP_ROOT_DIR="${ROCM_PATH}" \
    -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
    -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"

构建后检查（针对开发者）¶

构建完成后，运行一些检查以验证构建是否真正正确很有用。

未定义符号检查¶

由于 FBGEMM_GPU 包含大量 Jinja 和 C++ 模板实例化，因此确保在开发过程中不会意外生成未定义的符号非常重要

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Locate the built .SO file
fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)

# Check that the undefined symbols don't include fbgemm_gpu-defined functions
nm -gDCu "${fbgemm_gpu_lib_path}" | sort

GLIBC 版本兼容性检查¶

验证引用的 GLIBCXX 版本号以及某些函数符号的可用性也很有用

# !! Run in fbgemm_gpu/ directory inside the Conda environment !!

# Locate the built .SO file
fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)

# Note the versions of GLIBCXX referenced by the .SO
# The libstdc++.so.6 available on the install target must support these versions
objdump -TC "${fbgemm_gpu_lib_path}" | grep GLIBCXX | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat

# Test for the existence of a given function symbol in the .SO
nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::merge_pooled_embeddings("
nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::jagged_2d_to_dense("