量化实用程序¶

参考实现方法¶

template<typename T, layout_t LAYOUT = layout_t::KCX> void QuantizeGroupwise(const float *src, int K, int C, int X, int G, const float *scales, const std::int32_t *zero_points, T *dst)¶

将 src 中的浮点数据量化为 T 类型。

模板参数:

T – 输出量化数据类型（支持 int8_t、uint8_t 和 int32_t）
LAYOUT – src 中输入张量的布局。（支持 KCX 和 KXC）KCX 对应于 KCRS 或 KCTRS（对于带时间维度的权重张量），KXC 对应于 KRSC 或 KTRSC（对于带时间维度的权重张量）

参数:

K – 权重张量的输出通道数
C – 通道数
X – R*S 或 T*R*S
G – 组数（如果 G == C，则该函数执行逐通道量化；如果 1 < G < C，则该函数执行逐组量化；如果 G == 1，则该函数执行逐张量量化；）
scales – 浮点比例。大小应等于 G
zero_points – 零点（应以 T 类型表示）。大小应等于 G

template<typename T> void FusedQuantizeDequantize(const float *src, float *dst, std::int64_t len, const TensorQuantizationParams &qparams, int thread_id = 0, int num_threads = 1, float noise_ratio = 0.0f)¶: 融合整数量化反量化内核，用于加速感知量化训练。使用提供的 qparams 将 src 中的 fp32 值量化为 (u)int8，并将量化的整数值反量化回 fp32。

template<typename InputType> void FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf(int bit_rate, const InputType *input, size_t input_rows, int input_columns, std::uint8_t *output)¶

将浮点（fp32 或 fp16）输入转换为逐行量化输出。比特率指定量化输出中的比特数。比例和偏差为 fp16 类型。每行的比例和偏差都存储在该行本身（融合）的末尾。

参数:: bit_rate – 可以是 2、4 或 8

AVX-2 实现方法¶

uint32_t Xor128(void)¶: 基于本文的 [0, 9] 范围内的随机数生成器。

void FindMinMax(const float *m, float *min, float *max, int64_t len)¶: 查找浮点矩阵中的最小值和最大值。

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, typename BIAS_TYPE = std::int32_t, bool DIRECT = false> void requantizeOutputProcessingAvx2(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶: 使用 avx2 重新量化，并融合偏置。

AVX-512 实现方法¶

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, int C_PER_G, typename BIAS_TYPE = std::int32_t> void requantizeOutputProcessingGConvAvx512(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶: 使用 AVX512 重新量化。

量化实用程序¶

参考实现方法¶

AVX-2 实现方法¶

AVX-512 实现方法¶

文档

教程

资源