注意
点击 这里 下载完整的示例代码
使用 CUDA CTC 解码器的 ASR 推理¶
作者: Yuekai Zhang
本教程展示了如何使用基于 CUDA 的 CTC 束搜索解码器执行语音识别推理。我们在来自 下一代 Kaldi 项目的预训练 Zipformer 模型上演示了这一点。
概述¶
束搜索解码通过迭代地扩展文本假设(束)以包含下一个可能的字符来工作,并且在每个时间步仅保留得分最高的假设。
- 底层实现使用 cuda 来加速整个解码过程
解码器的数学公式可以
使用 CUDA CTC 束搜索解码器运行 ASR 推理需要以下组件
声学模型:从声学特征预测建模单元(本教程中的 BPE)的模型
BPE 模型:字节对编码 (BPE) 标记器文件
声学模型和设置¶
首先,我们导入必要的实用程序并获取我们正在使用的 数据
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
2.5.0
2.5.0
import time
from pathlib import Path
import IPython
import sentencepiece as spm
from torchaudio.models.decoder import cuda_ctc_decoder
from torchaudio.utils import download_asset
我们使用在 LibriSpeech 数据集 上训练的预训练 Zipformer 模型。该模型联合训练了 CTC 和 Transducer 损失函数。在本教程中,我们只使用模型的 CTC 头部。
def download_asset_external(url, key):
path = Path(torch.hub.get_dir()) / "torchaudio" / Path(key)
if not path.exists():
path.parent.mkdir(parents=True, exist_ok=True)
torch.hub.download_url_to_file(url, path)
return str(path)
url_prefix = "https://hugging-face.cn/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01"
model_link = f"{url_prefix}/resolve/main/exp/cpu_jit.pt"
model_path = download_asset_external(model_link, "cuda_ctc_decoder/cpu_jit.pt")
0%| | 0.00/269M [00:00<?, ?B/s]
19%|#9 | 51.5M/269M [00:00<00:00, 539MB/s]
38%|###8 | 103M/269M [00:00<00:00, 500MB/s]
60%|#####9 | 161M/269M [00:00<00:00, 545MB/s]
81%|######## | 218M/269M [00:00<00:00, 567MB/s]
100%|##########| 269M/269M [00:00<00:00, 559MB/s]
我们将从 LibriSpeech test-other 数据集加载一个样本。
speech_file = download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav")
waveform, sample_rate = torchaudio.load(speech_file)
assert sample_rate == 16000
IPython.display.Audio(speech_file)
0%| | 0.00/441k [00:00<?, ?B/s]
100%|##########| 441k/441k [00:00<00:00, 103MB/s]
与该音频文件相对应的文本是
解码器文件和数据¶
接下来,我们从 BPE 模型加载我们的标记,它是解码的标记器。
标记¶
标记是声学模型可以预测的可能符号,包括 CTC 中的空白符号。在本教程中,它包括 500 个 BPE 标记。它可以作为文件传递,其中每行包含与相同索引相对应的标记,也可以作为标记列表传递,每个标记映射到一个唯一的索引。
# tokens
<blk>
<sos/eos>
<unk>
S
_THE
_A
T
_AND
...
bpe_link = f"{url_prefix}/resolve/main/data/lang_bpe_500/bpe.model"
bpe_path = download_asset_external(bpe_link, "cuda_ctc_decoder/bpe.model")
bpe_model = spm.SentencePieceProcessor()
bpe_model.load(bpe_path)
tokens = [bpe_model.id_to_piece(id) for id in range(bpe_model.get_piece_size())]
print(tokens)
0%| | 0.00/239k [00:00<?, ?B/s]
100%|##########| 239k/239k [00:00<00:00, 83.3MB/s]
['<blk>', '<sos/eos>', '<unk>', 'S', '▁THE', '▁A', 'T', '▁AND', 'ED', '▁OF', '▁TO', 'E', 'D', 'N', 'ING', '▁IN', 'Y', 'M', 'C', '▁I', 'A', 'P', '▁HE', 'R', 'O', 'L', 'RE', 'I', 'U', 'ER', '▁IT', 'LY', '▁THAT', '▁WAS', '▁', '▁S', 'AR', '▁BE', 'F', '▁C', 'IN', 'B', '▁FOR', 'OR', 'LE', "'", '▁HIS', '▁YOU', 'AL', '▁RE', 'V', '▁B', 'G', 'RI', '▁E', '▁WITH', '▁T', '▁AS', 'LL', '▁P', '▁HER', 'ST', '▁HAD', '▁SO', '▁F', 'W', 'CE', '▁IS', 'ND', '▁NOT', 'TH', '▁BUT', 'EN', '▁SHE', '▁ON', 'VE', 'ON', 'SE', '▁DE', 'UR', '▁G', 'CH', 'K', 'TER', '▁AT', 'IT', '▁ME', 'RO', 'NE', 'RA', 'ES', 'IL', 'NG', 'IC', '▁NO', '▁HIM', 'ENT', 'IR', '▁WE', 'H', '▁DO', '▁ALL', '▁HAVE', 'LO', '▁BY', '▁MY', '▁MO', '▁THIS', 'LA', '▁ST', '▁WHICH', '▁CON', '▁THEY', 'CK', 'TE', '▁SAID', '▁FROM', '▁GO', '▁WHO', '▁TH', '▁OR', '▁D', '▁W', 'VER', 'LI', '▁SE', '▁ONE', '▁CA', '▁AN', '▁LA', '▁WERE', 'EL', '▁HA', '▁MAN', '▁FA', '▁EX', 'AD', '▁SU', 'RY', '▁MI', 'AT', '▁BO', '▁WHEN', 'AN', 'THER', 'PP', 'ATION', '▁FI', '▁WOULD', '▁PRO', 'OW', 'ET', '▁O', '▁THERE', '▁HO', 'ION', '▁WHAT', '▁FE', '▁PA', 'US', 'MENT', '▁MA', 'UT', '▁OUT', '▁THEIR', '▁IF', '▁LI', '▁K', '▁WILL', '▁ARE', 'ID', '▁RO', 'DE', 'TION', '▁WA', 'PE', '▁UP', '▁SP', '▁PO', 'IGHT', '▁UN', 'RU', '▁LO', 'AS', 'OL', '▁LE', '▁BEEN', '▁SH', '▁RA', '▁SEE', 'KE', 'UL', 'TED', '▁SA', 'UN', 'UND', 'ANT', '▁NE', 'IS', '▁THEM', 'CI', 'GE', '▁COULD', '▁DIS', 'OM', 'ISH', 'HE', 'EST', '▁SOME', 'ENCE', 'ITY', 'IVE', '▁US', '▁MORE', '▁EN', 'ARD', 'ATE', '▁YOUR', '▁INTO', '▁KNOW', '▁CO', 'ANCE', '▁TIME', '▁WI', '▁YE', 'AGE', '▁NOW', 'TI', 'FF', 'ABLE', '▁VERY', '▁LIKE', 'AM', 'HI', 'Z', '▁OTHER', '▁THAN', '▁LITTLE', '▁DID', '▁LOOK', 'TY', 'ERS', '▁CAN', '▁CHA', '▁AR', 'X', 'FUL', 'UGH', '▁BA', '▁DAY', '▁ABOUT', 'TEN', 'IM', '▁ANY', '▁PRE', '▁OVER', 'IES', 'NESS', 'ME', 'BLE', '▁M', 'ROW', '▁HAS', '▁GREAT', '▁VI', 'TA', '▁AFTER', 'PER', '▁AGAIN', 'HO', 'SH', '▁UPON', '▁DI', '▁HAND', '▁COM', 'IST', 'TURE', '▁STA', '▁THEN', '▁SHOULD', '▁GA', 'OUS', 'OUR', '▁WELL', '▁ONLY', 'MAN', '▁GOOD', '▁TWO', '▁MAR', '▁SAY', '▁HU', 'TING', '▁OUR', 'RESS', '▁DOWN', 'IOUS', '▁BEFORE', '▁DA', '▁NA', 'QUI', '▁MADE', '▁EVERY', '▁OLD', '▁EVEN', 'IG', '▁COME', '▁GRA', '▁RI', '▁LONG', 'OT', 'SIDE', 'WARD', '▁FO', '▁WHERE', 'MO', 'LESS', '▁SC', '▁MUST', '▁NEVER', '▁HOW', '▁CAME', '▁SUCH', '▁RU', '▁TAKE', '▁WO', '▁CAR', 'UM', 'AK', '▁THINK', '▁MUCH', '▁MISTER', '▁MAY', '▁JO', '▁WAY', '▁COMP', '▁THOUGHT', '▁STO', '▁MEN', '▁BACK', '▁DON', 'J', '▁LET', '▁TRA', '▁FIRST', '▁JUST', '▁VA', '▁OWN', '▁PLA', '▁MAKE', 'ATED', '▁HIMSELF', '▁WENT', '▁PI', 'GG', 'RING', '▁DU', '▁MIGHT', '▁PART', '▁GIVE', '▁IMP', '▁BU', '▁PER', '▁PLACE', '▁HOUSE', '▁THROUGH', 'IAN', '▁SW', '▁UNDER', 'QUE', '▁AWAY', '▁LOVE', 'QUA', '▁LIFE', '▁GET', '▁WITHOUT', '▁PASS', '▁TURN', 'IGN', '▁HEAD', '▁MOST', '▁THOSE', '▁SHALL', '▁EYES', '▁COL', '▁STILL', '▁NIGHT', '▁NOTHING', 'ITION', 'HA', '▁TELL', '▁WORK', '▁LAST', '▁NEW', '▁FACE', '▁HI', '▁WORD', '▁FOUND', '▁COUNT', '▁OB', '▁WHILE', '▁SHA', '▁MEAN', '▁SAW', '▁PEOPLE', '▁FRIEND', '▁THREE', '▁ROOM', '▁SAME', '▁THOUGH', '▁RIGHT', '▁CHILD', '▁FATHER', '▁ANOTHER', '▁HEART', '▁WANT', '▁TOOK', 'OOK', '▁LIGHT', '▁MISSUS', '▁OPEN', '▁JU', '▁ASKED', 'PORT', '▁LEFT', '▁JA', '▁WORLD', '▁HOME', '▁WHY', '▁ALWAYS', '▁ANSWER', '▁SEEMED', '▁SOMETHING', '▁GIRL', '▁BECAUSE', '▁NAME', '▁TOLD', '▁NI', '▁HIGH', 'IZE', '▁WOMAN', '▁FOLLOW', '▁RETURN', '▁KNEW', '▁EACH', '▁KIND', '▁JE', '▁ACT', '▁LU', '▁CERTAIN', '▁YEARS', '▁QUITE', '▁APPEAR', '▁BETTER', '▁HALF', '▁PRESENT', '▁PRINCE', 'SHIP', '▁ALSO', '▁BEGAN', '▁HAVING', '▁ENOUGH', '▁PERSON', '▁LADY', '▁WHITE', '▁COURSE', '▁VOICE', '▁SPEAK', '▁POWER', '▁MORNING', '▁BETWEEN', '▁AMONG', '▁KEEP', '▁WALK', '▁MATTER', '▁TEA', '▁BELIEVE', '▁SMALL', '▁TALK', '▁FELT', '▁HORSE', '▁MYSELF', '▁SIX', '▁HOWEVER', '▁FULL', '▁HERSELF', '▁POINT', '▁STOOD', '▁HUNDRED', '▁ALMOST', '▁SINCE', '▁LARGE', '▁LEAVE', '▁PERHAPS', '▁DARK', '▁SUDDEN', '▁REPLIED', '▁ANYTHING', '▁WONDER', '▁UNTIL', 'Q']
构建 CUDA 解码器¶
在本教程中,我们将构建一个 CUDA 束搜索解码器。可以使用工厂函数 cuda_ctc_decoder()
构造解码器。
运行推理¶
现在我们有了数据、声学模型和解码器,我们可以执行推理。束搜索解码器的输出是类型 CUCTCHypothesis
,包括预测的标记 ID、单词(与标记 ID 相对应的符号)和假设得分。回想一下,与波形相对应的文本是
actual_transcript = "i really was very much afraid of showing him how much shocked i was at some parts of what he said"
actual_transcript = actual_transcript.split()
device = torch.device("cuda", 0)
acoustic_model = torch.jit.load(model_path)
acoustic_model.to(device)
acoustic_model.eval()
waveform = waveform.to(device)
feat = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=80, snip_edges=False)
feat = feat.unsqueeze(0)
feat_lens = torch.tensor(feat.size(1), device=device).unsqueeze(0)
encoder_out, encoder_out_lens = acoustic_model.encoder(feat, feat_lens)
nnet_output = acoustic_model.ctc_output(encoder_out)
log_prob = torch.nn.functional.log_softmax(nnet_output, -1)
print(f"The shape of log_prob: {log_prob.shape}, the shape of encoder_out_lens: {encoder_out_lens.shape}")
The shape of log_prob: torch.Size([1, 175, 500]), the shape of encoder_out_lens: torch.Size([1])
cuda ctc 解码器给出以下结果。
results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
beam_search_transcript = bpe_model.decode(results[0][0].tokens).lower()
beam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_transcript.split()) / len(
actual_transcript
)
print(f"Transcript: {beam_search_transcript}")
print(f"WER: {beam_search_wer}")
Transcript: i really was very much afraid of showing him how much shocked i was at some parts of what he said
WER: 0.0
束搜索解码器参数¶
在本节中,我们将更深入地了解一些不同的参数和权衡。有关可定制参数的完整列表,请参阅 文档
。
辅助函数¶
def print_decoded(cuda_decoder, bpe_model, log_prob, encoder_out_lens, param, param_value):
start_time = time.monotonic()
results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))
decode_time = time.monotonic() - start_time
transcript = bpe_model.decode(results[0][0].tokens).lower()
score = results[0][0].score
print(f"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)")
nbest¶
此参数指示要返回的最佳假设数量。例如,通过在前面构建束搜索解码器时设置 nbest=10
,我们现在可以访问得分最高的 10 个假设。
i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20280733704566956)
i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -1.7408883571624756)
i really was very much afraid of sheowing him how much shocked i was at some parts of what he said (score: -6.67951774597168)
i reallyly very much afraid of showing him how much shocked i was at some parts of what he said (score: -7.597038745880127)
i really was very much afraid of sheowing him how much shocked i was at some part of what he said (score: -8.224080085754395)
i really was very much afraid of shwing him how much shocked i was at some parts of what he said (score: -8.439373970031738)
i really was very much afraid of showing him how much shocked i was in some parts of what he said (score: -8.781461715698242)
i really was very much afraid of showing him how much shocked i was at some parts of what said (score: -8.883706092834473)
i really was very much afraid of showing him how much shocked i was at some partes of what he said (score: -8.999059677124023)
i really was very much afraid of showing him how much shocked i was at some parts of what he say (score: -9.138861656188965)
束大小¶
beam_size
参数确定每个解码步骤后要保存的最佳假设的最大数量。使用更大的束大小可以探索更多可能的假设,这可以生成得分更高的假设,但在超过一定点后不会提供额外的收益。我们建议为 cuda 束搜索解码器设置 beam_size=10。
在下面的示例中,我们看到随着我们从 1 增加到 3 的束大小,解码质量得到提高,但请注意,使用束大小 3 提供的输出与束大小 10 相同。
beam_sizes = [1, 2, 3, 10]
for beam_size in beam_sizes:
beam_search_decoder = cuda_ctc_decoder(
tokens,
nbest=1,
beam_size=beam_size,
blank_skip_threshold=0.95,
)
print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "beam size", beam_size)
beam size 1 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -1.35; 0.0010 secs)
beam size 2 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0009 secs)
beam size 3 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0009 secs)
beam size 10 : i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
空白跳过阈值¶
blank_skip_threshold
参数用于修剪具有较大空白概率的帧。使用良好的 blank_skip_threshold 修剪这些帧可以显着加快解码过程,而不会降低准确性。由于 CTC 的规则,我们将在两个非空白帧之间保留至少一个空白帧,以避免错误地合并两个连续的相同符号。我们建议为 cuda 束搜索解码器设置 blank_skip_threshold=0.95。
blank_skip_probs = [0.25, 0.95, 1.0]
for blank_skip_prob in blank_skip_probs:
beam_search_decoder = cuda_ctc_decoder(
tokens,
nbest=10,
beam_size=10,
blank_skip_threshold=blank_skip_prob,
)
print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, "blank_skip_threshold", blank_skip_prob)
del cuda_decoder
blank_skip_threshold 0.25: i really was very much afraid of showing him how much shocked i was at some part of what he said (score: -0.01; 0.0009 secs)
blank_skip_threshold 0.95: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.20; 0.0010 secs)
blank_skip_threshold 1.0: i really was very much afraid of showing him how much shocked i was at some parts of what he said (score: -0.21; 0.0043 secs)
使用 Flashlight CPU 解码器进行基准测试¶
我们使用 librispeech test_other 集对 CUDA 解码器和 CPU 解码器之间的吞吐量和准确性进行了基准测试。要重现以下基准测试结果,您可以参考 这里。
解码器 |
设置 |
WER (%) |
N-Best Oracle WER (%) |
解码器成本时间(秒) |
---|---|---|---|---|
CUDA 解码器 |
blank_skip_threshold 0.95 |
5.81 |
4.11 |
2.57 |
CUDA 解码器 |
blank_skip_threshold 1.0 (无帧跳过) |
5.81 |
4.09 |
6.24 |
CPU 解码器 |
beam_size_token 10 |
5.86 |
4.30 |
28.61 |
CPU 解码器 |
beam_size_token 500 |
5.86 |
4.30 |
791.80 |
从上表可以看出,CUDA 解码器可以略微提高 WER,并且显着提高吞吐量。
脚本总运行时间:(0 分钟 2.023 秒)