注意
点击 这里 下载完整的示例代码
使用混合 Demucs 进行音乐源分离¶
作者: Sean Kim
本教程演示如何使用混合 Demucs 模型进行音乐分离
1. 概述¶
执行音乐分离包括以下步骤
构建混合 Demucs 管道。
将波形格式化为预期大小的块,并循环遍历块(带重叠)并馈送到管道中。
收集输出块并根据它们重叠的方式组合在一起。
混合 Demucs [Défossez, 2021] 模型是 Demucs 模型的改进版本,它是一种基于波形的模型,可以将音乐分离为各个音源,例如人声、贝斯和鼓。
2. 准备¶
首先,我们安装必要的依赖项。第一个要求是 torchaudio
和 torch
import torch
import torchaudio
print(torch.__version__)
print(torchaudio.__version__)
import matplotlib.pyplot as plt
2.5.0
2.5.0
除了 torchaudio
之外,还需要 mir_eval
来执行信噪比 (SDR) 计算。要安装 mir_eval
,请使用 pip3 install mir_eval
。
from IPython.display import Audio
from mir_eval import separation
from torchaudio.pipelines import HDEMUCS_HIGH_MUSDB_PLUS
from torchaudio.utils import download_asset
3. 构建管道¶
预训练的模型权重和相关的管道组件捆绑为 torchaudio.pipelines.HDEMUCS_HIGH_MUSDB_PLUS()
。这是一个在 MUSDB18-HQ 和额外的内部额外训练数据上训练的 torchaudio.models.HDemucs
模型。此特定模型适合较高的采样率,约为 44.1 kHz,在模型实现中具有 4096 的 nfft 值和 6 的深度。
bundle = HDEMUCS_HIGH_MUSDB_PLUS
model = bundle.get_model()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
sample_rate = bundle.sample_rate
print(f"Sample rate: {sample_rate}")
0%| | 0.00/319M [00:00<?, ?B/s]
14%|#3 | 44.5M/319M [00:00<00:00, 467MB/s]
28%|##7 | 89.0M/319M [00:00<00:00, 463MB/s]
42%|####1 | 133M/319M [00:00<00:00, 455MB/s]
55%|#####5 | 177M/319M [00:00<00:00, 445MB/s]
69%|######9 | 221M/319M [00:00<00:00, 451MB/s]
83%|########2 | 264M/319M [00:00<00:00, 431MB/s]
97%|#########7| 310M/319M [00:00<00:00, 447MB/s]
100%|##########| 319M/319M [00:00<00:00, 445MB/s]
/pytorch/audio/src/torchaudio/pipelines/_source_separation_pipeline.py:56: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(path)
Sample rate: 44100
4. 配置应用程序函数¶
由于 HDemucs
是一个大型且内存消耗大的模型,因此很难拥有足够的内存来一次性将模型应用于整首歌曲。为了解决这个限制,通过将歌曲分成较小的片段并将这些片段逐个运行模型,然后将它们重新组合在一起,可以获得完整歌曲的分离音源。
在执行此操作时,必须确保每个块之间有一些重叠,以适应边缘的伪影。由于模型的性质,边缘有时会出现不准确或不希望出现的声音。
我们在下面提供了一个分块和排列的示例实现。此实现会在每侧重叠 1 秒,然后在每侧进行线性淡入和淡出。使用淡入淡出重叠,我将这些片段加在一起,以确保整个音量保持一致。这通过使用模型输出边缘的较少部分来适应伪影。
from torchaudio.transforms import Fade
def separate_sources(
model,
mix,
segment=10.0,
overlap=0.1,
device=None,
):
"""
Apply model to a given mixture. Use fade, and add segments together in order to add model segment by segment.
Args:
segment (int): segment length in seconds
device (torch.device, str, or None): if provided, device on which to
execute the computation, otherwise `mix.device` is assumed.
When `device` is different from `mix.device`, only local computations will
be on `device`, while the entire tracks will be stored on `mix.device`.
"""
if device is None:
device = mix.device
else:
device = torch.device(device)
batch, channels, length = mix.shape
chunk_len = int(sample_rate * segment * (1 + overlap))
start = 0
end = chunk_len
overlap_frames = overlap * sample_rate
fade = Fade(fade_in_len=0, fade_out_len=int(overlap_frames), fade_shape="linear")
final = torch.zeros(batch, len(model.sources), channels, length, device=device)
while start < length - overlap_frames:
chunk = mix[:, :, start:end]
with torch.no_grad():
out = model.forward(chunk)
out = fade(out)
final[:, :, :, start:end] += out
if start == 0:
fade.fade_in_len = int(overlap_frames)
start += int(chunk_len - overlap_frames)
else:
start += chunk_len
end += chunk_len
if end >= length:
fade.fade_out_len = 0
return final
def plot_spectrogram(stft, title="Spectrogram"):
magnitude = stft.abs()
spectrogram = 20 * torch.log10(magnitude + 1e-8).numpy()
_, axis = plt.subplots(1, 1)
axis.imshow(spectrogram, cmap="viridis", vmin=-60, vmax=0, origin="lower", aspect="auto")
axis.set_title(title)
plt.tight_layout()
5. 运行模型¶
最后,我们运行模型并将分离的源文件存储在一个目录中
作为测试歌曲,我们将使用 MedleyDB 中 NightOwl 的 A Classic Education(Creative Commons BY-NC-SA 4.0)。它也位于 MUSDB18-HQ 数据集中,位于 train
源代码中。
为了使用不同的歌曲进行测试,可以更改以下变量名和 URL 以及参数,以不同的方式测试歌曲分离器。
# We download the audio file from our storage. Feel free to download another file and use audio from a specific path
SAMPLE_SONG = download_asset("tutorial-assets/hdemucs_mix.wav")
waveform, sample_rate = torchaudio.load(SAMPLE_SONG) # replace SAMPLE_SONG with desired path for different song
waveform = waveform.to(device)
mixture = waveform
# parameters
segment: int = 10
overlap = 0.1
print("Separating track")
ref = waveform.mean(0)
waveform = (waveform - ref.mean()) / ref.std() # normalization
sources = separate_sources(
model,
waveform[None],
device=device,
segment=segment,
overlap=overlap,
)[0]
sources = sources * ref.std() + ref.mean()
sources_list = model.sources
sources = list(sources)
audios = dict(zip(sources_list, sources))
0%| | 0.00/28.8M [00:00<?, ?B/s]
57%|#####7 | 16.5M/28.8M [00:00<00:00, 80.7MB/s]
100%|##########| 28.8M/28.8M [00:00<00:00, 104MB/s]
Separating track
5.1 分离音轨¶
已加载的预训练权重默认集包含 4 个音源,它们被分离为:鼓、贝斯、其他和人声,按此顺序。它们已存储到字典“audios”中,因此可以从那里访问。对于这四个音源,每个音源都有一个单独的单元格,它将创建音频、频谱图并计算 SDR 分数。SDR 是信噪比,本质上是音频音轨“质量”的表示。
N_FFT = 4096
N_HOP = 4
stft = torchaudio.transforms.Spectrogram(
n_fft=N_FFT,
hop_length=N_HOP,
power=None,
)
5.2 音频分段和处理¶
以下是处理步骤和将音轨分段为 5 秒以馈送到频谱图并计算相应的 SDR 分数。
def output_results(original_source: torch.Tensor, predicted_source: torch.Tensor, source: str):
print(
"SDR score is:",
separation.bss_eval_sources(original_source.detach().numpy(), predicted_source.detach().numpy())[0].mean(),
)
plot_spectrogram(stft(predicted_source)[0], f"Spectrogram - {source}")
return Audio(predicted_source, rate=sample_rate)
segment_start = 150
segment_end = 155
frame_start = segment_start * sample_rate
frame_end = segment_end * sample_rate
drums_original = download_asset("tutorial-assets/hdemucs_drums_segment.wav")
bass_original = download_asset("tutorial-assets/hdemucs_bass_segment.wav")
vocals_original = download_asset("tutorial-assets/hdemucs_vocals_segment.wav")
other_original = download_asset("tutorial-assets/hdemucs_other_segment.wav")
drums_spec = audios["drums"][:, frame_start:frame_end].cpu()
drums, sample_rate = torchaudio.load(drums_original)
bass_spec = audios["bass"][:, frame_start:frame_end].cpu()
bass, sample_rate = torchaudio.load(bass_original)
vocals_spec = audios["vocals"][:, frame_start:frame_end].cpu()
vocals, sample_rate = torchaudio.load(vocals_original)
other_spec = audios["other"][:, frame_start:frame_end].cpu()
other, sample_rate = torchaudio.load(other_original)
mix_spec = mixture[:, frame_start:frame_end].cpu()
0%| | 0.00/1.68M [00:00<?, ?B/s]
100%|##########| 1.68M/1.68M [00:00<00:00, 67.9MB/s]
0%| | 0.00/1.68M [00:00<?, ?B/s]
100%|##########| 1.68M/1.68M [00:00<00:00, 102MB/s]
0%| | 0.00/1.68M [00:00<?, ?B/s]
100%|##########| 1.68M/1.68M [00:00<00:00, 171MB/s]
0%| | 0.00/1.68M [00:00<?, ?B/s]
100%|##########| 1.68M/1.68M [00:00<00:00, 120MB/s]
5.3 频谱图和音频¶
在接下来的 5 个单元格中,您可以看到相应的音频的频谱图。可以使用频谱图清楚地可视化音频。
混合剪辑来自原始音轨,其余音轨是模型输出。
# Mixture Clip
plot_spectrogram(stft(mix_spec)[0], "Spectrogram - Mixture")
Audio(mix_spec, rate=sample_rate)
鼓 SDR、频谱图和音频
# Drums Clip
output_results(drums, drums_spec, "drums")
SDR score is: 4.964477475897244
贝斯 SDR、频谱图和音频
SDR score is: 18.90589959575034
人声 SDR、频谱图和音频
# Vocals Audio
output_results(vocals, vocals_spec, "vocals")
SDR score is: 8.792372276328596
其他 SDR、频谱图和音频
# Other Clip
output_results(other, other_spec, "other")
SDR score is: 8.866964245665635
# Optionally, the full audios can be heard in from running the next 5
# cells. They will take a bit longer to load, so to run simply uncomment
# out the ``Audio`` cells for the respective track to produce the audio
# for the full song.
#
# Full Audio
# Audio(mixture, rate=sample_rate)
# Drums Audio
# Audio(audios["drums"], rate=sample_rate)
# Bass Audio
# Audio(audios["bass"], rate=sample_rate)
# Vocals Audio
# Audio(audios["vocals"], rate=sample_rate)
# Other Audio
# Audio(audios["other"], rate=sample_rate)
脚本总运行时间:(0 分钟 25.315 秒)