注意

点击此处下载完整示例代码

词嵌入：编码词汇语义¶

创建日期：2017 年 4 月 8 日 | 最后更新：2021 年 9 月 14 日 | 最后验证：2024 年 11 月 5 日

词嵌入是实数的密集向量，词汇表中的每个词对应一个向量。在 NLP 中，几乎总是使用词作为特征！但是如何在计算机中表示一个词呢？你可以存储它的 ASCII 字符表示，但这只能告诉你这个词是什么，并不能说明它意味着什么（你可能可以从它的词缀推导出词性，或从其大写形式推导出属性，但也仅此而已）。更重要的是，你如何组合这些表示？我们常常希望从神经网络中得到密集输出，而输入是 \(|V|\) 维的（其中 \(V\) 是我们的词汇表），但输出通常只有少数几个维度（例如，如果我们只预测少数几个标签）。我们如何从一个巨大的维度空间转换到一个较小的维度空间？

我们不用 ASCII 表示，而是使用 One-Hot 编码如何？也就是说，我们将词 \(w\) 表示为

\[\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements} \]

其中 1 位于 \(w\) 独有的位置。任何其他词将在其他某个位置为 1，其余位置为 0。

这种表示法有一个巨大的缺点，除了它非常庞大之外。它基本上把所有词都视为独立的实体，彼此之间没有关系。我们真正想要的是词之间的某种相似性概念。为什么？让我们看一个例子。

假设我们正在构建一个语言模型。假设我们在训练数据中看到了句子

The mathematician ran to the store.
The physicist ran to the store.
The mathematician solved the open problem.

在我们训练数据中。现在假设我们得到一个以前从未在训练数据中见过的新句子

The physicist solved the open problem.

我们的语言模型可能在这句话上表现不错，但如果我们可以利用以下两个事实，会不会好得多？

我们在句子中见过数学家和物理学家扮演相同的角色。某种程度上，他们有语义关系。
我们在新的未见过的句子中见过数学家扮演与现在看到物理学家相同的角色。

然后推断出物理学家实际上非常适合这个新的未见过的句子？这就是我们所说的相似性概念：我们指的是语义相似性，而不仅仅是具有相似的拼写表示。这是一种对抗语言数据稀疏性的技术，通过连接我们见过的内容和未见过的内容之间的点来实现。这个例子当然依赖于一个基本的语言学假设：出现在相似上下文中的词彼此在语义上是相关的。这被称为分布式假设。

获取密集词嵌入¶

我们如何解决这个问题？也就是说，我们如何实际编码词语的语义相似性？也许我们可以想出一些语义属性。例如，我们看到数学家和物理学家都能跑，所以我们可以给这些词在“能跑步”的语义属性上打高分。想出其他一些属性，想象一下你会给一些常用词在这些属性上打什么分。

如果每个属性是一个维度，那么我们可能给每个词一个向量，像这样

\[ q_\text{mathematician} = \left[ \overbrace{2.3}^\text{能跑步}, \overbrace{9.4}^\text{喜欢咖啡}, \overbrace{-5.5}^\text{主修物理}, \dots \right]\]

\[ q_\text{physicist} = \left[ \overbrace{2.5}^\text{能跑步}, \overbrace{9.1}^\text{喜欢咖啡}, \overbrace{6.4}^\text{主修物理}, \dots \right]\]

然后我们可以通过以下方式获得这些词之间的相似度度量

\[\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician} \]

尽管更常见的是按长度归一化

\[ \text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}} {\| q_\text{physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\]

其中 \(\phi\) 是两个向量之间的夹角。这样，极其相似的词（词嵌入指向相同方向的词）相似度将为 1。极其不相似的词相似度应为 -1。

你可以将本节开头提到的稀疏 One-Hot 向量视为我们定义的这些新向量的一个特例，其中每个词基本上相似度为 0，并且我们给每个词赋予了一些独特的语义属性。这些新向量是密集的，也就是说它们的条目（通常）是非零的。

但这些新向量非常麻烦：你可以想到成千上万个不同的语义属性可能与确定相似性有关，而你又如何设置不同属性的值呢？深度学习思想的核心在于神经网络学习特征的表示，而不是要求程序员自己设计它们。那么为什么不直接让词嵌入作为我们模型中的参数，然后在训练期间进行更新呢？这正是我们将要做的。我们将有一些潜在的语义属性，网络原则上可以学习它们。请注意，词嵌入可能无法解释。也就是说，尽管在我们上面手工制作的向量中，我们可以看到数学家和物理学家在喜欢咖啡这一点上是相似的，但如果我们让神经网络学习嵌入，并且看到数学家和物理学家在第二维度中都具有较大值，这并不清楚意味着什么。它们在某个潜在的语义维度上是相似的，但这可能对我们没有解释。

总而言之，词嵌入是词语语义的一种表示，有效地编码了可能与当前任务相关的语义信息。你也可以嵌入其他东西：词性标签、句法树，任何东西！特征嵌入的概念是该领域的核心。

PyTorch 中的词嵌入¶

在我们进入一个具体示例和一个练习之前，先快速介绍一下如何在 PyTorch 和一般的深度学习编程中使用嵌入。就像我们在制作 One-Hot 向量时为每个词定义了一个唯一的索引一样，在使用嵌入时我们也需要为每个词定义一个索引。这些索引将是查找表的键。也就是说，嵌入存储为一个 \(|V| \times D\) 矩阵，其中 \(D\) 是嵌入的维度，词汇表中索引为 \(i\) 的词，其嵌入存储在矩阵的第 \(i\) 行。在我所有的代码中，词到索引的映射是一个名为 word_to_ix 的字典。

允许你使用嵌入的模块是 torch.nn.Embedding，它接受两个参数：词汇表大小和嵌入的维度。

要索引此表，你必须使用 torch.LongTensor（因为索引是整数，不是浮点数）。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator object at 0x7f72ce596470>

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)

示例：N-Gram 语言模型¶

回想一下，在 N-Gram 语言模型中，给定一个词序列 \(w\)，我们希望计算

\[P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) \]

其中 \(w_i\) 是序列中的第 i 个词。

在这个示例中，我们将计算一些训练样本上的损失函数，并使用反向传播更新参数。

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]
[521.44149518013, 518.8340816497803, 516.2432668209076, 513.668018579483, 511.10753536224365, 508.561292886734, 506.02885699272156, 503.50800943374634, 500.99908232688904, 498.4997034072876]
tensor([-1.8804, -0.7788,  2.0251, -0.0871,  2.3550, -1.0376,  1.5748, -0.6295,
         2.4065,  0.2789], grad_fn=<SelectBackward0>)

练习：计算词嵌入：连续词袋模型¶

连续词袋模型 (CBOW) 经常用于 NLP 深度学习。它是一种尝试根据目标词前后几个词的上下文来预测目标词的模型。这与语言模型不同，因为 CBOW 不是序列式的，也不必是概率式的。通常，CBOW 用于快速训练词嵌入，然后这些嵌入用于初始化更复杂模型中的嵌入。通常，这被称为预训练嵌入。它几乎总能帮助提升性能百分之几。

CBOW 模型如下。给定一个目标词 \(w_i\) 以及两侧各一个 \(N\) 个词的上下文窗口 \(w_{i-1}, \dots, w_{i-N}\) 和 \(w_{i+1}, \dots, w_{i+N}\)，将所有上下文词统称为 \(C\)，CBOW 尝试最小化

\[-\log p(w_i | C) = -\log \text{Softmax}\left(A(\sum_{w \in C} q_w) + b\right) \]

其中 \(q_w\) 是词 \(w\) 的嵌入。

通过填充下面的类，在 PyTorch 中实现此模型。一些提示：

思考你需要定义哪些参数。
确保你知道每个操作期望的形状。如果需要重塑，请使用 .view()。

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# Create your model and train. Here are some functions to help you make
# the data ready for use by your module.


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study'), (['study', 'to', 'idea', 'of'], 'the'), (['the', 'study', 'of', 'a'], 'idea')]

tensor([41, 21, 13, 46])

脚本总运行时间： ( 0 分钟 0.456 秒)

由 Sphinx-Gallery 生成的图库