torchtext.vocab¶

词汇表¶

class torchtext.vocab.Vocab(vocab)[source]¶

__contains__(token: str) → bool[source]¶

参数：: token – 要检查成员资格的标记。
返回值：: 标记是否为词汇表中的成员。

__getitem__(token: str) → int[source]¶

参数：: token – 用于查找对应索引的标记。
返回值：: 与关联标记对应的索引。

__init__(vocab) → None[source]¶: 初始化内部模块状态，由 nn.Module 和 ScriptModule 共享。

__jit_unused_properties__ = ['is_jitable']¶

创建一个词汇表对象，用于将标记映射到索引。

参数：: vocab (torch.classes.torchtext.Vocab 或 torchtext._torchtext.Vocab) – 一个 cpp 词汇表对象。

__len__() → int[source]¶

返回值：: 词汇表的长度。

__prepare_scriptable__()[source]¶: 返回一个可 JIT 化的词汇表。

append_token(token: str) → None[source]¶

参数：: token – 用于查找对应索引的标记。
引发：: RuntimeError – 如果token已存在于词汇表中

forward(tokens: List[str]) → List[int][source]¶

调用lookup_indices方法

参数：: tokens – 用于查找其对应indices的标记列表。
返回值：: 与标记列表关联的索引。

get_default_index() → Optional[int][source]¶

返回值：: 如果设置了默认索引，则其值。

get_itos() → List[str][source]¶

返回值：: 将索引映射到标记的列表。

get_stoi() → Dict[str, int][source]¶

返回值：: 将标记映射到索引的字典。

insert_token(token: str, index: int) → None[source]¶

参数：

token – 用于查找对应索引的标记。
index – 与关联标记对应的索引。

引发：

RuntimeError – 如果index不在范围 [0, Vocab.size()] 内，或者token已存在于词汇表中。

lookup_indices(tokens: List[str]) → List[int][source]¶

参数：: tokens – 用于查找其对应indices的标记。
返回值：: 与tokens关联的`indices`。

lookup_token(index: int) → str[source]¶

参数：: index – 与关联标记对应的索引。
返回值：: 用于查找对应索引的标记。
返回类型:: 标记
引发：: RuntimeError – 如果index不在范围[0, itos.size())内。

lookup_tokens(indices: List[int]) → List[str][source]¶

参数：: indices – 用于查找其对应`tokens`的indices。
返回值：: 与indices关联的tokens。
引发：: RuntimeError – 如果indices中的索引不在int范围[0, itos.size())内。

set_default_index(index: Optional[int]) → None[source]¶

参数：: index – 默认索引的值。当查询OOV标记时，将返回此索引。

vocab¶

torchtext.vocab.vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) → Vocab[source]¶

用于创建将标记映射到索引的vocab对象的工厂方法。

请注意，在构建词汇表时，将尊重ordered_dict中插入键值对的顺序。因此，如果对用户而言，按标记频率排序很重要，则应以反映此排序的方式创建ordered_dict。

参数：

ordered_dict – 将标记映射到其对应出现频率的有序字典。
min_freq – 将标记包含在词汇表中所需的最低频率。
specials – 要添加的特殊符号。将保留提供的标记的顺序。
special_first – 指示是否在开头或结尾插入符号。

返回值：

一个Vocab对象

返回类型:

torchtext.vocab.Vocab

示例

>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True

build_vocab_from_iterator¶

torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → Vocab[source]¶

从迭代器构建Vocab。

参数：

iterator – 用于构建Vocab的迭代器。必须产生标记列表或迭代器。
min_freq – 将标记包含在词汇表中所需的最低频率。
specials – 要添加的特殊符号。将保留提供的标记的顺序。
special_first – 指示是否在开头或结尾插入符号。
max_tokens – 如果提供，则从max_tokens - len(specials)个最频繁的标记创建词汇表。

返回值：

一个Vocab对象

返回类型:

torchtext.vocab.Vocab

示例

>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens(file_path):
>>>     with io.open(file_path, encoding = 'utf-8') as f:
>>>         for line in f:
>>>             yield line.strip().split()
>>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

Vectors¶

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]¶

__init__(name, cache=None, url=None, unk_init=None, max_vectors=None) → None[source]¶

参数：

name – 包含向量的文件的名称
cache – 缓存向量的目录
url – 如果缓存中找不到向量，则用于下载的URL
unk_init (回调函数) – 默认情况下，将词汇外单词向量初始化为零向量；可以是任何接收张量并返回相同大小张量的函数
max_vectors (int) – 可用于限制加载的预训练向量数量。大多数预训练向量集按单词频率降序排序。因此，在整个向量集无法放入内存或出于其他原因不需要的情况下，传递 max_vectors 可以限制加载集的大小。

get_vecs_by_tokens(tokens, lower_case_backup=False)[source]¶

查找标记的嵌入向量。

参数：

tokens – 一个标记或标记列表。如果 tokens 是字符串，则返回形状为 self.dim 的一维张量；如果 tokens 是字符串列表，则返回形状为 (len(tokens), self.dim) 的二维张量。
lower_case_backup – 是否在小写情况下查找标记。如果为 False，则将查找原始大小写中的每个标记；如果为 True，则首先查找原始大小写中的每个标记，如果在属性 stoi 的键中找不到，则将查找小写形式的标记。默认值：False。

示例

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)

预训练词嵌入¶

GloVe¶

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)[source]¶

FastText¶

class torchtext.vocab.FastText(language='en', **kwargs)[source]¶

CharNGram¶

class torchtext.vocab.CharNGram(**kwargs)[source]¶

torchtext.vocab¶

词汇表¶

vocab¶

build_vocab_from_iterator¶

Vectors¶

预训练词嵌入¶

GloVe¶

FastText¶

CharNGram¶

文档

教程

资源