模型描述

PyTorch-Transformers（以前称为 pytorch-pretrained-bert）是一个用于自然语言处理（NLP）的、包含最先进预训练模型的库。

该库目前包含以下模型的 PyTorch 实现、预训练模型权重、使用脚本和转换工具

BERT（来自 Google）随论文《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》发布，作者：Jacob Devlin, Ming-Wei Chang, Kenton Lee 和 Kristina Toutanova。
GPT（来自 OpenAI）随论文《Improving Language Understanding by Generative Pre-Training》发布，作者：Alec Radford, Karthik Narasimhan, Tim Salimans 和 Ilya Sutskever。
GPT-2（来自 OpenAI）随论文《Language Models are Unsupervised Multitask Learners》发布，作者：Alec Radford*、Jeffrey Wu*、Rewon Child、David Luan、Dario Amodei** 和 Ilya Sutskever**。
Transformer-XL（来自 Google/CMU）随论文《Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context》发布，作者：Zihang Dai*、Zhilin Yang*、Yiming Yang、Jaime Carbonell、Quoc V. Le、Ruslan Salakhutdinov。
XLNet（来自 Google/CMU）随论文《XLNet: Generalized Autoregressive Pretraining for Language Understanding》发布，作者：Zhilin Yang*、Zihang Dai*、Yiming Yang、Jaime Carbonell、Ruslan Salakhutdinov、Quoc V. Le。
XLM（来自 Facebook）随论文《Cross-lingual Language Model Pretraining》发布，作者：Guillaume Lample 和 Alexis Conneau。
RoBERTa（来自 Facebook）随论文《A Robustly Optimized BERT Pretraining Approach》发布，作者：Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov。
DistilBERT（来自 HuggingFace）随博文《更小、更快、更便宜、更轻量：DistilBERT 介绍，BERT 的精简版》发布，作者：Victor Sanh, Lysandre Debut 和 Thomas Wolf。

此处可用的组件基于 pytorch-transformers 库的 AutoModel 和 AutoTokenizer 类。

要求

与大多数其他 PyTorch Hub 模型不同，BERT 需要安装一些额外的 Python 软件包。

pip install tqdm boto3 requests regex sentencepiece sacremoses

用法

可用方法如下

config：返回与指定模型或路径对应的配置项。
tokenizer：返回与指定模型或路径对应的分词器
model：返回与指定模型或路径对应的模型
modelForCausalLM：返回具有语言建模头的模型，与指定模型或路径对应
modelForSequenceClassification：返回具有序列分类器的模型，与指定模型或路径对应
modelForQuestionAnswering：返回具有问答头的模型，与指定模型或路径对应

所有这些方法共享以下参数：pretrained_model_or_path，这是一个字符串，用于标识将从中返回实例的预训练模型或路径。每种模型都有多个可用检查点，详情如下

可用模型列在 transformers 文档的模型页面上。

文档

这里有一些示例，详细说明了每种可用方法的用法。

分词器

分词器对象允许将字符字符串转换为不同模型可以理解的标记（token）。每个模型都有自己的分词器，并且不同分词器的一些分词方法也不同。完整文档可在此处找到。

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')    # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', './test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`

模型

模型对象是一个继承自 nn.Module 的模型实例。每个模型都附带其保存/加载方法，可以从本地文件或目录加载，也可以从预训练配置（参见前面描述的 config）加载。每个模型的工作方式不同，可以在文档中找到不同模型的完整概览。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有语言建模头的模型

前面提到的 model 实例，增加了一个语言建模头。

import torch
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2')    # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json')
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './tf_model/gpt_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有序列分类头的模型

前面提到的 model 实例，增加了一个序列分类头。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有问答头的模型

前面提到的 model 实例，增加了一个问答头。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

配置

配置是可选的。配置对象包含有关模型的信息，例如头/层的数量、模型是否应输出注意力（attentions）或隐藏状态（hidden states），或者是否应为 TorchScript 进行适配。有很多参数可用，其中一些特定于每种模型。完整文档可在此处找到。

import torch
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')  # Download configuration from S3 and cache.
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/my_configuration.json')
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False)
assert config.output_attention == True
config, unused_kwargs = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False, return_unused_kwargs=True)
assert config.output_attention == True
assert unused_kwargs == {'foo': False}

# Using the configuration with a model
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')
config.output_attentions = True
config.output_hidden_states = True
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', config=config)
# Model will now output attentions and hidden states as well

用法示例

这里有一个示例，演示如何对输入文本进行分词，以便将其作为 BERT 模型的输入，然后获取此类模型计算出的隐藏状态，或者使用语言建模 BERT 模型预测被遮盖的标记（masked tokens）。

首先，对输入进行分词

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

使用 `BertModel` 将输入句子编码为最后一层隐藏状态的序列

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, token_type_ids=segments_tensors)

使用 `modelForMaskedLM` 使用 BERT 预测被遮盖的标记（masked token）

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])

masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')

with torch.no_grad():
    predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)

# Get the predicted token
predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'Jim'

使用 `modelForQuestionAnswering` 使用 BERT 进行问答

question_answering_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-large-uncased-whole-word-masking-finetuned-squad')
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')

# The format is paragraph first and then question
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the start and end positions logits
with torch.no_grad():
    out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)

# get the highest prediction
answer = question_answering_tokenizer.decode(indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1])
assert answer == "puppeteer"

# Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions (set model to train mode before if used for training)
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = question_answering_model(tokens_tensor, token_type_ids=segments_tensors, start_positions=start_positions, end_positions=end_positions)

使用 `modelForSequenceClassification` 使用 BERT 进行释义分类

sequence_classification_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased-finetuned-mrpc')
sequence_classification_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased-finetuned-mrpc')

text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the sequence classification logits
with torch.no_grad():
    seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)

predicted_labels = torch.argmax(seq_classif_logits[0]).item()

assert predicted_labels == 0  # In MRPC dataset this means the two sentences are not paraphrasing each other

# Or get the sequence classification loss (set model to train mode before if used for training)
labels = torch.tensor([1])
seq_classif_loss = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors, labels=labels)

PyTorch-Transformers

模型描述

要求

用法

文档

分词器

模型

带有语言建模头的模型

带有序列分类头的模型

带有问答头的模型

配置

用法示例

首先，对输入进行分词

使用 `BertModel` 将输入句子编码为最后一层隐藏状态的序列

使用 `modelForMaskedLM` 使用 BERT 预测被遮盖的标记（masked token）

使用 `modelForQuestionAnswering` 使用 BERT 进行问答

使用 `modelForSequenceClassification` 使用 BERT 进行释义分类

文档

教程

资源

PyTorch-Transformers

模型描述

要求

用法

文档

分词器

模型

带有语言建模头的模型

带有序列分类头的模型

带有问答头的模型

配置

用法示例

首先，对输入进行分词

使用 BertModel 将输入句子编码为最后一层隐藏状态的序列

使用 modelForMaskedLM 使用 BERT 预测被遮盖的标记（masked token）

使用 modelForQuestionAnswering 使用 BERT 进行问答

使用 modelForSequenceClassification 使用 BERT 进行释义分类

文档

教程

资源

使用 `BertModel` 将输入句子编码为最后一层隐藏状态的序列

使用 `modelForMaskedLM` 使用 BERT 预测被遮盖的标记（masked token）

使用 `modelForQuestionAnswering` 使用 BERT 进行问答

使用 `modelForSequenceClassification` 使用 BERT 进行释义分类