PyTorch-Transformers

模型描述

PyTorch-Transformers(前身为 pytorch-pretrained-bert)是一个用于自然语言处理 (NLP) 的最先进预训练模型库。

该库目前包含以下模型的 PyTorch 实现、预训练模型权重、使用脚本和转换工具

  1. BERT(来自 Google),随 Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova 的论文 BERT: 用于语言理解的深度双向 Transformer 预训练 发布。
  2. GPT(来自 OpenAI),随 Alec Radford、Karthik Narasimhan、Tim Salimans 和 Ilya Sutskever 的论文 通过生成式预训练改进语言理解 发布。
  3. GPT-2(来自 OpenAI),随 Alec Radford*、Jeffrey Wu*、Rewon Child、David Luan、Dario Amodei** 和 Ilya Sutskever** 的论文 语言模型是非监督多任务学习器 发布。
  4. Transformer-XL(来自 Google/CMU),随 Zihang Dai*、Zhilin Yang*、Yiming Yang、Jaime Carbonell、Quoc V. Le、Ruslan Salakhutdinov 的论文 Transformer-XL: 超越固定长度上下文的注意力语言模型 发布。
  5. XLNet(来自 Google/CMU),随 Zhilin Yang*、Zihang Dai*、Yiming Yang、Jaime Carbonell、Ruslan Salakhutdinov、Quoc V. Le 的论文 XLNet: 用于语言理解的广义自回归预训练 发布。
  6. XLM(来自 Facebook),与 Guillaume Lample 和 Alexis Conneau 的论文 跨语言语言模型预训练 一同发布。
  7. RoBERTa(来自 Facebook),与 Yinhan Liu、Myle Ott、Naman Goyal、Jingfei Du、Mandar Joshi、Danqi Chen、Omer Levy、Mike Lewis、Luke Zettlemoyer、Veselin Stoyanov 的论文 一种鲁棒优化的 BERT 预训练方法 一同发布。
  8. DistilBERT(来自 HuggingFace),与 Victor Sanh、Lysandre Debut 和 Thomas Wolf 的博文 更小、更快、更便宜、更轻:介绍 DistilBERT,一个精炼版的 BERT 一同发布。

此处提供的组件基于 pytorch-transformers 库的 AutoModelAutoTokenizer 类。

要求

与大多数其他 PyTorch Hub 模型不同,BERT 需要安装一些额外的 Python 包。

pip install tqdm boto3 requests regex sentencepiece sacremoses

用法

可用方法如下

  • config:返回与指定模型或路径对应的配置项。
  • tokenizer:返回与指定模型或路径对应的分词器。
  • model:返回与指定模型或路径对应的模型。
  • modelForCausalLM:返回一个带有语言建模头的模型,与指定模型或路径对应。
  • modelForSequenceClassification:返回一个带有序列分类器的模型,与指定模型或路径对应。
  • modelForQuestionAnswering:返回一个带有问答头的模型,与指定模型或路径对应。

所有这些方法都共享以下参数:pretrained_model_or_path,它是一个字符串,用于标识将返回实例的预训练模型或路径。每个模型都有几个可用的检查点,详见下文。

可用模型列在 transformers 文档的模型页面 上。

文档

以下是一些详细说明每种可用方法用法的示例。

分词器

分词器对象允许将字符串转换为不同模型能理解的标记。每个模型都有自己的分词器,并且某些分词方法在不同的分词器之间有所不同。完整的文档可以在 此处 找到。

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')    # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', './test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`

模型

模型对象是继承自 nn.Module 的模型实例。每个模型都附带其保存/加载方法,无论是从本地文件或目录,还是从预训练配置(参见前面描述的 config)。每个模型的工作方式不同,不同模型的完整概述可以在 文档 中找到。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有语言建模头的模型

前面提到的 model 实例,带有一个额外的语言建模头。

import torch
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2')    # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json')
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './tf_model/gpt_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有序列分类头的模型

前面提到的 model 实例,带有一个额外的序列分类头。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

带有问答头的模型

前面提到的 model 实例,带有一个额外的问答头。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

配置

配置是可选的。配置对象包含有关模型的信息,例如头/层的数量,模型是否应输出注意力或隐藏状态,或者是否应针对 TorchScript 进行适配。有许多参数可用,其中一些特定于每个模型。完整的文档可以在 此处 找到。

import torch
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')  # Download configuration from S3 and cache.
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/my_configuration.json')
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False)
assert config.output_attention == True
config, unused_kwargs = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False, return_unused_kwargs=True)
assert config.output_attention == True
assert unused_kwargs == {'foo': False}

# Using the configuration with a model
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')
config.output_attentions = True
config.output_hidden_states = True
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', config=config)
# Model will now output attentions and hidden states as well

使用示例

以下是关于如何将输入文本分词以作为 BERT 模型的输入,然后获取该模型计算的隐藏状态或使用语言建模 BERT 模型预测遮蔽标记的示例。

首先,分词输入

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

使用 BertModel 将输入句子编码为最后一层隐藏状态序列

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, token_type_ids=segments_tensors)

使用 modelForMaskedLM 预测 BERT 的遮蔽标记

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])

masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')

with torch.no_grad():
    predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)

# Get the predicted token
predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'Jim'

使用 modelForQuestionAnswering 进行 BERT 问答

question_answering_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-large-uncased-whole-word-masking-finetuned-squad')
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')

# The format is paragraph first and then question
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the start and end positions logits
with torch.no_grad():
    out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)

# get the highest prediction
answer = question_answering_tokenizer.decode(indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1])
assert answer == "puppeteer"

# Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions (set model to train mode before if used for training)
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = question_answering_model(tokens_tensor, token_type_ids=segments_tensors, start_positions=start_positions, end_positions=end_positions)

使用 modelForSequenceClassification 进行 BERT 释义分类

sequence_classification_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased-finetuned-mrpc')
sequence_classification_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased-finetuned-mrpc')

text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the sequence classification logits
with torch.no_grad():
    seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)

predicted_labels = torch.argmax(seq_classif_logits[0]).item()

assert predicted_labels == 0  # In MRPC dataset this means the two sentences are not paraphrasing each other

# Or get the sequence classification loss (set model to train mode before if used for training)
labels = torch.tensor([1])
seq_classif_loss = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors, labels=label

流行 NLP Transformers 的 PyTorch 实现

模型类型: Nlp
提交者: HuggingFace 团队