本文为参加Datawhale组队学习时所写,如若需了解细致内容,请去到Datawhale官方开源课程基于transformers的自然语言处理(NLP)入门
(datawhalechina.github.io)
文本分类
GLUE榜单包含了9个句子级别的分类任务,分别是:
CoLA (Corpus of
Linguistic Acceptability) 鉴别一个句子是否语法正确.
MNLI (Multi-Genre
Natural Language Inference)
给定一个假设,判断另一个句子与该假设的关系:entails, contradicts 或者
unrelated。
MRPC
(Microsoft Research Paraphrase Corpus)
判断两个句子是否互为paraphrases.
QNLI
(Question-answering Natural Language Inference)
判断第2句是否包含第1句问题的答案。
QQP
(Quora Question Pairs2) 判断两个问句是否语义相同。
RTE
(Recognizing Textual
Entailment)判断一个句子是否与假设成entail关系。
SST-2
(Stanford Sentiment Treebank) 判断一个句子的情感正负向.
STS-B
(Semantic Textual Similarity Benchmark)
判断两个句子的相似性(分数为1-5分)。
WNLI
(Winograd Natural Language Inference) Determine if a sentence with an
anonymous pronoun and a sentence with this pronoun replaced are entailed
or not.
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer
接口对预训练模型进行微调。
1 2 GLUE_TASKS = ["cola" , "mnli" , "mnli-mm" , "mrpc" , "qnli" , "qqp" , "rte" , "sst2" , "stsb" , "wnli" ]
1 2 3 task = "cola" model_checkpoint = "distilbert-base-uncased" batch_size = 16
加载数据
1 2 3 4 5 from datasets import load_dataset, load_metric actual_task = "mnli" if task == "mnli-mm" else task dataset = load_dataset("glue" , actual_task) metric = load_metric('glue' , actual_task)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DatasetDict({ train: Dataset({ features: ['sentence' , 'label' , 'idx' ], num_rows: 8551 }) validation: Dataset({ features: ['sentence' , 'label' , 'idx' ], num_rows: 1043 }) test: Dataset({ features: ['sentence' , 'label' , 'idx' ], num_rows: 1063 }) })
随机选择数据集中的几个例子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import datasetsimport randomimport pandas as pdfrom IPython.display import display, HTMLdef show_random_elements (dataset, num_examples=10 ): assert num_examples <= len (dataset), "Can't pick more elements than there are in the dataset." picks = [] for _ in range (num_examples): pick = random.randint(0 , len (dataset)-1 ) while pick in picks: pick = random.randint(0 , len (dataset)-1 ) picks.append(pick) df = pd.DataFrame(dataset[picks]) for column, typ in dataset.features.items(): if isinstance (typ, datasets.ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i]) display(HTML(df.to_html())) show_random_elements(dataset["train" ])
评估metric时datasets.Metric的一个实例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Metric(name: "glue" , features: {'predictions' : Value(dtype='int64' , id =None ), 'references' : Value(dtype='int64' , id =None )}, usage: """ Compute GLUE evaluation metric associated to each GLUE dataset. Args: predictions: list of predictions to score. Each translation should be tokenized into a list of tokens. references: list of lists of references for each translation. Each reference should be tokenized into a list of tokens. Returns: depending on the GLUE subset, one or several of: "accuracy": Accuracy "f1": F1 score "pearson": Pearson Correlation "spearmanr": Spearman Correlation "matthews_correlation": Matthew Correlation Examples: >>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0, 'f1': 1.0} >>> glue_metric = datasets.load_metric('glue', 'stsb') >>> references = [0., 1., 2., 3., 4., 5.] >>> predictions = [0., 1., 2., 3., 4., 5.] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)}) {'pearson': 1.0, 'spearmanr': 1.0} >>> glue_metric = datasets.load_metric('glue', 'cola') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'matthews_correlation': 1.0} """ , stored examples: 0 )
1 2 3 4 5 6 import numpy as np fake_preds = np.random.randint(0 , 2 , size=(64 ,)) fake_labels = np.random.randint(0 , 2 , size=(64 ,)) metric.compute(predictions=fake_preds, references=fake_labels)
每一个文本分类任务所对应的metic有所不同,具体如下:
数据预处理
预处理的工具叫Tokenizer
。Tokenizer
首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token
ID,再转化为模型需要的输入格式。
为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained
方法实例化我们的tokenizer,这样可以确保:
我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens
vocabulary。
1 2 3 4 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True )
tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式
1 tokenizer("Hello, this one sentence!" , "And this sentence goes with it." )
不同数据和对应的数据格式
1 2 3 4 5 6 7 8 9 10 11 12 task_to_keys = { "cola" : ("sentence" , None ), "mnli" : ("premise" , "hypothesis" ), "mnli-mm" : ("premise" , "hypothesis" ), "mrpc" : ("sentence1" , "sentence2" ), "qnli" : ("question" , "sentence" ), "qqp" : ("question1" , "question2" ), "rte" : ("sentence1" , "sentence2" ), "sst2" : ("sentence" , None ), "stsb" : ("sentence1" , "sentence2" ), "wnli" : ("sentence1" , "sentence2" ), }
1 2 3 4 5 6 7 sentence1_key, sentence2_key = task_to_keys[task]if sentence2_key is None : print (f"Sentence: {dataset['train' ][0 ][sentence1_key]} " )else : print (f"Sentence 1: {dataset['train' ][0 ][sentence1_key]} " ) print (f"Sentence 2: {dataset['train' ][0 ][sentence2_key]} " )
预处理函数
1 2 3 4 def preprocess_function (examples ): if sentence2_key is None : return tokenizer(examples[sentence1_key], truncation=True ) return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True )
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
1 encoded_dataset = dataset.map (preprocess_function, batched=True )
返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False
参数。另外,上面使用到的batched=True
这个参数是tokenizer的特点,因为这会使用多线程同时并行对输入进行处理。
微调预训练模型
既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification
这个类。和tokenizer相似,from_pretrained
方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。
STS-B是一个回归问题,MNLI是一个3分类问题
1 2 3 4 5 from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer num_labels = 3 if task.startswith("mnli" ) else 1 if task=="stsb" else 2 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
为了能够得到一个Trainer
训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数
TrainingArguments
。这个训练设定包含了能够定义训练过程的所有属性。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy" args = TrainingArguments( "test-glue" , evaluation_strategy = "epoch" , save_strategy = "epoch" , learning_rate=2e-5 , per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=5 , weight_decay=0.01 , load_best_model_at_end=True , metric_for_best_model=metric_name, )
由于不同的任务需要不同的评测指标,我们定一个函数来根据任务名字得到评价方法
1 2 3 4 5 6 7 def compute_metrics (eval_pred ): predictions, labels = eval_pred if task != "stsb" : predictions = np.argmax(predictions, axis=1 ) else : predictions = predictions[:, 0 ] return metric.compute(predictions=predictions, references=labels)
全部传给 Trainer
1 2 3 4 5 6 7 8 9 validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation" trainer = Trainer( model, args, train_dataset =encoded_dataset["train" ], eval_dataset =encoded_dataset[validation_key], tokenizer =tokenizer, compute_metrics =compute_metrics )
超参数搜索
Trainer
同样支持超参搜索,使用optuna or Ray Tune 代码库。
1 2 3 # 安装相关依赖 ! pip install optuna ! pip install ray[tune]
超参搜索时,Trainer
将会返回多个训练好的模型,所以需要传入一个定义好的模型从而让Trainer
可以不断重新初始化该传入的模型
1 2 def model_init (): return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
1 2 3 4 5 6 7 8 trainer = Trainer( model_init =model_init, args =args, train_dataset =encoded_dataset["train" ], eval_dataset =encoded_dataset[validation_key], tokenizer =tokenizer, compute_metrics =compute_metrics )
1 2 3 best_run = trainer.hyperparameter_search(n_trials=10 , direction="maximize" )
1 2 3 4 5 for n, v in best_run.hyperparameters.items(): setattr (trainer.args, n, v) trainer.train()
序列标注
序列标注,通常也可以看作是token级别的分类问题:对每一个token进行分类。在这个notebook中,我们将展示如何使用🤗
Transformers 中的transformer模型去做token级别的分类问题。
最常见的token级别分类任务:
NER (Named-entity recognition 名词-实体识别)
分辨出文本中的名词和实体 (person人名, organization组织机构名,
location地点名...).
POS (Part-of-speech tagging词性标注) 根据语法对token进行词性标注
(noun名词, verb动词, adjective形容词...)
Chunk (Chunking短语组块) 将同一个短语的tokens组块放在一起。
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer
接口对预训练模型进行微调。
只要预训练的transformer模型最顶层有一个token分类的神经网络层(比如上一篇章提到的BertForTokenClassification
)(另外,由于transformer库的tokenizer新特性,可能还需要对应的预训练模型有fast
tokenizer这个功能,参考这个表 ),那么本notebook理论上可以使用各种各样的transformer模型(模型面板 ),解决任何token级别的分类任务。
1 2 3 task = "ner" model_checkpoint = "distilbert-base-uncased" batch_size = 16
加载数据
1 2 from datasets import load_dataset, load_metric datasets = load_dataset('conll2003' )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DatasetDict({ train: Dataset({ features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ], num_rows: 14041 }) validation: Dataset({ features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ], num_rows: 3250 }) test: Dataset({ features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ], num_rows: 3453 }) })
无论是在训练集、验证机还是测试集中,datasets都包含了一个名为tokens的列(一般来说是将文本切分成了很多词),还包含一个名为label的列,这一列对应这tokens的标注。
1 2 3 4 5 6 7 8 9 10 11 12 13 {'chunk_tags' : [11 , 21 , 11 , 12 , 21 , 22 , 11 , 12 , 0 ], 'id' : '0' , 'ner_tags' : [3 , 0 , 7 , 0 , 0 , 0 , 7 , 0 , 0 ], 'pos_tags' : [22 , 42 , 16 , 21 , 35 , 37 , 16 , 21 , 7 ], 'tokens' : ['EU' , 'rejects' , 'German' , 'call' , 'to' , 'boycott' , 'British' , 'lamb' , '.' ]}
所有的数据标签labels都已经被编码成了整数,可以直接被预训练transformer模型使用。这些整数的编码所对应的实际类别储存在features
中。
1 2 3 datasets["train" ].features[f"ner_tags" ]Sequence (feature=ClassLabel(num_classes=9 , names=['O' , 'B-PER' , 'I-PER' , 'B-ORG' , 'I-ORG' , 'B-LOC' , 'I-LOC' , 'B-MISC' , 'I-MISC' ], names_file=None , id =None ), length=-1 , id =None )
所以以NER为例,0对应的标签类别是”O“,
1对应的是”B-PER“等等。”O“的意思是没有特别实体(no special
entity)。本例包含4种实体类别分别是(PER、ORG、LOC,MISC),每一种实体类别又分别有B-(实体开始的token)前缀和I-(实体中间的token)前缀。
'PER' for person
'ORG' for organization
'LOC' for location
'MISC' for miscellaneous
1 2 label_list = datasets["train" ].features[f"{task} _tags" ].feature.names label_list
1 2 ['O' , 'B-PER' , 'I-PER' , 'B-ORG' , 'I-ORG' , 'B-LOC' , 'I-LOC' , 'B-MISC' , 'I-MISC' ]
从数据集里随机选择几个例子进行展示。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from datasets import ClassLabel, Sequence import randomimport pandas as pdfrom IPython.display import display, HTMLdef show_random_elements (dataset, num_examples=10 ): assert num_examples <= len (dataset), "Can't pick more elements than there are in the dataset." picks = [] for _ in range (num_examples): pick = random.randint(0 , len (dataset)-1 ) while pick in picks: pick = random.randint(0 , len (dataset)-1 ) picks.append(pick) df = pd.DataFrame(dataset[picks]) for column, typ in dataset.features.items(): if isinstance (typ, ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i]) elif isinstance (typ, Sequence ) and isinstance (typ.feature, ClassLabel): df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x]) display(HTML(df.to_html()))
预处理数据
1 2 3 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
1 2 3 import transformersassert isinstance (tokenizer, transformers.PreTrainedTokenizerFast)
1 tokenizer("Hello, this is one sentence!" )
1 2 {'input_ids' : [101 , 7592 , 1010 , 2023 , 2003 , 2028 , 6251 , 999 , 102 ], 'attention_mask' : [1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ]}
transformer预训练模型在预训练的时候通常使用的是subword,如果我们的文本输入已经被切分成了word,那么这些word还会被我们的tokenizer继续切分。
1 2 example = datasets["train" ][4 ]print (example["tokens" ])
1 2 ['Germany' , "'s" , 'representative' , 'to' , 'the' , 'European' , 'Union' , "'s" , 'veterinary' , 'committee' , 'Werner' , 'Zwingmann' , 'said' , 'on' , 'Wednesday' , 'consumers' , 'should' , 'buy' , 'sheepmeat' , 'from' , 'countries' , 'other' , 'than' , 'Britain' , 'until' , 'the' , 'scientific' , 'advice' , 'was' , 'clearer' , '.' ]
1 2 3 tokenized_input = tokenizer(example["tokens" ], is_split_into_words=True ) tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids" ])print (tokens)
单词"Zwingmann" 和 "sheepmeat"继续被切分成了3个subtokens。
1 ['[CLS]' , 'germany' , "'" , 's' , 'representative' , 'to' , 'the' , 'european' , 'union' , "'" , 's' , 'veterinary' , 'committee' , 'werner' , 'z' , '##wing' , '##mann' , 'said' , 'on' , 'wednesday' , 'consumers' , 'should' , 'buy' , 'sheep' , '##me' , '##at' , 'from' , 'countries' , 'other' , 'than' , 'britain' , 'until' , 'the' , 'scientific' , 'advice' , 'was' , 'clearer' , '.' , '[SEP]' ]
由于标注数据通常是在word级别进行标注的,既然word还会被切分成subtokens,那么意味着我们还需要对标注数据进行subtokens的对齐。同时,由于预训练模型输入格式的要求,往往还需要加上一些特殊符号比如:
[CLS]
和 [SEP]
。
1 2 3 len (example[f"{task} _tags" ]), len (tokenized_input["input_ids" ]) (31 , 39 )
tokenizer有一个
word_ids
方法可以帮助我们解决这个问题。
1 2 3 4 print (tokenized_input.word_ids()) [None , 0 , 1 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 7 , 8 , 9 , 10 , 11 , 11 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 18 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , None ]
我们可以看到,word_ids将每一个subtokens位置都对应了一个word的下标。比如第1个位置对应第0个word,然后第2、3个位置对应第1个word。特殊字符对应了None。有了这个list,我们就能将subtokens和words还有标注的labels对齐啦。
1 2 3 word_ids = tokenized_input.word_ids() aligned_labels = [-100 if i is None else example[f"{task} _tags" ][i] for i in word_ids]print (len (aligned_labels), len (tokenized_input["input_ids" ]))
我们通常将特殊字符的label设置为-100,在模型中-100通常会被忽略掉不计算loss。
我们有两种对齐label的方式:
多个subtokens对齐一个word,对齐一个label
多个subtokens的第一个subtoken对齐word,对齐一个label,其他subtokens直接赋予-100.
1 2 label_all_tokens = True
预处理函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def tokenize_and_align_labels (examples ): tokenized_inputs = tokenizer(examples["tokens" ], truncation=True , is_split_into_words=True ) labels = [] for i, label in enumerate (examples[f"{task} _tags" ]): word_ids = tokenized_inputs.word_ids(batch_index=i) previous_word_idx = None label_ids = [] for word_idx in word_ids: if word_idx is None : label_ids.append(-100 ) elif word_idx != previous_word_idx: label_ids.append(label[word_idx]) else : label_ids.append(label[word_idx] if label_all_tokens else -100 ) previous_word_idx = word_idx labels.append(label_ids) tokenized_inputs["labels" ] = labels return tokenized_inputs
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
1 tokenized_datasets = datasets.map (tokenize_and_align_labels, batched=True )
微调预训练模型
1 2 3 from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len (label_list))
1 2 3 4 5 6 7 8 9 10 args = TrainingArguments( f"test-{task} " , evaluation_strategy = "epoch" , learning_rate=2e-5 , per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=3 , weight_decay=0.01 , )
最后我们需要一个数据收集器data
collator,将我们处理好的输入喂给模型。
1 2 3 from transformers import DataCollatorForTokenClassification data_collator = DataCollatorForTokenClassification(tokenizer)
我们使用seqeval
metric来完成评估。将模型预测送入评估之前,我们也会做一些数据后处理:
1 2 3 4 metric = load_metric("seqeval" ) labels = [label_list[i] for i in example[f"{task} _tags" ]] metric.compute(predictions=[labels], references=[labels])
对模型预测结果做一些后处理:
选择预测分类最大概率的下标
将下标转化为label
忽略-100所在地方
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import numpy as npdef compute_metrics (p ): predictions, labels = p predictions = np.argmax(predictions, axis=2 ) true_predictions = [ [label_list[p] for (p, l) in zip (prediction, label) if l != -100 ] for prediction, label in zip (predictions, labels) ] true_labels = [ [label_list[l] for (p, l) in zip (prediction, label) if l != -100 ] for prediction, label in zip (predictions, labels) ] results = metric.compute(predictions=true_predictions, references=true_labels) return { "precision" : results["overall_precision" ], "recall" : results["overall_recall" ], "f1" : results["overall_f1" ], "accuracy" : results["overall_accuracy" ], }
将数据,模型,参数传入Trainer
1 2 3 4 5 6 7 8 9 trainer = Trainer( model, args, train_dataset=tokenized_datasets["train" ], eval_dataset=tokenized_datasets["validation" ], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics )
开始训练
我们可以再次使用evaluate
方法评估,可以评估其他数据集。