本文为参加Datawhale组队学习时所写,如若需了解细致内容,请去到Datawhale官方开源课程基于transformers的自然语言处理(NLP)入门
(datawhalechina.github.io) 
文本分类 
GLUE榜单包含了9个句子级别的分类任务,分别是:
CoLA  (Corpus of
Linguistic Acceptability) 鉴别一个句子是否语法正确. 
MNLI  (Multi-Genre
Natural Language Inference)
给定一个假设,判断另一个句子与该假设的关系:entails, contradicts 或者
unrelated。 
MRPC 
(Microsoft Research Paraphrase Corpus)
判断两个句子是否互为paraphrases. 
QNLI 
(Question-answering Natural Language Inference)
判断第2句是否包含第1句问题的答案。 
QQP 
(Quora Question Pairs2) 判断两个问句是否语义相同。 
RTE 
(Recognizing Textual
Entailment)判断一个句子是否与假设成entail关系。 
SST-2 
(Stanford Sentiment Treebank) 判断一个句子的情感正负向. 
STS-B 
(Semantic Textual Similarity Benchmark)
判断两个句子的相似性(分数为1-5分)。 
WNLI 
(Winograd Natural Language Inference) Determine if a sentence with an
anonymous pronoun and a sentence with this pronoun replaced are entailed
or not. 
 
 
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer接口对预训练模型进行微调。
1 2  GLUE_TASKS = ["cola" , "mnli" , "mnli-mm" , "mrpc" , "qnli" , "qqp" , "rte" , "sst2" , "stsb" , "wnli" ]
 
1 2 3 task = "cola"  model_checkpoint = "distilbert-base-uncased"    batch_size = 16 
 
加载数据 
1 2 3 4 5 from  datasets import  load_dataset, load_metric actual_task = "mnli"  if  task == "mnli-mm"  else  task dataset = load_dataset("glue" , actual_task)   metric = load_metric('glue' , actual_task)  
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  DatasetDict({     train: Dataset({         features: ['sentence' , 'label' , 'idx' ],         num_rows: 8551      })     validation: Dataset({         features: ['sentence' , 'label' , 'idx' ],         num_rows: 1043      })     test: Dataset({         features: ['sentence' , 'label' , 'idx' ],         num_rows: 1063      }) })
 
随机选择数据集中的几个例子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import  datasetsimport  randomimport  pandas as  pdfrom  IPython.display import  display, HTMLdef  show_random_elements (dataset, num_examples=10  ):     assert  num_examples <= len (dataset), "Can't pick more elements than there are in the dataset."      picks = []     for  _ in  range (num_examples):         pick = random.randint(0 , len (dataset)-1 )         while  pick in  picks:             pick = random.randint(0 , len (dataset)-1 )         picks.append(pick)          df = pd.DataFrame(dataset[picks])     for  column, typ in  dataset.features.items():         if  isinstance (typ, datasets.ClassLabel):             df[column] = df[column].transform(lambda  i: typ.names[i])     display(HTML(df.to_html())) show_random_elements(dataset["train" ])
 
 
评估metric时datasets.Metric的一个实例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Metric(name: "glue" , features: {'predictions' : Value(dtype='int64' , id =None ), 'references' : Value(dtype='int64' , id =None )}, usage: """ Compute GLUE evaluation metric associated to each GLUE dataset. Args:     predictions: list of predictions to score.         Each translation should be tokenized into a list of tokens.     references: list of lists of references for each translation.         Each reference should be tokenized into a list of tokens. Returns: depending on the GLUE subset, one or several of:     "accuracy": Accuracy     "f1": F1 score     "pearson": Pearson Correlation     "spearmanr": Spearman Correlation     "matthews_correlation": Matthew Correlation Examples:     >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]     >>> references = [0, 1]     >>> predictions = [0, 1]     >>> results = glue_metric.compute(predictions=predictions, references=references)     >>> print(results)     {'accuracy': 1.0}     >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'     >>> references = [0, 1]     >>> predictions = [0, 1]     >>> results = glue_metric.compute(predictions=predictions, references=references)     >>> print(results)     {'accuracy': 1.0, 'f1': 1.0}     >>> glue_metric = datasets.load_metric('glue', 'stsb')     >>> references = [0., 1., 2., 3., 4., 5.]     >>> predictions = [0., 1., 2., 3., 4., 5.]     >>> results = glue_metric.compute(predictions=predictions, references=references)     >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})     {'pearson': 1.0, 'spearmanr': 1.0}     >>> glue_metric = datasets.load_metric('glue', 'cola')     >>> references = [0, 1]     >>> predictions = [0, 1]     >>> results = glue_metric.compute(predictions=predictions, references=references)     >>> print(results)     {'matthews_correlation': 1.0} """ , stored examples: 0 )
 
1 2 3 4 5 6 import  numpy as  np fake_preds = np.random.randint(0 , 2 , size=(64 ,)) fake_labels = np.random.randint(0 , 2 , size=(64 ,)) metric.compute(predictions=fake_preds, references=fake_labels)
 
每一个文本分类任务所对应的metic有所不同,具体如下:
数据预处理 
预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token
ID,再转化为模型需要的输入格式。
为了达到数据预处理的目的,我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer,这样可以确保:
我们得到一个与预训练模型一一对应的tokenizer。 
使用指定的模型checkpoint对应的tokenizer的时候,我们也下载了模型需要的词表库vocabulary,准确来说是tokens
vocabulary。 
 
1 2 3 4 from  transformers import  AutoTokenizer      tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True )
 
tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式
1 tokenizer("Hello, this one sentence!" , "And this sentence goes with it." )
 
不同数据和对应的数据格式
1 2 3 4 5 6 7 8 9 10 11 12 task_to_keys = {     "cola" : ("sentence" , None ),     "mnli" : ("premise" , "hypothesis" ),     "mnli-mm" : ("premise" , "hypothesis" ),     "mrpc" : ("sentence1" , "sentence2" ),     "qnli" : ("question" , "sentence" ),     "qqp" : ("question1" , "question2" ),     "rte" : ("sentence1" , "sentence2" ),     "sst2" : ("sentence" , None ),     "stsb" : ("sentence1" , "sentence2" ),     "wnli" : ("sentence1" , "sentence2" ), }
 
1 2 3 4 5 6 7  sentence1_key, sentence2_key = task_to_keys[task]if  sentence2_key is  None :     print (f"Sentence: {dataset['train' ][0 ][sentence1_key]} " )else :     print (f"Sentence 1: {dataset['train' ][0 ][sentence1_key]} " )     print (f"Sentence 2: {dataset['train' ][0 ][sentence2_key]} " )
 
预处理函数 
1 2 3 4 def  preprocess_function (examples ):     if  sentence2_key is  None :         return  tokenizer(examples[sentence1_key], truncation=True )     return  tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True )
 
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
1 encoded_dataset = dataset.map (preprocess_function, batched=True )
 
返回的结果会自动被缓存,避免下次处理的时候重新计算(但是也要注意,如果输入有改动,可能会被缓存影响!)。datasets库函数会对输入的参数进行检测,判断是否有变化,如果没有变化就使用缓存数据,如果有变化就重新处理。但如果输入参数不变,想改变输入的时候,最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。另外,上面使用到的batched=True这个参数是tokenizer的特点,因为这会使用多线程同时并行对输入进行处理。
微调预训练模型 
既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSequenceClassification
这个类。和tokenizer相似,from_pretrained方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。
STS-B是一个回归问题,MNLI是一个3分类问题
1 2 3 4 5 from  transformers import  AutoModelForSequenceClassification, TrainingArguments, Trainer num_labels = 3  if  task.startswith("mnli" ) else  1  if  task=="stsb"  else  2  model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
 
为了能够得到一个Trainer训练工具,我们还需要3个要素,其中最重要的是训练的设定/参数
TrainingArguments 。这个训练设定包含了能够定义训练过程的所有属性。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 metric_name = "pearson"  if  task == "stsb"  else  "matthews_correlation"  if  task == "cola"  else  "accuracy"  args = TrainingArguments(     "test-glue" ,     evaluation_strategy = "epoch" ,       save_strategy = "epoch" ,     learning_rate=2e-5 ,     per_device_train_batch_size=batch_size,     per_device_eval_batch_size=batch_size,     num_train_epochs=5 ,     weight_decay=0.01 ,     load_best_model_at_end=True ,     metric_for_best_model=metric_name, )
 
由于不同的任务需要不同的评测指标,我们定一个函数来根据任务名字得到评价方法
1 2 3 4 5 6 7 def  compute_metrics (eval_pred ):     predictions, labels = eval_pred     if  task != "stsb" :         predictions = np.argmax(predictions, axis=1 )     else :         predictions = predictions[:, 0 ]     return  metric.compute(predictions=predictions, references=labels)
 
全部传给 Trainer
1 2 3 4 5 6 7 8 9 validation_key = "validation_mismatched"  if  task == "mnli-mm"  else  "validation_matched"  if  task == "mnli"  else  "validation"  trainer = Trainer(     model,     args,     train_dataset =encoded_dataset["train" ],     eval_dataset =encoded_dataset[validation_key],     tokenizer =tokenizer,     compute_metrics =compute_metrics )
 
 
 
超参数搜索 
Trainer同样支持超参搜索,使用optuna  or Ray Tune 代码库。
1 2 3 #   安装相关依赖  ! pip install optuna ! pip install ray[tune]
 
超参搜索时,Trainer将会返回多个训练好的模型,所以需要传入一个定义好的模型从而让Trainer可以不断重新初始化该传入的模型
1 2 def  model_init ():     return  AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
 
1 2 3 4 5 6 7 8 trainer = Trainer(     model_init =model_init,     args =args,     train_dataset =encoded_dataset["train" ],     eval_dataset =encoded_dataset[validation_key],     tokenizer =tokenizer,     compute_metrics =compute_metrics )
 
1 2 3  best_run = trainer.hyperparameter_search(n_trials=10 , direction="maximize" )
 
1 2 3 4 5 for  n, v in  best_run.hyperparameters.items():     setattr (trainer.args, n, v) trainer.train()
 
序列标注 
序列标注,通常也可以看作是token级别的分类问题:对每一个token进行分类。在这个notebook中,我们将展示如何使用🤗
Transformers 中的transformer模型去做token级别的分类问题。
最常见的token级别分类任务:
NER (Named-entity recognition 名词-实体识别)
分辨出文本中的名词和实体 (person人名, organization组织机构名,
location地点名...). 
POS (Part-of-speech tagging词性标注) 根据语法对token进行词性标注
(noun名词, verb动词, adjective形容词...) 
Chunk (Chunking短语组块) 将同一个短语的tokens组块放在一起。 
 
对于以上任务,我们将展示如何使用简单的Dataset库加载数据集,同时使用transformer中的Trainer接口对预训练模型进行微调。
只要预训练的transformer模型最顶层有一个token分类的神经网络层(比如上一篇章提到的BertForTokenClassification)(另外,由于transformer库的tokenizer新特性,可能还需要对应的预训练模型有fast
tokenizer这个功能,参考这个表 ),那么本notebook理论上可以使用各种各样的transformer模型(模型面板 ),解决任何token级别的分类任务。
1 2 3 task = "ner"   model_checkpoint = "distilbert-base-uncased"  batch_size = 16 
 
加载数据 
1 2 from  datasets import  load_dataset, load_metric datasets = load_dataset('conll2003' )
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  DatasetDict({     train: Dataset({         features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ],         num_rows: 14041      })     validation: Dataset({         features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ],         num_rows: 3250      })     test: Dataset({         features: ['id' , 'tokens' , 'pos_tags' , 'chunk_tags' , 'ner_tags' ],         num_rows: 3453      }) })
 
无论是在训练集、验证机还是测试集中,datasets都包含了一个名为tokens的列(一般来说是将文本切分成了很多词),还包含一个名为label的列,这一列对应这tokens的标注。
 
1 2 3 4 5 6 7 8 9 10 11 12 13 {'chunk_tags' : [11 , 21 , 11 , 12 , 21 , 22 , 11 , 12 , 0 ],  'id' : '0' ,  'ner_tags' : [3 , 0 , 7 , 0 , 0 , 0 , 7 , 0 , 0 ],  'pos_tags' : [22 , 42 , 16 , 21 , 35 , 37 , 16 , 21 , 7 ],  'tokens' : ['EU' ,   'rejects' ,   'German' ,   'call' ,   'to' ,   'boycott' ,   'British' ,   'lamb' ,   '.' ]}
 
所有的数据标签labels都已经被编码成了整数,可以直接被预训练transformer模型使用。这些整数的编码所对应的实际类别储存在features中。
1 2 3 datasets["train" ].features[f"ner_tags" ]Sequence (feature=ClassLabel(num_classes=9 , names=['O' , 'B-PER' , 'I-PER' , 'B-ORG' , 'I-ORG' , 'B-LOC' , 'I-LOC' , 'B-MISC' , 'I-MISC' ], names_file=None , id =None ), length=-1 , id =None )
 
所以以NER为例,0对应的标签类别是”O“,
1对应的是”B-PER“等等。”O“的意思是没有特别实体(no special
entity)。本例包含4种实体类别分别是(PER、ORG、LOC,MISC),每一种实体类别又分别有B-(实体开始的token)前缀和I-(实体中间的token)前缀。
'PER' for person 
'ORG' for organization 
'LOC' for location 
'MISC' for miscellaneous 
 
1 2 label_list = datasets["train" ].features[f"{task} _tags" ].feature.names label_list
 
1 2 ['O' , 'B-PER' , 'I-PER' , 'B-ORG' , 'I-ORG' , 'B-LOC' , 'I-LOC' , 'B-MISC' , 'I-MISC' ]
 
从数据集里随机选择几个例子进行展示。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from  datasets import  ClassLabel, Sequence import  randomimport  pandas as  pdfrom  IPython.display import  display, HTMLdef  show_random_elements (dataset, num_examples=10  ):     assert  num_examples <= len (dataset), "Can't pick more elements than there are in the dataset."      picks = []     for  _ in  range (num_examples):         pick = random.randint(0 , len (dataset)-1 )         while  pick in  picks:             pick = random.randint(0 , len (dataset)-1 )         picks.append(pick)          df = pd.DataFrame(dataset[picks])     for  column, typ in  dataset.features.items():         if  isinstance (typ, ClassLabel):             df[column] = df[column].transform(lambda  i: typ.names[i])         elif  isinstance (typ, Sequence ) and  isinstance (typ.feature, ClassLabel):             df[column] = df[column].transform(lambda  x: [typ.feature.names[i] for  i in  x])     display(HTML(df.to_html()))
 
 
预处理数据 
1 2 3 from  transformers import  AutoTokenizer      tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 
1 2 3 import  transformersassert  isinstance (tokenizer, transformers.PreTrainedTokenizerFast)
 
1 tokenizer("Hello, this is one sentence!" )
 
1 2  {'input_ids' : [101 , 7592 , 1010 , 2023 , 2003 , 2028 , 6251 , 999 , 102 ], 'attention_mask' : [1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ]}
 
transformer预训练模型在预训练的时候通常使用的是subword,如果我们的文本输入已经被切分成了word,那么这些word还会被我们的tokenizer继续切分。
1 2 example = datasets["train" ][4 ]print (example["tokens" ])
 
1 2 ['Germany' , "'s" , 'representative' , 'to' , 'the' , 'European' , 'Union' , "'s" , 'veterinary' , 'committee' , 'Werner' , 'Zwingmann' , 'said' , 'on' , 'Wednesday' , 'consumers' , 'should' , 'buy' , 'sheepmeat' , 'from' , 'countries' , 'other' , 'than' , 'Britain' , 'until' , 'the' , 'scientific' , 'advice' , 'was' , 'clearer' , '.' ]
 
1 2 3 tokenized_input = tokenizer(example["tokens" ], is_split_into_words=True ) tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids" ])print (tokens)
 
单词"Zwingmann" 和 "sheepmeat"继续被切分成了3个subtokens。
1 ['[CLS]' , 'germany' , "'" , 's' , 'representative' , 'to' , 'the' , 'european' , 'union' , "'" , 's' , 'veterinary' , 'committee' , 'werner' , 'z' , '##wing' , '##mann' , 'said' , 'on' , 'wednesday' , 'consumers' , 'should' , 'buy' , 'sheep' , '##me' , '##at' , 'from' , 'countries' , 'other' , 'than' , 'britain' , 'until' , 'the' , 'scientific' , 'advice' , 'was' , 'clearer' , '.' , '[SEP]' ]
 
由于标注数据通常是在word级别进行标注的,既然word还会被切分成subtokens,那么意味着我们还需要对标注数据进行subtokens的对齐。同时,由于预训练模型输入格式的要求,往往还需要加上一些特殊符号比如:
[CLS] 和 [SEP]。
1 2 3 len (example[f"{task} _tags" ]), len (tokenized_input["input_ids" ]) (31 , 39 )
 
tokenizer有一个
word_ids方法可以帮助我们解决这个问题。
1 2 3 4 print (tokenized_input.word_ids()) [None , 0 , 1 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 7 , 8 , 9 , 10 , 11 , 11 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 18 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , None ]
 
我们可以看到,word_ids将每一个subtokens位置都对应了一个word的下标。比如第1个位置对应第0个word,然后第2、3个位置对应第1个word。特殊字符对应了None。有了这个list,我们就能将subtokens和words还有标注的labels对齐啦。
1 2 3 word_ids = tokenized_input.word_ids() aligned_labels = [-100  if  i is  None  else  example[f"{task} _tags" ][i] for  i in  word_ids]print (len (aligned_labels), len (tokenized_input["input_ids" ]))
 
 
我们通常将特殊字符的label设置为-100,在模型中-100通常会被忽略掉不计算loss。
我们有两种对齐label的方式:
多个subtokens对齐一个word,对齐一个label 
多个subtokens的第一个subtoken对齐word,对齐一个label,其他subtokens直接赋予-100. 
 
1 2  label_all_tokens = True 
 
预处理函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def  tokenize_and_align_labels (examples ):     tokenized_inputs = tokenizer(examples["tokens" ], truncation=True , is_split_into_words=True )     labels = []     for  i, label in  enumerate (examples[f"{task} _tags" ]):         word_ids = tokenized_inputs.word_ids(batch_index=i)         previous_word_idx = None          label_ids = []         for  word_idx in  word_ids:                                       if  word_idx is  None :                 label_ids.append(-100 )                          elif  word_idx != previous_word_idx:                 label_ids.append(label[word_idx])                                       else :                 label_ids.append(label[word_idx] if  label_all_tokens else  -100 )             previous_word_idx = word_idx         labels.append(label_ids)     tokenized_inputs["labels" ] = labels     return  tokenized_inputs
 
接下来对数据集datasets里面的所有样本进行预处理,处理的方式是使用map函数,将预处理函数prepare_train_features应用到(map)所有样本上。
1 tokenized_datasets = datasets.map (tokenize_and_align_labels, batched=True )
 
微调预训练模型 
1 2 3 from  transformers import  AutoModelForTokenClassification, TrainingArguments, Trainer model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len (label_list))  
 
1 2 3 4 5 6 7 8 9 10  args = TrainingArguments(     f"test-{task} " ,     evaluation_strategy = "epoch" ,     learning_rate=2e-5 ,     per_device_train_batch_size=batch_size,     per_device_eval_batch_size=batch_size,     num_train_epochs=3 ,     weight_decay=0.01 , )
 
最后我们需要一个数据收集器data
collator,将我们处理好的输入喂给模型。
1 2 3 from  transformers import  DataCollatorForTokenClassification data_collator = DataCollatorForTokenClassification(tokenizer)
 
我们使用seqeval 
metric来完成评估。将模型预测送入评估之前,我们也会做一些数据后处理:
1 2 3 4  metric = load_metric("seqeval" ) labels = [label_list[i] for  i in  example[f"{task} _tags" ]] metric.compute(predictions=[labels], references=[labels])
 
对模型预测结果做一些后处理:
选择预测分类最大概率的下标 
将下标转化为label 
忽略-100所在地方 
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import  numpy as  npdef  compute_metrics (p ):     predictions, labels = p     predictions = np.argmax(predictions, axis=2 )          true_predictions = [         [label_list[p] for  (p, l) in  zip (prediction, label) if  l != -100 ]         for  prediction, label in  zip (predictions, labels)     ]     true_labels = [         [label_list[l] for  (p, l) in  zip (prediction, label) if  l != -100 ]         for  prediction, label in  zip (predictions, labels)     ]     results = metric.compute(predictions=true_predictions, references=true_labels)     return  {         "precision" : results["overall_precision" ],         "recall" : results["overall_recall" ],         "f1" : results["overall_f1" ],         "accuracy" : results["overall_accuracy" ],     }
 
将数据,模型,参数传入Trainer
1 2 3 4 5 6 7 8 9 trainer = Trainer(     model,     args,     train_dataset=tokenized_datasets["train" ],     eval_dataset=tokenized_datasets["validation" ],     data_collator=data_collator,     tokenizer=tokenizer,     compute_metrics=compute_metrics )
 
开始训练
 
我们可以再次使用evaluate方法评估,可以评估其他数据集。