Cannot find the NER Model #10

kemalaraz opened this issue Mar 18, 2020 · 22 comments

kemalaraz opened this issue Mar 18, 2020 · 22 comments


It seems huggingface repository contains only the base model, I couldn't find the model and tokenizer related to the model for named entity recognition. Where can I find the trained NER model and if it is not too much to ask how can I load and use it easily?

@kemalaraz kemalaraz changed the title NER Model Cannot find the NER Model Mar 18, 2020
I am trying to replicate your NER results with hugginface take your base model as pre-trained and using BertForTokenClassification then train the model for NER but it is not converging. Can you elaborate me on that?

Hi @kemalaraz ,

yes, there are no fine-tuned models stored on the model hub at the moment, only the ones with a normal lm head.

For fine-tuning I used FARM, with corresponding configuration files in the ./config folder of this repo.

Could you just paste the fine-tuning command that you've used with the script? Maybe you should adjust the number of epochs (for NER I used 10 epochs) :)

kemalaraz commented Mar 24, 2020

Hello again @stefan-it ,
Thanks for a quick response, I wrote a big code chunk for it which is below:

# -*- coding: utf-8 -*-
import torch
from torch.optim import Adam
from import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertConfig, AdamW
from transformers import BertForTokenClassification
from preprocess import create_bert_data
import argparse
from seqeval.metrics import f1_score
from tqdm import tqdm, trange
import numpy as np
from transformers import get_linear_schedule_with_warmup


parser = argparse.ArgumentParser()
parser.add_argument("-t", "--trainpath", required = True, help = "Path to the train raw data")
parser.add_argument("-v", "--valpath", required = True, help = "Path to the val raw data")
parser.add_argument("-f", "--finetune", required = True, help = "Fine tuning option if false just the classifier layer is trained")
args = vars(parser.parse_args())

if args["finetune"] == "True":
    fine_tune = True
if args["finetune"] == "False":
    fine_tune = False

assert fine_tune is True or fine_tune is False, "Fine-tune must be a boolean value but given {}".format(args["finetune"])

tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

#Accuracy on a token level (balanced accuracy)
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

#Set up if the script will run on cpu or gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

train_input, train_labels, uniq_ent, ent_idx, train_attention_masks = create_bert_data("train.txt", train = True)
val_input, val_labels, val_attention_masks = create_bert_data(args["valpath"], train = False)
one_training_step = len(train_input) / BATCH_SIZE
train_input = torch.tensor(train_input)
train_labels = torch.tensor(train_labels)
train_attention_masks = torch.tensor(train_attention_masks)
val_input = torch.tensor(val_input)
val_labels = torch.tensor(val_labels)
val_attention_masks = torch.tensor(val_attention_masks)

#Define the DataLoader and shuffle the data at training time and test time pass them to SequentialSampler
train_dataset = TensorDataset(train_input, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler = train_sampler, batch_size = BATCH_SIZE)

val_dataset = TensorDataset(val_input, val_attention_masks, val_labels)
val_sampler = RandomSampler(val_dataset)
val_dataloader = DataLoader(val_dataset, sampler = val_sampler, batch_size = BATCH_SIZE)

#Write unique entities and their ids into a file
with open("entity_idx.txt", "w") as foo:


#Load the model
model = BertForTokenClassification.from_pretrained("dbmdz/bert-base-turkish-cased", num_labels = len(ent_idx))

#For gpu
num_training_steps = int(one_training_step/BATCH_SIZE)*EPOCHS
num_warmup_steps = int(num_training_steps * 0.4)
#Initilize the optimizer
optimizer = AdamW(model.parameters(), lr = 5e-5, correct_bias=False)  #To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  #PyTorch schedule

#For printing under progress bars
description_train = tqdm(total=0, position=1, bar_format='{desc}')
description_val = tqdm(total=0, position=1, bar_format='{desc}')

train_losses = []

for epoch in range(EPOCHS):
    print("{:d}/{:d} EPOCH".format(epoch, EPOCHS))
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    # Training progress bar
    pbar_train = tqdm(total = len(train_input), leave = False)

    for step, batch in enumerate(train_dataloader):

        # add batch to gpu
        input_batch, input_mask_batch, labels_batch = [ for b in batch]
        # forward pass
        loss,_ = model(input_batch, token_type_ids = None,
                    attention_mask = input_mask_batch, labels = labels_batch)
        # backprob
        # track train loss
        tr_loss += loss.item()
        nb_tr_examples += input_batch.size(0)
        nb_tr_steps += 1
        # gradient clipping to avoid exploding gradients
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)
        # update parameters
        # Increment progress bar
        train_loss = tr_loss/nb_tr_steps
        description_train.set_description_str(f"Epoch: {epoch}/{EPOCHS} - Loss: {train_loss}")
    #Print train loss per epoch
    print("Train loss: {:.5f}".format(tr_loss/nb_tr_steps))

    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    predictions, true_labels = [], []
    # Validation progress bar
    pbar_val = tqdm(total = len(val_input), leave = True)

    for batch in val_dataloader:
        # add batch to gpu
        batch = tuple( for t in batch)
        input_batch, input_mask_batch, labels_batch = batch

        with torch.no_grad():
            outputs = model(input_batch, token_type_ids = None,
                                attention_mask = input_mask_batch, labels = labels_batch)
            temp_eval_loss, logits = outputs[:2]
            logits = logits.detach().cpu().numpy()
            label_ids ="cpu").numpy()
            predictions.extend([list(p) for p in np.argmax(logits, axis = 2)])
            temp_eval_accuracy = flat_accuracy(logits, label_ids)
            eval_loss += temp_eval_loss.mean().item()
            eval_accuracy += temp_eval_accuracy

            nb_eval_examples += input_batch.size(0)
            nb_eval_steps += 1
        eval_loss = eval_loss/nb_eval_steps
        val_accuracy = eval_accuracy/nb_eval_steps
        predicted_entities = [uniq_ent[p_i] for p in predictions for p_i in p]
        valid_entities = [uniq_ent[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
        f1_score_val = f1_score(valid_entities, predicted_entities)
        description_val.set_description_str(f"Validation Scores -> Loss : {eval_loss} , Accuracy : 
                                                                  {val_accuracy} , F1-Score : {f1_score_val}")

The code above didnt converge I stuck at 0.020... loss and 0.40 around for f1 score..
Also I tried with your berturk.json file with farm and I used experiment which is presented below

from farm.experiment import run_experiment, load_experiments

experiments = load_experiments("/home/karaz/Desktop/BERT_NER/FARM_TransferLearning/berturk.json")


I can go with the FARM also but when I executed that code above for farm the f1 score is %55. What am I missing here :)

Sorry to bother you this much but I also tried with different datasets still no luck..

Hi @kemalaraz ,

I'm currently working on an evaluation for the WikiANN (balanced) dataset, so you could use this dataset as well to test the implementation:

Dataset can be retrieved from here and the dataset needs to be pre-processed, e.g. with:

import sys

filename = sys.argv[1]

with open(filename, "rt") as f_p:
    for line in f_p:
        line = line.rstrip()

        if not line:

        token, label = line.split("\t")

        assert token.startswith("tr:")

        print(token[3:], label)

Just use python3 train > train.txt and so on for train, dev and test.

It's important that the final dataset format is:

Büyük B-ORG
Ermenistan I-ORG
kurma O
girişimleri O
sona O
ermiştir O
. O


There's a token/label pair for each line, delimited by a space. An empty line denotes a space.

Just make sure, that the dataset format is ok.

The 55% depends on the dataset, e.g. when your dataset is not balanced, or very noise (like the WNUT NER datasets) this would explain the bad result. However, you should try to train on bert-base-multilingual-cased as well :)

Thank you so much for your interest, I ll try your suggestions and get back to you. Btw I got 55% with the dataset that you wrote on another issue which you said you plan to evaluate the model for ner with this dataset. I used the same set and parsed it one word and token at a time however got bad results. In addition to that the dataset should be like Büyük B-ORG however when preparing the input for the bert it should be "This is a sentence" and output " 'O' ,'O' ..." right?

Copy link

I am sorry but that dataset didnt work as well. I hope you put your trained model for NER to this repo otherwise it seems impossible to replicate your results.


Copy link

stefan-it commented Apr 13, 2020

Hi @kemalaraz,

could you try to reproduce the NER results with the following commands:

Download WikiANN data

Just clone the latest version of Transformers install it via pip install -e . and put the following bash script in the examples/ner folder:

mkdir tr-data

cd tr-data

for file in train.txt dev.txt test.txt labels.txt

cd ..

It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.

Run pre-training

After downloading the dataset, pre-training can be started. Just set the following environment variables:

export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased

export OUTPUT_DIR=tr-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1

Then run pre-training:

python3 --data_dir ./tr-data \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \

This will fine-tune the Turkish BERT model for 3 epochs. Results should be around:

cat tr-model-1/eval_results.txt
cat tr-model-1/test_results.txt

In my experiment I got 92.60% on the development set and 92.19% on the test set.

I hope I can upload fine-tuned models this week.

savasy commented Apr 28, 2020

I applied the same experiment with Stefan to reproduce the same resulta. However, I got errror
ImportError: cannot import name 'EvalPrediction' from 'transformers
But keep going , I share the results soon

Works like a charm, cheers bud, however I still wonder with which dataset you used to get 95% on NER task?

Copy link

savasy commented Apr 29, 2020

With my experiments, I got the similar results with @stefan-it as follows

Eval Results:

precision = 0.916400580551524
recall = 0.9342309684101502
f1 = 0.9252298787412536
loss = 0.11335893666411284

Test Results:
precision = 0.9192058759362955
recall = 0.9303010230367262
f1 = 0.9247201697271198
loss = 0.11182546521618497

kemalaraz commented Apr 30, 2020

When I load the trained model with:

label_list = ["B-LOC", "B-ORG", "B-PER", "I-LOC", "I-ORG", "I-PER", "O"]
model = AutoModelForTokenClassification.from_pretrained("./transformers/examples/ner/tr-model-1/checkpoint-1875")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentences[5])))
inputs = tokenizer.encode(sentences[4], return_tensors = "pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

### In config file label2id is:

  "label2id": {
    "B-LOC": 0,
    "B-ORG": 1,
    "B-PER": 2,
    "I-LOC": 3,
    "I-ORG": 4,
    "I-PER": 5,
    "O": 6

and test it on another dataset I am getting really bad results. I can give you more information if you want or share the model and stuff with you. Can you please help me on that?

Could you post an example sentence or more sentences from your dataset, so that I can test it 🤔

Copy link

savasy commented Apr 30, 2020

I trained and uploaded fine-tuned model to the transformers repo as savasy/bert-base-turkish-ner-cased

please check the following code

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")

output looks like this
[{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]

I got the dataset from and getting bad results. @savasy thank you for your interest and for the example :) I trained the model there is no problem with that I am trying the model with various datasets that are unseen to the model and not getting good results but I will double check everything again when I have time. But again thank you :)

stefan-it commented Apr 30, 2020

The NER dataset in the linspector repo was automatically annotated - just found the paper:

When looking at the test set, I could find some tagging errors:

ve X B-X O
Vorarlberg'i X B-X O
Avusturya X B-X B-LOCATION
İmparatorluğu'na X B-X O
bırakan X B-X O
, X B-X O
Aschaffenburg X B-X B-MISC
ile X B-X O
Darmstadt'ın X B-X O
bir X B-X O
kısmını X B-X O
elde X B-X O
etti X B-X O
. X B-X O

So e.g. "Aschaffenburg" is clearly a location, "Darmstadt" and "Vorarlberg" as well.

@savasy Thanks for uploading the model 👍 I just used it to tag the sentence, and "Aschaffenburg", "Darmstadt" and "Vorarlberg" are tagged as locations.

savasy commented Apr 30, 2020

You are welcome @stefan-it @kemalaraz
And "İmparatorluğu" must be I-Location as well.
It seems the dataset linspector could be misleading. Can you share other dataset here @kemalaraz and @stefan-it so that We can train/test it

You can get that from this link however I haven't tried the model on that dataset I ma working on a different project right now, will write a parser for enamex format and then try that.nerdata.txt

savasy commented Apr 30, 2020

Thank you, it is fairly enough data

I will give it a try

Copy link

savasy commented May 3, 2020

The performance for the data given by @kemalaraz is as follows

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
precision = 0.9461980692049029
recall = 0.959309358847465
f1 = 0.9527086063783312
loss = 0.037054269206847804

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
precision = 0.9458370635631155
recall = 0.9588201928530913
f1 = 0.952284378344882
loss = 0.035431676572445225

savasy commented Jun 2, 2020

Hi @kemalaraz and @stefan-it
Where did you take this ner dataset that kemal shared as follow

is it from the paper below ?

Hi @savasy , I checked that dataset and it should be idential to the one that I've used for the NER experiments :)

