Revisiting CoNNL-2003 with Classical Machine Learning

Highlights

Contributors: Matthew A Hernandez

Note Access to our paper here

We present a brief survey on classical machine learning algorithms in the context of the CoNNL-2003 Shared Task. A named entity recognition (NER) system is built to recognize and classify objects in a body of text into predefined categories. We include a principled framework that motivates the use of machine learning by creating two baseline systems. Finally, the paper includes an analysis between generative and discriminative machine learning algorithms.

Directory

.
├── corpora/                # Contains datasets used for training and evaluation
│   ├── test/               
│   ├── train/          
│   └── val/                
├── implementation/         # Source code for the project
│   ├── gazetteers/         # Folder for text files of named entities
│   ├── models/             # SGDClassifier (scratch)
│   ├── utils/              # Utilities for reporting accuracy and exact-entity eval
│   ├── corpus.py           # Script to read the CoNLL data
│   ├── ner_main.py         # The full NER system 
│   ├── ner_main_memm.py    # Script to train MEMM and report results
│   ├── ner_memm_grid.py    # Script to use grid search for tuning
│   └── ner_system.py       # Script to train HMM and report results
├── reports/                # Text files of various reports
│   └── benchmarks/         
│   └── grid-search/     
├── README.md               
├── requirements.txt

Virtual Environment

The environment can be replicated with a virtual environment. Please follow the directions below to run the experiments from the paper.

$ git clone YOUR_REPO
$ cd NER-with-Classical-Machine-Learning
$ python3 -m venv .env
$ source .env/bin/activate
$ pip install -r requirements.txt

Data

We used both Spanish (CoNNL-2002) and English (CoNNL-2003) dataset.

More information is in the \corpora subdirectory.

Hyperparameters

The hyperparameters were tuned with grid search to find the optimal values for regularization, epochs, and learning rate.

Click to expand!

Parameter	Value
$\lambda$	0.1
$\eta$	0.1
epochs	15
$\alpha$	100

Main System

The main NER system is located in implementation/ner_main.py and reports the results for the two baselines and Hidden Markov Model on the English/Spanish dataset.

python implementation/ner_main.py corpora/train/eng/eng.train corpora/val/eng/eng.testa corpora/test/eng/eng.testb corpora/train/esp/esp.train corpora/test/esp/esp.testb

Improved System

The improved NER system is located in implementation/ner_main_memm.py and reports the results of the Maximum-entropy model on the English dataset. The default ME model is the SGDClassifier from scikit-learn and expected run time is around 3 minutes.

We do not advise the user to switch the model to the (Me)MM because training time is significant.

Validation set

python implementation/ner_main_memm.py corpora/train/eng/eng.train corpora/val/eng/eng.testa

Test set

python implementation/ner_main_memm.py corpora/train/eng/eng.train corpora/test/eng/eng.testb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revisiting CoNNL-2003 with Classical Machine Learning

Table of Contents

Highlights

Directory

Virtual Environment

Data

Hyperparameters

Main System

Improved System

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
corpora		corpora
implementation		implementation
reports		reports
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

weezymatt/NER-with-Classical-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Revisiting CoNNL-2003 with Classical Machine Learning

Table of Contents

Highlights

Directory

Virtual Environment

Data

Hyperparameters

Main System

Improved System

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages