Skip to content

Latest commit

 

History

History
305 lines (219 loc) · 23.6 KB

README.md

File metadata and controls

305 lines (219 loc) · 23.6 KB

A state-of-the-art Hidden Markov Model Framework

Code for the papers:

Sentiment Analysis using Novel and Interpretable Architectures of Hidden Markov Models (Elsevier 2021).

Hidden Markov Models for Sentiment Analysis in Social Media (BCD 2019).

Machine Learning Techniques for Sentiment Analysis and Emotion Recognition in Natural Language (Thesis).

Getting Started

Introduction

A framework that implements a wide variety of Hidden Markov Models. The main focus of the performed experiments is mainly on natural language processing, but the proposed approach can be applied to any classification task in general.
The number of implemented models, whether that is different architectures, algorithms or high-order Hidden Markov Model variants, is unprecedented. Some major examples are provided below.

A traditional architecture, denoted as "Approach A".

Approach A

An architecture denoted as "Approach B", where a single supervised HMM is trained for every class y in the data, by utilizing a set of labels and observations. These labels can take any form such as labels of sentences, labels of phrases or even labels of words. For instance, a set of data contains n documents, each one consists of a set of sentences S, which form a sequential vector: d = [s1, s2, ..., sn]. Every one of the aforementioned sentences is annotated with a label x.

Approach B

An ensemble of Models.

Ensemble

Interpretability

The proposed models are not only an effective and potent tool for classification, but they also possess extremely high interpretability.
The HMMs can not only indicate the sentiment of parts of a sentence but they can also showcase the way that sentiment evolves from the start to the end of a sentence. Unlike neural networks and most machine learning approaches that operate as a black box, our approach has extremely high interpretability. Interpretability is important and desired in learning systems for a wide variety of reasons. First and foremost, in most cases, it is not enough to merely get a prediction, but we also want to know the reasoning behind that prediction, especially in cases of incompleteness in problem formalization (Arrieta et al., Doshi-Velez et al.).
Furthermore, the original goal of science was to gain knowledge, not solve problems with big datasets and black box machine learning models (Molnar) that achieve low, or even no interpretability. Thus, the proposed HMMs are reliable, trust-worthy, highly interpretable predictive models.

Some raw in-practice examples of the model operating on random sentences are provided below.

Example #1

Example #2

Dependencies

Python Version:

Python >= 3.6

Required for Baum-Welch, Maximum a Posteriori, Viterbi algorithms and other baseline components:

pomegranate >= 0.11.0

Required for Labeled algorithm and an alternative high-order implementation different to my own:

SimpleHOHMM >= 0.3.0

Required for k-fold Cross-Validation module and performance metrics:

scikit-learn >= 0.20.3

Required for plotting the performance of models:

matplotlib

Required for n-grams and high-order implementations:

nltk

Running the Model

The preprocessing as well as the model training are executed by running a single .py script at a time.

Important Framework Functions:

HMM_Framework.build()

def build(self, architecture, model, framework, k_fold, boosting=False,
          state_labels_pandas=[], observations_pandas=[], golden_truth_pandas=[],
          text_instead_of_sequences=[], text_enable=False, 
          n_grams=1, n_target="", n_prev_flag=False, n_dummy_flag=False, 
          pome_algorithm="baum-welch", pome_verbose=False, pome_njobs=1, pome_smoothing_trans=0.0, pome_smoothing_obs=0.0,
          pome_algorithm_t="map",
          hohmm_high_order=1, hohmm_smoothing=0.0, hohmm_synthesize=False,
          architecture_b_algorithm="formula", formula_magic_smoothing=0.0
          ):

Ensemble_Framework.ensemble_run()

def ensemble_run(cross_val_prediction_matrix, mapping, golden_truth, mode, weights=None, 
                 use_log_prob=True, detailed=False):

Implementation Details

HMM Algorithms

Training Architecture Required Data Sequences
Baum-Welch A observations [✓], states [X], golden truth [X] / not only supervised
Baum-Welch B observations [✓], states [X], golden truth [✓] / supervised
Viterbi A observations [✓], states [X], golden truth [X] / not only supervised
Viterbi B observations [✓], states [X], golden truth [✓] / supervised
Labeled A observations [✓], states [✓], golden truth [X] / not only supervised
Labeled B observations [✓], states [✓], golden truth [✓] / supervised
Prediction Architecture Required Data Sequences
Maximum a Posteriori A observations [✓], states [X], golden truth [X] / not only supervised
Viterbi A observations [✓], states [X], golden truth [X] / not only supervised
Forward B observations [✓], states [X], golden truth [✓] / supervised
Math Formula B observations [✓], states [✓], golden truth [✓] / supervised

Categorization of HMM Models with emphasis on Classification Tasks

(number) *Name* [on what type of sequential data it operates on] [difficulty] [approach A/B] [is it implemented] [references relevant to NLP]

(1) Classic HMM [any-based] [★] [A&B] [yes] [Rabiner]: see Section 3 of paper.

(2) State-emission HMM [mainly sentence-based] [★] [A&B] [yes] [Manning et al.] [Mathew]: see Section 3 of paper.

  • For example, imagine a crazy soft drink machine that prefers to output a certain drink and after each output changes state randomly.
  • states: sentence polarity labels, observations: sentence polarity labels

(3) General Mixture HMM [mainly sentence-based] [★] [A&B] [yes]: see Section 3 of paper.

  • states: constant document labels, observations: sentence polarity labels

(4) Stationary HMM [any-based] [★★★] [A&B] [semi] [Liu et al.] Iglesias et al.] [Yi] [Yi et al.]: see Section 3 of my paper.

  • Liu et al. - states: 4 states, tf-idf etc., observations: the values of the features (there is no sequence)

(5) Multivariate HMM [any-based] [★★★★] [A&B] [semi] [[Tune et al.] [Li et al.]: see Section 3 of paper.

  • Can be either Continuous (Method A, new formula would need to be invented for Method B) or Discrete (Method A/B).

  • Note: passing more than 100 words/distributions makes it crash, so it is pointless.

  • Example of Discrete: pass multiple words on each sentence (useless since too many variables and each Discrete distribution is independent from other words); a very good idea would be to pass both the word and its Part-of-Speech tag on a Spyros HMM; another idea would be to pass the 1st most relevant word on the 1st Discrete distribution, the 2nd most relevant on the 2nd and so on.

  • Example of Continuous: pass the tfidf value of the entire sentence (but how would we do that).

(6) High-order HMM [any-based] [★★★★★] [B] [yes] [Quan et al.] [Preez] [Kochanski] [Ching et al.] [Lee et al.] : Lifts a major restriction of HMMs and allows the states to also depend on the observation/state preceding the directly previous one. The implementation might work through Kochanski's transformation, Preez's transformation and miniHMM's dummy states and initial probabilities.

  • Quan et al. (+ Multivariate) - states: emotions, observations: custom encoding with 150 possible observations per state

(7) Clustering then HMM [any-based] [★★★★★] [B] [yes] [Kang et al.]

  • Kang et al. - states: clusters, obervations: words - We can either (1) form some sort of clustering on the SVD term-to-term dot matrix like Kang did, (2) form a clustering of the documents and then predict on each word, (3) form a clustering on word2vec (better not).
  • Other ideas - Combine (5) and (6) or something, states: labels, observations: clusters

(8) Artificial labels then HMM - Kardakis-HMM [any-based] [★★★] [B] [yes] [Me] : We want to perform classification but in text-related tasks we have a single label instead of a sequence of labels. In order to tackle this, take any Machine Learning classifier or tool (e.g. Naive Bayes), train it on the data and then have it predict the class label of every single word - we create artificial labels. Even though we are performing predictions on the data it was trained on, the resulting labels are VERY noisy since we are doing it on a single-word basis (e.g. get me the sentiment label of "movie" and get me the label of "review"). Then, the proposed HMM is trained on the artificial data and actually performs better than the original classifiers. For even higher performance, have 10 state-of-the-art classifiers and tools (LSTM, lexicon etc.) predict the (artificial) label of each word and perform a majority vote.

  • S. Kardakis - states: class labels, obervations: words

(9) Autoregressive HMM [ ] [ ] [ ] [ ] Ephraim et al.] : Tackles the problem of capturing correlations between observed variables that are far away from each other in terms of time steps.

(10) Input-output HMM [ ] [ ] [ ] [ ] [Bengio et al.] : Introduces observed variables that can influence either the hidden state variables, output variables or both. This technique can be particularly helpful in the domain of supervised learning for sequential data. However, I think it requires data/information in a specific "previous-current" form.

(11) Bidirectional HMM [ ] [ ] [ ] [ ] [Zacher et al.] [Arani et al.]

(12) Hierarchical HMM [ ] [ ] [ ] [ ] [Fine et al.]

(13) The remaining HMM models that alter assumptions about time etc. (e.g. Semi-Markov)

Coding Notes - Overall

  • On architecture B's 'formula' algorithm, to get log probabilities we have to normalize the result of the math formula. Kang et al. divide the multiplied probability score by the length of the sequence.
  • On architecture B's 'formula' algorithm, the ideal magic smoothing factor is around half of the smallest possible probability of observations, code is on "print_probability_parameters()".
  • Sample_weights parameter only works when n_jobs=1

Coding Notes - Pomegranate

  • The states in Pomegranate, represented as strings si are mapped to the input state labels in an alphabetical order, e.g. ['bbb', 'aaa', 'ccc'] means: s0='aaa', s1='bbb', s2='ccc', s3='None-start', s4='None-end'.
  ...
  normal_states = list(sorted(normal_states, key=attrgetter('name')))
  silent_states = list(sorted(silent_states, key=attrgetter('name')))
  ...
  • For anything that isn't algorithm='labeled': just use the state_names parameter to avoid some bugs, e.g.1. state_labels:["bbb", "aaa", "bbb"] then state_names=["aaa", "bbb"]. EVEN better, convert all state labels to s0, s1, ... sn, e.g.2. state_labels:["s2", "s1", "s0"] then state_names=["s0", "s1", "s2"].

  • For algorithm='labeled': extremely bad implementation riddled with bugs, see .py. The '_labeled_summarize' function is a mess.
    Observation Probabilities Bug: Instead of taking into consideration the current observation, it takes the previous observation for each time step. In detail: (1) if the label names are "s0", "s1" etc. even if 'state_names' parameter is not used, the bug happens, (2) if the label names are anything else and 'state_names' is used to assist the function, the bug happens, (3) however if the label names are anything else and 'state_names' is not used, the bug doesn't happen. The bug also doesn't happen in the opposite scenario, where "s0" etc. are used, but irelevant 'state_names' are also given.
    State transition Probabilities Bug: All the transitions end up being the exact same (the initial distribution). This bug happens when the previous one doesn't; it happens on (3) and doesn't on (1) and (2).
    Initial transition Probabilies Bug: All the transitions end up being the exact same (the initial distribution).
    At first I thought our best bet was (3), but we stil have the last 2 bugs. Just don't use "labeled" and use "baum-welch" with 'max_iterations' set to 1.

  ...
  for i in range(label_ndarray.shape[0]):
      if isinstance(label_ndarray[i], State):
          labels[i] = self.states.index(label_ndarray[i])
      else:
          labels[i] = self.state_name_mapping[label_ndarray[i]]
  ...
  • Conclusion: Labeled is actually not implemented. Whatever is executed inside viterbi and baum-welch is missing. If I make a call to _viterbi, labeled gets fixed. Even though in the docs it explicitly says that the labels parameter only work for "lableed", all algorithms take them into consideration (at the start of from_samples) and work perfectly well.

  • The emission-pseudocount does not work for algorithm='labeled'.

  • The emission-pseudocount is not added to states that only occur at the start of sequences, e.g. observations:[["234", "123", "234"], ["651", "1"]] and state_labels:[["s234", "s123", "s234"], ["s651", "s1"]] means that "state651" will have probability of 1 for 651 and 0 for everything else.

Coding Notes - Matlab

Installing the Matlab Engine API for Python
Path of execution: /usr/local/MATLAB/R2018b/extern/engines/python
Path of installation: ~/anaconda3/envs/matlabCompatiblePython/lib/python3.6/site-packages
Successful setup with: sudo /home/s/anaconda3/envs/matlabCompatiblePython/bin/python setup.py install
(no need for --prefix="installdir" parameter)

Passing data to Matlab
Passing arrays to Matlab
Call Matlab Functions from Python Function to be used

Coding Notes - HOHMM

  • The states in HOHMM are mapped to the input state labels in the order that it encountered them.

  • I suspect that the reason it doesn't use dummy states because it operates in the following way: for order n ignore the first n elements of the sequence and instead use them for the initial probability matrix, e.g. for 3rd-order ignore the first 3 "pos"/"neg" or whatever they are and built the following initial matrix: ['pos', 'neg'] ['posneg, 'pospos' ...] ['posposneg', 'pospospos' ...], the traditional matrix is on the 3rd list while the other 2 are reserved for the elements it ignores at the start of the sequqence.

  • Documentation

Experiment Notes

(all) Utilizing n-grams on the observations heavily increases accuracy, see console logs and graphs.
(all) Using a higher n-gram means more new unseen transitions, see graphs.
(all) Adding the smoothing factor doesn't affect performance that much, at least on Pomegranate.
(all) njobs = -1 (enabling parallelization with batches) has a peculiar effect of sometimes increasing or decreasing accuracy (by around 20% in both). This might be happening because the model does not overfit/train on the training data which leads to better performance on the test data. (all) Training on the "neg" subset of the IMDb dataset on Pomegranate completely bugs when using emission_pseudocount or higher-order represented as first-order; possibly semi-supervised learning gets enabled; sequence are slightly longer than "pos" ones. Temporary fix is to shorten the "neg" sequences by 1.
(all) Windows Operating Systems require __name__ == "__main__": to run multiprocessing. Windows Operating Systems also don't support SenticNet package.
(1) Can be run at any time.
(2) Experimental Results of State-emission HMM.txt
(3) Can be run at any time.
(4) It is possible to have HMMs that don't utilize sequential data at all.
(5) Didn't work
(6) Experimental Results of State-emission HMM.txt. Also, as the order increases we have transitions that have too low of a probability; the problematic count will help with this. As the order increases we also have more empty sequences; ['neg', 'neg-pos', 'neg-pos-pos'] is an empty sequence since HOHMM doesn't use dummy states.
(7) Idea 2 didn't work, idea 3 didn't work Experimental Results of Clustering Solver HMM.txt.
(8) A HMM can increase the performance of any bag-of-words-based Machine Learning classifier or tool by utilizing the sequential information of text. This is done by producing artificial labels. Experimental Results of State-emission HMM.txt and Experimental Results on Big Dataset.txt

Random Notes

  • A weird implementation where all words are one-hot encoded and used as observations; since it is built like a classifier, clueless people run it with (n_sequences, seq_length=1) by throwing a tf-idf matrix as the sequence of length 1 and it kind of works. states: Part-of-Speech tags, observations: one-hot encoded words.

  • A Multivariate HMM on 6 Emotions with 2 Discrete Distributions, one for the words and one for Part-of-Speech tags; or one for the words and one for the emotion intensity. This won't work across entire documents but only on sentences. states: Artificial Labels, observations: 2 sets of Discrete Distributions.

To Do - Future Work

  • Take a look at the weighting function from Quan et al.
  • Use Stanford Sentiment Treebank and its tree structure to split it into sequences (alternative source) leading to a HMM with 2-3 sentences per document.
  • Attempt different Ensembles (see Arani paper) instead of average, e.g. multiply.
  • Attempt different smoothing techniques from Thede et al.
  • Implement 'text_instead_of_sequences' on build().
  • Make weights on the Ensemble be relative to the exact accuracy of each base classifier.
  • Implement BIRCH before the Spherical k-Means for a better initialization

Citation

Please cite this work using the first paper below:

@article{kardakis2021sentiment,
  title={Sentiment analysis using novel and interpretable architectures of Hidden Markov Models},
  author={Perikos, Isidoros and Kardakis, Spyridon and Hatzilygeroudis, Ioannis},
  journal={Knowledge-Based Systems},
  volume={229},
  pages={107332},
  year={2021},
  publisher={Elsevier}
}

@inproceedings{kardakis2019hidden,
  title={Hidden Markov Models for Sentiment Analysis in Social Media},
  author={Perikos, Isidoros and Kardakis, Spyridon and Paraskevas, Michael and Hatzilygeroudis, Ioannis},
  booktitle={2019 IEEE International Conference on Big Data, Cloud Computing, Data Science \& Engineering (BCD)},
  pages={130--135},
  year={2019},
  organization={IEEE}
}

Credits

Spyridon Kardakis