A Tensorflow implementation of word2vec applied to stanford philosophy encyclopedia, the implementation supports both cbow
and skip gram
for more reference, please have a look at this papers:
- Distributed Representations of Words and Phrases and their Compositionality
- word2vec Parameter Learning Explained
- Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method
After training the model returns some interesting results, see interesting results part
Evaluating hume - empiricist + rationalist
:
descartes
malebranche
spinoza
hobbes
herder
Similar words to death
:
untimely
ravages
grief
torment
Similar words to god
:
divine
De Providentia
christ
Hesiod
Similar words to love
:
friendship
affection
christ
reverence
Similar words to life
:
career
live
lifetime
community
society
Similar words to brain
:
neurological
senile
nerve
nervous
Evaluating hume - empiricist + rationalist
:
descartes
malebranche
spinoza
hobbes
herder
Evaluating ethics - rational
:
hiroshima
Evaluating ethic - reason
:
inegalitarian
anti-naturalist
austere
Evaluating moral - rational
:
commonsense
Evaluating life - death + love
:
self-positing
friendship
care
harmony
Evaluating death + choice
:
regret
agony
misfortune
impending
Evaluating god + human
:
divine
inviolable
yahweh
god-like
man
Evaluating god + religion
:
amida
torah
scripture
buddha
sokushinbutsu
Evaluating politic + moral
:
rights-oriented
normative
ethics
integrity
- an object to crawl data from the philosophy encyclopedia; PlatoData
- a object to build the vocabulary based on the crawled data; VocabBuilder
- the model that computes the continuous distributed representations of words; Philo2Vec
The dependencies used for this module can be easily installed with pip:
> pip install -r requirements.txt
- min_frequency: the minimum frequency of the words to be used in the model.
- size: the size of the data, the model then use the top size most frequenct words.
- optimizer: an instance of tensorflow
Optimizer
, such asGradientDescentOptimizer
,AdagradOptimizer
, orMomentumOptimizer
. - model: the model to use to create the vectorized representation, possible values:
CBOW
,SKIP_GRAM
. - loss_fct: the loss function used to calculate the error, possible values:
SOFTMAX
,NCE
. - embedding_size: dimensionality of word embeddings.
- neg_sample_size: number of negative samples for each positive sample
- num_skips: numer of skips for a
SKIP_GRAM
model. - context_window: window size, this window is used to create the context for calculating the vector representations [ window target window ].
params = {
'model': Philo2Vec.CBOW,
'loss_fct': Philo2Vec.NCE,
'context_window': 5,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)
params = {
'model': Philo2Vec.SKIP_GRAM,
'loss_fct': Philo2Vec.SOFTMAX,
'context_window': 2,
'num_skips': 4,
'neg_sample_size': 2,
}
x_train = get_data()
validation_words = ['kant', 'descartes', 'human', 'natural']
x_validation = [StemmingLookup.stem(w) for w in validation_words]
vb = VocabBuilder(x_train, min_frequency=5)
pv = Philo2Vec(vb, **params)
pv.fit(epochs=30, validation_data=x_validation)
Since the words are stemmed as part of the preprocessing, some operation are sometimes necessary
StemmingLookup.stem('religious') # returns "religi"
StemmingLookup.original_form('religi') # returns "religion"
pv.get_similar_words(['rationalist', 'empirist'])
pv.evaluate_operation('moral - rational')
pv.plot(['hume', 'empiricist', 'descart', 'rationalist'])