This chapter uses unsupervised machine learning to extract latent topics from documents. These themes can produce detailed insights into a large body of documents in an automated way. They are very useful to understand the haystack itself and permit the concise tagging of documents because using the degree of association of topics and documents.
Topic models permit the extraction of sophisticated, interpretable text features that can be used in various ways to extract trading signals from large collections of documents. They speed up the review of documents, help identify and cluster similar documents, and can be annotated as a basis for predictive modeling. Applications include the identification of key themes in company disclosures or earnings call transcripts, customer reviews or contracts, annotated using, e.g., sentiment analysis or direct labeling with subsequent asset returns. More specifically, In this chapter, we will cover:
- What topic modeling achieves, why it matters and how it has evolved
- How Latent Semantic Indexing (LSI) reduces the dimensionality of the DTM
- How probabilistic Latent Semantic Analysis (pLSA) uses a generative model to extract topics
- How Latent Dirichlet Allocation (LDA) refines pLSA and why it is the most popular topic model
- How to visualize and evaluate topic modeling results
- How to implement LDA using sklearn and gensim
- How to apply topic modeling to collections of earnings calls and Yelp business reviews
Initial attempts by topic models to improve on the vector space model (developed in the mid-1970s) applied linear algebra to reduce the dimensionality of the document-term matrix. This approach is similar to the algorithm we discussed as principal component analysis in chapter 12 on unsupervised learning. While effective, it is difficult to evaluate the results of these models absent a benchmark model.
In response, probabilistic models emerged that assume an explicit document generation process and provide algorithms to reverse engineer this process and recover the underlying topics.
The below table highlights key milestones in the model evolution that we will address in more detail in the following sections.
Model | Year | Description |
---|---|---|
Latent Semantic Indexing (LSI) | 1988 | Capture semantic document-term relationship by reducing the dimensionality of the word space |
Probabilistic Latent Semantic Analysis (pLSA) | 1999 | Reverse-engineers a generative a process that assumes words generate a topic and documents as a mix of topics |
Latent Dirichlet Allocation (LDA) | 2003 | Adds a generative process for documents: three-level hierarchical, Bayesian model |
Latent Semantic Analysis set out to improve the results of queries that omitted relevant documents containing synonyms of query terms. Its aimed to model the relationships between documents and terms to be able to predict that a term should be associated with a document, even though, because of variability in word use, no such association was observed.
LSI uses linear algebra to find a given number k of latent topics by decomposing the DTM. More specifically, it uses the Singular Value Decomposition (SVD) to find the best lower-rank DTM approximation using k singular values & vectors. In other words, LSI is an application of the unsupervised learning techniques of dimensionality reduction we encountered in chapter 12 (with some additional detail). The authors experimented with hierarchical clustering but found it too restrictive to explicitly model the document-topic and topic-term relationships or capture associations of documents or terms with several topics.
The notebook latent_semantic_indexing demonstrates how to apply LSI to the BBC new articles we used in the last chapter.
Probabilistic Latent Semantic Analysis (pLSA) takes a statistical perspective on LSA and creates a generative model to address the lack of theoretical underpinnings of LSA.
pLSA explicitly models the probability each co-occurrence of documents d and words w described by the DTM as a mixture of conditionally independent multinomial distributions that involve topics t. The number of topics is a hyperparameter chosen prior to training and is not learned from the data.
The notebook probabilistic_latent_analysis demonstrates how to apply LSI to the BBC new articles we used in the last chapter.
- Relation between PLSA and NMF and Implications, Gaussier, Goutte, 2005
Latent Dirichlet Allocation extends pLSA by adding a generative process for topics. It is the most popular topic model because it tends to produce meaningful topics that humans, can relate to, can assign topics to new documents, and is extensible. Variants of LDA models can include metadata like authors, or include image data, or learn hierarchical topics.
LDA is a hierarchical Bayesian model that assumes topics are probability distributions over words, and documents are distributions over topics. More specifically, the model assumes that topics follow a sparse Dirichlet distribution, which implies that documents cover only a small set of topics, and topics use only a small set of words frequently.
The Dirichlet distribution produces probability vectors that can be used with discrete distributions. That is, it randomly generates a given number of values that are positive and sum to one. It has a parameter 𝜶 of positive real value that controls the concentration of the probabilities.
The notebook dirichlet_distribution contains a simulation so you can experiment with different parameter values.
Unsupervised topic models do not provide a guarantee that the result will be meaningful or interpretable, and there is no objective metric to assess the result as in supervised learning. Human topic evaluation is considered the ‘gold standard’ but is potentially expensive and not readily available at scale.
Two options to evaluate results more objectively include perplexity that evaluates the model on unseen documents and topic coherence metrics that aim to evaluate the semantic quality of the uncovered patterns.
- Exploring Topic Coherence over many models and many topics
- Paper on various Methods
- Blog Post - Overview
The notebook lda_with_sklearn shows how to apply LDA to the BBC news articles. We use sklearn.decomposition.LatentDirichletAllocation to train an LDA model with five topics.
The notebook lda_with_sklearn
contains the code examples used in this section.
Topic visualization facilitates the evaluation of topic quality using human judgment. pyLDAvis is a python port of LDAvis, developed in R and D3.js. We will introduce the key concepts; each LDA implementation notebook contains examples.
pyLDAvis displays the global relationships among topics while also facilitating their semantic evaluation by inspecting the terms most closely associated with each individual topic and, inversely, the topics associated with each term. It also addresses the challenge that terms that are frequent in a corpus tend to dominate the multinomial distribution over words that define a topic. LDAVis introduces the relevance r of term w to topic t to produce a flexible ranking of key terms using a weight parameter 0<=ƛ<=1.
Gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter on word vectors (see the notebook lda_with_gensim for details.
In Chapter 3 on Alternative Data, we learned how to scrape earnings call data from the SeekingAlpha site. In this section, we will illustrate topic modeling on the harvested data.
We're using a (small) sample of some 1,000 earnings call transcripts from the second half of 2018. For a practical application, a larger dataset would be highly desirable.
The directory earnings_calls contains several files with examples mentioned below. See the notebook lda_earnings_calls for details on loading, exploring, and preprocessing the data, as well as training and evaluating individual models, and the run_experiments.py) file for the experiments evaluated in the notebook.
The notebook lda_yelp_reviews contains an example of LDA applied to six million business review on yelp. Reviews are a more uniform in length than the statements extracted from the earnings call transcript. After cleaning as above, the 10th and 90th percentile range from 14 to 90 tokens.
-
Applications of Topic Models, Jordan Boyd-Graber, Yuening Hu, David Minmo, 2017
-
High Quality Topic Extraction from Business News Explains Abnormal Financial Market Volatility
-
What are You Saying? Using Topic to Detect Financial Misreporting