15 Aug 2019

Guide to Build Best LDA model using Gensim Python



In recent years, huge amount of data (mostly unstructured) is growing. It is difficult to extract relevant and desired information from it. In Text Mining (in the field of Natural Language Processing) Topic Modeling is a technique to extract the hidden topics from huge amount of text.

There are so many algorithms to do topic modeling. Latent Dirichlet Allocation (LDA) is one of those popular algorithms for topic modeling. In previous tutorials I have explained how it Latent Dirichlet Allocation (LDA) works. In this tutorial I am going to implement LDA in Python’s Gensim package.

Must Read:

Prerequisites to implement LDA with Gensim Python

You need two models or data to follow this tutorial. They are:
  • Stopwords of NLTK: Though Gensim have its own stopword but just to enlarge our stopword list we will be using NLTK stopword.
  • Spacy Model: We will be using spacy model for lemmatization only.

Run following commands in cmd to download and install spacy and (small) English model.

pip install -U spacy
python -m spacy download en_core_web_sm


## Download nltk stopword incase you don't have already
import nltk
nltk.download('stopwords')

Import packages for LDA


import gensim, spacy
import gensim.corpora as corpora
from nltk.corpus import stopwords

import pandas as pd
import re
from tqdm import tqdm
import time


import pyLDAvis
import pyLDAvis.gensim  # don't skip this
# import matplotlib.pyplot as plt
# %matplotlib inline

## Setup nlp for spacy
nlp = spacy.load("en_core_web_sm")

# Load NLTK stopwords
stop_words = stopwords.words('english')
# Add some extra words in it if required
stop_words.extend(['from', 'subject', 'use','pron'])


Newsgroup Data for LDA Topic Modeling

We will be using the 20-Newsgroups dataset for this tutorial. The dataset contains about 11k newsgroups posts (news). This is available as newsgroups.json.

# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
## Or you can download the data from that link and load it by using same function
# View data
df.head()

Cleaning and Pre-processing for LDA

As you know cleaning and pre-processing is the common step for any kind of analysis. There are so many ways to do this based on your data and type of analysis you are doing. For our data and analysis I have divided this stage into following steps:

  • Remove emails: I don’t think emails are important for our analysis
  • Remove newline characters and extra space
  • Remove quotation marks
  • Lemmatization: using spacy
  • Tokenization:Split the text into sentences and the sentences into words (including Gensim stopword removal)
  • Stopword Removal: Final stopword removal by using NLTK stopword 

# Convert into list
data = df.content.values.tolist()

### Cleaning data

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
# Remove new line characters and extra space
data = [re.sub('\s+', ' ', sent) for sent in data]
# Remove single quotes
data = [re.sub("\'", "", sent) for sent in data]

### Lemmatization
data_lemma = []
for txt in tqdm(data):
    lis = []
    doc = nlp(txt)
    for token in doc:
        lis.append(token.lemma_)
    data_lemma.append(' '.join(lis))

### Tokenization and gensim stopword removal

# You can look for all gensim stopwords by running -> 'gensim.parsing.preprocessing.STOPWORDS'

# Function to tokenize
# Also remove words whose length less than 3 (you can chang it)
def tokenization_with_gen_stop(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)

    return result

## Apply tokenization function
data_words = []
for txt in tqdm(data_lemma):
    data_words.append(tokenization_with_gen_stop(txt))

### NLTK Stopword removal (extra stopwords)

data_words_clean = []
for word in tqdm(data_words):
    wrd = []
    for w in word:
        if w not in stop_words:
            wrd.append(w)
    data_words_clean.append(wrd)


Prepare Dictionary and Corpus for Topic Modeling

As like any other algorithm LDA can only understand numeric values. So somehow we need to convert all cleaned text into numbers.

In this tutorial we will convert text (cleaned and tokenized word) into bag of words to make it numeric which you can think of as a dictionary, where the key is the word and value is the number of times that word occurs in the entire corpus.

To do so two main inputs of the LDA topic model are:

  • Dictionary:Unique ids for each unique word
  • Corpus: For each document number of times a particular word appeared

# Create Dictionary
dictionary = corpora.Dictionary(data_words_clean)
# Print dictionary
print(dictionary.token2id)

## Create Term document frequency (corpus)
# Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in data_words_clean]
# Print corpus for first document
print(corpus[0])


Dictionary:   
{'able': 0, 'add': 1, 'addison_reed': 2, 'afloat': 3, 'alejandro_de': 4, 'allow': 5 .....

For example id for word ‘able’ is 0, id for word ‘add’ is 1 and so on.

Corpus:
[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1),...

For example, (0, 1) above implies, for first document word id 0 (word: ‘able’) occurs once. Likewise, word id 1 (word: ‘add’) occurs twice and so on.
If you are still having problem to understand corpus because of word id, you can see a easy-readable form of the corpus itself by following script.

# Easy to observe format of corpus
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

Train LDA Topic Model with Gensim

As we now have done with everything required to train the LDA model.
Here for this tutorial I will be providing few parameters to the LDA model those are:

  • Corpus: corpus data
  • num_topics: For this tutorial keeping topic number = 8
  • id2word: dictionary data
  • random_state: It will control randomness of training process
  • passes:Number of passes through the corpus during training.
Apart from those, there are lot many parameters you should consider while tuning your LDA model to get best performance. Those can be found here


start_time = time.time()
##
NUM_TOPICS = 8
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary,random_state=100,passes=10)
# Saving trained model
ldamodel.save('LDA_NYT')
# Loading trained model
ldamodel = gensim.models.ldamodel.LdaModel.load('LDA_NYT')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))


Above code is done with single core process it takes time. If you want faster implementation of LDA (parallelized for multicore machines, parallelization uses multiprocessing). I have tested it in my i7 system and its takes half time than single core LDA.

start_time = time.time()
##
## Multicore LDA
NUM_TOPICS = 8
lda_multicore_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = NUM_TOPICS, id2word=dictionary,random_state=100,passes=10)
# Saving trained model
lda_multicore_model.save('LDA_NYT_multicore')
# Loading trained model
lda_multicore_model = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))

View topics of LDA model

Above LDA model is built with 8 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to that particular topic.

# See the topics
ldamodel.print_topics(-1)


[(0,
  '0.008*"israel" + 0.007*"armenian" + 0.007*"turkish" + 0.007*"israeli" + 0.006*"armenians" + 0.004*"jews" + 0.004*"kill" + 0.003*"armenia" + 0.003*"play" + 0.003*"arab"'),
 (1,
  '0.013*"year" + 0.012*"game" + 0.011*"team" + 0.010*"organization" + 0.008*"write" + 0.008*"good" + 0.007*"article" + 0.007*"player" + 0.007*"think" + 0.007*"university"'),
 (2,
  '0.019*"organization" + 0.013*"line" + 0.010*"posting" + 0.010*"host" + 0.010*"nntp" + 0.010*"university" + 0.009*"write" + 0.009*"lines" + 0.008*"know" + 0.007*"drive"'),
 (3,
  '0.013*"people" + 0.010*"write" + 0.009*"know" + 0.009*"think" + 0.006*"article" + 0.006*"organization" + 0.005*"believe" + 0.005*"like" + 0.005*"thing" + 0.005*"time"'),
 (4,
  '0.010*"write" + 0.010*"organization" + 0.009*"article" + 0.007*"like" + 0.006*"line" + 0.005*"time" + 0.005*"good" + 0.005*"nntp" + 0.005*"posting" + 0.005*"host"'),
 (5,
  '0.011*"space" + 0.007*"information" + 0.006*"government" + 0.006*"chip" + 0.006*"encryption" + 0.005*"clipper" + 0.005*"public" + 0.004*"technology" + 0.004*"nasa" + 0.004*"datum"'),
 (6,
  '0.010*"gordon" + 0.009*"health" + 0.008*"medical" + 0.008*"banks" + 0.007*"doctor" + 0.007*"disease" + 0.007*"patient" + 0.005*"insurance" + 0.004*"treatment" + 0.004*"reply"'),
 (7,
  '0.022*"file" + 0.011*"window" + 0.010*"program" + 0.009*"image" + 0.007*"server" + 0.006*"line" + 0.006*"available" + 0.006*"include" + 0.006*"display" + 0.006*"application"')]



Interpret LDA Gensim result

topic 0 is a represented as '0.008*"israel" + 0.007*"armenian" + 0.007*"turkish" + 0.007*"israeli" + 0.006*"armenians" + 0.004*"jews" + 0.004*"kill" + 0.003*"armenia" + 0.003*"play" + 0.003*"arab"'

It means the top 10 keywords that contribute to this topic are: ‘israel’, ‘armenian’, ‘turkish’…and so on and the weight of ‘israel’for topic 0 is 0.008.
The weights are how important a keyword is to that topic.
Looking at these keywords, can you guess what this topic could be? You may summarise that topic0 may be for “country” or “location”.
Similarly topic6 represents “helthcare”, topic7 represents “computer programming/ graphics”.
May be other topics are still not mature enough to decide their name, here comes the tuning part of LDA mode which I will cover later in this tutorial.

Evaluate LDA model

Like each algorithm LDA also needs to evaluate to judge how good our trained model is. There are two convenient measurement techniques:

  • Perplexity Score: Lower is better
  • Coherence Score: Higher is better
In my experience, coherence score is more helpful.

# Compute Perplexity Score
print('\nPerplexity Score: ', ldamodel.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel, texts=data_words_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity Score:  -8.483322129214947
Coherence Score:  0.5751529939463009

Visualize topics-keywords of LDA

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.
While you are done with building LDA model, time to visualize topic with keywords. Python’s pyLDAvis package is best for that. It’s user interactive chart and is designed to work with jupyter notebook also.

# To plot at Jupyter notebook
pyLDAvis.enable_notebook()
plot = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
# Save pyLDA plot as html file
pyLDAvis.save_html(plot, 'LDA_NYT.html')
plot



Each bubble on the left-hand side plot represents individual topic. Larger the bubble, the more important topic is that.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one.

Like for my case topic6, topic7 and topic8are big and non-overlapping but rest topics are overlapping to each other. You have already observed it while printing topic result topic6 represents “helthcare”, topic7 represents “computer programming/ graphics” and topic0 represent “country/ Location”. But rest are not explaining any particular topic.

Note: One important point I observed that topic number while printing topic and while plotting topics may not be same.

Train LDA with mallet

So far you have seen Gensim’s inbuilt version of the LDA algorithm. There is another package called Mallet which often gives a better quality of topics.The difference between Mallet and Gensim’s standard LDA is that, Gensim uses Variational Bayes sampling method which is faster but less precise than Mallet’s Gibbs Sampling.

MALLET is a Java-based package but Python, Gensim has a wrapper for Latent Dirichlet Allocation via Mallet.

Setup Mallet for LDA:

In order to use mallet for LDA, you need to download the zip file of Mallet Java package from here http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Unzip the file called mallet-2.0.8 and paste it in any drive. I have pasted it my c: drive.

Note: Make sure that java is installed and environment variable is set for java in your system

import os
## Setup mallet environment change it according to your drive
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8'})
## Setup mallet path change it according to your drive
mallet_path = 'C:/mallet-2.0.8/bin/mallet'

start_time = time.time()
##
## Train LDA with mallet
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=dictionary)
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))


Evaluate Mallet LDA with Gensim LDA

Now time to evaluate this model, to see if Mallet’s LDA is giving better result than Gensim’s in built LDA or not.

# Compute Coherence Score for mallet
coherence_model_lda = gensim.models.CoherenceModel(model=ldamallet, texts=data_words_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.6151392292265527

You can clearly observe the difference. Just by changing algorithm coherence score increased from 0.57 to 0.61.

Coherence Score for:
  • Gensim’s in built LDA:0.5751529939463009
  • Mallet’s LDA: 0.6151392292265527

Predict topic and keyword for new document with LDA model

Let’s try to predict topic and keyword for a new document by using our trained LDA model.

To do that the new document need to pass through each same step of data preparation.

## Keeping first content of dataframe as our new document
new_doc = df['content'][0]

### Cleaning data

# Remove Emails
data = re.sub('\S*@\S*\s?', '', new_doc)
# Remove new line characters and extra space
data = re.sub('\s+', ' ', data)
# Remove single quotes
data = re.sub("\'", "", data)

### Lemmatization
data_lemma = []
lis = []
doc = nlp(data)
for token in doc:
    lis.append(token.lemma_)
data_lemma.append(' '.join(lis))
    
### Tokenization and gensim stopword removal

# You can look for all gensim stopwords by running -> 'gensim.parsing.preprocessing.STOPWORDS'

# Function to tokenize
# Also remove words whose length less than 3 (you can chang it)
def tokenization_with_gen_stop(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)
            
    return result

## Apply tokenization function
data_words = []
for txt in tqdm(data_lemma):
    data_words.append(tokenization_with_gen_stop(txt))
    
### NLTK Stopword removal (extra stopwords)

data_words_clean_new = []
for word in tqdm(data_words):
    for w in word:
        if w not in stop_words:
            data_words_clean_new.append(w)


After cleaning and pre-processing the data we need to create corpus for new document by using main dictionary.

# Create corpus for new document
corpus_new = dictionary.doc2bow(data_words_clean_new)
corpus_new


Finally we can print topic for new document.

print(ldamodel.get_document_topics(corpus_new))

[(2, 0.30002788), (3, 0.12005065), (4, 0.564969)]

LDA output shows that topic 4 has the highest probability assigned, and topic 2 has the second highest probability assigned.

Note: LDA only provides dominating topics.

Now you can find keywords for topic 4.


topic_prob = ldamodel.get_topic_terms(topicid=4)
for topic in topic_prob:
    print('word:',dictionary[topic[0]],'->','probability:',topic[1])


How to find the optimal number of topics for LDA?

Now at this point you how to do topic modelling (Latent Diriclet Allocation) by using Gensim inbuilt model and by using Mallet. But every where you had to mention topic number to train the LDA model. Is there any way to find optimum topic number for LDA?

I prefer to find the optimal number of topics by building many LDA models with different number of topics (k) and pick the one that gives the highest coherence value.

If same keywords are repeating in multiple topics, it’s probably a sign that the ‘k’ (number of topic) is too large.

Tuning LDA model

Like every algorithm LDA also needs to tune to get optimum result. To tune you can

  • Tune parameters values like alpha, eta, gamma_threshold,  minimum_phi_value etc. And check coherence score (remember higher is better).
  • You can store those word in corpus/ dictionary having some particular parts of speech ( POS) like Noun, Adjective, Adverb etc.
  • You can use stemming instead of lemmatization or along with lemmatization

Conclusion

In this tutorial I have covered:
  • Prerequisites for LDA modeling
  • Packages required for LDA model
  • Cleaning and Pre-processing for LDA
  • Prepare Dictionary and Corpus for Topic Modeling
  • Train LDA Topic Model with Gensim
  • View topics in LDA model
  • Evaluate LDA model
  • Visualize topics-keywords of LDA
  • Train Topic model with Mallet
  • Difference between Gensim LDA with Mallet LDA
  • Predict topic and keyword for new document with LDA model
  • How to find the optimal number of topics for LDA?
  • How to tune LDA model

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.