Latent Dirichlet Allocation is a generative
probability model, which means it provide distribution of outputs and inputs
based on latent variables.

In this post I will show you how Latent Dirichlet
Allocation works, the inner view.

Let’s say we have some comments (listed below) and
we want to cluster those comments based on topics those documents cover.

__Also Read:__- Latent Dirichlet Allocation for beginners: A HIGH LEVEL OVERVIEW
- Guide to build best LDA MODEL using GENSIM PYTHON

'play football on holiday'

'I like to watch football on
holiday'

'la liga match is on this
holiday'

'the vehicle is being rightly
advertised as smooth and silent'

'the vehicle has good pickup
and is very comfortable to drive'

'mileage of this vehicle is
around 14kmpl'

'the vehicle is a 7 seater MPV
and it generates 300 Nm torque with a diesel engine'

To apply LDA on any document, all documents need
to pass through some pre-processing steps.

**1. Tokenize text**

**2. Assign word IDs to each unique word**

**3. Replace words from documents with word IDs**

###
**4. Count Matrices
Calculation:**

After doing all above steps of pre-processing, now
we need to calculate count matrices. To do that first
we

**randomly assign topics to each word/token**in each document, and then we will calculate a word to topic count matrix and document to topic count matrix.###
**Word to topic
Count Matrices:**

Let's see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.

We will use this count later.

Now we will generate a document-topic
count matrix, where the counts correspond to the number of tokens assigned to
each topic for each document.

###
**Document-topic
count matrix:**

Now it’s time to main part of LDA which is

**Collapse Gibbs Sampling**.###
**Formula for collapsed Gibbs Sampling:**

Let me explain above formula for individual
topics.

**Parameter Explanation:**

a.
C

_{w,j}^{WT }= While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.
For
example before starting iteration 1, (From word to topic matrix we can see)
throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.

b.
β = Per topic word
distribution(concentration parameter)

c.
W = Length of vocabulary (No. of
unique token/ word in full document)

d.
C

_{d,j}^{DT }=^{ }While starting a iteration number of times a document appeared as topic 1 and topic 2.
For example before starting iteration 1, (From document to
topic matrix we can see) document 2 appeared as topic 1 four times and as
topic 2 three times.

e.
α : Per document topic
distribution.

f. T: Number of topic. (here T
= 2)###
**Latent Dirichlet
Allocation under the hood (LDA Steps):**

Gibbs sampling should go through many more
iterations to come up with optimum best result.

**Probability Calculation:**

Calculate probability of
each word in each document. After calculating we will have a table like this. (i.e.
probability is calculated by Gibbs sampling equation shown above)

Let's calculate probability of

Let's calculate probability of

**Topic 1**for very first word**"play"**for document 1 in hand.Now for word "play" let's calculate probability for Topic 1 for document 1:

*General parameter initialization:***First let's initialize**

**α = 1 and β = 0.001**

And we already know Total no. of unique words

**(W) = 44**

*Parameters from word-topic count matrix:*While starting iteration, number of times a word "play" appeared as topic 1

**(**

**C**

_{w,j}^{WT}) = 0Now from word to topic matrix we know for full document Topic 1 appears 31 times.

So;

Please refer to word to topic count matrix if you have any confusion in above.

*Parameters from document-topic count matrix:***While starting iteration number of times document 1 appeared as topic 1**

**(**

**C**

_{d,j}^{DT}) = 2And Total Number of times document 1 appears as topic 1 and topic 2:

Please refer to document-topic count matrix if you have any confusion in above calculation.

And Finally Total number of unique topic

**(T) =2**

So lets recall the sampling equation again:

Similarly yo can calculate Topic 2 probability for word "play" for document 1.

Now let's see topic probability for all tokens (calculated in the same way shown above)

In this table last two columns are the output of first stage (Probability calculation)

###
**Final
Topic Calculation:**

This stage is quite easy. Based
on highest probability of two topics for a word LDA will provide final topic
for that particular word in that particular document.

Here you can see topic for some
word is modified after iteration one.

This is it. Last column is
showing final topic of each word for each document after end of one iteration.
Now if we have three iterations this output will be provided to the next
iteration as an input. It will go on like this for many more iterations.

**Collapse Gibb’s sampling equation**again, which I have already shown.

__Also Read:__- Latent Dirichlet Allocation for beginners: A HIGH LEVEL OVERVIEW
- Guide to build best LDA MODEL using GENSIM PYTHON

**Conclusion:**

In this post I have discussed about

- What is Latent Dirichlet Allocation
- How Latent Dirichlet Allocation works (LDA under the hood) from scratch
- Steps for Latent Dirichlet Allocation
- Theoretical Explanation of Latent Dirichlet Allocation

If you have any question in mind regarding this
topic please let me know in comment section, will try my best to answer.