30 Jan 2019

Latent Dirichlet Allocation explained


Latent Dirichlet Allocation is a generative probability model, which means it provide distribution of outputs and inputs based on latent variables.

In this post I will show you how Latent Dirichlet Allocation works, the inner view.

Let’s say we have some comments (listed below) and we want to cluster those comments based on topics those documents cover.

Also Read:


'play football on holiday'
'I like to watch football on holiday'
'la liga match is on this holiday'
'the vehicle is being rightly advertised as smooth and silent'
'the vehicle has good pickup and is very comfortable to drive'
'mileage of this vehicle is around 14kmpl'
'the vehicle is a 7 seater MPV and it generates 300 Nm torque with a diesel engine'


To apply LDA on any document, all documents need to pass through some pre-processing steps.



1. Tokenize text



2. Assign word IDs to each unique word


3. Replace words from documents with word IDs




4. Count Matrices Calculation:

After doing all above steps of pre-processing, now we need to calculate count matrices. To do that first we randomly assign topics to each word/token in each document, and then we will calculate a word to topic count matrix and document to topic count matrix.

For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.




Word to topic Count Matrices:


Token/ WordTopic 1 CountTopic 2 Count
play01
football11
on12
holiday30
I10
like01
to02
watch10
la01
liga10
match10
is23
this11
the12
vehicle22
being01
rightly01
advertised01
as01
smooth01
and12
silent10
has01
good10
pickup10
very10
comfortable10
drive01
mileage10
of01
around01
14kmpl10
a11
710
seater01
MPV01
it10
generates01
30010
Nm01
torque10
with10
diesel10
engine10

Let's see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.

We will use this count later.

Now we will generate a document-topic count matrix, where the counts correspond to the number of tokens assigned to each topic for each document.


Document-topic count matrix:




DocumentTopic 1 CountTopic 2 Count
122
243
352
419
574
625
7107



Now it’s time to main part of LDA which is Collapse Gibbs Sampling.

Formula for collapsed Gibbs Sampling:







Let me explain above formula for individual topics.






Parameter Explanation:

a.      Cw,jWT  = While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.

For example before starting iteration 1, (From word to topic matrix we can see) throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.

                     Similarly word “football” appeared as topic 1 one time and as topic 2 one                           times.


Token/ WordTopic 1 CountTopic 2 Count
play01
football11
on12
holiday30


b.      β = Per topic word distribution(concentration parameter)
c.     W = Length of vocabulary (No. of unique token/ word in full document)
d.      Cd,jDT =  While starting a iteration number of times a document appeared as topic 1 and topic 2.

For example before starting iteration 1, (From document to topic matrix we can see) document 2 appeared as topic 1 four times and as topic 2 three times.

                     Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.


DocumentTopic 1 CountTopic 2 Count
122
243
352
419


e.      α : Per document topic distribution.
                f.    T: Number of topic. (here T = 2)



Latent Dirichlet Allocation under the hood (LDA Steps):


Gibbs sampling should go through many more iterations to come up with optimum best result.

Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.


Probability Calculation:

Calculate probability of each word in each document. After calculating we will have a table like this. (i.e. probability is calculated by Gibbs sampling equation shown above)

Let's calculate probability of Topic 1 for very first word "play" for document 1 in hand.

Now for word "play" let's calculate probability for Topic 1 for document 1:

General parameter initialization:

First let's initialize α = 1 and β = 0.001
And we already know Total no. of unique words (W) = 44

Parameters from word-topic count matrix:

While starting iteration, number of times a word "play" appeared as topic 1(Cw,jWT) = 0

Now from word to topic matrix we know for full document Topic 1 appears 31 times.
So;



Please refer to word to topic count matrix if you have any confusion in above.

Parameters from document-topic count matrix:

While starting iteration number of times document 1 appeared as topic 1 (Cd,jDT) = 2

And Total Number of times document 1 appears as topic 1 and topic 2:

  


Please refer to document-topic count matrix if you have any confusion in above calculation.

And Finally Total number of unique topic (T) =2

So lets recall the sampling equation again:




Similarly yo can calculate Topic 2 probability for word "play" for document 1.

Now let's see topic probability for all tokens (calculated in the same way shown above)


DocumentToken Position in DocWord/ TokenPrevious Iteration TopicProbability for Topic 1Probability for Topic 2
11playTopic 20.00001610620.0156191487
12footballTopic 20.01612227810.0156191487
13onTopic 10.01612227810.0312226938
14holidayTopic 10.04833462180.0000156035
21ITopic 10.01791364230.0000138698
22likeTopic 20.00001789570.0138836877
23toTopic 20.00001789570.0277535056
24watchTopic 10.01791364230.0000138698
25footballTopic 10.01791364230.0138836877
26onTopic 20.01791364230.0277535056
27holidayTopic 10.05370513540.0000138698
31laTopic 20.00002147490.0104127658
32ligaTopic 10.02149637070.0000104024
33matchTopic 10.02149637070.0000104024
34isTopic 10.04297126660.0312174926
35onTopic 20.02149637070.0208151292
36thisTopic 10.02149637070.0104127658
37holidayTopic 10.06444616240.0000104024
41theTopic 20.00537409270.0520378230
42vehicleTopic 20.01074281660.0520378230
43isTopic 20.01074281660.0780437315
44beingTopic 20.00000536870.0260319145
45rightlyTopic 20.00000536870.0260319145
46advertisedTopic 20.00000536870.0260319145
47asTopic 20.00000536870.0260319145
48smoothTopic 20.00000536870.0260319145
49andTopic 20.00537409270.0520378230
410silentTopic 10.00537409270.0000260059
51theTopic 10.01984280380.0240174568
52vehicleTopic 10.03966578450.0240174568
53hasTopic 20.00001982300.0120147297
54goodTopic 10.01984280380.0000120027
55pickupTopic 10.01984280380.0000120027
56andTopic 20.01984280380.0240174568
57isTopic 10.03966578450.0360201838
58veryTopic 10.01984280380.0000120027
59comfortableTopic 10.01984280380.0000120027
510toTopic 20.00001982300.0240174568
511driveTopic 20.00001982300.0120147297
61mileageTopic 10.01074818540.0000208047
62ofTopic 20.00001073740.0208255316
63thisTopic 20.01074818540.0208255316
64vehicleTopic 20.02148563330.0416302584
65isTopic 20.02148563330.0624349852
66aroundTopic 20.00001073740.0208255316
6714kmplTopic 10.01074818540.0000208047
71theTopic 20.01866790090.0262927948
72vehicleTopic 10.03731715260.0262927948
73isTopic 20.03731715260.0394326222
74aTopic 20.01866790090.0131529673
757Topic 10.01866790090.0000131398
76seaterTopic 20.00001864930.0131529673
77MPVTopic 20.00001864930.0131529673
78andTopic 10.01866790090.0262927948
79itTopic 10.01866790090.0000131398
710generatesTopic 20.00001864930.0131529673
711300Topic 10.01866790090.0000131398
712NmTopic 20.00001864930.0131529673
713torqueTopic 10.01866790090.0000131398
714withTopic 10.01866790090.0000131398
715aTopic 10.01866790090.0131529673
716dieselTopic 10.01866790090.0000131398
717engineTopic 10.01866790090.0000131398


In this table last two columns are the output of first stage (Probability calculation)


Final Topic Calculation:

This stage is quite easy. Based on highest probability of two topics for a word LDA will provide final topic for that particular word in that particular document.

Let’s find the output of this stage:


DocumentToken Position in DocWord/ TokenPrevious Iteration TopicProbability for Topic 1Probability for Topic 2Final Topic
11playTopic 20.00001610620.0156191487Topic 2
12footballTopic 20.01612227810.0156191487Topic 1
13onTopic 10.01612227810.0312226938Topic 2
14holidayTopic 10.04833462180.0000156035Topic 1
21ITopic 10.01791364230.0000138698Topic 1
22likeTopic 20.00001789570.0138836877Topic 2
23toTopic 20.00001789570.0277535056Topic 2
24watchTopic 10.01791364230.0000138698Topic 1
25footballTopic 10.01791364230.0138836877Topic 1
26onTopic 20.01791364230.0277535056Topic 2
27holidayTopic 10.05370513540.0000138698Topic 1
31laTopic 20.00002147490.0104127658Topic 2
32ligaTopic 10.02149637070.0000104024Topic 1
33matchTopic 10.02149637070.0000104024Topic 1
34isTopic 10.04297126660.0312174926Topic 1
35onTopic 20.02149637070.0208151292Topic 1
36thisTopic 10.02149637070.0104127658Topic 1
37holidayTopic 10.06444616240.0000104024Topic 1
41theTopic 20.00537409270.0520378230Topic 2
42vehicleTopic 20.01074281660.0520378230Topic 2
43isTopic 20.01074281660.0780437315Topic 2
44beingTopic 20.00000536870.0260319145Topic 2
45rightlyTopic 20.00000536870.0260319145Topic 2
46advertisedTopic 20.00000536870.0260319145Topic 2
47asTopic 20.00000536870.0260319145Topic 2
48smoothTopic 20.00000536870.0260319145Topic 2
49andTopic 20.00537409270.0520378230Topic 2
410silentTopic 10.00537409270.0000260059Topic 1
51theTopic 10.01984280380.0240174568Topic 2
52vehicleTopic 10.03966578450.0240174568Topic 1
53hasTopic 20.00001982300.0120147297Topic 2
54goodTopic 10.01984280380.0000120027Topic 1
55pickupTopic 10.01984280380.0000120027Topic 1
56andTopic 20.01984280380.0240174568Topic 2
57isTopic 10.03966578450.0360201838Topic 1
58veryTopic 10.01984280380.0000120027Topic 1
59comfortableTopic 10.01984280380.0000120027Topic 1
510toTopic 20.00001982300.0240174568Topic 2
511driveTopic 20.00001982300.0120147297Topic 2
61mileageTopic 10.01074818540.0000208047Topic 1
62ofTopic 20.00001073740.0208255316Topic 2
63thisTopic 20.01074818540.0208255316Topic 2
64vehicleTopic 20.02148563330.0416302584Topic 2
65isTopic 20.02148563330.0624349852Topic 2
66aroundTopic 20.00001073740.0208255316Topic 2
6714kmplTopic 10.01074818540.0000208047Topic 1
71theTopic 20.01866790090.0262927948Topic 2
72vehicleTopic 10.03731715260.0262927948Topic 1
73isTopic 20.03731715260.0394326222Topic 2
74aTopic 20.01866790090.0131529673Topic 1
757Topic 10.01866790090.0000131398Topic 1
76seaterTopic 20.00001864930.0131529673Topic 2
77MPVTopic 20.00001864930.0131529673Topic 2
78andTopic 10.01866790090.0262927948Topic 2
79itTopic 10.01866790090.0000131398Topic 1
710generatesTopic 20.00001864930.0131529673Topic 2
711300Topic 10.01866790090.0000131398Topic 1
712NmTopic 20.00001864930.0131529673Topic 2
713torqueTopic 10.01866790090.0000131398Topic 1
714withTopic 10.01866790090.0000131398Topic 1
715aTopic 10.01866790090.0131529673Topic 1
716dieselTopic 10.01866790090.0000131398Topic 1
717engineTopic 10.01866790090.0000131398Topic 1


Here you can see topic for some word is modified after iteration one.

This is it. Last column is showing final topic of each word for each document after end of one iteration. Now if we have three iterations this output will be provided to the next iteration as an input. It will go on like this for many more iterations.


And if you still confused about how probability is calculated then please have a look at Collapse Gibb’s sampling equation again, which I have already shown.


Also Read:


Conclusion:

In this post I have discussed about
  • What is Latent Dirichlet Allocation
  • How Latent Dirichlet Allocation works (LDA under the hood) from scratch
  • Steps for Latent Dirichlet Allocation
  • Theoretical Explanation of Latent Dirichlet Allocation


If you have any question in mind regarding this topic please let me know in comment section, will try my best to answer.