## 11 Jun 2019

To implement Word2Vec, there are two flavors which are -- Continuous Bag-Of-Words (CBOW) and continuous Skip-gram (SG)

In this post I will explain only Continuous Bag of Word (CBOW) model with a one-word window to understand continuous bag of word (CBOW) clearly. If you can understand CBOW with single word model then multiword CBOW model will be so easy to you.

While explaining, I will present a few small examples with a text containing a few words. However, keep in mind that word2vec is typically trained with billions of words.

### Continuous Bag of Words (CBOW):

It attempts to guess the output (target word) from its neighboring words (context words). You can think of it like fill in the blank task, where you need to guess word in place of blank by observing nearby words.

### Continuous Bag of Words (CBOW) single-word model:

In this section we will be implementing the CBOW for single-word architecture of Word2Vec. The content is broken down into the following steps:

Data Preparation: Defining corpus by tokenizing text.

Generate Training Data: Build vocabulary of words, one-hot encoding for words, word index.

Train Model: Pass one hot encoded words through forward pass, calculate error rate by computing loss, and adjust weights using back propagation.

Output: By using trained model calculate word vector and find similar words.
I will explain CBOW steps without code but if you want full working code of CBOW with numpy from scratch, I have separate post for that, you can always jump into that.

### 1. Data Preparation:

Let’s say we have a text like below:

i like natural language processing

To make it simple I have chosen a sentence without capitalization and punctuation. Also I will not remove any stop words (“and”, “the” etc.) but for real world implementation you should do lots of cleaning task like stop word removal, replacing digits, remove punctuation etc.

After pre-processing we will convert the text to list of tokenized word.

[“i”, “like”, “natural”, “language”, “processing”]

### 2. Generate training data:

Unique vocabulary: Find unique vocabulary list. As we don’t have any duplicate word in our example text, so unique vocabulary will be:

[“i”, “like”, “natural”, “language”, “processing”]

Now to prepare training data for single word CBOW model, we define “target word” as the word which follows a given word in the text (which will be our “context word”). That means we will be predicting next word for a given word.

Now let’s construct our training examples, scanning through the text with a window will prepare a context word and a target word, like so:

For example, for context word “i” the target word will be “like”. For our example text full training data will looks like:

One-hot encoding: We need to convert text to one-hot encoding as algorithm can only understand numeric values.

For example encoded value of the word “i”, which appears first in the vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]

So let’s see overall set of context-target words in one hot encoded form:

So as you can see above table is our final training data, where encoded target word is Y variable for our model and encoded context word is X variable for our model.

Now we will move on to train our model.

### 3. Training Model:

So far, so good right? Now we need to pass this data into the basic neural network with one hidden layer and train it. Only one thing to note is that the desire vector dimension of any word will be the number of hidden nodes.

For this tutorial and demo purpose my desired vector dimension is 3. For example:

“i” => [0.001, 0.896, 0.763]  so number of hidden layer node will be 3.

Dimension (n): It is dimension of word embedding you can treat embedding as number of features or entity like organization, name, gender etc. It can be 10, 20, 100 etc. Increasing number of embedding layer will explain a word or token more deeply. Just for an example Google pre-trained word2vec have dimension of 300.

Now as you know a basic neural network training is divided into some steps:

1. Create model Architecture
2. Forward Propagation
3. Error Calculation
4. Weight tuning using backward pass

Before going into forward propagation we need to understand model architecture in vectorized form.

### 3.1 Model Architecture:

Let’s look back to our example text “I like natural language processing”. Suppose we are training the model against the first training data point where the context word is “i” and the target word is “like”.

Ideally, the values of the weights should be such that when the input x=(1,0,0,0) - which is for the word “i” is given to the model –it will produce output which should close to y=(0,1,0,0) - which is the word “like”.

So now let’s create weight matrix for input to hidden layer (w).

Above picture shows how I have created weighted matrix for first two input nodes. So in the similar manner if we are creating weight matrix for all input nodes it should looks like below, whose dimension will be [3 X 5], in different way it’s

[N x V]

Where:

N: Number of embedding layer/ hidden unit
V: Number of unique vocabulary size

$w=\begin{bmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ w_{31} & w_{32} & w_{33}\\ w_{41} & w_{42} & w_{43}\\ w_{51} & w_{52} & w_{53} \end{bmatrix}$

Now let’s create weight matrix for hidden to output layer (w/).

So in the similar manner (shown in the above picture) if we are creating weight matrix for all hidden nodes it should looks like below, whose dimension will be [5 X 3], in different way it’s [V X N]

$w=\begin{bmatrix} w_{11}^{'} & w_{12}^{'} & w_{13}^{'} & w_{14}^{'} & w_{15}^{'}\\ w_{21}^{'} & w_{22}^{'} & w_{23}^{'} & w_{24}^{'} & w_{25}^{'}\\ w_{31}^{'} & w_{32}^{'} & w_{33}^{'} & w_{34}^{'} & w_{35}^{'} \end{bmatrix}$

So final vectorized form of CBOW model should look like below:

### CBOW Vectorized form:

Where V is number of unique vocabulary and N is number of embedding layers (number of hidden units)
Now we can start forward propagation.

### 3.2 Forward Propagation:

Please refer to picture of CBOW vectorized form, if you have any confusion in below calculations.

### Calculate hidden layer matrix (h):

So now:

$\\h_1=w_{11} x_1+w_{21} x_2+w_{31} x_3+w_{41} x_4+w_{51} x_5\\ h_2=w_{12} x_1+w_{22} x_2+w_{32} x_3+w_{42} x_4+w_{52} x_5\\ h_3=w_{13} x_1+w_{23} x_2+w_{33} x_3+w_{43} x_4+w_{53} x_5$

### Calculate output layer matrix (u):

So now:

$\\u_1=w_{11}^{'} h_1+w_{21}^{'} h_2+w_{31}^{'} h_3\\ u_2=w_{12}^{'} h_1+w_{22}^{'} h_2+w_{32}^{'} h_3\\ u_3=w_{13}^{'} h_1+w_{23}^{'} h_2+w_{33}^{'} h_3\\ u_4=w_{14}^{'} h_1+w_{24}^{'} h_2+w_{34}^{'} h_3\\ u_5=w_{15}^{'} h_1+w_{25}^{'} h_2+w_{35}^{'} h_3$

### Calculate final Softmax output (y):

So:

y1 = Softmax (u1)
y2 = Softmax (u2)
y3 = Softmax (u3)
y4 = Softmax (u4)
y5 = Softmax (u5)

Now as you know softmax calculate probability for every possible class. Softmax function uses exponential as we want to get output from softmax in a range between 0 to 1.

Let’s have a look for only one (first) output from softmax function:

$y_1 =\frac{e^{u_1 }}{(e^{u_1 }+e^{u_2 }+e^{u_3 }+e^{u_4 }+e^{u_5 } )}$

So now if we want to generalize above equation (as the above equation for first output from softmax) we can write:

$y_{j}=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}$

### 3.3 Error Calculation:

As we have done with forward propagation, now we need to calculate error to know how accurately our model is performing and update weights (w and w/) accordingly.

As you know to calculate error we must have an actual value. That actual value needs to compare with predicted value.

For single word model next word is the target word for previous word.

Here we will be using log loss function to calculate error. First let’s have a look at generalized equation to minimize error/ log loss:

$E=-log(w_t|w_c)$

Where

wt = Target word
wc = Context word

So now let’s calculate error / loss for first iteration only. For the first iteration “like” will be the target word and its position is 2.

So,

$E(y_{2} )= -log(w_{y_{2}}|w_{x_{1}})\\\\\\ =-log\frac{e^{u_2 }}{e^{u_1 }+e^{u_2 }+ e^{u_3 }+ e^{u_4 }+ e^{u_5 }}\\\\\\ =-log(e^{u_2 } )+ log(e^{u_1 }+e^{u_2 }+ e^{u_3 }+ e^{u_4} + e^{u_5} )\\\\\\ = -u_2+ log(e^{u_1 }+e^{u_2 }+ e^{u_3} + e^{u_4} + e^{u_5} )$

So now if we want to generalize above equation we can write like this:
$E=-u_{j^{*}}+log\sum_{j=1}^{V}e^{u_{j}}$

Where j* is the index of the actual word (target word) in the output layer. For first iteration index of target word is 2. So for first iteration j* will be 2 as position of word "like" is 2.

### 3.5 Back Propagation:

As we have done with forward propagation and error calculation now we need to tune weight matrices through back propagation.
Back propagation is required to update weight matrices (w and w/). To update weight we need to calculate derivative of loss with respect to each weight and subtract those derivatives with their corresponding weight. It’s called gradient descent technique to tune weight.
Let’s take a look for only one example to update second weight (w/).
In this back propagation phase we will update weight of all neurons from hidden layer to output (w/11 , w/12, w/13 …. w/15)
Step1: Gradient of E with respect to w/11:

$\frac{dE(y_1)}{dw^{'}_{11}}=\frac{dE(y_1)}{du_1}\, \frac{du_1}{dw^{'}_{11}}$

Now as we know

$E(y_1 )= -u_1+ log(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )$

So,
$\frac{dE(y_1)}{du_1 }= -\frac{du_1}{du_1}+\frac{d[log(e^{u_1} +e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )])}{du_1 }\\\\\\\\ =-1+\frac{d[log(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} ) ]}{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )}\,\frac{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )}{du_1}$

Derivative of log function with chain rule.

$=-1+\frac{1}{e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} }*u_{1}\\\\ =-1+\frac{u_{1}}{e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} }\\\\ =-1+y_1$

We can generalize above derivative as:
Taking derivative of E with respect to uj:

$\frac{dE}{du_j}=-\frac{d(u_{j^{*}})}{du_j}+\frac{d(log\sum_{j=1}^{V}e^{u_j})}{du_j}\\\\ =(-t_j+y_j)\\\\ =(y_j-t_j)\\\\ =e_j$

Note: tj =1 if tj = tj* else tj = 0

That means tj is the actual output and yj is the predicted output. So ej is the error.
So for first iteration,

$\frac{dE(y_1)}{du_1}=e_1$

And,
$\frac{du_1}{dw^{'}_{11}}=\frac{d(w^{'}_{11}h_1+w^{'}_{21}h_2+w^{'}_{31}h_3)}{dw^{'}_{11}}=h_1$

Now finally coming back to main derivative which we were trying to solve:

$\frac{dE(y_1)}{dw^{'}_{11}}=\frac{dE(y_1)}{du_1}\,\frac{du_1}{dw^{'}_{11}}=e_1h_1$

So generalized form will looks like below:

$\frac{dE}{dw^{'}}==e*h$

Step2: Updating the weight of w/11:

$new(w^{'}_{11})=w^{'}_{11}-\frac{dE(y_1)}{w^{'}_{11}}=(w^{'}_{11}-e_1h_1)$

Note: To keep it simple I am not considering any learning rate here.

Similarly we can update weight of w/12, w/13 …. w/35.
Now let’s update 1stweight which (w).

Step1: Gradient of E with respect to w11:

In this back propagation phase we will update weight of all neurons from input layer to hidden layer (w11, w12, …..w53)

Again I will take you through only first weigh (w11)

$\frac{dE}{dw_{11}}=\frac{dE}{dh_{1}}\,\frac{dh_1}{dw_{11}}$

As to change E with respect to h1, u1to u5 all are involved.

So,

$\frac{dE}{dh_{1}}=(\frac{dE}{du_1}\,\frac{du_1}{dh_1})+(\frac{dE}{du_2}\,\frac{du_2}{dh_1})+(\frac{dE}{du_3}\,\frac{du_3}{dh_1})+(\frac{dE}{du_4}\,\frac{du_4}{dh_1})+(\frac{dE}{du_5}\,\frac{du_5}{dh_1})\\\\ =ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15}$

As for u1 and h1,

$\frac{du_1}{dh_1}=\frac{d(w^{'}_{11}h_1+w^{'}_{21}h_2+w^{'}_{31}h_3)}{dh_1}=w^{'}_{11}\\\\ As\, h_2\, and\, h_3\,are \, constant \, with \, respect \, to \, h_1 \\\\ Similarly \, we\, can\, calculate\,:\,\frac{du_2}{dh_1},\frac{du_3}{dh_1},\frac{du_4}{dh_1},\frac{du_5}{dh_1}$

And,
$\frac{dh_1}{dw_{11}}=\frac{d(w_{11} x_1+w_{21} x_2+w_{31} x_3+w_{41} x_4+w_{51} x_5)}{dw_{11}}$

Now finally:

$\frac{dE}{dw_{11}}=\frac{dE}{dh_1}\,\frac{dh_1}{dw_{11}}\\\\ =(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x$

Step2: Updating the weight of w11:

$new(w_{11})=w_{11}-\frac{dE}{dw_{11}}\\\\ =w_{11}-(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x$

Note: Again to keep it simple I am not considering any learning rate here.

Similarly we can update weight of w12, w13 ... W53.

### Final Word2Vec:

After doing above training process for many iteration final trained model comes with tuned and proper weights. At the end of entire training process we will consider first weight matrix (w) as our vector for our given sentence.

For example:

Our sentence or document wasi like natural language processing
And let’s say first weight of our trained model is:

$w=\begin{bmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ w_{31} & w_{32} & w_{33}\\ w_{41} & w_{42} & w_{43}\\ w_{51} & w_{52} & w_{53} \end{bmatrix}$

So word2vce of word “i” will be [w11, w12, w13]
Similarly word2vec for all word will be as follows:

### Conclusion:

Now to conclude let’s recall our generalized equations:

Forward Propagation:

$\\h=wx\\ u=w^{'}h\\ y_j=softmax(u_j)=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}\\\\ E=-log(w_t|w_c)=-u_{j^{*}}+log\sum_{j=1}^{V}e^{u_j}$

Back Propagation:
1. Update second weight (w/11):

$\frac{dE}{dw^{'}}=e*h\\\\ So,\\ new(dw^{'}_{11})=dw^{'}_{11}-\frac{dE(y_1)}{dw^{'}_{11}}\\\\ =w^{'}_{11}-e_1h_1$

General Vector Notation:

$\frac{dE}{dw^{'}}=(wx)\otimes e\\\\ new(w^{'})=w^{'}_{old}-\frac{dE}{dw^{'}}$

Note:Symbol  denotes the outer product. In Python this can be obtained using the numpy.outer method

2. Update second weight (w11):

$\frac{dE}{dw_{11}}=(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x$

General Vector Notation:

$\frac{dE}{dw}=x\otimes (w^{'}e)\\\\ new(w)=w_{old}-\frac{dE}{dw}$

Note:Symbol  denotes the outer product. In Python this can be obtained using the numpy.outer method

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.