To implement Word2Vec,
there are two flavors which are

**-- Continuous Bag-Of-Words (CBOW)**and**continuous Skip-gram (SG)**.
In this post I will
explain only Continuous Bag of Word (CBOW) model with a one-word window to
understand continuous bag of word (CBOW) clearly. If you can understand CBOW
with single word model then multiword CBOW model will be so easy to you.

While explaining, I will
present a few small examples with a text containing a few words. However, keep
in mind that word2vec is typically trained with billions of words.

###
**Continuous Bag of Words (CBOW):**

It attempts to guess the output (target word) from its neighboring words (context words). You can think of it like fill in the blank
task, where you need to guess word in place of blank by observing nearby words.
In this section we will be
implementing the CBOW for single-word architecture of Word2Vec. The content is
broken down into the following steps:

__Data Preparation__**:**Defining corpus by tokenizing text.

__Generate Training Data__**:**Build vocabulary of words, one-hot encoding for words, word index.

__Train Model__**:**Pass one hot encoded words through

**forward pass**, calculate error rate by computing loss, and adjust weights using

**back propagation**.

__Output__**:**By using trained model calculate word vector and find similar words.

**CBOW**with numpy from scratch, I have separate post for that, you can always jump into that.

__Also Read:__###
**1. Data Preparation:**

Let’s say we have a text like below:

“i like natural
language processing”

To make it simple I have chosen a sentence
without capitalization and punctuation. Also I will not remove any stop words
(“and”, “the” etc.) but for real world implementation you should do lots of
cleaning task like stop word removal, replacing digits, remove punctuation etc.

After pre-processing we will convert the text
to list of tokenized word.

[“i”, “like”,
“natural”, “language”, “processing”]

###
**2. Generate training data:**

**: Find unique vocabulary list. As we don’t have any duplicate word in our example text, so unique vocabulary will be:**

__Unique vocabulary__
[“i”, “like”,
“natural”, “language”, “processing”]

Now to prepare training data for single word CBOW model, we
define “target word” as the word which follows a given word in the text (which will
be our “context word”). That means we will be predicting next word for a given
word.

For example, for context
word “

**i**” the target word will be “**like**”. For our example text full training data will looks like:**: We need to convert text to one-hot encoding as algorithm can only understand numeric values.**

__One-hot encoding__
For example encoded value
of the word “

**i**”, which appears first in the vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]
So let’s see overall set
of context-target words in one hot encoded form:

So as you can see above table is our final training data,
where encoded target word is

**Y variable**for our model and encoded context word is**X variable**for our model.
Now we will move on to
train our model.

###
**3. Training Model:**

So far, so good right? Now we need to pass this data into the
basic neural network with one hidden layer and train it. Only one thing to note
is that the desire vector dimension of any word will be the number of hidden
nodes.

For this tutorial and demo purpose my desired vector dimension
is 3. For example:

“i” => [0.001, 0.896, 0.763] so number of hidden layer node will be 3.

__Dimension (n)__**:**It is dimension of word embedding you can treat embedding as number of features or entity like organization, name, gender etc. It can be 10, 20, 100 etc. Increasing number of embedding layer will explain a word or token more deeply. Just for an example Google pre-trained word2vec have dimension of 300.

Now as you know a basic
neural network training is divided into some steps:

1. Create model Architecture

2. Forward Propagation

3. Error Calculation

4. Weight tuning using backward pass

Before going into forward
propagation we need to understand model architecture in vectorized form.

__Also Read:__###
**3.1 Model Architecture:**

Let’s look back to our
example text “I like natural language processing”. Suppose we are training the
model against the first training data point where the context word is “

**i**” and the target word is “like”.
Ideally, the values of the
weights should be such that when the input x=(1,0,0,0) - which is for the word
“

**i**” is given to the model –it will produce output which should close to y=(0,1,0,0) - which is the word “like”.
So now let’s create weight matrix for input to hidden layer (

**w**).
Above picture shows how I
have created weighted matrix for first two input nodes. So in the similar
manner if we are creating weight matrix for all input nodes it should looks
like below, whose dimension will be [3 X 5], in different way it’s

**[N x V]**

Where:

*N:**Number of embedding layer/ hidden unit*

*V:**Number of unique vocabulary size*

Now let’s
create weight matrix for hidden to output layer (

**w**).^{/}
So in the similar manner
(shown in the above picture) if we are creating weight matrix for all hidden
nodes it should looks like below, whose dimension will be [5 X 3], in different way
it’s

**[V X N]**So final vectorized form of CBOW model should look like below:

###
**CBOW Vectorized form:**

Where

**V**is number of unique vocabulary and**N**is number of embedding layers (number of hidden units)
Now we can start forward
propagation.

###

__Also Read:__###
**3.2 Forward Propagation:**

Please refer to picture of
CBOW vectorized form, if you have any confusion in below calculations.

###
**Calculate hidden layer matrix (h):**

So now:

###
**Calculate output layer matrix (u):**

So now:

\[\\u_1=w_{11}^{'} h_1+w_{21}^{'} h_2+w_{31}^{'} h_3\\ u_2=w_{12}^{'} h_1+w_{22}^{'} h_2+w_{32}^{'} h_3\\ u_3=w_{13}^{'} h_1+w_{23}^{'} h_2+w_{33}^{'} h_3\\ u_4=w_{14}^{'} h_1+w_{24}^{'} h_2+w_{34}^{'} h_3\\ u_5=w_{15}^{'} h_1+w_{25}^{'} h_2+w_{35}^{'} h_3\]

###
**Calculate final Softmax output (y):**

y

_{1}= Softmax (u_{1})
y

_{2}= Softmax (u_{2})
y

_{3}= Softmax (u_{3})
y

_{4}= Softmax (u_{4})
y

_{5}= Softmax (u_{5})
Now as you know softmax
calculate probability for every possible class. Softmax function uses
exponential as we want to get output from softmax in a range between 0 to 1.

Let’s have a look for only
one (first) output from softmax function:

So now if we want to
generalize above equation (as the above equation for first output from softmax)
we can write:

\[y_{j}=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}\]

###
**3.3 Error Calculation:**

As we have done
with forward propagation, now we need to calculate error to know how accurately
our model is performing and update weights (

**w and w**) accordingly.^{/}
As you know to
calculate error we must have an actual value. That actual value needs to compare
with predicted value.

For single word
model next word is the target word for previous word.

Here we will be
using log loss function to calculate error. First let’s have a look at
generalized equation to minimize error/ log loss:

\[E=-log(w_t|w_c)\]

Where

**w**= Target word

_{t}**w**= Context word

_{c}
So now let’s
calculate error / loss

**for first iteration**only. For the first iteration “**like**” will be the target word and its position is 2.
So,

\[E(y_{2} )= -log(w_{y_{2}}|w_{x_{1}})\\\\\\ =-log\frac{e^{u_2 }}{e^{u_1 }+e^{u_2 }+ e^{u_3 }+ e^{u_4 }+ e^{u_5 }}\\\\\\ =-log(e^{u_2 } )+ log(e^{u_1 }+e^{u_2 }+ e^{u_3 }+ e^{u_4} + e^{u_5} )\\\\\\ = -u_2+ log(e^{u_1 }+e^{u_2 }+ e^{u_3} + e^{u_4} + e^{u_5} )\]

So now if we want
to generalize above equation we can write like this:

Where j* is the
index of the actual word (target word) in the output layer. For first iteration
index of target word is 2. So for first iteration j* will be 2 as position of word "

**like**" is 2.###
**3.5 Back Propagation:**

As we have done
with forward propagation and error calculation now we need to tune weight
matrices through back propagation.

Back propagation
is required to update weight matrices (

**w and w**). To update weight we need to calculate derivative of loss with respect to each weight and subtract those derivatives with their corresponding weight. It’s called^{/}**gradient descent**technique to tune weight.
Let’s take a look
for only one example to update second weight (

**w**).^{/}
In this back
propagation phase we will update weight of all neurons from hidden layer to
output (w

^{/}_{11 }, w^{/}_{12}, w^{/}_{13 …. }w^{/}_{15})**Step1: Gradient of E with respect to w**

^{/}_{11}:\[\frac{dE(y_1)}{dw^{'}_{11}}=\frac{dE(y_1)}{du_1}\, \frac{du_1}{dw^{'}_{11}}\]

Now as we know

\[E(y_1 )= -u_1+ log(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )\]

So,

\[\frac{dE(y_1)}{du_1 }= -\frac{du_1}{du_1}+\frac{d[log(e^{u_1} +e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )])}{du_1 }\\\\\\\\ =-1+\frac{d[log(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} ) ]}{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )}\,\frac{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )}{du_1}\]

Derivative of log
function with chain rule.

\[=-1+\frac{1}{e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} }*u_{1}\\\\ =-1+\frac{u_{1}}{e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} }\\\\ =-1+y_1\]

We can generalize
above derivative as:

Taking derivative
of

**E**with respect to**u**:_{j}\[\frac{dE}{du_j}=-\frac{d(u_{j^{*}})}{du_j}+\frac{d(log\sum_{j=1}^{V}e^{u_j})}{du_j}\\\\ =(-t_j+y_j)\\\\ =(y_j-t_j)\\\\ =e_j\]

**Note:**t

_{j}=1 if t

_{j}= t

_{j*}else t

_{j}= 0

That means t

_{j}is the actual output and y_{j}is the predicted output. So e_{j}is the error.
So for first iteration,

And,

\[\frac{du_1}{dw^{'}_{11}}=\frac{d(w^{'}_{11}h_1+w^{'}_{21}h_2+w^{'}_{31}h_3)}{dw^{'}_{11}}=h_1\]

Now finally coming back to main
derivative which we were trying to solve:

\[\frac{dE(y_1)}{dw^{'}_{11}}=\frac{dE(y_1)}{du_1}\,\frac{du_1}{dw^{'}_{11}}=e_1h_1\]

So generalized form will looks like below:

\[\frac{dE}{dw^{'}}==e*h\]

**Step2: Updating the weight of w**

^{/}_{11}:**\[new(w^{'}_{11})=w^{'}_{11}-\frac{dE(y_1)}{w^{'}_{11}}=(w^{'}_{11}-e_1h_1)\]**

**Note:**To keep it simple I am not considering any learning rate here.

Similarly we can update weight of w

^{/}_{12}, w^{/}_{13 …. }w^{/}_{35}.
Now let’s update 1

^{st}weight which (w).**Step1: Gradient of E with respect to w**

_{11}:
In this back propagation phase we will
update weight of all neurons from input layer to hidden layer (w

_{11}, w_{12}, …..w_{53})
Again I will take you through only first
weigh (w

_{11})\[\frac{dE}{dw_{11}}=\frac{dE}{dh_{1}}\,\frac{dh_1}{dw_{11}}\]

As to change E with respect to h

_{1}, u_{1}to u_{5}all are involved.So,

\[\frac{dE}{dh_{1}}=(\frac{dE}{du_1}\,\frac{du_1}{dh_1})+(\frac{dE}{du_2}\,\frac{du_2}{dh_1})+(\frac{dE}{du_3}\,\frac{du_3}{dh_1})+(\frac{dE}{du_4}\,\frac{du_4}{dh_1})+(\frac{dE}{du_5}\,\frac{du_5}{dh_1})\\\\ =ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15}\]

As for u

_{1 }and h_{1},\[\frac{du_1}{dh_1}=\frac{d(w^{'}_{11}h_1+w^{'}_{21}h_2+w^{'}_{31}h_3)}{dh_1}=w^{'}_{11}\\\\ As\, h_2\, and\, h_3\,are \, constant \, with \, respect \, to \, h_1 \\\\ Similarly \, we\, can\, calculate\,:\,\frac{du_2}{dh_1},\frac{du_3}{dh_1},\frac{du_4}{dh_1},\frac{du_5}{dh_1}\]

And,

\[\frac{dh_1}{dw_{11}}=\frac{d(w_{11} x_1+w_{21} x_2+w_{31} x_3+w_{41} x_4+w_{51} x_5)}{dw_{11}}\]

Now finally:

\[\frac{dE}{dw_{11}}=\frac{dE}{dh_1}\,\frac{dh_1}{dw_{11}}\\\\ =(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x\]

**Step2: Updating the weight of w**

_{11}:\[new(w_{11})=w_{11}-\frac{dE}{dw_{11}}\\\\ =w_{11}-(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x\]

**Note:**Again to keep it simple I am not considering any learning rate here.

Similarly we can update weight of w

_{12}, w_{13}... W_{53}.###
**Final Word2Vec:**

After doing above training process for
many iteration final trained model comes with tuned and proper weights. At the
end of entire training process we will consider first weight matrix (w) as our
vector for our given sentence.

For example:

Our sentence or document was“i like natural
language processing”

And let’s say first weight of our
trained model is:

\[w=\begin{bmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\\ w_{31} & w_{32} & w_{33}\\ w_{41} & w_{42} & w_{43}\\ w_{51} & w_{52} & w_{53} \end{bmatrix}\]

So word2vce of word “i” will be [w

_{11}, w_{12}, w_{13}]
Similarly word2vec for all word will be
as follows:

__Also Read:__###
**Conclusion:**

Now to conclude let’s recall our generalized
equations:

**Forward Propagation:**

\[\\h=wx\\ u=w^{'}h\\ y_j=softmax(u_j)=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}\\\\ E=-log(w_t|w_c)=-u_{j^{*}}+log\sum_{j=1}^{V}e^{u_j}\]

**Back Propagation:**

**1. Update second weight (w**

^{/}_{11}):\[\frac{dE}{dw^{'}}=e*h\\\\ So,\\ new(dw^{'}_{11})=dw^{'}_{11}-\frac{dE(y_1)}{dw^{'}_{11}}\\\\ =w^{'}_{11}-e_1h_1\]

General Vector Notation:

\[\frac{dE}{dw^{'}}=(wx)\otimes e\\\\ new(w^{'})=w^{'}_{old}-\frac{dE}{dw^{'}}\]

**Note:**Symbol ⊗ denotes the outer product. In Python this can be obtained using the

**numpy.outer**method

**2. Update second weight (w**

_{11}):\[\frac{dE}{dw_{11}}=(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x\]

**General Vector Notation:**

\[\frac{dE}{dw}=x\otimes (w^{'}e)\\\\ new(w)=w_{old}-\frac{dE}{dw}\]

**Note:**Symbol ⊗ denotes the outer product. In Python this can be obtained using the

**numpy.outer**method

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.