15 Oct 2019

Doc2Vec implementation in Python using Gensim



Doc2vec (also known as: paragraph2vec or sentence embedding) is the modified version of word2vec. The main objective of doc2vec is to convert sentence or paragraph to vector (numeric) form.

In this article I will walk you through a simple implementation of doc2vec using Python and Gensim.

I have a separate article for doc2vec to explain how it works. I recommend reading that article before reading this.

Must Read:

Data Pre-Processing for doc2vec Python

#Import packages
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

## Exapmple document (list of sentences)
doc = ["I love data science",
        "I love coding in python",
        "I love building NLP tool",
        "This is a good phone",
        "This is a good TV",
        "This is a good laptop"]

# Tokenization of each document
tokenized_doc = []
for d in doc:
    tokenized_doc.append(word_tokenize(d.lower()))
tokenized_doc


Output:
[['i', 'love', 'data', 'science'],
 ['i', 'love', 'coding', 'in', 'python'],
 ['i', 'love', 'building', 'nlp', 'tool'],
 ['this', 'is', 'a', 'good', 'phone'],
 ['this', 'is', 'a', 'good', 'tv'],
 ['this', 'is', 'a', 'good', 'laptop']]
# Convert tokenized document into gensim formated tagged data
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_doc)]
tagged_data

Output:
[TaggedDocument(words=['i', 'love', 'data', 'science'], tags=[0]),
 TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=[1]),
 TaggedDocument(words=['i', 'love', 'building', 'nlp', 'tool'], tags=[2]),
 TaggedDocument(words=['this', 'is', 'a', 'good', 'phone'], tags=[3]),
 TaggedDocument(words=['this', 'is', 'a', 'good', 'tv'], tags=[4]),
 TaggedDocument(words=['this', 'is', 'a', 'good', 'laptop'], tags=[5])]

Above steps are just basic data pre-processing steps. In real world complex application data pre-processing is not that much simple. I that case you should be using steps like stemming, lemmatization, n-grams, stop word removal etc. To make this tutorial simple I am avoiding those steps.

Now we are ready to train our doc2vec model.

Also Read:

Train save and load doc2vec model

Here I am using distributed memory paragraph vector (PV-DM) model as doc2vec.

Note: dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW) 

## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size=20, window=2, min_count=1, workers=4, epochs = 100)
# Save trained doc2vec model
model.save("test_doc2vec.model")
## Load saved doc2vec model
model= Doc2Vec.load("test_doc2vec.model")
## Print model vocabulary
model.wv.vocab



Output:
{'a': <gensim.models.keyedvectors.Vocab at 0xc45edbb710>,
 'building': <gensim.models.keyedvectors.Vocab at 0xc45edbb518>,
 'coding': <gensim.models.keyedvectors.Vocab at 0xc45edbb400>,
 'data': <gensim.models.keyedvectors.Vocab at 0xc45edbb320>,
 'good': <gensim.models.keyedvectors.Vocab at 0xc45edbb780>,
 'i': <gensim.models.keyedvectors.Vocab at 0xc45edbb048>,
 'in': <gensim.models.keyedvectors.Vocab at 0xc45edbb470>,
 'is': <gensim.models.keyedvectors.Vocab at 0xc45edbb6d8>,
 'laptop': <gensim.models.keyedvectors.Vocab at 0xc45edbb8d0>,
 'love': <gensim.models.keyedvectors.Vocab at 0xc45edbb2b0>,
 'nlp': <gensim.models.keyedvectors.Vocab at 0xc45edbb588>,
 'phone': <gensim.models.keyedvectors.Vocab at 0xc45edbb7f0>,
 'python': <gensim.models.keyedvectors.Vocab at 0xc45edbb4e0>,
 'science': <gensim.models.keyedvectors.Vocab at 0xc45edbb390>,
 'this': <gensim.models.keyedvectors.Vocab at 0xc45edbb668>,
 'tool': <gensim.models.keyedvectors.Vocab at 0xc45edbb5f8>,
 'tv': <gensim.models.keyedvectors.Vocab at 0xc45edbb860>}

Document Similarity using doc2vec 

# find most similar doc 
test_doc = word_tokenize("That is a good device".lower())
model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)

Output:

[(5, 0.28079578280448914),
 (0, 0.1330653727054596),
 (3, 0.12503036856651306),
 (4, 0.05355849117040634),
 (2, 0.05051974207162857)]

Here (5, 0.28079578280448914) means our provided test_doc is most similar with document 5 of training document set with probability of 28%.

Note: document 5 of training data is: “This is a good laptop”

Conclusion:

  • In this tutorial I have discussed about:
  • How to implement doc2vec in python using Gensim
  • Data pre-processing for doc2cec
  • Train doc2vec model
  • Save trained doc2vec model
  • Load saved doc2vec model
  • Find doc2vec model vocabulary list
  • Find vector representation of document by using trained doc2vec model
  • Find similarity between two documents/ sentences by using doc2vec.

If you have any question or suggestion regarding this topic please let me know in comment section, will try my best to answer.