4 Jun 2020

Complete Guide for Natural Language Processing in Python


What is Natural Language Processing

Natural Language Processing (NLP in short) is a component of Data Science/ Artificial Intelligence (AI) in which computer understand, analyze and extract meaning of Natural Language (human language) like chat or speech.

Use of Natural Language Processing

There are so many Natural Language sources like: emails, search engines, product reviews, customer feed backs, online surveys, social media posts, tweets, comments, customer support. NLP helps to analyze these unstructured text data to help your company or product to grow.



So many companies are using power of NLP to improve their business. There are so many use cases of Natural Language Processing. I am showing some popular of those.
·       Analyze Customer Feedback: Through Customer Feedback you can understand what your customer or client think about your product or service. You can collect feedback for your product or service from different sources like: your own portal/ system, social media, product reviews, or online surveys.
By applying NLP techniques like Topic Modelling, Text Classification, Sentiment Analysis you can come up with the result which can tell you exactly where (it may be Pricing, delivery time, customer support etc.) you are serving well or where you are failing to serve or what kind of service you are giving to your customer.
This kind of analysis is typically called NPS Feedback analysis.
·       Email Classification: If you have used Gmail, you may see that Gmail is automatically filter spam emails and drop those mails to the spam folder. This is happening because text of each emails are passing through classification algorithm before they enter your inbox. This is one of the standard use cases of NLP.

·       Customer Query Analysis: Every day customers raise thousands of queries in chat. Company can find relevant business problems and increase revenue/decrease revenue leakage and derive additional intelligence out of that huge volume of unstructured data.
By this analysis you can reduce manual effort of an agent, find out training requirement of a particular agent, performance of an agent, find up sell opportunities, identify who are the customer likely to churn, product or service failure identification.
This is one of the popular use cases of Natural Language processing now a days.
·       Human Resource: Organization has (or receive) so many resumes. These resume can be passed through various NLP techniques to help Talent Recruitment team to select best candidate for a particular project or Talent Development team to up skill their employees to certain areas.

·       Chabot: Chatbot is computer program which can communicate with a human like a human. It is true that in recent time Chatbot is not up to the mark which can completely replace a human but it can reduce 80% of routine human task. Now a day’s most of the companies are using Chatbot for customer support. It can automate customer service also.

·       Personal Assistant: Company like Google (Google Assistant), Apple (Siri), and Amazon (Alexa) made their personal assistant software or devices by using AI based NLP tool. This kind of application of NLP made our daily life easier.

·       Text Summarization: Automatic text summarizer can be used as a personal or specialized assistant. This kind of NLP application save lot of time to read huge amount of text document.
Resoomer is one of the applications of text summarization to generate conclusion for text or articles.
·       Text Generator: By using text generator which can write something like someone. For example you can generate poem which may be looks like written by Shakespeare.
·       Machine Translation: Companies like Google is using Artificial Intelligence based NLP approach to translate languages. This is one of the daily and highly used applications of Natural language Processing.

How NLP works

When we read a text, our brain decodes words and making connection with something in the world. Like if we read a black cowthen in our brain we imagine about a cow of colour black on that time.
The main objective of Natural Language Processing is to break words into their simplest form to identify rules, patterns or relationships between them.
NLP is combination of linguistic and computer science.
Linguistic is to gather knowledge of a text by analysing different techniques like syntax, semantics, morphology and pragmatic.
Computer science then converts this linguistic knowledge into rule based or machine learning algorithms which can solve specific business problems or perform desired tasks.

Hope now you have a good idea about what NLP is. Now let’s have a look at different techniques of NLP and apply those techniques or methods of NLP to polish our learning. To do so we need to use some tools or libraries. Let’s find out some popular NLP libraries in python.



Open Source NLP Libraries for python

There are so many NLP libraries in python. I am listing some libraries which I used and look promising to me.

Spacy: Spacy is my best choice for general purpose NLP techniques like POS (parts of speech) tagging, dependency parsing, NER etc. as it generates output from pre-trained deep learning models.

It is fast and easy to use library than any other NLP libraries in python. For production level NLP projects you can use Spacy.

Spacy does not support all the possible techniques of NLP (for example: you cannot do sentiment analysis, language translation etc. Are not available), instead it focuses on the best algorithms to solve your task.

Spacy supports multiple languages. You can also train your custom NER (Named Entity Recognition) model, custom dependency parser model by using your business specific data.

StanfordCore NLP: Stanford Core NLP released by NLP research group from Stanford University. By using this library you can do basic NLP tasks like POS tagging, NER, dependency parsing etc.
You can do Sentiment Analysis by using StanfordCoreNLP. I found accuracy pretty good.

TextBlob: It provides a simple API for common natural language processing (NLP) tasks like part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, Spelling corrector and more.

TextBlob is built on top of NLTK and Pattern. It is easy to learn and offers a lot of features.

NLTK: Natural Language Toolkit (NLTK) is basic and most popular Python library for NLP to learn basic NLP techniques if you are beginner. But for development or production level NLP work NLTK is not up to the mark.

Gensim: This NLP library in Python not for basic or common tasks like POS tagging, dependency parsing, Named Entity Recognition etc. You can use Gensim to do NLP tasks like topic modelling (LDA, LSA), word embeddings(word2vec) etc.




Techniques of Natural Language Processing


There are two main NLP techniques:
1.    Syntax Analysis
2.    Semantic Analysis

1. Syntax Analysis

It is also known as syntactic analysis. Syntax is arrangement or positioning of words to make sentence grammatically correct or sensible. There are many different techniques to do Syntax Analysis such as:
·       Tokenization
·       Dependency Parsing
·       Stemming & Lemmatization
·       Stop word removal
·       TF-IDF
·       N-Grams

2. Semantic Analysis

Semantic Analysis is used to identify the meaning of text. Semantic Analysis uses various NLP algorithms and techniques to understand meaning and structure of a sentence (or meaning of a word based on context).
For example meaning of apple is completely different in below two sentences.
iPhone is a product of Apple
I eat apple every day.
You can understand topic of a text by doing Semantic Analysis of Natural Language Processing.
For example an article containing words mobile, battery, touch screen, charging etc. can be labelled as Mobile Phone.
Like above there are so many techniques and tasks involves in Semantic Analysis like:
·       Keyword Extraction
·       Topic Modelling
·       Word Relationship Extraction
·       Word Embedding
·       Text Similarity Matching
·       Word sense disambiguation

Basic Techniques of NLP in Python

At this point hope you have a better understanding about what NLP is? and different techniques of natural language processing.
Now it’s time to implement those NLP techniques in Python to brush up your learning with hands on.

Tokenization

Tokenization is the process to convert string of words (sentence or list of sentences) into tokens.
You can split words within a sentence which is called word tokenization.
Or you can split sentences from a list of sentences (paragraph) which is called sentence tokenization.

# Word tokenization using spacy
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = nlp('I like Natural Language Processing')

# Print all tokens for above sentence
for word in sentence:
    print(word.text)

# Sentence tokenization using spacy
import spacy
nlp = spacy.load('en_core_web_sm')
list_of_sentences = nlp('I am Anindya. I am author of this post. To get latest article, subscribe my blog: https://www.thinkinfi.com/')
# Print all sentences for above list of sentences
for sen in list_of_sentences.sents:
    print(sen)


I
like
Natural
Language
Processing
I am Anindya.
I am author of this post.
To get latest article, subscribe my blog: https://www.thinkinfi.com/

I have written an in detailed article about Tokenization and Parts of speech. You should read that.

Parts of speech tagging

Every word of a sentence has their grammatical meaning or position which is called parts of speech (POS) tags. Some common POS tags are: noun, adjective, preposition etc.

# Parts-of-speech tagging using spacy
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = nlp('I like Natural Language Processing')

# Print all tokens with POS for above sentence
for word in sentence:
    print(word.text, word.pos_)

I PRON
like VERB
Natural PROPN
Language PROPN
Processing PROPN

I have written an in detailed article about Tokenization and Parts of speech. You should read that.

Dependency Parsing

Dependency grammar tells us the way words in a sentence are connected to each other. Dependency Parser is to extract those relationships.


# Dependency parsing using spacy
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = nlp('I like Natural Language Processing')

# Print all tokens with dependency for above sentence
for word in sentence:
    print(word.text, word.dep_)

# Visualizing dependency using spacy
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
sentence = nlp('I like Natural Language Processing')

# Visualize dependency in jupyter notbook
displacy.render(sentence, style="dep", jupyter = True, options = {'distance': 140})

















Stemming & Lemmatization

Stemming: It is a rule based approach for text normalization, it strips inflected words based on common prefixes and suffixes that can be found in an inflected word.
Lemmatization: It converts a word into its lemma (root form).
I have written an in detailed article about Stemming and Lemmatization. You should read that.

Stopword Removal

Stopwords are most commonly used words in any natural language. To analyze text data, these stopwords mostly not add much value to the meaning of the document.
So while analying text data (specially in rule based model) mostly we should remove stop words to normalize text document.
Most common words (stop words) in english: am, is, are, the, for, in, on, where, when, to, at etc.

# English Stopword removal using spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')

sentence = nlp('I am Anindya. I am author of this post. To get latest article, subscribe my blog: https://www.thinkinfi.com/')

# Create list of word tokens
token_list = []
for token in sentence:
    token_list.append(token.text)

# Create list of word tokens after removing stopwords
normalized_sentence =[] 

for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        normalized_sentence.append(word)

# Before stopword removal    
print('### Before stopword removal ###')
print(' '.join(token_list))
print('\n')
# After stopword removal
print('### After stopword removal ###')
print(' '.join(normalized_sentence))


TF-IDF

TF-IDF is a technique to measure how important a word in a document.

TF (Term Frequency): Number of times a word occurs in the text document / Total number of words in the text document

IDF (Inverse Document Frequency): Total number of documents / Number of documents with a specific word in it
Thus,
TF-IDF = TF * IDF


# Calculate TF-IDF in Python using Sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import pandas as pd
import re

transformer = TfidfTransformer()

document = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]

# Configure count vectorizer
cnt_vec = CountVectorizer()

# Apply count vectorizer in our document
transformed_document = cnt_vec.fit_transform(document)

# Printing important phrases after
print(cnt_vec.get_feature_names())

transformed_weights = transformer.fit_transform(transformed_document)

tf_idf_weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
tf_idf_weights_df = pd.DataFrame({'term': cnt_vec.get_feature_names(), 'tf_idf_weight': tf_idf_weights})
# Print words with TF-IDF score
tf_idf_weights_df.sort_values(by='tf_idf_weight', ascending=False).head(4)












N-Grams

N-gram means a sequence of N words. So
Bigram:  sequence of 2 words
Trigram: Sequences of 3 words... so on


# Python NLTK: Bigrams trigrams fourgrams
from nltk import everygrams

# Print unigram, bigram to 5-grams
list(everygrams('I love Natural Language Processing'.split(), 1, 5))

[('I',),
 ('love',),
 ('Natural',),
 ('Language',),
 ('Processing',),
 ('I', 'love'),
 ('love', 'Natural'),
 ('Natural', 'Language'),
 ('Language', 'Processing'),
 ('I', 'love', 'Natural'),
 ('love', 'Natural', 'Language'),
 ('Natural', 'Language', 'Processing'),
 ('I', 'love', 'Natural', 'Language'),
 ('love', 'Natural', 'Language', 'Processing'),
 ('I', 'love', 'Natural', 'Language', 'Processing')]

Named Entity Recognition

Named entity is a real-world object that’s assigned a name – for example, a person, a country, a product or a book title
Task of Named entity recognition is to extract important real world object name from text. For example person name, country name, date etc.


# Named entity recognition in spacy
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Microsoft India Private Limited is a subsidiary of Microsoft Corporation, headquartered in Hyderabad, India, Set up in 1998")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


Must Read:

Keyword extraction

I have explained different types of keyword extraction techniques in details. You may look into that.
Must Read:

Topic Modeling

Read my previous articles (mentioning below) to have complete understanding about Topic modelling with code.
Must Read:




Methods of NLP

Above listed techniques are more or less the complete lists of NLP techniques involved in basic NLP tasks.
Now to implement those techniques of Natural Language Processing, there are two main approaches:
1.    Rule-Based Approach NLP: Earlier approach of NLP to solve problems with handcrafted linguistic rules. One advantage of rule based method is, no training data is required. This is the reason, till today also rule based approaches are using to solve any project of Natural Language Processing.

2.    Machine Learning in NLP: In recent times all rule based approaches are replacing with machine learning algorithms (because we have so much of training data and have computer with higher computational power)

3.    Deep Learning for NLP: In recent time each machine learning algorithm is replacing with deep learning algorithms to increase accuracy.
Let me explain with one example to clear your concept about different methods of natural language processing.
Let’s say you are doing sentiment analysis to predict positive and negative opinion from a given product review.
If you have training data which contain text of each product review with their sentiment (1, 0 or positive, negative), you can apply any kind of Classification algorithm (Naive Bays, SVM etc.) to predict the opinion for an unknown product review.
Now if you dont have that training data, then you must apply Rule-Based approach to perform same task. In this case, you can make two dictionaries:
1.    positive words dictionary (good, better, best, excellent etc.)
2.    negative words dictionary (frustrating, bad, worst etc)
Then you can pass each words of your product review text though those two dictionaries and count the number of positive and negative words in your review text. Based on that number you can determine whether your review is positive or negative.
For example:
Display of my phone is excellent, but battery life is bad.
Score: 1 + (-1) = 0
In this short tutorial I have tried to cover all possible angles of Basic Natural Language Processing. But it is not the end. If you want build your skill in this complex and vast subject, you have to keep learning.
Adding some resources to learn NLP.



Online NLP Courses

·       Natural Language Processing with Python by nltk.org
If you are looking for NLP course for beginner then you can look into this. This is completely basic course for NLP with Python language using NLTK library.
·       Advanced NLP with Spacy by spacy.io
In this course, you'll learn how you can use spaCy to build basic to advanced natural language understanding systems, using both rule-based and machine learning approaches
·       Natural Language Processing by National Research University
In this course you will learn Natural Language Processing from basic to advance. Covering topics like: sentiment analysis, summarization, dialogue state tracking etc.
·       Applied Text Mining in Python by University of Michigan
In this course you will learn basic text mining and text manipulation with NLTK in Python.
·       Natural Language Processing in TensorFlow by deeplearning.ai
This advance NLP course is for them who have basic understanding of NLP and Python. In this course you will learn to solve NLP tasks like sentiment analysis, word similarity etc. by applying deeplearning algorithms of NLP.
·       Natural Language Processing With Deep Learning by Stanford University

 Conclusion

Natural Language Processing plays a crucial role to make a system to interact between machine and human. There are so many application of NLP, some of them I have mentioned in this tutorial.

NLP is not yet up to the mark, so that it can completely replace human effort.

Natural Language Processing is a huge and complex subject, which cannot be covered within one tutorial. Follow this website to uncover various projects of Natural Language Processing.


If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.