30 Sep 2019

Natural Language Processing Using TextBlob



Interest of Natural Language Processing (NLP) is growing due to increasing number of interesting applications like Machine Translator, Chatbot, Image Captioning etc.

There are lots of tools to work with NLP. Some popular of them are:
  • NLTK
  • Spacy
  • Stanford Core NLP
  • TextBlob
In this topic I will show you how to use TextBlob in Python

What is TextBlob?

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob, which is built on the shoulders of NLTK and Pattern. A big advantage of this is, it is easy to learn and offers a lot of features like sentiment analysis, pos-tagging, noun phrase extraction, etc. It has now become my go-to library for performing NLP tasks.

TextBlob is a Python (supported for 2 and 3) library for text processing. It is a combination of multiple libraries and functions like NLTK, google translator etc. to serve features like:

Features:
  • Part-of-speech tagging
  • Noun phrase extraction
  • Sentiment analysis
  • Classification (by using Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Spelling correction
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Add new models or languages through extensions
  • WordNet integration

Install and setup TextBlob for Python

As TextBlob built on the shoulders of NLTK and Pattern, so we need to download necessary NLTK corpora along with TextBlob itself.

$ pip install -U textblob
$ python -m textblob.download_corpora

Now let’s explore some key features of TextBlob and implement them in Python.
To do any kind of text processing using TextBlob, we need to follow two steps listed below:
  • Convert any string to TextBlob object
  • Call functions of TextBlob to do a specific task

Tokenization with TextBlob

Tokenization refers to dividing text into sequence of tokens

from textblob import TextBlob
text = '''
TextBlob is a Python (2 and 3) library for processing textual data. 
It provides API to do natural language processing (NLP) 
such as part-of-speech tagging, noun phrase extraction, sentiment analysis, etc. 
'''
blob_obj = TextBlob(text)
# Divide into sentence
blob_obj.sentences
<! –– In article Ad start ––>
<! –– In article Ad End ––>
Output:
[Sentence("
 TextBlob is a Python (2 and 3) library for processing textual data."),
 Sentence("It provides API to do natural language processing (NLP)
 such as part-of-speech tagging, noun phrase extraction, sentiment analysis, etc.")]
 
# Print tokens/words
blob_obj.tokens

Output:
WordList(['TextBlob', 'is', 'a', 'Python', '(', '2', 'and', '3', ')', 'library', 'for', 'processing', 'textual', 'data', '.', 'It', 'provides', 'API', 'to', 'do', 'natural', 'language', 'processing', '(', 'NLP', ')', 'such', 'as', 'part-of-speech', 'tagging', ',', 'noun', 'phrase', 'extraction', ',', 'sentiment', 'analysis', ',', 'etc', '.'])

POS tagging with TextBlob

TextBlob have two type of POS tagger
  • PatternTagger (uses the same implementation as the pattern library)
  • NLTKTagger which uses NLTK’s TreeBank tagger
By default TextBlob use PatternTagger. If you want to use NLTK Treebank tagger you can always use that.
 
# By using TreeBank tagger
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob_obj = TextBlob(text, pos_tagger=nltk_tagger)
blob_obj.pos_tags

# By using Pattern Tagger
from textblob.taggers import PatternTagger
pattern_tagger = PatternTagger()
blob_obj = TextBlob(text, pos_tagger=pattern_tagger)
blob_obj.pos_tags

Output:
[('TextBlob', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('Python', 'NNP'),
 ('2', 'IN'),
 ('and', 'CC'),
 ('3', 'CD'),
 ('library', 'NN'),
 ('for', 'IN'),
 ('processing', 'NN'),
 ('textual', 'JJ'),
 ('data', 'NNS'),
 ('It', 'PRP'),
 ('provides', 'VBZ'),
 ('API', 'NNP'),
 ('to', 'TO'),
 ('do', 'VBP'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('NLP', 'NN'),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('part-of-speech', 'JJ'),
 ('tagging', 'VBG'),
 ('noun', 'NN'),
 ('phrase', 'NN'),
 ('extraction', 'NN'),
 ('sentiment', 'NN'),
 ('analysis', 'NN'),
 ('etc.', 'FW')] 


Noun Phrase Extraction using TextBlob

Noun Phrase extraction is important in NLP when you want to analyze the “who” factor in a sentence. Let’s see an example below.

TextBlob uses NLTK data to do this job.
for np in blob_obj.noun_phrases:
    print (np)
<! –– In article Ad start ––>
<! –– In article Ad End ––>
Output:
textblob
python
processing textual data
api
natural language processing
nlp
noun phrase extraction
sentiment analysis

Word Inflection and Lemmatization using TextBlob

Lemmatization is to convert a word into its base form. For this also TextBlob uses NLTK (wordnet) data.
# Singularize form
print('Prvious: ', blob_obj.words[13], ' After: ', blob_obj.words[13].singularize())
# Pluralize form
print('Prvious: ', blob_obj.words[7], ' After: ', blob_obj.words[7].pluralize())

Output:
Prvious:  provides  After:  provide
Prvious:  library  After:  libraries

N-grams using TextBlob


N-Gram is combination of multiple words together. N grams can be used as features for language modelling.

By using “ngrams” function we can easily generate N gram words.
## 3-gram
blob_obj.ngrams(n=3)

Output:
[WordList(['TextBlob', 'is', 'a']),
 WordList(['is', 'a', 'Python']),
 WordList(['a', 'Python', '2']),
 WordList(['Python', '2', 'and']),
 WordList(['2', 'and', '3']),
 WordList(['and', '3', 'library']),
 WordList(['3', 'library', 'for']),
 WordList(['library', 'for', 'processing']),
 WordList(['for', 'processing', 'textual']),
 WordList(['processing', 'textual', 'data']),
 WordList(['textual', 'data', 'It']),
 WordList(['data', 'It', 'provides']),
 WordList(['It', 'provides', 'API']),
 WordList(['provides', 'API', 'to']),
 WordList(['API', 'to', 'do']),
 WordList(['to', 'do', 'natural']),
 WordList(['do', 'natural', 'language']),
 WordList(['natural', 'language', 'processing']),
 WordList(['language', 'processing', 'NLP']),
 WordList(['processing', 'NLP', 'such']),
 WordList(['NLP', 'such', 'as']),
 WordList(['such', 'as', 'part-of-speech']),
 WordList(['as', 'part-of-speech', 'tagging']),
 WordList(['part-of-speech', 'tagging', 'noun']),
 WordList(['tagging', 'noun', 'phrase']),
 WordList(['noun', 'phrase', 'extraction']),
 WordList(['phrase', 'extraction', 'sentiment']),
 WordList(['extraction', 'sentiment', 'analysis']),
 WordList(['sentiment', 'analysis', 'etc'])]

Sentiment Analysis using TextBlob

Sentiment analysis is the process of determining the emotion (positive or negative or neutral) of a text.
The sentiment function of TextBlob has two properties, which are:

  • Polarity (range -1 to 1) 
  • Subjectivity (range 0 to 1)
TextBlob has a training set with pre-classified movie reviews, when you provide a new text for analysis; it uses Naive Bayes classifier to classify the polarity of new text in negative and positive probabilities.
text = "I hate this phone"
blob_obj = TextBlob(text)
blob_obj.sentiment

Output:
Sentiment(polarity=-0.8, subjectivity=0.9)

<! –– In article Ad start ––>
<! –– In article Ad End ––>
text = "I love this phone"
blob_obj = TextBlob(text)
blob_obj.sentiment

Output:
Sentiment (polarity=0.5, subjectivity=0.6)

Note: subjectivity = 0.6 refers that it is a public opinion and not a general information.

Spelling Correction using TextBlob

In nlp sometimes spelling correction is mostly required to normalize text data. TextBlob offers spelling corrector with 80-90% accuracy at a processing speed of at least 10 words per second.

Spelling corrector is Based on: Peter Norvig, "How to Write a Spelling Corrector"
(http://norvig.com/spell-correct.html) as implemented in the pattern library.
blob_obj = TextBlob("speling")
blob_obj.words[0].spellcheck()
[('spelling', 1.0)]

So corrected word is ‘spelling’ with probability of 100%.

How spelling corrector works in TextBlob

Step1 => from a big textfile calculate count for each word
Step2 => Calculate probability for each word by number of times that word appear in whole document/total number of words.  
# def P(word, N=sum(WORDS.values())): 
#     "Probability of `word`."
#     return WORDS[word] / N

Step3 => arrange word (provided incorrect word) in various ways

In our example: For word 'speling'

#  'spsling',
#  'spteling',
#  'sptling',
#  'spueling',
#  'spuling',
#  'spveling',
#  'spvling',
#  'spweling',
#  'spwling',
#  'spxeling',
#  'spxling',
#  'spyeling',
#  'spyling',
#  'spzeling',
#  'spzling',
#  'sqeling',
#  'sqpeling',
#  'sreling',
#  'srpeling',
#  'sseling',
#  'sspeling',
#  'steling',
#  'stpeling',
#  'sueling',
#  'supeling',
#  'sveling',
#  'svpeling',
#  'sweling',
#  'swpeling',
#  'sxeling',
#  'sxpeling',
#  'syeling',
#  'sypeling',
#  'szeling',
#  'szpeling',
#  'tpeling',
#  'tspeling',
#  'upeling',
#  'uspeling',
#  'vpeling',
#  'vspeling',
#  'wpeling',
#  'wspeling',
#  'xpeling',
#  'xspeling',
#  'ypeling',
#  'yspeling',
#  'zpeling',
#  'zspeling'

<! –– In article Ad start ––>
<! –– In article Ad End ––>
Step4 => Search above word in entire word (big text file; step 1)
Step5 => Correct word will be that particular word whose probability (from step 2 will be higher)

Language detection and Translation using TextBlob

Language translation and detection is powered by the Google Translate API
## Detect Language
text = "I hate this phone"
blob_obj = TextBlob(text)
blob_obj.detect_language()
>>‘en’ 
Now if you are trying to use this code in your office computer you may get TimeoutError called:

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time or established connection failed because connected host has failed to respond>  

In this case you have to define your proxy address before above code like below (as it is fetching result from web api):  
## Detect Language with proxy
import nltk
# Set up your proxy address
nltk.set_proxy('http://111.199.236.103:8080') 
text = "I hate this phone"
blob_obj = TextBlob(text)
blob_obj.detect_language()

## Translate to bengali language
blob_obj.translate(to="bn")


>> TextBlob("আমি এই ফোন ঘৃণা করি")


Here‘bn’ is language code you need to provide to TextBlob. So you should know all Supported Language codes.

<! –– In article Ad start ––>
<! –– In article Ad End ––>
 
Language Name
Code
Language Name
Language Code
Afrikaans
af
Irish
ga
Albanian
sq
Italian
it
Arabic
ar
Japanese
ja
Azerbaijani
az
Kannada
kn
Basque
eu
Korean
ko
Bengali
bn
Latin
la
Belarusian
be
Latvian
lv
Bulgarian
bg
Lithuanian
lt
Catalan
ca
Macedonian
mk
Chinese Simplified
zh-CN
Malay
ms
Chinese Traditional
zh-TW
Maltese
mt
Croatian
hr
Norwegian
no
Czech
cs
Persian
fa
Danish
da
Polish
pl
Dutch
nl
Portuguese
pt
English
en
Romanian
ro
Esperanto
eo
Russian
ru
Estonian
et
Serbian
sr
Filipino
tl
Slovak
sk
Finnish
fi
Slovenian
sl
French
fr
Spanish
es
Galician
gl
Swahili
sw
Georgian
ka
Swedish
sv
German
de
Tamil
ta
Greek
el
Telugu
te
Gujarati
gu
Thai
th
Haitian Creole
ht
Turkish
tr
Hebrew
iw
Ukrainian
uk
Hindi
hi
Urdu
ur
Hungarian
hu
Vietnamese
vi
Icelandic
is
Welsh
cy
Indonesian
id
Yiddish
yi

Conclusion

TextBlob is built by using various NLP tools like NLTK, Pattern, google translator etc.

There is nothing new or special in this package but if you want multiple important NLP functions together in one place then you can go with this package.

In this tutorial I have discussed about:

  • What is TextBlob?

  • Install and setup TextBlob for Python

  • Tokenization with TextBlob

  • POS tagging with TextBlob

  • Noun Phrase Extraction using TextBlob

  • Word Inflection and Lemmatization using TextBlob

  • N-grams using TextBlob

  • Sentiment Analysis using TextBlob

  • Spelling Correction using TextBlob

  • How spelling corrector works in TextBlob

  • Language detection and Translation using TextBlob 

If you have any question or suggestion regarding this topic please let me know in comment section, will try my best to answer.