20 Sep 2018

How to download NLTK corpus manually

NLTK is a most popular package among all NLP packages available for Python. It can be used to solve all kind of basic to advanced level of NLP task.

Important thing is NLTK requires lots of data or corpus to process any NLP task. Without those NLTK can’t do anything.

For example if you are trying to do POS tagging (one NLP task) by following code.
from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)

If you do not have required corpus (data), you are supposed to get lookup error like:



LookupError: ********************************************************************** Resource u'taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
  Searched in:
    - 'C:\\Users\\anindya/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\anindya\\Anaconda2\\nltk_data'
    - 'C:\\Users\\anindya\\Anaconda2\\lib\\nltk_data'
    - 'C:\\Users\\anindya\\AppData\\Roaming\\nltk_data'
**********************************************************************


How to find required corpus name by looking at error?


At the First line of Lookup error will have that information. Let’s look back to the first line.

Resource 'taggers/averaged_perceptron_tagger/averaged_perceptron
  _tagger.pickle' not found.

It says:
“averaged_perceptron_tagger.pickle” corpus is required to execute my script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).

How to download NLTK corpus from Python?


There are three ways to download NLTK corpus automatically

  •     By GUI (Select corpus name from GUI to download)
  •     By corpus name.
  •     Download all corpus

By GUI

Type the code in python

import nltk
nltk.download()

A window should pop up called “NLTK Downloader”







Click on corpora……..

Download by NLTK corpus name:


import nltk
nltk.download('averaged_perceptron_tagger.pickle')

Download all NLTK corpus:


import nltk
nltk.download('all')
Now sometimes if you are doing all this in your office computer you might get error like:

[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>

False

This is because of proxy settings; you will not be able to download anything through python. In this situation you have to download NLTK corpus manually.

Manually download NLTK corpus?


Step 1:

Go to http://www.nltk.org/nltk_data/ and search for “tagger” and download “averaged_perceptron_tagger





Now if you unzip the downloaded file you can see inside “averaged_perceptron_tagger” folder “averaged_perceptron_tagger.pickle” corpus is there (which is required).




Step 2 (Find folder where to move):

Recall our error once again.

LookupError: ********************************************************************** Resource u'taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()

  Searched in:
    - 'C:\\Users\\anindya/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\anindya\\Anaconda2\\nltk_data'
    - 'C:\\Users\\anindya\\Anaconda2\\lib\\nltk_data'
    - 'C:\\Users\\anindya\\AppData\\Roaming\\nltk_data'
**********************************************************************


Choose any of the above folder I’m choosing 'C:\nltk_data'

So let’s create a folder called nltk_data inside C drive 

Create a folder called “taggers” inside nltk_data folder as first line of error says:

Note:
“averaged_perceptron_tagger.pickle” corpus is required to execute above script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).

Full path:

Now just copy and paste “averaged_perceptron_tagger.zip” (without unzipping) inside “taggeres” folder.

nltk_data è taggers è averaged_perceptron_tagger.zip


That’s all!!

This is how you can deal with any other nltk data related issue.

Now you can test with same code which we were testing at the beginning of this page.
from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)


This time it should work as expected.

[('There', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('problem', 'NN'), ('with', 'IN'), ('Traffic', 'NNP'), ('Light', 'NNP')]

Do you have any question?


Ask your question in the comment below and I will do my best to answer.