Basic Language Processing with NLTK

In this post, we explore some basic text processing using the Natural Language Toolkit (NLTK).

We will be grabbing the most popular nouns from a list of text documents.

We will need to start by downloading a couple of NLTK packages for language processing.

punkt is used for tokenising sentences and averaged_perceptron_tagger is used for tagging words with their parts of speech (POS). We also need to set the add this directory to the NLTK data path.

In [1]:

import os
import nltk

# Create NLTK data directory
NLTK_DATA_DIR = './nltk_data'
if not os.path.exists(NLTK_DATA_DIR):
        os.makedirs(NLTK_DATA_DIR)

nltk.data.path.append(NLTK_DATA_DIR)

# Download packages and store in directory above
nltk.download('punkt', download_dir=NLTK_DATA_DIR)
nltk.download('averaged_perceptron_tagger', download_dir=NLTK_DATA_DIR)

Out[1]:

True

The documents to be analysed here are in the docs folder, we see that there are 6 text documents.

In [2]:

DOCS_DIR = './docs'

files = os.listdir(DOCS_DIR)
files

Out[2]:

['doc1.txt', 'doc2.txt', 'doc3.txt', 'doc4.txt', 'doc5.txt', 'doc6.txt']

For now we will analyse the first document doc1.txt

In [3]:

import io

path = os.path.join(DOCS_DIR, 'doc1.txt')

with io.open(path, encoding='utf-8') as f:
    text_from_file = ' '.join(f.read().splitlines())

text_from_file now has the whole corpus of text from doc1.txt.

If we tokenise this block of text by sentences, we are left with a list of sentences.

In [4]:

# 5th sentence
sentences = nltk.sent_tokenize(text_from_file)
fifth_sentence = sentences[4]
print(fifth_sentence)

In the face of despair, you believe there can be hope.

We can also tokenise the words within the sentence

In [5]:

# 3rd word from 5th sentence
words_in_fifth_sentence = nltk.word_tokenize(fifth_sentence)
third_word_fifth_sentence = words_in_fifth_sentence[2]
third_word_fifth_sentence

Out[5]:

u'face'

Now we will tag the parts of speech of this sentence

In [6]:

# Part of speech tagging
nltk.pos_tag(words_in_fifth_sentence)

Out[6]:

[(u'In', 'IN'),
 (u'the', 'DT'),
 (u'face', 'NN'),
 (u'of', 'IN'),
 (u'despair', 'NN'),
 (u',', ','),
 (u'you', 'PRP'),
 (u'believe', 'VBP'),
 (u'there', 'EX'),
 (u'can', 'MD'),
 (u'be', 'VB'),
 (u'hope', 'VBN'),
 (u'.', '.')]

We see that the words are now tuples of the given word and a tag for the part of speech.

In the list above, the following abbreviations can be observed:

IN – Preposition
DT – Determiner
NN – Noun
PRP – Personal pronoun
EX – Existential there
MD – Modal
VB – Verb base form
VBN – Verb past participle

Our aim is to extract the most popular nouns from all the sentences across all the documents.

In [7]:

# Get all sentences from all documents
all_text = ''
for file_name in files:
    path = os.path.join(DOCS_DIR, file_name)
    with io.open(path, encoding='utf-8') as f:
        text_from_file = ' '.join(f.read().splitlines())
    all_text += text_from_file

sentences = nltk.sent_tokenize(all_text)

# tag speech

words = []
for sentence in sentences:
    words.extend(nltk.word_tokenize(sentence))

tagged_words = nltk.pos_tag(words)

# Count nouns

from collections import defaultdict

freqs = defaultdict(int)
nouns = [word[0].lower() for word in tagged_words if word[1] == 'NN']

for noun in nouns:
    if noun in freqs:
        freqs[noun] += 1
    else:
        freqs[noun] = 1

# Get the 25 most common words

import operator
sorted_by_most_common = sorted(freqs.items(), key=operator.itemgetter(1), reverse=True)
for word, count in sorted_by_most_common[:25]:
    print "{} was found {} times.".format(word.capitalize(), count)

Country was found 60 times.
Time was found 58 times.
Government was found 45 times.
War was found 40 times.
Promise was found 33 times.
Today was found 29 times.
World was found 29 times.
Care was found 27 times.
Health was found 23 times.
Work was found 23 times.
Way was found 21 times.
Nation was found 21 times.
Corruption was found 20 times.
Job was found 19 times.
Party was found 19 times.
Year was found 18 times.
Security was found 18 times.
Economy was found 17 times.
Part was found 17 times.
Progress was found 17 times.
Change was found 15 times.
Generation was found 15 times.
Man was found 14 times.
Threat was found 14 times.
Life was found 14 times.