Basic Language Processing with NLTK
In this post, we explore some basic text processing using the Natural Language Toolkit (NLTK).
We will be grabbing the most popular nouns from a list of text documents.
We will need to start by downloading a couple of NLTK packages for language processing.
punkt
is used for tokenising sentences and averaged_perceptron_tagger
is used for tagging words with their parts of speech (POS). We also need to set the add this directory to the NLTK data path.
import os
import nltk
# Create NLTK data directory
NLTK_DATA_DIR = './nltk_data'
if not os.path.exists(NLTK_DATA_DIR):
os.makedirs(NLTK_DATA_DIR)
nltk.data.path.append(NLTK_DATA_DIR)
# Download packages and store in directory above
nltk.download('punkt', download_dir=NLTK_DATA_DIR)
nltk.download('averaged_perceptron_tagger', download_dir=NLTK_DATA_DIR)
The documents to be analysed here are in the docs folder, we see that there are 6 text documents.
DOCS_DIR = './docs'
files = os.listdir(DOCS_DIR)
files
For now we will analyse the first document doc1.txt
import io
path = os.path.join(DOCS_DIR, 'doc1.txt')
with io.open(path, encoding='utf-8') as f:
text_from_file = ' '.join(f.read().splitlines())
text_from_file
now has the whole corpus of text from doc1.txt
.
If we tokenise this block of text by sentences, we are left with a list of sentences.
# 5th sentence
sentences = nltk.sent_tokenize(text_from_file)
fifth_sentence = sentences[4]
print(fifth_sentence)
We can also tokenise the words within the sentence
# 3rd word from 5th sentence
words_in_fifth_sentence = nltk.word_tokenize(fifth_sentence)
third_word_fifth_sentence = words_in_fifth_sentence[2]
third_word_fifth_sentence
Now we will tag the parts of speech of this sentence
# Part of speech tagging
nltk.pos_tag(words_in_fifth_sentence)
We see that the words are now tuples of the given word and a tag for the part of speech.
In the list above, the following abbreviations can be observed:
- IN – Preposition
- DT – Determiner
- NN – Noun
- PRP – Personal pronoun
- EX – Existential there
- MD – Modal
- VB – Verb base form
- VBN – Verb past participle
Our aim is to extract the most popular nouns from all the sentences across all the documents.
# Get all sentences from all documents
all_text = ''
for file_name in files:
path = os.path.join(DOCS_DIR, file_name)
with io.open(path, encoding='utf-8') as f:
text_from_file = ' '.join(f.read().splitlines())
all_text += text_from_file
sentences = nltk.sent_tokenize(all_text)
# tag speech
words = []
for sentence in sentences:
words.extend(nltk.word_tokenize(sentence))
tagged_words = nltk.pos_tag(words)
# Count nouns
from collections import defaultdict
freqs = defaultdict(int)
nouns = [word[0].lower() for word in tagged_words if word[1] == 'NN']
for noun in nouns:
if noun in freqs:
freqs[noun] += 1
else:
freqs[noun] = 1
# Get the 25 most common words
import operator
sorted_by_most_common = sorted(freqs.items(), key=operator.itemgetter(1), reverse=True)
for word, count in sorted_by_most_common[:25]:
print "{} was found {} times.".format(word.capitalize(), count)