Question 1. (45 points) Write a python program to process an input file called 1865-Lincoln.txt, which is attached in the dropbox. You can use NLTK data and functions for this assignments.
1.1 (10 points) Calculate the frequency distribution of the words in the file. Plot a histogram for the top 20 most common used words in the text file
1.2 (10 points) Open the file and read the contents to generate an output file which contains the lemmatized words of the original content. The output file must be created by replacing all the words with lemmatized words.
1.3 (10 points) Tokenize the text file into sentences and calculate the number of words in each sentence, and its entropy. Output the results in the following format.
Sentence (The first 5 words with …) #of Words Entropy
At this second appearing to … 6 7.233
1.4 (15 points) For the text file 1865-Lincoln.txt, conduct part-of-speech tagging using one of the taggers and one of the tagged corpus in the NLTK toolkit. The program outputs the tagged text into a text file and named it by adding “tagged_” before the original text filename.
Question 2. (55 points)
I am interested in knowing how the climate changes in terms of temperature and precipitation. The U.S. climate data site contains climate data for Denton, Texas since 2009. I would like you to do some calculation and comparison between the data in 2010 and 2017 in order to answer the following two research questions:
RQ1: Is 2017 significantly different from 2010 on temperature and precipitation in the months January-June?
RQ2: Is June 2017 significantly hotter than June 2010?
2.1 (10 points) Create two files based on data published at U.S. climate data (http://www.usclimatedata.com/climate/denton/texas/united-states/ustx0353):
- File A (should be called 2010-Jan-June.txt, or 2010-Jan-June.csv) contains daily weather data from January 1, 2010 to June 30, 2010;
- File B (called 2017-Jan-June.txt, or 2017-Jan-June.csv) contains daily weather data from January 1, 2017 to June 30, 2017;
To find the data, go to “History” tab of the above page, select the right year and month. You will see the data being presented to you.
The final format of each result file should look like the following
<Date-month>,<High>,<Low>,<Precip>
1-Jan,55,33,0.08
2-Jan,55,33,0.12
……
1-June,80,56,0,15
…
The delimiter can be comma (,) or whitespace. Make sure you round the numbers for the temperature so there is no decimal points.
2.2. (15 points) Write a program to calculate the mean, median, and standard deviation of high temperature, low temperature and precipitation of each file, and output the results in the following format:
File name mean median standard deviation
2010-Jan-June.txt —– —– —–
2017-Jan-June.txt
2.3 (20 points) In order to answer the first research question, we would like to conduct some statistical tests. Take File A and File B, conduct a T-test on TWO RELATED samples (you can use scipy.stats.ttest_rel: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html) on Temperature High, Temperature Low, and Precipitation to find out whether there is a significant difference between these scores. Report your results in statements after the program using the docstring.
2.4 (10 points) Describe how you can answer research question 2 and what your answer is.
Bonus Points (10 points) Students will receive 10 extra credit points if their homework is PEP8 compliant (a score of 8 or above)
Submission Instruction
Please submit your answers in a zip file named as <LastName>_Assign4.zip to class Assignment 4 drop box in Learn.unt.edu. The zip file should contain all of your Python program, and the output files. Add comments to your program to facilitate reading.
Answers:
Q1
import nltk
import math
from operator import sub
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
nonstopwords=[];
# open the 1865-Lincoln.txt
file1 = open(‘1865-Lincoln.txt’).read()
a= sent_tokenize(file1)
b=[]
for i in a:
b.append(len(i.split()))
stopwords = stopwords.words(“english”)
# omit stopwords from each sentence
for i in file1.lower().split():
if i not in stopwords:
nonstopwords.append(i)
# Join all the sentences
text=” “.join(str(i) for i in nonstopwords)
#Sentence tokenization
tokenize=sent_tokenize(text)
tokenized_sentences=sent_tokenize(text);
numofwords_list=[]
entrophy_list=[]
# define the entropy function
def entropy(labels):
freqdist = nltk.FreqDist(labels)
probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
return -sum([p * math.log(p,2) for p in probs])
# count non stop words in each sentences
for i in tokenized_sentences:
numofwords_list.append(len(i.split()))
#calculate entropy for each sentences
for i in tokenize:
entrophy_list.append(entropy(i))
print(“Sentences (The first 5 words with)” + “\t\t” + “#ofwords” + “\t\t”+”Entropy”)
# Count non stopwords in each sentence wt its entropy
for i in range(len(tokenized_sentences)):
x=tokenized_sentences[i]
wl=numofwords_list[i]
el=entrophy_list[i]
print(“{}{}{}{}{}”.format(x[:30],”\t\t\t”,wl,”\t\t\t”,el))
Solutions
Q2
import nltk
#nltk.download() #Download nltk if you dont have it intalled
nltk.download(‘stopwords’)
nltk.download(‘brown’)
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from collections import Counter
import sys
import string
from io import StringIO
file = open(‘1865-Lincoln.txt’,’r’) # open 1865-Lincoln Text file for reading.
a=file.read() # read the content of the figure.
words = 0 #for word count in 1865-Lincoln.txt
sentences = 0 #For counting sentences in 1865-Lincoln.txt
print(“The Number of sentences and words in text file 1865-Lincoln.txt are:”)
#Tokenization
tokenized_sentences = sent_tokenize(a) # Sentence tokenization for Lincoln.txt file.
for i in tokenized_sentences:
sentences = sentences + 1 # keep count on the number of sentences in Lincoln.txt
print(“Sentences = “+str(sentences))
tokenized_words = word_tokenize(a) # here we are performing the word tokenization
for i in tokenized_words:
words = words + 1 # here we are keeping a count on the number of words in a text file
print(“Words = “+str(words))
stop_words = set(stopwords.words(“english”)) # Here we are choosing the stopwords of English language
d=[]
word = word_tokenize(a) #Word tokenization
for i in word:
if i not in stop_words:
d.append(i) # Store all the non stopwords in a list named ‘d’
print(“Number of non-stop words in 1865-Lincoln.text is: ” +str(len(d)))
final_text=[]
# Adding puncutuation symbols to the pre defined function punctuationSet
punctuationSet = set(string.punctuation)
punctuationSet.add(“–“)
punctuationSet.add(““”)
punctuationSet.add(“””)
punctuationSet.add(“Fellow-Countrymen”)
punctuationSet.add(“One-eighth”)
for i in d:
if i not in punctuationSet: # omit the punctuations and words with punctuations not at the end.
final_text.append(i)
print(“Number of non-stop words (with no punctuations) in 1865-Lincoln.txt is” +str(len(final_text)))
print(“\nThe frequency distribution of unique words in 1865-Lincoln.txt file is as follows :\n”)
LFD = nltk.FreqDist(final_text) # count unique words
print(LFD)
print(‘\n’)
# Calculate the Frequency Distribution of unique words
frequency_counting = Counter(final_text)
print(frequency_counting)
print(“\n”)
for i in string.punctuation:
a = a.replace(i,””)
s = stopwords.words(‘english’)
w = nltk.word_tokenize(a)
filtered_words = []
filtered_words = [i for i in w if i not in s]
fdist = nltk.FreqDist(filtered_words) #Count the frequency distribution of the most used words
print(“Top 20 unique words in the 1865-Lincoln.txt File Are:\n”)
list1=[]
for (ww, frequency) in fdist.most_common(20):
list1.append(ww)
print(list1) # print out the list of top 20 most used words.
# Conduct parts of 1865-Lincoln speech using Unigram Tagger
print(“\nConducting part-of-speech tagging on 1865-Lincoln.txt, Please Be Patient…”)
unigram_tagger = UnigramTagger(brown.tagged_sents(categories=’news’))
tokenized_sentences = sent_tokenize(a) # Sentence tokenization
llist1=[]
for i in tokenized_sentences:
tokenized_words = word_tokenize(i)
tagged_words = unigram_tagger.tag(tokenized_words)
llist1.append(tagged_words)
tagged_file=” “.join(str(i) for i in llist1)
files=open(“tagged_1865-Lincoln.txt”,”w”) #output to file
files.write(tagged_file)
print(“\ntagged_1865-lincoln.txt Created Successfully”)