Computational Methods for Information Systems: 705049

Question 1. (45 points) Write a python program to process an input file called 1865-Lincoln.txt, which is attached in the dropbox. You can use NLTK data and functions for this assignments.

1.1 (10 points) Calculate the frequency distribution of the words in the file. Plot a histogram for the top 20 most common used words in the text file

1.2 (10 points) Open the file and read the contents to generate an output file which contains the lemmatized words of the original content. The output file must be created by replacing all the words with lemmatized words.

1.3 (10 points) Tokenize the text file into sentences and calculate the number of words in each sentence, and its entropy. Output the results in the following format.

Sentence (The first 5 words with …)              #of Words                Entropy

At this second appearing to …                        6                                  7.233

1.4 (15 points) For the text file 1865-Lincoln.txt, conduct part-of-speech tagging using one of the taggers and one of the tagged corpus in the NLTK toolkit. The program outputs the tagged text into a text file and named it by adding “tagged_” before the original text filename.

Question 2. (55 points)

I am interested in knowing how the climate changes in terms of temperature and precipitation. The U.S. climate data site contains climate data for Denton, Texas since 2009.  I would like you to do some calculation and comparison between the data in 2010 and 2017 in order to answer the following two research questions: 

RQ1: Is 2017 significantly different from 2010 on temperature and precipitation in the months January-June?

RQ2:  Is June 2017 significantly hotter than June 2010?

2.1 (10 points) Create two files based on data published at U.S. climate data (http://www.usclimatedata.com/climate/denton/texas/united-states/ustx0353):

  • File A (should be called 2010-Jan-June.txt, or 2010-Jan-June.csv) contains daily weather data from January 1, 2010 to June 30, 2010;
  • File B (called 2017-Jan-June.txt, or 2017-Jan-June.csv) contains daily weather data from January 1, 2017 to June 30, 2017;

To find the data, go to “History” tab of the above page, select the right year and month. You will see the data being presented to you.

The final format of each result file should look like the following

<Date-month>,<High>,<Low>,<Precip>

1-Jan,55,33,0.08

2-Jan,55,33,0.12

……

1-June,80,56,0,15

The delimiter can be comma (,) or whitespace. Make sure you round the numbers for the temperature so there is no decimal points.

2.2. (15 points) Write a program to calculate the mean, median, and standard deviation of high temperature, low temperature and precipitation of each file, and output the results in the following format:

File name                       mean                                    median                        standard deviation

2010-Jan-June.txt           —–                           —–                              —–

2017-Jan-June.txt

2.3 (20 points) In order to answer the first research question, we would like to conduct some statistical tests. Take File A and File B, conduct a T-test on TWO RELATED samples (you can use scipy.stats.ttest_rel: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html) on Temperature High, Temperature Low, and Precipitation to find out whether there is a significant difference between these scores. Report your results in statements after the program using the docstring.

2.4 (10 points) Describe how you can answer research question 2 and what your answer is. 

Bonus Points (10 points) Students will receive 10 extra credit points if their homework is PEP8 compliant (a score of 8 or above) 

Submission Instruction 

Please submit your answers in a zip file named as <LastName>_Assign4.zip to class Assignment 4 drop box in Learn.unt.edu. The zip file should contain all of your Python program, and the output files. Add comments to your program to facilitate reading. 

Answers:

Q1

import nltk

import math

from operator import sub

from nltk.corpus import stopwords

from nltk.tokenize import sent_tokenize, word_tokenize

nonstopwords=[]; 

# open the 1865-Lincoln.txt

file1 = open(‘1865-Lincoln.txt’).read()

a= sent_tokenize(file1)

b=[]

 for i in a:

    b.append(len(i.split()))

stopwords = stopwords.words(“english”) 

# omit stopwords from each sentence

for i in file1.lower().split():

    if i not in stopwords:

        nonstopwords.append(i)       

# Join all the sentences

text=” “.join(str(i) for i in nonstopwords) 

#Sentence tokenization

tokenize=sent_tokenize(text)

tokenized_sentences=sent_tokenize(text);

numofwords_list=[]

entrophy_list=[] 

# define the entropy function

def entropy(labels):

    freqdist = nltk.FreqDist(labels)

    probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]

    return -sum([p * math.log(p,2) for p in probs]) 

# count non stop words in each sentences

for i in tokenized_sentences:

    numofwords_list.append(len(i.split()))   

#calculate entropy for each sentences

for i in tokenize:

    entrophy_list.append(entropy(i))

print(“Sentences (The first 5 words with)” + “\t\t” + “#ofwords” + “\t\t”+”Entropy”) 

# Count non stopwords in each sentence wt its entropy

for i in range(len(tokenized_sentences)):

    x=tokenized_sentences[i]

    wl=numofwords_list[i]

    el=entrophy_list[i]

    print(“{}{}{}{}{}”.format(x[:30],”\t\t\t”,wl,”\t\t\t”,el)) 

Solutions

Q2

import nltk

#nltk.download() #Download nltk if you dont have it intalled

nltk.download(‘stopwords’)

nltk.download(‘brown’)

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

from nltk.corpus import brown

from nltk.tag import UnigramTagger

from collections import Counter

import sys

import string

from io import StringIO 

file = open(‘1865-Lincoln.txt’,’r’) # open 1865-Lincoln Text file for reading.

a=file.read()    # read the content of the figure. 

words = 0        #for word count in 1865-Lincoln.txt

sentences = 0 #For counting sentences in 1865-Lincoln.txt

print(“The Number of sentences and words in  text file 1865-Lincoln.txt are:”) 

#Tokenization

tokenized_sentences = sent_tokenize(a) # Sentence tokenization for Lincoln.txt file. 

for i in tokenized_sentences:

    sentences = sentences + 1 # keep count on the number of sentences in Lincoln.txt

print(“Sentences = “+str(sentences)) 

tokenized_words = word_tokenize(a) # here we are performing the word tokenization

for i in tokenized_words:

    words = words + 1 # here we are keeping a count on the number of words in a text file

print(“Words = “+str(words))

stop_words = set(stopwords.words(“english”)) # Here we are choosing the stopwords of English language

d=[]

word = word_tokenize(a) #Word tokenization

for i in word:

    if i not in stop_words:

        d.append(i) # Store all the non stopwords in a list named ‘d’

print(“Number of non-stop words in 1865-Lincoln.text is: ” +str(len(d)))

final_text=[] 

# Adding puncutuation symbols to the pre defined function punctuationSet

punctuationSet = set(string.punctuation)

punctuationSet.add(“–“)

punctuationSet.add(““”)

punctuationSet.add(“””)

punctuationSet.add(“Fellow-Countrymen”)

punctuationSet.add(“One-eighth”) 

for i in d:

    if i not in punctuationSet: # omit the punctuations and words with punctuations not at the end.

        final_text.append(i)

print(“Number of non-stop words (with no punctuations) in 1865-Lincoln.txt is” +str(len(final_text))) 

print(“\nThe frequency distribution of unique words in 1865-Lincoln.txt file is as follows :\n”)

LFD = nltk.FreqDist(final_text) # count unique words

print(LFD)

print(‘\n’) 

# Calculate the Frequency Distribution of unique words

frequency_counting = Counter(final_text)

print(frequency_counting)

print(“\n”) 

for i in string.punctuation:

    a = a.replace(i,””)

s = stopwords.words(‘english’)

w = nltk.word_tokenize(a)

filtered_words = []

filtered_words = [i for i in w if i not in s]

fdist = nltk.FreqDist(filtered_words)  #Count the frequency distribution of the most used words

print(“Top 20 unique words in the 1865-Lincoln.txt File Are:\n”)

list1=[]

for (ww, frequency) in fdist.most_common(20):

    list1.append(ww)

print(list1) # print out the list of top 20 most used words. 

# Conduct parts of 1865-Lincoln speech using Unigram Tagger

print(“\nConducting part-of-speech tagging on 1865-Lincoln.txt, Please Be Patient…”)

unigram_tagger = UnigramTagger(brown.tagged_sents(categories=’news’))

tokenized_sentences = sent_tokenize(a) # Sentence tokenization

llist1=[] 

for i in tokenized_sentences:

    tokenized_words = word_tokenize(i)

    tagged_words = unigram_tagger.tag(tokenized_words)

    llist1.append(tagged_words)

tagged_file=” “.join(str(i) for i in llist1) 

files=open(“tagged_1865-Lincoln.txt”,”w”) #output to file

files.write(tagged_file)

print(“\ntagged_1865-lincoln.txt Created Successfully”)