Task
This project uses the naive bayes algorithm to do sentiment analysis on the yelp review dataset. I build this model simple using Scikit-learn libraries, NLTK as well as Python’s pandas. My aim is coming up with a sentimental analysis model which predicts whether or not a specific user liked a restaurant, in accordance with predict star ratings for reviews carried out one restaurants Yelp platform.
Implementation
- The reviewmetatrain.csv dataset that I used contains 28068 reviews each bearing the following information:
- reviewer ID(unique ID of the reviewer that is being reviewed)
- date (Day of the week when the review was posted)
- review ID (ID that uniquely identifies the posted review)
- business ID(unique ID of the business which in this case is being reviewed)
- vote_funny votes on the review, which are contributed to by other users.
- vote_cool vote on the review, which result from the action of user users.
- vote_useful vote on the review, given by other users
- rating(1–5 rating for the restaurant)
- The reviewtexttain.csv dataset contains text reviews and it had 28068 rows and one column. I extracted the review column which I used for plotting and visualizing the data to see whether there is correlation.
- The review_text_train.npz dataset that I used as my training data and review_text_test.npz that I used as my testing data. The two files contain the sparse matrix.
I started by importing important libraries. I imported NLTK library which I used for stopword removal. After importing the libraries I imported the dataset provided and I stored it in a Pandas dataframe yelp. Then I used the .shape attribute to know the rows count as well as columns count contained in the dataframe which was (28068,8).Then I used the .head() method to show the first five row, .info() method to get the information about the dataset and .describe() method to show the static of the dataset. Then I imported the text review that had the text reviews. To get every single review length, I come up with a new column called ‘review length’ on yelp, which is responsible for storage of the characters count in every single review. I proceeded to visualize the data using seaborn FacetGrid which helped me to construct a histograms grid placed abreast each other. I made of use of FacetGrid to acertain whether there is any correlation linking the rating and the newly created review length feature. After plotting the graphs I noted that there is similarity across all the five ratings, the review length distribution is similar all across Nevertheless, the count of the text reviews appear to be skewed way higher as it tends to the 4-rating as well as 5-star rating. Then I created a box plot of the review length for each rating. From the box plot I noted that the 1-rating and 2-rating had much longer text compared to the 4-rating and 5-rating but there are many outliers. I went ahead to group the data in accordance with rating to ascertain whether there is a link between various features for instance, vote_cool, vote_useful as well as vote_funny. I employed one of Panda’s method, that is .corr() method so as to look for any relations in the dataframe. Then I used seaborn’s heatmap to visualize the correlations. From the plot I noted that vote_funny is has a strong tie between it and vote_useful, as well as the presence of a strong link between vote_useful and review length. It was notable that there exists a negative correlation between vote_cool as well as the other three dataset features. In order to ascertain whether a review is bad or good, I grabbed reviews from the yelp dataframe which were either 1 star or 5 stars and stored the reviews which were a consequence of this in a new dataframe called yelp_class. I used the shape attribute on the yelp_class and the number of rows were 21624 and number of columns were nine. The number of rows had reduced because I was not taking the 2, 3 and 4 rating. Then I created the x and y variables to do the classification where x is the review column and y is the rating column. The review was in plain text format and the classification need a characteristic vector so as to carry out the classification. I used the bag of words approach in order to convert a corpus to a vector format whereby in every text, every unique word contained therein shall be represented using a single number. To separate the text in my review I wrote a function that split the text into individual words, stores them into a list and returns the list. I removed all the common words and punctuation in the review. I had review as a list of token and to enable the sklearn algorithm to work on my text I converted each review to a vector.
I used the data from review_text_train.npz file which had the sparse matrix as my training data and I assigned it to the variable name x_train and the data from review_text_test.npz file as my test data and I assigned it to the variable x_test. In the case of training my model I made use of Multinomial Naive Bayes which is a specialized version of Naive Bayes designed largely for text documents and fitted it with my training set. Then I predicted the data and printed the first fifteen rows output and I created a file that saved the outputs.
Error analysis
From this model, we may conclude whether a given user, based on what they typed, likes a given restaurant or not. I was able to predict the data and I wanted to use sentiment analysis. I as able to change review from plane text to vectors and I used naive bayes algorithm to train, fit and predict the dataset. Although my model has achieved quite a high accuracy, there are some bias related issues resulting from the dataset. After testing with some singular reviews, I noted what rating our model predicts for each one. It predicted a rating of 5 for positive singular review while it predicted some negative singular review and gave an output of 5. The model is a little biased because of the data used had more 5-ratings thus the model tend to be more biased toward positive reviews than negative reviews.
Reference
Hill, D., Minsker, B. and Amir, E., 2009. Real-time Bayesian anomaly detection in streaming environmental data. Water Resources Research, 45(4).
Mukherjee, I. and Schapire, R., 2020. A Theory of Multiclass Boosting. Journal of Machine Learning Research 14 (2013), [online] (2013), pp.437-497. Available at: <http://www.jmlr.org/papers/volume14/mukherjee13a/mukherjee13a.pdf> [Accessed 1 June 2020].
Nicholas Renotte. 2020. How To Build A Sentiment Analyser For Yelp Reviews In Python. [online] Available at: <https://www.nicholasrenotte.com/how-to-build-a-sentiment-analyser-for-yelp-reviews-in-python/> [Accessed 1 June 2020].
Özyer, T. and Alhajj, R., 2018. Machine Learning Techniques For Online Social Networks. New York: Springer.
Rayana, S. and Akoglu, L., 2015. Less is More: Building Selective Anomaly Ensembles with Application to Event Detection in Temporal Graphs. Proceedings of the 2015 SIAM International Conference on Data Mining,.
Seeger, M., 2003. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification. Journal of Machine Learning Research, 3(2), pp.233-269.
Sentiment Analysis on the Yelp Reviews Dataset. (2020). Retrieved 1 June 2020, from https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset
Subramaniam, V., Moraes, S., Cortez, T. and Cortez, T., 2020. Sentiment Analysis Of Yelp User Review Data | NYC Data Science Academy Blog. [online] Nycdatascience.com. Available at: <https://nycdatascience.com/blog/student-works/sentiment-analysis-yelp-user-review-data/> [Accessed 1 June 2020].
Subramaniam, V., Moraes, S., Cortez, T., & Cortez, T. (2020). Sentiment Analysis Of Yelp User Review Data | NYC Data Science Academy Blog. Retrieved 1 June 2020, from