Yelp Academic Dataset


This project was done as part of a class on Coursera for learning SQL. I learned a lot about the data in this dataset using these methods and became much more comfortable using the SQL query language after having finished this project. The SQL code for creating the original tables, the Python code for importing the data into the tables, and the SQL code for querying the database are all included on this website.


By inspection of the JSON files that can be downloaded here: https://business.yelp.com/data/resources/open-dataset/ it can be determined what tables are needed and what columns are needed in each of those tables. Upon further inspection it can be determined what are good choices for the primary keys and the foreign keys to ensure data integrity. For example, since each business has one and only one identification for it, that is a good primary key for the business table (that's the definition of a primary key). Since friends also have to be established users, a foreign key for the friends table is the friend ID that refers back to a specific user ID in the users table. Having plenty of interconnections like this ensures that there aren't ghost entries when the data is imported and that if there are, they are identified and can be removed before data analysis begins. Here's a snippet of a JSON file from the original dataset:


{"user_id":"2WnXYQFK0hXEoTxPtV2zvg","name":"anonymized","review_count":665,"yelping_since":"2008-07-25 10:41:00","useful":2086,"funny":1010,"cool":1003,"elite":"2009,2010,2011,2012,2013","friends":"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA, pypZb3V5TXHOnlTj-qLSrw,"fans":52,"average_stars":3.32,"compliment_hot":89,"compliment_more":13,"compliment_profile":10,"compliment_cute":17,"compliment_list":3,"compliment_note":66,"compliment_plain":96,"compliment_cool":119,"compliment_funny":119,"compliment_writer":35,"compliment_photos":18}

As can be seen, this JSON file "users.JSON" is a good candidate for a table, and each one of these variables in quotes followed by a colon is a good candidate for a column. So there is an intuitive data structure built into this particular dataset that allows for smooth transitioning into a SQLite database. This is not guaranteed by any means in a given dataset. This is a very "nice" dataset for SQL analysis.


After careful inspection of the files and careful coding for the SQL tables, I had a good sense of the data and the connections between the tables, and so I was in a good position to create an ERD (Entity-Relationship Diagram) using Microsoft Paint. It looks like this:


ERD for Yelp academic dataset

The primary purpose of the project was to get a sense of what it meant to be a 5 star business and what it meant to be a 1 star business. To boil down the findings into their essence, I will simply say the best businesses had the highest frequency of positive words in the reviews and the worst businesses had the highest frequency of negative words in their reviews. These words that were analyzed are linked to instrinsic properties of the object to which they are applied. Words like excellence, quality, good, and kind, and words like lousy, poor, nasty, and rude are simple but they are strong differentiators between good businesses and bad businesses.


Statistical significance testing was done to confirm that there was indeed a significant difference in the number of occurrences of these words in the reviews of the best businesses versus the worst businesses based on the total number of reviews (Chi-squared testing). When the negative words appeared in the best business reviews it was virtually always a "false positive" or perhaps better stated to be a false occurrence of a negative word (in other words, they weren't being accused of being the negative word, the negative word was being used in a different context to distinguish the good business from a bad business, so it was not being used in the normal context that it would be applied to an actually bad business). When the positive words were found in the worst business reviews they were also almost always "false positives" or in other words the word "kind" showed up but it was in the context of, for example, "The customer service team was not kind to me." So there was some noise in the data but even with the noise there was a very strongly significant difference in the frequency of the positive words when the best businesses were compared with the worst businesses as determined by their average rating on Yelp.


When the users were analyzed that essentially made the best business on Yelp the best business by writing all the reviews for it, as predicted there was no particular characteristic that was common to all the users. They were all different when the various parameters describing the users were compared, there was no common feature that stood out amongst the different users. This was predictable because a good business is a good business to a person who yelps a lot, and they are a good business to a person who yelps only a little. A good business is a good business to a very popular user, and it is a good business to a user with few friends. A good business is a good business to long-standing Yelp users, and a good business is a good business to very new Yelp users. This study lends credence to the Yelp construct as it confirms that there are instrinsic properties to the highest rated businesses that transcend any distorting features of the user groups who are leaving the reviews. For example it is possible that the best business is only the best business because the people who frequent that business are users on Yelp whose average stars per review is always 5. In other words, they never leave bad reviews. That was not the case though. So there is some definite legitimacy to the best and worst businesses on Yelp. There are some predictable trends and characteristics of these businesses. A visual example of this is the number of occurrences of negative words like lousy, nasty, rude etc. versus the number of stars on Yelp, seen below:

Graph of negative words in reviews of various ratings

It's amazing how sharp the drop is in negative word occurrences (y-axis) as the rating on Yelp (x-axis) increases. It requires an exponent in the regression model to capture the behavior of this relationship.


One of the most interesting findings during my analysis was that there was a very wide distribution of average rating per business based on zip code (it could be the case that all zip codes had an average rating that was about the same, but this was far from the reality in the data). This confirms a suspicion that the location of a business can be somehow related to the number of stars that business receives for its ratings. The extent to which this is true is astounding. Although deep investigation into this phenomenon was not made, it is something worth further study (for example, determining whether there was a statistically adequate number of businesses per zip code to be making distinctions in the average rating for those zip codes is a question that remains unanswered at this time). Perhaps this finding ties back to the old saying "Location! Location! Location!" in business circles.


The python script "import_data.py" found at the hyperlink below imports the data into SQLite after a new database is created using the menus in SQLite and the structure of the database has been formed by "creating_tables_for_database.sql".


This is the Python code I used to import the data into the SQLite database tables that I created. As is evident the SQLite3 library was imported into the script first. The script is carefully designed to weed out any ghost entries in the tables, including reviews/tips/checkins by users that are not in the users table, reviews/tips/checkins of businesses that are not in the business table, etc. This ensures data integrity.


Scripts used for this analysis



Thank you for considering this data analysis on the Yelp academic data set. I hope it was informative and educating to you. If you have any feedback or comments please send me an email at andyhayles@gmail.com.