Max Garber Metis Data Science Blog:
    About     Archive     Feed

Reddit Discussion of TV - Natural Language Processing

Reddit Discussion of TV episodes

Max Garber

Project Description Backstory:

Using data we scrape from the web, what can we learn about our data via unsupervised learning techniques? Extend your analysis by combining unsupervised and supervised learning methods, or by developing a recommender system, etc.

Data:

type: text data

acquisition: api's, scraping, etc.

storage: mongodb

Skills & Tools:

flask

mongodb

nlp

unsupervised learning

dimensionality reduction

topic modeling

recommender systems

Analysis:

unsupervised learning (clustering and/or dimensionality reduction) is required, other types of modeling (listed above) are encouraged.



I looked at the discussion on a tv shows subreddit following the premiere date of each episode of the show. I was looking to determine if there was a relationship between the discussion of an episode and the rating of that episode.

The Data

Using the pushshift.io 3rd party API for reddit I got the top (by score) 100 Reddit posts between 1 and 3 days after an episode premiered from the shows subreddit
I also got Information about each episode from The Movie Database and scraped rating from Rotten Tomatoes>

Text Processing

First the title and text of each post was combined. Then the text was cleaned, tokenized, stemmed, stopwords removed and tagged with Parts of Speech. Finally the corpus was converted to a matrix with a Count Vectorizer and TF-IDF. text processing

Topic Modeling : Westworld

I decided to use Westworld as my test case since season 3 had recently ended and it was fresh in my memory.
I used Latent Semantic Analysis to extract topics from the data. The topics corresponded to different story lines of the show.

4 topic Clouds

Sentiment Analysis : Westworld:

I then did Sentiment Analysis on the data using VADER. There does appear to be loose correlation between the sentiment of the discussion of each episode and the rating of the episode thou it was not a strong as I hoped.

Sentiment

Board Game Geek - Linear Regression

Board Game Geek Ratings

Max Garber

Project Description Backstory:

Using information we scrape from the web, build linear regression models from which we can learn about movies, sports, or categories.

Data:

acquisition: web scraping

storage: flat files

sources: (as listed below or any other publicly available information)

movie: boxofficemojo.com, imdb.com

sports: sports-reference.com

Skills:

basics of the web (requests, HTML, CSS, JavaScript)

web scraping

numpy and pandas

statsmodels, scikit-learn

Analysis:

linear regression is required, other regression methods are optional



The board game website BoardGameGeek has rating for board games based on user reviews. I created a model relating different features of a board game to its BGG rating.

Web Scraping

Top games by number of votes were scraped from BoardGameGeek using Beautiful Soup. Additional information for each game was scraped its individual BBG page with help from using Selenium. Game expansions were ignored and games published before 1900 were discarded.

Geek Ratings v features

Fit

The data was fit using Linear Regression with L1 and L2 regularization values found via a grid search. BGG fit

Basic Game Information

Fitted relationships between basic game information and the game score.

Basic Game Information

Mechanics

Effects of a games mechanics on the score of the game.

Number of Mechanics Mechanics

Categories

Effects of a games categories on the score of the game.

Categories

Publishers

Effects a games publisher on the score of the game.

Publishers