{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Today we are going to perform the simple classification of the amazon reviews' sentiment.\n", "\n", "### Please, download the dataset amazon_baby.csv." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namereviewrating
0Planetwise Flannel WipesThese flannel wipes are OK, but in my opinion ...3
1Planetwise Wipe Pouchit came early and was not disappointed. i love...5
2Annas Dream Full Quilt with 2 ShamsVery soft and comfortable and warmer than it l...5
3Stop Pacifier Sucking without tears with Thumb...This is a product well worth the purchase. I ...5
4Stop Pacifier Sucking without tears with Thumb...All of my kids have cried non-stop when I trie...5
\n", "
" ], "text/plain": [ " name \\\n", "0 Planetwise Flannel Wipes \n", "1 Planetwise Wipe Pouch \n", "2 Annas Dream Full Quilt with 2 Shams \n", "3 Stop Pacifier Sucking without tears with Thumb... \n", "4 Stop Pacifier Sucking without tears with Thumb... \n", "\n", " review rating \n", "0 These flannel wipes are OK, but in my opinion ... 3 \n", "1 it came early and was not disappointed. i love... 5 \n", "2 Very soft and comfortable and warmer than it l... 5 \n", "3 This is a product well worth the purchase. I ... 5 \n", "4 All of my kids have cried non-stop when I trie... 5 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import string\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "def remove_punctuation(text):\n", " import string\n", " translator = str.maketrans('', '', string.punctuation)\n", " return text.translate(translator)\n", "\n", "baby_df = pd.read_csv('amazon_baby.csv')\n", "baby_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1 (data preparation)\n", "a) Remove punctuation from reviews using the given function. \n", "b) Replace all missing (nan) revies with empty \"\" string. \n", "c) Drop all the entries with rating = 3, as they have neutral sentiment. \n", "d) Set all positive ($\\geq$4) ratings to 1 and negative($\\leq$2) to -1." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#a)\n", "\n", "#short test: \n", "baby_df[\"review\"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents You will save them soo many headachesThanks for this book You all rock'\n", "remove_punctuation(baby_df[\"review\"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents You will save them soo many headachesThanks for this book You all rock'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#b)\n", "\n", "#short test:\n", "baby_df[\"review\"][38] == baby_df[\"review\"][38]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "16779" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#c)\n", "\n", "#short test:\n", "sum(baby_df[\"rating\"] == 3)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "168348" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#d) \n", "\n", "#short test:\n", "sum(baby_df[\"rating\"]**2 != 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CountVectorizer\n", "In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']\n", "[[0 0 1 0 0 0 1 0 0 1]\n", " [0 0 0 0 0 1 0 1 0 1]\n", " [1 0 0 1 0 0 0 0 0 0]\n", " [0 1 1 0 0 0 2 1 0 1]\n", " [0 0 0 1 1 0 0 0 1 0]]\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vectorizer = CountVectorizer()\n", "reviews_train_example = [\"We like apples\",\n", " \"We hate oranges\",\n", " \"I adore bananas\",\n", " \"We like like apples and oranges\",\n", " \"They dislike bananas\"]\n", "\n", "X_train_example = vectorizer.fit_transform(reviews_train_example)\n", "\n", "print(vectorizer.get_feature_names())\n", "print(X_train_example.todense())\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 0 0 1 0 0 1 0 1 0]\n", " [0 1 1 1 0 1 0 1 0 1]\n", " [0 0 0 1 0 0 0 0 0 1]]\n" ] } ], "source": [ "reviews_test_example = [\"They like bananas\",\n", " \"We hate oranges bananas and apples\",\n", " \"We love bananas\"] #New word!\n", "\n", "X_test_example = vectorizer.transform(reviews_test_example)\n", "\n", "print(X_test_example.todense())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2 \n", "a) Split dataset into training and test sets. \n", "b) Transform reviews into vectors using CountVectorizer. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#a)\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#b)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3 \n", "a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were). \n", "b) Print 10 most positive and 10 most negative words." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#a)\n", "model = LogisticRegression()\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#b)\n", "\n", "#hint: model.coef_, vectorizer.get_feature_names()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 4 \n", "a) Predict the sentiment of test data reviews. \n", "b) Predict the sentiment of test data reviews in terms of probability. \n", "c) Find five most positive and most negative reviews. \n", "d) Calculate the accuracy of predictions." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#a)\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#b)\n", "\n", "#hint: model.predict_proba()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#c) \n", "\n", "#hint: use the results of b)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#d) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 5\n", "In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.\n", "\n", "\n", "a) Redo exercises 2-5 using limited dictionary. \n", "b) Check the impact of all the words from the dictionary. \n", "c) Compare accuracy of predictions and the time of evaluation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#a)\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#b)\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#c)\n", "\n", "#hint: %time, %timeit" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda env:anaconda2]", "language": "python", "name": "conda-env-anaconda2-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }