{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When exploring a large set of documents -- such as Wikipedia, news articles, StackOverflow, etc. -- it can be useful to get a list of related material. To find relevant documents you typically\n",
"* Decide on a notion of similarity\n",
"* Find the documents that are most similar \n",
"\n",
"In the assignment you will\n",
"* Gain intuition for different notions of similarity and practice finding similar documents. \n",
"* Explore the tradeoffs with representing documents using raw word counts and TF-IDF\n",
"* Explore the behavior of different distance metrics by looking at the Wikipedia pages most similar to President Obama’s page."
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Wikipedia dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will be using the dataset of abridged Wikipedia pages. Each element of the dataset consists of a link to the wikipedia article, the name of the person, and the text of the article (in lowercase). "
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
URI
\n",
"
name
\n",
"
text
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
<http://dbpedia.org/resource/Digby_Morrell>
\n",
"
Digby Morrell
\n",
"
digby morrell born 10 october 1979 is a former...
\n",
"
\n",
"
\n",
"
1
\n",
"
<http://dbpedia.org/resource/Alfred_J._Lewy>
\n",
"
Alfred J. Lewy
\n",
"
alfred j lewy aka sandy lewy graduated from un...
\n",
"
\n",
"
\n",
"
2
\n",
"
<http://dbpedia.org/resource/Harpdog_Brown>
\n",
"
Harpdog Brown
\n",
"
harpdog brown is a singer and harmonica player...
\n",
"
\n",
"
\n",
"
3
\n",
"
<http://dbpedia.org/resource/Franz_Rottensteiner>
\n",
"
Franz Rottensteiner
\n",
"
franz rottensteiner born in waidmannsfeld lowe...
\n",
"
\n",
"
\n",
"
4
\n",
"
<http://dbpedia.org/resource/G-Enka>
\n",
"
G-Enka
\n",
"
henry krvits born 30 december 1974 in tallinn ...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" URI name \\\n",
"0 Digby Morrell \n",
"1 Alfred J. Lewy \n",
"2 Harpdog Brown \n",
"3 Franz Rottensteiner \n",
"4 G-Enka \n",
"\n",
" text \n",
"0 digby morrell born 10 october 1979 is a former... \n",
"1 alfred j lewy aka sandy lewy graduated from un... \n",
"2 harpdog brown is a singer and harmonica player... \n",
"3 franz rottensteiner born in waidmannsfeld lowe... \n",
"4 henry krvits born 30 december 1974 in tallinn ... "
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wiki = pd.read_csv('people_wiki.csv')\n",
"wiki.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to check whether the text on the webpage agrees with the one here, you can display it with the following code:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# from IPython.display import HTML\n",
"# print(wiki['text'][0])\n",
"# HTML(url=wiki['URI'][0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 1: Extract word count vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen in Assignment 4, we can extract word count vectors using `CountVectorizer` function.\n",
"- make sure you include words of unit length by using the parameter: `token_pattern=r\"(?u)\\b\\w+\\b\"`\n",
"- do not use any stopwords\n",
"- take 10000 most frequent words in the corpus\n",
"- explicitly take all the words independent of in how many documents they occur\n",
"- obtain the matrix of word counts"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"vectorizer = CountVectorizer(# Your code goes here\n",
" )\n",
"WCmatrix = vectorizer.# Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 2: Find nearest neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Start by finding the nearest neighbors of the Barack Obama page using the above word count matrix to represent the articles and **Euclidean** distance to measure distance.\n",
"Save the distances in `wiki['BO-eucl']` and look at the top 10 nearest neighbors."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# One can use the following:\n",
" # from sklearn.neighbors import NearestNeighbors\n",
" # nbrs = NearestNeighbors(n_neighbors=3, algorithm='brute',metric='euclidean').fit(X.toarray())\n",
" # distances, indices = nbrs.kneighbors(X.toarray())\n",
"# but here let's use:\n",
"from sklearn.metrics import pairwise_distances\n",
"\n",
"dist = pairwise_distances(# Your code goes here\n",
" )\n",
"# Your code goes here\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
URI
\n",
"
name
\n",
"
text
\n",
"
BO-eucl
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
<http://dbpedia.org/resource/Barack_Obama>
\n",
"
Barack Obama
\n",
"
barack hussein obama ii brk husen bm born augu...
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
24478
\n",
"
<http://dbpedia.org/resource/Joe_Biden>
\n",
"
Joe Biden
\n",
"
joseph robinette joe biden jr dosf rbnt badn b...
\n",
"
31.336879
\n",
"
\n",
"
\n",
"
28447
\n",
"
<http://dbpedia.org/resource/George_W._Bush>
\n",
"
George W. Bush
\n",
"
george walker bush born july 6 1946 is an amer...
\n",
"
33.645208
\n",
"
\n",
"
\n",
"
48202
\n",
"
<http://dbpedia.org/resource/Tony_Vaccaro>
\n",
"
Tony Vaccaro
\n",
"
michelantonio celestino onofrio vaccaro born d...
\n",
"
33.734256
\n",
"
\n",
"
\n",
"
14754
\n",
"
<http://dbpedia.org/resource/Mitt_Romney>
\n",
"
Mitt Romney
\n",
"
willard mitt romney born march 12 1947 is an a...
\n",
"
34.351128
\n",
"
\n",
"
\n",
"
31423
\n",
"
<http://dbpedia.org/resource/Walter_Mondale>
\n",
"
Walter Mondale
\n",
"
walter frederick fritz mondale born january 5 ...
\n",
"
34.423829
\n",
"
\n",
"
\n",
"
36364
\n",
"
<http://dbpedia.org/resource/Don_Bonker>
\n",
"
Don Bonker
\n",
"
don leroy bonker born march 7 1937 in denver c...
\n",
"
34.597688
\n",
"
\n",
"
\n",
"
13229
\n",
"
<http://dbpedia.org/resource/Francisco_Barrio>
\n",
"
Francisco Barrio
\n",
"
francisco javier barrio terrazas born november...
\n",
"
34.669872
\n",
"
\n",
"
\n",
"
35357
\n",
"
<http://dbpedia.org/resource/Lawrence_Summers>
\n",
"
Lawrence Summers
\n",
"
lawrence henry larry summers born november 30 ...
\n",
"
35.383612
\n",
"
\n",
"
\n",
"
25258
\n",
"
<http://dbpedia.org/resource/Marc_Ravalomanana>
\n",
"
Marc Ravalomanana
\n",
"
marc ravalomanana malagasy ravalumanan born 12...
\n",
"
35.440090
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" URI name \\\n",
"35817 Barack Obama \n",
"24478 Joe Biden \n",
"28447 George W. Bush \n",
"48202 Tony Vaccaro \n",
"14754 Mitt Romney \n",
"31423 Walter Mondale \n",
"36364 Don Bonker \n",
"13229 Francisco Barrio \n",
"35357 Lawrence Summers \n",
"25258 Marc Ravalomanana \n",
"\n",
" text BO-eucl \n",
"35817 barack hussein obama ii brk husen bm born augu... 0.000000 \n",
"24478 joseph robinette joe biden jr dosf rbnt badn b... 31.336879 \n",
"28447 george walker bush born july 6 1946 is an amer... 33.645208 \n",
"48202 michelantonio celestino onofrio vaccaro born d... 33.734256 \n",
"14754 willard mitt romney born march 12 1947 is an a... 34.351128 \n",
"31423 walter frederick fritz mondale born january 5 ... 34.423829 \n",
"36364 don leroy bonker born march 7 1937 in denver c... 34.597688 \n",
"13229 francisco javier barrio terrazas born november... 34.669872 \n",
"35357 lawrence henry larry summers born november 30 ... 35.383612 \n",
"25258 marc ravalomanana malagasy ravalumanan born 12... 35.440090 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"31.336879231984796\n",
"33.645207682521445\n"
]
},
{
"data": {
"text/plain": [
"30.919249667480614"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the 10 people from **a)** are politicians, but about half of them have rather tenuous connections with Obama, other than the fact that they are politicians, e.g.,\n",
"\n",
"* Francisco Barrio is a Mexican politician, and a former governor of Chihuahua.\n",
"* Walter Mondale and Don Bonker are Democrats who made their career in late 1970s.\n",
"\n",
"Nearest neighbors with raw word counts got some things right, showing all politicians in the query result, but missed finer and important details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** Let's find out why Francisco Barrio was considered a close neighbor of Obama.\n",
"To do this, look at the most frequently used words in each of Barack Obama and Francisco Barrio's pages."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def top_words(name):\n",
" \"\"\"\n",
" Get a table of the most frequent words in the given person's wikipedia page.\n",
" \"\"\"\n",
" # Your code goes here\n",
" \n",
" return df.sort_values(by='count',ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" count\n",
"the 36\n",
"of 24\n",
"and 18\n",
"in 17\n",
"he 10\n",
"... ...\n",
"governance 1\n",
"governors 1\n",
"has 1\n",
"headed 1\n",
"joining 1\n",
"\n",
"[195 rows x 1 columns]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"barrio_words = top_words('Francisco Barrio')\n",
"barrio_words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**d)** Extract the list of most frequent **common** words that appear in both Obama's and Barrio's documents and display the five words that appear most often in Barrio's article.\n",
"\n",
"Use a dataframe operation known as **join**. The **join** operation is very useful when it comes to playing around with data: it lets you combine the content of two tables using a shared column (in this case, the index column of words). See [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) for more details."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
count_Obama
\n",
"
count_Barrio
\n",
"
\n",
" \n",
" \n",
"
\n",
"
the
\n",
"
40
\n",
"
36
\n",
"
\n",
"
\n",
"
of
\n",
"
18
\n",
"
24
\n",
"
\n",
"
\n",
"
and
\n",
"
21
\n",
"
18
\n",
"
\n",
"
\n",
"
in
\n",
"
30
\n",
"
17
\n",
"
\n",
"
\n",
"
he
\n",
"
7
\n",
"
10
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count_Obama count_Barrio\n",
"the 40 36\n",
"of 18 24\n",
"and 21 18\n",
"in 30 17\n",
"he 7 10"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Modify the code to avoid error.\n",
"\n",
"common_words = obama_words.join(barrio_words)\n",
"common_words.sort_values(by='count_Barrio', ascending=False).head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama's page. "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
count_Obama
\n",
"
count_Bush
\n",
"
\n",
" \n",
" \n",
"
\n",
"
the
\n",
"
40
\n",
"
39
\n",
"
\n",
"
\n",
"
in
\n",
"
30
\n",
"
22
\n",
"
\n",
"
\n",
"
and
\n",
"
21
\n",
"
14
\n",
"
\n",
"
\n",
"
of
\n",
"
18
\n",
"
14
\n",
"
\n",
"
\n",
"
to
\n",
"
14
\n",
"
11
\n",
"
\n",
"
\n",
"
his
\n",
"
11
\n",
"
6
\n",
"
\n",
"
\n",
"
act
\n",
"
8
\n",
"
3
\n",
"
\n",
"
\n",
"
he
\n",
"
7
\n",
"
8
\n",
"
\n",
"
\n",
"
a
\n",
"
7
\n",
"
6
\n",
"
\n",
"
\n",
"
as
\n",
"
6
\n",
"
6
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count_Obama count_Bush\n",
"the 40 39\n",
"in 30 22\n",
"and 21 14\n",
"of 18 14\n",
"to 14 11\n",
"his 11 6\n",
"act 8 3\n",
"he 7 8\n",
"a 7 6\n",
"as 6 6"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bush_words = top_words('George W. Bush')\n",
"# Modify the code to avoid error.\n",
"obama_words.join(bush_words).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note.** Even though common words are swamping out important subtle differences, commonalities in rarer political words still matter on the margin. This is why politicians are being listed in the query result instead of musicians, for example. In the next subsection, we will introduce a different metric that will place greater emphasis on those rarer words."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**e)** Among the words that appear in both Barack Obama and Francisco Barrio, take the 15 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 15 words? Which are they?"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"30"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# It might be helpful to use:\n",
"word_to_ind={v: i for i, v in enumerate(vectorizer.get_feature_names())} # a dictionary with words as keys and indices as values\n",
"\n",
"# Your code goes here\n",
"\n",
"articles.sum()"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1177 Donald Fowler\n",
"1413 Chris Redfern\n",
"3400 James Bilbray\n",
"4004 Paul Kagame\n",
"4874 Bernard Kenny\n",
"6617 Paul Sarlo\n",
"11316 Gy%C3%B6rgy Sur%C3%A1nyi\n",
"12371 Morley Winograd\n",
"12743 David Ibarra Mu%C3%B1oz\n",
"13229 Francisco Barrio\n",
"16095 Charles Taylor (Liberian politician)\n",
"24417 Jesse Ventura\n",
"24478 Joe Biden\n",
"28447 George W. Bush\n",
"29505 Arturo Vallarino\n",
"33744 John O. Agwunobi\n",
"35541 Jimmy Carter\n",
"35817 Barack Obama\n",
"36452 Bill Clinton\n",
"38081 John Garamendi\n",
"39489 Helmut Anheier\n",
"40229 Edward Rowny\n",
"42934 Henry Sanders (Alabama politician)\n",
"48253 Saber Hossain Chowdhury\n",
"50868 Russell Trood\n",
"52229 Robert Lewis Morgan\n",
"53102 Ewart Brown\n",
"54765 Chuck Wolfe (executive)\n",
"55495 Lokman Singh Karki\n",
"56172 Hu Jintao\n",
"Name: name, dtype: object"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wiki[articles]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 3: TF-IDF to the rescue"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Much of the perceived commonalities between Obama and Barrio were due to occurrences of extremely frequent words, such as \"the\", \"and\", and \"his\". So nearest neighbors is recommending plausible results sometimes for the wrong reasons.\n",
"\n",
"To retrieve articles that are more relevant, we should focus more on rare words that don't happen in every article. **TF-IDF** (term frequency–inverse document frequency) is a feature representation that penalizes words that are too common."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Repeat the search for the 10 nearest neighbors of Barack Obama with Euclidean distance of TF-IDF. This time do not limit to only 10000 most frequent words, but take all of them."
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# We could use:\n",
" # from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# but since we already know how to compute CountVectorizer, let's use:\n",
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"\n",
"vectorizer = CountVectorizer(# Your code goes here\n",
" )\n",
"WCmatrix=vectorizer.# Your code goes here\n",
"\n",
"tfidf=TfidfTransformer# Your code goes here; use smooth_idf=False, norm=None\n",
"TFIDFmatrix = # Your code goes here"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
BO-eucl-TF-IDF
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
Barack Obama
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
38376
\n",
"
Samantha Power
\n",
"
139.364493
\n",
"
\n",
"
\n",
"
46811
\n",
"
Jeff Sessions
\n",
"
139.757740
\n",
"
\n",
"
\n",
"
7914
\n",
"
Phil Schiliro
\n",
"
139.812175
\n",
"
\n",
"
\n",
"
38714
\n",
"
Eric Stern (politician)
\n",
"
140.450064
\n",
"
\n",
"
\n",
"
6507
\n",
"
Bob Menendez
\n",
"
141.661111
\n",
"
\n",
"
\n",
"
44681
\n",
"
Jesse Lee (politician)
\n",
"
142.342440
\n",
"
\n",
"
\n",
"
6796
\n",
"
Eric Holder
\n",
"
142.490179
\n",
"
\n",
"
\n",
"
38495
\n",
"
Barney Frank
\n",
"
142.581337
\n",
"
\n",
"
\n",
"
56008
\n",
"
Nathan Cullen
\n",
"
142.751073
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name BO-eucl-TF-IDF\n",
"35817 Barack Obama 0.000000\n",
"38376 Samantha Power 139.364493\n",
"46811 Jeff Sessions 139.757740\n",
"7914 Phil Schiliro 139.812175\n",
"38714 Eric Stern (politician) 140.450064\n",
"6507 Bob Menendez 141.661111\n",
"44681 Jesse Lee (politician) 142.342440\n",
"6796 Eric Holder 142.490179\n",
"38495 Barney Frank 142.581337\n",
"56008 Nathan Cullen 142.751073"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# now recompute the distances as before but for TF-IDF\n",
"dist = pairwise_distances(# Your code goes here\n",
" )\n",
"# add the distances as a column in the wiki dataframe\n",
"wiki['BO-eucl-TF-IDF'] = # Your code goes here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's determine whether this list makes sense.\n",
"* With a notable exception of Nathan Cullen, the other 8 are all American politicians who are contemporaries of Barack Obama.\n",
"* Phil Schiliro, Jesse Lee, Samantha Power, Eric Stern, Eric Holder worked for Obama.\n",
"\n",
"Clearly, the results are more plausible with the use of TF-IDF. Let's take a look at the word vector for Obama and Schilirio's pages. Notice that TF-IDF representation assigns a weight to each word. This weight captures relative importance of that word in the document."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Sort the words in Obama's article by their TF-IDF weights; do the same for Schiliro's article as well.\n",
"Using the **join** operation we learned earlier, compute the common words shared by Obama's and Schiliro's articles.\n",
"Sort the common words by their TF-IDF weights in Obama's document."
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def top_words_tf_idf(name):\n",
" \"\"\"\n",
" Get a table of the largest tf-idf words in the given person's wikipedia page.\n",
" \"\"\"\n",
" # Your code goes here\n",
" \n",
" return df.sort_values(by='tf-idf',ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
tf-idf_Obama
\n",
"
tf-idf_Schiliro
\n",
"
\n",
" \n",
" \n",
"
\n",
"
obama
\n",
"
52.295653
\n",
"
11.621256
\n",
"
\n",
"
\n",
"
the
\n",
"
40.004063
\n",
"
10.001016
\n",
"
\n",
"
\n",
"
in
\n",
"
30.028962
\n",
"
5.004827
\n",
"
\n",
"
\n",
"
and
\n",
"
21.015648
\n",
"
6.004471
\n",
"
\n",
"
\n",
"
law
\n",
"
20.722936
\n",
"
10.361468
\n",
"
\n",
"
\n",
"
of
\n",
"
18.074811
\n",
"
9.037406
\n",
"
\n",
"
\n",
"
democratic
\n",
"
16.410689
\n",
"
8.205344
\n",
"
\n",
"
\n",
"
to
\n",
"
14.657229
\n",
"
7.328615
\n",
"
\n",
"
\n",
"
his
\n",
"
13.888726
\n",
"
1.262611
\n",
"
\n",
"
\n",
"
senate
\n",
"
13.164288
\n",
"
4.388096
\n",
"
\n",
"
\n",
"
president
\n",
"
11.226869
\n",
"
14.033587
\n",
"
\n",
"
\n",
"
presidential
\n",
"
9.386955
\n",
"
4.693478
\n",
"
\n",
"
\n",
"
he
\n",
"
8.493580
\n",
"
13.347054
\n",
"
\n",
"
\n",
"
states
\n",
"
8.473201
\n",
"
2.824400
\n",
"
\n",
"
\n",
"
2011
\n",
"
8.107041
\n",
"
5.404694
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" tf-idf_Obama tf-idf_Schiliro\n",
"obama 52.295653 11.621256\n",
"the 40.004063 10.001016\n",
"in 30.028962 5.004827\n",
"and 21.015648 6.004471\n",
"law 20.722936 10.361468\n",
"of 18.074811 9.037406\n",
"democratic 16.410689 8.205344\n",
"to 14.657229 7.328615\n",
"his 13.888726 1.262611\n",
"senate 13.164288 4.388096\n",
"president 11.226869 14.033587\n",
"presidential 9.386955 4.693478\n",
"he 8.493580 13.347054\n",
"states 8.473201 2.824400\n",
"2011 8.107041 5.404694"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obama_tf_idf = top_words_tf_idf('Barack Obama')\n",
"schiliro_tf_idf = top_words_tf_idf('Phil Schiliro')\n",
"common_words = # Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** Among the words that appear in both Barack Obama and Phil Schiliro, take the 15 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 15 words? Which are they?"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# It might be helpful to use:\n",
"word_to_ind={v: i for i, v in enumerate(vectorizer.get_feature_names())} # a dictionary with words as keys and indices as values\n",
"\n",
"# Your code goes here\n",
"\n",
"articles.sum()"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"7914 Phil Schiliro\n",
"24478 Joe Biden\n",
"35817 Barack Obama\n",
"Name: name, dtype: object"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wiki[articles]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the huge difference in this calculation using TF-IDF scores instead of raw word counts. We've eliminated noise arising from extremely common words."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 4: Choosing metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Compute the Euclidean distance between TF-IDF features of Obama and Biden."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"148.7784541307789"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dist = # Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distance is larger than the distances we found for the 10 nearest neighbors, which we repeat here for readability:"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
BO-eucl-TF-IDF
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
Barack Obama
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
38376
\n",
"
Samantha Power
\n",
"
139.364493
\n",
"
\n",
"
\n",
"
46811
\n",
"
Jeff Sessions
\n",
"
139.757740
\n",
"
\n",
"
\n",
"
7914
\n",
"
Phil Schiliro
\n",
"
139.812175
\n",
"
\n",
"
\n",
"
38714
\n",
"
Eric Stern (politician)
\n",
"
140.450064
\n",
"
\n",
"
\n",
"
6507
\n",
"
Bob Menendez
\n",
"
141.661111
\n",
"
\n",
"
\n",
"
44681
\n",
"
Jesse Lee (politician)
\n",
"
142.342440
\n",
"
\n",
"
\n",
"
6796
\n",
"
Eric Holder
\n",
"
142.490179
\n",
"
\n",
"
\n",
"
38495
\n",
"
Barney Frank
\n",
"
142.581337
\n",
"
\n",
"
\n",
"
56008
\n",
"
Nathan Cullen
\n",
"
142.751073
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name BO-eucl-TF-IDF\n",
"35817 Barack Obama 0.000000\n",
"38376 Samantha Power 139.364493\n",
"46811 Jeff Sessions 139.757740\n",
"7914 Phil Schiliro 139.812175\n",
"38714 Eric Stern (politician) 140.450064\n",
"6507 Bob Menendez 141.661111\n",
"44681 Jesse Lee (politician) 142.342440\n",
"6796 Eric Holder 142.490179\n",
"38495 Barney Frank 142.581337\n",
"56008 Nathan Cullen 142.751073"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wiki.sort_values(by='BO-eucl-TF-IDF',ascending=True)[['name','BO-eucl-TF-IDF']][0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But one may wonder, is Biden's article that different from Obama's, more so than, say, Schiliro's? It turns out that, when we compute nearest neighbors using the Euclidean distances, we unwittingly favor short articles over long ones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Let us compute the length of each Wikipedia document, and examine the document lengths for the 100 nearest neighbors to Obama's page. To compute text length use the same splitting rules you used in `vectorizer`."
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"tokenizer = # Your code goes here\n",
"\n",
"def compute_length(row):\n",
"# Here we could use simply:\n",
"# return len(row['text'].split(' '))\n",
" return len(tokenizer(row['text']))\n",
"\n",
"wiki['length'] = # Your code goes here"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
length
\n",
"
BO-eucl-TF-IDF
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
Barack Obama
\n",
"
540
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
38376
\n",
"
Samantha Power
\n",
"
310
\n",
"
139.364493
\n",
"
\n",
"
\n",
"
46811
\n",
"
Jeff Sessions
\n",
"
230
\n",
"
139.757740
\n",
"
\n",
"
\n",
"
7914
\n",
"
Phil Schiliro
\n",
"
208
\n",
"
139.812175
\n",
"
\n",
"
\n",
"
38714
\n",
"
Eric Stern (politician)
\n",
"
255
\n",
"
140.450064
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
12834
\n",
"
Mark Waller (judge)
\n",
"
211
\n",
"
146.796202
\n",
"
\n",
"
\n",
"
11303
\n",
"
Steven Weinberg
\n",
"
227
\n",
"
146.815087
\n",
"
\n",
"
\n",
"
8277
\n",
"
John M. Facciola
\n",
"
207
\n",
"
146.823495
\n",
"
\n",
"
\n",
"
11996
\n",
"
Thomas H. Jackson
\n",
"
216
\n",
"
146.836489
\n",
"
\n",
"
\n",
"
50366
\n",
"
Patrick Lipton Robinson
\n",
"
201
\n",
"
146.849274
\n",
"
\n",
" \n",
"
\n",
"
100 rows × 3 columns
\n",
"
"
],
"text/plain": [
" name length BO-eucl-TF-IDF\n",
"35817 Barack Obama 540 0.000000\n",
"38376 Samantha Power 310 139.364493\n",
"46811 Jeff Sessions 230 139.757740\n",
"7914 Phil Schiliro 208 139.812175\n",
"38714 Eric Stern (politician) 255 140.450064\n",
"... ... ... ...\n",
"12834 Mark Waller (judge) 211 146.796202\n",
"11303 Steven Weinberg 227 146.815087\n",
"8277 John M. Facciola 207 146.823495\n",
"11996 Thomas H. Jackson 216 146.836489\n",
"50366 Patrick Lipton Robinson 201 146.849274\n",
"\n",
"[100 rows x 3 columns]"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nearest_neighbors_euclidean = # Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** To see how these document lengths compare to the lengths of other documents in the corpus, make a histogram of the document lengths of Obama's 100 nearest neighbors and compare to a histogram of document lengths for all documents."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAt8AAAEvCAYAAACdcK1AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOzdeXxMV//A8c9BSEKCJBIhKpbal5LY9yXWxxK1Vh+0D/Fr1aNqLWljeahqFaXahuKpahG1tQRt0VYTQrraH7WVJvY1kpA4vz/uZMxkJqsslu/79bqv9J57tntnps6c+d5zldYaIYQQQgghRO4rkN8dEEIIIYQQ4kkhg28hhBBCCCHyiAy+hRBCCCGEyCMy+BZCCCGEECKPyOBbCCGEEEKIPCKDbyGEEEIIIfKIDL6FeIIopaYopbRSqnU+tL1LKaVTpeVbf9Lq06NEKTVYKfW7UirOdB1fzWY9vqbyy3O4iyIT8vtzkBMe9c+SEHlJBt9CPGIsBkqWW5xS6pxS6hulVLBSyicX2m1tamtKTtedWx7FPmeWUqo5sBwoAiwApgJ78rNPT7rH+UvM4/xZEiKvFcrvDgghsu0osMr0345AaaAJMB0IVkq9rrWem6rMQlOZM3nWy/sGAc750G56HsY+ZVZn09/BWmsZdAshxCNCBt9CPLqOaK2npE5USnUFlgHvKaVuaa0XpxzTWl8CLuVdF+/TWufHgD9dD2OfssDb9Dc2X3shhBAiSyTsRIjHjNZ6M9DLtPuWUqpoyrG0YkuVUn2VUruVUpeUUvFKqdNKqQ2m0AZMPzXvNGUPsQx5sahjlynNSSk1Syl1SimVpJQaYnk8rX4rpV5SSh1WSiUopU6awmcKpcqTZmxs6mNZ6bOdukoppRaYrsMdpVSMUmq5UsrXTt5Tps1FKfW+KW+CUmqfUqp9WuebxjXwNbUTY2r3tKlOD4s8rU19fsGUdDL1eWXQxssW1/mEUmoyUDCd/C2VUluVUldN740/lFJjU782FvmfVUp9Z5H/uFIqVCn1lEWeU0qpU2mUtzlmuiZaKVVRKTVBKfWnqe5flVKdTHlclVIfmK5dvFJqh1Kqahpt1FNKhSmlziulEk31vaWUKpYqnznUQinVQCn1rVLqllLqilJqpVKqlEXeIcBJ0+5gZR0W5pvW9c2I6XruUkpdN53XL0qpIDv5zO9/pdRApdRvptf4rFLqP0opm9dYKeWplFqmjM99nFLqJ6VUm+x8lizqdFBKTTW9dxOVUgeVUgOye/5CPI5k5luIx5DWerdS6nugFdAO2JRWXqXUCIxwlD8xQlJuAWVMZVsDu4FdgC8wGPjetJ+WdUB1YBsQD5zPRJfHAi1M7X8NdMcIn3na1GZ27CLzfTYzDaj2ABWBb4DPgSoYISpdlVLNtNbHUhVzALYDrsAaoCQwANiilPLXWv+eiXarYlxrd2ADRlhRfWCkqd3GWuuLwCmM+O6eQF1gPnAtk+c2DXgDOAd8hDHo/jdGuJK9/H1N538bWA1cBboA7wDNlVKBWmvLLzPzTfVdAMKAK0AFoA+whQcPd5oL+AFfYfz7NRDYpJRqBiwypa3CeN17Al8rpapprZMt+hhoynMH4zrHYlzniUAbpVRLrfWdVO02AMYD32Jct2bAc0BFpVRT0zX4FeO1GAX8Zqo7RaZen9SUUu9gfDZOY1z/20AA8LFSqrrWerSdYiNNeTZiDJi7A5Mxrs1Ei7pdgB+AqsAOYC9QGdjK/YF2il1k/rO0CuN6bja1OQD4XCl1TWsdnqkTF+Jxp7WWTTbZHqEN4x9BDWzIIN9UU75pFmlTTGmtLdJ+Bs4CzqnKK8DNYr+1qeyUNNrbZTq+Dyie1vFUaSn9uQ1Us0h3BKJMx9qm1/8Mzi1TfU6VtsxU5o1U6YNM6TtSpZ8ypX8JOFikDzalf5zJ13WnKf8/U6W/aUpfmip9uSndN5P1Pw0kYczOWr6u3hhfkDSw3CLdFWPQeAuobpFeCOOLlQYGWaR3N6VFAa6p2nZK1eYp4FQa/bQ5ZnGuhwF3i/RnTelXgS+AghbHFpiO9bJI8wBuACeAMqnaGGfKP9bO+0cDz1qkFwC+M6U3sfPZXG7v3NJ5bey9dzua0jYAjhbpDqY0DTSwU8cVoLJFuhtwGbgJFLZIn2HK/06qvjxvcc5Z/iwBEUAxi/RWpvRtWbkmssn2OG8SdiLE4yvG9Ncj3VyGOxgDMzNtuJKNdqdora9nscynWusjFm0nYAwmwBgM5AmlVGGgP8Zs6GzLY1rrTzFmN9sopcrZKT5Ga33XYn8lxjX1z0S7T2EMbn7RWq9IdXg2xuB4gKl/2TUAY6b7HcvXVWsdgzFjm1pPoDgQqrU+bJE/CZhg2rX8VeIl099/a61vWFaktY7P5nsptZla68sW++sx3rslgPHaYoYbY6YYjF8HUgwCXEx5/05V9xzgIsbrn9r3WusvU3a01veAT027Gb6+2TQCY9AaZPo8pLR9Fwg27fazU+59rfVxi/xXMH75KoYxy51iIMaX3pmpyq8EDj1AvydprW9ZtP89xheq3LpOQjxyJOxEiMeXymS+1cAs4IBSajXGz8qRWuu4bLa7PxtldqeTVtfOsdxSDWPWPVJrnWjn+PfAM6btL4v0a1rrU5YZtdZJSqnzGAPDjDxj+rsr9QGtdYJSag/QA2Pw9Ecm6rMn5Tr+aOeYveufXp9+VUpdt8gDRmjGTZ27K6/8lqof95RSFzF+tfkrVd6UG1HLWKQ1Mv1trpSqZaf+uxjvgdR+sZN2zvQ3M69vdjTCmKV/WSmbj7KD6W+2+qqUKg6UB6K11lctM2qtten9ViOb/U6r/afspAvxRJLBtxCPr5TVMC5mkG82xs/2L2HMqAUDCUqpVcBrqf9xzoQLWcwPdvqotb6hlErACH/IKyltpRWnHpsqX4q0ZvqTSOdmxhxoNyuKm/7ae33stZuZPlVKVf+f2etapt2wk5aUTjrcH6iCEYIBRlx2Vth7fVPqz8zrmx1uGP9Gh6STp6idtMz01cX0N63/N2TnMwxAGr96JSELPAhhJh8GIR5frUx/052JNoWXhGqt62GsFd4PY4Z3CEasbZZorbPzlLtSqRNMN4Q5Yj2wumf6a2/AkxOD9JS2vNI47pUqX07Ji3ZTBkWe6dSf1T5Z9uca1rPM6blH2oPW3PyyldLfp7XWKq0tF9vPihvAufT6qbVuk826b5r+2nzuTOy9R4QQOUQG30I8hpRSLYCWGGt678hsOa31ea31GowVLf4HdLZYUi4lnjY3Zvqa20lrYfprGWqQsmpEWTv569lJy2qfjwIJQOM04qtb2ulTTvg1Vf1mSqkiGCEICab+ZVdKn1vYOWbv+qfXpzoYIQy/WiTvA1yUUo0z0ZdrgFfq5e+UUuUxVorJLVGmv5npY3bk5GckCiibxv0FD8Q0O30aqK6UsgqbUUaMi73rk5uffyGeKDL4FuIxo5TqgrHyBhg3P6Ubu62U6mBnDWBnjBu07nB/tjnlhjl7A98HNUgpZY5fVUo5cv+Gy88s8kWb/j6vlCpgkb8nxg2LqWWpz6Y479UYITtWy7gppQZiLKG2S+fww3lM9X0P+CmlUt9EN9bUn1Xadgm8rFiFMYAap5RKCb9AKeWN/TCMjRizr0FKqcoW+QsCb5t2P7XI/6Hp7/tKKavZa6WUo2WbGK+jA8ZNoCl5HIB3s3pSWbQMY/WWt5VST6c+qJQqoZSy9yUus1JCtHLiM7LA9PcTU4y2FaVUhQdZPxxjdRhnjGUILT2H/Xjv3Pz8C/FEkZhvIR5d1UwPvwAoghEy0hRjTepEjHjtxWmUtbQGuKWU2o0xG+YMdMUY8P3HtLIDGLOuMUB/pVQcppu4tNazcuBcdgB7TXHmNzCWrauCsQqK5cx9BMYMawCwWykVgbGEXgDGOtJdUtWbnT6PxwjZmaWUaoOxFGMVIBBjybaX0in7IF7CuPHxc6VUH+AYxmC/I8bygBPSKZshrfUxpdRMjHW+f1dKhWHMYvbDuKZdU+W/rpT6P4wvP9Gm1+YaxjWuhbEe+6cW+b9SSi3AWGf6mFJqI8aA7SmgE/Av7q99/QFGWNNSpVSAqd52GOEQKav05Dit9QXTl6jVwEGl1BaMX3iKYqzr3gr4L/B/2az/llJqH9BKKbUEIwZeAx9mdQUgrfUWpdRbwOvAcaXUNowlQUthrKPfGGOgfCo7fQXewliqcazpC0cUxjrf3TGWkuzI/S/ekLuffyGeLPm91qFsssmWtY37awlbbrcx/jH8BmMmyyeNslOwXb/3JYyHlpzGCG24gPHwjX52yjfFWC3jVkrbFsd2We7bKWtz3LI/wMvAEYwvDqcwBomF7NTjifHgl6tAHMb62A3tnVt2+2xqYyHGQ2HuYNxc+ClQwU7eU2RhzeoMXtuKpnZiTe2eMfXD007e5WRhnW+LcpbX+STGDbaVSGN9atNrsw1jgJwAHMT4IuCQRv39Te+fG6b35f8wHkxTLlW+Dhj3IyRi3NS5EONGQJtrlt65pnWNSWfNbYyZ3eUYK9bcwQjP+hlj1Z9qqc7d7trWaR3DWIFkG0aMfcrnM93XKK33rulYZ4wH1lwy9fUcxq8kYwCPTNZh9xhG3P5yjC+VccBPQBvur5FeLyc+/+kdk022J3FTWmfn3qjsM8WvzcWYqVIYTwx7VWfiZ1zTT9HTMdb9TYk3nKC1/iGdMgMw/qE+p7X2efAzEEIIIR5fSqkfMQbaxbXFmt1CiJyRpzHfSilnjJ+Xq2E8nOGfGD8Z71RK2VsyKbVPgGEYT3z7B8ZPYNuUUs/Yy2y6kWQu95fpEkIIIQTmeP/Uaf0xbsDdIQNvIXJHns58K6VGAe8BVbXpCVxKqQoYP0uO11q/l07Zuhgz3S9qrZeZ0gph/AR6VGvd3U6ZUIwHCcQA7WXmWwghhDAopf7ACCf6DSOkpQ5G7P0toIXW+td0igshsimvVzvpDuzR1o++PYkRZ9YjE2Xvcv+RwWjjMcergI6m5bjMlFLNMMJTRuRM14UQQojHynKMG6wHYtwoWwvj39jGMvAWIvfk9WonNTGWr0rtINAnE2VPaq1v2ylbGOMu7YNgXrIqFHhHa33czqN5hRBCiCea1noOMCe/+yHEkyavB99u3F8H1dIVMn6wQnplU46nmICx9Npbme2YUioICAIoWrSoX7Vq1TIoIYQQQgghHgfR0dGXtNZpPfU1R+XHOt/2gswzMzWtMlPW9DCIyUCg1joh053SOhRjthx/f3+9f3+6T+QWQjyB/EL9rPajg6LTyCns8fNLdf2i5foJIR4OSqnTedVWXg++r2I9Q52iJPZntS2lPKzBXtmU4wDvY6yossfisbmFMZ6aWwJI1FrHZ6nXQggB/Bzzc3534ZH2889y/YQQIq8H3wcxYrdTqwEcykTZQKWUc6q47xoYd2kft9gvj/3B/FVgPvBqVjothBBCCCFETsjr1U42AY2VUhVTEpRSvkAz07GMyjpgcWOmaanBfsB2rXWiKbk/xhO6LLdtGE8Ha4PxFDUhhBBCCCHyXF7PfC8GXgE2KqWCMWK4p2M84vfjlExKqfLAn8A0rfU0AK31r0qp1cA802omJzEei10BY5kkTPn2pG5UKTUEI9xkV+6clhBCCCGEEBnL05lvrXUc0BY4BqwAVmIMotumepKWAgra6d8LwDLgP8BmoBzQSWstgYRCCCGEEOKhl+ernWitzwDPZpDnFHZWQDHdKPmaactKm0Oykl8IIYQQQojckB9LDQohhBDcuHGDCxcucPfu3fzuihDiMebg4ICnpyeurq753RVABt9CCCHyQePGjTl//jxly5bFyckJeRKxECI3aK2Jj4/n3LlzAA/FAFwG30Kk5u9vmyYPXRIiR7388suULVsWZ2fn/O6KEOIxppTC2dmZsmXL8vfffz8Ug++8XmpQCCGEwN3dHScnp/zuhhDiCeHk5PTQhLjJ4FsIIUSeK1CggISaCCHyzMP0/xsZfAshhBBCCJFHZPAthBBCCCFEHpHBtxBCCJENU6ZMwcPDI7+7YSU0NJQNGzbYpPv6+jJ27Nhca/enn36ifv36ODo6pvvzvlLKvBUoUIAyZcrQr18/Tp48mWt9y67WrVvTu3fvbJU9ePAg/fr1w9PTE0dHR6pUqcKbb75JXFycVb7ly5ejlOLWrVtp1CQeR7LaiRBCZJIO0fndhUea1vev3+HDh/OxJ4+v0NBQatWqRc+ePfO03eHDh+Pp6cm2bdsoUqRIunnHjBlD79690Vpz8uRJQkJC6Nq1K7///juFCj36w5KdO3fStWtXnnnmGRYsWEDp0qXZv38/M2fOJDw8nJ07d1KsWLH87qbIR4/+u1wIIYQQ+erIkSMEBQXRqlWrDPP6+vrSuHFjAJo0aUKJEiXo2rUrx44do0aNGtnug9aaxMREHB0ds13Hg7p9+zYDBw7Ez8+PHTt24ODgAECrVq0ICAjA39+f4OBg5s2bl299FPlPwk6EEEKIXHLlyhWGDx+Ol5cXjo6ONG3alL1791rlUUoxf/58Jk2aRKlSpfD09GTEiBEkJiZa5du1axd16tTB0dGRBg0aEBUVhYeHB1OmTAGMMIno6Gj++9//mkM7li9fblXH3Llz8fHxoWTJkvTv359r165leA47duygUaNGODo64uXlxcsvv2wOk9i1axdKKZKTkxk1ahRKKYYMGZKla+Ti4gJgtQzc5s2bCQgIMD+VsHHjxmzfvt2qXErYz+7du2nQoAGOjo6EhYURFxfHK6+8QtWqVXF2dqZChQqMGDGCGzduWJVPTk7mrbfeokqVKhQpUgQfH590+379+nWaNWtG3bp1uXjxot08YWFhxMTEMGPGDPPAO0WdOnUYOHAgS5Ys4fbt21bHDh8+TIsWLXBycqJKlSqsX7/e6nhWrsfevXvx9/fHycmJ5s2bc/LkSS5cuEDPnj0pVqwY1atXZ8eOHVZlP/30U5o3b46bmxslS5akTZs27JfnW+QaGXwLIYR4aFjGBGdl8/PzS7NOPz+/DMvnhsTERNq3b88333zDO++8w4YNGyhVqhTt27cnNjbWKu+cOXP4+++/+eyzzxg3bhwff/wx8+fPNx8/d+4cXbp0wdPTk7Vr1zJ8+HAGDhxIfHy8Oc+iRYuoVq0aXbp0ITIyksjISLp27Wo+vmbNGr777jtCQ0N5++23+frrr5k0aVK653Do0CE6deqEh4cHX375JVOnTuXzzz83x0LXr1+fyMhIwAgniYyM5I033ki3znv37pGUlMTdu3c5duwYISEhPP3009SqVcuc5+TJk3Tr1o0VK1bw5Zdf0rRpUzp37sxPP/1kVdft27cZPHgwQ4cOZevWrTRs2JDbt2+TnJzMjBkzCA8PZ/r06ezYsYM+ffpYlR0+fDghISH07duXr7/+mjlz5tjEZKe4cuUK7du3586dO+zcuZNSpUrZzffDDz9QsmRJWrZsafd4z549iYuL4+eff7ZK79evHz169GDdunXUrl2bPn368Ntvv2XregQFBTF69Gi++OILzpw5wz//+U8GDBhA8+bNWbduHWXLlqVPnz5WXwBOnTrFoEGDCAsL4/PPP8fHx4eWLVty4sQJu+chHpDWWrZUm5+fnxZPMD8/200IkaMOHTpkNx3I1la/fv0026pfv36G5bMjJCREu7u7p3l8yZIl2sHBQR87dsycdvfuXV2xYkU9duxYq3Nu0aKFVdkePXroRo0amffHjh2r3d3d9e3bt81pq1ev1oAOCQkxp/n5+enBgwfb9KV8+fK6YsWK+u7du+a0UaNGaS8vr3TPsV+/frpy5co6KSnJpt2IiAirc1iwYEG6daXkS735+Pjo33//Pc0yycnJ+u7du7pDhw76hRdeMKeHhIRoQG/YsCHdNu/evat3796tAX369GmttdaHDx/WgJ4/f36a5Vq1aqWfffZZfeHCBV2nTh3dtGlTff369XTb6tixo37mmWfSPP7LL79oQK9atUprrfWyZcs0oGfMmGF1vlWrVtX9+vWzW0dG12PXrl3mtA8++EADeurUqea0gwcPakBv2bIl3fqrVq1qVe5xkNb/d7TWGtiv82icKTPfQgghRC749ttv8fPzo0KFCiQlJZGUlAQY8b+pf9Lv0KGD1X6NGjU4e/aseX/fvn0EBARYPRW0e/fuWepPmzZtrG5orFGjBhcuXODOnTtplomKiiIwMJCCBQua05599lkKFSrE7t27s9R+inHjxrFv3z727dvH5s2bqVOnDl26dOHcuXPmPGfPnmXw4MGULVuWQoUK4eDgwPbt2zl27JhVXUopOnfubNPGihUrqFevHsWKFcPBwYHmzZsDmMvv3LkTIMMQmfPnz9OqVSvc3d3Zvn17rj2aPDAw0PzfBQoUoEePHkRFRZnTMns9ChcuTIsWLcz7lStXBqBt27Y2aZbX+/DhwwQGBuLl5UXBggVxcHDg6NGjNvWLnCE3XAohRCYFfRVktR/aLTSfevJoCgq6f/0GDRqUjz3JG5cuXWLPnj02sb8AlSpVstovUaKE1X7hwoVJSEgw78fGxlKnTh2rPI6OjllaNcNeG1pr7ty5Q+HChe2WiYmJwcvLyyqtYMGCuLu7c+XKlUy3bempp57C39/fvN+uXTt8fHyYO3cu7777Lvfu3aN79+7cvHmTadOmUblyZYoWLcqbb77JhQsXrOoqWbKkTd/Xr1/PoEGDeOmll5g5cyZubm7ExMQQGBhovqaXL1+maNGiGQ6mDx06xJUrVxg3bhxFixbN8NzKli1rNWhO7fTp0+Z8ljw9PW32Y2JiALJ0PVxcXChQ4P68asq1sXztU9JSrsXNmzfp0KEDXl5evPfee5QvXx5HR0eGDh1q9R4UOUcG30IIkUmLf15stS+D76xZvPj+9evVq5fdPNpiOcKcEh0dneN1Zoabmxv+/v58+OGHNscyWo4vtdKlS9vc5JeQkJDr60N7e3vbDPCSk5O5fPkybm5uOdJGkSJFqFixonn5yePHj/PLL78QHh5Op06dzPks49tT2IvXDwsLo1GjRixatMic9v3331vlcXd3Jy4ujhs3bqQ7AG/Tpg316tUjKCgIDw8PunXrlu65tGzZkqVLl7J7927zbLulTZs2UbRoUZt7FC5cuIC7u7vVvre3N5C165EdkZGRnD17lm+++YZq1aqZ069fv54j9QtbEnYihBBC5IJ27dpx/Phx80yv5Va7du0s1dWgQQO++eYbqwHXpk2bbPKlnjF/UI0aNWL9+vUkJyeb09atW0dSUpLdwWV2JCQk8Oeff1KuXDng/qDS8gvK6dOnbW4uTEt8fLzNl5uVK1da7aeEYXz66acZ1jd58mTGjBlDnz59bFYJSa1Pnz54e3szefJkc5hRigMHDrBixQqGDRtmFT4EWK1ucu/ePTZu3EjDhg3N5wPZvx4ZsVd/REQEp06dypH6hS2Z+RZCCCGy6c6dO6xdu9YmvVWrVgwaNIiPPvqI1q1bM3bsWCpWrMjly5eJioqidOnSjB49OtPtvPrqq3zwwQd069aN0aNHExsby6xZs3B2drYKM6hWrRrbtm1j27ZtuLu7U6FCBasZ1awKDg6mXr169OzZk5deeomzZ88yYcIEOnbsSJMmTbJV56lTp9izZw8AFy9eZNGiRVy/fp1//etf5nPw8fFhzJgxTJ8+nZs3bxISEmITqpGWgIAARowYwYwZM2jUqBFbtmzhu+++s8pTtWpVgoKCGDNmDBcuXKBly5Zcu3aNtWvXsmrVKps6Z82axc2bN+nRowfffPONeZ3y1JydnVm5ciVdu3aldevW/Pvf/8bLy4vo6GhmzpxJ3bp1mT59uk25JUuWULhwYWrVqsXixYs5fvw4X3zxRY5cj4w0btyYYsWKMWzYMMaPH8/Zs2eZMmVKjtUvbMngWwghhMimmzdv2ixhB8YNfa1bt2bnzp28+eabhISEcP78eTw9PWnYsGGWb5YsW7YsmzdvZtSoUfTq1Yvq1auzdOlSAgICrMImgoODOXPmDH379uXGjRssW7Ysy+tuW6pZsybh4eFMmjSJXr164erqyoABA5g9e3a265wzZw5z5swBjPCP2rVrs337dho0aAAYM7Dr1q1jxIgR9O7dGx8fHyZPnsyuXbs4cOBAhvUPHz6cEydOMH/+fBISEggICODzzz+3GTAvWrSI8uXLs2TJEmbNmoWnpycBAQFp1rtw4ULi4uLo3Lkzu3btom7dunbztWnThqioKKZNm2ZeX7x8+fK8/PLLTJgwwW7s+KpVqxg9ejTBwcH4+PiwevVq6tWrlyPXIyNeXl6EhYUxduxYevTowdNPP81HH330QK+xSJ/Kjfi6R52/v7+WxeWfYBY3ApnJ+0EAaqp1fKk8bj5rLONzU8eviqzbvXs3LVq0YMeOHbRp0ya/uyPEQ+/w4cNUr17d7jGlVLTW2s4AIOfJzLcQQgjxCJgwYQL16tWjdOnSHD16lOnTp1OnTp1MPdJdCPHwkMG3EEII8QhITExk3LhxnD9/HhcXFzp06MB7771nFfMthHj4yeBbCCGEeATMmzePefPm5Xc3hBAPSL4uCyGEEEIIkUdk8C2EEEIIIUQekcG3EEIIIYQQeUQG30IIIYQQQuQRGXwLIYQQQgiRR2TwLYQQQgghRB6RwbcQQmTS/mH7rTaRNfv37zdv3t7e+d2dBzZlyhQ8PDzyuxtWQkND2bBhg026r68vY8eOzbV2f/rpJ+rXr4+jo6PVk0xTU0qxcOHCXOuHpeXLl6OUMm9FihShatWqzJw5k+TkZHO+U6dOoZTi66+/Tre+hQsXpntuQmSWrPMthBCZ5FfGL7+78Ejz87t//Q4fPpyPPXl8hYaGUqtWLXr27Jmn7Q4fPhxPT0+2bdtGkSJF8rTtjOzYsQMnJycSEhL48ccfeeONNwCYNGkSAN7e3kRGRlKtWrX87KZ4gsjgWwghhBAP5MiRIwQFBT2Uj7pv0KABxYoVA6B169b88ccfbNiwwTz4LlKkCI0bN87PLoonjISdCCGEELnkypUrDB8+HC8vLxwdHWnatCl79+61yqOUYv78+UyaNIlSpUrh6enJiBEjSExMtKwSK44AACAASURBVMq3a9cu6tSpg6OjIw0aNCAqKgoPDw+mTJkCGAPL6Oho/vvf/5pDLZYvX25Vx9y5c/Hx8aFkyZL079+fa9euZXgOO3bsoFGjRjg6OuLl5cXLL7/MrVu3zH1SSpGcnMyoUaNQSjFkyJAsXaOFCxfy9NNPU6RIESpXrszcuXNt8hw4cICuXbvi4uKCi4sLffr0ITY2NkvtpHBxceHu3bvmfXthJ4mJibzyyiuUKFECNzc3Ro8ebVUmRU6+vuLJITPfQgghHhpqavZiaut71yc6KNruMb9QP36O+Tnd8jpEZ6vd9CQmJtK+fXuuXbvGO++8g6enJx9++CHt27fnf//7H6VLlzbnnTNnDm3btuWzzz7j999/5/XXX6d8+fKMHz8egHPnztGlSxeaNm3KzJkziY2NZeDAgcTHx5vrWLRoEc8++ywVK1Y0h1ZUqlTJfHzNmjXUqVOH0NBQzp49y2uvvcakSZNYtGhRmudw6NAhOnXqREBAAF9++SV//fUXEydO5MSJE2zdupX69esTGRlJkyZNGDNmDL1796ZUqVKZvkaLFy9m5MiRvPbaa3Ts2JGdO3cyZswYEhMTmThxIgDHjx+nWbNm+Pv7s2LFCpKTk3njjTfo1q0bUVFRGcZhJycnk5SURGJiIj/88ANr1qxh3Lhx6ZaZOHEiS5YsYcaMGdSoUYPFixcTFhZmlScnX1/xZJHBtxBCCJELPvvsMw4cOMDBgwd5+umnAWjfvj1Vq1Zlzpw5vPPOO+a8vr6+5lnqjh078tNPP7Fu3Trz4GzevHk4Ozvz1Vdf4eTkBICrqyv9+vUz11GjRg2KFi1KqVKl7IZRODg4sGHDBgoVMv7pP3ToEKtWrUp38D1t2jTKly/Ppk2bKFiwIABubm7069fPPOhOacvX1zdL4Rv37t1jypQpDBkyhDlz5gDQoUMHrl+/zltvvcWrr76Ko6MjU6dOpXTp0oSHh1O4cGEA6tSpQ7Vq1diyZQtdu3ZNt50SJUpY7ffq1csccmLP5cuX+eijj5g6dSpjxowBjNekRo0aVvly8vUVTxYJOxFCiEwKjQ612kTWhIaGmrebN2/md3dy3bfffoufnx8VKlQgKSmJpKQkAFq1asX+/dar5XTo0MFqv0aNGpw9e9a8v2/fPgICAswDb4Du3btnqT9t2rQxD7xT2rhw4QJ37txJs0xUVBSBgYHmgTfAs88+S6FChdi9e3eW2k/t7Nmz/P333/Tp08cqvV+/fty4cYM//vgDMK5jYGAgBQoUMF/HChUq4Ovra3Md7fnhhx/Yt28fkZGRfPLJJ+zZs4dhw4almf+PP/4gISGBHj16mNMKFChgtZ/Sr5x6fcWTRWa+hRAik4Z/PdxqP8gvKJ968mgaPvz+9QsPD8/HnuSNS5cusWfPHhwcHGyOWYaDgO3sbOHChUlISDDvx8bGUqdOHas8jo6O5hsJM8NeG1pr7ty5Y55RTi0mJgYvLy+rtIIFC+Lu7s6VK1cy3XZadQM29afsp9R/6dIl3n77bd5++22bOv76668M26lXr575OjVu3JjixYvTu3dvxowZQ61atWzyp8SSe3p6WqWn3s/J11c8WWTwLYQQ4qGRG7HXacWC5zY3Nzf8/f358MMPbY5ldTm+0qVLc/HiRau0hIQE842PucXb25sLFy5YpSUnJ3P58mXc3NweuG7Apv7z588DmOt3c3MjMDCQoUOH2tSRnXXWU8JHDh8+bHfwnRKrfeHCBatzTN3PnHx9xZNFBt9CCCFELmjXrh3bt2/nqaeespk1zaoGDRqwbNky4uPjzaEnmzZtssmX0zOqjRo1Yv369cycOdMcerJu3TqSkpJo3rz5A9Xt4+NDmTJlCAsLo3Pnzub0NWvW4OrqSu3atQHjOh44cAA/P78cecjNgQMHAChXrpzd47Vr18bR0ZGNGzea1/6+d+8eGzdutMqXk6+veLLI4FsIIYTIpjt37rB27Vqb9FatWjFo0CA++ugjWrduzdixY6lYsSKXL18mKiqK0qVLM3r06Ey38+qrr/LBBx/QrVs3Ro8eTWxsLLNmzcLZ2ZkCBe7fvlWtWjW2bdvGtm3bcHd3p0KFCri7u2f7/IKDg6lXrx49e/bkpZde4uzZs0yYMIGOHTvSpEmTbNWZMoAuUKAAU6ZMYfjw4bi7uxMQEMD333/Phx9+yMyZM3F0dASMJ4k2bNiQrl278uKLL+Lh4cG5c+f45ptvGDJkCK1bt063vX379uHk5ERSUhKHDx8mJCQEf39//P397eZ3d3cnKCiIkJAQChUqRM2aNVm8eLHNrww5+fqKJ4sMvoUQQohsunnzps0NgwA7d+6kdevW7Ny5kzfffJOQkBDOnz+Pp6cnDRs2zPLNkmXLlmXz5s2MGjWKXr16Ub16dZYuXUpAQACurq7mfMHBwZw5c4a+ffty48YNli1bluV1ty3VrFmT8PBwJk2aRK9evXB1dWXAgAHMnj07y3WlLItoGV8+bNgwEhMTmTdvHvPnz8fHx4c5c+ZYDVyrVKnCnj17CA4OJigoiPj4eMqWLUu7du2oXLlyhu22bdsWMGLVfXx86NatG9OmTbO6+TS12bNnc/fuXaZNm0aBAgV4/vnnee2118yrn4ARc59Tr694siitcz6+7lHn7++vM3MHtXhM2ZsNkfeDwHYN6tyIT36cWYYMhIeH06lTp3zszaNv9+7dtGjRgh07dtCmTZv87k6GDhw4QO3atdm6dSsdO3bM7+6IJ9Dhw4epXr263WNKqWittf2fQ3KYzHwLIYQQj4AJEyZQr149SpcuzdGjR5k+fTp16tR5KB/pbunGjRvs3buXWbNm4e7uTosWLfK7S0LkKxl8CyGEEI+AxMRExo0bx/nz53FxcaFDhw689957VjHfD6Off/6Znj17UrduXbZt24azs3N+d0mIfJXng2+lVDlgLhAAKOBb4FWt9ZlMlHUEpgPPAyWAX4EJWusfLPK4AJ8A9QFv4C5wFFigtf4sZ89GCCGEyBvz5s1j3rx5+d2NLGvdujVxcXH53Q0hHhp5+nVZKeUM7ACqAYOBfwJPAzuVUkUzUcUnwDDgTeAfQAywTSn1jEWewkAS8BbQHXgOOAKsUErJrcdCCCGEECLf5PXM9zCgIlBVa30cQCn1O/A/YDjwXloFlVJ1MQbSL2qtl5nSvgcOAtMwBtporS+b8lnaopSqAryIMesuhBBCCCFEnsvrQLHuwJ6UgTeA1vok8BPQIxNl7wKrLcomAauAjkqpjB4nddlUXgghhBBCiHyR14PvmsABO+kHgRqZKHtSa33bTtnCgNVin8pQSCnlrpQKAjoCj16wnBBCCCGEeGzk9eDbDbhqJ/0KUPIByqYctzQCY6b7ErAQGKW1/jStypVSQUqp/Uqp/RcvXsygK0IIIYQQQmRdfiw1aO+pFMpOmr08WSm7GtgDeGCErCxQSiVrrT+22ymtQ4FQMB6yk4n+CCGeMB//w+7/PkQmffzx/evn5pZ6vkQIIZ4MeT34vortDDUYs972ZrUtXQGeSqNsynEzrfVFIGUKe6tppZV3lVJLtdYS+y2EyLIgv6D87sIjLSjo/vU7fPhwPvZECCHyT14Pvg9ixG6nVgM4lImygUop51Rx3zWAO8Bx+8XM9mMsb+gFnM1cd4UQQuQlf/88ebqzjf3792e5zJQpU5g6dardYytWrOD555/PdF3bt2/n0KFDvPrqq1bpQ4YM4cCBA9nqX2aUK1eOjh07smTJEnParVu3KFGiBL6+vhw/bv1Pa7NmzShatCjbt29n165dtGnThj/++INatWoBoJRiwYIFvPLKK3bbs1cmt/Xu3ZtLly6xa9cuwHjdFi5cyKVLl/KkfSFSy+vB9yaM2eeKWusTAEopX6AZMDETZacCfYD/msoWAvoB27XWiRmUbwXcAi5kt/NCCCGEpeLFi7N161ab9MqVK9vJnbbt27ezdu1am8H3G2+8QXx8/AP1MT1NmzYlIiLCKm3v3r0UKVKEP//8kwsXLuDp6QnAnTt3iI6O5vXXXwegfv36REZGUqlSpUy3l50yOW3o0KF069Yt39oXIq8H34uBV4CNSqlgjBju6cBfgDkYUClVHvgTmKa1ngagtf5VKbUamKeUcgBOAi8BFYCBFmWHA40xnpx5FnAH+gK9gYla6zu5fZJCCCGeDIUKFaJx48a5Vn9mBqkJCQk4Ojpmq/6mTZsSFhbG1atXKVnSiOKMjIykVatWHDp0iIiICHr27AlAdHQ0iYmJNGvWDABXV9csn3t2yuQ0Hx8ffHx88rUP4smWp6udaK3jgLbAMWAFsBJjEN1Wa33LIqsCCtrp3wvAMuA/wGagHNBJa/2zRZ4/MEJL3gW2Awswbrr8h9b67Zw+JyGEECItp06dQinFmjVrGD58OMWLF8fHx4eQkBDu3bsHGGEQc+bM4fTp0yilUEoxZMgQwAg7sQzFWb58OUopoqKiaN26NU5OTrzzzjuAMQgfP3485cqVo0iRItStW5ctW7ak279mzZqhtSYyMtKcFhERQZMmTWjSpInVrHhERAQFCxakUaNGgBFCopTiwAF7KwgbDhw4QOnSpfnnP/9JcnKy3TJKKd577z1GjRqFm5sbJUqUYOTIkdy5Yz1XdubMGfr374+bmxvOzs507NiRo0ePWuX566+/6NKlC05OTvj6+lqF06SYMmUKHh4e5v24uDheeeUVqlatirOzMxUqVGDEiBHcuHEj3WsnRHbl+WonWuszwLMZ5DmFnVVMtNbxwGumLa2yEUCXB+ulEELYiv472mrfr4xfPvXk0RQdff/6FSqUH4tt5Y6kpCSbtNTnN378eJ599lnWrl3Ld999x7Rp06hZsyZ9+/Zl6NCh/O9//2PHjh2sX78egFKlSqXb5oABA3jppZcICQmhRIkSgBHbHBUVxdSpU6lUqRJr1qyhe/fu7N+/n2eeecZuPc888wzOzs5ERETQpUsXtNbs2bOH1157jeLFi7NmzRpz3oiICGrXro2Li0umrssvv/xCQEAAgYGBfPzxxxQokPZ835w5c2jcuDErV67k4MGDTJ48GUdHR/MXiytXrtC8eXPc3d356KOPcHZ2ZtasWbRv355jx47h5OSE1poePXpw6dIlPvnkExwdHQkJCeHKlSs8/fTTabZ9+/ZtkpOTmTFjBqVKleKvv/5ixowZ9OnTh23btmXqXIXIisfn/35CCJHL/Bdb3wyoQ2RV0qywnMENDw+nbt26+dibnHH58mUcHBxs0k+ePImvr695v2XLlsyZMweAgIAAtm7dyrp16+jbty8+Pj54e3tTpEiRTIdk/Pvf/2bUqFHm/e+++47Nmzeza9cuWrVqBUCHDh04duwYM2bMICwszG49hQoVokGDBuYZ7iNHjnD9+nUaNmxI8eLFmTBhAnfu3KFw4cJERkbSq1evTPVv7969dOrUieeff573338fpdJfUdjFxYWwsDAKFChA586dSUxMZMaMGbz++uu4ubkxd+5c4uLi+PXXX83LVDZr1gxfX1+WLl3KiBEjCA8P55dffmHPnj3m2Xk/Pz8qVaqU7uC7VKlSfPjhh+b9pKQkKlSoQPPmzTlz5gxPPWVvoTUhsi+vH7IjhBBCPDaKFy/Ovn37bLYyZcpY5evQoYPVfo0aNTh7NvsLb3Xt2tVq/9tvv6V06dI0a9aMpKQk89auXbsMV0pp1qwZUVFRJCcnExERQc2aNXF1dTXPlv/888+cPHmSmJgYmjZtmmHffvrpJwICAggKCmLBggUZDrwBevToYTUz3qtXL+Lj483hKd9++y0BAQG4urqaz83FxQU/Pz/z+UVFReHl5WUeeAOUL18eP7+Mf6FasWIF9erVo1ixYjg4ONC8eXMAjh07lmFZIbJKZr6FEEKIbCpUqFCmlkdMCQ1JUbhwYRISErLdrpeXl9X+pUuXiI2NtTsLX7BgwXTratq0KTNnzuS3334jMjLSPMB2cHDAz8+PiIgIc3spN1umZ/v27SQlJTFo0KDMno55RZXU+zExMYBxfnv27GH16tU2Zdu1awdAbGysTT0pdd28eTPNttevX8+gQYN46aWXmDlzJm5ubsTExBAYGPhAr5EQaZHBtxCZkfof11xac1cIITIj9Wyym5sbZcuWZcOGDVmuq2nTpiiliIiIICIiggkTJpiPpdx06eXlRdmyZSlfvnyG9QUHB5tnqn/88cdMrdhy4cIFu/ve3t6AcX7du3fnjTfesCmbEoNeunRpm3pS6nJyckqz7bCwMBo1asSiRYvMad9//32GfRYiuyTsRAghhMhnDzoT3q5dO2JjYylWrBj+/v42W3pKlixJtWrV2LJlC0eOHKFJkybmYymD74iIiEyFnIAxY7527VqqVq1K+/btOXfuXIZlNm7caF79BWDdunU4OTmZH8TTrl07Dh48SM2aNW3OrWrVqgA0aNCA8+fPs3fvXnM9Z86c4eeffyY98fHxFClSxCpt5cqVmTpXIbJDZr6FEEKIbEpKSmLPnj026eXKlaNs2bKZrqdatWqcP3+e5cuXU6tWLTw8PKxu2MxIQEAAHTt2JCAggAkTJlCzZk1u3LjBr7/+SkJCAm+99Va65Zs2bcrSpUtxc3OjSpUq5vQmTZoQExNDbGysefnDzHBycuKrr76iffv2tG/fnh9++CHdFVxu3rxJnz59GDZsGAcPHmTatGm88sor5psrX3vtNT777DPatm3LyJEjKVu2LOfPn+f777+nefPmDBgwgC5dulC3bl369OnD22+/jaOjI2+++abdUJTU127EiBHMmDGDRo0asWXLFr777rtMn6sQWSWDbyGEEA+N3HqMem65fv261UxxiunTpxMcHJzpevr27cvOnTsZP348Fy9eZPDgwSxfvjzT5ZVSrFu3jpkzZzJv3jzOnDmDm5sbzzzzDCNHjsywfLNmzfjkk09sVlvx9vamfPnynD59OtMz3ymKFStGeHg4bdq0oWPHjuzcuTPNvGPGjOHEiRMMGDCAe/fuMXToUGbOnGk+7uHhwZ49e5g8eTKjR4/m2rVreHt707x5c+rUqWO+Bps2bSIoKIgXX3wRT09PJk2axDfffJPuo+SHDx/OiRMnmD9/PgkJCQQEBPD555/n+8OAxONLaS1LZaXm7++vH7V/AEQOysTNUxLz/WRSU63jbGWpwayxjFMODw+nU6dO+dgb8bBQSrFgwQJeeeWV/O6KeMwdPnyY6tWr2z2mlIrWWmdiAPDgsjXzrZSqA7TEeHT7x1rrWKVUZeC81jrtW4qFEEIIIYR4gmVp8K2UKgJ8BvTCeAKlBr4CYoHZGI+Nn5jDfRRCCCGEEOKxkNWZ7xlAe+CfwDfAeYtj4cDLyOBbCCGEEJkk4a/iSZPVwfcAIFhr/blSKvWq/ScB3xzplRBCCCGEEI+hrK7z7Q4cTqeuImkcE0IIIYQQ4omX1cH3ScB2TSVDQ+Dog3VHCCGEEEKIx1dWw04+BSYppU4B60xpWinVBhgNTMm5rgkhxMNlWP1h+d2FR9qwYfevX7FixfKxJ0IIkX+yOvieDdQFVgBLTGm7AUdgldZ6QQ72TQghHiqh3ULzuwuPtNDQ+9fv8OG0IhiFEOLxlqXBt9Y6GeivlPoA6Ah4ApeBrVrr73Ohf0IIIYQQQjw2svWQHa31j8CPOdwXIYQQQgghHmtZveFSCCGEyD3+/vmzZcPx48cZPnw4devWpWDBgrRu3dpuPq01M2fOpFy5cjg5OdGyZUt+/fVXm3yHDh2iXbt2ODs7U6ZMGd58802Sk5PT7cOuXbtQSuHh4cGtW7esji1cuBCllFWaUgqlFJGRkVbpBw4cQCnFrl27Mj7xBxQTE0OXLl0oXrx4hm3GxcUxZcoUqlatiqOjI6VKlaJPnz4cOHDAJq9SioULF+Ziz3Nft27dmDp1qnl/yJAh5tcs9bZ79+4cbXvKlCl4eHiY91PeW/autaWxY8fi6+ubo33Jiv379+Pu7s7169fzrQ9ZlaXBt1LqnlIqOY0tSSl1WSn1jVKqQ251WAghhHgYHDx4kC1btlClShWqVKmSZr5Zs2Yxffp0JkyYwFdffUWxYsVo3749sbGx5jxXr16lffv2KKXYuHEjb775JnPmzCEkJCRTfbl8+TIffvhhpvv+n//8J9N5c9qMGTP47bff+OKLL4iMjKR+/fp28926dYvWrVvz/vvvM3ToULZu3coHH3xATEwMDRs2ZOfOnXnc89y1d+9edu7cyciRI63Sq1WrRmRkpM1Wt27dXO1P/fr1iYyMpFKlSrnazoPy9/fnmWeeYe7cufndlUzLatjJdGAwxg2WmzGecFka6AIkABuA1kC4UqqH1vrrnOuqEEII8fDo1q0bPXr0AKB3795cunTJJk9CQgKzZs3i9ddf55VXXgGgSZMm+Pr6snDhQvMg+KOPPiI+Pp5169bh6upKQEAAN27cYMqUKYwfPx5XV9d0+9K6dWvmzJnDyJEjcXR0zDDvli1b+OWXX6hXr152Tv2BHDlyhEaNGtGlS5d08wUHB/Pbb78RHR1N7dq1zemBgYG0bduWgQMH8ueff+Lk5JTbXc4T77//Pj169MDNzc0qvWjRojRu3DjP++Pq6pov7WbHCy+8wNixYwkODqZQoWxFVOeprIadJGB6kqXW+l9a60la6xeBCsAp4CJQH9gOTMrJjgohRH5TU5XVJrLG8ifz06dP53d3HliBAhn/ExoREcGNGzfo27evOa1o0aJ069aN8PBwc1p4eDgdO3a0GmT379+f+Ph4vv8+4/UMxo8fz9WrV1myZEmGeXv16kWNGjWYMWNGuvk2bdqEn58fRYsWpWTJkjRq1CjDvpw8eZKePXvi6uqKi4sL3bp14/jx4+bjSim+++471q9fj1IqzXCF27dvs2TJEp5//nmrgTeAg4MD//nPf4iJiSEsLMzq2J07dxg1ahRubm6UKFGCkSNHcufOHfPxmJgYXnzxRSpWrIiTkxNVqlQhODjYKs+pU6dQSrFq1SpeeOEFXF1d8fHx4bPPPgNg9uzZlClThlKlSjFhwgTu3btnLnvkyBH69+9PuXLlcHZ2pmbNmsybN88qjz03b95k/fr19O7dO9189ixfvhyllE3Yka+vL2PHjrVKW79+PQ0bNsTJyQl3d3e6dOmS5mfRXtjJtWvXeO655yhatCje3t5pvofOnDlD//79cXNzw9nZmY4dO3L0qPWjYCZOnEjt2rUpVqwYPj4+DBw40OrXIMtzmDt3Lj4+PpQsWZL+/ftz7do1q3zdu3fnypUrbNu2Lf2L9ZDI6uD7/4C5WusEy0StdTwwF/g/rfU9jGUI6+RMF4UQQohH05EjRyhYsCBPP/20VXr16tU5cuSIVb5q1apZ5Xnqqadwdna2ypeWcuXKMWjQIGbPns3du3fTzauUYtKkSaxbt45Dhw7ZzfPnn3/Su3dv2rZty1dffcXKlSv5xz/+wZUrV9KsNzExkXbt2nH48GEWL17M8uXLOXnyJK1atTKXi4yMpF69erRp04bIyEjWr19vt67o6Gji4uLo2bOn3eOtWrWiRIkS/PDDD1bpc+bM4ezZs6xcuZLg4GBCQ0OZPHmy+filS5dwc3PjvffeY+vWrYwbN45ly5bZhHoATJgwAW9vb7788ktatGjB4MGDGTNmDFFRUSxdupRXX32V2bNns2bNGnOZc+fOUbVqVRYtWsSWLVsYNmwYISEhvP3222leNzC+pMXHx9O0aVO7x5OSkmy2rFqxYgW9evWiUqVKrFmzhmXLllGlShUuXryY6TpeeOEFwsPDmTdvHqGhoWzfvp1Vq1ZZ5bly5QrNmzfn6NGjfPTRR6xZs4a4uDjat29PfHy8Od+FCxeYNGkSmzdvZt68eZw4cYK2bdva3OewZs0avvvuO0JDQ3n77bf5+uuvmTTJen7X1dWVmjVr8u2332b5uuSHrM7NewIOaRwrjPH4eYBLgEwLCSGEeKJdvXqVYsWKUbBgQav0kiVLcvv2be7cuUPhwoW5evUqJUqUsClfsmRJrl69mqm2Jk6cyLJly/j000/517/+lW7e/v37ExISwltvvcWKFStsjv/yyy+4uLjwzjvvmNMyChNZtmwZZ86c4dixY1SsWBGARo0aUbFiRT7++GNef/11GjdujKurK25ubumGNJw7dw6A8uXLp5mnfPny5nwpXFxcCAsLo0CBAnTu3JnExERmzJjB66+/jpubG7Vr1+bdd98152/WrBlFixblxRdfZMGCBRQuXNh8rG3btsycOdN8HmvXrmXTpk3mL1SdOnVi48aNrF+/nv79+wPQrl072rVrBxg32jZv3pzbt2+zePFiXn/99TTPJTo6Gg8PD7y8vOwec3CwHXpprdOsL7V79+4xceJEAgMD+eKLL8zp3bt3z3QdBw8eZMOGDaxatYp+/foB0KZNG5566imrX2zmzp1LXFwcv/76qzmEplmzZvj6+rJ06VJGjBgBwNKlS81lkpOTadKkCT4+Pvz000+0bNnSfMzBwYENGzaYw0kOHTrEqlWrWLRokVX/6tatS1RUVKbPJz9ldeZ7PzBFKeVtmaiUKgOEmI4DlAf+fvDuCSGEEI+21CuOwP2Bk+WxtPLZS7enUqVK9O/fn1mzZmW4SkrBggWZOHEiX3zxBX/++afN8dq1a3P9+nUGDx7M9u3biYuLy7D9qKgo6tevbx54A/j4+NCsWbMcX5kjLT169LAKB+rVqxfx8fHm0AmtNfPmzaNGjRo4OTnh4ODAwIEDSUxM5MyZM1Z1pQyiwZhZLVWqFK1atbL6IlW5cmWrLwAJCQmEhIRQuXJlihQpgoODA5MnT+bkyZPpzlbHxsZarTRiqXr16uzbt89my4qjR4/y999/88ILL2SpnKWUNi0H7MWKFSMgIMAq37fffktAQACurq7mWXoXFxf8/PzYv3+/OV94eDhNG1hRkwAAIABJREFUmzalePHiFCpUCB8fHwCOHTtmVV+bNm2s4rhr1KjBhQsXrEKFADw8PGzCVh5WWR18jwJ8gJNKqZ1KqdVKqZ3ACaAM8G9TvsrA5znXTSGEEOLRU7JkSW7evGkzGL527RrOzs7mGc2SJUvaxLECXL9+3e6MeFomTZrEn3/+yerVqzPMO2jQIMqUKWM3JKJq1aps3LiREydO0KVLFzw8PHjuuefSDVGIiYmxO3Pr5eWVbriKPWXLlgVI996A06dPm/Ol8PT0tLsfExMDwLx58xgzZgyBgYFs3LiRqKgoPvjgA8AYOFtKfd0LFy5sN82y3IQJE3j33XcJCgpiy5Yt7Nu3j+DgYLv1W0pISKBIkSJ2jzk7O+Pv72+zZcXly5cB8Pb2ziBn2mJjY3FxcbG5wTX1Nb906RKrV6/GwcHBatu5cyd//fUXYAzku3fvjo+PDytWrCAyMpI9e/YAmXsdtNY2g+8iRYqke40fJll9wuXPSqnKwBigEVAbiAHmAO9prS+b8r2Z0x0VQgghHjXVqlUjOTmZ48ePU7VqVXN66hjvatWq2cR2//XXX8TFxdnEgqenRo0aBAYGMnPmTIYPH55u3sKFCzNu3DjGjh1Lr169bI537dqVrl27cv36dTZv3syrr77KyJEjbWJ8U3h7e3Pw4EGb9PPnz9us4JGRlBs9N23aZDc04scff+TatWtW4QlgxBHb208ZdIaFhdGnTx+rGwXTinvPjrCwMEaOHMn48ePNaZs3b86wnJubm90vX5mRsrpN6sGoZbiSu7sRFZzyJSQ7Spcuzc2bN4mPj7cagKe+5m5ubnTv3p033njDpg4XFxfAuPGzVKlSrF692vzLzoPehH3t2rUsv8/yS5YfsqO1vmxa5aSd1rqG6e/klIG3EEIIIQxNmzbF1dXValWO27dv89VXX9G5c2dzWufOndm2bRs3b940p61evRonJydatWqVpTaDg4M5ePBgmjczWho2bBglS5Zk9uzZaeYpXrw4zz33HIGBgekOVBs1akR0dDQnT540p507d46IiAiaN2+epXNwdnZm6NChfPrppzYPeUlKSiI4OJgyZcrQp08fq2MbN260Wllk3bp1ODk5UatWLQDi4+NtZphXrlyZpb6lJ3X9ycnJaX5ZsVS1alX+/vtvEhMTs9xmSrjG4cOHzWl79+7lxo0b/9/encfJVZUJH/89BJIQliGRRYgQQFwgLpg0LgOvYXlRQAUVGAajEHASQNHhDSqoyBKio6DooOgQZJABFQdwRqKoLEEcWZQksgWGNYhsAgZJAmHN8/5xb5PqSnV3Vbq6urv69/187qe7zj331rl1KtVPTp37nC7nHz9+POeff37D5++00047AUUWnE7Lly/nyiuv7FJvjz32YNGiRUycOHG10frO/4CuWLGCddZZp8uUqr72wwMPPNBjvv3BZPAnQ5QkDR8Vc0IHu2effZbLL78cKILMpUuXcskllwDFzYljxoxh9OjRHH/88Zx66qmMHTuWN77xjZxxxhmsXLmyS4aNI488kjPPPJMPf/jDHHfccdx///2cfPLJzJw5s9cc39Xe9ra3sffee3dJZdid0aNHM3PmTI477rgu5WeffTY33HADe+21F1tssQX33HMPF198MYcccki355o2bRpf+9rX2HvvvZk1axYjRox4ZdXE3kbha5k9ezbXXXcdU6ZM4Qtf+AIdHR08/vjjnHnmmSxYsIBf/OIXq02BWLZsGQceeCDTp09n0aJFzJo1i6OPPvqVEdE999yTM888k3e84x289rWv5Yc//GGXVIh9teeee3LWWWex3XbbMW7cOM4666y6Auqdd96ZF198kdtuu221KSXPPPPMK1MyKm233XZsvPHGvP3tb2f8+PF8+tOf5tRTT2XJkiWcdtppXd43a621FqeddhpTp05l6tSpHHzwwUQE8+bN4+CDD65rGsvEiRPZd999Oeqoo1i6dCmbb745p59+OmPGjOlSb+bMmVx44YXsvvvufOpTn2L8+PH85S9/4dprr2WXXXbh4IMPZs899+Rb3/oWxxxzDB/4wAe4/vrrX0nluKbmz5+/2vt40MrMhjbgTRRpBS8H5lVtVzd6vsG4TZ48OTWMTZ7c+6ZhiZPpsqkxwCvbL3/5y4FuTp8tXry4yzVVbosXL36l3sqVK3P27Nk5fvz4HD16dO6yyy65cOHC1c63aNGi3G233XL06NH56le/Ok844YR86aWXemzDNddck0DedtttXcqvu+66V9pSCchvf/vbXcqWLVuW48aNSyCvueaazMy8/vrrc5999snNN988R40alVtvvXV+7nOfy+eee67H9tx3332533775frrr5/rrbdevu9978u77767S50pU6bk/vvv3+N5Oi1fvjxPPPHEfP3rX58jR47MjTfeOA844IC89dZbV6sL5De+8Y385Cc/mRtttFFuuOGG+YlPfKJLm5ctW5bTpk3LsWPH5tixY/PjH/94zp07t8tr2Nmvc+fO7XL+CRMm5LHHHtul7NBDD83KmOGxxx7LD37wg7nBBhvkpptump/97Gdzzpw5CeSyZct6vNY3velNOWvWrNXO39177IILLnil3h/+8Ifs6OjIddddN3fcccf83e9+V7O9l156aU6aNClHjRqV48aNy3322ScfeOCBzMw86aST8lWvetUrdWu9t5YsWZIHHXRQjhkzJjfddNM85ZRT8thjj80JEyZ0eZ6HH344p02blptuummOHDkyJ0yYkFOnTs3bb7/9lTpf+9rX8jWveU2OGTMm99hjj7z77rtXe3/Wuobzzjtvtddz4cKFGRFd/t3Vcscdd3S7D5ifLYozIxtIVRMR7wCupVhQ53XArcBYYCvgIeDezNy9oeh/EOro6Mj5Q2j0RU1Wz40svj+GpeqFdfKk+j8/1TWbxy9/+Uv22muvAWyNNLh885vf5Nxzz11tmo169/nPf56bbrqp1zzfd955J9tvv33NfRGxIDMbu5N1DTU65/srwE+BiRR5vD+emVsD/xcYAcxuauskSZKGgRkzZvDEE08MmYViBotnnnmGc84555WsMkNBo8H3W4ALKb7ygCLgJjPnUQTe/9K8pkmSJA0P6623Hueff35dOdW1yoMPPsiJJ57IrrvuOtBNqVujN1yuAzyTmSsjYglQmTDyLor54JIkSWqQU7Eat/3223c7lWSwanTk+z6gM6P9rcDhEbFWRKwFHAYMjaWFJEkDauXKlQ0tjy1JfTGYPm8aHfmeC+xKsXrlV4BfAEuBl4H1WbXCpSS1nUmbTxroJgxpkyatev2ee+45VqxYsVqaMknqD525xQeDRle4PLni96si4p3A/sAY4FeZeUVzmydJg8eCGQsGuglD2oIFq16/pUuX8vDDDzN+/HjWXXfdLplQJKlZMpMVK1bw8MMPs9lmmw10c4A+LrKTmX8E/tiktkiShonOBUAeeeQRXnzxxQFujaR2ts4667DZZps1vGBVf2ko+I6Il4F3ZeYfauybDPwhM0c0q3GSpPa14YYbDpo/hpLUKo3ecNnT94IjWJWCUJIkSVKVuka+y2wmnYF3Z3aTSusCewNPNrFtkiRJUlvpNfiOiJOAE8uHCVzXQ/XvNqNRkiRJUjuqZ+T7N+XPoAjCzwUeqqrzPHAH8POmtUySBpnJcyZ3eWz2k8ZMnlz1+i3w9ZM0/PQafGfmtcC1ABGRwDmZ+Uh/N0ySBpuFjy4c6CYMaQsX+vpJUkM3XGbmKX0NvCNiy4i4JCKejoilEfHTiNiqzmNHR8TpEfFoRKyIiBsi4t1VdV4fEf8aEbdGxPKy7mUR8da+tFuSJEnqq4bzfEfEFOBgYCtgdNXuzMw9ejh2DDCPYprKoRRzyGcD10TEWzLzmV6e/lzgfcBngfuBTwK/joh3ZebNZZ33ALsB5wMLgY2AzwG/j4idM9PvOSVJkjQgGs3zfQTwPeCvwD0UQXSXKr2cYjqwLfCGzLy3POet5bmOAM7o4bnfCnwEODwzzyvLrgUWAbOAfcuqFwFnZWZWHDsPeAD4Z+CQ3q5TkiRJ6g+NjnwfC/yIIgB+YQ2eb1/gxs7AGyAzF0fEdcB+9BB8l8e+CPyk4tiXIuIi4PiIGJWZz2fmaukOM/PpiLgbGL8GbZYkSZKaotFFdsYD561h4A0wEbi9RvkiYIc6jl2cmc/WOHYksF13B0bEOOBNwJ31N1WSJElqrkaD7wUU00bW1DjgqRrlS4CxfTi2c393vk0xJeZb3VWIiBkRMT8i5j/xxBO9NEWSJElqXKPB96eBY6ozjDSo1hL0vc0V76zT8LER8XmKueJHV053Wa1RmXMysyMzOzbZZJM6miNJkiQ1ptE533OBDSmykzzL6iPRmZkTejj+KWqPUI+tca5qSygyrNQ6tnN/FxFxJPAV4ITM/Pdezi9JkiT1q0aD76upPfpcr0UUc7er7UCxQmZvx34oIsZUzfveAXgB6DKqHREfo1ju/huZ+eU1b7IkSZLUHA0F35k5rY/Pdxnw9YjYNjPvB4iIrYGdgePrOPYU4ECKHN5ExNrAQcAVmflK2sOI+BBwHvD9zPxMH9ssSZIkNUXDi+z00TnA0cDPIuIEilH0U4E/A2d3VoqICcB9wKzMnAWQmTdHxE+Ab0XEOsBi4ChgG2BqxbHvBn4M3Ar8ICLeWfH8z2fmH/vx+jRcdXSsXjZ/fuvbIUmSBrVGb7gkIt5WLgn/ZES8FBGTyvKvRMRePR1brmC5O3A3cAHwQ4ogevfMXF75NMCIGu07jGJEezbwC2BLYK/MXFhRZ3dgFPA24Drghortvxq9XkmSJKlZGl3hchfgKoql3X9EMYrdaSVwJPCrns6RmQ8C+/dS5wFqZDHJzBXAzHLr7tiTgZN7Or8kSZI0EBqddvJV4NfABylGpiuD74W4dLukNpYn9eV+c2X6+klSo8H3JODDmZkRUf0p+iRggmxJkiSpG43O+X4OGNPNvs2Bp/vWHEmSJKl9NRp8/45ihcsRFWWdI+AfB+Y1pVWSJElSG2p02smXKDKI3AJcQhF4HxoRZwCTgZ2a2zxJkiSpfTQ08p2ZtwDvBv4CfJEiI0nnTZdTMvOu5jZPkiRJah8NL7JT5tTeIyJGA+OAv1Ut9y5JbWnG3BldHs/5wJwBasnQNGNG1es3x9dP0vATjaR+KleWHFkullO9bz3ghcx8sYntGxAdHR0539UJh69aq1VWq35/uMLlsBCndF1+wNSDjYmoev1MPShpkIiIBZlZRwDQd42OfH8fWAf4SI19ZwMvAIf3tVGSJElSO2o028luwM+62XcZsEffmiNJkiS1r0aD702Bx7vZ9wSwWd+aI0mSJLWvRoPvx4E3d7PvzcBf+9YcSZIkqX01Gnz/HPhSRLylsjAi3kyRenBusxomSZIktZtGb7g8EdgTWBARNwEPAeOBtwOLgROa2zxJkiSpfTQUfGfmkxGxEzCTIgjfEXgS+DLwzcx8uvlNlAahetIRSpIkVak7+I6IEcCbgEcy80SKUXBJkiRJdWpkzncC84G39VNbJEmSpLZWd/CdmSuBPwPr9V9zJEmSpPbVaLaTs4FjImJkfzRGkiRJameNZjvZAHgtcH9E/Ap4lGI6SqfMzJOa1ThJkiSpnTQafH+h4vfDa+xPwOBbkiRJqqHRVIONTlORpLYxf/r8gW7CkDZ/vq+fJDU68i1Jw9bkLSYPdBOGtMmTff0kqeGR7CjsGxFfj4jzImJCWT4lIrZofhMlSZKk9tDQyHdEjAUuB94BLKW4AfPbwJ+A6cAS4NNNbqMkSZLUFhod+T4d2BLYGdgYiIp9VwF7NKldkiRJUttpdM73fsBnMvOGcrn5Sg9SBOaSJEmSamg0+F4feLibfaPpOhIuSW1lzoI5XR7PmDxjgFoyNM2ZU/X6zfD1kzT8NBp83wW8h2KKSbUpwG19bpEkDVJH/PyILo8NvhtzxBFVr5/Bt6RhqNHg+yzgrIh4GvhRWbZRRBwGHA34SSpJkiR1o9FFds6JiNcCpwCzyuIrgZXAaZn5wya3T5IkSWobjaYa3Bg4GfgexfSTTYC/Aldm5v1Nb50kSZLURnoNvsusJl8CjqHI6/0yMBf4eGb+rX+bJ0mSJLWPeka+jwROBH4D3ARsC3yIYpGdw/qtZZIkSVKbqSf4ng6ck5mv3KYeEUcA34mIIzLzhX5rnSRJktRG6lnhclvg4qqynwAjgAlNb5EkSZLUpuoJvtenmGJSaVn5c4PmNkeSJElqX/VmOxkfEdtWPB5RUd7lpkuznkiSJEm11Rt8X9JN+X/XKBtRo0ySJEka9uoJvs1oIkmSJDVBr8F3Zp7fioZIkiRJ7a6eGy4lSZIkNUFDy8tL0nB29vvPHugmDGlnn+3rJ0kG35JUpxmTZwx0E4a0GTN8/SSp5dNOImLLiLgkIp6OiKUR8dOI2KrOY0dHxOkR8WhErIiIGyLi3TXqzYyIuWW9jIiTm34hkiRJUoNaGnxHxBhgHvBG4FDgY8DrgGsiYr06TnEuxXL3JwLvBx4Ffh0RO1bVmw5sSu1UiJIkSdKAaPW0k+kUy9W/ITPvBYiIW4F7gCOAM7o7MCLeCnwEODwzzyvLrgUWAbOAfSuqT8zMlRGxNnBkf1yIJEmS1KhWB9/7Ajd2Bt4Ambk4Iq4D9qOH4Ls89kXgJxXHvhQRFwHHR8SozHy+LF/ZL62XGtHR0fXx/PkD0w5JkjRotDr4ngj8rEb5IuDAOo5dnJnP1jh2JLBd+bsk9YsFjyzo8njyFpMHqCVD04IFVa/fZF8/ScNPq4PvccBTNcqXAGP7cGzn/jUWETOAGQBbbVXX/Z+ShpmOc7p+m5En5QC1ZGjqqPo2KNPXT9LwMxCL7NT6tI06jos+HNurzJyTmR2Z2bHJJps045SSJElSF60Ovp+i9gj1WGqPalda0sOxnfslSZKkQavVwfciirnb1XYA7qjj2G3KdIXVx74A3Lv6IZIkSdLg0erg+zLgnRGxbWdBRGwN7Fzu6+3Ydai4MbNMJXgQcEVnphNJkiRpsGr1DZfnAEcDP4uIEyjmcJ8K/Bk4u7NSREwA7gNmZeYsgMy8OSJ+AnwrItYBFgNHAdsAUyufJCI6gK1Z9Z+LHSLigPL3y2tkTJEkSZL6XUuD78x8JiJ2B74JXEBxs+TVwDGZubyiagAjWH1k/jDgy8BsYCPgFmCvzFxYVe9oihU0Ox3IqhHzbYAH+nwxkiRJUoNaPfJNZj4I7N9LnQeokcUkM1cAM8utp+OnAdPWtI2SJElSfxiIVIOSJEnSsGTwLUmSJLWIwbckSZLUIgbfkiRJUosYfEuSJEktYvAtSZIktUjLUw1K0lA1fdL0gW7CkDZ9uq+fJEVmDnQbBp2Ojo6cP3/+QDdDA6WjozXP43tMkqRBISIWZGZLAgCnnUiSJEktYvAtSZIktYjBtyRJktQiBt+SJElSixh8S5IkSS1iqkFJqlOcEl0e50lmi2pERNXrZ7YtScOQI9+SJElSizjyLVW54847W/I8h9TIJ/4fVc99yPbbN3zeZpwDwFz3kiQ1nyPfkiRJUosYfEuSJEktYvAtSZIktYjBtyRJktQiBt+SJElSixh8S5IkSS1i8C1JkiS1iMG3JEmS1CIusiMNkOrFcNb0mDVdREeSJLWeI9+SJElSixh8S5IkSS3itBNJqtOkzScNdBOGtEmTfP0kyeBbkuq0YMaCgW7CkLZgga+fJDntRJIkSWoRg29JkiSpRQy+JUmSpBZxzrek+nR0dH08f/7AtEOSpCHMkW9JkiSpRRz5lqQ6TZ4zuctjs580ZvLkqtfP7CeShiGDb0mq08JHFw50E4a0hQt9/STJ4FtSTR1Vc7z/4847uzw+pHoOeJua79x2SVITOedbkiRJahGDb0mSJKlFDL4lSZKkFnHOtzSIVc+zbtXzHLL99i15Xg0T5f0BN5UPdxq4lkjSgHPkW5IkSWoRg29JkiSpRQy+JUmSpBZp+ZzviNgS+CawJxDAVcAxmflgHceOBk4FPgpsBNwMHJeZv62qtxZwHHAE8GrgLmBWZl7axEuRNAx0yXf+SNW+ucMj1zmY71ySmqWlI98RMQaYB7wROBT4GPA64JqIWK+OU5wLTAdOBN4PPAr8OiJ2rKp3KnAy8B1gb+BG4OKI2KcJlyFJkiStkVaPfE8HtgXekJn3AkTErcA9FKPUZ3R3YES8FfgIcHhmnleWXQssAmYB+5ZlmwKfAb6amV8vD78mIrYDvgpc3g/XJUmSJPWq1cH3vsCNnYE3QGYujojrgP3oIfguj30R+EnFsS9FxEXA8RExKjOfB94LjAQurDr+QuDfI2KbzFzcnMuRpOGhy/SbBnWXMrMv5xwoTr+R1FetDr4nAj+rUb4IOLCOYxdn5rM1jh0JbFf+PhF4Hri3Rj2AHQCDb0lSw4bifxjUf/zPmNZEq4PvccBTNcqXAGP7cGzn/s6ff8vM7KVeFxExA5hRPnw+Im7vpT1qXxsDTw50I+q2YEGXhxN72V9LPcesyXmHqLr7f8GjbfsaNNVq753SgsH5Hhpa//7VTA33fUT0U1M0AN7QqicaiBUuq4NiKLKe9CbqPLbeel0blTkHmAMQEfMz0+GNYcr+H97s/+HN/h++7PvhLSJa9jVGq/N8P0Xtkeex1B7VrrSkh2M793f+HBur/3e0up4kSZLUUq0OvjvnZFfbAbijjmO3KdMVVh/7AqvmeC8CRgGvrVGPOp5HkiRJ6hetDr4vA94ZEdt2FkTE1sDO5b7ejl2HihszI2Jt4CDgijLTCcCvKILxqVXHfxS4vc5MJ3PqqKP2Zf8Pb/b/8Gb/D1/2/fDWsv6P1e9L7McnKxbSuQVYAZxAMTf7VGAD4C2ZubysNwG4j2JVylkVx19EkUrwsxQZS46iWGzn7zNzYUW9rwLHAF8AFlIE6EcA+2Xm3H6+TEmSJKmmlt5wmZnPRMTuFMvLX0BxE+TVFMvLL6+oGsAIVh+ZPwz4MjCbYnn5W4C9KgPv0heB5cA/s2p5+X8w8JYkSdJAaunItyRJkjSctXrO96AVEVtGxCUR8XRELI2In0bEVgPdLq25iDggIi6NiD9FxIqIuCsi/iUiNqiqNzYivh8RT0bEMxFxVUS8ucb5RkfE6RHxaHm+GyLi3a27IvVFRPwqIjIiZleV2/9tKiL2iYjfRsTy8nN9fvnta+d++75NRcTOEXFFRDxe9v3CiDi8qk5d/RoRa0XE5yPigYh4LiJuiYj9W3c16klEvCYivl3237Pl5/zWNeo1vb8jYnpE/G9EPF/GGEfW02aDb6DMoDIPeCNwKPAx4HXANeU8dQ1NnwFeppj7vxfwPYr7BK6MiLUAypSUl5X7PwXsT3Fj7zUR8Zqq850LTAdOpLjX4FHg1xGxY/9fivoiIg4G3lqj3P5vUxFxBMWKyguAD1HcrH8xMKbcb9+3qYh4C3AVRX9Op+jbm4BzI+Koiqr19uupwMnAd4C9gRuBiyNin368DNVvO+AfKFJW/08P9Zra3xExHTgbuJTic+Ri4LtV77HaMnPYbxRzw18Gtqso2wZ4CZg50O1zW+N+3aRG2SEUN/ruXj7er3y8W0Wdv6PIB39mRdlby3qHVZStTXE/wWUDfa1uPb4PNgIeAw4u+3B2xT77vw03YGuKG/uP6aGOfd+mG/AViqxn61eV3wjc0Ei/ApsCzwOnVJ3rauDWgb5WtwRYq+L3fyr7deuqOk3t7/LYx4Hzq+r9O8Uqqev01GZHvgv7AjdmZmeucLJISXgdxQe0hqDMfKJG8U3lz/Hlz32BRzLzmorjngbm0rXv9wVeBH5SUe8l4CLgvRExqolNV3OdBizKzB/X2Gf/t6fDgZXAv/VQx75vXyMp+mxFVfnfWPWNf739+t7yfBdWnetC4M0RsU1zm65GZebKOqo1u7/fBWxSo94FwKuAXXpqjMF3YSJwe43yRaxanEftYUr5887yZ099v1VErF9Rb3FmPluj3kiKr700yETELhTfdnyimyr2f3vaBfhf4B8j4r6IeCki7o2IT1bUse/b1w/Kn2dGxBYRsVE5RWAPimxrUH+/TqQYCb23Rj0wRhgqmt3fnQtGVn+G1PW+MPgujKP28vZLWLUsvYa4iBgPzAKuysz5ZXFPfQ+r+r+3euOa1U41R0SsQzEf7+uZeVc31ez/9rQFxX07pwNfBd4DXAl8JyL+uaxj37epzLwd2JXiG4yHKfrvLODIzLyorFZvv44D/pblnIIe6mlwa3Z/d/6sPmdd74uW5vke5GrlXIyWt0L9ohzF+hnFPP7DKndRX9/XW0+Dx3HAuhRrA3TH/m9Pa1Es3jYtM39als0rMyB8PiLOxL5vWxHxOoqb4BYBR1JMP9kP+LeIeC4zf4j9P9w0u787H69Rvm6D78JT1P5fylhq/09JQ0hEjKbIarAtMCUzH6rYvYTu+x5W9f8SoFbqybEV+zVIRJEm9IsUN9+MqpqXOyoiNgKWYf+3q79SjHxfWVV+BUVWgs2x79vZVyjm974/M18sy66OiFcB/xoRP6b+fl0CjI2IqBoNtf+Hlmb3d+UI96MV9cZV7a/JaSeFRayav1NpB+COFrdFTVROPbgUeDuwT2beVlWlp75/MFetvLoI2KZMS1ld7wVWnx+mgbUtMJriZpinKjYoUlA+BbwZ+79dLeqmvHO0aiX2fTt7M3BLReDd6Q8UN8NtSv39uggYBby2Rj0wRhgqmt3fnZ8x1Z8hdb0vDL4LlwHvjIhtOwvKryd3LvdpCCpzef+Q4iab/TLzxhrVLgPGR8SUiuM2BD5A176/jCJn7IEV9dYGDgKuyMznm38F6oObgd1qbFAE5LtRfNja/+3pv8qf760qfy/wUGY+hn3fzh4DdoyIkVXl7wCeoxiVrLdff0URnE2tOtdHgdvLzGga/Jrd3zdQpBSsVW/YYdCCAAAGlElEQVQJRba87g10fsbBsAHrUfwhvo1iXti+wC3A/VTlCXUbOhvFojoJzAbeWbW9pqyzFnA98GfgHyn+OP+m/MezZdX5LqIYMf0nioD+EooP8kkDfa1udb8nqvN82/9tuFGMcM+jmH5yJMUNl3PK/p9m37f3BhxQ9vWvy7/p76FYMCWBMxrtV4qbdp8DZlLcyPk9im9PPjDQ1+rWpc8PqPi7f1T5eEp/9Xf52bKyjDF2pUjosBL4ZK/tHegXbLBsFHOBLgWWUswF/W+qkrS7Da0NeKD8R1hrO7mi3jiKxPhLgGcpkum/tcb51gXOoBhVeQ74PbDrQF+nW0PviS7Bt/3fvhuwIUWGi79QjGTdCnzEvh8eG8XKhL8Bnij/pt9MkXJ0RKP9CowATgD+RJGG7lbggIG+RrcufdTd3/rf9Gd/A0cAd5f17gE+UU97ozxYkiRJUj9zzrckSZLUIgbfkiRJUosYfEuSJEktYvAtSZIktYjBtyRJktQiBt+SJElSixh8S1KLRcQhEfGnisd3RsRRTX6Od0XE7yPimYjIiNixmefvDxGxddnWaQPdFknqL2sPdAMkaRiaDCwAiIj1gdd3Pm6ic4EVFMulP0uxEIQkaYA58i1JrfdK8F3+vpJiFbWmiIi1gDcAv8jMeZl5Y2Y+26zz90VEjBroNkjSQDL4lqQWKgPjHYGFZdFk4I7MfK7O4zeMiO9ExCMR8XxE3BUR/y8iotw/DXiZ4vP9S+U0jge6OVdHuX+XirJPlWWzK8peV5btU1H29oi4KiKWl1Nbro6It1ed/wcR8VA5Beb6iFgBnFbuGxMR342Iv5bnuAx4TY027hQRV5b1no2I+yPiu/W8VpI0GDntRJJaoAyAJ1QUXV7Gy537s/x1m8x8oJtzrAX8ApgEnAjcBrwPOAPYBPhCuX8X4HcUU0++DzzfTbMWAn8Ddi/rU/6+ovxJRdnLwP+U7XgLcC1wBzANSOB44NqIeGdm3lJx7N8BFwFfL9u3oiw/GzgIOAW4CdgT+FHV9a4P/Br4Q/k8y4Ctgb/v5nokadAz+Jak1tgHGAkcArwXmFqW/xY4CbimfPxIL+fYBTgsM39Qll0REesBx0bEGZn5REQ8Ve57KDNv7O5kmbkyIn4L7AbMKoP7KcD3gE9HxPqZubzcPz8zl5WHnkgR0O+RmX8DiIgrgQfKa/lwxdOsD3w0M3/WWRARbwA+AnwxM79acR3rA0dWHPtGYCzwucysnJbzAyRpiHLaiSS1QGbekZk3A1sCvyl/fwbYALg4M28utxd6OM27KeaH/7iq/EKKwP5da9C0a4B3RcRoiukwG1FMDXke+D9lnV2BeVXt+Hln4F1e31LgMorgvdJLwM+ryt5B8ffnP6vKL6p6fA/FyPzZEfHRiNiy/suSpMHJ4FuS+llEjIiItSNibWBn4Iby9/8DPAw8Vu6PHk8E44AlmVk9jeSxiv2NmgeMopjKsRtwS2b+hWIaym4RMRHYjFUj853P82iNcz1GMVJd6fHMfLmqbPPy51+qyrs8zsynyzY9AnwXeDAibo+I/eu5MEkajAy+Jan/XQ28WG6bAxeUv58LjK/YVz1qXG0JMC4iRlaVv7r8+dc1aNttwJMU87p3Z9UI97yKsheA66ra8WpW9+pyX6WsUa8zcN+sqrz6MeW3AftTBPzvAu4D/jMi3tTN9UjSoGbwLUn97whgJ4qbDu8tf98JeAI4oeJxb7m+r6X43D6wqnwqRYDc7fzu7mRmlufdk2IkvjL4fhvwIeD3VakKrwXeFxEbdBaUv3+g3Neb31NMn/mHqvJ/7KGdL5Xz179E8RpsX8fzSNKg4w2XktTPMvMugIj4EkXu7fnlTYcbA+dm5mM9nmCVX1JMB/m3iNgEWERxE+Y/Af+SmU+uYRPnAWdRkdGEIhPKUsqbMavqnwq8H7g6Ir5GMbp9HDCmRt3VZOZdEfEjVt3k2ZntZJ/KehHxfmAG8N/AYmA94NMUWU9uaPgqJWkQMPiWpBYop4rsARxQFu0N/LGBwLszO8n7gK9QBLuvosgwMhP4Vh+a1zmfe35542RlJpR96Trfm8y8NSJ2Bb4MnA8Exaj7lKo0gz05AlgOfIbiZtF5FBlQfldR5x6K1IRfopius4wyUM/Mhxq8RkkaFKL4xlGSJElSf3POtyRJktQiBt+SJElSixh8S5IkSS1i8C1JkiS1iMG3JEmS1CIG35IkSVKLGHxLkiRJLWLwLUmSJLXI/wddVESCiofWugAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10.5,4.5))\n",
"plt.hist(# Your code goes here\n",
" )\n",
"plt.axvline(# Your code goes here\n",
" )\n",
"\n",
"# Your code goes here\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relative to the rest of Wikipedia, nearest neighbors of Obama are overwhemingly short, most of them being shorter than 300 words. The bias towards short articles is not appropriate in this application as there is really no reason to favor short articles over long articles (they are all Wikipedia articles, after all). Many of the Wikipedia articles are 300 words or more, and both Obama and Biden are over 300 words long.\n",
"\n",
"**Note**: For the interest of computation time, the dataset given here contains _excerpts_ of the articles rather than full text. For instance, the actual Wikipedia article about Obama is around 25000 words. Do not be surprised by the low numbers shown in the histogram."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** Both word-count features and TF-IDF are proportional to word frequencies. While TF-IDF penalizes very common words, longer articles tend to have longer TF-IDF vectors simply because they have more words in them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To remove this bias, we turn to **cosine distances**:\n",
"$$\n",
"d(\\mathbf{x},\\mathbf{y}) = 1 - \\frac{\\mathbf{x}^T\\mathbf{y}}{\\|\\mathbf{x}\\| \\|\\mathbf{y}\\|}\n",
"$$\n",
"Cosine distances let us compare word distributions of two articles of varying lengths."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**d)** Train a new nearest neighbor model, this time with cosine distances. Then repeat the search for Obama's 100 nearest neighbors and make a plot to better visualize the effect of having used cosine distance in place of Euclidean on our TF-IDF vectors."
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
length
\n",
"
BO-cos-TF-IDF
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
Barack Obama
\n",
"
540
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
24478
\n",
"
Joe Biden
\n",
"
414
\n",
"
0.572725
\n",
"
\n",
"
\n",
"
57108
\n",
"
Hillary Rodham Clinton
\n",
"
580
\n",
"
0.616149
\n",
"
\n",
"
\n",
"
38376
\n",
"
Samantha Power
\n",
"
310
\n",
"
0.625797
\n",
"
\n",
"
\n",
"
38714
\n",
"
Eric Stern (politician)
\n",
"
255
\n",
"
0.651475
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2045
\n",
"
Allan Ryan (attorney)
\n",
"
291
\n",
"
0.731376
\n",
"
\n",
"
\n",
"
47085
\n",
"
Ray Thornton
\n",
"
326
\n",
"
0.731908
\n",
"
\n",
"
\n",
"
16392
\n",
"
P%C3%A9ter Kov%C3%A1cs (lawyer)
\n",
"
365
\n",
"
0.732172
\n",
"
\n",
"
\n",
"
55495
\n",
"
Lokman Singh Karki
\n",
"
2486
\n",
"
0.732608
\n",
"
\n",
"
\n",
"
22304
\n",
"
Chung Dong-young
\n",
"
886
\n",
"
0.732785
\n",
"
\n",
" \n",
"
\n",
"
100 rows × 3 columns
\n",
"
"
],
"text/plain": [
" name length BO-cos-TF-IDF\n",
"35817 Barack Obama 540 0.000000\n",
"24478 Joe Biden 414 0.572725\n",
"57108 Hillary Rodham Clinton 580 0.616149\n",
"38376 Samantha Power 310 0.625797\n",
"38714 Eric Stern (politician) 255 0.651475\n",
"... ... ... ...\n",
"2045 Allan Ryan (attorney) 291 0.731376\n",
"47085 Ray Thornton 326 0.731908\n",
"16392 P%C3%A9ter Kov%C3%A1cs (lawyer) 365 0.732172\n",
"55495 Lokman Singh Karki 2486 0.732608\n",
"22304 Chung Dong-young 886 0.732785\n",
"\n",
"[100 rows x 3 columns]"
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Your code goes here\n",
"nearest_neighbors_cosine"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From a glance at the above table, things look better. For example, we now see Joe Biden as Barack Obama's nearest neighbor! We also see Hillary Clinton on the list. This list looks even more plausible as nearest neighbors of Barack Obama."
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10.5,4.5))\n",
"plt.hist(# Your code goes here\n",
" )\n",
"plt.axvline(# Your code goes here\n",
" )\n",
"\n",
"# Your code goes here\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indeed, the 100 nearest neighbors using cosine distance provide a sampling across the range of document lengths, rather than just short articles like Euclidean distance provided."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Moral of the story**: In deciding the features and distance measures, check if they produce results that make sense for your particular application."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ex. 5: Problem with cosine distances: tweets vs. long articles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Happily ever after? Not so fast. Cosine distances ignore all document lengths, which may be great in certain situations but not in others. For instance, consider the following (admittedly contrived) example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"+--------------------------------------------------------+\n",
"| +--------+ |\n",
"| One that shall not be named | Follow | |\n",
"| @username +--------+ |\n",
"| |\n",
"| Democratic governments control law in response to |\n",
"| popular act. |\n",
"| |\n",
"| 8:05 AM - 16 May 2016 |\n",
"| |\n",
"| Reply Retweet (1,332) Like (300) |\n",
"| |\n",
"+--------------------------------------------------------+\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Transform the tweet into TF-IDF features, using the fit to the Wikipedia dataset. (That is, let's treat this tweet as an article in our Wikipedia dataset and see what happens.) How similar is this tweet to Barack Obama's Wikipedia article? "
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
word count
\n",
"
tf_idf
\n",
"
\n",
" \n",
" \n",
"
\n",
"
democratic
\n",
"
1
\n",
"
4.102672
\n",
"
\n",
"
\n",
"
governments
\n",
"
1
\n",
"
5.167571
\n",
"
\n",
"
\n",
"
control
\n",
"
1
\n",
"
4.721765
\n",
"
\n",
"
\n",
"
law
\n",
"
1
\n",
"
3.453823
\n",
"
\n",
"
\n",
"
in
\n",
"
1
\n",
"
1.000965
\n",
"
\n",
"
\n",
"
response
\n",
"
1
\n",
"
5.261462
\n",
"
\n",
"
\n",
"
to
\n",
"
1
\n",
"
1.046945
\n",
"
\n",
"
\n",
"
popular
\n",
"
1
\n",
"
3.764479
\n",
"
\n",
"
\n",
"
act
\n",
"
1
\n",
"
4.459778
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" word count tf_idf\n",
"democratic 1 4.102672\n",
"governments 1 5.167571\n",
"control 1 4.721765\n",
"law 1 3.453823\n",
"in 1 1.000965\n",
"response 1 5.261462\n",
"to 1 1.046945\n",
"popular 1 3.764479\n",
"act 1 4.459778"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'text': ['democratic governments control law in response to popular act']})\n",
"\n",
"# Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare this tweet's TF-IDF vectors to Barack Obama's Wikipedia entry."
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
tf-idf
\n",
"
\n",
" \n",
" \n",
"
\n",
"
obama
\n",
"
52.295653
\n",
"
\n",
"
\n",
"
the
\n",
"
40.004063
\n",
"
\n",
"
\n",
"
act
\n",
"
35.678223
\n",
"
\n",
"
\n",
"
in
\n",
"
30.028962
\n",
"
\n",
"
\n",
"
iraq
\n",
"
21.747379
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
is
\n",
"
2.055233
\n",
"
\n",
"
\n",
"
new
\n",
"
1.887235
\n",
"
\n",
"
\n",
"
which
\n",
"
1.767431
\n",
"
\n",
"
\n",
"
that
\n",
"
1.661407
\n",
"
\n",
"
\n",
"
by
\n",
"
1.374553
\n",
"
\n",
" \n",
"
\n",
"
273 rows × 1 columns
\n",
"
"
],
"text/plain": [
" tf-idf\n",
"obama 52.295653\n",
"the 40.004063\n",
"act 35.678223\n",
"in 30.028962\n",
"iraq 21.747379\n",
"... ...\n",
"is 2.055233\n",
"new 1.887235\n",
"which 1.767431\n",
"that 1.661407\n",
"by 1.374553\n",
"\n",
"[273 rows x 1 columns]"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"obama_tf_idf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Now, compute the cosine distance between the Barack Obama article and this tweet:"
]
},
{
"cell_type": "code",
"execution_count": 418,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.69866453]])"
]
},
"execution_count": 418,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics.pairwise import cosine_distances # for one pair of samples we can just use this function\n",
"\n",
"# Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare this distance to the distance between the Barack Obama article and all of its Wikipedia nearest neighbors:"
]
},
{
"cell_type": "code",
"execution_count": 427,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
length
\n",
"
BO TF-IDF cos
\n",
"
\n",
" \n",
" \n",
"
\n",
"
35817
\n",
"
Barack Obama
\n",
"
540
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
24478
\n",
"
Joe Biden
\n",
"
414
\n",
"
0.572725
\n",
"
\n",
"
\n",
"
57108
\n",
"
Hillary Rodham Clinton
\n",
"
580
\n",
"
0.616149
\n",
"
\n",
"
\n",
"
38376
\n",
"
Samantha Power
\n",
"
310
\n",
"
0.625797
\n",
"
\n",
"
\n",
"
38714
\n",
"
Eric Stern (politician)
\n",
"
255
\n",
"
0.651475
\n",
"
\n",
"
\n",
"
28447
\n",
"
George W. Bush
\n",
"
505
\n",
"
0.659478
\n",
"
\n",
"
\n",
"
39357
\n",
"
John McCain
\n",
"
410
\n",
"
0.661645
\n",
"
\n",
"
\n",
"
48693
\n",
"
Artur Davis
\n",
"
371
\n",
"
0.666690
\n",
"
\n",
"
\n",
"
18827
\n",
"
Henry Waxman
\n",
"
279
\n",
"
0.671226
\n",
"
\n",
"
\n",
"
37199
\n",
"
Barry Sullivan (lawyer)
\n",
"
893
\n",
"
0.673300
\n",
"
\n",
"
\n",
"
46811
\n",
"
Jeff Sessions
\n",
"
230
\n",
"
0.673581
\n",
"
\n",
"
\n",
"
36452
\n",
"
Bill Clinton
\n",
"
524
\n",
"
0.675260
\n",
"
\n",
"
\n",
"
6796
\n",
"
Eric Holder
\n",
"
232
\n",
"
0.677451
\n",
"
\n",
"
\n",
"
24848
\n",
"
John C. Eastman
\n",
"
366
\n",
"
0.679724
\n",
"
\n",
"
\n",
"
36425
\n",
"
Edward B. Montgomery
\n",
"
331
\n",
"
0.681387
\n",
"
\n",
"
\n",
"
14754
\n",
"
Mitt Romney
\n",
"
502
\n",
"
0.681761
\n",
"
\n",
"
\n",
"
35357
\n",
"
Lawrence Summers
\n",
"
413
\n",
"
0.687272
\n",
"
\n",
"
\n",
"
47303
\n",
"
John Kerry
\n",
"
410
\n",
"
0.692701
\n",
"
\n",
"
\n",
"
34344
\n",
"
Mary Dawson (civil servant)
\n",
"
434
\n",
"
0.696581
\n",
"
\n",
"
\n",
"
55181
\n",
"
Ron Paul
\n",
"
427
\n",
"
0.696642
\n",
"
\n",
"
\n",
"
4565
\n",
"
Robinson O. Everett
\n",
"
764
\n",
"
0.698133
\n",
"
\n",
"
\n",
"
46140
\n",
"
Robert Gibbs
\n",
"
257
\n",
"
0.698549
\n",
"
\n",
"
\n",
"
52859
\n",
"
Ann Lewis
\n",
"
634
\n",
"
0.698799
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name length BO TF-IDF cos\n",
"35817 Barack Obama 540 0.000000\n",
"24478 Joe Biden 414 0.572725\n",
"57108 Hillary Rodham Clinton 580 0.616149\n",
"38376 Samantha Power 310 0.625797\n",
"38714 Eric Stern (politician) 255 0.651475\n",
"28447 George W. Bush 505 0.659478\n",
"39357 John McCain 410 0.661645\n",
"48693 Artur Davis 371 0.666690\n",
"18827 Henry Waxman 279 0.671226\n",
"37199 Barry Sullivan (lawyer) 893 0.673300\n",
"46811 Jeff Sessions 230 0.673581\n",
"36452 Bill Clinton 524 0.675260\n",
"6796 Eric Holder 232 0.677451\n",
"24848 John C. Eastman 366 0.679724\n",
"36425 Edward B. Montgomery 331 0.681387\n",
"14754 Mitt Romney 502 0.681761\n",
"35357 Lawrence Summers 413 0.687272\n",
"47303 John Kerry 410 0.692701\n",
"34344 Mary Dawson (civil servant) 434 0.696581\n",
"55181 Ron Paul 427 0.696642\n",
"4565 Robinson O. Everett 764 0.698133\n",
"46140 Robert Gibbs 257 0.698549\n",
"52859 Ann Lewis 634 0.698799"
]
},
"execution_count": 427,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nearest_neighbors_cosine[0:23]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With cosine distances, the tweet is \"nearer\" to Barack Obama than most people! If someone is reading the Barack Obama Wikipedia page, would you want to recommend they read this tweet?\n",
"In practice, it is common to enforce maximum or minimum document lengths. After all, when someone is reading a long article from _The Atlantic_, you wouldn't recommend him/her a tweet."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:anaconda2]",
"language": "python",
"name": "conda-env-anaconda2-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}