Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions AW_MiniProject1_TextMining.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, good job!

A few comments:

  • In the future, please upload a .py file instead of a Jupyter notebook, as it's much easier for us to read and critique.
  • Small thing: Try to be consistent with your spacing. In between functions, you should generally have 2 lines of space.
  • When you do have a Jupyter Notebook, be sure to run all cells before pushing to github, because then we can see all of your output. (I can't see the plot you made, but it looks like you attached it elsewhere so you're all good)

"cells": [
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The last 100 tweets about Donald Trump has an average polarity of 0.0383908730159\n",
"The last 100 tweets about Hilary Clinton has an average polarity of -0.00256539101331\n",
"\n",
"The last 100 tweets about each of the presidential candidates show that Donald Trump is the more favorable presidential candidate\n"
]
}
],
"source": [
"from pattern.web import Twitter\n",
"from pattern.en import sentiment\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"t = Twitter() #twitter object created\n",
"i = None\n",
"\n",
"def draw_histogram(polarity_list, name):\n",
" # the histogram of the data\n",
" plt.figure() #opens a new window\n",
" n, bins, patches = plt.hist(polarity_list, 50, normed=1, facecolor='green', alpha=0.75)\n",
" plt.xlabel('Polarity')\n",
" plt.ylabel('Frequency')\n",
" plt.title('Histogram of '+name)\n",
" plt.axis([-1,1,0,40])\n",
" plt.grid(True)\n",
" \n",
"\n",
"def get_polarity_list(phrase):\n",
" polarity_list = []\n",
" polarity = 0\n",
" count = 0\n",
" for i in range(10):\n",
" for tweet in t.search(phrase, start=i, count=10): #search method inside Twitter object; t.searh = list\n",
" #print tweet.text\n",
" sentiment_tuple = sentiment(tweet.text)\n",
" polarity += sentiment_tuple[0]\n",
" count += 1\n",
" polarity_list.append(sentiment_tuple[0])\n",
" i = tweet.id\n",
" return polarity_list\n",
"\n",
"dt_pol_list = get_polarity_list('donald trump')\n",
"hc_pol_list = get_polarity_list('hilary clinton')\n",
"\n",
"def find_avg_polarity(new_list):\n",
" length = len(new_list)\n",
" sum = 0\n",
" for i in new_list:\n",
" sum += i\n",
" return sum/length\n",
" \n",
"print \"The last 100 tweets about Donald Trump has an average polarity of\", find_avg_polarity(dt_pol_list)\n",
"print \"The last 100 tweets about Hilary Clinton has an average polarity of\", find_avg_polarity(hc_pol_list)\n",
"print\n",
"\n",
"if(find_avg_polarity(dt_pol_list) > find_avg_polarity(hc_pol_list)):\n",
" print \"The last 100 tweets about each of the presidential candidates show that Donald Trump is the more favorable presidential candidate\"\n",
"else:\n",
" print \"The last 100 tweets about each of the presidential candidates show that Hilary Clinton is the more favorable presidential candidate\"\n",
" \n",
"draw_histogram(dt_pol_list, 'Donald Trump') \n",
"draw_histogram(hc_pol_list, 'Hilary Clinton')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
32 changes: 32 additions & 0 deletions MiniProject1_WriteUp
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Project Overview [Maximum 100 words]
What data source(s) did you use and what technique(s) did you use analyze/process them? What did you hope to learn/create?

I used the Twitter data source and performed sentiment analysis on the recent tweets on each of the presidential candidates: Donald Trump and Hilary Clinton. I wanted to see who tweeters thought were the more favorable candidate, and perhaps this will align with who wins the election in November. I also used histograms to compare the distribution of negative, neutral, and positive polarities of the tweets about Donald Trump and Hilary Clinton.



Implementation [~2-3 paragraphs]
Describe your implementation at a system architecture level. You should NOT walk through your code line by line, or explain every function (we can get that from your docstrings). Instead, talk about the major components, algorithms, data structures and how they fit together. You should also discuss at least one design decision where you had to choose between multiple alternatives, and explain why you made the choice you did.

The goal of my code was to: (1) compare the average polarities of each presidential candidate, and (2) compare the distribution of the polarities for each presidential candidate in histograms. To achieve goal #1, I had to get a list of polarities of each candidate in order to find the average. I utilized the twitter search function and the sentiment function from the pattern module. The twitter search function allows me to obtain the tweets and the sentiment function returns the polarity and subjectivity of the tweet in a tuple. I utilized only the polarity, which is the first element in the tuple. I placed this code inside a function called get_polarity_list so it can be called for both presidential candidates without repeating the code. Then, I created the find_avg_polarity function to use the list created from get_polarity_list and find the average polarity.
To achieve goal #2, I created a function called draw_histogram. I used some of the initial code from the Machine Learning toolbox, specifically the matplotlib.pyplot module. I passed in the polarity_list generated by get_polarity_list, as previously explained, to draw a histogram using functions such as xlabel, ylabel, title, axis, grid, and show. I used the figure function so that both the presidential candidates' histogram will pop up in different windows and can be compared. Using these histograms, we can now compare the distribution of polarities.



Results [~2-3 paragraphs + figures/examples]
Present what you accomplished:

For one of the times I ran the code, I got the output below (the two histograms of the output are attached as exhibits in the folder named 'donald trump' and 'hilary clinton':

The last 100 tweets about Donald Trump has an average polarity of 0.0383908730159
The last 100 tweets about Hilary Clinton has an average polarity of -0.00256539101331
The last 100 tweets about each of the presidential candidates show that Donald Trump is the more favorable presidential candidate 'hilary clinton'):

Here we can see that Donald Trump appears to be the more favorable presidential candidate because the average polarity of the last 100 tweets regarding him is 0.038, which is higher than that of Hilary Clinton's which is -0.0025. When we look at Donald Trump's histogram as compared to Hilary Clinton's histogram, we see that both presidential candidates have a large number of neutral tweets. Clinton has more neutral responses than Trump and Clinton has an even spread of negative to neutral and neutral to positive polarities from -1.0 to 1.0. Trump, on the other hand, has fewer neutral tweets and the rest are more evenly spread from -0.5 to 0.5. This may indicate that while ~15-20% of the sample of 100 tweets view the presidential candidates neutrally, the rest of the tweets view Clinton more extremely than Trump.



Reflection [~1 paragraph]
From a process point of view, what went well? What could you improve? Other possible reflection topics: Was your project appropriately scoped? Did you have a good plan for unit testing? How will you use what you learned going forward? What do you wish you knew before you started that would have helped you succeed?

Throughout this mini project, I learned to use the python community online to help me write code and utilize modules and functions that I have not learned yet. I was able to test my code piece by piece before adding more to get more results or differently. For example, I'll print the polarity_list to check that it works before I start using it in another part of my code. I will use these learned functions and modules to help me complete the toolboxes and continue to utilize the python community as well. I have also learned to write more concise code- which still needs more work- but overall, there was improvement. For example, coding with indices versus the object directly.
Binary file added donald trump.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added hilary clinton.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.