Saturday, February 15, 2014

Mining Twitter Data using Python: Getting Started

Data Mining is a hot topic these days, and Twitter is being used heavily as a data source in various Data Mining applications. In this post I will introduce you to start mining twitter data with Python using the Tweepy module.
( I will not include the scientific module examples here( for mining,analysing ...etc). It's a basic guide to get the Twitter API setup)

Environment Setup

1. Install python ( MacOS comes with python installed)

2. Get a Twitter API key
    Go to https://dev.twitter.com/, sign-in to twitter ( create an account if you don't already have one)
    Click the profile Icon ( top left) -> My Applications -> Create New App
    Provide the necessary data and it will create an application.
    Go to the application -> click on API Keys tab
 
    This will show you the necessary keys to authenticate your application using OAuth.

3. Install Tweepy
   Tweepy is a python library which supports the Twitter API
 
   Install in Mac:
pip install tweepy
   Ubuntu:
sudo apt-get install python-tweepy
 
   Here's the github project : https://github.com/tweepy/tweepy

Now you are ready to read some tweets!!

The code to get the twitter stream, ( insert your keys to this file)

#imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener

#setting up the keys
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

class TweetListener(StreamListener):
    # A listener handles tweets are the received from the stream.
    #This is a basic listener that just prints received tweets to standard output

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status

#printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

stream = Stream(auth, TweetListener())
stream.filter(track=['nba'])

This prints the whole twitter stream filtered using the text "nba".

getting user info:

import tweepy

auth = OAuthHandler(consumer_key,consumer_secret)
api = tweepy.API(auth)

auth.set_access_token(access_token, access_secret)
twitterStream = Stream(auth,TweetListener())

user = api.get_user('sachithwithana')
print user.screen_name


This is a basic example to get you set up. Now you are ready to explore with the Twitter API.

I would recommend using the scikit-learn library for Machine Learning with Python.
http://scikit-learn.org/stable/

Here's the Tweepy Documentation:
http://pythonhosted.org/tweepy/html/




5 comments:

  1. Sachith,

    I just started harvesting a Twitter stream. Thank you! I am still learning about computation on graphs, and also considering what kind of statistical models might be cool to implement. Will let you know if/when something comes of it.

    Again, thank you.

    Chris

    ReplyDelete
  2. Thanks mate!
    Yeah try them out and please let me know if can :)
    You can use the scikit-learn if you are going to do any Machine Learning stuff :)

    ReplyDelete
  3. Nice guidance, thank you very much for saving lots of time

    ReplyDelete
  4. Hi, I was wondering if tweepy could be installed on chrome OS through the python app?

    Thank you for this guide, I will use it on my ubuntu box.

    ReplyDelete
  5. Hi,
    I have downloaded twitter data and saved them as json in a .txt file. Just wondering if there is any online help to understand how to clean it up, convert it to a database and use it in R for data mining. I am new to python.
    Magesh

    ReplyDelete