Fun way to Understand Data Analysis by Scrapping Website.

By Promode

About Me

Pramod Dutta

4+ Year in Software Industry.

Blogger -

Many Featured Article in Python Weekly.

Things we are using

  • Scrapping data from Website using Beautifulsoup
  • Data Analysis using Pandas & Data Visualization using MatplotLib

What is Web-Scrapping ?

Introduction to BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

html_doc = '<html><title>Hi</title><body><p>Awesome BS4</p><a href='1'>First</a>
<a href='2'>Second</a></body></html>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

# Hi

# <p>Awesome BS4</p>

# 1,2

Which One to Watch First -
Let see some movies Data From 2000-2017 From - IMDB

  • Objective is to see the Average Rating
  • Maximum Rating
  • Average Runtime of the Movie
  • Year which has most High Rated Movies(Trend)
  • .....

Parsing Code

from bs4 import BeautifulSoup
import urllib2
def main():
    print("**======  Data Extracting from imdb -- by Promode  =====**")
    testUrl = "\
    pageSource = urllib2.urlopen(testUrl).read()
    soupPKG = BeautifulSoup(pageSource, 'lxml')
    titles = soupPKG.findAll("div",class_='lister-item mode-advanced')
    mymovieslist = []
    mymovies = {}
    for t in titles:
        mymovies = {}
        mymovies['name'] = t.findAll("a")[1].text
        mymovies['year'] = str(t.find("span", "lister-item-year").text).replace('','')
        mymovies['rating'] = float(str(t.find("span", "rating-rating").text)\
        mymovies['runtime'] = t.find("span", "runtime").text
    print mymovieslist
if __name__=="__main__":

Data looks Like

'rating': 8.1, 
'runtime': '136 min', 
'name': u'Guardians of the Galaxy Vol. 2', 
'year': '(2017)'

{'rating': 9.0, 'runtime': '167 min', 'name': u'Bahubali 2: The Conclusion', 'year': '(2017)'},
{'rating': 8.0, 'runtime': '104 min', 'name': u'Get Out', 'year': '(I) (2017)'},
{'rating': 8.1, 'runtime': '121 min', 'name': u'Guardians of the Galaxy', 'year': '(2014)'},
{'rating': 7.7, 'runtime': '129 min', 'name': u'Beauty and the Beast', 'year': '(2017)'}, 
{'rating': 8.4, 'runtime': '137 min', 'name': u'Logan', 'year': '(2017)'}, 
{'rating': 7.9, 'runtime': '133 min', 'name': u'Rogue One', 'year': '(2016)'} .....]

With Pandas & Metaplotlib

from bs4 import BeautifulSoup
import urllib2
def main():
    df = pd.DataFrame.from_dict(mymovieslist)
    df =df.set_index('rating')
    print df

if __name__=="__main__":

DataFrame : Rating is Set as Index 

Maximum Rating - Sorted by Rating

Year Vs Rating Trend

Average Rating

print "Avg Rating of Movies From 2000-2017 ON IMDB : "+ str(df.mean())

Now We have a List to Watch Movies..


What is Pandas

High-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas .. Continues.

  • Made by Panel Data System.

  • Used by Lots of Companies(Prod ready lib).

  • Built on the top of numpy.

  • Supports , Sorting, Cleaning, Munging, Analysing and Modeling the data.

What is DataFrame in Pandas

Demo and Basics Commands

Install the Requirments

  • virtualenv .

  • source /bin/activate

  • pip install jupyter

  • pip install pandas

  • pip install matplotlib

What is MatplotLib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.