Fun way to Understand Data Analysis by Scrapping Website.

By Promode

About Me

Pramod Dutta

4+ Year in Software Industry.

Blogger - http://scrolltest.com

Many Featured Article in Python Weekly.

Things we are using

Scrapping data from Website using Beautifulsoup
Data Analysis using Pandas & Data Visualization using MatplotLib

What is Web-Scrapping ?

Introduction to BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

html_doc = '<html><title>Hi</title><body><p>Awesome BS4</p><a href='1'>First</a>
<a href='2'>Second</a></body></html>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')




soup.title
# Hi

soup.p
# <p>Awesome BS4</p>

soup.find_all('a')
# 1,2

Which One to Watch First -
Let see some movies Data From 2000-2017 From - IMDB

Objective is to see the Average Rating

Maximum Rating

Average Runtime of the Movie

Year which has most High Rated Movies(Trend)
.....

Parsing Code

from bs4 import BeautifulSoup
import urllib2
def main():
    print("**======  Data Extracting from imdb -- by Promode  =====**")
    testUrl = "http://www.imdb.com/search/title?at=0&count=100&\
    groups=top_1000&release_date=2000,2017&sort=moviemeter"
    pageSource = urllib2.urlopen(testUrl).read()
    soupPKG = BeautifulSoup(pageSource, 'lxml')
    titles = soupPKG.findAll("div",class_='lister-item mode-advanced')
    mymovieslist = []
    mymovies = {}
    for t in titles:
        mymovies = {}
        mymovies['name'] = t.findAll("a")[1].text
        mymovies['year'] = str(t.find("span", "lister-item-year").text).replace('','')
        mymovies['rating'] = float(str(t.find("span", "rating-rating").text)\
        .replace('','')[0:-3])
        mymovies['runtime'] = t.find("span", "runtime").text
        mymovieslist.append(mymovies)
    print mymovieslist
if __name__=="__main__":
    main()

Data looks Like

[
 {
'rating': 8.1, 
'runtime': '136 min', 
'name': u'Guardians of the Galaxy Vol. 2', 
'year': '(2017)'
 },

{'rating': 9.0, 'runtime': '167 min', 'name': u'Bahubali 2: The Conclusion', 'year': '(2017)'},
{'rating': 8.0, 'runtime': '104 min', 'name': u'Get Out', 'year': '(I) (2017)'},
{'rating': 8.1, 'runtime': '121 min', 'name': u'Guardians of the Galaxy', 'year': '(2014)'},
{'rating': 7.7, 'runtime': '129 min', 'name': u'Beauty and the Beast', 'year': '(2017)'}, 
{'rating': 8.4, 'runtime': '137 min', 'name': u'Logan', 'year': '(2017)'}, 
{'rating': 7.9, 'runtime': '133 min', 'name': u'Rogue One', 'year': '(2016)'} .....]

With Pandas & Metaplotlib

from bs4 import BeautifulSoup
import urllib2
def main():
    
    .....
    df = pd.DataFrame.from_dict(mymovieslist)
    df.plot()
    plt.show()
    df =df.set_index('rating')
    print df

    
if __name__=="__main__":
    main()

DataFrame : Rating is Set as Index

Maximum Rating - Sorted by Rating

Year Vs Rating Trend

Average Rating

print "Avg Rating of Movies From 2000-2017 ON IMDB : "+ str(df.mean())

Now We have a List to Watch Movies..

Thanks

http://slides.com/pramoddutta1/deck/fullscreen

What is Pandas

High-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas .. Continues.

Made by Panel Data System.
Used by Lots of Companies(Prod ready lib).
Built on the top of numpy.
Supports , Sorting, Cleaning, Munging, Analysing and Modeling the data.

What is DataFrame in Pandas

Demo and Basics Commands

Install the Requirments

virtualenv .
source /bin/activate
pip install jupyter
pip install pandas
pip install matplotlib

What is MatplotLib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Fun way to Understand Data Analysis by Scrapping Website.

About Me

Things we are using

What is Web-Scrapping ?

Introduction to BeautifulSoup

Which One to Watch First - Let see some movies Data From 2000-2017 From - IMDB

Parsing Code

Data looks Like

With Pandas & Metaplotlib

DataFrame : Rating is Set as Index

Maximum Rating - Sorted by Rating

Year Vs Rating Trend

Average Rating

Now We have a List to Watch Movies..

Thanks http://slides.com/pramoddutta1/deck/fullscreen

What is Pandas

High-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas .. Continues.

Made by Panel Data System.

Used by Lots of Companies(Prod ready lib).

Built on the top of numpy.

Supports , Sorting, Cleaning, Munging, Analysing and Modeling the data.

What is DataFrame in Pandas

Demo and Basics Commands

Install the Requirments

What is MatplotLib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Which One to Watch First -
Let see some movies Data From 2000-2017 From - IMDB

Thanks

http://slides.com/pramoddutta1/deck/fullscreen