Fun way to Understand Data Analysis by Scrapping Website.
By Promode
data:image/s3,"s3://crabby-images/52a8d/52a8d29cc006731bd3191e0157289c21a494def9" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
About Me
Pramod Dutta
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/59969/5996903d8b4b5a426035bb66d66dcd5e82324ce2" alt=""
4+ Year in Software Industry.
Blogger - http://scrolltest.com
Many Featured Article in Python Weekly.
data:image/s3,"s3://crabby-images/36ecd/36ecd0b1f9ac8d329448fee9def41018191e3cb0" alt=""
Things we are using
-
Scrapping data from Website using Beautifulsoup
- Data Analysis using Pandas & Data Visualization using MatplotLib
data:image/s3,"s3://crabby-images/7ba2a/7ba2a1a47599db53adad22d7cf52ed277649c8af" alt=""
data:image/s3,"s3://crabby-images/ff073/ff073a2ed94b1d4e9286b4795c802981129d8334" alt=""
data:image/s3,"s3://crabby-images/be20e/be20e15acca86f0feb2986ba5c3af01b8ed9389a" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
What is Web-Scrapping ?
data:image/s3,"s3://crabby-images/18539/1853957ec79d3446fd23188aaac97638de3072b4" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/884d4/884d498b85ff73fc66cd15257a7ea79625177bef" alt=""
Introduction to BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
html_doc = '<html><title>Hi</title><body><p>Awesome BS4</p><a href='1'>First</a>
<a href='2'>Second</a></body></html>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.title
# Hi
soup.p
# <p>Awesome BS4</p>
soup.find_all('a')
# 1,2
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Which One to Watch First -
Let see some movies Data From 2000-2017 From - IMDB
- Objective is to see the Average Rating
- Maximum Rating
- Average Runtime of the Movie
- Year which has most High Rated Movies(Trend)
- .....
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/884d4/884d498b85ff73fc66cd15257a7ea79625177bef" alt=""
Parsing Code
from bs4 import BeautifulSoup
import urllib2
def main():
print("**====== Data Extracting from imdb -- by Promode =====**")
testUrl = "http://www.imdb.com/search/title?at=0&count=100&\
groups=top_1000&release_date=2000,2017&sort=moviemeter"
pageSource = urllib2.urlopen(testUrl).read()
soupPKG = BeautifulSoup(pageSource, 'lxml')
titles = soupPKG.findAll("div",class_='lister-item mode-advanced')
mymovieslist = []
mymovies = {}
for t in titles:
mymovies = {}
mymovies['name'] = t.findAll("a")[1].text
mymovies['year'] = str(t.find("span", "lister-item-year").text).replace('','')
mymovies['rating'] = float(str(t.find("span", "rating-rating").text)\
.replace('','')[0:-3])
mymovies['runtime'] = t.find("span", "runtime").text
mymovieslist.append(mymovies)
print mymovieslist
if __name__=="__main__":
main()
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Data looks Like
[
{
'rating': 8.1,
'runtime': '136 min',
'name': u'Guardians of the Galaxy Vol. 2',
'year': '(2017)'
},
{'rating': 9.0, 'runtime': '167 min', 'name': u'Bahubali 2: The Conclusion', 'year': '(2017)'},
{'rating': 8.0, 'runtime': '104 min', 'name': u'Get Out', 'year': '(I) (2017)'},
{'rating': 8.1, 'runtime': '121 min', 'name': u'Guardians of the Galaxy', 'year': '(2014)'},
{'rating': 7.7, 'runtime': '129 min', 'name': u'Beauty and the Beast', 'year': '(2017)'},
{'rating': 8.4, 'runtime': '137 min', 'name': u'Logan', 'year': '(2017)'},
{'rating': 7.9, 'runtime': '133 min', 'name': u'Rogue One', 'year': '(2016)'} .....]
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
With Pandas & Metaplotlib
from bs4 import BeautifulSoup
import urllib2
def main():
.....
df = pd.DataFrame.from_dict(mymovieslist)
df.plot()
plt.show()
df =df.set_index('rating')
print df
if __name__=="__main__":
main()
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
DataFrame : Rating is Set as Index
data:image/s3,"s3://crabby-images/e9fa4/e9fa4365fc4cbd05b84e63a7bc4ddea16ea61d5e" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Maximum Rating - Sorted by Rating
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/39413/39413a6d73a7888699364024088c3817907b4df8" alt=""
Year Vs Rating Trend
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/3fc1c/3fc1c23e664be51cd4a197228bf6b7728a10db2a" alt=""
Average Rating
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
print "Avg Rating of Movies From 2000-2017 ON IMDB : "+ str(df.mean())
data:image/s3,"s3://crabby-images/cd1f7/cd1f792a25142b74bbf3ce6a370a1c1f64921294" alt=""
Now We have a List to Watch Movies..
data:image/s3,"s3://crabby-images/96869/9686935913ed066d28717ffa85c6958b9dc82e7c" alt=""
Thanks
http://slides.com/pramoddutta1/deck/fullscreen
What is Pandas
High-performance, easy-to-use data structures and data analysis tools for the Python programming language.
data:image/s3,"s3://crabby-images/ce8ab/ce8ab0b92d8c99358b5b41a3c7a055626b9ad2e1" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Pandas .. Continues.
-
Made by Panel Data System.
-
Used by Lots of Companies(Prod ready lib).
-
Built on the top of numpy.
-
Supports , Sorting, Cleaning, Munging, Analysing and Modeling the data.
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
data:image/s3,"s3://crabby-images/a5723/a572358793c61aa6a7e5beb991ec0b4485886821" alt=""
data:image/s3,"s3://crabby-images/c9c01/c9c01aac87f9768e8a4f7ecc596ffa4e67320cbb" alt=""
data:image/s3,"s3://crabby-images/0af3d/0af3d1c5356449e8d7769c32e3520d08258e4de0" alt=""
What is DataFrame in Pandas
data:image/s3,"s3://crabby-images/31933/3193322b8cc5c86ca6ae600b25d6e8bb81cf7d2d" alt=""
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Demo and Basics Commands
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""
Install the Requirments
-
virtualenv .
-
source /bin/activate
-
pip install jupyter
-
pip install pandas
-
pip install matplotlib
What is MatplotLib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
data:image/s3,"s3://crabby-images/2c518/2c518579a2529bfe0c70520601978fcbd42252ba" alt=""