~ Crash Course ~
UNIFEI - June, 2018
Prof. Maurilio
Hanneli
Prof. Bonatto
Moodle
<3
This is NOT a Python tutorial
(Note: This list can change!)
(Note: This list can change!)
Python is already there :)
https://www.python.org/downloads/windows/
Python 3, please
Let's try it out!
You can use it online or in your computer
It is time for the black screen os h4ck3rs
pip install jupyter
jupyter notebook
Tips:
- Look for Open Source tools
- Check the source and see if the library is stable and has frequent updates
What do we do now?
pip install scrapy
1. Obtain the book title and price of all the items of the wishlist
2. Save all this information in a local file
1. Obtain the book title and price of all the items of the wishlist
2. Save all this information in a local file
Text
This is what we call HTML
image from amazon.com.br
For every book, get the title and price; move to the next book; stop when you reach the end of the wishlist
image from amazon.com.br
Can the library (scrapy) do that?
scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
response.text
Use the command 'scrapy shell' was unfair
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
print(response.text)
short_parser.py
C:\> scrapy runspider short_parser.py
C:\> scrapy runspider short_parser.py > page.txt
import scrapy
from scrapy.crawler import CrawlerProcess
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
print(response.text)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(ShortParser)
process.start()
the script will block here until the crawling is finished
short_parser.py
C:\> python short_parser.py
TOO MANY THINGS GOING ON!!!111
1. Obtain the book title and price of all the items of the wishlist
2. Save all this information in a local file
- Python basics
- Installation
- Using the tools and the language
- Extra: Where to learn more (Coursera)
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
print(response.text)
short_parser.py
Read the Docs! https://doc.scrapy.org/
1. Obtain the book title and price of all the items of the wishlist
2. Save all this information in a local file
We still need to clean up the information inside the HTML and collect only a few information about the books
All the books are inside this element 'div#item-page-wrapper'
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
amazon_spider.py
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
amazon_spider.py
What is the output of this method call?
scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
In [4]: books_root = response.css("#item-page-wrapper")
In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']"
data='<div id="item-page-wrapper" class="a-sec'>]
In [4]: books_root = response.css("#item-page-wrapper")
In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']"
data='<div id="item-page-wrapper" class="a-sec'>]
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']"
data='<div id="item-page-wrapper" class="a-sec'>]
The information of every book is inside the 'div#itemMain_[WeirdCode]'
We know this information always starts with itemMain_, followed by anything
We can use a Regular Expression - we tell to the algorithm: 'Select the items that are called #itemMain[I_dont_care]'
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
amazon_spider.py
What is the output of this method call?
scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
In [4]: books_root = response.css("#item-page-wrapper")
In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']"
data='<div id="item-page-wrapper" class="a-sec'>]
In [6]: books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
In [7]: books
Out[7]:
[<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2Y87PJXIXDI9X" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I1B5CM41FDBQQU" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I3JPI776YSPWGL" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2HEQ767OEDQND" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I12P3TV8OT8VUQ" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2PUS0RW7H14QK" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2Z1XDSQWY482G" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I1UIZOCDVDCJCN" class='>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_IVVEST5SABEDL" class="'>,
<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_IO58MH5C88JMF" class="'>]
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
amazon_spider.py
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
for book in books:
amazon_spider.py
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls =
['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
for book in books:
book_info = book.xpath('.//div[re:test(@id, "itemInfo_*")]')
title = book_info.xpath('.//a[re:test(@id, "itemName_*")]/text()')
.extract_first()
amazon_spider.py
Now we have the values of title and price, for every iteration
yield
import scrapy
class ShortParser(scrapy.Spider):
name = 'shortamazonspider'
start_urls = ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']
def parse(self, response):
books_root = response.css("#item-page-wrapper")
books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
for book in books:
book_info = book.xpath('.//div[re:test(@id, "itemInfo_*")]')
title = book_info.xpath('.//a[re:test(@id, "itemName_*")]/text()').extract_first()
price = book_info.xpath('.//span[re:test(@id, "itemPrice_*")]//span/text()').extract_first()
yield {'Title':title, 'Last price':price }
amazon_spider.py
1. Obtain the book title and price of all the items of the wishlist
2. Save all this information in a local file
julianafandrade@hotmail.com,Juliana Figueiredo de Andrade,34293,ECO
jianthomaz1994@gmail.com,Jian Thomaz De Souza ,2018013906,ECO
ranierefr@hotmail.com,Frederico Ranieri Rosa,2017020601,ECO
pievetp@gmail.com,Dilson Gabriel Pieve,2017006335,ECO
henrqueolvera@gmail.com,Henrique Castro Oliveira,2018003703,CCO
fabio.eco15@gmail.com,Fábio Rocha da Silva,31171,ECO
ftdneves@gmail.com,Frederico Tavares Direne Neves,2016014744,ECO
gabrielocarvalho@hotmail.com,Gabriel Oraboni Carvalho,33549,CCO
laizapaulino1@gmail.com,Laiza Aparecida Paulino da Silva,2016001209,SIN
rodrigogoncalvesroque@gmail.com,Rodrigo Gonçalves Roque,2017004822,CCO
jlucasberlinck@hotmail.com,João Lucas Berlinck Campos ,2016017450,ECO
sglauber26@gmail.com,Glauber Gomes de Souza,2016014127,CCO
ricardoweasley@hotmail.com,Ricardo Dalarme de Oliveira Filho,2018002475,CCO
caikepiza@live.com,Caike De Souza Piza,2016005404,ECO
henriquempossatto@gmail.com,Henrique Marcelino Possatto ,2017007539,ECO
souza.isa96@gmail.com,Isabela de Souza Silva,2017000654,SIN
felipetoshiohikita@gmail.com,FELIPE TOSHIO HIKITA DA SILVA,34948,ECO
'Load CSV file to Python'
Pandas Library
Install it with pip
pip install pandas
import pandas as pd
df = pd.read_csv('python_ext.csv')
print(df)
3. On the terminal, run `python students.py`
Send emails
Generate QRCodes
Statistics
df.groupby('curso').count()['matricula']
A complete Scraper
Pandas and Datasets
Exercises
Moodle
<3
'library arrays in Python'
NumPy Library
Install it with pip
pip install numpy
(note: you can skip the first part where they suggest installing Anaconda)
Questions?
hannelitaa@gmail.com
@hannelita