Python

~ Crash Course ~

UNIFEI - June, 2018

Hello! 

Prof. Maurilio

Hanneli

Prof. Bonatto

Moodle

<3

Questions

  • What is this course? 
    • A: An attempt to bring interesting subjects to the courses
  • Why are the slides in English?
    • I am reusing some material that I prepared before
  • What should I expect?
    • This is the first time we give this course. We will be looking forward to hearing your feedback

Rules and Notes

  • Don't skip classes
  • The final assignment is optional
  • Feel free to ask questions
  • We plan to record the classes and upload the videos

DISCLAIMER

This is NOT a Python tutorial

[GIF time]

Goals

  • First contact with Python
  • Tools for science and Engineering
  • How to ask proper questions to Google
  • (try to) Use all the code that is already done
  • Some recipes for transforming a problem into an automated task

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

(Note: This list can change!)

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

(Note: This list can change!)

The benefits of Python

The benefits of Python

  • Easy to learn
  • Lots of libraries
  • Lots of libraries for science and engineering
  • Good performance
  • Good for Web Development
  • Good for automating tasks

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

Linux, MacOS

Python is already there :)

Windows

https://www.python.org/downloads/windows/

Python 3 or Python 2?

Python 3, please

We don't really need to install Python to use it

Let's try it out!

There is a good place for writing and testing your Python code

Jupyter Notebooks

Jupyter Notebooks

  • Easy to see the results
  • Easy to test
  • Good for documenting your research
  • Easy to share

You can use it online or in your computer

How can we get all the cool libraries written in Python?

It is time for the black screen os h4ck3rs

You need to use your terminal (Linux, MacOS) or the Windows Prompt

pip

Once you install pip, you can have all the cool libraries in your machine

Ex: Jupyter Notebooks

pip install jupyter
jupyter notebook

Question:

Which libraries can I download?

In practice

Tips:

- Look for Open Source tools

- Check the source and see if the library is stable and has frequent updates

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

We often need to collect online data (for research, personal projects, etc)

Scraping my wishlists at Amazon

What do we do now?

scrapy (install it with pip)

pip install scrapy

Goal

1. Obtain the book title and price of all the items of the wishlist

2. Save all this information in a local file

Goal

1. Obtain the book title and price of all the items of the wishlist

2. Save all this information in a local file

Everything that you see in a webpage is...

Text

It is all a bunch of text with markups to indicate style, position of the elements,...

This is what we call HTML

image from amazon.com.br

We must dig into this text and find the information we want

The idea of the algorithm

For every book, get the title and price; move to the next book; stop when you reach the end of the wishlist

image from amazon.com.br

Can the library (scrapy) do that?

Smart move

scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
response.text

How do I reproduce this action into a Jupyter Notebook or Python file?

Use the command 'scrapy shell' was unfair

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      print(response.text)

short_parser.py

C:\> scrapy runspider short_parser.py
C:\> scrapy runspider short_parser.py > page.txt
import scrapy
from scrapy.crawler import CrawlerProcess

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      print(response.text)

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ShortParser)
process.start() 
the script will block here until the crawling is finished

short_parser.py

C:\> python short_parser.py

TOO MANY THINGS GOING ON!!!111

Goal

1. Obtain the book title and price of all the items of the wishlist

2. Save all this information in a local file

Questions:

  •  What is 'class'?
  • How did you know about that method parse?
  • I have a big chunk of code. Now what? How do I extract the information I need?

Wrap Up

- Python basics

- Installation

- Using the tools and the language

- Extra: Where to learn more (Coursera)

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      print(response.text)

short_parser.py

Read the Docs! https://doc.scrapy.org/

The Docs will help you using the library. They usually have examples

Going back to our problem

1. Obtain the book title and price of all the items of the wishlist

2. Save all this information in a local file

We still need to clean up the information inside the HTML and collect only a few information about the books

Since HTML has "markups", we can fetch the information we want

(whiteboard time - how does HTML represent the content?)

All the books are inside this element 'div#item-page-wrapper'

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")

amazon_spider.py

Note: Python relies on indentation.

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")

amazon_spider.py

What is the output of this method call?

Scrapy shell is useful here!

scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
In [4]: books_root = response.css("#item-page-wrapper")

In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']" 
data='<div id="item-page-wrapper" class="a-sec'>]

Let's see the examples again. We have something called 'Selector'

In [4]: books_root = response.css("#item-page-wrapper")

In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']" 
data='<div id="item-page-wrapper" class="a-sec'>]

Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']" 
data='<div id="item-page-wrapper" class="a-sec'>]

The information of every book is inside the 'div#itemMain_[WeirdCode]'

How can we select

'div#itemMain_[WeirdCode]' ?

We know this information always starts with itemMain_, followed by anything

We can use a Regular Expression - we tell to the algorithm: 'Select the items that are called #itemMain[I_dont_care]'

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")
      books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')

amazon_spider.py

What is the output of this method call?

Back to Scrapy shell!

scrapy shell 'https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW'
In [4]: books_root = response.css("#item-page-wrapper")

In [5]: books_root
Out[5]: [<Selector xpath="descendant-or-self::*[@id = 'item-page-wrapper']" 
data='<div id="item-page-wrapper" class="a-sec'>]
In [6]: books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')

In [7]: books
Out[7]:
[<Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2Y87PJXIXDI9X" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I1B5CM41FDBQQU" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I3JPI776YSPWGL" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2HEQ767OEDQND" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I12P3TV8OT8VUQ" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2PUS0RW7H14QK" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I2Z1XDSQWY482G" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_I1UIZOCDVDCJCN" class='>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_IVVEST5SABEDL" class="'>,
 <Selector xpath='//div[re:test(@id, "itemMain_*")]' data='<div id="itemMain_IO58MH5C88JMF" class="'>]

Those are the books! Now we can obtain the title and the price

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")
      books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')

amazon_spider.py

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")
      books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
      for book in books:

amazon_spider.py

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = 
     ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")
      books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
      for book in books:
        book_info = book.xpath('.//div[re:test(@id, "itemInfo_*")]')
        title = book_info.xpath('.//a[re:test(@id, "itemName_*")]/text()')
         .extract_first()

amazon_spider.py

Now we have the values of title and price, for every iteration

We want these results only when the code runs

yield

import scrapy

class ShortParser(scrapy.Spider):
    name = 'shortamazonspider'
    start_urls = ['https://www.amazon.com.br/gp/registry/wishlist/3DA4I0ZLH8ADW']

    def parse(self, response):
      books_root = response.css("#item-page-wrapper")
      books = books_root.xpath('//div[re:test(@id, "itemMain_*")]')
      for book in books:
        book_info = book.xpath('.//div[re:test(@id, "itemInfo_*")]')
        title = book_info.xpath('.//a[re:test(@id, "itemName_*")]/text()').extract_first()
        price = book_info.xpath('.//span[re:test(@id, "itemPrice_*")]//span/text()').extract_first()
        yield {'Title':title, 'Last price':price }

amazon_spider.py

Going back to our problem

1. Obtain the book title and price of all the items of the wishlist

2. Save all this information in a local file

Problem: this code only considers the first books. There are many books. How can we fetch info from all the books?

(This is an optional homework!)

Note about scraping data: Digging into the HTML structure can be tricky

There may be many ways to extract the data and the code can get obsolete with any minor changes on the webpage

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

New problem: We don't need to scrape the data. It comes from a .csv file

For example, the students of this course

From the form, I obtained a python_ext.csv file

julianafandrade@hotmail.com,Juliana Figueiredo de Andrade,34293,ECO
jianthomaz1994@gmail.com,Jian Thomaz De Souza ,2018013906,ECO
ranierefr@hotmail.com,Frederico Ranieri Rosa,2017020601,ECO
pievetp@gmail.com,Dilson Gabriel Pieve,2017006335,ECO
henrqueolvera@gmail.com,Henrique Castro Oliveira,2018003703,CCO
fabio.eco15@gmail.com,Fábio Rocha da Silva,31171,ECO
ftdneves@gmail.com,Frederico Tavares Direne Neves,2016014744,ECO
gabrielocarvalho@hotmail.com,Gabriel Oraboni Carvalho,33549,CCO
laizapaulino1@gmail.com,Laiza Aparecida Paulino da Silva,2016001209,SIN
rodrigogoncalvesroque@gmail.com,Rodrigo Gonçalves Roque,2017004822,CCO
jlucasberlinck@hotmail.com,João Lucas Berlinck Campos ,2016017450,ECO
sglauber26@gmail.com,Glauber Gomes de Souza,2016014127,CCO
ricardoweasley@hotmail.com,Ricardo Dalarme de Oliveira Filho,2018002475,CCO
caikepiza@live.com,Caike De Souza Piza,2016005404,ECO
henriquempossatto@gmail.com,Henrique Marcelino Possatto ,2017007539,ECO
souza.isa96@gmail.com,Isabela de Souza Silva,2017000654,SIN
felipetoshiohikita@gmail.com,FELIPE TOSHIO HIKITA DA SILVA,34948,ECO

We want to get this information into Python and analyse it

LMGTFY

'Load CSV file to Python'

Pandas Library

Install it with pip

pip install pandas

Recipe

  1. Add the python_ext.csv file to a directory
  2. Create a file students.py at the same directory
import pandas as pd

df = pd.read_csv('python_ext.csv')
print(df)

3. On the terminal, run `python students.py`

Now we can manipulate this data

Send emails

Generate QRCodes

Statistics

df.groupby('curso').count()['matricula']

Tip: Obtaining data

Agenda

  1. Why Python?
  2. Getting Started (installation, important tools)
  3. Hello, Python! (super basic, boring and necessary intro)
  4. Our first problem: Getting online data
  5. Data and Numpy
  6. More interesting libraries

Wrap Up

A complete Scraper

Pandas and Datasets

Moodle

<3

Working with Arrays

LMGTFY

'library arrays in Python'

NumPy Library

Install it with pip

pip install numpy

Try these examples from DataCamp

(note: you can skip the first part where they suggest installing Anaconda)

Extra - challenge from Kaggle (Jun 11th - Jun 17th)

Feedback

Thank you :)

Questions?

 

hannelitaa@gmail.com

@hannelita