Web Scraping with Python 🐍

What happens when you visit a website?

HTTP Request

HTML File

👨‍💻

HTML Basic Syntax

HTML Skeleton

Most Common HTML Elements

Most Common HTML Attributes

Beautiful Soup

Beautiful Soup will

parse the HTML for us !

Websites are a mess

Python Basics

Before we start, we'll need to download Anaconda

Strings

Anything inside double or single quotes is a string

Any text value that you use in a program needs to come into quotes, otherwise it will look for a variable name and will give you an error if it doesn't find it.

Numbers

Numbers can be either integers (whole numbers) or floats (numbers with a fractional part)

A number between quotes is a string not an integer or float

Variables

Variables are just a way programmers name values, like in math class!

Variables can be re-assigned

Boolean & Logic

Boolean values: True, False

If and Else condition

Logic Operations (or, and)

While and For loop

Lists

Python has a built-in List type for storing a collection of values.

Generally we don't know what we're collecting, so we start with an empty list to which we add values

Lists #2

To do something with each element of the list, we iterate over it.

(It does not matter how we name the first variable in the list, however it's important we get the second variable right as it is the one that contains a list)

Dictionaries

Another common built-in data structure is called a Dictionary

Dictionaries are used to store “key-value” information. Here, “War and Peace” is the key, and the description is the value. So it works like a real… dictionary! 📖

Functions

Pre-built functions: type(), str()...

Import functions from downloaded packages

pandas, numpy, matplotlib...

Defining your own with def()

Let's Get to Work!

Our Target

https://books.toscrape.com

Basic Code

import requests
from bs4 import BeautifulSoup
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.content, "html.parser")

Importing our Libraries

Generic way to use BeautifulSoup

Basic Code #2

soup.select("h3 p")

Finding an element h3

CSS Selectors

Finding all h3 elements

soup.find_all("h3")
soup.find("h3")
soup.find(class_="product_price")

Finding a product_price class

soup.select(".product_pod")

Time for Live Code!

10% Workshop

By simonpastor

10% Workshop

  • 243