Web Scraping with Python 🐍
What happens when you visit a website?
HTTP Request
HTML File
👨💻
HTML Basic Syntax
HTML Skeleton
Most Common HTML Elements
Most Common HTML Attributes
Beautiful Soup
Beautiful Soup will
parse the HTML for us !
Websites are a mess
Python Basics
Before we start, we'll need to download Anaconda
Strings
Anything inside double or single quotes is a string
Any text value that you use in a program needs to come into quotes, otherwise it will look for a variable name and will give you an error if it doesn't find it.
Numbers
Numbers can be either integers (whole numbers) or floats (numbers with a fractional part)
A number between quotes is a string not an integer or float
Variables
Variables are just a way programmers name values, like in math class!
Variables can be re-assigned
Boolean & Logic
Boolean values: True, False
If and Else condition
Logic Operations (or, and)
While and For loop
Lists
Python has a built-in List type for storing a collection of values.
Generally we don't know what we're collecting, so we start with an empty list to which we add values
Lists #2
To do something with each element of the list, we iterate over it.
(It does not matter how we name the first variable in the list, however it's important we get the second variable right as it is the one that contains a list)
Dictionaries
Another common built-in data structure is called a Dictionary
Dictionaries are used to store “key-value” information. Here, “War and Peace” is the key, and the description is the value. So it works like a real… dictionary! 📖
Functions
Pre-built functions: type(), str()...
Import functions from downloaded packages
pandas, numpy, matplotlib...
Defining your own with def()
Let's Get to Work!
Our Target
https://books.toscrape.com
Basic Code
import requests
from bs4 import BeautifulSoup
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.content, "html.parser")
Importing our Libraries
Generic way to use BeautifulSoup
Basic Code #2
soup.select("h3 p")
Finding an element h3
CSS Selectors
Finding all h3 elements
soup.find_all("h3")
soup.find("h3")
soup.find(class_="product_price")
Finding a product_price class
soup.select(".product_pod")
Time for Live Code!
10% Workshop
By simonpastor
10% Workshop
- 243