Web Scraping with Python 🐍
What happens when you visit a website?
HTTP Request
HTML File
👨💻
HTML Basic Syntax
HTML Skeleton
data:image/s3,"s3://crabby-images/c689d/c689d2fce4c3bec182d5481cf42c1da43efe0c36" alt=""
Most Common HTML Elements
data:image/s3,"s3://crabby-images/5485d/5485de8d33b0a0aa9f00407d6611c76543f62ab6" alt=""
Most Common HTML Attributes
data:image/s3,"s3://crabby-images/175f5/175f5e8c848bfa95ed1933bde833ccd66f13e577" alt=""
Beautiful Soup
data:image/s3,"s3://crabby-images/3da47/3da47395e1c6b8d3d7072963f58e9c4c0dcff13a" alt=""
Beautiful Soup will
parse the HTML for us !
Websites are a mess
Python Basics
Before we start, we'll need to download Anaconda
data:image/s3,"s3://crabby-images/0bdfb/0bdfb4a312924b27d4be40232c524b9f0a6d06bc" alt=""
data:image/s3,"s3://crabby-images/9b0ee/9b0eeb62f9c249a6dc2ddf64e96950111e669d77" alt=""
Strings
data:image/s3,"s3://crabby-images/fc621/fc62130455881d200e01223843db15522b506af5" alt=""
Anything inside double or single quotes is a string
Any text value that you use in a program needs to come into quotes, otherwise it will look for a variable name and will give you an error if it doesn't find it.
data:image/s3,"s3://crabby-images/6f9b5/6f9b5dc5f21bdf64f19b22ecbbd9e090fdacb187" alt=""
Numbers
Numbers can be either integers (whole numbers) or floats (numbers with a fractional part)
data:image/s3,"s3://crabby-images/ed559/ed5590e05e66c80eea3e910af8b8ee251c31190b" alt=""
A number between quotes is a string not an integer or float
Variables
Variables are just a way programmers name values, like in math class!
Variables can be re-assigned
data:image/s3,"s3://crabby-images/a433b/a433b8aee9f9d501bc66a7fe86e3ee808c5386f9" alt=""
data:image/s3,"s3://crabby-images/60abe/60abe7c060243ca562bff5ad2c07e9b81e1c090e" alt=""
data:image/s3,"s3://crabby-images/8c09b/8c09b73dd62f169ee28c0244264f7699e6f5c323" alt=""
Boolean & Logic
Boolean values: True, False
If and Else condition
Logic Operations (or, and)
While and For loop
Lists
Python has a built-in List type for storing a collection of values.
Generally we don't know what we're collecting, so we start with an empty list to which we add values
data:image/s3,"s3://crabby-images/1ecd8/1ecd8dbbefdbf13b6ab3d966b115fc1c2133a543" alt=""
data:image/s3,"s3://crabby-images/34278/342787d06f76a2f858ecad9e22dc980f1fcd9886" alt=""
Lists #2
To do something with each element of the list, we iterate over it.
data:image/s3,"s3://crabby-images/ffd74/ffd7427d25edfb0e9968df1bec0c8a8f27eb4c8c" alt=""
(It does not matter how we name the first variable in the list, however it's important we get the second variable right as it is the one that contains a list)
Dictionaries
Another common built-in data structure is called a Dictionary
Dictionaries are used to store “key-value” information. Here, “War and Peace” is the key, and the description is the value. So it works like a real… dictionary! 📖
data:image/s3,"s3://crabby-images/9d58a/9d58adfe2fc24ddb963b4b293027c87190504af9" alt=""
Functions
Pre-built functions: type(), str()...
Import functions from downloaded packages
pandas, numpy, matplotlib...
Defining your own with def()
Let's Get to Work!
Our Target
https://books.toscrape.com
data:image/s3,"s3://crabby-images/6ba4f/6ba4f27025e32a100ab4bee20eac659fef753665" alt=""
Basic Code
import requests
from bs4 import BeautifulSoup
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.content, "html.parser")
Importing our Libraries
Generic way to use BeautifulSoup
Basic Code #2
soup.select("h3 p")
Finding an element h3
CSS Selectors
Finding all h3 elements
soup.find_all("h3")
soup.find("h3")
soup.find(class_="product_price")
Finding a product_price class
soup.select(".product_pod")
Time for Live Code!
10% Workshop
By simonpastor
10% Workshop
- 271