PyEugene

Python 101 - Web Scraping

WifiPassword: LocKEtlYiNIG

Thanks to IDX Broker for hosting us.

https://www.idxbroker.com/

Seth Dudenhofer

sdudenhofer@gmail.com

t/f/i/l sdudenhofer

A word of warning.

Other Ways to do this

Needed Packages

  • BeautifulSoup4
  • Requests
  • lxml
  • pandas - used to save to excel files
import bs4
import requests
import pandas as pd

Import your libraries

url = "http://planetpython.org"
source = requests.get(url)
print(source)

Use requests to grab the html file.

We are going to use Planet Python

We want to just display the links

soup = bs4.BeautifulSoup(source.content, 'lxml')

soup_array = []

for link in soup.find_all('a'):
	data = link.get('href')
	soup_array.append(data)
print(soup_array)

How can we view this in a better way?

df = pd.DataFrame(soup_array)
print(df)
df.to_csv('test.csv')

Other ways to format this?

Resources

  • Beautiful Soup - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • Pandas - https://pandas.pydata.org/
  • Requests - http://docs.python-requests.org/en/master/
  • Code hosted at www.github.com/pyeugene under meetups

PyEugene

By sdudenhofer

PyEugene

An introduction to web scraping in Python

  • 24