Scraping with
lxml
Extracting information from
HTML/XML WEB pages
using lxml
library
by Allan Daemon
Agenda
HTML / XML
SGML
Standard Generalized Markup Language
HTML
Hypertext Markup Language
(1993 @ CERN)
XML
Extensible Markup Language
<!DOCTYPE html>
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p>Hello world!</p>
</body>
</html>
<tag attribute1="value1" attribute2="value2">
<subtag1> value </subtag1>
<subtag2 attribute1="value1" attribute2="value2">
<subsubtag>
Some data <single tag />.
</subsubtag>
</subtag2>
</tag2>
DOM
DOM
Document Object Model
## Show Chrome's DOM
Handling HTML / XML
Handling
HTML
& XML
SAX Style
html.parser (py2: HTMLParser)
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Output:
Element Tree
xml.etree.ElementTree
xml.etree.ElementTree
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
xml.etree.ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
root = ET.fromstring(country_data_as_string)
>>> root.tag
'data'
>>> root.attrib
{}
>>> for child in root:
... print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
There are alternatives
BeautifulSoup Parsers
BeautifulSoup Parsers
Parser | Typical usage | Advantages | Disadvantages |
---|---|---|---|
Python’s html.parser | BeautifulSoup(markup, "html.parser") | * Batteries included * Decent speed * Lenient (as of Python 2.7.3 and 3.2.) |
* Not very lenient (before Python 2.7.3 or 3.2.2) |
lxml’s HTML parser | BeautifulSoup(markup, "lxml") | * Very fast * Lenient |
* External C dependency |
lxml’s XML parser | BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") | * Very fast * The only currently supported XML parser |
* External C dependency |
html5lib | BeautifulSoup(markup, "html5lib") | * Extremely lenient * Parses pages the same way a web browser does * Creates valid HTML5 |
* Very slow * External Python dependency |
BeautifulSoup Parsers
Parser | Typical usage | Advantages | Disadvantages |
---|---|---|---|
Python’s html.parser | BeautifulSoup(markup, "html.parser") | * Batteries included * Decent speed * Lenient (as of Python 2.7.3 and 3.2.) |
* Not very lenient (before Python 2.7.3 or 3.2.2) |
lxml’s HTML parser | BeautifulSoup(markup, "lxml") | * Very fast * Lenient |
* External C dependency |
lxml’s XML parser | BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") | * Very fast * The only currently supported XML parser |
* External C dependency |
html5lib | BeautifulSoup(markup, "html5lib") | * Extremely lenient * Parses pages the same way a web browser does * Creates valid HTML5 |
* Very slow * External Python dependency |
BeautifulSoup Eg.
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
lxml
lxml
from lxml import html
dom = html.fromstring(page)
## code at ipython...
XML
Query
Languages
CSS Selectors
* {
color: green;
font-size: 20px;
line-height: 25px;
}
ul {
list-style: none;
border: solid 1px #ccc;
}
body {
background-color: black;
border: solid 1px #ccc;
}
div {
display: none;
}
#container {
width: 960px;
margin: 0 auto;
}
#my-field {
display: none;
}
.box {
padding: 20px;
margin: 10px;
width: 240px;
}
<div class="box"></div>
<div class=”box box-more box-extended”></div>
#container .box {
float: left;
padding-bottom: 15px;
}
<div id="container">
<div class="box"></div>
<div class="box-2"></div>
</div>
<div class="box"></div>
#container > .box {
float: left;
padding-bottom: 15px;
}
<div id="container">
<div class="box"></div>
<div>
<div class="box"></div>
</div>
</div>
input[type="text"] {
background-color: #444;
width: 200px;
}
<input type="text">
<input type="submit">
XPath
<?xml version="1.0" encoding="utf-8"?>
<Wikimedia>
<projects>
<project name="Wikipedia" launch="2001-01-05">
<editions>
<edition language="English">en.wikipedia.org</edition>
<edition language="German">de.wikipedia.org</edition>
<edition language="French">fr.wikipedia.org</edition>
<edition language="Polish">pl.wikipedia.org</edition>
<edition language="Spanish">es.wikipedia.org</edition>
</editions>
</project>
<project name="Wiktionary" launch="2002-12-12">
<editions>
<edition language="English">en.wiktionary.org</edition>
<edition language="French">fr.wiktionary.org</edition>
<edition language="Vietnamese">vi.wiktionary.org</edition>
<edition language="Turkish">tr.wiktionary.org</edition>
<edition language="Spanish">es.wiktionary.org</edition>
</editions>
</project>
</projects>
</Wikimedia>
/Wikimedia/projects/project/@name
/Wikimedia//editions
/Wikimedia/projects/project/editions/edition[@language='English']/text()
/Wikimedia/projects/project[@name='Wikipedia']/editions/edition/text()
* {
color: green;
font-size: 20px;
line-height: 25px;
}
//*
*
CSS Selector
XPath
ul {
list-style: none;
border: solid 1px #ccc;
}
body {
background-color: black;
border: solid 1px #ccc;
}
div {
display: none;
}
//ul
//body
//div
ul
body
div
CSS Selector
XPath
#container {
width: 960px;
margin: 0 auto;
}
#my-field {
display: none;
}
/[@id="container"]
/[@id="my-field"]
#container
#my-field
CSS Selector
XPath
.box {
padding: 20px;
margin: 10px;
width: 240px;
}
<div class="box"></div>
<div class=”box box-more box-extended”></div>
//[@class="box"]
//[contains(@class, "box")]
.box
CSS Selector
XPath
#container .box {
float: left;
padding-bottom: 15px;
}
<div id="container">
<div class="box"></div>
<div class="box-2"></div>
</div>
<div class="box"></div>
//[@id="container"]//[@class="box"]
//[@id="container"]//[contains(@class, "box")]
#container .box
CSS Selector
XPath
#container > .box {
float: left;
padding-bottom: 15px;
}
<div id="container">
<div class="box"></div>
<div>
<div class="box"></div>
</div>
</div>
//[@id="container"]/[@class="box"]
//[@id="container"]/[contains(@class, "box")]
#container > .box
CSS Selector
XPath
input[type="text"] {
background-color: #444;
width: 200px;
}
<input type="text">
<input type="submit">
//input[@type="text"]
input[type="text"]
CSS Selector
XPath
Practical
Example:
Correios
WEB/HTML/XML Scraping with lxml library
By Allan Daemon
WEB/HTML/XML Scraping with lxml library
How to get a HTML or XML (like from a web page), parses and extracts information from it.
- 2,209