Scraping with lxml

Extracting information from

HTML/XML WEB pages

using lxml library

by Allan Daemon

Agenda

HTML / XML

SGML

Standard Generalized Markup Language

HTML

Hypertext Markup Language

(1993 @ CERN)

XML

Extensible Markup Language

<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>
<tag attribute1="value1" attribute2="value2">

    <subtag1> value </subtag1>

    <subtag2 attribute1="value1" attribute2="value2">
        <subsubtag>
                Some data <single tag />.
        </subsubtag>
    </subtag2>

</tag2>

DOM

DOM

Document Object Model

## Show Chrome's DOM

Handling HTML / XML

Handling
HTML
& XML

SAX Style

html.parser (py2: HTMLParser)

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Output:

Element Tree

 xml.etree.ElementTree

xml.etree.ElementTree

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

xml.etree.ElementTree

import xml.etree.ElementTree as ET

tree = ET.parse('country_data.xml')
root = tree.getroot()

root = ET.fromstring(country_data_as_string)
>>> root.tag
'data'
>>> root.attrib
{}
>>> for child in root:
...     print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

There are alternatives

    BeautifulSoup Parsers

    BeautifulSoup Parsers

Parser Typical usage Advantages Disadvantages
Python’s html.parser     BeautifulSoup(markup, "html.parser") * Batteries included
* Decent speed
* Lenient (as of Python 2.7.3 and 3.2.)
* Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser BeautifulSoup(markup, "lxml") * Very fast
* Lenient
* External C dependency
lxml’s XML parser BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") * Very fast
* The only currently supported XML parser
* External C dependency
html5lib BeautifulSoup(markup, "html5lib") * Extremely lenient
* Parses pages the same way a web browser does
* Creates valid HTML5
* Very slow
* External Python dependency

    BeautifulSoup Parsers

Parser Typical usage Advantages Disadvantages
Python’s html.parser     BeautifulSoup(markup, "html.parser") * Batteries included
* Decent speed
* Lenient (as of Python 2.7.3 and 3.2.)
* Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser BeautifulSoup(markup, "lxml") * Very fast
* Lenient
* External C dependency
lxml’s XML parser BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") * Very fast
* The only currently supported XML parser
* External C dependency
html5lib BeautifulSoup(markup, "html5lib") * Extremely lenient
* Parses pages the same way a web browser does
* Creates valid HTML5
* Very slow
* External Python dependency

BeautifulSoup Eg.

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

lxml

lxml

from lxml import html

dom  = html.fromstring(page)

## code at ipython...

XML

Query
Languages

CSS Selectors

* {
   color: green;
   font-size: 20px;
   line-height: 25px;
}
ul {
   list-style: none;
   border: solid 1px #ccc;
}

body {
   background-color: black;
   border: solid 1px #ccc;
}

div {
    display: none;
}

#container {
   width: 960px;
   margin: 0 auto;
}

#my-field {
    display: none;
}
.box {
   padding: 20px;
   margin: 10px;
   width: 240px;
}
<div class="box"></div>

<div class=”box box-more box-extended”></div>
#container .box {
   float: left;
   padding-bottom: 15px;
}
<div id="container">
  <div class="box"></div>

  <div class="box-2"></div>
</div>

<div class="box"></div>
#container > .box {
   float: left;
   padding-bottom: 15px;
}
<div id="container">
  <div class="box"></div>

  <div>
    <div class="box"></div>
  </div>
</div>
input[type="text"] {
   background-color: #444;
   width: 200px;
}
<input type="text">

<input type="submit">

XPath

<?xml version="1.0" encoding="utf-8"?>
<Wikimedia>
  <projects>
    <project name="Wikipedia" launch="2001-01-05">
      <editions>
        <edition language="English">en.wikipedia.org</edition>
        <edition language="German">de.wikipedia.org</edition>
        <edition language="French">fr.wikipedia.org</edition>
        <edition language="Polish">pl.wikipedia.org</edition>
        <edition language="Spanish">es.wikipedia.org</edition>
      </editions>
    </project>
    <project name="Wiktionary" launch="2002-12-12">
      <editions>
        <edition language="English">en.wiktionary.org</edition>
        <edition language="French">fr.wiktionary.org</edition>
        <edition language="Vietnamese">vi.wiktionary.org</edition>
        <edition language="Turkish">tr.wiktionary.org</edition>
        <edition language="Spanish">es.wiktionary.org</edition>
      </editions>
    </project>
  </projects>
</Wikimedia>
/Wikimedia/projects/project/@name

/Wikimedia//editions

/Wikimedia/projects/project/editions/edition[@language='English']/text()

/Wikimedia/projects/project[@name='Wikipedia']/editions/edition/text()
* {
   color: green;
   font-size: 20px;
   line-height: 25px;
}
//*
*

CSS Selector

XPath

ul {
   list-style: none;
   border: solid 1px #ccc;
}

body {
   background-color: black;
   border: solid 1px #ccc;
}

div {
    display: none;
}

//ul
//body
//div
ul
body
div

CSS Selector

XPath

#container {
   width: 960px;
   margin: 0 auto;
}

#my-field {
    display: none;
}
/[@id="container"]
/[@id="my-field"]
#container
#my-field

CSS Selector

XPath

.box {
   padding: 20px;
   margin: 10px;
   width: 240px;
}
<div class="box"></div>

<div class=”box box-more box-extended”></div>
//[@class="box"]
//[contains(@class, "box")]
.box

CSS Selector

XPath

#container .box {
   float: left;
   padding-bottom: 15px;
}
<div id="container">
  <div class="box"></div>

  <div class="box-2"></div>
</div>

<div class="box"></div>
//[@id="container"]//[@class="box"]
//[@id="container"]//[contains(@class, "box")]
#container .box

CSS Selector

XPath

#container > .box {
   float: left;
   padding-bottom: 15px;
}
<div id="container">
  <div class="box"></div>

  <div>
    <div class="box"></div>
  </div>
</div>
//[@id="container"]/[@class="box"]
//[@id="container"]/[contains(@class, "box")]
#container > .box

CSS Selector

XPath

input[type="text"] {
   background-color: #444;
   width: 200px;
}
<input type="text">

<input type="submit">
//input[@type="text"] 
input[type="text"] 

CSS Selector

XPath

Practical
Example:
Correios

WEB/HTML/XML Scraping with lxml library

By Allan Daemon

WEB/HTML/XML Scraping with lxml library

How to get a HTML or XML (like from a web page), parses and extracts information from it.

  • 2,209