Scraping with `lxml`

Extracting information from

HTML/XML WEB pages

using lxml library

by Allan Daemon

Agenda

HTML / XML

SGML

Standard Generalized Markup Language

HTML

Hypertext Markup Language

(1993 @ CERN)

XML

Extensible Markup Language

<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>

<tag attribute1="value1" attribute2="value2">

    <subtag1> value </subtag1>

    <subtag2 attribute1="value1" attribute2="value2">
        <subsubtag>
                Some data <single tag />.
        </subsubtag>
    </subtag2>

</tag2>

DOM

Document Object Model

Source: https://en.wikipedia.org/wiki/Document_Object_Model#/media/File:DOM-model.svg

## Show Chrome's DOM

Handling HTML / XML

Handling
HTML
& XML

SAX Style

html.parser (py2: HTMLParser)

Source: https://docs.python.org/3/library/html.parser.html

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head

Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Output:

Element Tree

xml.etree.ElementTree

Source: https://docs.python.org/3.7/library/xml.etree.elementtree.html

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

xml.etree.ElementTree

Source: https://docs.python.org/3.7/library/xml.etree.elementtree.html

import xml.etree.ElementTree as ET

tree = ET.parse('country_data.xml')
root = tree.getroot()

root = ET.fromstring(country_data_as_string)

>>> root.tag
'data'
>>> root.attrib
{}
>>> for child in root:
...     print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

There are alternatives

BeautifulSoup Parsers

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	BeautifulSoup(markup, "html.parser")	* Batteries included * Decent speed * Lenient (as of Python 2.7.3 and 3.2.)	* Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	BeautifulSoup(markup, "lxml")	* Very fast * Lenient	* External C dependency
lxml’s XML parser	BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")	* Very fast * The only currently supported XML parser	* External C dependency
html5lib	BeautifulSoup(markup, "html5lib")	* Extremely lenient * Parses pages the same way a web browser does * Creates valid HTML5	* Very slow * External Python dependency

Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

BeautifulSoup Parsers

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	BeautifulSoup(markup, "html.parser")	* Batteries included * Decent speed * Lenient (as of Python 2.7.3 and 3.2.)	* Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	BeautifulSoup(markup, "lxml")	* Very fast * Lenient	* External C dependency
lxml’s XML parser	BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")	* Very fast * The only currently supported XML parser	* External C dependency
html5lib	BeautifulSoup(markup, "html5lib")	* Extremely lenient * Parses pages the same way a web browser does * Creates valid HTML5	* Very slow * External Python dependency

Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

BeautifulSoup Eg.

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

`lxml`

lxml

from lxml import html

dom  = html.fromstring(page)

## code at ipython...

XML

Query
Languages

CSS Selectors

* {
   color: green;
   font-size: 20px;
   line-height: 25px;
}

Source: https://www.sitepoint.com/web-foundations/css-selectors/

ul {
   list-style: none;
   border: solid 1px #ccc;
}

body {
   background-color: black;
   border: solid 1px #ccc;
}

div {
    display: none;
}

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container {
   width: 960px;
   margin: 0 auto;
}

#my-field {
    display: none;
}

Source: https://www.sitepoint.com/web-foundations/css-selectors/

.box {
   padding: 20px;
   margin: 10px;
   width: 240px;
}

<div class="box"></div>

<div class=”box box-more box-extended”></div>

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container .box {
   float: left;
   padding-bottom: 15px;
}

<div id="container">
  <div class="box"></div>

  <div class="box-2"></div>
</div>

<div class="box"></div>

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container > .box {
   float: left;
   padding-bottom: 15px;
}

<div id="container">
  <div class="box"></div>

  <div>
    <div class="box"></div>
  </div>
</div>

Source: https://www.sitepoint.com/web-foundations/css-selectors/

input[type="text"] {
   background-color: #444;
   width: 200px;
}

<input type="text">

<input type="submit">

XPath

<?xml version="1.0" encoding="utf-8"?>
<Wikimedia>
  <projects>
    <project name="Wikipedia" launch="2001-01-05">
      <editions>
        <edition language="English">en.wikipedia.org</edition>
        <edition language="German">de.wikipedia.org</edition>
        <edition language="French">fr.wikipedia.org</edition>
        <edition language="Polish">pl.wikipedia.org</edition>
        <edition language="Spanish">es.wikipedia.org</edition>
      </editions>
    </project>
    <project name="Wiktionary" launch="2002-12-12">
      <editions>
        <edition language="English">en.wiktionary.org</edition>
        <edition language="French">fr.wiktionary.org</edition>
        <edition language="Vietnamese">vi.wiktionary.org</edition>
        <edition language="Turkish">tr.wiktionary.org</edition>
        <edition language="Spanish">es.wiktionary.org</edition>
      </editions>
    </project>
  </projects>
</Wikimedia>

Source: https://en.wikipedia.org/wiki/XPath

/Wikimedia/projects/project/@name

/Wikimedia//editions

/Wikimedia/projects/project/editions/edition[@language='English']/text()

/Wikimedia/projects/project[@name='Wikipedia']/editions/edition/text()

http://codebeautify.org/Xpath-Tester

http://ricostacruz.com/cheatsheets/xpath.html

Source: https://en.wikipedia.org/wiki/XPath

https://www.simple-talk.com/wp-content/uploads/imported/1269-Locators_table_1_0_2.pdf

* {
   color: green;
   font-size: 20px;
   line-height: 25px;
}

//*

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

ul {
   list-style: none;
   border: solid 1px #ccc;
}

body {
   background-color: black;
   border: solid 1px #ccc;
}

div {
    display: none;
}

//ul
//body
//div

ul
body
div

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container {
   width: 960px;
   margin: 0 auto;
}

#my-field {
    display: none;
}

/[@id="container"]
/[@id="my-field"]

#container
#my-field

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

.box {
   padding: 20px;
   margin: 10px;
   width: 240px;
}

<div class="box"></div>

<div class=”box box-more box-extended”></div>

//[@class="box"]
//[contains(@class, "box")]

.box

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container .box {
   float: left;
   padding-bottom: 15px;
}

<div id="container">
  <div class="box"></div>

  <div class="box-2"></div>
</div>

<div class="box"></div>

//[@id="container"]//[@class="box"]
//[@id="container"]//[contains(@class, "box")]

#container .box

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

#container > .box {
   float: left;
   padding-bottom: 15px;
}

<div id="container">
  <div class="box"></div>

  <div>
    <div class="box"></div>
  </div>
</div>

//[@id="container"]/[@class="box"]
//[@id="container"]/[contains(@class, "box")]

#container > .box

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

input[type="text"] {
   background-color: #444;
   width: 200px;
}

<input type="text">

<input type="submit">

//input[@type="text"]

input[type="text"]

CSS Selector

XPath

Source: https://www.sitepoint.com/web-foundations/css-selectors/

Practical
Example:
Correios

Scraping with lxml

Agenda

HTML / XML

SGML

HTML

XML

DOM

DOM

Document Object Model

## Show Chrome's DOM

Handling HTML / XML

Handling HTML & XML

SAX Style

html.parser (py2: HTMLParser)

Element Tree

xml.etree.ElementTree

xml.etree.ElementTree

xml.etree.ElementTree

There are alternatives

BeautifulSoup Parsers

BeautifulSoup Parsers

BeautifulSoup Parsers

BeautifulSoup Eg.

lxml

lxml

## code at ipython...

XML

Query Languages

CSS Selectors

XPath

Practical Example: Correios

Scraping with `lxml`

Handling
HTML
& XML

`lxml`

Query
Languages

Practical
Example:
Correios