Simple Structural Scraping
Skyscraper
Enlive
Enlive is a selector-based (à la CSS) templating library for Clojure.
Why Enlive?
- Parses html into edn
- Has functions for CSS-like selection of data
-
Also has a template story that's pretty cool
- Not going into it here though
CSS | Enlive |
---|---|
div | [:div] |
div.my-class | [:div.my-class] |
div#my-id | [:div#my-id] |
body script | [:body :script] |
div > * | [:div :> :*] |
(text children of div) | [:div :> text-node] |
nodes where attribute "href" starts with "item" | (attr-starts :href "item") |
Selectors:
<!DOCTYPE html>
<html lang="en">
<head>
<title class="scrape-me">
Interesting Thing
</title>
</head>
<body>
</body>
</html>
Enlive Example
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<title class="scrape-me">
Interesting Thing
</title>
</head>
<body>
</body>
</html>
index.html
(ns scraping-talk.scrape
(:require [net.cgrand.enlive-html :as html]
[clojure.pprint :refer [pprint]]))
(let [resource (-> "resources/public/index.html"
slurp
java.io.StringReader.
html/html-resource)]
(html/select resource [:title.scrape-me]))
;;=> [{:tag :title,
;; :attrs {:class "scrape-me"},
;; :content ["Interesting Thing"]}]
Scraping HN
(ns scraping-talk.enlive-hn
(:require [net.cgrand.enlive-html :as html]
[clojure.pprint :refer [pprint]]))
(def ^:dynamic *base-url* "https://news.ycombinator.com/")
(defn fetch-url [url] (html/html-resource (java.net.URL. url)))
(defn hn-headlines []
(->> (html/select (fetch-url *base-url*)
[:td.title :a])
(map html/text)))
(defn hn-points []
(->> (html/select (fetch-url *base-url*)
[:td.subtext html/first-child])
(map html/text)))
(mapv (fn [p h] {:points p
:headline h})
(hn-points)
(hn-headlines))
;;=> [{:points "960 points", :headline "Clj-Syd voted top Meetup on meetup.com"}
;; {:points "118 points", :headline "Why I chose Clojure over doing dishes"}
;; ...]
How about in the large?
- Not a great story for organizing mountains of data.
- How do we organize a huge list of pages?
- Usually there's a tree-like structure that we care about.
- i.e. Articles -> Comments
- Caching (not refetching)? Concurrency?
- What is essential for writing this kind of crawler?
Skyscraper
Data driven web crawler based on enlive.
Key Features
- Automatically Caches previously read pages
- Can be switched off too.
- A simple way to generate sequences of pages of interest.
- Graph based approach where so-called "Contexts" are page-data and "Processors" move from one Context to the next.
<- seed
<- processor
<- context
Terms
(root context)
Seed
(defn seed
"Returns a seq of top-level contexts."
[& _]
[{:url "https://news.ycombinator.com/"
:processor :root-page}])
Processor
(defprocessor hn-index
:cache-template "hn/index"
:process-fn
(fn [resource ctx]
(for [row (select resource {[:tr.athing] [:tr.spacer]})]
(let [a-thing (select row [:tr.athing])
subtext (select row [:td.subtext])]
{:score (first (select subtext
[:span.score text-node]))
:title (first (select a-thing
[:td.title :> :a text-node]))}))))
Scrape
(scrape (seed) :processed-cache false)
;;=> [{:score "107 points", :title "Volkswagen’s Diesel..."}
;; {:score "402 points", :title "Chris Poole"}
;; {:score "85 points", :title "Don't expose the..."}
;; ...]
Going Deeper
(defprocessor hn-index
:cache-template "hn/index"
:process-fn
(fn [resource ctx]
(for [row (select resource {[:tr.athing] [:tr.spacer]})]
(let [a-thing (select row [:tr.athing])
subtext (select row [:td.subtext])]
{:score (first (select subtext [:span.score text-node]))
:title (first (select a-thing [:td.title :> :a text-node]))
:url (->> (select subtext [(attr-starts :href "item")])
first
:attrs
:href
(str "https://news.ycombinator.com/"))
:processor :hn-comment}))))
(scraping comment pages)
Next Steps
- Have seed return a seq of top-level pages (instead of just one)
- Parse down further:
- Comment Pages
- User Pages
What to be careful of
- Sketchy error messages
- If you stay on the tracks you should be fine though.
- Still pretty early + API could is subject to change
Thanks!
Questions?
Simple Structural Scraping with Skyscraper
By escherize
Simple Structural Scraping with Skyscraper
- 1,004