Simple Structural Scraping

Skyscraper

Enlive

Enlive is a selector-based (à la CSS) templating library for Clojure.

Why Enlive?

  • Parses html into edn
  • Has functions for CSS-like selection of data
  • Also has a template story that's pretty cool
    • ​Not going into it here though
CSS Enlive
div [:div]
div.my-class [:div.my-class]
div#my-id [:div#my-id]
body script [:body :script]
div > * [:div :> :*]
(text children of div) [:div :> text-node]
nodes where attribute "href" starts with "item" ​(attr-starts :href "item")

Selectors:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title class="scrape-me">
     Interesting Thing
    </title>
  </head>
  <body>
  </body>
</html>

Enlive Example

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <title class="scrape-me">
     Interesting Thing
    </title>
  </head>
  <body>
  </body>
</html>

index.html

(ns scraping-talk.scrape
  (:require [net.cgrand.enlive-html :as html]
            [clojure.pprint :refer [pprint]]))

(let [resource (-> "resources/public/index.html"
                   slurp
                   java.io.StringReader.
                   html/html-resource)]
  (html/select resource [:title.scrape-me]))

;;=> [{:tag :title,
;;     :attrs {:class "scrape-me"},
;;     :content ["Interesting Thing"]}]

Scraping HN

(ns scraping-talk.enlive-hn
  (:require [net.cgrand.enlive-html :as html]
            [clojure.pprint :refer [pprint]]))

(def ^:dynamic *base-url* "https://news.ycombinator.com/")

(defn fetch-url [url] (html/html-resource (java.net.URL. url)))

(defn hn-headlines []
  (->> (html/select (fetch-url *base-url*) 
                    [:td.title :a])
       (map html/text)))

(defn hn-points []
  (->> (html/select (fetch-url *base-url*) 
                    [:td.subtext html/first-child])
       (map html/text)))

(mapv (fn [p h] {:points p
                :headline h})
      (hn-points)
      (hn-headlines))

;;=> [{:points "960 points", :headline "Clj-Syd voted top Meetup on meetup.com"}
;;    {:points "118 points", :headline "Why I chose Clojure over doing dishes"} 
;;   ...]

How about in the large?

  • Not a great story for organizing mountains of data.
  • How do we organize a huge list of pages?
  • Usually there's a tree-like structure that we care about.
    • i.e. Articles -> Comments
  • Caching (not refetching)? Concurrency?
  • What is essential for writing this kind of crawler?

Skyscraper

Data driven web crawler based on enlive.

Key Features

  • Automatically Caches previously read pages
    • Can be switched off too.
  • A simple way to generate sequences of pages of interest.
  • Graph based approach where so-called "Contexts" are page-data and "Processors" move from one Context to the next.
<- seed
<- processor
<- context

Terms

(root context)

Seed

(defn seed
  "Returns a seq of top-level contexts."
  [& _]
  [{:url "https://news.ycombinator.com/"
    :processor :root-page}])

Processor

(defprocessor hn-index
  :cache-template "hn/index"

  :process-fn
  (fn [resource ctx]
    (for [row (select resource {[:tr.athing] [:tr.spacer]})]
      (let [a-thing (select row [:tr.athing])
            subtext (select row [:td.subtext])]
        {:score (first (select subtext
                               [:span.score text-node]))
         :title (first (select a-thing
                               [:td.title :> :a text-node]))}))))

Scrape

(scrape (seed) :processed-cache false)

  ;;=>   [{:score "107 points", :title "Volkswagen’s Diesel..."}
  ;;      {:score "402 points", :title "Chris Poole"}
  ;;      {:score "85 points", :title "Don't expose the..."}
  ;;      ...]

Going Deeper

(defprocessor hn-index
  :cache-template "hn/index"
  :process-fn
  (fn [resource ctx]
    (for [row (select resource {[:tr.athing] [:tr.spacer]})]
      (let [a-thing (select row [:tr.athing])
            subtext (select row [:td.subtext])]
        {:score (first (select subtext [:span.score text-node]))
         :title (first (select a-thing [:td.title :> :a text-node]))
         :url (->> (select subtext [(attr-starts :href "item")])
                    first
                    :attrs
                    :href
                    (str "https://news.ycombinator.com/"))
         :processor :hn-comment}))))

(scraping comment pages)

Next Steps

  • Have seed return a seq of top-level pages (instead of just one)
  • Parse down further:
    • Comment Pages
    • User Pages

What to be careful of

  • Sketchy error messages
    • If you stay on the tracks you should be fine though.
  • Still pretty early + API could is subject to change

Thanks!

Questions?

Simple Structural Scraping with Skyscraper

By escherize

Simple Structural Scraping with Skyscraper

  • 1,004