Simple Structural Scraping

Skyscraper

code at: https://github.com/escherize/scraping-talk

Enlive

Enlive is a selector-based (à la CSS) templating library for Clojure.

Why Enlive?

Parses html into edn
Has functions for CSS-like selection of data
Also has a template story that's pretty cool
- Not going into it here though

CSS	Enlive
div	[:div]
div.my-class	[:div.my-class]
div#my-id	[:div#my-id]
body script	[:body :script]
div > *	[:div :> :*]
(text children of div)	[:div :> text-node]
nodes where attribute "href" starts with "item"	(attr-starts :href "item")

Selectors:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title class="scrape-me">
     Interesting Thing
    </title>
  </head>
  <body>
  </body>
</html>

Enlive Example

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <title class="scrape-me">
     Interesting Thing
    </title>
  </head>
  <body>
  </body>
</html>

index.html

(ns scraping-talk.scrape
  (:require [net.cgrand.enlive-html :as html]
            [clojure.pprint :refer [pprint]]))

(let [resource (-> "resources/public/index.html"
                   slurp
                   java.io.StringReader.
                   html/html-resource)]
  (html/select resource [:title.scrape-me]))

;;=> [{:tag :title,
;;     :attrs {:class "scrape-me"},
;;     :content ["Interesting Thing"]}]

Scraping HN

(ns scraping-talk.enlive-hn
  (:require [net.cgrand.enlive-html :as html]
            [clojure.pprint :refer [pprint]]))

(def ^:dynamic *base-url* "https://news.ycombinator.com/")

(defn fetch-url [url] (html/html-resource (java.net.URL. url)))

(defn hn-headlines []
  (->> (html/select (fetch-url *base-url*) 
                    [:td.title :a])
       (map html/text)))

(defn hn-points []
  (->> (html/select (fetch-url *base-url*) 
                    [:td.subtext html/first-child])
       (map html/text)))

(mapv (fn [p h] {:points p
                :headline h})
      (hn-points)
      (hn-headlines))

;;=> [{:points "960 points", :headline "Clj-Syd voted top Meetup on meetup.com"}
;;    {:points "118 points", :headline "Why I chose Clojure over doing dishes"} 
;;   ...]

How about in the large?

Not a great story for organizing mountains of data.
How do we organize a huge list of pages?
Usually there's a tree-like structure that we care about.
- i.e. Articles -> Comments
Caching (not refetching)? Concurrency?
What is essential for writing this kind of crawler?

Skyscraper

https://github.com/nathell/skyscraper

Data driven web crawler based on enlive.

Key Features

Automatically Caches previously read pages
- Can be switched off too.
A simple way to generate sequences of pages of interest.
Graph based approach where so-called "Contexts" are page-data and "Processors" move from one Context to the next.

<- seed

<- processor

<- context

Terms

(root context)

Seed

(defn seed
  "Returns a seq of top-level contexts."
  [& _]
  [{:url "https://news.ycombinator.com/"
    :processor :root-page}])

Processor

(defprocessor hn-index
  :cache-template "hn/index"

  :process-fn
  (fn [resource ctx]
    (for [row (select resource {[:tr.athing] [:tr.spacer]})]
      (let [a-thing (select row [:tr.athing])
            subtext (select row [:td.subtext])]
        {:score (first (select subtext
                               [:span.score text-node]))
         :title (first (select a-thing
                               [:td.title :> :a text-node]))}))))

Scrape

(scrape (seed) :processed-cache false)

  ;;=>   [{:score "107 points", :title "Volkswagen’s Diesel..."}
  ;;      {:score "402 points", :title "Chris Poole"}
  ;;      {:score "85 points", :title "Don't expose the..."}
  ;;      ...]

Going Deeper

(defprocessor hn-index
  :cache-template "hn/index"
  :process-fn
  (fn [resource ctx]
    (for [row (select resource {[:tr.athing] [:tr.spacer]})]
      (let [a-thing (select row [:tr.athing])
            subtext (select row [:td.subtext])]
        {:score (first (select subtext [:span.score text-node]))
         :title (first (select a-thing [:td.title :> :a text-node]))
         :url (->> (select subtext [(attr-starts :href "item")])
                    first
                    :attrs
                    :href
                    (str "https://news.ycombinator.com/"))
         :processor :hn-comment}))))

(scraping comment pages)

Next Steps

Have seed return a seq of top-level pages (instead of just one)
Parse down further:
- Comment Pages
- User Pages

What to be careful of

Sketchy error messages
- If you stay on the tracks you should be fine though.
Still pretty early + API could is subject to change

Thanks!

Questions?

code at: https://github.com/escherize/scraping-talk

Simple Structural Scraping with Skyscraper

By escherize

Simple Structural Scraping with Skyscraper

1,004

Simple Structural Scraping

Enlive

Why Enlive?

Scraping HN

How about in the large?

Skyscraper

Key Features

Seed

Processor

Scrape

Going Deeper

Next Steps

What to be careful of

Thanks!

Questions?

Simple Structural Scraping with Skyscraper

More from escherize