Skyscraper
Enlive is a selector-based (à la CSS) templating library for Clojure.
CSS | Enlive |
---|---|
div | [:div] |
div.my-class | [:div.my-class] |
div#my-id | [:div#my-id] |
body script | [:body :script] |
div > * | [:div :> :*] |
(text children of div) | [:div :> text-node] |
nodes where attribute "href" starts with "item" | (attr-starts :href "item") |
Selectors:
<!DOCTYPE html>
<html lang="en">
<head>
<title class="scrape-me">
Interesting Thing
</title>
</head>
<body>
</body>
</html>
Enlive Example
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<title class="scrape-me">
Interesting Thing
</title>
</head>
<body>
</body>
</html>
index.html
(ns scraping-talk.scrape
(:require [net.cgrand.enlive-html :as html]
[clojure.pprint :refer [pprint]]))
(let [resource (-> "resources/public/index.html"
slurp
java.io.StringReader.
html/html-resource)]
(html/select resource [:title.scrape-me]))
;;=> [{:tag :title,
;; :attrs {:class "scrape-me"},
;; :content ["Interesting Thing"]}]
(ns scraping-talk.enlive-hn
(:require [net.cgrand.enlive-html :as html]
[clojure.pprint :refer [pprint]]))
(def ^:dynamic *base-url* "https://news.ycombinator.com/")
(defn fetch-url [url] (html/html-resource (java.net.URL. url)))
(defn hn-headlines []
(->> (html/select (fetch-url *base-url*)
[:td.title :a])
(map html/text)))
(defn hn-points []
(->> (html/select (fetch-url *base-url*)
[:td.subtext html/first-child])
(map html/text)))
(mapv (fn [p h] {:points p
:headline h})
(hn-points)
(hn-headlines))
;;=> [{:points "960 points", :headline "Clj-Syd voted top Meetup on meetup.com"}
;; {:points "118 points", :headline "Why I chose Clojure over doing dishes"}
;; ...]
Data driven web crawler based on enlive.
<- seed
<- processor
<- context
Terms
(root context)
(defn seed
"Returns a seq of top-level contexts."
[& _]
[{:url "https://news.ycombinator.com/"
:processor :root-page}])
(defprocessor hn-index
:cache-template "hn/index"
:process-fn
(fn [resource ctx]
(for [row (select resource {[:tr.athing] [:tr.spacer]})]
(let [a-thing (select row [:tr.athing])
subtext (select row [:td.subtext])]
{:score (first (select subtext
[:span.score text-node]))
:title (first (select a-thing
[:td.title :> :a text-node]))}))))
(scrape (seed) :processed-cache false)
;;=> [{:score "107 points", :title "Volkswagen’s Diesel..."}
;; {:score "402 points", :title "Chris Poole"}
;; {:score "85 points", :title "Don't expose the..."}
;; ...]
(defprocessor hn-index
:cache-template "hn/index"
:process-fn
(fn [resource ctx]
(for [row (select resource {[:tr.athing] [:tr.spacer]})]
(let [a-thing (select row [:tr.athing])
subtext (select row [:td.subtext])]
{:score (first (select subtext [:span.score text-node]))
:title (first (select a-thing [:td.title :> :a text-node]))
:url (->> (select subtext [(attr-starts :href "item")])
first
:attrs
:href
(str "https://news.ycombinator.com/"))
:processor :hn-comment}))))
(scraping comment pages)