Chrome Extensions for Page Scraping & Analysis

Baskin Tapkan

gdg-tc://devfestmn-2016

What we will cover

  • Overview of Chrome Extensions
  • Page Scraping Techniques
  • Node Packages for Scraping
  • MongoDB Overview
  • Firebase Overview
  • Sample Application Walkthrough
  • Unit and Integration Testing
  • Summary - Q & A

About me

Full Snack(*) Software Developer

User Group Junkie

@baskint

  • Angular-MN
  • Arduino-MN (IoT)
  • Elixir-MN
  • GDG-TC
  • JavaScript-MN
  • MongoDB-MN
  • Node-MN
  • Ruby-MN

BaskinTapkan

Hobbies & Crafts

Chrome Extensions

- not to be confused with chrome://plugins

What is it?

Small program running in the browser

Where do I run it?

Only in the Chrome browser. They don't run in Firefox, IE or Safari. But they run on Macs, Windows and Linux OSes

How do I write one? Is there a ChromeScript?

Nice try :)  Already uses web languages we know today such as JavaScript, HTML, and CSS

... and we will not be covering this kind!

Starting point

chrome://extensions

  • Blocking Ads
  • Readability
  • Form fillers
  • Customized Search
  • Social Media 
  • Productivity
  • Security

Useful for

Extension Review

Task Manager

  1. Click Chrome Browser Hamburger menu - on the top right corner
  2. More tools
  3. Task Manager

Create one...

Start with a manifest.json file

{
  "manifest_version": 2,
  "name": "Page Scrape and Analysis Tool",
  "description": "This extension scrapes a web page for analysis and reporting",
  "version": "0.0.1",
  "browser_action": {
    "default_icon": "images/icon128.png"
  },
  "icons": {...},
  "background": {
    "scripts": ["eventpage.js"],
    "persistent": false
  },
  "browser_action": {
    "default_icon": "images/icon48.png"
  },
  "permissions": [...],
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "css": ["psa/pagescrape.css"],
      "js": ["psa/pagescrape.js"]
    }
  ],
  "web_accessible_resources": [...]
}

Background Page

  • Load various configurations
  • Set event listeners
  • Access and use Chrome SDK
  • Create a new tab and initialize its state
  • Send messages to content scripts or other extensions
  • Control Workflow of an extension
  • Orchestrate events
  • Store state data

A few things you can do within a background page

Content Scripts

  • Use when interaction with the target page is desired
  • Can read details of displayed web pages
  • Content scripts can communicate with the parent extension via messages
  • Can make cross-site XMLHttpRequests
  • Use chrome.* APIs with a few exceptions (extension, i18n, runtime and storage)
"content_scripts" [
{
    "matches":["http://www.google.com"],
    "css:" ["psaStyles.css"],
    "js": ["jquery.js", "psaContent.js"]
}
],
...

chrome.* APIs

Asynchronous - they return immediately

  • .browserAction
  • .cookies
  • .extension
  • .pageAction
  • .runtime
  • .storage
  • .tabs
  • .windows

There are over 40 - here is a few of interest

Pro-tip: Use "Promises"

var p = new Promise(function(resolve, reject) {
  chrome.storage.local.set({'mykey': 'myvalue'}, 
    function() { resolve('saved');
  });

p.then(function() { 
  // continue happy path
 }).catch(function() {
  // continue failed path
 })

return p;

Page Scraping Techniques

  • Human copy-and-paste
  • Text grep - regular expression matching
  • HTTP Programming
  • HTML Parser
  • DOM Parsing
  • Web-scraping software

source: wiki

next up... two Chrome Extensions for Page Scraping in the Chrome Web Store

Scraper-I

Source is out-of-date (2010). Version 1.6 is using and older Chrome API, won't load with new. 

Download the latest from Chrome Web Store

Scraper - II

Right click on a link - modify the X-path

Scraper - III

"Export to

Google Docs" works!

So does 

"Copy to clipboard"

DEMO

Web Scraper-I

Has nice web site. Enterprise Data extraction service available!

Better support. Tutorial introduction video posted on site. Link to Free version on Chrome Store.

Access it under 
Chrome Developer Tools

Comprehensive 

Tutorial and Documentation

Pro-Tip: Do NOT use CDT undocked. Selectors don't work. 

Web Scraper-II

Web Scraper-III

  • There is a learning curve
  • Lots of options
  • Good support and documentation including videos
  • Commercial product available

Web Scraper-IV

Scraped data available
as CSV and shown on
screen

{
  "startUrl":"https://news.ycombinator.com/",
  "selectors":[{
      "parentSelectors":["_root"],
     "type":"SelectorLink","multiple":true,
      "id":"hacks",
      "selector":"tr.athing td.title > a","delay":""
    }],
"_id":"hacker-news"
}

Export Sitemap is a simple JSON

Local storage is used for storing data. CouchDB can be configured as alternative.

DEMO

Node Packages for Scraping

DOM traverser, core jQuery designed for the server

depends on Cheerio, includes pagination, delay between requests, stream to files

Also useful: 

high-level browser automation library

vo

control-flow library

downloading files, images etc.

simplest way to make HTTP calls

Cheerio

var request = require('request');
var cheerio = require('cheerio');
...
function scrape(url, json) {
  'use strict';
  json = json || false;
  return new Promise(function (resolve, reject) {
    request(url, function (error, response, html) {
      if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html);
        var parsedResults = [];
        $('span.comhead').each(function (i, element) {
          // Select the previous element
          var a = $(this).prev();
          // Get the rank by parsing the element two levels above the "a" element
          var rank = a.parent().parent().text();
          // Parse the link title
          var title = a.text();
          // Parse the href attribute from the "a" element
          var link = a.attr('href');
          var metadata = {
            rank: parseInt(rank),
            title: title,
            link: link
          };
          // Push meta-data into parsedResults array
          parsedResults.push(metadata);
        });       
      }
      resolve(parsedResults);  // resolve data
    });
  });
}

X-Ray

'using strict';
var XRay = require('x-ray');
var xray = new XRay();

exports.scrape = function (url) {
  return xray(url, '.athing', [{
      rank: '.rank',
      title: 'td:nth-child(3) a',
      link: 'td:nth-child(3) a@href'
    }])
    // pagination 
    //.paginate('a[rel="nofollow"]:last-child@href')
    // limiting the number of pages visited
    // .limit(3)
    .write(); // this allows a stream being returning
};

An open-source document database with high performance, availability and automatic scaling

In MEAN stack, best works with Mongoose

mongo-CLI or RoboMongo for graphical interface

npm install mongoose --save-dev

Realtime cloud database with an API allowing developers to store and sync data across multiple clients

npm install firebase --save-dev
 var Firebase = require('firebase');
...
 var pageScrapeFirebase = new Firebase(config.firebaseUrl);
  var fb_scraped = pageScrapeFirebase.child("scrapes");
  fb_scraped.push({
    url: scrape.url,
    created_with: scrape.created_with,
    combines: combineSet,
    scrapedAt: Firebase.ServerValue.TIMESTAMP
  });

Sample Application Walkthrough

UI

DA

MDB

FDB

DEMO

Unit and Integration Testing

2 Unit tests, 0 integration test

Summary

  • Discussed the components that make a Chrome Extension
  • Demonstrated two web store available
    Page Scrapers and their usage
  • Introduced Node packages for page scraping
  • Overview of MongoDB and Firebase
  • Sample Application Walkthrough
  • Thoughts on Testing

Thank You

Chrome Extensions for Page Scraping & Analysis

By baskint

Chrome Extensions for Page Scraping & Analysis

presentation for DevFestMN 2016 - discusses Chrome Extensions, works through a sample application which scrapes links from web pages and stores the results in a MongoDB instance and a Firebase repository in the cloud

  • 3,174