Building a news aggregator

Using streams and functional js

@eiriklv

Streams

Content aggregation

is really fun!

Objectives

  • Collect articles from
    • RSS
    • Websites
  • Extending articles with
    • Content
    • Number of shares
  • Expose API's for
    • Searching
    • Realtime updates

First attempt

Has to be a better way

New approach

  • Modular
  • Streams/Functional (ish..?)
  • Composable abstractions
  • Transforming data
  • Declarative

Collecting articles

  • Create module(s) for transforming
    • RSS to articles
    • Websites to articles
parseSiteIntoArticles(spec, function(err, articles) {
  // an array of articles
});

parseRssIntoArticles(spec, function(err, articles) {
  // an array of articles
});

Just functions

{
  "type": "site", // or "feed"
  "name": "New Yorker",
  "url": "http://www.newyorker.com/",
  "template": {
    "containers": [{
      "selector": "article",
      "elements": [{
        "name": "url",
        "type": "url",
        "occurence": "first",
        "required": true,
        "items": [{
          "selector": "section h2 a",
          "attribute": "href"
        }]
      }, {
        "name": "title",
        "required": true,
        "occurence": "first",
        "items": [{
          "selector": "section h2 a"
        }]
      }, {
        "name": "image",
        "type": "url",
        "occurence": "first",
        "fallback": null,
        "items": [{
          "selector": "figure a img",
          "attribute": "src"
        }, {
          "selector": "figure a img",
          "attribute": "data-lazy-src"
        }]
      }]
    }]
  }
}

Specification/template

spec -> [articles]

[{
    origin: 'http://www.newyorker.com/',
    url: 'http://www.newyorker.com/news/daily-comment/the-plot-against-trains',
    author: 'Adam Gopnik',
    title: 'The Plot Against Trains',
    description: 'The will to abandon American infrastructure projects is not some omission of shortsighted politicians. It is part of a coherent ideological project.',
    image: 'http://www.newyorker.com/wp-content/uploads/2015/05/Gopnik-Plot-Against-Trains2-290-150-14182024.jpg',
    page_position: 0
}, {
    origin: 'http://www.newyorker.com/',
    url: 'http://www.newyorker.com/news/john-cassidy/obamas-cognitive-dissonance-on-trade',
    author: 'John Cassidy',
    title: 'Cognitive Dissonance on Trade',
    description: 'A trade deal remains a huge issue for American workers even as President Obama seeks the power from Congress to complete the Trans-Pacific Partnership.',
    image: 'http://www.newyorker.com/wp-content/uploads/2015/05/Cassidy-Obamas-Cognitive-Dissonance-on-Trade-320-240-14154710.jpg',
    page_position: 1
}, {
    origin: 'http://www.newyorker.com/',
    url: 'http://www.newyorker.com/magazine/2015/05/18/distant-emotions',
    author: 'Anthony Lane',
    title: 'Ethan Hawke, Drone Pilot',
    description: 'Viewers of “Good Kill” will end up like its protagonist: sad, stunned, lonesome, and boxed in.',
    image: 'http://www.newyorker.com/wp-content/uploads/2015/05/150518_r26521-320-240-06151517.jpg',
    page_position: 2
}, ...]

Fetching article content

getContentFromUrl(url, function(err, content) {
  // Article content
});

Just a function

url -> article content

content = "<p>One morning, my grandmother’s brother,
Avraham, decided to stop being religious. 
He shaved his beard, cut off his side 
curls, shed his yarmulke, packed his 
things and resolved to leave his hometown 
of Baranovichi and begin a new life..</p>"

Social shares

getSharesFromFacebook(url, function(err, result) {
  // facebook shares count
})

getSharesFromTwitter(url, function(err, result) {
  // twitter shares count
})

Just functions

url -> shares

shares = 243

Declarative object extension

Object extension

let article = {
  url: 'http://www.newyorker.com/news/daily-comment/the-plot-against-trains',
  title: 'The Plot Against Trains'
};

let extension = {
  content: '<p>Trains, trains, trains..</p>'
};

let extendedArticle = Object.assign(extension, article);

// {
//   url: 'http://www.newyorker.com/news/daily-comment/the-plot-against-trains',
//   title: 'The Plot Against Trains',
//   content: '<p>Trains, trains, trains..</p>'
// };

But what about..

  • Extending an object based on an existing property?
  • Transforming an existing property?
  • Synchronous/asynchronous?

fp-object-transform

Just some methods for declaratively extending objects

Declarative

let article = {
  url: 'http://www.newyorker.com/magazine/2015/05/18/art-census',
  title: 'A Census at the Met'
};

let extensions = {
  content: ['url', getContentFromUrl]
};

let extendWithContent = extendWith(extensions);

extendWithContent(article, function(err, result) {
  // result = {
  //   url: 'http://www.newyorker.com/magazine/2015/05/18/art-census', 
  //   title: 'A Census at the Met',
  //   content: '<p>The content of the article</p>
  // }
});

Streams

  • Vanilla streams
    • low level
    • need to implement map/filter/etc yourself
    • tedious error handling
    • manual handling of split/merge
    • object-mode

(Are awesome!)

But

highland.js

  • Utility belt like lodash
  • But with streams
  • Higher level abstractions
  • Can use "anything" as source

(more awesomeness!)

lodash-fp

  • functional version of lodash
  • arguments flipped
  • meaningful partial application

(also helpful)

Let's try it out

// {...} , {...} , {...}
const specStream = highland([{...}, {...}, {...}]);

const isSite = lodash.compose(
  lodash.isEqual('site'),
  lodash.result('type')
);

const articlesFromHtmlStream = specStream
  .fork()
  .filter(isSite)
  .map(parseArticlesFromHtml).parallel(5)
  .errors(function(err) {
    console.log(err);
  })

Transforming specs to articles

const isRssFeed = lodash.compose(
  lodash.isEqual('feed'),
  lodash.result('type')
);

const articlesFromRssStream = specStream
  .fork()
  .filter(isRssFeed)
  .map(parseArticlesFromRss).parallel(5)
  .errors(handleError)

Transforming specs to articles

What next..?

Merging and flattening

const articleStream = highland([
    articlesFromHtmlStream,
    articlesFromRssStream
  ])
  .merge()
  .flatten()
  .errors(handleError)

Yay!

A stream of articles we can do anything we want with

Extending the articles declaratively

const addSocialDataFromUrl = extendWith({
  shares: {
    facebook: ['url', getSharesFromFacebook],
    twitter: ['url', getSharesFromTwitter],
  }
});

const addContentFromUrl = extendWith({
  content: ['url', getContentFromURL]
});

const extendedArticleStream = articleStream
  .fork()
  .map(addSocialDataFromUrl).parallel(10)
  .map(addContentFromUrl).parallel(10)
  .errors(handleError)

Yay!

A stream of articles that contains everything we want

{
  origin: 'http://www.newyorker.com/',
  url: 'http://www.newyorker.com/magazine/2015/05/18/art-census',
  title: 'A Census at the Met',
  content: '<p>The content of the article</p>',
  shares: {
    facebook: 245,
    twitter: 350,
    linkedin: 470
  }
}

Let's put it together from the top!

(With persistence to a database)

const specStream = highland(getSpecsFromDatabase)
  .ratelimit(1, 30000)
  .flatten()
  .errors(handleError)

Getting the input

const articlesFromRssStream = specStream
  .fork()
  .filter(isRssFeed)
  .map(parseArticlesFromRss).parallel(5)
  .errors(handleError)


const articlesFromHtmlStream = specStream
  .fork()
  .filter(isSite)
  .map(parseArticlesFromHtml).parallel(5)
  .errors(handleError)

Transforming into articles

Merging and flattening

const articleStream = highland([
    articlesFromHtmlStream,
    articlesFromRssStream
  ])
  .merge()
  .flatten()
  .errors(handleError)
const newArticleStream = articleStream
  .fork()
  .filter(doesNotExistInDatabase)
  .map(addContentFromUrl)
  .map(saveToDatabase)
  .errors(handleErrors)

Saving new articles

const updatedArticleStream = articleStream
  .fork()
  .filter(existsInDatabase)
  .map(addSocialDataFromUrl)
  .map(updateInDatabase)
  .errors(handleErrors)

Updating existing articles

newArticleStream
  .doto(console.log)
  .resume()

updatedArticleStream
  .doto(console.log)
  .resume()

Starting the fun

Demo

Code

Tools

Building a news aggregator using streams (talk)

By Eirik Langholm Vullum

Building a news aggregator using streams (talk)

Talk for NodeConf ONEShot Oslo

  • 2,800