WEB CRAWLER 

WITH RUBY

Rifki Fauzi, ST
@kubid




WHAT IS WEB CRAWLER?



AN INTERNET BOT THAT SYSTEMATICALLY BROWSES THE WORLD WIDE WEB, TYPICALLY FOR THE PURPOSE OF WEB INDEXING.




WHAT DO WE NEED ?


RUBYGEMS

OPEN-URI

NOKOGIRI

OPEN-URI


OpenURI is an easy-to-use wrapper for net/http, net/https and net/ftp.

NOKOGIRI




Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser


THE POWER 

OF 

NOKOGIRI



IT CAN PARSE 

HTML, XML & SAX

  • parsing from  string
  • parsing from file
  • parsing from internet / web url



PARSING FROM INTERNET



require 'open-uri' doc = Nokogiri::HTML(open("http://www.somewebsite.com/"))   


Searching an HTML / XML Document


Searching an HTML/XML document 
is like searching DOM using jQuery

characters = @doc.css("body .sitcoms div.name") 

# will return :
# => ["<title>Example Feed</title>", "<title>Atom-Powered Robots Run Amok</title>"]



THATS IT 
VERY EASY RIGHT ?

THANKS

Made with Slides.com