Rotto Link Web Crawler

Summer Internship Project 

At 

 

Technology Pvt. Ltd.

Akshay Pratap Singh (01)

Sunny Kumar (43)

 

Web Crawler

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits
these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs
to visit, called the crawl frontier. URLs from the frontier are recursively visited according
to a set of policies.

Rotto Link Web Crawler

Rotto (i.e broken in Italian)  link   web   crawler   extracts   the   broken   link   (i.e   dead   links)   within   a   complete website.   This   application   takes   a   seed   url,   a   url/link   of   a   website   to   be   crawl,   and   visits every   page   of   a   website   and   search   for   broken   (i.e   dead)   links.   As   the   crawler   visits these   URLs,   it   identifies   all   the   hyperlinks   in   the   web   page   and   these   hyperlinks   are distributed   into   two   parts   :   internal   links(i.e   refer   to   internal   site)   and   external   links   (i.e refer   to   outside   website).

Crawler Backend

Back end of the application is designed on Flask Python Microframework. 

Back end of an application consist two parts, REST web API and crawler modules.

  •  WEB API act an interface between Frontend and backend of an application. 

  • Crawler modules consists of whole works like dispatching, scraping, storing data, mailing. 

Web API

An application web api conforms REST standard and has two main endpoints, one for taking input the request of a website to be crawled and other one , for returning result on requesting by input job id.Web API only accepts a HTTP JSON request and 
responds with a JSON object as output. 

  • /api/v1.0/crawl/ ­
  • /api/v1.0/crawl/<website/job­id>

Crawler Module 

  • GRequest  
  • BeautifulSoup
  • Nltk  
  • SqlAlchemy  
  • Reddis   
  • LogBook  
  • Smtp 

Crawler module is the heart of this application which performs several vital process like 
Dispatching set of websites from the database to worker queue, crawling a webpage 
popped from worker queue, store data into database and mail back the result link to the 
user. 

Several python packages are used to extracting, 
manipulating the web pages.

GRequests allows you to use Requests with Gevent to make asynchronous HTTP 
Requests easily. 

GRequest 


 
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, 
searching, and modifying the parse tree.

BeautifulSoup 

 
 
NLTK is a leading platform for building Python programs to work with human language data. 

NLTK 


 
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. 

SqlAlchemy 


 
RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. Application uses two reddis worker. 

Reddis  

 

  • DISPATCHER ­ Dispatcher is a worker which pops out five websites to be crawled from the database an pushed into the worker queue. 

 

  • CRAWLER ­ Crawler is a worker which pops a web hyperlink from a worker queue and process the page 

Logbook is based on the concept of loggers that are extensible by the application. Each logger and handler, as well as other parts of the system, may inject additional information into the logging record that improves the usefulness of log entries.

LogBook 


The  smtplib  module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP listener daemon.  

SMTP


For Application User­ Interface more interacting, AngularJS Front end framework is used. HTML is great for declaring static documents, but it falters when we try to use it for declaring dynamic views in web­ applications. AngularJS lets you extend HTML vocabulary for your application. 

Crawler Front End 

UI of Application takes input in 3 stages : 
Target Website URL : Contains Valid Hyperlink to be crawled. 
Keywords : Keywords to be searched on pages contains dead links. 
User Mail : Mail id of user to mail back the result link after crawling done. 

User Interface (UI)

Input Field for a seed url of website to be crawl  

2. Input Field for a set of keywordsto be match  

3. Input Email­ of user to send results in mail

after crawling

Confirm Details and Submit Request Page 

Result Page shows the list of hyperlinks of pages contains broken links 

Scope

Rotto   Web   Crawler   can   be   widely   used   in   the   web   industry   to   search   for  links   and   contents.   Many   companies   have   a   heavy   website   like   news,  blogging,   Educational   sites,   Government   sites   etc.   They   add   large   number  of   pages   and   hyperlinks   refer   to   internal   links   or   to   other   websites   daily. 

Application   like   this   can   be  very   useful   for   searching   broken   links   in   their   website   and   this   is   helpful  for the admin of the site in maintaining with less flaw contents. 
Application   search   keywords   service   helps   owner   of   the   site   to   find   an  article   around   which   links   are   broken.   This   helps   owner   to   maintain   pages  related to specific topic errorless. ​​

 

This   crawler   enhances   overall   user   experience   and   robustness   of   web  
platform. 

 

Crawling Done! 

 

;)

Colloquim web scraper

By Sunny Gupta

Colloquim web scraper

  • 1,564