Rotto Link Web Crawler

Summer Internship Project

Technology Pvt. Ltd.

Akshay Pratap Singh (01)

Sunny Kumar (43)

Web Crawler

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits
these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs
to visit, called the crawl frontier. URLs from the frontier are recursively visited according
to a set of policies.

Rotto Link Web Crawler

Rotto (i.e broken in Italian) link web crawler extracts the broken link (i.e dead links) within a complete website. This application takes a seed url, a url/link of a website to be crawl, and visits every page of a website and search for broken (i.e dead) links. As the crawler visits these URLs, it identifies all the hyperlinks in the web page and these hyperlinks are distributed into two parts : internal links(i.e refer to internal site) and external links (i.e refer to outside website).

Crawler Backend

Back end of the application is designed on Flask Python Microframework.

Back end of an application consist two parts, REST web API and crawler modules.

WEB API act an interface between Frontend and backend of an application.
Crawler modules consists of whole works like dispatching, scraping, storing data, mailing.

Web API

An application web api conforms REST standard and has two main endpoints, one for taking input the request of a website to be crawled and other one , for returning result on requesting by input job id.Web API only accepts a HTTP JSON request and
responds with a JSON object as output.

/api/v1.0/crawl/

/api/v1.0/crawl/<website/jobid>

Crawler Module

GRequest
BeautifulSoup
Nltk
SqlAlchemy
Reddis
LogBook
Smtp

Crawler module is the heart of this application which performs several vital process like
Dispatching set of websites from the database to worker queue, crawling a webpage
popped from worker queue, store data into database and mail back the result link to the
user.

Several python packages are used to extracting,
manipulating the web pages.

GRequests allows you to use Requests with Gevent to make asynchronous HTTP
Requests easily.

GRequest

Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.

BeautifulSoup

NLTK is a leading platform for building Python programs to work with human language data.

NLTK

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

SqlAlchemy

RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. Application uses two reddis worker.

Reddis

DISPATCHER Dispatcher is a worker which pops out five websites to be crawled from the database an pushed into the worker queue.

CRAWLER Crawler is a worker which pops a web hyperlink from a worker queue and process the page

Logbook is based on the concept of loggers that are extensible by the application. Each logger and handler, as well as other parts of the system, may inject additional information into the logging record that improves the usefulness of log entries.

LogBook

The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP listener daemon.

SMTP

For Application User Interface more interacting, AngularJS Front end framework is used. HTML is great for declaring static documents, but it falters when we try to use it for declaring dynamic views in web applications. AngularJS lets you extend HTML vocabulary for your application.

Crawler Front End

UI of Application takes input in 3 stages :
● Target Website URL : Contains Valid Hyperlink to be crawled.
● Keywords : Keywords to be searched on pages contains dead links.
● User Mail : Mail id of user to mail back the result link after crawling done.

User Interface (UI)

Input Field for a seed url of website to be crawl

2. Input Field for a set of keywordsto be match

3. Input Email of user to send results in mail

after crawling

Confirm Details and Submit Request Page

Result Page shows the list of hyperlinks of pages contains broken links

Scope

Rotto Web Crawler can be widely used in the web industry to search for links and contents. Many companies have a heavy website like news, blogging, Educational sites, Government sites etc. They add large number of pages and hyperlinks refer to internal links or to other websites daily.

Application like this can be very useful for searching broken links in their website and this is helpful for the admin of the site in maintaining with less flaw contents.
Application search keywords service helps owner of the site to find an article around which links are broken. This helps owner to maintain pages related to specific topic errorless.

This crawler enhances overall user experience and robustness of web
platform.

Crawling Done!

;)

Source code of project

https://github.com/KodeKracker/Rotto-Links-Scraper

Rotto Link Web Crawler

Web Crawler

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

Rotto Link Web Crawler

Crawler Backend

Back end of the application is designed on Flask Python Microframework.

Back end of an application consist two parts, REST web API and crawler modules.

WEB API act an interface between Frontend and backend of an application.

Crawler modules consists of whole works like dispatching, scraping, storing data, mailing.

Crawler Module

GRequest

BeautifulSoup

NLTK

SqlAlchemy

Reddis

LogBook

SMTP

Crawler Front End

User Interface (UI)

2. Input Field for a set of keywordsto be match

3. Input Email­ of user to send results in mail

after crawling

Result Page shows the list of hyperlinks of pages contains broken links

Scope

Crawling Done!

A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits
these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs
to visit, called the crawl frontier. URLs from the frontier are recursively visited according
to a set of policies.

3. Input Email of user to send results in mail