Rotto Link Web Crawler
Summer Internship Project
At
Technology Pvt. Ltd.
Akshay Pratap Singh (01)
Sunny Kumar (43)
Web Crawler
A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits
these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs
to visit, called the crawl frontier. URLs from the frontier are recursively visited according
to a set of policies.
Rotto Link Web Crawler
Rotto (i.e broken in Italian) link web crawler extracts the broken link (i.e dead links) within a complete website. This application takes a seed url, a url/link of a website to be crawl, and visits every page of a website and search for broken (i.e dead) links. As the crawler visits these URLs, it identifies all the hyperlinks in the web page and these hyperlinks are distributed into two parts : internal links(i.e refer to internal site) and external links (i.e refer to outside website).
Crawler Backend
Back end of the application is designed on Flask Python Microframework.
Back end of an application consist two parts, REST web API and crawler modules.
-
WEB API act an interface between Frontend and backend of an application.
-
Crawler modules consists of whole works like dispatching, scraping, storing data, mailing.
Web API
An application web api conforms REST standard and has two main endpoints, one for taking input the request of a website to be crawled and other one , for returning result on requesting by input job id.Web API only accepts a HTTP JSON request and
responds with a JSON object as output.
- /api/v1.0/crawl/
- /api/v1.0/crawl/<website/jobid>
Crawler Module
- GRequest
- BeautifulSoup
- Nltk
- SqlAlchemy
- Reddis
- LogBook
- Smtp
Crawler module is the heart of this application which performs several vital process like
Dispatching set of websites from the database to worker queue, crawling a webpage
popped from worker queue, store data into database and mail back the result link to the
user.
Several python packages are used to extracting,
manipulating the web pages.
GRequests allows you to use Requests with Gevent to make asynchronous HTTP
Requests easily.
GRequest
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.
BeautifulSoup
NLTK is a leading platform for building Python programs to work with human language data.
NLTK
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
SqlAlchemy
RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. Application uses two reddis worker.
Reddis
- DISPATCHER Dispatcher is a worker which pops out five websites to be crawled from the database an pushed into the worker queue.
- CRAWLER Crawler is a worker which pops a web hyperlink from a worker queue and process the page
Logbook is based on the concept of loggers that are extensible by the application. Each logger and handler, as well as other parts of the system, may inject additional information into the logging record that improves the usefulness of log entries.
LogBook
The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP listener daemon.
SMTP
For Application User Interface more interacting, AngularJS Front end framework is used. HTML is great for declaring static documents, but it falters when we try to use it for declaring dynamic views in web applications. AngularJS lets you extend HTML vocabulary for your application.
Crawler Front End
UI of Application takes input in 3 stages :
● Target Website URL : Contains Valid Hyperlink to be crawled.
● Keywords : Keywords to be searched on pages contains dead links.
● User Mail : Mail id of user to mail back the result link after crawling done.
User Interface (UI)
Input Field for a seed url of website to be crawl
2. Input Field for a set of keywordsto be match
3. Input Email of user to send results in mail
after crawling
Confirm Details and Submit Request Page
Result Page shows the list of hyperlinks of pages contains broken links
Scope
Rotto Web Crawler can be widely used in the web industry to search for links and contents. Many companies have a heavy website like news, blogging, Educational sites, Government sites etc. They add large number of pages and hyperlinks refer to internal links or to other websites daily.
Application like this can be very useful for searching broken links in their website and this is helpful for the admin of the site in maintaining with less flaw contents.
Application search keywords service helps owner of the site to find an article around which links are broken. This helps owner to maintain pages related to specific topic errorless.
This crawler enhances overall user experience and robustness of web
platform.
Crawling Done!
;)
Source code of project
Copy of Copy of Colloquim web scraper
By MANJUNATH BAGEWADI
Copy of Copy of Colloquim web scraper
- 784