Summer Internship Project
At
Technology Pvt. Ltd.
Akshay Pratap Singh (01)
Sunny Kumar (43)
Web API
An application web api conforms REST standard and has two main endpoints, one for taking input the request of a website to be crawled and other one , for returning result on requesting by input job id.Web API only accepts a HTTP JSON request and
responds with a JSON object as output.
Crawler module is the heart of this application which performs several vital process like
Dispatching set of websites from the database to worker queue, crawling a webpage
popped from worker queue, store data into database and mail back the result link to the
user.
Several python packages are used to extracting,
manipulating the web pages.
GRequests allows you to use Requests with Gevent to make asynchronous HTTP
Requests easily.
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.
NLTK is a leading platform for building Python programs to work with human language data.
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. Application uses two reddis worker.
Logbook is based on the concept of loggers that are extensible by the application. Each logger and handler, as well as other parts of the system, may inject additional information into the logging record that improves the usefulness of log entries.
The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP listener daemon.
For Application User Interface more interacting, AngularJS Front end framework is used. HTML is great for declaring static documents, but it falters when we try to use it for declaring dynamic views in web applications. AngularJS lets you extend HTML vocabulary for your application.
UI of Application takes input in 3 stages :
● Target Website URL : Contains Valid Hyperlink to be crawled.
● Keywords : Keywords to be searched on pages contains dead links.
● User Mail : Mail id of user to mail back the result link after crawling done.
Input Field for a seed url of website to be crawl
Confirm Details and Submit Request Page
Rotto Web Crawler can be widely used in the web industry to search for links and contents. Many companies have a heavy website like news, blogging, Educational sites, Government sites etc. They add large number of pages and hyperlinks refer to internal links or to other websites daily.
Application like this can be very useful for searching broken links in their website and this is helpful for the admin of the site in maintaining with less flaw contents.
Application search keywords service helps owner of the site to find an article around which links are broken. This helps owner to maintain pages related to specific topic errorless.
This crawler enhances overall user experience and robustness of web
platform.
;)
Source code of project