The DAAR Search Engine

An Internet and Web Systems Project

 

Razzi Abuissa

Project Requirements

  • As a team of 4, build a search engine
  • Run on AWS
  • Respect robots.txt
  • HTML UI

Crawler

  • Master-worker model
  • Worker gets URL from SQS
    • Extracts links
    • Saves text to S3
  • Master validates new urls against robots.txt cache
documentId title url
1 Germany - Wikipedia, the free https://en.wikipedia...
2 Movies | The Guardian http://www.theguardi..

Indexer

 

<h2>Climate</h2>
<p>Most of Germany has a temperate seasonal climate</p>

-> Climate Most of Germany has a temperate seasonal climate

-> Climate Germany temperate seasonal climate

-> climat germani temper season climat

documentId wordId count
1 1 2
1 2 1
wordId word
1 climat
2 germani

PageRank

Iterative MapReduce implementation

 

documentId: 1, links: [2, 3], pageRank: 1
documentId: 2, links: [],     pageRank: 1
documentId: 3, links: [2],    pageRank: 1

->

documentId: 1, links: [2, 3], pageRank: .42
documentId: 2, links: [],     pageRank: 1.71
documentId: 3, links: [2],    pageRank: .87

->

documentId: 1, links: [2, 3], pageRank: .63
documentId: 2, links: [],     pageRank: 1.56
documentId: 3, links: [2],    pageRank: .81

Backend

  • Crawler: SQS -> S3
  • Indexer: S3 -> RDS
  • PageRank: Update RDS

Querying the Database

 

Needed document length to account for longer documents

 

 

 

create materialized view total_word_counts
    (documentId, totalCount) as
    select documents.id, sum(counts.count)
    from documents inner join counts on documents.id = counts.documentId
    group by documents.id;
create index doc_idx on counts (documentid);

Index for faster queries

Query interface

select
(counts.count / document.totalcount) + (document.pageRank / 300)
as rank, document.url, document.title
from total_word_counts as document
inner join counts on document.documentid = counts.documentid
where counts.wordid = (select id from words where word = ?)
order by rank desc
limit 10 offset ?

Able to paginate using offset!

The DAAR Search Engine

By razzi

The DAAR Search Engine

  • 766