The DAAR Search Engine
An Internet and Web Systems Project
Razzi Abuissa
Project Requirements
- As a team of 4, build a search engine
- Run on AWS
- Respect robots.txt
- HTML UI
Crawler
- Master-worker model
- Worker gets URL from SQS
- Extracts links
- Saves text to S3
- Master validates new urls against robots.txt cache
documentId | title | url |
---|---|---|
1 | Germany - Wikipedia, the free | https://en.wikipedia... |
2 | Movies | The Guardian | http://www.theguardi.. |
Indexer
<h2>Climate</h2>
<p>Most of Germany has a temperate seasonal climate</p>
-> Climate Most of Germany has a temperate seasonal climate
-> Climate Germany temperate seasonal climate
-> climat germani temper season climat
documentId | wordId | count |
---|---|---|
1 | 1 | 2 |
1 | 2 | 1 |
wordId | word |
---|---|
1 | climat |
2 | germani |
PageRank
Iterative MapReduce implementation
documentId: 1, links: [2, 3], pageRank: 1 documentId: 2, links: [], pageRank: 1 documentId: 3, links: [2], pageRank: 1
->
documentId: 1, links: [2, 3], pageRank: .42 documentId: 2, links: [], pageRank: 1.71 documentId: 3, links: [2], pageRank: .87
->
documentId: 1, links: [2, 3], pageRank: .63 documentId: 2, links: [], pageRank: 1.56 documentId: 3, links: [2], pageRank: .81
Backend
- Crawler: SQS -> S3
- Indexer: S3 -> RDS
- PageRank: Update RDS


Querying the Database
Needed document length to account for longer documents
create materialized view total_word_counts
(documentId, totalCount) as
select documents.id, sum(counts.count)
from documents inner join counts on documents.id = counts.documentId
group by documents.id;
create index doc_idx on counts (documentid);
Index for faster queries
Query interface
select
(counts.count / document.totalcount) + (document.pageRank / 300)
as rank, document.url, document.title
from total_word_counts as document
inner join counts on document.documentid = counts.documentid
where counts.wordid = (select id from words where word = ?)
order by rank desc
limit 10 offset ?
Able to paginate using offset!

The DAAR Search Engine
By razzi
The DAAR Search Engine
- 766