TEXT AND MULTIMEDIA SEARCH ENGINE (CLI TOOL)

shashankaryan

adarsh2k

SHASHANK ARYAN

ADARSH KUMAR

Search Engine

  • A software program/script that searches documents and files for keywords and returns the results of any files containing those keywords.
  • The first search engine ever developed is considered Archie, used to find files stored on anonymous FTP sites.
  • The first text-based search engine is considered Veronica, used as a search tool in Gopher.

Web Search Engine

  • A software that is designed to search for information(web pages, images and other types of files) on the World Wide Web.
  • They used crawlers to go through every page, creates a huge index, receives the search request, compares it to the entries in the index, and returns results.
  • Eg.  Google, Duck Duck Go, Bing etc.

Search files/pattern in FS

1. find 

2. locate

Search files/pattern in FS

3. grep 

4. find + xargs + grep

Search Engine Programs

  • Whoosh: A fast, featureful full-text indexing and searching library implemented in pure Python.
  • Xapian: Open source search engine library, which allows developers to easily add advanced indexing and search facilities to their own applications

Text/ Multimedia search engine for

user contents

  • A search engine for text, image, and audio files.
  • It searches image files according to date, month, year, locality, city, state, country, and postal code.
  • It searches textual files according to the given search word and returns path/to/file containing number of iterations of the word.
  • It searches audio files according to the artist, album, genre and year.

Text Files

  • Yielding a list of all text files from given directory.     
  • Saving the indexes
  • Creating indexes of files along with the repetition of words.
  • Looking for search word in saved indexes

Audio Files

  • Yielding files ('.mp3', '.ogg', '.wav', '.flac', '.wma') from given directory.
  • Extracting metadata using pyexiftool from files.
  • Saving required metadata -- (file_type, file_size, artist, album, genre, year) 
  • Index metadata along with their respective files and save them.
  • Look for given artist, album, genre, year in saved files.

Image Files

  • Yielding a list of files ('.png', '.tif', '.jpg', '.gif', '.JPEG') from given directory.
  • Extracting metadata using pyexiftool from files.
  • Saving required metadata -- (created-date, g_p_s_latitude, g_p_s_longitude)
  • Index metadata along with their respective files and save them.
  • Look for given date, month, year, locality, city, state, country, and postal code in saved files.

Major Challanges Faced

  • Extracting date, month and year  from create_date metadata
  • Extracting village, city, state, country and postal code from gps_latitude and gps_longitude.
  • Searching for all keywords.
  • Exception handling  

Achieved efficiency of searching

  • Cleaning data
  • Using Collections.
  • Filenames are hashlib of the directory, to prevent de-duplication.
  • Saving & loading state of indexes in files using cPickle
  • Comparing last modified date-time of files and directories
  •  Utility module

{DEMO/HANDS-ON}

Making it more efficient

  • Concurrent processing of files
  • Using Cython
  • Using faster data structures like dataframe

Many many thanks with full of heart to our mentors, parents and friends for their immense support and patience.

Thank You for listening!

Contact Us:

aryanshashank31@gmail.com

shashankaryan

adarsh2k

adarshkumar2k@gmail.com 

Made with Slides.com