Text & Multimedia

Search Engine

https://cosmologist10.github.io

Shweta Suman

Search engine

  • A software program/script that searches documents and files for keywords and returns the results of any files containing those keywords.
  • The first text-based search engine is considered Veronica, used as a search tool in Gopher.
  • The first search engine ever developed is considered Archie, used to find files stored on anonymous FTP sites.

Web search engine:

  • A software that is designed to search for information(web pages, images and other types of files) on the World Wide Web.
  • Eg.  Google, Duck Duck Go, Bing etc.
  • They used crawlers to go through every page, creates a huge index, receives the search request, compares it to the entries in the index, and returns results.

Search files/pattern in file system

  1. find

2. locate

 

3. grep

Search Engine Programs

  • Whoosh: A fast, featureful full-text indexing and searching library implemented in pure Python.
  • Xapian: Open source search engine library, which allows developers to easily add advanced indexing and search facilities to their own applications

Text/ Multimedia search engine for

user contents

  • A search engine for text, image, and audio files.
  • It searches audio files according to the artist, album, genre and year.
  • It searches image files according to date, month, year, locality, city, state, country, and postal code.
  • It searches textual files according to the given search word and returns path/to/file containing number of iterations of the word.

Text files

  • Yielding a list of all text files from given directory.     
  • Looking for search word in saved indexes
  • Saving the indexes
  • Creating indexes of files along with the repetition of words.

Audio files

  • Yielding files ('.mp3', '.ogg', '.wav', '.flac', '.wma') from given directory.
  • Extracting metadata using pyexiftool from files.
  • Saving required metadata -- (file_type, file_size, artist, album, genre, year)
  • Index metadata along with their respective files and save them.
  • Look for given artist, album, genre, year in saved files.

Image files

  • Yielding a list of files ('.png', '.tif', '.jpg', '.gif', '.JPEG') from given directory.
  • Extracting metadata using pyexiftool from files.
  • Saving required metadata -- (created-date, g_p_s_latitude, g_p_s_longitude)
  • Index metadata along with their respective files and save them.
  • Look for given date, month, year, locality, city, state, country, and postal code in saved files.

Major challenges in Image files

  • Extracting date, month and year  from create_date metadata
  • Exception handling  
  • Searching for all keywords.
  • Extracting village, city, state, country and postal code from gps_latitude and gps_longitude.

Increasing efficiency of searching

  • Cleaning data
  • Using Collections.
  • Filenames are hashlib of the directory, to prevent de-duplication.
  • Saving & loading state of indexes in files using cPickle
  • Comparing last modified date-time of files and directories
  •  Utility module

{code}

Improve efficiency

  • Concurrent processing of files
  • Using faster data structures like dataframe
  • Using Cython

Thanks

Anand B Pillai

Shashank Aryan

Thank You!

sumanshweta44@gmail.com

TELEGRAM: cosmologist10

GITHUB: github.com/cosmologist10

Text and Multimedia search engine

By shweta suman

Text and Multimedia search engine

  • 835