Text & Multimedia
Search Engine
https://cosmologist10.github.io
Shweta Suman
Search engine
- A software program/script that searches documents and files for keywords and returns the results of any files containing those keywords.
- The first text-based search engine is considered Veronica, used as a search tool in Gopher.
- The first search engine ever developed is considered Archie, used to find files stored on anonymous FTP sites.
Web search engine:
- A software that is designed to search for information(web pages, images and other types of files) on the World Wide Web.
- Eg. Google, Duck Duck Go, Bing etc.
- They used crawlers to go through every page, creates a huge index, receives the search request, compares it to the entries in the index, and returns results.
Search files/pattern in file system
- find
2. locate
3. grep
Search Engine Programs
- Whoosh: A fast, featureful full-text indexing and searching library implemented in pure Python.
- Xapian: Open source search engine library, which allows developers to easily add advanced indexing and search facilities to their own applications
Text/ Multimedia search engine for
user contents
- A search engine for text, image, and audio files.
- It searches audio files according to the artist, album, genre and year.
- It searches image files according to date, month, year, locality, city, state, country, and postal code.
- It searches textual files according to the given search word and returns path/to/file containing number of iterations of the word.
Text files
- Yielding a list of all text files from given directory.
- Looking for search word in saved indexes
- Saving the indexes
- Creating indexes of files along with the repetition of words.
Audio files
- Yielding files ('.mp3', '.ogg', '.wav', '.flac', '.wma') from given directory.
- Extracting metadata using pyexiftool from files.
- Saving required metadata -- (file_type, file_size, artist, album, genre, year)
- Index metadata along with their respective files and save them.
- Look for given artist, album, genre, year in saved files.
Image files
- Yielding a list of files ('.png', '.tif', '.jpg', '.gif', '.JPEG') from given directory.
- Extracting metadata using pyexiftool from files.
- Saving required metadata -- (created-date, g_p_s_latitude, g_p_s_longitude)
- Index metadata along with their respective files and save them.
- Look for given date, month, year, locality, city, state, country, and postal code in saved files.
Major challenges in Image files
- Extracting date, month and year from create_date metadata
- Exception handling
- Searching for all keywords.
- Extracting village, city, state, country and postal code from gps_latitude and gps_longitude.
Increasing efficiency of searching
- Cleaning data
- Using Collections.
-
Filenames are
hashlib of the directory, to prevent de-duplication. - Saving & loading state of indexes in files using cPickle
- Comparing last modified date-time of files and directories
- Utility module
{code}
Improve efficiency
- Concurrent processing of files
- Using faster data structures like dataframe
- Using Cython
Thanks
Anand B Pillai
Shashank Aryan
Thank You!
sumanshweta44@gmail.com
TELEGRAM: cosmologist10
GITHUB: github.com/cosmologist10
Copy of Text and Multimedia search engine
By cocoa1231
Copy of Text and Multimedia search engine
- 664