Image Scraper in BASH

Moksh Jain, 16IT221

Nishanth Hebbar, 16IT234

Suyash Ghuge, 16IT114

Abhishek Kamal, 16IT202

Code at: github.com/MJ10/Unix-Project

Introduction

Growing popularity of Machine Learning.

 

Machine Learning models requiring a lot of Data.

 

Categorized datasets hard to find.

 

Creating new datasets hard for beginner.

Functioning

Can be used by beginners to create toy datasets to test machine learning models.

 

Scrapes Google Search results for a particular category name, and downloads the specified number of images.

 

Scraped images can be resized to the desired dimensions and saved.

Commands Used

awk

convert

curl

cat

egrep

mkdir

rm

rmdir

wget

which

...

awk

  • The basic function of awk is to search files for lines (or other units of text) that contain certain patterns.
  • When a line matches one of the patterns, awk performs specified actions on that line. awk continues to process input lines in this way until it reaches the end of the input files.
awk -F <delimiter> <action> <file>

convert (ImageMagick)

  • The Convert command is used to convert between image formats as well as resize an image, blur, crop,  dither, draw on, flip, join or re-sample images.
  • It is a part of the ImageMagick software suite.
convert $input_file -resize $RESIZE_WIDTH\x$RESIZE_HEIGHT! $output_file &

which

  • Which is used to return the pathnames of the files ( or links ) which would be executed in the current environment.
  • It does this by searching the paths in the path environment variable for executable files matching the names of arguments.
  • Which doesn’t allow symbolic links.
which [filename/command]

curl

  • curl - Transfers data from or to a server, using one of the protocols: HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE.
  • curl supports features like pause and resume of downloads and it has around 120 command line options for various tasks.
curl [options] [URL...]

wget

  • wget command  stands for “web get”. It is a command line utility for downloading files from the Internet.
  • It supports downloading multiple files, downloading in the background, resuming downloads, limiting the bandwidth used for downloads and viewing headers.
  • It can also be used for taking a mirror of a site.
wget [options] [url]

egrep

  • Egrep command is used to search for a pattern extended regular expressions. Egrep is essentially the same as running grep with the option -E option. This version of grep is efficient and fast.

 

  • In case of egrep, even if you do not escape the meta-characters,it would treat them as special characters and substitute them for their special meaning instead of treating them as a part of string.

 

egrep [option] pattern [file…]

 

Screenshots

Screenshots

Screenshots

References

https://www.gnu.org/software/wget/manual/

https://curl.haxx.se/docs/manpage.html

https://www.imagemagick.org/script/convert.php

https://linux.die.net/man/1/egrep

https://www.gnu.org/software/gawk/manual/gawk.html

https://images.google.com

Unix-Project

By Moksh Jain

Unix-Project

Presentation for Unix(IT202) Course Project

  • 341