Best practices for using PHP to develop web crawlers!

Peter

Outline

  • About me

  • The motivation about writing this book

  • Guide for book

    • ​Technical keywords introduction

    • Section introduction

  • Extended sections

    • ​Missed, but important advanced crawling

  • Feedback about publishing book

Slide

About me

  • Peter
  • GitHub
  • Active open source contributor
  • An associate engineer
    • DevOps
    • Back-end
    • System Architecture Researching
    • Web Application Security
    • PHP, Python and JavaScript
  • Smart Grid Technology (2017~2021)
  • Database, Data platform architecture (2021~)

The Motiviation

Joke

Thinking?

Back to 2014

My small story about learning web crawler

Do you know this book?

The original author

Reading book......

Writing e-mail to ask author?

Ask author

Receive Reply from author

Receive Reply from author

Screen Scraper Tricks Extracting Data from Difficult Websites

From then on

After six years...

There's no new book about PHP Web Crawler

That's why I write new one!

Guide for the book

Guide for book section

  • Section 1 to 10

  • Appendix A

Sample codes

Section 1

Fundamentals

Web crawler, spider and bot

Development Environment setup

Section 2

Lab 1-1、My university website

Analyze website behavior

Implementing RSS news fetching

Implementing RSS news parsing

  1. Google Chrome DevTools

  2. HTML/CSS

  3. RSS

  4. DOM

Section 3

Lab 1-2、University website

Analyze website loading contents

  1. AJAX

  2. HTTP POST Method

  3. Google Chrome Dev Networks

Section 4

Lab 2-1、Courses Search System

Analyze course outlines website

  1. AJAX

  2. HTTP POST Method

  3. Google Chrome Dev Networks

  4. ASP.NET Forms

Analyze & Implement courses search system

Web crawler development troubleshooting

Section 5

Lab 3-1、Securities website

Analyze and Implement Securities data website

  1. HTTP GET Method

  2. Google Chrome Dev tools

  3. ASP.NET

Fetch & analyze Securities web contents

Section 6

Lab 4-1、Convenient Store Cloud Printer

FamilyMart-part1

  1. QRCode

  2. base64 encode/decode

  3. Google Chrome Dev tools

  4. uuid

  5. ramsey/uuid

  6. ASP.NET

7-ELEVEN-part2

Section 7

Case studies integration

Cronjob integration

  1. Gandi SMTP

  2. MailGun

  3. Cronjob

Section 8

Advanced web crawling techniques

Automated、Headless web browser

Anti-web-crawler→Captcha code

  1. Selenium Web Driver

  2. Headless Chrome

  3. Puppeteer

  4. Tesseract

Section 9

Lab 5-1

  1. Tesseract

Automated login for a shopping website

Implement automated login webbot

Implement history shopping lists web crawler

Section 10

Lab 5-2

  1. LocalStorage

  2. chrome-php/chrome

  3. nesk/puphpeteer

Radio program website

Analyze radio MP3 audio file download

Lists of MP3 audio file download-part1

A single audio file download-part2

Appendix A

Providing OVA file to import VirtualBox

Development environment setup

Register a MailGun account for section 7

Additional materials

  1.  Fetching HTTP requests from non-browser

  2. Advanced recaptcha identifying

  3. Cloud computing provider integration

Fetching HTTP requests from non-browser

Desktop App:Lightshot

Lightshot

Lightshot screenshot image

Lightshot uploading image

Lightshot uploading image link

https://prnt.sc/1sejtdr

How to upload picture?

How to upload file to https://prnt.sc?

Fetching Desktop App HTTP packets

Operating system:Win 10

Installation steps

Installation steps

Installation steps

Installation steps

Installation steps

Usage

Open Fiddler4

Configure non-browser only

Configure non-browser only

Trusted self-signed certificate

Filtering non-browser only requests

Proxy sever setup for Lightshot

Configure proxy server for Lightshot

Using Lightshot do screenshot & uploading

Using Fiddler to find HTTP requests

Develop uploading image program

Develop uploading image program

Develop uploading image program

Advanced captcha image processing

Advanced captcha image processing

Advanced captcha image processing

ImageMagick

Install ImageMagick

Install ImageMagick

convert command usage

Gray scale image

Gray scale image with PHP

Gray scale image with PHP

Gray scale image with PHP

<?php

// Threhold captcha image to be gray background
$captchaPath = './captcha1.jpg';
$solvedCaptchaPath = './captcha1_solved.jpg';

$imageMagick = new \Imagick($captchaPath);
$imageMagick->SetColorspace(Imagick::COLORSPACE_GRAY);
$max = $imageMagick->getQuantumRange();
$imageMagick->thresholdImage(0.5 * $max['quantumRangeLong']);
$imageMagick->setImageFormat("png");
file_put_contents($solvedCaptchaPath, $imageMagick);

OCR on Google Cloud support

Feedback for publishing book

More additional materials

References

Contact me!

My e-mail address is available on my GitHub

Thanks!

Best practices for using PHP to develop web crawlers!

By peter279k

Best practices for using PHP to develop web crawlers!

PHP Conference Japan 2021

  • 892