Best practices for using PHP to develop web crawlers!

Peter

Outline

About me
The motivation about writing this book
Guide for book
- Technical keywords introduction
- Section introduction
Extended sections
- Missed, but important advanced crawling
Feedback about publishing book

Slide

About me

Peter
GitHub
Active open source contributor
An associate engineer
- DevOps
- Back-end
- System Architecture Researching
- Web Application Security
- PHP, Python and JavaScript
Smart Grid Technology (2017~2021)
Database, Data platform architecture (2021~)

The Motiviation

Joke

Thinking?

Back to 2014

My small story about learning web crawler

Do you know this book?

The original author

Twitter

Got some problems when implementing some codes

Reading book......

Writing e-mail to ask author?

Ask author

Receive Reply from author

Screen Scraper Tricks Extracting Data from Difficult Websites

Slide

From then on

After six years...

There's no new book about PHP Web Crawler

That's why I write new one!

Guide for the book

Guide for book section

Section 1 to 10
Appendix A

Sample codes

php_crawler_lab

Section 1

Fundamentals

Web crawler, spider and bot

Development Environment setup

Section 2

Lab 1-1、My university website

Analyze website behavior

Implementing RSS news fetching

Implementing RSS news parsing

Google Chrome DevTools
HTML/CSS
RSS
DOM

Section 3

Lab 1-2、University website

Analyze website loading contents

AJAX
HTTP POST Method
Google Chrome Dev Networks

Section 4

Lab 2-1、Courses Search System

Analyze course outlines website

AJAX
HTTP POST Method
Google Chrome Dev Networks
ASP.NET Forms

Analyze & Implement courses search system

Web crawler development troubleshooting

Section 5

Lab 3-1、Securities website

Analyze and Implement Securities data website

HTTP GET Method
Google Chrome Dev tools
ASP.NET

Fetch & analyze Securities web contents

Section 6

Lab 4-1、Convenient Store Cloud Printer

FamilyMart-part1

QRCode
base64 encode/decode
Google Chrome Dev tools
uuid
ramsey/uuid
ASP.NET

7-ELEVEN-part2

Section 7

Case studies integration

Cronjob integration

Gandi SMTP
MailGun
Cronjob

Section 8

Advanced web crawling techniques

Automated、Headless web browser

Anti-web-crawler→Captcha code

Selenium Web Driver
Headless Chrome
Puppeteer
Tesseract

Section 9

Lab 5-1

Tesseract

Automated login for a shopping website

Implement automated login webbot

Implement history shopping lists web crawler

Section 10

Lab 5-2

LocalStorage
chrome-php/chrome
nesk/puphpeteer

Radio program website

Analyze radio MP3 audio file download

Lists of MP3 audio file download-part1

A single audio file download-part2

Appendix A

Providing OVA file to import VirtualBox

Development environment setup

Register a MailGun account for section 7

OVA download link

Additional materials

Fetching HTTP requests from non-browser
Advanced recaptcha identifying
Cloud computing provider integration

Fetching HTTP requests from non-browser

Desktop App：Lightshot

Lightshot

Installation reference

Lightshot screenshot image

Lightshot uploading image

Lightshot uploading image link

https://prnt.sc/1sejtdr

How to upload picture?

How to upload file to https://prnt.sc?

Fetching Desktop App HTTP packets

Fiddler

Download Fiddler

Operating system：Win 10

Installation steps

Usage

Open Fiddler4

Configure non-browser only

Trusted self-signed certificate

Filtering non-browser only requests

Proxy sever setup for Lightshot

Configure proxy server for Lightshot

Using Lightshot do screenshot & uploading

Using Fiddler to find HTTP requests

Develop uploading image program

Uploaded image link

Develop uploading image program

Advanced captcha image processing

ImageMagick

Install ImageMagick

convert command usage

Gray scale image

Gray scale image with PHP

<?php

// Threhold captcha image to be gray background
$captchaPath = './captcha1.jpg';
$solvedCaptchaPath = './captcha1_solved.jpg';

$imageMagick = new \Imagick($captchaPath);
$imageMagick->SetColorspace(Imagick::COLORSPACE_GRAY);
$max = $imageMagick->getQuantumRange();
$imageMagick->thresholdImage(0.5 * $max['quantumRangeLong']);
$imageMagick->setImageFormat("png");
file_put_contents($solvedCaptchaPath, $imageMagick);

OCR on Google Cloud support

Feedback for publishing book

More additional materials

My GitHub reopsitories

References

Contact me!

Best practices for using PHP to develop web crawlers!

Peter

Outline

Slide

About me

The Motiviation

Joke

Thinking?

Back to 2014

My small story about learning web crawler

Do you know this book?

The original author

Got some problems when implementing some codes

Reading book......

Writing e-mail to ask author?

Ask author

Receive Reply from author

Receive Reply from author

Screen Scraper Tricks Extracting Data from Difficult Websites

From then on

After six years...

There's no new book about PHP Web Crawler

That's why I write new one!

Guide for the book

Guide for book section

Section 1 to 10

Appendix A

Sample codes

Section 1

Fundamentals

Web crawler, spider and bot

Development Environment setup

Section 2

Lab 1-1、My university website

Analyze website behavior

Implementing RSS news fetching

Implementing RSS news parsing

Google Chrome DevTools

HTML/CSS

RSS

DOM

Section 3

Lab 1-2、University website

Analyze website loading contents

AJAX

HTTP POST Method

Google Chrome Dev Networks

Section 4

Lab 2-1、Courses Search System

Analyze course outlines website

AJAX

HTTP POST Method

Google Chrome Dev Networks

ASP.NET Forms

Analyze & Implement courses search system

Web crawler development troubleshooting

Section 5

Lab 3-1、Securities website

Analyze and Implement Securities data website

HTTP GET Method

Google Chrome Dev tools

ASP.NET

Fetch & analyze Securities web contents

Section 6

Lab 4-1、Convenient Store Cloud Printer

FamilyMart-part1

QRCode

base64 encode/decode

Google Chrome Dev tools

uuid

ramsey/uuid

ASP.NET

7-ELEVEN-part2

Section 7

Case studies integration

Cronjob integration

Gandi SMTP

MailGun

Cronjob

Section 8