Web Scrapping, Crawler and PHP
Slide link
Outline
-
Joke about web crawler and scrapping
-
My small story about learning web crawler
-
What’s web crawler, bot and scrapping?
-
Develop your first web crawler!
-
Advanced/Useful crawling skills for everyone
-
A funny story about crawling
-
Sharing
-
How to figure out your side project via crawling?
-
How to integrate existed project for your crawler?
-
About me
-
Peter
-
Active open source contributor
-
An associate engineer
-
3+ years for PHP development
-
PHP 5.3 → PHP 7+
-
No framework→Slim→Laravel
-
Working for ITRI currently
-
Smart Grid technology
Joke about web crawler and scrapping
Joke
Thinking?
Back to 2014
My small story about learning web crawler
My small story about learning web crawler
My small story about learning web crawler
Michael Schrenk
My first web crawler
-
wifi_login_php
-
curl_*
-
preg_match
Screen Scraper Tricks Extracting Data from Difficult Websites
iMarcos
iMarcos
#01 '##################################################
#02 ' Set maximum web page time out
#03 SET !TIMEOUT 240
#04 ' Tell iMacros to ignore error messages
#05 SET !ERRORIGNORE YES
#06 ' Clear ALL cookies
#07 CLEAR
#08 ' Initialize Browser tab 1, close all other tabs
#09 TAB T=1
#10 TAB CLOSEALLOTHERS
#11 ' Tell iMacros to ignore images (nice if using Tor)
#12 FILTER TYPE=IMAGES STATUS=ON
#13 ' Tell iMacros to ignore extract messages
#14 SET !EXTRACT_TEST_POPUP NO
#15 '##################################################
Back to 2015
First hackathon in 2015
First hackathon in 2015
Back nowadays!
Fundamentals
-
Web Bot
-
Web Crawler
-
Web Scarping
Develop your first web crawler!
We have...
We have...
-
Linux(Especially Ubuntu 16.04+)
-
Google Chrome
-
PHP 7.2+
-
cURL extension
-
-
Composer
Installation
curl -sS https://getcomposer.org/installer | php
php ~/composer.phar require guzzlehttp/guzzle:^6.2 -n
php ~/composer.phar require symfony/dom-crawler:^4.3 -n
php ~/composer.phar require symfony/css-selector:^4.3 -n
# php ~/composer.phar require fabpot/goutte:^4.0 -n
Installation with Docker
docker pull peter279k/crawler-lab-coscup:latest
Basic Case Study One
Inspection
HTTP Inspection
HTTP Inspection
HTTP Inspection
HTTP Inspection
{
"content":"\n\n\t<div class=\"row listBS\">\n\t\n\t\n\t\t\n\t\t<div class=\"d-item d-title col-sm-12\">\n<div class=\"mbox\">\n\t<div class=\"d-txt\">\n <div class=\"mtitle\">\n\t\t\t\n\t\t\t<a href=\"http:\/\/aa.nttu.edu.tw\/p\/404-1002-99926-1.php\">\n\t\t\t\t\u3010\u6559\u52d9\u8655\u8ab2\u52d9\u7d44\u3011109\u5b78\u5e74\u5ea6\u7b2c1\u5b78\u671f(\u9032\u4fee\u5b78\u5236)\u9078\u8ab2\u4f5c\u696d\u6642\u7a0b(\u7db2\u8def\u52a0\u9000\u9078\u8ab2\u6642\u9593:109\u5e749\u670814\u65e5(\u4e00)08:00~9\u670818\u65e5(\u4e94)24:00)\n\t\t\t<\/a>\n\t\t\t\n\t\t\t<span class=\"subsitename newline\"><\/span>\n\t\t<\/div>\n\t<\/div>\n\t\n<\/div>\n<\/div>\n\n\t\t<\/div><div class=\"row listBS\">\n\t\n\t\t\n\t\t<div class=\"d-item d-title col-sm-12\">\n<div class=\"mbox\">\n\t<div class=\"d-txt\">\n <div class=\"mtitle\">\n\t\t\t\n\t\t\t<a href=\"http:\/\/wdsa.nttu.edu.tw\/p\/404-1009-99907-1.php\">\n\t\t\t\t\u3010\u5b78\u52d9\u8655\u6821\u5b89\u4e2d\u5fc3\u3011\u8f49\u6559\u80b2\u90e8\u8acb\u5404\u7d1a\u5b78\u6821109\u5e74\u6691\u5047\u671f\u9593\u5b78\u751f\u6d3b\u52d5\u5b89\u5168\u6ce8\u610f\u4e8b\u9805\n\t\t\t<\/a>\n\t\t\t\n\t\t\t<span class=\"subsitename newline\"><\/span>\n\t\t<\/div>\n\t<\/div>\n\t\n<\/div>\n<\/div>\n\n\t\t<\/div><div class=\"row listBS\">\n\t\n\t\t\n\t\t<div class=\"d-item d-title col-sm-12\">\n<div class=\"mbox\">\n\t<div class=\"d-txt\">\n <div class=\"mtitle\">\n\t\t\t\n\t\t\t<a href=\"http:\/\/aa.nttu.edu.tw\/p\/404-1002-99911-1.php\">\n\t\t\t\t\u3010\u6559\u52d9\u8655\u3011109\u5b78\u5e74\u5ea6\u7b2c1\u5b78\u671f\u6559\u5e2b\u6559\u5b78\u5927\u7db1\u4e0a\u50b3\u53ca\u554f\u5377\u985e\u578b\u8a2d\u5b9a\u901a\u77e5-\u81f3(109\/8\/14\u622a\u6b62)\uff0c\u8acb\u5404\u4f4d\u5e2b\u9577\u7559\u610f\uff01\n\t\t\t<\/a>\n\t\t\t\n\t\t\t<span class=\"subsitename newline\"><\/span>\n\t\t<\/div>\n\t<\/div>\n\t\n<\/div>\n<\/div>\n\n\t\t\n\t\n\t<\/div>\n\n\n\n",
"stat":"over"
}
HTTP Inspection
Organize thoughts
<?php
require_once __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$latestNews = 'https://www.nttu.edu.tw/app/index.php?Action=mobileassocgmolist';
$client = new Client();
$formParams = [
'form_params' => [
'Cg' => '1009',
'IsTop' => '0',
'Op' => 'getpartlist',
'Page' => '1',
],
];
$response = $client->request('POST', $latestNews, $formParams);
$latestNewsString = (string)$response->getBody();
var_dump($latestNewsString);
Organize thoughts
<?php
require_once __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
//......
$latestNewsString = json_decode($latestNewsString, true);
$crawler = new Crawler($content);
$crawler
->filter('a')
->reduce(function (Crawler $node, $i) {
global $titles;
global $links;
$titles[] = $node->text();
$links[] = $node->attr('href');
});
var_dump($links);
var_dump($titles);
Organize thoughts
array(4) {
[0]=>
string(45) "https://aa.nttu.edu.tw/p/404-1002-90907-1.php"
[1]=>
string(48) "https://enews.nttu.edu.tw/p/404-1045-90881-1.php"
[2]=>
string(48) "https://enews.nttu.edu.tw/p/404-1045-90876-1.php"
[3]=>
string(45) "https://aa.nttu.edu.tw/p/404-1002-90906-1.php"
}
array(4) {
[0]=>
string(93) "
【教務處】大一新生「運動、美術、音樂」績優獎學金申請公告
"
[1]=>
string(55) "
【秘書室】東大簡訊-13號刊(20190903)
"
[2]=>
string(75) "
【秘書室】恭賀!音樂學系何育真老師榮升副教授
"
[3]=>
string(74) "
【教務處】核發108-1舊生續領設籍臺東獎學金公告
"
}
DOM Tree
Basic Case Study Two
PM
I need to get these data sets
Intern
I cannot find these data sets
Element Inspection
Element Inspection
Target Data Fetching
-
reverse.csv (109)
-
reverse2.csv (108)
-
reverse3.csv (107)
Advanced/Useful crawling skills for everyone
ASP.NET
Subject System
Subject System
Invalid User Agent
<div>
<span id="lblMsg">The error message:</span><br />
<textarea name="txtMsg" rows="2" cols="20" id="txtMsg" class="input">
此頁面正在執行非同步回傳,
但 ScriptManager.SupportsPartialRendering 屬性卻是設定為 false。
請於非同步回傳時將此屬性設定為 true。</textarea><br />
<span id="lblStackTrace">The error stack trace:</span><br />
<textarea name="txtStackTrace" rows="2" cols="20" id="txtStackTrace" class="input">
於 System.Web.UI.ScriptManager.OnPageInitComplete(Object sender, EventArgs e)
於 System.Web.UI.Page.OnInitComplete(EventArgs e)
於 System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)</textarea>
User-Agent
Subject System Crawling
<?php
require_once __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$publicCourses = 'https://infosys.nttu.edu.tw/n_CourseBase_Select/CourseListPublic.aspx';
$headers = [
'Host' => 'infosys.nttu.edu.tw',
'Connection' => 'keep-alive',
'Cache-Control' => 'max-age=0',
'Upgrade-Insecure-Requests' => '1',
'Sec-Fetch-Mode' => 'navigate',
'Sec-Fetch-User' => '?1',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 application/signed-exchange;v=b3',
'Sec-Fetch-Site' => 'none',
'Referer' => 'https://infosys.nttu.edu.tw/',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13',
];
$client = new Client(['cookies' => true]);
$response = $client->request('GET', $publicCourses, [
'debug' => true,
'headers' => $headers,
]);
Subject System Crawling
......
$publicCourseString = (string)$response->getBody();
$viewState = '__VIEWSTATE';
$eventValidation = '__EVENTVALIDATION';
$viewStateGenerator = '5D156DDA';
$crawler = new Crawler($publicCourseString);
$crawler
->filter('input[type="hidden"]')
->reduce(function (Crawler $node, $i) {
global $viewState;
global $eventValidation;
if ($node->attr('name') === $viewState) {
$viewState = $node->attr('value');
}
if ($node->attr('name') === $eventValidation) {
$eventValidation = $node->attr('value');
}
});
Subject System Crawling
......
$formParams = [
'form_params' => [
'ToolkitScriptManager1' => 'UpdatePanel1|Button3',
'ToolkitScriptManager1_HiddenField' => '',
'__EVENTTARGET' => '',
'__EVENTARGUMENT' => '',
'__LASTFOCUS' => '',
'__VIEWSTATE' => $viewState,
'__VIEWSTATEGENERATOR' => $viewStateGenerator,
'__SCROLLPOSITIONX' => '0',
'__SCROLLPOSITIONY' => '0',
'__VIEWSTATEENCRYPTED' => '',
'__EVENTVALIDATION' => $eventValidation,
'DropDownList1' => '1071',
'DropDownList6' => '1',
'DropDownList2' => '%',
'DropDownList3' => '%',
'DropDownList4' => '%',
'TextBox9' => '',
'DropDownList5' => '%',
'DropDownList7' => '%',
'TextBox1' => '',
'DropDownList8' => '%',
'TextBox6' => '0',
'TextBox7' => '14',
'__ASYNCPOST' => 'true',
'Button3' => '查詢',
],
Invalid Order Form Params
ue="/wEdAAbxmE99JhLisIrrBlSpleKvA3sa9CLAiY0NRgwF9EJQGh6kvJC1EopKW4ZDfj9Gj7oGHrYxvYrs5XDlrjyz+wVULvWz/wJ+1kADwg6S0w9SXo/Fg06KOWoBIRHuyh28DoVPLgf8rKyi7Ffc8EgW/ntaNx+wYA==" />
</div>
<div>
<span id="lblMsg">The error message:</span><br />
<textarea name="txtMsg" rows="2" cols="20" id="txtMsg" class="input">
無效的 Viewstate。
Client IP: 61.230.251.119
Port: 38320
Referer:
Path: /n_CourseBase_Select/CourseListPublic.aspx
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36
Subject System Crawling
......
'headers' => [
'Sec-Fetch-Mode: cors',
'Origin: https://infosys.nttu.edu.tw',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'X-Requested-With: XMLHttpRequest',
'Connection: keep-alive',
'X-MicrosoftAjax: Delta=true',
'Accept: */*',
'Cache-Control: no-cache',
'Referer: https://infosys.nttu.edu.tw/n_CourseBase_Select/CourseListPublic.aspx',
'Sec-Fetch-Site: same-origin',
'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
],
];
$response = $client->request('POST', $publicCourses, $formParams);
$coursesString = (string)$response->getBody();
var_dump($coursesString);
JavaScript
Web Page loading
-
Open Web browser and type URL
-
Web page loading
-
DOM Ready
-
DOMContentLoaded
-
Loading Assets (JS, CSS)
-
Execute JS
-
Page Ready
Web Page loading for Web Driver/Engine
Unmaintained
Headless Chrome
google-chrome-stable --headless --dump-dom "https://google.com"
# Install Command Wrapper for Headless Chrome
php ~/composer.phar require chrome-php/chrome:^0.8 -n
Case Study
A Podcast Site
A Podcast Site
A Podcast Site
Fetch Single Episode From Site
Fetch Single Episode From Site
sample.php
<?php
require_once './vendor/autoload.php';
use HeadlessChromium\BrowserFactory;
$url = 'https://baabao.com/single-episode/2792254?to=1596211873940&s=8TBkr';
$jsCode = "JSON.parse(JSON.parse(localStorage.getItem('localforage/listen_history/lastListenEpisode')))";
$browserFactory = new BrowserFactory('google-chrome-stable');
// starts headless chrome
$browser = $browserFactory->createBrowser();
// creates a new page and navigate to an url
$page = $browser->createPage();
$page->navigate($url)->waitForNavigation();
// get JSON with single episode info
$episodeInfo = $page->evaluate($jsCode)->getReturnValue();
var_dump($episodeInfo);
// bye
$browser->close();
sample.php result
/data/badoo-episode/sample.php:20:
array(26) {
'image' =>
string(95) "https://baabao-programs-images.s3.amazonaws.com/efee0e645c6c49bc87ef7972211be7c5--1_400_400.jpg"
'episode_data_url' =>
string(168) "https://d3hl6newtgi50f.cloudfront.net/0dd31152b9db415bbae239bcba2b61ba--1125+%E5%AF%B6%E8%B2%9D%E7%89%B9%E6%B4%BE%E5%93%A1+%E7%AC%AC40%E9%9B%86+45%E5%88%86%E9%90%98.mp3"
'emojis' =>
array(1) {
[0] =>
array(2) {
'description' =>
string(12) "給個鼓勵"
'count' =>
int(0)
}
}
'subscribed' =>
......
Anti-Crawler
Anti-Bot
Like A Human Reading
-
Sleep
-
Waiting
A funny story about crawling
Are you afraid of Captcha?
Don't be afraid firstly!
Captcha
Captcha
Captcha
OCR
Tesseract
Install Tesseract
-
Ubuntu 16.04
-
sudo apt-get install libtesseract3
-
sudo apt-get install tesseract-ocr
-
-
Ubuntu 18.04
-
sudo apt-get install libtesseract4
-
sudo apt-get install tesseract-ocr
-
Organize Thoughts
-
Request with GET method
-
Request with GET method
-
Request header
-
Add above response cookies
-
-
Captcha code Image
-
-
Request with POST Method
Organize Thoughts
<?php
require_once './vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$loginUrl = 'https://www.leezen.com.tw/login.php';
$captchaUrl = 'https://www.leezen.com.tw/captcha/code.php';
$client = new Client(['cookies' => true]);
$response = $client->request('GET', $loginUrl);
$loginPageResponse = (string)$response->getBody();
$codeResponse = $client->request('GET', $captchaUrl);
Organize Thoughts
// ......
file_put_contents('./code.png', (string)$codeResponse->getBody());
exec('tesseract ./code.png code');
$code = file_get_contents('./code.txt');
preg_match('/(\d+)/', $code, $matched);
$code = $matched[0];
$crawler = new Crawler($loginPageResponse);
$token = '';
$crawler
->filter('input[type="hidden"]')
->reduce(function (Crawler $node, $i) {
global $token;
if ($node->attr('name') === 'token') {
$token = $node->attr('value');
}
});
Organize Thoughts
// ......
$formParams = [
'form_params' => [
'member' => 'email_or_phone_number',
'member_m' => 'email_or_phone_number',
'member_password' => 'password',
'Mode' => 'login',
'token' => $token,
'Turing2' => $code,
'login' => '登入',
],
];
// Do Login Action!
$postLoginUrl = 'https://www.leezen.com.tw/member_process.php';
$response = $client->request('POST', $postLoginUrl, $formParams);
$loginResponseString = (string)$response->getBody();
var_dump($loginResponseString);
Organize Thoughts
alert("登入成功,歡迎您回到天天里仁!");window.location.replace("index.php")
Command Wrapper Tesseract in PHP
More funny captchas
Sharing
-
How to figure out your side project via crawling?
- News Bot
- Automated Task
-
How to integrate existed project for your crawler?
- Cron job
- Automated job
- Logging for crawler
Summary
Any questions?
Web Scrapping, Crawler and PHP
By peter279k
Web Scrapping, Crawler and PHP
COSCUP 2020
- 1,421