Web scraping in R

Ujang Fahmi

Tujuan

Apa web Scraping?
Mengapa kita perlu melakukan web Scraping?
Bagaimana cara melakukan web Scraping?
Prinsip Kerja web Scraping di R dengan paket rvest

Scraping table
Scraping static webpage
Scraping multipage
Storing scraped data in data frame

UMUM

KHUSUS

Apa web Scraping?

Web scraping, data scraping, web extraction adalah proses mengekstrak data dari halaman sebuah website dengan memanfaatkan beberapa teknik seperti copy paste , html parsing , DOM parsing , dan lain-lain.

HTTP

Instruksi sesuai dengan API

Mengapa Scraping

diperlukan?

Data is new differentiator

Bernard Marr, Forbes

An essential skill to acquire in today’s digital world

Bagaimana

Scraping di R?

Download html using read_html()
Extract specific nodes using html_nodes()
Extract element of a nodes using html_text() , html_atr() , html_table() , etc.
Pre-process

Menggunakan xpath dan css nodes

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

Expression	Result
nodename	Selects all nodes with the name " nodename"
/	Selects from the root node
//	Selects nodes in the document from the current node that match the selection no matter where they are
.	Selects the current node
..	Selects the parent of the current node
@	Selects attributes

Path Expression	Result
bookstore	Selects all nodes with the name "bookstore"
/bookstore	Selects the root element bookstore Note: If the path starts with a slash ( / ) it always represents an absolute path to an element!
bookstore/book	Selects all book elements that are children of bookstore
//book	Selects all book elements no matter where they are in the document
bookstore//book	Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element
//@lang	Selects all attributes that are named lang
//book/title \| //book/price	Selects all the title AND price elements of all book elements

Menggunakan xpath dan css nodes

Contoh xpath

library(tidyverse)
library(rvest)

page <- read_html("https://www.kdnuggets.com/news/top-stories.html")

# xpath
page %>% 
  html_nodes(xpath = "//ol[1]/li/a/b") %>% 
  html_text(trim = TRUE)

page %>% 
  html_nodes(xpath = "//ol[1]//li//a") %>% 
  html_attr("href") %>%
  ifelse(. == " ", NA, .)

Mendaparkan judul artikel bagian most shared dari laman https://www.kdnuggets.com/news/top-stories.html

Selector	Example	Example description
.class	.intro	Selects all elements with class="intro"
.class1.class2	<div class="name1 name2">...</div>	Selects all elements with both name1 and name2 set within its class attribute
element>element	div > p	Selects all <p> elements where the parent is a <div> element
element+element	div + p	Selects all <p> elements that are placed immediately after <div> elements
:nth-child(n)	p:nth-child(2)	Selects every <p> element that is the second child of its parent
element1~element2	p ~ ul	Selects every <ul> element that are preceded by a <p> element
element,element	div, p	Selects all <div> elements and all <p> elements

Menggunakan xpath dan cssnodes

Contoh css

library(tidyverse)
library(rvest)

page <- read_html("https://www.kdnuggets.com/news/top-stories.html")

page %>% 
  html_nodes(css = "ol:nth-child(3) > li > a > b") %>%
  html_text(trim = TRUE) %>%
  data_frame()

page %>% 
  html_nodes(css = "ol:nth-child(3) > li > a") %>% 
  html_attr("href") %>%
  data_frame()

Menggunakan css selector

Multiple Page 1

https://jdih.dprd-diy.go.id/?cat=5

https://jdih.dprd-diy.go.id/?pagenum=2&totalrow=27&cat=5

halaman 1

halaman 2

halaman

https://jdih.dprd-diy.go.id/?pagenum=

&totalrow=27&cat=5

angka halaman

baseurl

nomor halaman

endurl

Multiple Page 2

library(rvest)

baseurls <- "https://jdih.dprd-diy.go.id/?pagenum="
page <- (0:2)
endurls <- "&totalrow=27&cat=5"

# list kosong
urls <- list()

# membuat daftar urls
for (i in seq_along(page)) {
  url<- paste0(baseurls, page[i], endurls)
  urls[[i]] <- url
}

Target: Membuat daftar url yang akan diambil datanya

# list kosong untuk menampung hasil
undang2 <- list()

# loop over the urls and get the table from each page
for (i in seq_along(urls)) {
  pages <- read_html(urls[[i]])
 
  tentang <- pages %>% 
    html_nodes("tr:nth-child(3) td~ td+ td") %>% 
    html_text(trim = TRUE) %>% 
    data_frame()
  
  download <- pages %>% 
    html_nodes("tr:nth-child(6) > td") %>%
    html_node("a") %>%
    html_attr("href") %>% 
    ifelse(. == " ", NA, .) %>%
    data_frame()

  download$. <- paste0("https://jdih.dprd-diy.go.id/", download$.)
  undang <- bind_cols(tentang, download)
  undang2[[i]] <- undang
}

Target: mengambil elemen teks tentang dan downoad dari tiap halaman

library(rvest)
library(tidyverse)

pages <- read_html("http://www.dpr.go.id/jdih/pp")

hasil <- html_nodes(pages, "table") [[1]] %>% 
  html_table()

hasil <- pages %>% 
  html_nodes("table") %>% 
  html_table()

hasil <- hasil[[1]]

pages %>% 
  html_nodes("table") %>% 
  html_table()

Target: mengambil tabel dan isinya dari sebuah halaman website

Scraping tabel

Rangkuman

Scraping bisa dilakukan di R
Scraping memanfaatkan css nodes dan atau xpath nodes
Tidak ada nodes yang lebih baik, terkadang kita harus mencobanya dan melihat hasilnya
Melihat perubahan urls merupakan strategi umum yang bisa digunakan untuk melakukan scraping dari beberpa halaman
Nodes untuk tabel umumnya adalah table, jika banyak bisa diurukan tabel ke berapa
Sering mencoba (praktik)