BI Dashboard Implementation

vincentchiang@kkbox.com

Senior Developer Advocate @

江品陞 @

About me

Developer Advocate @
Functional Programming Lover
Data Engineer
Cryptocurrency Trader

zaoldyeck @

當一間公司的產品

在市場上已經發展成熟

要如何繼續成長？

垂直整合

提供音樂串流服務

舉辦實體演唱會、售票

投資娛樂產業、負責音樂版權

水平整合

如何讓音樂在更多的載體

或者情境上被使用？

KKBOX 是一間軟體公司

我們並不擅長硬體設計製造

KKBOX Open API (developer.kkbox.com)

KKBOX Open API 提供什麼？

KKBOX 全曲庫資料
排行榜、分類歌單、個人推薦歌單
會員收藏歌曲
打造專屬於該平台的 KKBOX

透過 Open API 我們可以

降低合作夥伴與我們的串接門檻
從 Access Log 分析，使我們暸解
- 何種平台具有發展性
- 在其他平台上什麼數據熱門
- 何種面向的產品會吸引使用者
- 異於官方產品的使用者習慣
最終，幫助我們制定商業決策
BI Dashboard

2018-1 KKBOX Open API 武林大會

Questions

如何知道用戶喜不喜歡我們的產品？
從 Open API 上線以來的流量成長
如何知道哪一個 Client 用量最大？
用戶透過 Open API 搜尋哪些歌曲 / 歌手 / 專輯 / 歌單？
如何知道哪些 Resource 最受歡迎？

Today we focus on

Not Google Analytics
About API Access Log processing
for Business Intelligence

BI Dashboard Implemantaion

Log Storage
- facebookarchive/scribe, AWS S3
ETL（Extract-Transform-Load）
- Scala
Data Storage
- Google Cloud BigQuery
Data Visualization
- Google Data Studio

ETL Issues

資料處理邏輯複雜
中途易出錯，需要做例外處理
數據量大，需要高效高併發

我們期望的資料結構

client_id	url_path	header_key	herder_value	request_ parameter_key	request_ parameter_value	reponse_status	response_body	timestamp
"demo id"	"/v1.1/search"	"Authorization"	"Bearer access_token"	"q"	"五月天"	200	"派對動物"	"2017-11-21T15:00:03+08:00"
"demo id2"	"/v1.1/charts"	"Authorization"	"Bearer access_token"	"territory"	"TW"	200		"2017-11-21T15:00:03+09:00"

Local

Scribe Server

Central

Scribe Server

Web API Server

Load Balancer

Web API Server

Local

Scribe Server

Local

Scribe Server

Local

Scribe Server

Local

Scribe Server

Central

Scribe Server

Client API Request

Access Log

{
  "timestamp": "2017-11-21T15:00:03+08:00",
  "client_id": "demo_client_id",
  "user_id": null,
  "request": {
    "timestamp": "2017-11-21T15:00:03+08:00",
    "uri": "\/v1.1\/shared-playlists\/4nUZM-TY2aVxZ2xaA-\/tracks?territory=TW",
    "protocol": "HTTP\/1.1",
    "path": "\/v1.1\/shared-playlists\/4nUZM-TY2aVxZ2xaA-\/tracks",
    "headers": [
      {
        "key": "accept",
        "value": "application\/json, text\/plain, *\/*"
      },
      {
        "key": "user-agent",
        "value": "KKBOX JavaScript SDK"
      },
      {
        "key": "host",
        "value": "api.kkbox.com"
      },
      {
        "key": "connection",
        "value": "close"
      },
      {
        "key": "x-forwarded-for",
        "value": "220.130.164.2"
      },
      {
        "key": "x-forwarded-proto",
        "value": "https"
      }
    ],
    "query_parameters": [
      {
        "key": "territory",
        "value": "TW"
      }
    ],
    "body": ""
  },
  "response": {
    "status": 200,
    "version": "1.1",
    "headers": [
      {
        "key": "cache-control",
        "value": "no-cache"
      },
      {
        "key": "content-type",
        "value": "application\/json"
      },
      {
        "key": "date",
        "value": "Tue, 21 Nov 2017 07:00:03 GMT"
      }
    ],
    "body": "{\"data\":[{\"id\":\"SsJItLYfDXYxalPgtU\",\"name\":\"\\u9019\\u500b\\u4e16\\u754c\",\"duration\":240195,\"url\":\"https:\\\/\\\/event.kkbox.com\\\/conten"
  }
}

{
  "timestamp": "2017-11-21T15:00:03+08:00",
  "client_id": "demo_client_id",
  "user_id": null,
  "request": {
    "timestamp": "2017-11-21T15:00:03+08:00",
    "uri": "\/v1.1\/shared-playlists\/4nUZM-TY2aVxZ2xaA-\/tracks?territory=TW",
    "protocol": "HTTP\/1.1",
    "path": "\/v1.1\/shared-playlists\/4nUZM-TY2aVxZ2xaA-\/tracks",
    "headers": {
      "accept": "application\/json, text\/plain, *\/*",
      "user_agent": "KKBOX JavaScript SDK",
      "host": "api.kkbox.com",
      "connection": "close",
      "x_forwarded_for": "220.130.164.2",
      "x_forwarded_proto": "https"
    },
    "query_parameters": {
      "territory": "TW"
    },
    "body": ""
  },
  "response": {
    "status": 200,
    "version": "1.1",
    "headers": {
      "cache_control": "no-cache",
      "content_type": "application\/json",
      "date": "Tue, 21 Nov 2017 07:00:03 GMT"
    },
    "body": "{\"data\":[{\"id\":\"SsJItLYfDXYxalPgtU\",\"name\":\"\\u9019\\u500b\\u4e16\\u754c\",\"duration\":240195,\"url\":\"https:\\\/\\\/event.kkbox.com\\\/conten"
  }
}

Data Pipeline

Transforming
Cleaning
Merging

user_id	status	last_login

user_id	song_id	client_id	device_id	listen_at

動態資料

靜態資料

user_id	status	last_login	song_id	client_id	device_id	listen_at

Title Text

Running on JVM
Highly performance
Mixing object-oriented and functional programming
Easily to make program concurrency and parallelism

Programming Lang Choosing

> 1,000,000 or Not?

Benefits for data pipeline

Type safety
No side effect
Lambda Expression
Concurrency
all for Parallelism

Type safety

val optionWithInt: Option[Int] = Some(1)
val emptyOption: Option[Int] = None

if (optionWithInt.isDefined) println(optionWithInt.get) //1

emptyOption match {
  case Some(int) => println(int)   // 1
  case None => println("is empty") // is empty
}

val value = emptyOption.getOrElse(0) // 0

No Side Effect

/* Cause Side effect */
var value = 0

def addValue(): Unit = {
  value += 1
}
println(value) // 1

val values = mutable.Seq(1, 2, 3, 4, 5)
for (i <- values.indices) {
  values(i) = values(i) + 1
}
println(values) // ArrayBuffer(2, 3, 4, 5, 6)

/* No side effect */
val value = 0

def addValue(v: Int): Int = v + 1
val newValue = addValue(value)
println(value) // 0
println(newValue) // 1

val values = Seq(1, 2, 3, 4, 5)
val newValues = values.map(_ + 1)
println(values) // List(1, 2, 3, 4, 5)
println(newValues) // List(2, 3, 4, 5, 6)

Lambda Expression

// no Lambda Expression
import scala.collection.mutable

val seq: Seq[String] = Seq("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
var result: Seq[Int] = mutable.Seq()

for (i <- seq.indices) {
  val int = i.toInt
  if (int > 5) result = result :+ int * 10
}

println(result) // ArrayBuffer(60, 70, 80, 90)

// with Lambda Expression

val seq: Seq[String] = Seq("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")

val result = seq
  .map(_.toInt)
  .filter(_ > 5)
  .map(_ * 10)



println(result) // ArrayBuffer(60, 70, 80, 90)

Concurrency

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}

def stringToInt(s: String) = Future {
  Thread.sleep(1000)
  s.toInt
}

def add(i: Int) = Future {
  Thread.sleep(1000)
  i + 1
}

def multiply(i: Int) = Future {
  Thread.sleep(1000)
  i * 10
}

val task = stringToInt("1") flatMap {
    i => add(i)
  } flatMap {
    i => multiply(i)
  }

Await.result(task, Duration.Inf)




















val task2: Future[Future[Int]] = for {
  int <- stringToInt("1")
  int2 <- add(int)
} yield multiply(int, int2)


Await.result(task2, Duration.Inf)

Parallelism

import scala.collection.parallel.ParSeq
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}

val data: Seq[String] = Seq("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
val parSeq: ParSeq[Int] = data
  .par
  .map(_.toInt)
  .filter(_ > 5)
  .map(_ + 1)


def stringToInt(s: String) = Future {
  Thread.sleep(1000)
  s.toInt
}

val seqOfFuture: ParSeq[Future[Int]] = data
  .par
  .map(stringToInt)
val futureOfSeq: Future[ParSeq[Int]] = Future.sequence(eventualInt)

Await.result(futureOfSeq, Duration.Inf)

KKBOX Open API Server

Scribe Server

Data Transforming

Data Cleaning

Data Merging

Data Upload

Google Cloud BigQuery

Google Data Studio

Access Log


@throws[IOException]
@throws[InterruptedException]
@throws[TimeoutException]
def writeFileToTable(path: Path, tableName: String = "open_api_log", datasetName: String = "production", isAutodetect: Boolean = false): Long = { // [START writeFileToTable]
  val yes = true
  val tableId: TableId = TableId.of(datasetName, tableName)
  val writeChannelConfiguration: WriteChannelConfiguration = WriteChannelConfiguration
    .newBuilder(tableId)
    .setCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
    .setFormatOptions(FormatOptions.json)
    .setAutodetect(isAutodetect)
    .setIgnoreUnknownValues(yes)
    .setWriteDisposition(WriteDisposition.WRITE_APPEND)
    .setMaxBadRecords(99999999)
    .build
  val writer: TableDataWriteChannel = bigQuery.writer(writeChannelConfiguration)
  // Write data to writer
  try {
    val stream: OutputStream = Channels.newOutputStream(writer)
    try
      Files.copy(path, stream)
    finally if (stream != null) stream.close()
  }
  // Get load job
  var job: Job = writer.getJob
  job = job.waitFor()
  val stats: LoadStatistics = job.getStatistics[LoadStatistics]
  stats.getEndTime
  // [END writeFileToTable]
}

Data Visualization Issues

資料載入速度慢
與 Standard SQL 不相容（例如不能 join）
資料不民主

BigQuery + Data Studio

BigQuery

優異的資料查詢速度
便宜的價格（資料記得要做 Auto Partition）
與 Standard SQL 相容

Data Studio

用 G suite 管理儀表板權限（類似 Google Doc）
視覺化方式建立圖表

成果

發現 IoT 用戶的使用習慣與 App 不同
- 用量高峰發生在晚間
- 有聲書類別意外擠進排行榜
從數據中設法勾勒出特定使用者的輪廓

分眾

水平發展策略

開放資料

分析數據用量與使用情境

與互補型合作夥伴串接

描繪特殊受眾輪廓

推出分眾產品

Thanks!

Data Visualization

Visualization Framework Choosing

Google Cloud BigQuery + Google Data Studio
Reasons
- Google Cloud BigQuery
  - 資料將不斷累積 => 不適合堆在單一伺服器
  - 性能優越
  - 可以用 SQL 做 Query
- Google Data Studio
  - 可直接拉 BigQuery 的資料做視覺化
  - G Suite
    - 內建權限管理
    - 易實現資料民主化

Weaknesses of BigQuery

Not schemaless

Questions

如何知道用戶喜不喜歡我們的產品？
從 Open API 上線以來的流量成長
如何知道哪一個 Client 用量最大？
用戶透過 Open API 搜尋哪些歌曲 / 歌手 / 專輯 / 歌單？
如何知道哪些 Resource 最受歡迎？

Questions

如何知道用戶喜不喜歡我們的產品？
- 如何定義喜歡？ => Domain Knowledge
- 用戶黏著度 => Custom SQL
從 Open API 上線以來的流量成長＝> 趨勢
如何知道哪一個 Client 用量最大？ => 比較
用戶透過 Open API 搜尋哪些歌曲 / 歌手 / 專輯 / 歌單？＝> 分佈
如何知道哪些 Resource 最受歡迎？ => 列表 / 分佈

如何知道用戶喜不喜歡我們的產品？

-- Custom SQL
select 
  DATE(TIMESTAMP_MICROS(timestamp*1000),"Asia/Taipei") as date,
  client_key,
  client_name,
  count(user_id)/count(distinct(user_id)) as listen_times_per_user
from production.open_api_log
where request.path="/v1.1/music_streaming" and
      developer_id is null and
      user_id is not null and
      response.status=200
group by date, client_id, client_key, client_name

從 Open API 上線以來的流量成長

如何知道哪一個 Client 用量最大？

用戶透過 Open API 搜尋哪些歌曲 / 歌手 / 專輯 / 歌單？

Resource 使用分佈

還有一種叫散佈圖

Summary

Business Logic
- 定義問題
- 釐清需求
Data Processing
- No side effect
- Type safe
- Concurrency
Data Visualization
- Domain Knowledge => Custom SQL
- 看趨勢 => 趨勢圖
- 比大小 => 長條圖
- 看分佈 => 分佈圖
- 關聯性 => 散佈圖