CALIL Inc.
Ryuuji Yosimoto
Code4Lib 2015
データ分析がしたい
お金がない
人がいない
面倒なことはやらない
小規模クラスタ(Hadoopごっこ)ではメリットが少ない
Google Big Query ・・・ データ従量課金
Amazon Redshift ・・・ インスタンス課金
Azure HDInsight ・・・ インスタンス課金 (Hadoop)
ほぼ無料
$5 / 処理データ 1TB
$0.020 / ストレージ 1GB
※2016年1月から最大400%の値上げ
負荷の高いクエリーは 1TB / $20
管理画面付き ・・・ プログラミングの知識がなくても使える
カーリルでの活用例
システムエラーの検出
スクレイピング結果の精度分析
(テキスト形式のログを統計する
簡易的なプログラムから移行)
HOLDING_SCHEMA = [{"name": "timestamp", "type": "TIMESTAMP", "description": u"処理日時", "mode": "REQUIRED"}, {"name": "systemid", "type": "STRING", "description": u"システムID", "mode": "REQUIRED"}, {"name": "caller", "type": "STRING", "description": u"呼び出し元", "mode": 'nullable'}, {"name": "isbn", "type": "STRING", "description": u"ISBN", "mode": "REQUIRED"}, {"name": "book", "type": "RECORD", "description": u"所蔵情報", "fields": [ {"name": "libid", "type": "INTEGER", "description": u"LIBID", 'mode': 'nullable'}, {"name": "libkey", "type": "STRING", "description": u"LIBKEY", 'mode': 'nullable'}, {"name": "status", "type": "STRING", "description": u"貸出状態", 'mode': 'nullable'}, ], "mode": "repeated"}, {"name": "success", "type": "BOOLEAN", "description": u"処理成功フラグ", 'mode': 'REQUIRED'}, {"name": "reason_of_error", "type": "STRING", "description": u"エラー情報", 'mode': 'nullable'}, {"name": "latency", "type": "FLOAT", "description": u"処理時間", 'mode': 'nullable'}]
{"reason_of_error": null, "latency": 2.0176000595092773, "book": [{"status": "\u8cb8\u51fa\u53ef", "libid": "103405", "libkey": "\u4e2d\u592e"}, {"status": "\u8cb8\u51fa\u4e2d", "libid": "103418", "libkey": "\u6771\u6d66\u548c"}, {"status": "\u8cb8\u51fa\u53ef", "libid": "103412", "libkey": "\u5927\u5bae\u897f\u90e8"}, {"status": "\u8cb8\u51fa\u53ef", "libid": "103406", "libkey": "\u5317\u56f3\u66f8\u9928"}, {"status": "\u8cb8\u51fa\u53ef", "libid": "103413", "libkey": "\u5bae\u539f"}], "systemid": "Saitama_Saitama", "success": "true", "isbn": "4796683550", "timestamp": "2015-09-05 02:53:51", "caller": "API:0"}
1件あたりのログデータ
api.calil.jp
Task worker
Task worker
Task worker
Task worker
Google Big Query
スクレイピングサーバーの
ログデータを集約
JSON (newline-delimited)
ファイルをGCEにアップロードして、Big Queryにロード
NDL
NII
180万タイトル
30館
カーリルAPI
OPAC
2015年1月~3月
5400万回、カーリルAPIを叩く
NDL OAI-PMH → JSONに変換
4カ月 → 4週間くらいになったらしい
NII OpenSearch → JSONに変換
↓
遊んでみたい人は
データセット共有できます
SELECT * FROM (
SELECT
IFNULL(ndl.isbn ,nii.nii_isbn) AS isbn ,
IFNULL(ndl.title ,nii.nii_title) AS title ,
IFNULL(ndl.creater ,nii.nii_creater) AS creater,
IFNULL(ndl.publisher ,nii.nii_publisher) AS publisher ,
LEFT(IFNULL(ndl.date ,nii.nii_date),4) AS date ,
ndl.isbn_count AS ndl_isbn_count,
nii.nii_isbn_count AS nii_isbn_count,
ndl.owner_count AS ndl_owner_count
FROM [caliljp.ndlbook] ndl
LEFT OUTER JOIN EACH (
SELECT isbn as nii_isbn,
title as nii_title,
creater as nii_creater,
publisher as nii_publisher,
owner_count as nii_owner_count,
isbn_count as nii_isbn_count,
date as nii_date FROM [caliljp.niibook]
) AS nii ON ndl.isbn=nii.nii_isbn
) WHERE (REGEXP_MATCH(isbn,'^4'))
SELECT systemid,
isbn,
LAST(timestamp) as timestamp,
(LAST(book.libkey) IS NOT NULL) as is_holding,
FROM [caliljp.holding]
WHERE systemid in
('Tokyo_Kokubunji',
'Tokyo_Nishitokyo',
'Tokyo_Hino',
'Tokyo_Hamura',
'Tokyo_Mizuho',
'Tokyo_Higashikurume',
'Tokyo_Tachikawa',
'Tokyo_Okutama',
'Tokyo_Fussa',
'Tokyo_Ome',
'Tokyo_Kodaira',
'Tokyo_Kiyose',
'Tokyo_Chofu',
'Tokyo_Hachioji',
'Tokyo_Musashino',
'Tokyo_Koganei',
'Tokyo_Akishima',
'Tokyo_Tama',
'Tokyo_Fuchu',
'Tokyo_Inagi',
'Tokyo_Komae',
'Tokyo_Hinode',
'Tokyo_Mitaka',
'Tokyo_Akiruno',
'Tokyo_Machida',
'Tokyo_Kunitachi',
'Tokyo_Higashiyamato',
'Tokyo_Higashimurayama',
'Tokyo_Musashimurayama') AND success=true
GROUP EACH BY systemid,isbn;
SELECT sum,count(*) as titles,sum*count(*) as books FROM
(
SELECT isbn,count(*) as count,sum(is_holding) as sum FROM [caliljp.holding_tama_distinct] GROUP BY isbn
) where count>=26 and sum>=1 group by sum order by sum
Text
Text
Text
所蔵自治体数ごとの集計
SELECT t1.d,t2.formal FROM
(SELECT book.libid,DATE(min(timestamp)) as d from [caliljp.holding] where isbn='4778314506' and book.libid is not NULL group by book.libid order by d)
AS t1
LEFT JOIN [caliljp.library] as t2 on t1.libid=t2.libid