2016-06-17 pixiv Inc. Study Session LT
@hakatashi
Warning:
...just say compressing “data.”
Recently I developed the Node.js module
to look up the General_Category of the specified Unicode character
> const category = require('general-category')
undefined
> category('Å')
'Lu'
> category(' ')
'Zs'
> category(' ', {long: true})
'Space_Separator'
> category('\u{1F600}', {version: '6.0.0'})
'Cn'
data.json
["Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc"...]
* and non-characters
"Xx",
Unicode versions
https://imgs.xkcd.com/comics/margin.png
{
"0": "Cc",
"32": "Zs",
"33": "Po",
...
}
["Cc","Cc","Cc",...,"Cc","Zs","Po","Po","Po",...]
Squashing consecutive categories down into an indexed value
$ du -h general-category@1.3/data/index.json
628K general-category@1.3/data/index.json
$ du -h `npm pack general-category@1.3`
183K general-category-1.3.0.tgz
(npm package is distributed as tarball)
http://msgpack.org/images/intro.png
$ du -h index.json
628K index.json
$ msgpack-cli index.json --out index.msgpack
$ du -h index.msgpack
420K index.msgpack
http://www.troll.me/images/the-most-interesting-man-in-the-world/no-not-at-all.jpg
$ ll
-rw-rwxr--+ 1 hakatashi hakatashi 626K May 23 18:21 index.json
-rwxrwxr-x+ 1 hakatashi hakatashi 86K Jun 17 11:24 index.json.gz
-rwxrwxr-x+ 1 hakatashi hakatashi 420K Jun 17 11:06 index.msgpack
-rwxrwxr-x+ 1 hakatashi hakatashi 68K Jun 17 11:25 index.msgpack.gz
const data = require('./index.json');
const fs = require('fs');
const msgpack = require('msgpack-lite');
fs.readFile('index.msgpack', (err, buf) => {
if (err) throw err;
const data = msgpack.decode(buf);
});
Before:
After:
...is it a real effective way to save my disk?
While I am pleased to see MessagePack's wider adoption, its pros and cons should be carefully considered, and there are many situations where it simply does not offer enough advantage to JSON.
My thoughts on MessagePack
{
"0": "Cc",
"32": "Zs",
"33": "Po",
"36": "Sc",
...
}
[
0, "Cc",
32, "Zs",
33, "Po",
36, "Sc",
...
]
[
0, "Cc",
32, "Zs",
1, "Po",
3, "Sc",
...
]
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
export integer-array-to-buffer = ->
it |> map ->
it |> (+ 1) |> unfoldr ->
if it is 0 then null
else
modulo = it %% 128
base = it .>>. 7
modulo .|.= 2~1000_0000 if base is 0
[modulo, base]
|> concat
|> buffer-from
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
"~~^a}i{g}i{h}l}i}e}i}m}i|qti|e{i|udh}i}l}f}n}f}..."
(7.0KB gziped)
https://imgflip.com/s/meme/Satisfied-Seal.jpg
$ du -h `npm pack general-category@1.3`
184K general-category-1.3.0.tgz
$ du -h `npm pack general-category@1.4`
24K general-category-1.4.0.tgz