Compressing Data
2016-06-17 pixiv Inc. Study Session LT
@hakatashi
This is NOT the slides
about DEFLATE etc.
Warning:
...just say compressing “data.”
general-category
Recently I developed the Node.js module
to look up the General_Category of the specified Unicode character
> const category = require('general-category')
undefined
> category('Å')
'Lu'
> category(' ')
'Zs'
> category(' ', {long: true})
'Space_Separator'
> category('\u{1F600}', {version: '6.0.0'})
'Cn'
Usage
How to build the data file to accomplish this?
Easiest Answer
data.json
["Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc"...]
“Wait.
Unicode™ is so huge as to contain 1,114,112 characters* inside of it!”
* and non-characters
Then the data file will at least consume
5 bytes × 1,114,112 × 22
"Xx",
=123MB
Unicode versions
123MB
Then we have to fairly compress the information.
https://imgs.xkcd.com/comics/margin.png
My first attempt:
{
"0": "Cc",
"32": "Zs",
"33": "Po",
...
}
["Cc","Cc","Cc",...,"Cc","Zs","Po","Po","Po",...]
Squashing consecutive categories down into an indexed value
This worked for me
$ du -h general-category@1.3/data/index.json
628K general-category@1.3/data/index.json
123MB => 628KB
(saved 99.5%)
Total size of the package
$ du -h `npm pack general-category@1.3`
183K general-category-1.3.0.tgz
(npm package is distributed as tarball)
Let's go further
“Why JSON?
If you care about the file size, you should consider using a binary format.”
Fair enough.
We can try msgpack.
msgpack
http://msgpack.org/images/intro.png
$ du -h index.json
628K index.json
$ msgpack-cli index.json --out index.msgpack
$ du -h index.msgpack
420K index.msgpack
Then we could easily save the disk space for good!
http://www.troll.me/images/the-most-interesting-man-in-the-world/no-not-at-all.jpg
One famous fact about msgpack is that it's less efficient in its file size when gziped.
$ ll
-rw-rwxr--+ 1 hakatashi hakatashi 626K May 23 18:21 index.json
-rwxrwxr-x+ 1 hakatashi hakatashi 86K Jun 17 11:24 index.json.gz
-rwxrwxr-x+ 1 hakatashi hakatashi 420K Jun 17 11:06 index.msgpack
-rwxrwxr-x+ 1 hakatashi hakatashi 68K Jun 17 11:25 index.msgpack.gz
The advantage of JSON in Node.js is that you don't even have to decode it to load data.
const data = require('./index.json');
const fs = require('fs');
const msgpack = require('msgpack-lite');
fs.readFile('index.msgpack', (err, buf) => {
if (err) throw err;
const data = msgpack.decode(buf);
});
Before:
After:
In consideration of Browserify, we have to bundle a pure-JS implementation of msgpack decoder.
...is it a real effective way to save my disk?
msgpack is not a cure-all.
The author of msgpack also states:
While I am pleased to see MessagePack's wider adoption, its pros and cons should be carefully considered, and there are many situations where it simply does not offer enough advantage to JSON.
My thoughts on MessagePack
<Going back to the story>
We are compressing this data
{
"0": "Cc",
"32": "Zs",
"33": "Po",
"36": "Sc",
...
}
628KB
No need for hash. Convert into array...
[
0, "Cc",
32, "Zs",
33, "Po",
36, "Sc",
...
]
396KB
Record diff instead of actual codepoints...
[
0, "Cc",
32, "Zs",
1, "Po",
3, "Sc",
...
]
376KB
Use indexed numbers for the enumerated category names...
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
235KB
Now, this is just a serialized array of integers.
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
There are some prior researches about the compression of the integer array.
- Lemire, Daniel, and Leonid Boytsov. "Decoding billions of integers per second through vectorization." Software: Practice and Experience 45.1 (2015): 1-29.
- Lemire, Daniel, Leonid Boytsov, and Nathan Kurz. "SIMD compression and the intersection of sorted integers." Software: Practice and Experience (2015).
But I'll perform more simpler way to achieve this.
Arbitrary integer can be represented with multi-precision bytes array
And bytes are equivalent to string in JSON format.
export integer-array-to-buffer = ->
it |> map ->
it |> (+ 1) |> unfoldr ->
if it is 0 then null
else
modulo = it %% 128
base = it .>>. 7
modulo .|.= 2~1000_0000 if base is 0
[modulo, base]
|> concat
|> buffer-from
The implementation is written only w/ 10 lines
Now...
[
0, 0,
32, 29,
1, 21,
3, 23,
...
]
"~~^a}i{g}i{h}l}i}e}i}m}i|qti|e{i|udh}i}l}f}n}f}..."
107KB
(7.0KB gziped)
Satisfied
https://imgflip.com/s/meme/Satisfied-Seal.jpg
Total size of the package
$ du -h `npm pack general-category@1.3`
184K general-category-1.3.0.tgz
$ du -h `npm pack general-category@1.4`
24K general-category-1.4.0.tgz
Lessons:
- Your module is distributed every time the module and its dependents are installed. Consider reducing the size of package.
- MessagePack is not the only way to achieve binary packing. Use it carefully.
EOF
Compressing Data
By Koki Takahashi
Compressing Data
2016-06-17 Weekly Study Session in pixiv Inc. LT
- 2,299