Compressing Data

2016-06-17 pixiv Inc. Study Session LT

@hakatashi

This is NOT the slides
about DEFLATE etc.

Warning:

...just say compressing “data.”

general-category

Recently I developed the Node.js module

to look up the General_Category of the specified Unicode character

> const category = require('general-category')
undefined
> category('Å')
'Lu'
> category(' ')
'Zs'
> category(' ', {long: true})
'Space_Separator'
> category('\u{1F600}', {version: '6.0.0'})
'Cn'

Usage

How to build the data file to accomplish this?

Easiest Answer

data.json

["Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc","Cc"...]

“Wait.

Unicode™ is so huge as to contain 1,114,112 characters* inside of it!”

* and non-characters

Then the data file will at least consume

5 bytes × 1,114,112 × 22

"Xx",

=123MB

Unicode versions

123MB

Then we have to fairly compress the information.

https://imgs.xkcd.com/comics/margin.png

My first attempt:

{
  "0": "Cc",
  "32": "Zs",
  "33": "Po",
  ...
}
["Cc","Cc","Cc",...,"Cc","Zs","Po","Po","Po",...]

Squashing consecutive categories down into an indexed value

This worked for me

$ du -h general-category@1.3/data/index.json
628K    general-category@1.3/data/index.json

123MB => 628KB 

(saved 99.5%)

Total size of the package

$ du -h `npm pack general-category@1.3`
183K    general-category-1.3.0.tgz

(npm package is distributed as tarball)

Let's go further

“Why JSON?

If you care about the file size, you should consider using a binary format.

Fair enough.

We can try msgpack.

msgpack

http://msgpack.org/images/intro.png

$ du -h index.json
628K    index.json
$ msgpack-cli index.json --out index.msgpack
$ du -h index.msgpack
420K    index.msgpack

Then we could easily save the disk space for good!

http://www.troll.me/images/the-most-interesting-man-in-the-world/no-not-at-all.jpg

One famous fact about msgpack is that it's less efficient in its file size when gziped.

$ ll
-rw-rwxr--+ 1 hakatashi hakatashi 626K May 23 18:21 index.json
-rwxrwxr-x+ 1 hakatashi hakatashi  86K Jun 17 11:24 index.json.gz
-rwxrwxr-x+ 1 hakatashi hakatashi 420K Jun 17 11:06 index.msgpack
-rwxrwxr-x+ 1 hakatashi hakatashi  68K Jun 17 11:25 index.msgpack.gz

The advantage of JSON in Node.js is that you don't even have to decode it to load data.

const data = require('./index.json');
const fs = require('fs');
const msgpack = require('msgpack-lite');

fs.readFile('index.msgpack', (err, buf) => {
    if (err) throw err;
    const data = msgpack.decode(buf);
});

Before:

After:

In consideration of Browserify, we have to bundle a pure-JS implementation of msgpack decoder.

...is it a real effective way to save my disk?

msgpack is not a cure-all.

The author of msgpack also states:

While I am pleased to see MessagePack's wider adoption, its pros and cons should be carefully considered, and there are many situations where it simply does not offer enough advantage to JSON.
My thoughts on MessagePack

<Going back to the story>

We are compressing this data

{
  "0": "Cc",
  "32": "Zs",
  "33": "Po",
  "36": "Sc",
  ...
}

628KB

No need for hash. Convert into array...

[
  0, "Cc",
  32, "Zs",
  33, "Po",
  36, "Sc",
  ...
]

396KB

Record diff instead of actual codepoints...

[
  0, "Cc",
  32, "Zs",
  1, "Po",
  3, "Sc",
  ...
]

376KB

Use indexed numbers for the enumerated category names...

[
  0, 0,
  32, 29,
  1, 21,
  3, 23,
  ...
]

235KB

Now, this is just a serialized array of integers.

[
  0, 0,
  32, 29,
  1, 21,
  3, 23,
  ...
]

There are some prior researches about the compression of the integer array.

  • Lemire, Daniel, and Leonid Boytsov. "Decoding billions of integers per second through vectorization." Software: Practice and Experience 45.1 (2015): 1-29.
  • Lemire, Daniel, Leonid Boytsov, and Nathan Kurz. "SIMD compression and the intersection of sorted integers." Software: Practice and Experience (2015).

But I'll perform more simpler way to achieve this.

Arbitrary integer can be represented with multi-precision bytes array

And bytes are equivalent to string in JSON format.

export integer-array-to-buffer = ->
  it |> map ->
    it |> (+ 1) |> unfoldr ->
      if it is 0 then null
      else
        modulo = it %% 128
        base = it .>>. 7
        modulo .|.= 2~1000_0000 if base is 0
        [modulo, base]
  |> concat
  |> buffer-from

The implementation is written only w/ 10 lines

Now...

[
  0, 0,
  32, 29,
  1, 21,
  3, 23,
  ...
]
"~~^a}i{g}i{h}l}i}e}i}m}i|qti|e{i|udh}i}l}f}n}f}..."

107KB

(7.0KB gziped)

Satisfied

https://imgflip.com/s/meme/Satisfied-Seal.jpg

Total size of the package

$ du -h `npm pack general-category@1.3`
184K    general-category-1.3.0.tgz
$ du -h `npm pack general-category@1.4`
24K     general-category-1.4.0.tgz

Lessons:

  • Your module is distributed every time the module and its dependents are installed. Consider reducing the size of package.
  • MessagePack is not the only way to achieve binary packing. Use it carefully.

EOF

Made with Slides.com