Compressing Data

2016-06-17 pixiv Inc. Study Session LT


This is NOT the slides
about DEFLATE etc.


...just say compressing “data.”


Recently I developed the Node.js module

to look up the General_Category of the specified Unicode character

> const category = require('general-category')
> category('Å')
> category(' ')
> category(' ', {long: true})
> category('\u{1F600}', {version: '6.0.0'})


How to build the data file to accomplish this?

Easiest Answer




Unicode™ is so huge as to contain 1,114,112 characters* inside of it!”

* and non-characters

Then the data file will at least consume

5 bytes × 1,114,112 × 22



Unicode versions


Then we have to fairly compress the information.

My first attempt:

  "0": "Cc",
  "32": "Zs",
  "33": "Po",

Squashing consecutive categories down into an indexed value

This worked for me

$ du -h general-category@1.3/data/index.json
628K    general-category@1.3/data/index.json

123MB => 628KB 

(saved 99.5%)

Total size of the package

$ du -h `npm pack general-category@1.3`
183K    general-category-1.3.0.tgz

(npm package is distributed as tarball)

Let's go further

“Why JSON?

If you care about the file size, you should consider using a binary format.

Fair enough.

We can try msgpack.


$ du -h index.json
628K    index.json
$ msgpack-cli index.json --out index.msgpack
$ du -h index.msgpack
420K    index.msgpack

Then we could easily save the disk space for good!

One famous fact about msgpack is that it's less efficient in its file size when gziped.

$ ll
-rw-rwxr--+ 1 hakatashi hakatashi 626K May 23 18:21 index.json
-rwxrwxr-x+ 1 hakatashi hakatashi  86K Jun 17 11:24 index.json.gz
-rwxrwxr-x+ 1 hakatashi hakatashi 420K Jun 17 11:06 index.msgpack
-rwxrwxr-x+ 1 hakatashi hakatashi  68K Jun 17 11:25 index.msgpack.gz

The advantage of JSON in Node.js is that you don't even have to decode it to load data.

const data = require('./index.json');
const fs = require('fs');
const msgpack = require('msgpack-lite');

fs.readFile('index.msgpack', (err, buf) => {
    if (err) throw err;
    const data = msgpack.decode(buf);



In consideration of Browserify, we have to bundle a pure-JS implementation of msgpack decoder. it a real effective way to save my disk?

msgpack is not a cure-all.

The author of msgpack also states:

While I am pleased to see MessagePack's wider adoption, its pros and cons should be carefully considered, and there are many situations where it simply does not offer enough advantage to JSON.
My thoughts on MessagePack

<Going back to the story>

We are compressing this data

  "0": "Cc",
  "32": "Zs",
  "33": "Po",
  "36": "Sc",


No need for hash. Convert into array...

  0, "Cc",
  32, "Zs",
  33, "Po",
  36, "Sc",


Record diff instead of actual codepoints...

  0, "Cc",
  32, "Zs",
  1, "Po",
  3, "Sc",


Use indexed numbers for the enumerated category names...

  0, 0,
  32, 29,
  1, 21,
  3, 23,


Now, this is just a serialized array of integers.

  0, 0,
  32, 29,
  1, 21,
  3, 23,

There are some prior researches about the compression of the integer array.

  • Lemire, Daniel, and Leonid Boytsov. "Decoding billions of integers per second through vectorization." Software: Practice and Experience 45.1 (2015): 1-29.
  • Lemire, Daniel, Leonid Boytsov, and Nathan Kurz. "SIMD compression and the intersection of sorted integers." Software: Practice and Experience (2015).

But I'll perform more simpler way to achieve this.

Arbitrary integer can be represented with multi-precision bytes array

And bytes are equivalent to string in JSON format.

export integer-array-to-buffer = ->
  it |> map ->
    it |> (+ 1) |> unfoldr ->
      if it is 0 then null
        modulo = it %% 128
        base = it .>>. 7
        modulo .|.= 2~1000_0000 if base is 0
        [modulo, base]
  |> concat
  |> buffer-from

The implementation is written only w/ 10 lines


  0, 0,
  32, 29,
  1, 21,
  3, 23,


(7.0KB gziped)


Total size of the package

$ du -h `npm pack general-category@1.3`
184K    general-category-1.3.0.tgz
$ du -h `npm pack general-category@1.4`
24K     general-category-1.4.0.tgz


  • Your module is distributed every time the module and its dependents are installed. Consider reducing the size of package.
  • MessagePack is not the only way to achieve binary packing. Use it carefully.


Made with