vu128: Efficient variable-length integers

lifthrasiir

As with many other variable-length integer encodings, its efficiency really depends on the expected input distribution. One glaring issue in this regard is that its 2-byte encoding is rather wasted: only 6.25% out of all possible encodings are useful. This can be possibly fixed by making the 2-byte encoding represent 240..495 instead of 0..255. You can even make use of some bits from the first byte, so that for example the sequence goes like: 0=00 1=01 2=02 .. BF=BF C0=C0C0 C1=C0C1 .. 30FF=F0FF 3100=F13100 .. (all in hex).

A radically different design, used by for example 7z, is also possible if you put all continuation bits into the first byte. For example, instead of having `1xxxxxxx 1yyyyyyy 0zzzzzzz` for 3-byte encoding, use `110xxxxx yyyyyyyy zzzzzzzz`. This is essentially as compact as more common VLQ and LEB128 encodings, but you can determine the length of the whole sequence from the first byte alone! Unfortunately this is less suitable for integers larger than 64 bits, because then you run out of bits in the first byte. But it ends up with a rather elegant encoding for 64-bit integers, where `11111110` indicates 7 bytes and `11111111` indicates 8 bytes with no additional 0 bit needed.

> An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I'm going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.

I don't know if it has other names, but a similar encoding can be seen from X.690 [1] and the "additional information" bit field of CBOR [2].

[1] https://en.wikipedia.org/wiki/X.690#Length_octets

[2] https://www.rfc-editor.org/rfc/rfc8949.html#section-3

terrelln

Variable-length integer schemes are great when you are encoding one integer. But when you are encoding a list of integers, you should really consider a different scheme.

Variable-length integer schemes generally interleave the control bits & data bits. This means you don't know where the next integer starts until you at least partially decode the current integer.

When encoding a list of integers, you would rather put all the control information together, and all the data together. This generally allows for significantly more efficient decoding using SIMD.

Const-me

When I need them I usually implement the codec used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html

Similar performance characteristics, i.e. just the first byte tells the encoded length, no need to test every byte. Luckily, most programming languages have intrinsic functions for BSWAP instruction.

kunley

There is also SQLite varint, https://sqlite.org/src4/doc/trunk/www/varint.wiki

IIRC different from both Leb128 and VLQ, and quite neat. It's for 64bit (u)ints, though

danking00

If I’m following, this presents an alternative format and implementation to LEB128 which encodes and decodes substantially faster. Notably, the implementation is quite simple. Cool! And agreed that modern CPUs really suffer from branches.

Should I interpret the plot to mean the average elapsed wall clock time per integer decoded/encoded? And can I conclude the throughput is the reciprocal? So about 100,000 integers per second or around a 1 MB/s of decompressed data.

I know this is a bit unfair because the implementation is much more complex, but my first thought is why I would use vu128 instead of Lemire’s Stream VByte: https://arxiv.org/abs/1709.08990

A slight tangent but I stumbled on this library which stores floats XOR’ed with the previous float in the stream: https://github.com/velvia/compressed-vec it seems really clever to me! They reference “Gorilla: A Fast, Scalable, In-Memory Time Series Database” which in turn references two 2006 papers: “Fast Lossless Compression of Scientific Floating-Point Data” and “Fast and Efficient Compression of Floating-Point Data”. Frustratingly, the FB paper doesn’t benchmark their XOR-based floating point encoding but the earlier two papers do.

chubot

FWIW the lobste.rs thread surfaced this 2016 comment thread about PrefixVarint vs. LEB128 varint in protobufs, which has lots of good info, and makes the UTF-8 analogy:

https://news.ycombinator.com/item?id=11263378

Also sqlite varint:

https://sqlite.org/src4/doc/trunk/www/varint.wiki

ixau

In terms of serializing JSON to a binary representation, which includes variable length integers, check CBOR

https://datatracker.ietf.org/doc/html/rfc8949

zigzag312

How does it compare to vint64[0] performance-wise? For values that fit the 64-bit space.

[0] https://crates.io/crates/vint64

vitiral

zig-zag encoding is brilliant. Basically you can determine positive or negative based on odd/even and divide by 2 to get the value. I tried to come up with a similar encoding and never thought of this