judofyr
I implemented this in Zig earlier: https://github.com/judofyr/minz

It’s a quite neat algorithm. I saw compression ratios in the 2-3x range. However, I remember that the algorithm for finding the dictionary was a bit unclear. I wasn’t convinced that what was explained in the paper found the “optimal” dictionary. With some slight tweaks I got widely different results. I wonder if this implementation improves on this.

Epicism
Super interesting! I’m curious how this differs from InfluxDB’s German strings implementation https://www.influxdata.com/blog/faster-queries-with-stringvi...
jcgrillo
I really like the look of vortex[1]! One of my industry pet peeves is all the useless utf-8 server log bytes. I'd like to log data in a sane, schemaful, binary format and this looks like it could be a good way to do that. Bonus points if we can wire this up as a physical layer for e.g. datafusion[2] so I can analyze my logs with the dataframe abstraction.

EDIT: Question about FSST--lets say I build a strings table like:

  struct Strings {
      compressor: fsst::Compressor,
      compressed: Vec<Vec<u8>>
  }
Is there some optimal length for compressed given the 255 symbols limit?

[1] https://github.com/spiraldb/vortex [2] https://github.com/apache/datafusion

aidenn0
What is the meaning of "Arrow" in this context?
chgo1
A question regarding the second generation in the example: Why is the symbol "um" (0) only counted once?
scotty79
So this lets you compress a collection of strings and cheaply decompress any of them individually?