pkhuong
I'd expect a comparison with compact (or even succinct) constructions for arbitrary functions, like MWHC. Section 3.2 of https://vigna.di.unimi.it/ftp/papers/TheoryPracticeMonotone.... has a good overview.

Given a set S of arbitrary hashable values, it's possible to represent a function from S to r bits in |S|r + o(|S|) bits (keys outside S are mapped to random r-bit values). More practical construction hit ~1.23 |S|r, or even |S|(r + 1.23) bits. It should also be faster to evaluate than `r` bloom filter lookups for large datasets.

I think the main advantage of the bloom filter (or compressed bitmap) approach is it can be updated incrementally. MWHC-style representations are better suited to build once / read many workloads.

danking00
I think it might help readers to include a narrative about an example application. Perhaps I’m in the minority but I tend to think of Bloom filters as a way to reliably know something isn’t in a set (e.g. so as to not run an expensive disk read). This data structure seems to view them the dual way: “this is maybe the right value for this key”.

I’ve seen that view work for visualizations like approximate CDFs and medians where I have some statement like “with probability p, the value differs from truth by less than e”. Is this data structure used in a similar way? My instinct is that visualizations having a low rate of being wrong is OK because the human will follow up that visualization with more tests. In the end you have lots of evidence supporting the conclusion.

ComputerGuru
Curious idea. So it’s for cases where you have any key but associated with one of only (preferably few) discrete values. I.E. your url example is great with url as a key but subpar if url were to be the value (padded to length n with trailing nulls encoded as a fixed width int array)?

With its interesting set of guarantees, I can’t see a case where you could use this unless you are 100% positive all keys have previously been inserted into the set, otherwise you risk getting a wrong value in return (instead of no value). A traditional bloom filter is similar but in the worst case you throw away work because you look up the determinative data/value but here it’s a bit trickier.

Lots of applications tolerate missing results but significantly fewer can tolerate “unknowingly incorrect” results.

Question about the implementation: I would have expected the primary interface to be in-memory with some api for disk spillover for large datasets but while all the docs say “designed for in-memory lookups” the rust api shows that you need to provide it with a temp directory to create the structure? (Also, fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path.)

judofyr
Very interesting and I'll have to read more to understand how it fully works, but _initially_ the space requirements doesn't seem too impressive? Am I missing something here? Is my calculation/assumption completely off? Maybe the solution here is more flexible?

One alternative approach for many of these problems is to start with a perfect minimal hash function which hashes your key into a unique number [0, N) and then have a packed array of size N where each element is of a fixed byte size. To look up the value you first use the hash function to get an index, and then you look up in the packed array. There's also no error rate here: This is exact.

PTHash (https://arxiv.org/abs/2104.10402) needs roughly ~4 bits per element to create a perfect minimal hash function.

> Store 1 billion web URLs and assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, =0.1%; 19 bits per element)

Assuming that "n=8" here means "8 bits" we need 1GB (8 bits * billion) to represent all of the values, and then 500 MB for the hash function (4 bits * billion).

I also don't quite understand what "2.22Gb" here refers to. 19 bits * billion = 2.357 SI-giga bytes = 19 SI-giga bits = 2.212 gibi bytes.

> Store 1 billion DNA or RNA k-mers ("ACGTA...") and associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, =0.1%; 59 bits per element)

"~500k bacterial ID" can be represented with 19 bits. 1 billion of these take ~2.3GB, and then we have the additional 500MB for the perfect hash function.

Another data structure which is even more fine-tuned for this problem space is Bumped Ribbon Retrieval (https://arxiv.org/abs/2109.01892) where they have <1% overhead over just storing the plain bit values.

EDIT: Aha! One thing I forgot about: The alternatives I mentioned above all have a construction cost. I've been playing with them in the 100k-1M range and they've all been pretty instant (<1s), but I don't have any experience in the billion range. Maybe it's too slow there?

alexbowe
Interesting read, thanks for sharing!

If you have some benchmark results, it'd be great to see how it compares to traditional data structures in practice, for different datasets and varying k-mer lengths

xvilka
There's another library unrelated to the data structure but is from the same field - an interval tree structure - Lapper[1][2]

[1] https://github.com/sstadick/rust-lapper

[2] https://docs.rs/rust-lapper

neutrinobro
Ughh...the term B-field already has a very strong association with magnetic fields. I'm sure it sounded like a good name given the context, but these types of name-collisions generally makes searching for a specific topic more and more painful each year.
vslira
Great work, thanks for sharing!

In a somewhat tangent note, does anyone have a good resource for designing probabilistic data structures? At a high level, I'm looking for something that helps me understand what is and isn't feasible and, given a problem and constraints, how would I go on to design a specific DS for a problem. Doesn't need to be all that general, but something that is more than an analysis of existing structures

foota
I wonder... The comparison here is against a bloom filter, but is this actually more similar to a sketch?

Or... Actually this is sort of like a posting list (e.g., a list of places that a given document appears: https://en.m.wikipedia.org/wiki/Inverted_index)

seffect
IIRC with a bloom filter if returns false you can be sure it is not in the set but if it returns true it probably is in the set but might be a clash giving a false positive?

Is the same true with this data structure.

I guess you could mitigate this by storing an additionally hash or the original key in it’s entirety as the value?

hexo
Is somewhere out there a non-rust version, please?