coppsilgold
If you know that a source of randomness contains entropy (unpredictable bits) but you don't know how much (ex. digital camera unless heavily processed will contain random sensor noise in the output) the safest thing to do is pipe it into a cryptographic construct such as a hash or a sponge.

Once you believe you piped enough you use the state of the cryptographic primitive as the seed for further random bit generation. The Linux kernel uses a sponge (to accumulate), hash function (to consolidate) and a stream cipher (to output) to 'convert' events with some degree of randomness into 'infinite' safe cryptographically secure random bits.

To acquire some intuition about this you can imagine taking a raw 1MP photo with a camera sensor and then feeding the lossless file to sha256sum. You acquire a 256 bit string and the sensor noise in the photo will be sufficient to secure the result. An attacker would need to model all the degrees of freedom in taking photos in the world and sensor noise production to build a simulator for your camera and start bruteforcing your sha256 result which will almost certainly (sensor might be compromised or not really be raw) contain far more degrees of freedom than 256 bits.

yarg
I remember thinking about something similar to this at university - I was uncomfortable with the use of biases to assign non-zero probabilities to events that fail to occur after some number of trials.

If I flip a coin n times and it comes up heads everytime, what's my best estimate of the likelihood of tails?

It came out as 1/2^(1/(n + 1)); and the chance of heads = (1 - that).

The calculus for results in between seemed intractable to me - or at least well beyond my abilities...

So I threw it into a newton-raphson solver and was happy to see that it came out pretty much linear (the most asymmetrical result will be the one for three trials, and since that was basically linear all results for greater n will be as well).

But I never went quite this far - for that you'd also need to calculate the standard deviation of the probability estimate (I don't think that it would've been much harder than what I did, but it was outside of my requirements at the time, so it was something I never implemented).

j7ake
The insight here is that you need to emit two signals that has equal probability, even if those two signals are rare in the full distribution. In the full distribution, you’re allowed to add any other kinds of signals that aren’t those two.

You then throw out all signals that are not those two signals, and the conditional distribution will renormalise itself to give you a fair coin toss.

You pay for this by throwing out many bits that are not these two signals. The less fair the coin, the more coin flips you throw away.

In the trivial case of a fair coin, you throw away nothing and keep every coin toss. In a biased coin, you throw away any pairs of HH or TT.

Independence is a major assumption underlying any of these models.

alphazard
Something maybe obvious but worth repeating is that there are 2 kinds of errors: predictable and unpredictable. Bias is the predictable error, it's the direction we are likely to be wrong in. In many practical applications unbiased error is not what we want. If the cost of being wrong in different directions is asymmetric then we want to be biased so that our mistakes are less costly. The unpredictable error is noise. In this example we are trying to create something maximally unpredictable, so the goal is to remove all biases, giving pure noise.
bjornsing
> Interestingly, this is the best possible approach, whether we know the bias of the input bit stream or not.

Would love to see a proof of that. It feels a bit unintuitive to me.

(But on second thought that might be because I’m thinking of cases of correlated bits.)

leeoniya
made me think of, "bend entropy towards definable outcomes"

https://m.youtube.com/watch?v=2QJjMwBAORw

andrewla
I haven't dug deeper, but the claim that the von Neumann approach is optimal does not seem intuitively correct. It seems like you could squeeze a tiny bit more entropy from it -- basically, if you reject two pairs in a row, the nature of that rejection tells you something.

    HT xx -> H
    TH xx -> T
    HH TT -> H
    TT HH -> T
travisjungroth
It would be interesting to combine this with something that detects bias on a non-deterministic stream. So in one shot, it takes a stream of unknown bias and emits an unbiased stream. The closing paragraph says that’s impossible, but the tradeoff is you are only sure the output is unbiased with some amount of confidence. I think you’d also need a buffer to detect and then output.
clircle
This is a specific type of general algorithm/research area called Bernoulli Factories if anyone wants to go deep.
ur-whale
Does this work with a guaranteed random bit output bandwidth ?

These techniques (eg von neumann) all seem to suffer from that same problem, namely, they crank out uniformly distributed random bits from biased sources, but with no guarantee of how long you may have to wait for a bit to come out.

sevenoftwelve
The article is interesting, but it misses the most practical and unambiguously safe way to generate streams of random data: Use cryptography.

To generate a stream of random data, use a hash function with arbitrary-length output (XOF) such as blake2x[^0] or shake256[^1]. Make sure your key contains at least 256 bits of entropy. Absolutely never use a key with less than 128 bits of entropy.

Since it's impossible to know how much entropy there is in a key, you probably want to use something like the Fortuna RNG[^2]. Substitute the sha2/AES based construction for your XOF. Bruce Schneier designed Fortuna back when XOFs were harder to come by.

If you want more performance, you can use blake2 to compress your input seed into 256 bits and generate the random stream using chacha20[^3].

All of this is usually handled by the Linux kernel[^4], so it's best to just use the getrandom(2)[^5] system call or just read from /dev/urandom[^6]. If you are writing a Rust program, you can use the rand[^7] crate, which uses a mixed approach reading a seed from the operating system and expanding it in-process using chacha[^8]. This is a valid strategy.

I am omitting some subtleties[^10] about mathematical definitions of randomness extractors as used by the author of the article. When you are using a cryptographic approach, you are dealing with a complexity-theory based security notion[^9], which does not precisely equate to creating a stream with a specific amount of entropy. Everywhere – except for a physics or a math paper dealing with information theory – I would call this a technicality. For most intents and purposes, cryptographic security notions are the most real-world robust conceptions of randomness available.

[^0]: https://www.blake2.net/

[^1]: https://csrc.nist.gov/pubs/fips/202/final (shake256 is part of the SHA-3 standard)

[^2]: https://www.schneier.com/academic/fortuna/

[^3]: https://protonvpn.com/blog/chacha20

[^4]: https://lwn.net/Articles/884875/

[^5]: https://man.archlinux.org/man/getrandom.2

[^6]: https://man.archlinux.org/man/urandom.4

[^7]: https://crates.io/crates/rand

[^8]: https://docs.rs/rand/0.8.5/rand/rngs/struct.StdRng.html

[^9]: https://en.wikipedia.org/wiki/Security_of_cryptographic_hash...

[^10]: In particular, PRFs are not guaranteed to output tokens with a certain amount of entropy – if I recall correctly – because they can map two inputs to the same output.

---

I am the main author of the Rosenpass[^11] post-quantum secure key exchange for WireGuard. My expertise comes from developing this protocol, as well as a couple of years of engagement with the real-world cryptography community and from my own scientific research on cryptography and secure implementations of cryptography.

[^11]: https://rosenpass.eu/