Raft: Understandable Distributed Consensus (2014)

_russross

I am in the minority who thinks Raft is overrated.

I tried teaching Raft one year instead of Paxos but ended up switching back. While it was much easier to understand how to implement Raft, I think my students gained deeper insight when focusing on single-decision Paxos. There is a lightbulb moment when they first understand that consensus is a property of the system that happens first (and they can point at the moment it happens) and then the nodes discover that it has been achieved later. Exploring various failure modes and coming to understand how Paxos is robust against them seems to work better in this setting as well.

I think this paper by Heidi Howard and Richard Mortier is a great way to move on to Multipaxos:

https://arxiv.org/abs/2004.05074

They present Multipaxos in a similar style to how Raft is laid out and show that Multipaxos as it is commonly implemented and Raft are almost the same protocol.

Raft was a great contribution to the engineering community to make implementing consensus more approachable, but in the end I don't think the protocol itself is actually more understandable. It was presented better for implementers, but the implementation focus obscures some of the deep insights that plain Paxos exposes.

benbjohnson

Author here. I'm happy to answer any questions although this project was from 10+ years ago so I could be a little rusty.

Over the years I've been trying to find better ways to do this kind of visualization but for other CS topics. Moving to video is the most realistic option but using something like After Effects takes A LOT of time and energy for long-form visualizations. It also doesn't produce a readable output file format that could be shared, diff'd, & tweaked.

I spent some time on a project recently to build out an SVG-based video generation tool that can use a sidecar file for defining animations. It's still a work in progress but hopefully I can get it to a place where making this style of visualizations isn't so time intensive.

MarkMarine

This is one of my favorite pieces of software engineering because it took something difficult and tried to design something easy to understand as a main criteria for success. The PHD Thesis has a lot more info about this if anyone is curious, it is approachable and easy to read:

https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pd...

I think this was core to Raft’s success, and I strive to create systems like this with understandability as a first goal.

prydt

I've run a reading group for distributed systems for the last 2 years now and I do think that Raft is a better introduction to Consensus than any Paxos paper I have seen (I mean the Paxos Made Simple paper literally has bugs in it). But when I learned consensus in school, we used Paxos and Multi-Paxos and I do believe that there was a lot to be gained by learning both approaches.

Heidi Howard has several amazing papers about how the differences between Raft and Multi-Paxos are very surface level and that Raft's key contribution is its presentation as well as being a more "complete" presentation since there are so many fragmented different presentations of Multi-Paxos.

As a bonus, one of my favorite papers I have read recently is Compartmentalized Paxos: https://vldb.org/pvldb/vol14/p2203-whittaker.pdf which is just a brilliant piece on how to scale Multi-Paxos

mbivert

In case this is of interest, MIT's 6.5840[0], distributed systems, has a series of labs, implementing Raft in Go. Haven't made it through the whole thing yet, but it's quite entertaining so far.

The teachers provide you with some code templates, a bunch of tests, and a progressive way to implement it all.

[0]: https://pdos.csail.mit.edu/6.824/index.html

dang

Raft Visualization - https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35 comments)

Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)

skilning

I was asked to click "continue" after each of the first two sentences, and the fade-in of the text took longer than reading the text.

This may be a great article, but I'll never know because it's frustrating to try and read.

eatonphil

Ben's visualization here is great.

The other biggest help to me aside from the paper and the thesis was Ongaro's TLA+ spec: https://github.com/ongardie/raft.tla/blob/master/raft.tla. It's the only super concise "implementation" I found that is free of production-grade tricks, optimizations, and abstractions.

And for building an intuition, TigerBeetle's sim.tigerbeetle.com is great. What happens to consensus when there's high latency to disk or network? Or as processes crash more frequently? It demonstrates.

quelltext

In the log replication example, after healing the partition the uncommitted log changes in the minority group are rolled back and the leader's log is used.

However it's not clear how that log is transmitted. Until this point only heartbeats via append entry were discussed, so it's not clear if the followers pull that information from the leader somehow via a different mechanism, or whether it's the leader's responsibility to detect followers that are left behind and replay everything. That would seem rather error prone and a lot of coordination effort. So how's it actually done?

eatonphil

Here's a visualization of Paxos and MultiPaxos that seems to be based on Ben's work: https://visual.ofcoder.com/.

jhanschoo

Just earlier this month I was going through https://github.com/jepsen-io/maelstrom and following the demo implementing (a cut-corners version of) Raft, and I found it quite elucidating. The sample (ruby) code contains some bugs, and I had to use some understanding to fix them. (The bugs were of the kind where a dummy implementation from an earlier step isn't correctly changed)

ko_pivot

The writing and the visualizations are great. The ‘continue’ button is way too frequent.

wg0

Off topic - folks in the know, besides Paxos/Raft what are some of the other most complex algorithm in computer sciences that are widely used or are a bedrock?

shiredude95

"Paxos Made Moderately Complex" by Robert van Renesse and Deniz Altinbuken: http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf is a great starting point for implementing multi-paxos. The authors also provide a working python implementation.

rapsey

While understandable, implementing it is however far from easy.

matthewfcarlson

This really teaches Raft well. Is there a good example of this but for Paxos?

cedws

Can't proof-of-work be used as a leader election algorithm? If the proof is hard enough to generate then one node should be able to generate one and broadcast it before the other nodes can, then that node becomes the leader.

kfrzcode

DLT technology discussions are entirely incomplete without consideration of Hedera Hashgraph [0], an aBFT, leaderless, fair and fast DLT using a gossip-about-gossip consensus mechanism. It's absolutely a more robust and scalable technology than Paxos or any other DLT for that matter. I'd love to know what the HN crowd thinks about Hedera as the trust layer of the internet but.... nobody around here seems to have any. It's like ignoring Linux while comparing Mac and Windows based computing.

[0]: https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf

tomerbd

[flagged]