Line data Source code
1 : #ifndef HEADER_fd_src_choreo_tower_fd_tower_h
2 : #define HEADER_fd_src_choreo_tower_fd_tower_h
3 :
4 : /* fd_tower presents an API for Solana's TowerBFT algorithm.
5 :
6 : What is TowerBFT? TowerBFT is an algorithm for converging a
7 : supermajority of stake in the validator cluster on the same fork.
8 :
9 : /-- 3-- 4 (A)
10 : 1-- 2
11 : \-- 5 (B)
12 :
13 : Above is a diagram of a fork. The leader for slot 5 decided to build
14 : off slot 2, rather than slot 4. This can happen for various reasons,
15 : for example network propagation delay. We now have two possible forks
16 : labeled A and B. The consensus algorithm has to pick one of them.
17 :
18 : So, how does the consensus algorithm pick? As detailed in
19 : fd_ghost.h, we pick the fork based on the most stake from votes,
20 : called the "heaviest". Validators vote for blocks during replay, and
21 : simultaneously use other validator’s votes to determine which block
22 : to vote for. This encourages convergence, because as one fork gathers
23 : more votes, more and more votes pile-on, solidifying its position as
24 : the heaviest fork.
25 :
26 : /-- 3-- 4 (10%)
27 : 1-- 2
28 : \-- 5 (9%)
29 :
30 : However, network propagation delay of votes can lead us to think one
31 : fork is heaviest, before observing new votes that indicate another
32 : fork is heavier. So our consensus algorithm also needs to support
33 : switching.
34 :
35 : /-- 3-- 4 (10%)
36 : 1-- 2
37 : \-- 5 (15%)
38 :
39 : At the same time we don’t want excessive switching. The more often
40 : validators switch, the more difficult it will be to achieve that
41 : pile-on effect I just described.
42 :
43 : Note that to switch forks, you need to rollback a given slot and its
44 : descendants on that fork. In the example above, to switch to 1, 2, 5,
45 : we need to rollback 3 and 4. The consensus algorithm makes it more
46 : costly the further you want to rollback a fork. Here, I’ve added a
47 : column lockout, which doubles for every additional slot you want to
48 : rollback.
49 :
50 : Eventually you have traversed far enough down a fork, that the
51 : lockout is so great it is infeasible to imagine it ever rolling back
52 : in practice. So you can make that fork permanent or “commit” it.
53 : Once all validators do this, the blockchain now just has a single
54 : fork.
55 :
56 : Armed with some intuition, let’s now begin defining some terminology.
57 : Here is a diagram of a validator's "vote tower":
58 :
59 : slot | confirmation count (conf)
60 : --------------------------------
61 : 4 | 1
62 : 3 | 2
63 : 2 | 3
64 : 1 | 4
65 :
66 : It is a stack structure in which each element is a vote. The vote
67 : slot column indicates which slots the validator has voted for,
68 : ordered from most to least recent.
69 :
70 : The confirmation count column indicates how many consecutive votes on
71 : the same fork have been pushed on top of that vote. You are
72 : confirming your own votes for a fork every time you vote on top of
73 : the same fork.
74 :
75 : Two related concepts to confirmation count are lockout and expiration
76 : slot. Lockout equals 2 to the power of confirmation count. Every
77 : time we “confirm” a vote by voting on top of it, we double the
78 : lockout. The expiration slot is the sum of vote slot and lockout, so
79 : it also increases when lockouts double. It represents which slot the
80 : vote will expire. When a vote expires, it is popped from the top of
81 : the tower. An important Tower rule is that a validator cannot vote
82 : for a different fork from a given vote slot, until reaching the
83 : expiration slot for that vote slot. To summarize, the further a
84 : validator wants to rollback their fork (or vote slots) the longer the
85 : validator needs to wait without voting (in slot time).
86 :
87 : Here is the same tower, fully-expanded to include all the fields:
88 :
89 : slot | conf | lockout | expiration
90 : ----------------------------------
91 : 4 | 1 | 2 | 6
92 : 3 | 2 | 4 | 7
93 : 2 | 3 | 8 | 10
94 : 1 | 4 | 16 | 17
95 :
96 : Based on this tower, the validator is locked out from voting for any
97 : slot <= 6 that is on a different fork than slot 4. I’d like to
98 : emphasize that the expiration is with respect to the vote slot, and
99 : is _not_ related to the Proof-of-History slot or what the "current
100 : slot" is. So even if the current slot is now 7, the validator can’t
101 : go back and vote for slot 5, if slot 5 were on a different fork than
102 : 4. The earliest valid vote slot this validator could submit for a
103 : different fork from 4 would be slot 7 or later.
104 :
105 : Next let’s look at how the tower makes state transitions. Here we
106 : have the previous example tower, with a before-and-after view with
107 : respect to a vote for slot 9:
108 :
109 : (before) slot | conf
110 : -----------
111 : 4 | 1
112 : 3 | 2
113 : 2 | 3
114 : 1 | 4
115 :
116 : (after) slot | conf
117 : -----------
118 : 9 | 1
119 : 2 | 3
120 : 1 | 4
121 :
122 : As you can see, we added a vote for slot 9 to the top of the tower.
123 : But we also removed the votes for slot 4 and slot 3. What happened?
124 : This is an example of vote expiry in action. When we voted for slot
125 : 9, this exceeded the expirations of vote slots 4 and 3, which were 6
126 : and 7 respectively. This action of voting triggered the popping of
127 : the expired votes from the top of the tower.
128 :
129 : Next, we add a vote for slot 10:
130 :
131 : (before) slot | conf
132 : -----------
133 : 9 | 1
134 : 2 | 3
135 : 1 | 4
136 :
137 : (after) slot | conf
138 : -----------
139 : 10 | 1
140 : 9 | 2
141 : 2 | 3
142 : 1 | 4
143 :
144 : The next vote for slot 10 doesn’t involve expirations, so we just add
145 : it to the top of the tower. Also, here is an important property of
146 : lockouts. Note that the lockout for vote slot 9 doubled (ie. the
147 : confirmation count increased by 1) but the lockouts of vote slots 2
148 : and 1 remained unchanged.
149 :
150 : The reason for this is confirmation counts only increase when they
151 : are consecutive in the vote tower. Because 4 and 3 were expired
152 : previously by the vote for 9, that consecutive property was broken.
153 : In this case, the vote for slot 10 is only consecutive with slot 9,
154 : but not 2 and 1. Specifically, there is a gap in the before-tower at
155 : confirmation count 2.
156 :
157 : In the after-tower, all the votes are again consecutive (confirmation
158 : counts 1, 2, 3, 4 are all accounted for), so the next vote will
159 : result in all lockouts doubling as long as it doesn’t result in more
160 : expirations.
161 :
162 : One other thing I’d like to point out about this vote for slot 10.
163 : Even though 10 >= the expiration slot of vote slot 2, which is 10,
164 : voting for 11 did not expire the vote for 2. This is because
165 : expiration happens top-down and contiguously. Because vote slot 9
166 : was not expired, we do not proceed with expiring 2.
167 :
168 : In the Tower rules, once a vote reaches a conf count of 32, it is
169 : considered rooted and it is popped from the bottom of the tower.
170 : Here is an example where 1 got rooted and popped from the bottom:
171 :
172 : (before) slot | conf
173 : -----------
174 : 50 | 1
175 : ... | ... (29 votes elided)
176 : 1 | 31
177 :
178 : (after) slot | conf
179 : -----------
180 : 53 | 1
181 : ... | ... (29 votes elided)
182 : 2 | 31
183 :
184 : So the tower is really a double-ended queue rather than a stack.
185 :
186 : Rooting has implications beyond the Tower. It's what we use to prune
187 : our state. Every time tower makes a new root slot, we prune any old
188 : state that does not originate from that new root slot. Our
189 : blockstore will discard blocks below that root, our forks structure
190 : will discard stale banks, funk (which is our accounts database) will
191 : discard stale transactions (which in turn track modifications to
192 : accounts), and ghost (which is our fork select tree) will discard
193 : stale nodes tracking stake percentages. We call this operation
194 : publishing.
195 :
196 : Note that the vote slots are not necessarily consecutive. Here I
197 : elided the votes sandwiched between the newest and oldest votes for
198 : brevity.
199 :
200 : Next, let’s go over three additional tower checks. These three
201 : checks further reinforce the consensus algorithm we established with
202 : intuition, in this case getting a supermajority (ie. 2/3) of stake to
203 : converge on a fork.
204 :
205 : The first is the threshold check. The threshold check makes sure at
206 : least 2/3 of stake has voted for the same fork as the vote at depth 8
207 : in our tower. Essentially, this guards our tower from getting too
208 : out of sync with the rest of the cluster. If we get too out of sync
209 : we can’t vote for a long time, because we had to rollback a vote we
210 : had already confirmed many times and had a large lockout. This might
211 : otherwise happen as the result of a network partition where we can
212 : only communicate with a subset of stake.
213 :
214 : Next is the lockout check. We went in detail on this earlier when
215 : going through the lockout and expiration slot, and as before, the
216 : rule is we can only vote on a slot for a different fork from a
217 : previous vote, after that vote’s expiration slot.
218 :
219 : Given this fork and tower from earlier:
220 :
221 : /-- 3-- 4
222 : 1-- 2
223 : \-- 5
224 :
225 : slot | conf
226 : -----------
227 : 4 | 1
228 : 3 | 2
229 : 2 | 3
230 : 1 | 4
231 :
232 : You’re locked out from voting for 5 because it’s on a different fork
233 : from 4 and the expiration slot of your previous vote for 4 is 6.
234 :
235 : However, if we introduce a new slot 9:
236 :
237 : /-- 3-- 4
238 : 1-- 2
239 : \-- 5-- 9
240 :
241 : slot | conf
242 : -----------
243 : 9 | 1
244 : 2 | 3
245 : 1 | 4
246 :
247 : Here the new Slot 9 descends from 5 and exceeds vote slot 4’s
248 : expiration slot of 6 unlike 5.
249 :
250 : After your lockout expires, the tower rules allow you to vote for
251 : descendants of the fork slot you wanted to switch to in the first
252 : place (here, 9 descending from 5). So we eventually switch to the
253 : fork we wanted, by voting for 9 and expiring 3 and 4.
254 :
255 : Importantly, notice that the fork slots and vote slots are not exactly
256 : 1-to-1. While conceptually our tower is voting for the fork 1, 2, 5,
257 : 9, the vote for 5 is only implied. Our tower votes themselves still
258 : can’t include 5 due to lockout.
259 :
260 : Finally, the switch check. The switch check is used to safeguard
261 : optimistic confirmation. Optimistic confirmation is when a slot gets
262 : 2/3 of stake-weighted votes. This is then treated as a signal that the
263 : slot will eventually get rooted. However, to actually guarantee this
264 : we need a rule that prevents validators from arbitrarily switching
265 : forks (even when their vote lockout has expired). This rule is the
266 : switch check.
267 :
268 : The switch check is additional to the lockout check. Before switching
269 : forks, we need to make sure at least 38% of stake has voted for a
270 : different fork than our own. Different fork is defined by finding the
271 : greatest common ancestor of our last voted fork slot and the slot we
272 : want to switch to. Any forks descending from the greatest common
273 : ancestor (which I will subsequently call the GCA) that are not our
274 : own fork are counted towards the switch check stake.
275 :
276 : Here we visualize the switch check:
277 :
278 : /-- 7
279 : /-- 3-- 4
280 : 1-- 2 -- 6
281 : \-- 5-- 9
282 :
283 : First, we find the GCA of 4 and 9 which is 2. Then we look at all the
284 : descendants of the GCA that do not share a fork with us, and make sure
285 : their stake sums to more than 38%.
286 :
287 : I’d like to highlight that 7 here is not counted towards the switch
288 : proof, even though it is on a different fork from 4. This is because
289 : it’s on the same fork relative to the GCA.
290 :
291 : So that covers the checks. Next, there are two additional important
292 : concepts: "reset slot" and "vote slot". The reset slot is the slot you
293 : reset PoH to when it's your turn to be leader. Because you are
294 : responsible for producing a block, you need to decide which fork to
295 : build your block on. For example, if there are two competing slots 3
296 : and 4, you would decide whether to build 3 <- 5 or 4 <- 5. In general
297 : the reset slot is the same fork as the vote slot, but not always.
298 : There is an important reason for this. Recall this fork graph from
299 : earlier:
300 :
301 : /-- 3-- 4 (10%)
302 : 1-- 2
303 : \-- 5-- 6 (9%)
304 :
305 : In this diagram, 4 is the winner of fork choice. All future leaders
306 : now want to reset to slot 4. Naively, this makes sense because you
307 : maximize the chance of your block finalizing (and earning the rewards)
308 : if you greedily (in the algorithmic, and perhaps also literal sense)
309 : pick what's currently the heaviest.
310 :
311 : However, say most validators actually voted fork 5, even though we
312 : currently observe 3 as the heavier. For whatever reason, those votes
313 : for 5 just didn't land (the leader for 6 missed the votes, network
314 : blip, etc.)
315 :
316 : All these validators that voted for 5 are now constrained by the
317 : switch check (38% of stake), and none of them can actually switch
318 : their vote to 4 (which only has 10%). But they're all continuing to
319 : build blocks on top of fork 4, which importantly implies that votes
320 : for 5 will not be able to propagate. This is because the validators
321 : that can't switch continue to refresh their votes for 5, but those
322 : votes never "land" because no one is building blocks on top of fork
323 : 5 anymore (everyone is building on 4 because that's currently the
324 : heaviest).
325 :
326 : Therefore, it is important to reset to the same fork as your last vote
327 : slot, which is usually also the heaviest fork, but not always.
328 :
329 : Now let’s switch gears from theory back to practice. How does the
330 : literal mechanism of voting actually work?
331 :
332 : Validators don't send individual votes. Rather, they send their
333 : entire updated tower to the cluster every time. Essentially, the
334 : validator is continuously syncing their local tower with the cluster.
335 : That tower state is then stored inside a vote account, like any other
336 : state on Solana.
337 :
338 : On the flip side, validators also must stay in sync the other way from
339 : cluster to local. If a validator has previously voted, then they have
340 : an on-chain vote account containing the cluster's latest view of the
341 : tower (as of a given replay slot). If this on-chain tower is
342 : incompatible with the local one, they must be reconciled
343 : (fd_tower_reconcile - also note the etymology for the "TowerSync" vote
344 : instruction).
345 :
346 : Finally, a note on the difference between the Vote Program and
347 : TowerBFT. The Vote Program runs during transaction (block) execution.
348 : It checks that certain invariants about the tower inside a vote
349 : transaction are upheld (recall a validator sends their entire tower as
350 : part of a "vote"): otherwise, it fails the transaction. For example,
351 : it checks that every vote contains a tower in which the vote slots are
352 : strictly monotonically increasing. As a consequence, only valid
353 : towers are committed to the ledger. Another important detail of the
354 : Vote Program is that it only has a view of the current fork on which
355 : it is executing. Specifically, it can't observe what state is on
356 : other forks, like what a validator's tower looks like on fork A vs.
357 : fork B.
358 :
359 : The TowerBFT rules, on the other hand, run after transaction
360 : execution. Also unlike the Vote Program, the TowerBFT rules do not
361 : take the vote transactions as inputs: rather the inputs are the towers
362 : that have already been written to the ledger by the Vote Program. As
363 : described above, the Vote Program validates every tower, and in this
364 : way, the TowerBFT rules leverage the validation already done by the
365 : Vote Program to (mostly) assume each tower is valid. Every validator
366 : runs TowerBFT to update their own tower with votes based on the
367 : algorithm documented above. Importantly, TowerBFT has a view of all
368 : forks, and the validator makes a voting decision based on all forks.
369 : */
370 :
371 : #include "../fd_choreo_base.h"
372 : #include "fd_tower_blocks.h"
373 : #include "fd_tower_leaves.h"
374 : #include "fd_tower_lockos.h"
375 : #include "fd_tower_serdes.h"
376 : #include "fd_tower_stakes.h"
377 : #include "fd_tower_voters.h"
378 : #include "../ghost/fd_ghost.h"
379 : #include "../notar/fd_notar.h"
380 : #include "../../disco/pack/fd_microblock.h"
381 :
382 0 : #define FD_TOWER_VOTE_MAX (FD_TOWER_LOCKOS_MAX)
383 :
384 : /* fd_tower is a representation of a validator's "vote tower" (described
385 : in detail in the preamble at the top of this file). The votes in the
386 : tower are stored in an fd_deque.c ordered from lowest to highest vote
387 : slot (highest to lowest confirmation count) relative to the head and
388 : tail. There can be at most FD_TOWER_VOTE_MAX votes in the tower. */
389 :
390 : struct fd_tower_vote {
391 : ulong slot; /* vote slot */
392 : ulong conf; /* confirmation count */
393 : };
394 : typedef struct fd_tower_vote fd_tower_vote_t;
395 :
396 : #define DEQUE_NAME fd_tower
397 0 : #define DEQUE_T fd_tower_vote_t
398 0 : #define DEQUE_MAX FD_TOWER_VOTE_MAX
399 : #include "../../util/tmpl/fd_deque.c"
400 :
401 : typedef fd_tower_vote_t fd_tower_t; /* typedef for semantic clarity */
402 :
403 : /* FD_TOWER_{ALIGN,FOOTPRINT} provided for static declarations. */
404 :
405 : #define FD_TOWER_ALIGN (alignof(fd_tower_private_t))
406 : #define FD_TOWER_FOOTPRINT (sizeof (fd_tower_private_t))
407 : FD_STATIC_ASSERT( alignof(fd_tower_private_t)==8UL, FD_TOWER_ALIGN );
408 : FD_STATIC_ASSERT( sizeof (fd_tower_private_t)==512UL, FD_TOWER_FOOTPRINT );
409 :
410 0 : #define FD_TOWER_FLAG_ANCESTOR_ROLLBACK 0 /* rollback to an ancestor of our prev vote */
411 0 : #define FD_TOWER_FLAG_SIBLING_CONFIRMED 1 /* our prev vote was a duplicate and its sibling got confirmed */
412 0 : #define FD_TOWER_FLAG_SAME_FORK 2 /* prev vote is on the same fork */
413 0 : #define FD_TOWER_FLAG_SWITCH_PASS 3 /* successfully switched to a different fork */
414 0 : #define FD_TOWER_FLAG_SWITCH_FAIL 4 /* failed to switch to a different fork */
415 0 : #define FD_TOWER_FLAG_LOCKOUT_FAIL 5 /* failed lockout check */
416 0 : #define FD_TOWER_FLAG_THRESHOLD_FAIL 6 /* failed threshold check */
417 0 : #define FD_TOWER_FLAG_PROPAGATED_FAIL 7 /* failed propagated check */
418 :
419 : struct fd_tower_out {
420 : uchar flags; /* one of FD_TOWER_{EMPTY,...} */
421 : ulong reset_slot; /* slot to reset PoH to */
422 : fd_hash_t reset_block_id; /* block ID to reset PoH to */
423 : ulong vote_slot; /* slot to vote for (ULONG_MAX if no vote) */
424 : fd_hash_t vote_block_id; /* block ID to vote for */
425 : ulong root_slot; /* new tower root slot (ULONG_MAX if no new root) */
426 : fd_hash_t root_block_id; /* new tower root block ID */
427 : };
428 : typedef struct fd_tower_out fd_tower_out_t;
429 :
430 0 : #define FD_TOWER_CSTR_MIN (917UL) /* worst-case tower with 31 entries and 20-digit (ulong max width) slots */
431 :
432 : /* fd_tower_vote_and_reset selects both a block to vote for and block to
433 : reset to. Returns a struct with a reason code (FD_TOWER_{EMPTY,...})
434 : in addition to {reset,vote,root}_{slot,block_id}.
435 :
436 : We can't always vote, so vote_slot may be ULONG_MAX which indicates
437 : no vote should be cast and caller should ignore vote_block_id. New
438 : roots result from votes, so the same applies for root_slot (there is
439 : not always a new root). However there is always a reset block, so
440 : reset_slot and reset_block_id will always be populated on return. The
441 : implementation contains detailed documentation of the tower rules. */
442 :
443 : fd_tower_out_t
444 : fd_tower_vote_and_reset( fd_tower_t * tower,
445 : fd_tower_blocks_t * blocks,
446 : fd_tower_leaves_t * leaves,
447 : fd_tower_lockos_t * lockos,
448 : fd_tower_stakes_t * stakes,
449 : fd_tower_voters_t * voters,
450 : fd_ghost_t * ghost,
451 : fd_notar_t * notar );
452 :
453 : /* Misc */
454 :
455 : /* fd_tower_reconcile reconciles our local tower with the on-chain tower
456 : inside our vote account. Mirrors what Agave does. */
457 :
458 : void
459 : fd_tower_reconcile( fd_tower_t * tower,
460 : ulong tower_root,
461 : uchar const * vote_acc );
462 :
463 : /* fd_tower_from_vote_acc deserializes the vote account into tower.
464 : Assumes tower is a valid local join and currently empty. */
465 :
466 : void
467 : fd_tower_from_vote_acc( fd_tower_t * tower,
468 : uchar const * vote_acc );
469 :
470 : /* fd_tower_with_lat_from_vote_acc deserializes the vote account into
471 : tower, including slot latency (when available) for tower votes.
472 : Assumes tower points to a static array of length FD_TOWER_VOTE_MAX.
473 :
474 : Returns the number of copied elements. */
475 :
476 : ulong
477 : fd_tower_with_lat_from_vote_acc( fd_vote_acc_vote_t tower[ static FD_TOWER_VOTE_MAX ],
478 : uchar const * vote_acc );
479 :
480 : /* fd_tower_to_vote_txn writes tower into a fd_tower_sync_t vote
481 : instruction and serializes it into a Solana transaction. Assumes
482 : tower is a valid local join. */
483 :
484 : void
485 : fd_tower_to_vote_txn( fd_tower_t const * tower,
486 : ulong root,
487 : fd_hash_t const * bank_hash,
488 : fd_hash_t const * block_id,
489 : fd_hash_t const * recent_blockhash,
490 : fd_pubkey_t const * validator_identity,
491 : fd_pubkey_t const * vote_authority,
492 : fd_pubkey_t const * vote_account,
493 : fd_txn_p_t * vote_txn );
494 :
495 : /* fd_tower_verify checks tower is in a valid state. Valid iff:
496 : - cnt < FD_TOWER_VOTE_MAX
497 : - vote slots and confirmation counts in the tower are monotonically
498 : increasing */
499 :
500 : int
501 : fd_tower_verify( fd_tower_t const * tower );
502 :
503 : /* fd_tower_to_cstr pretty-prints tower as a formatted table to the
504 : buffer cstr. Assumes cstr is a buffer of at least FD_TOWER_CSTR_MIN
505 : capacity.
506 :
507 : Sample output:
508 :
509 : slot | confirmation count
510 : --------- | ------------------
511 : 279803931 | 1
512 : 279803930 | 2
513 : ...
514 : 279803901 | 31
515 : 279803900 | root
516 : */
517 :
518 : char *
519 : fd_tower_to_cstr( fd_tower_t const * tower,
520 : ulong root,
521 : char * cstr );
522 :
523 : #endif /* HEADER_fd_src_choreo_tower_fd_tower_h */
|