Line data Source code
1 : #ifndef HEADER_fd_src_disco_store_fd_store_h
2 : #define HEADER_fd_src_disco_store_fd_store_h
3 :
4 : /* fd_store is a high-performance in-memory storage engine for shreds as
5 : they are received from the network.
6 :
7 : The elements in the store themselves are not shreds, but FEC set
8 : payloads. Briefly, FEC sets are groupings of shreds that encode the
9 : same data but provide redundancy and security. While Firedancer
10 : receives individual shreds over Turbine / Repair (the relevant Solana
11 : protocols), a lot of Firedancer's validation of those shreds can only
12 : be done at the FEC set boundary. Also, there are potential future
13 : protocol changes to encode all shreds in a FEC set so that they can
14 : only be decoded once the entire FEC set is received ("all-coding").
15 :
16 : Therefore, the FEC set becomes the logical unit for the store vs.
17 : individual shreds. Replay will only ever attempt replay of a FEC set
18 : so there is no need to store at the granularity of individual shreds.
19 : The store is essentially a mapping of merkle root->FEC set payload,
20 : in which the merkle root was produced by feeding every shred in a FEC
21 : set into a merkle tree. This uniquely identifies a FEC set, because
22 : if any of the shreds change, the merkle root changes.
23 :
24 : Shreds are coalesced and inserted into the store as bytes. The max
25 : bytes per FEC set is currently variable-length but capped at 63985.
26 : In the future this will be fixed to 31840 bytes, when FEC sets are
27 : enforced to always be 32 shreds.
28 :
29 : The shared memory used by a store instance is within a workspace such
30 : that it is also persistent and remotely inspectable. Store is
31 : designed to be used inter-process (allowing concurrent joins from
32 : multiple tiles), relocated in memory (via wksp operations), and
33 : accessed concurrently (managing conflicts with a lock).
34 :
35 : EQUIVOCATION
36 :
37 : There is a protocol violation called equivocation (also known as
38 : "duplicates") that results in two or more blocks for the same slot.
39 : The actual conflict occurs at the shred level: a leader produces two
40 : or more shreds for the same slot at the same index, with different
41 : data payloads. The result will be two different FEC sets for the
42 : same "logical" FEC set based on (slot, fec_set_idx).
43 :
44 : Unfortunately a naive keying scheme such as (slot, fec_set_idx) is
45 : insufficient as a result of equivocation. As mentioned earlier,
46 : Store instead uses the merkle root as the key for its FEC set
47 : elements, at the cost of some keying performance and complexity.
48 :
49 : ARCHITECTURE
50 :
51 : In the Firedancer topology, Shred tile inserts to store and Replay
52 : tile queries from store. Replay tile only queries from the store
53 : after Shred tile has notified Replay that a FEC set is ready. Shred's
54 : inserts are append-only and Replay is responsible for removing once
55 : it is done consuming (signaled by a new Tower root).
56 :
57 : Shred (inserts) -> Replay (queries, removes)
58 :
59 : ORDERING
60 :
61 : In the above architecture, Shred delivers FEC sets to Replay in
62 : partial order. Any given fork will be delivered in-order, but
63 : concurrent forks can be delivered in arbitrary order. Another way to
64 : phrase this is a parent FEC set will always be delivered before the
65 : child (see fd_reasm).
66 :
67 : CONCURRENCY
68 :
69 : It is possible to design Store access in a way that enables parallel
70 : inserts and minimizes lock contention between readers and writers.
71 : Store contains a fd_rwlock (read-write lock), but the name is a bit
72 : of a misnomer because inserts can actually be concurrent and the only
73 : operation that will actually need the write lock is remove, which
74 : will be done by Replay. It is more appropriate to describe as an
75 : exclusive-shared access lock.
76 :
77 : In this design, writers (Shred tiles) hold the shared lock during
78 : their access. The reader (Replay tile) also holds the shared lock
79 : during its access, and given both Shred and Replay will be taking out
80 : shared locks, they will not contend.
81 :
82 : For parallel inserts, the Store's hash function is carefully designed
83 : to partition the keyspace so that the same Shred tile always inserts
84 : to the same map slots. This ensures map collisions always happen on
85 : the same Shred tile and cannot happen across tiles. Specifically, if
86 : two different FEC sets hash to the same slot, it is guaranteed that
87 : to be the same Shred tile processing both those FEC sets. This
88 : prevents a data race in which multiple Shred tiles (each with a
89 : handle to the shared lock) write to the same map slot.
90 :
91 : The hash function is defined as follows:
92 : ```
93 : #define MAP_KEY_HASH(key,seed) ((ulong)key->mr.ul[0]%seed + (key)->part_idx*seed)
94 : ```
95 : where `key` is a key type that includes the merkle root (32 bytes)
96 : and the partition index (8 bytes) that is equivalent to the Shred
97 : tile index doing the insertion. seed, on initialization, is the
98 : number of chains/buckets in the map_chain divided by the number of
99 : partitions. In effect, seed is the size of each partition. For
100 : example, if the map_chain is sized to 1024, and there are 4 shred
101 : tiles, then the seed is 1024/4 = 256. Then the map key hash can
102 : bound the chain index of each partition as such: shred tile 0 will
103 : write to chains 0-255, shred tile 1 will write to chains 256-511,
104 : shred tile 2 will write to chains 512-767, and shred tile 3 will
105 : write to chains 768-1023, without overlap. The merkle root is a 32
106 : byte SHA-256 hash, so we can expect a fairly uniform distribution of
107 : hash values even after truncating to the first 8 bytes, without
108 : needing to introduce more randomness. Thus we can repurpose the
109 : `seed` argument to be the number of partitions.
110 :
111 : Essentially, this allows for limited single-producer single-consumer
112 : (SPSC) concurrency, where the producer is a given Shred tile and the
113 : consumer is Replay tile. The SPSC concurrency is limited in that the
114 : Store should 1. only be read by Replay after Shred has notified
115 : Replay it is time to read (ie. Shred has finished writing), and 2. be
116 : written to by Shred(s) in append-only fashion, so Shred never
117 : modifies or removes from the map. Store is backed by fd_map_chain,
118 : which is not thread-safe generally, but does support this particular
119 : SPSC concurrency model in cases where the consumer is guaranteed to
120 : be lagging the producer.
121 :
122 : Analyzing fd_map_chain in gory detail, in the case of a map collision
123 : where Replay tile is reading an element and Shred tile inserts a new
124 : element to the same map slot, that new element is prepended to the
125 : hash chain within that slot (which modifies what the head of the
126 : chain points to as well as the now-previous head in the hash chain's
127 : `.next` field, but does not touch application data). With fencing
128 : enabled (MAP_INSERT_FENCE), it is guaranteed the consumer either
129 : queries the head before or after the update. If it queries before,
130 : that is safe, it would just check the key (if no match, iterate down
131 : the chain etc.) If it queries after, it is also safe because the new
132 : element is guaranteed to be before the old element in the chain, so
133 : it would just do one more iteration. Note the consumer should always
134 : use fd_store_query_const to ensure the underlying fd_map_chain is not
135 : modified during querying.
136 :
137 : The exception to the above is removing. Removing requires exclusive
138 : access because it involves removing from fd_map_chain, which is not
139 : safe for shared access. So the Replay tile should take out the
140 : exclusive lock. Removing happens at most once per slot, so it is a
141 : relatively infrequent Store access compared to FEC queries and
142 : inserts (which is good because it is also the most expensive). */
143 :
144 : #include "../../flamenco/fd_rwlock.h"
145 : #include "../../flamenco/types/fd_types_custom.h"
146 : #include "../../util/hist/fd_histf.h"
147 :
148 : /* FD_STORE_USE_HANDHOLDING: Define this to non-zero at compile time
149 : to turn on additional runtime checks and logging. */
150 :
151 : #ifndef FD_STORE_USE_HANDHOLDING
152 : #define FD_STORE_USE_HANDHOLDING 1
153 : #endif
154 :
155 : /* FD_STORE_ALIGN specifies the alignment needed for store. ALIGN is
156 : double x86 cache line to mitigate various kinds of false sharing (eg.
157 : ACLPF adjacent cache line prefetch). */
158 :
159 : #define FD_STORE_ALIGN (128UL)
160 :
161 : /* FD_STORE_MAGIC is a magic number for detecting store corruption. */
162 :
163 0 : #define FD_STORE_MAGIC (0xf17eda2ce75702e0UL) /* firedancer store version 0 */
164 :
165 : /* FD_STORE_DATA_MAX defines a constant for the maximum size of a FEC
166 : set payload. The value is computed from the maximum number
167 : of shreds in a FEC set * the payload bytes per shred.
168 :
169 : 67 shreds per FEC set * 955 payloads per shred = 63985 bytes max. */
170 :
171 0 : #define FD_STORE_DATA_MAX (63985UL) /* TODO fixed-32 */
172 :
173 : /* fd_store_fec describes a store element (FEC set). The pointer fields
174 : implement a left-child, right-sibling n-ary tree. */
175 :
176 : struct __attribute__((packed)) fd_store_key {
177 : fd_hash_t merkle_root;
178 : ulong part_idx; /* partition index of the caller of fd_store_insert */
179 : };
180 : typedef struct fd_store_key fd_store_key_t;
181 :
182 : struct __attribute__((aligned(FD_STORE_ALIGN))) fd_store_fec {
183 : fd_store_key_t key; /* map key, merkle root of the FEC set + a partition index */
184 : ulong next; /* reserved for internal use by fd_pool, fd_map_chain */
185 : fd_hash_t cmr; /* parent's map key, chained merkle root of the FEC set */
186 : uint block_offs[32]; /* TODO fixed-32. block_offs[ i ] is the total size of data shreds [0, i] */
187 : ulong data_sz; /* TODO fixed-32. sz of the FEC set payload, guaranteed < FD_STORE_DATA_MAX */
188 : uchar data[FD_STORE_DATA_MAX];
189 : };
190 : typedef struct fd_store_fec fd_store_fec_t;
191 :
192 : #define POOL_NAME fd_store_pool
193 0 : #define POOL_ELE_T fd_store_fec_t
194 : #include "../../util/tmpl/fd_pool_para.c"
195 :
196 : #define MAP_NAME fd_store_map
197 : #define MAP_ELE_T fd_store_fec_t
198 : #define MAP_KEY_T fd_store_key_t
199 0 : #define MAP_KEY key
200 0 : #define MAP_KEY_EQ(k0,k1) (!memcmp((k0),(k1), sizeof(fd_hash_t)))
201 0 : #define MAP_KEY_HASH(key,seed) ((ulong)key->merkle_root.ul[0]%seed + (key)->part_idx*seed) /* see top-level documentation of hash function */
202 : #define MAP_INSERT_FENCE 1
203 : #include "../../util/tmpl/fd_map_chain.c"
204 :
205 : struct fd_store {
206 : ulong magic; /* ==FD_STORE_MAGIC */
207 : ulong fec_max; /* max number of FEC sets that can be stored */
208 : ulong part_cnt; /* number of partitions, also the number of writers */
209 : ulong store_gaddr; /* wksp gaddr of store in the backing wksp, non-zero gaddr */
210 : ulong map_gaddr; /* wksp gaddr of map of fd_store_key->fd_store_fec */
211 : ulong pool_mem_gaddr; /* wksp gaddr of shmem_t object in pool_para */
212 : ulong pool_ele_gaddr; /* wksp gaddr of first ele_t object in pool_para */
213 :
214 : fd_rwlock_t lock; /* shared-exclusive lock */
215 : };
216 : typedef struct fd_store fd_store_t;
217 :
218 : FD_PROTOTYPES_BEGIN
219 :
220 : /* Constructors */
221 :
222 : /* fd_store_{align,footprint} return the required alignment and
223 : footprint of a memory region suitable for use as store with up to
224 : fec_max elements. fec_max is an integer power-of-two. */
225 :
226 : FD_FN_CONST static inline ulong
227 0 : fd_store_align( void ) {
228 0 : return alignof(fd_store_t);
229 0 : }
230 :
231 : FD_FN_CONST static inline ulong
232 0 : fd_store_footprint( ulong fec_max ) {
233 0 : if( FD_UNLIKELY( !fd_ulong_is_pow2( fec_max ) ) ) return 0UL;
234 0 : return FD_LAYOUT_FINI(
235 0 : FD_LAYOUT_APPEND(
236 0 : FD_LAYOUT_APPEND(
237 0 : FD_LAYOUT_APPEND(
238 0 : FD_LAYOUT_APPEND(
239 0 : FD_LAYOUT_INIT,
240 0 : alignof(fd_store_t), sizeof(fd_store_t) ),
241 0 : fd_store_map_align(), fd_store_map_footprint( fec_max ) ),
242 0 : fd_store_pool_align(), fd_store_pool_footprint() ),
243 0 : alignof(fd_store_fec_t), sizeof(fd_store_fec_t)*fec_max ),
244 0 : fd_store_align() );
245 0 : }
246 :
247 : /* fd_store_new formats an unused memory region for use as a store.
248 : mem is a non-NULL pointer to this region in the local address space
249 : with the required footprint and alignment. fec_max is an integer
250 : power-of-two. */
251 :
252 : void *
253 : fd_store_new( void * shmem,
254 : ulong fec_max,
255 : ulong part_cnt );
256 :
257 : /* fd_store_join joins the caller to the store. store points to the
258 : first byte of the memory region backing the store in the caller's
259 : address space.
260 :
261 : Returns a pointer in the local address space to store on success. */
262 :
263 : fd_store_t *
264 : fd_store_join( void * store );
265 :
266 : /* fd_store_leave leaves a current local join. Returns a pointer to the
267 : underlying shared memory region on success and NULL on failure (logs
268 : details). Reasons for failure include store is NULL. */
269 :
270 : void *
271 : fd_store_leave( fd_store_t const * store );
272 :
273 : /* fd_store_delete unformats a memory region used as a store.
274 : Assumes only the nobody is joined to the region. Returns a
275 : pointer to the underlying shared memory region or NULL if used
276 : obviously in error (e.g. store is obviously not a store ... logs
277 : details). The ownership of the memory region is transferred to the
278 : caller. */
279 :
280 : void *
281 : fd_store_delete( void * shstore );
282 :
283 : /* Accessors */
284 :
285 : /* fd_store_wksp returns the local join to the wksp backing the store.
286 : The lifetime of the returned pointer is at least as long as the
287 : lifetime of the local join. Assumes store is a current local
288 : join. */
289 :
290 : FD_FN_PURE static inline fd_wksp_t *
291 0 : fd_store_wksp( fd_store_t const * store ) {
292 0 : return (fd_wksp_t *)( ( (ulong)store ) - store->store_gaddr );
293 0 : }
294 :
295 : /* fd_store_{s}lock_{acquire,release} interface store's shared-exclusive
296 : lock. See also FD_STORE_{S,X}LOCK_{BEGIN,END}. */
297 :
298 0 : static inline void fd_store_slock_acquire( fd_store_t * store ) { fd_rwlock_read ( &store->lock ); }
299 0 : static inline void fd_store_slock_release( fd_store_t * store ) { fd_rwlock_unread ( &store->lock ); }
300 0 : static inline void fd_store_xlock_acquire( fd_store_t * store ) { fd_rwlock_write ( &store->lock ); }
301 0 : static inline void fd_store_xlock_release( fd_store_t * store ) { fd_rwlock_unwrite( &store->lock ); }
302 :
303 : static inline void
304 0 : fd_store_private_slock_end( fd_store_t ** _store ) {
305 0 : fd_store_t * store = *_store;
306 0 : fd_store_slock_release( store );
307 0 : }
308 :
309 0 : #define FD_STORE_SLOCK_BEGIN(store) { \
310 0 : fd_store_t * _store __attribute__((cleanup(fd_store_private_slock_end))) = (store); \
311 0 : fd_store_slock_acquire( _store ); \
312 0 : {
313 :
314 0 : #define FD_STORE_SLOCK_END }}
315 :
316 : static inline void
317 0 : fd_store_private_xlock_end( fd_store_t ** _store ) {
318 0 : fd_store_t * store = *_store;
319 0 : fd_store_xlock_release( store );
320 0 : }
321 :
322 0 : #define FD_STORE_XLOCK_BEGIN(store) { \
323 0 : fd_store_t * _store __attribute__((cleanup(fd_store_private_xlock_end))) = (store); \
324 0 : fd_store_xlock_acquire( _store ); \
325 0 : {
326 :
327 0 : #define FD_STORE_XLOCK_END }}
328 :
329 :
330 : /* fd_store_query queries the FEC set keyed by merkle_root. Returns a
331 : pointer to the fd_store_fec_t if found, NULL otherwise. Assumes
332 : caller has acquired the shared lock.
333 :
334 : IMPORTANT SAFETY TIP! Caller should only call release the shared
335 : lock when they no longer retain interest in the returned pointer. */
336 :
337 : fd_store_fec_t *
338 : fd_store_query( fd_store_t * store,
339 : fd_hash_t const * merkle_root );
340 :
341 : /* Operations */
342 :
343 : /* fd_store_insert inserts a new FEC keyed by merkle_root into the
344 : store. Returns a pointer to the inserted pool ele (fd_store_fec_t *)
345 : on success. If the merkle root has previously been inserted, returns
346 : a pointer to the previous element with that key. Returns NULL if the
347 : store is full (see fd_store_evict).
348 :
349 : Each fd_store_fec_t can hold at most FD_STORE_DATA_MAX bytes of data,
350 : and caller is responsible for copying into the region.
351 :
352 : This is a blocking operation that acquires a shared lock as part of
353 : its implementation. Assumes caller has safely partitioned insertions
354 : if running in parallel (see top-level documentation in fd_store.h for
355 : details). */
356 :
357 : fd_store_fec_t *
358 : fd_store_insert( fd_store_t * store,
359 : ulong part_idx,
360 : fd_hash_t * merkle_root );
361 :
362 : /* fd_store_remove removes the FEC set keyed by merkle_root. Logs a
363 : warning if merkle_root is not found.
364 :
365 : This is a blocking operation that acquires an exclusive lock as part
366 : of its implementation.
367 :
368 : IMPORTANT SAFETY TIP! Caller should only call fd_store_shrel or
369 : fd_store_exrel when they no longer retain interest in the returned
370 : pointer. */
371 :
372 : void
373 : fd_store_remove( fd_store_t * store,
374 : fd_hash_t const * merkle_root );
375 :
376 : /* fd_store_verify returns 0 if the store is not obviously corrupt or -1
377 : otherwise (logs details). */
378 :
379 : int
380 : fd_store_verify( fd_store_t * store );
381 :
382 : FD_PROTOTYPES_END
383 :
384 : #endif /* HEADER_fd_src_disco_store_fd_store_h */
|