LCOV - code coverage report
Current view: top level - disco/store - fd_store.h (source / functions) Hit Total Coverage
Test: cov.lcov Lines: 0 48 0.0 %
Date: 2026-03-19 18:19:27 Functions: 0 261 0.0 %

          Line data    Source code
       1             : #ifndef HEADER_fd_src_disco_store_fd_store_h
       2             : #define HEADER_fd_src_disco_store_fd_store_h
       3             : 
       4             : /* fd_store is a high-performance in-memory storage engine for shreds as
       5             :    they are received from the network.
       6             : 
       7             :    The elements in the store themselves are not shreds, but FEC set
       8             :    payloads.  Briefly, FEC sets are groupings of shreds that encode the
       9             :    same data but provide redundancy and security.  While Firedancer
      10             :    receives individual shreds over Turbine / Repair (the relevant Solana
      11             :    protocols), a lot of Firedancer's validation of those shreds can only
      12             :    be done at the FEC set boundary.  Also, there are potential future
      13             :    protocol changes to encode all shreds in a FEC set so that they can
      14             :    only be decoded once the entire FEC set is received ("all-coding").
      15             : 
      16             :    Therefore, the FEC set becomes the logical unit for the store vs.
      17             :    individual shreds.  Replay will only ever attempt replay of a FEC set
      18             :    so there is no need to store at the granularity of individual shreds.
      19             :    The store is essentially a mapping of merkle root->FEC set payload,
      20             :    in which the merkle root was produced by feeding every shred in a FEC
      21             :    set into a merkle tree.  This uniquely identifies a FEC set, because
      22             :    if any of the shreds change, the merkle root changes.
      23             : 
      24             :    Shreds are coalesced and inserted into the store as bytes.  The max
      25             :    bytes per FEC set is currently variable-length but capped at 63985.
      26             :    In the future this will be fixed to 31840 bytes, when FEC sets are
      27             :    enforced to always be 32 shreds.
      28             : 
      29             :    The shared memory used by a store instance is within a workspace such
      30             :    that it is also persistent and remotely inspectable.  Store is
      31             :    designed to be used inter-process (allowing concurrent joins from
      32             :    multiple tiles), relocated in memory (via wksp operations), and
      33             :    accessed concurrently (managing conflicts with a lock).
      34             : 
      35             :    EQUIVOCATION
      36             : 
      37             :    There is a protocol violation called equivocation (also known as
      38             :    "duplicates") that results in two or more blocks for the same slot.
      39             :    The actual conflict occurs at the shred level: a leader produces two
      40             :    or more shreds for the same slot at the same index, with different
      41             :    data payloads.  The result will be two different FEC sets for the
      42             :    same "logical" FEC set based on (slot, fec_set_idx).
      43             : 
      44             :    Unfortunately a naive keying scheme such as (slot, fec_set_idx) is
      45             :    insufficient as a result of equivocation.  As mentioned earlier,
      46             :    Store instead uses the merkle root as the key for its FEC set
      47             :    elements, at the cost of some keying performance and complexity.
      48             : 
      49             :    ARCHITECTURE
      50             : 
      51             :    In the Firedancer topology, Shred tile inserts to store and Replay
      52             :    tile queries from store.  Replay tile only queries from the store
      53             :    after Shred tile has notified Replay that a FEC set is ready. Shred's
      54             :    inserts are append-only and Replay is responsible for removing once
      55             :    it is done consuming (signaled by a new Tower root).
      56             : 
      57             :    Shred (inserts) -> Replay (queries, removes)
      58             : 
      59             :    ORDERING
      60             : 
      61             :    In the above architecture, Shred delivers FEC sets to Replay in
      62             :    partial order.  Any given fork will be delivered in-order, but
      63             :    concurrent forks can be delivered in arbitrary order.  Another way to
      64             :    phrase this is a parent FEC set will always be delivered before the
      65             :    child (see fd_reasm).
      66             : 
      67             :    CONCURRENCY
      68             : 
      69             :    It is possible to design Store access in a way that enables parallel
      70             :    inserts and minimizes lock contention between readers and writers.
      71             :    Store contains a fd_rwlock (read-write lock), but the name is a bit
      72             :    of a misnomer because inserts can actually be concurrent and the only
      73             :    operation that will actually need the write lock is remove, which
      74             :    will be done by Replay.  It is more appropriate to describe as an
      75             :    exclusive-shared access lock.
      76             : 
      77             :    In this design, writers (Shred tiles) hold the shared lock during
      78             :    their access.  The reader (Replay tile) also holds the shared lock
      79             :    during its access, and given both Shred and Replay will be taking out
      80             :    shared locks, they will not contend.
      81             : 
      82             :    For parallel inserts, the Store's hash function is carefully designed
      83             :    to partition the keyspace so that the same Shred tile always inserts
      84             :    to the same map slots.  This ensures map collisions always happen on
      85             :    the same Shred tile and cannot happen across tiles.  Specifically, if
      86             :    two different FEC sets hash to the same slot, it is guaranteed that
      87             :    to be the same Shred tile processing both those FEC sets.  This
      88             :    prevents a data race in which multiple Shred tiles (each with a
      89             :    handle to the shared lock) write to the same map slot.
      90             : 
      91             :    The hash function is defined as follows:
      92             :    ```
      93             :    #define MAP_KEY_HASH(key,seed) ((ulong)key->mr.ul[0]%seed + (key)->part_idx*seed)
      94             :    ```
      95             :    where `key` is a key type that includes the merkle root (32 bytes)
      96             :    and the partition index (8 bytes) that is equivalent to the Shred
      97             :    tile index doing the insertion.  seed, on initialization, is the
      98             :    number of chains/buckets in the map_chain divided by the number of
      99             :    partitions.  In effect, seed is the size of each partition.  For
     100             :    example, if the map_chain is sized to 1024, and there are 4 shred
     101             :    tiles, then the seed is 1024/4 = 256.  Then the map key hash can
     102             :    bound the chain index of each partition as such: shred tile 0 will
     103             :    write to chains 0-255, shred tile 1 will write to chains 256-511,
     104             :    shred tile 2 will write to chains 512-767, and shred tile 3 will
     105             :    write to chains 768-1023, without overlap.  The merkle root is a 32
     106             :    byte SHA-256 hash, so we can expect a fairly uniform distribution of
     107             :    hash values even after truncating to the first 8 bytes, without
     108             :    needing to introduce more randomness.  Thus we can repurpose the
     109             :    `seed` argument to be the number of partitions.
     110             : 
     111             :    Essentially, this allows for limited single-producer single-consumer
     112             :    (SPSC) concurrency, where the producer is a given Shred tile and the
     113             :    consumer is Replay tile.  The SPSC concurrency is limited in that the
     114             :    Store should 1. only be read by Replay after Shred has notified
     115             :    Replay it is time to read (ie. Shred has finished writing), and 2. be
     116             :    written to by Shred(s) in append-only fashion, so Shred never
     117             :    modifies or removes from the map.  Store is backed by fd_map_chain,
     118             :    which is not thread-safe generally, but does support this particular
     119             :    SPSC concurrency model in cases where the consumer is guaranteed to
     120             :    be lagging the producer.
     121             : 
     122             :    Analyzing fd_map_chain in gory detail, in the case of a map collision
     123             :    where Replay tile is reading an element and Shred tile inserts a new
     124             :    element to the same map slot, that new element is prepended to the
     125             :    hash chain within that slot (which modifies what the head of the
     126             :    chain points to as well as the now-previous head in the hash chain's
     127             :    `.next` field, but does not touch application data).  With fencing
     128             :    enabled (MAP_INSERT_FENCE), it is guaranteed the consumer either
     129             :    queries the head before or after the update.  If it queries before,
     130             :    that is safe, it would just check the key (if no match, iterate down
     131             :    the chain etc.)  If it queries after, it is also safe because the new
     132             :    element is guaranteed to be before the old element in the chain, so
     133             :    it would just do one more iteration.  Note the consumer should always
     134             :    use fd_store_query_const to ensure the underlying fd_map_chain is not
     135             :    modified during querying.
     136             : 
     137             :    The exception to the above is removing.  Removing requires exclusive
     138             :    access because it involves removing from fd_map_chain, which is not
     139             :    safe for shared access.  So the Replay tile should take out the
     140             :    exclusive lock.  Removing happens at most once per slot, so it is a
     141             :    relatively infrequent Store access compared to FEC queries and
     142             :    inserts (which is good because it is also the most expensive). */
     143             : 
     144             : #include "../../flamenco/fd_rwlock.h"
     145             : #include "../../flamenco/types/fd_types_custom.h"
     146             : #include "../../util/hist/fd_histf.h"
     147             : 
     148             : /* FD_STORE_USE_HANDHOLDING:  Define this to non-zero at compile time
     149             :    to turn on additional runtime checks and logging. */
     150             : 
     151             : #ifndef FD_STORE_USE_HANDHOLDING
     152             : #define FD_STORE_USE_HANDHOLDING 1
     153             : #endif
     154             : 
     155             : /* FD_STORE_ALIGN specifies the alignment needed for store.  ALIGN is
     156             :    double x86 cache line to mitigate various kinds of false sharing (eg.
     157             :    ACLPF adjacent cache line prefetch). */
     158             : 
     159             : #define FD_STORE_ALIGN (128UL)
     160             : 
     161             : /* FD_STORE_MAGIC is a magic number for detecting store corruption. */
     162             : 
     163           0 : #define FD_STORE_MAGIC (0xf17eda2ce75702e0UL) /* firedancer store version 0 */
     164             : 
     165             : /* FD_STORE_DATA_MAX defines a constant for the maximum size of a FEC
     166             :    set payload.  The value is computed from the maximum number
     167             :    of shreds in a FEC set * the payload bytes per shred.
     168             : 
     169             :    67 shreds per FEC set * 955 payloads per shred = 63985 bytes max. */
     170             : 
     171           0 : #define FD_STORE_DATA_MAX (63985UL) /* TODO fixed-32 */
     172             : 
     173             : /* fd_store_fec describes a store element (FEC set).  The pointer fields
     174             :    implement a left-child, right-sibling n-ary tree. */
     175             : 
     176             : struct __attribute__((packed)) fd_store_key {
     177             :    fd_hash_t merkle_root;
     178             :    ulong     part_idx; /* partition index of the caller of fd_store_insert */
     179             : };
     180             : typedef struct fd_store_key fd_store_key_t;
     181             : 
     182             : struct __attribute__((aligned(FD_STORE_ALIGN))) fd_store_fec {
     183             :   fd_store_key_t key;            /* map key, merkle root of the FEC set + a partition index */
     184             :   ulong          next;           /* reserved for internal use by fd_pool, fd_map_chain */
     185             :   fd_hash_t      cmr;            /* parent's map key, chained merkle root of the FEC set */
     186             :   uint           block_offs[32]; /* TODO fixed-32. block_offs[ i ] is the total size of data shreds [0, i] */
     187             :   ulong          data_sz;        /* TODO fixed-32. sz of the FEC set payload, guaranteed < FD_STORE_DATA_MAX */
     188             :   uchar          data[FD_STORE_DATA_MAX];
     189             : };
     190             : typedef struct fd_store_fec fd_store_fec_t;
     191             : 
     192             : #define POOL_NAME  fd_store_pool
     193           0 : #define POOL_ELE_T fd_store_fec_t
     194             : #include "../../util/tmpl/fd_pool_para.c"
     195             : 
     196             : #define MAP_NAME               fd_store_map
     197             : #define MAP_ELE_T              fd_store_fec_t
     198             : #define MAP_KEY_T              fd_store_key_t
     199           0 : #define MAP_KEY                key
     200           0 : #define MAP_KEY_EQ(k0,k1)      (!memcmp((k0),(k1), sizeof(fd_hash_t)))
     201           0 : #define MAP_KEY_HASH(key,seed) ((ulong)key->merkle_root.ul[0]%seed + (key)->part_idx*seed) /* see top-level documentation of hash function */
     202             : #define MAP_INSERT_FENCE       1
     203             : #include "../../util/tmpl/fd_map_chain.c"
     204             : 
     205             : struct fd_store {
     206             :   ulong magic;          /* ==FD_STORE_MAGIC */
     207             :   ulong fec_max;        /* max number of FEC sets that can be stored */
     208             :   ulong part_cnt;       /* number of partitions, also the number of writers */
     209             :   ulong store_gaddr;    /* wksp gaddr of store in the backing wksp, non-zero gaddr */
     210             :   ulong map_gaddr;      /* wksp gaddr of map of fd_store_key->fd_store_fec */
     211             :   ulong pool_mem_gaddr; /* wksp gaddr of shmem_t object in pool_para */
     212             :   ulong pool_ele_gaddr; /* wksp gaddr of first ele_t object in pool_para */
     213             : 
     214             :   fd_rwlock_t lock; /* shared-exclusive lock */
     215             : };
     216             : typedef struct fd_store fd_store_t;
     217             : 
     218             : FD_PROTOTYPES_BEGIN
     219             : 
     220             : /* Constructors */
     221             : 
     222             : /* fd_store_{align,footprint} return the required alignment and
     223             :    footprint of a memory region suitable for use as store with up to
     224             :    fec_max elements.  fec_max is an integer power-of-two. */
     225             : 
     226             : FD_FN_CONST static inline ulong
     227           0 : fd_store_align( void ) {
     228           0 :   return alignof(fd_store_t);
     229           0 : }
     230             : 
     231             : FD_FN_CONST static inline ulong
     232           0 : fd_store_footprint( ulong fec_max ) {
     233           0 :   if( FD_UNLIKELY( !fd_ulong_is_pow2( fec_max ) ) ) return 0UL;
     234           0 :   return FD_LAYOUT_FINI(
     235           0 :     FD_LAYOUT_APPEND(
     236           0 :     FD_LAYOUT_APPEND(
     237           0 :     FD_LAYOUT_APPEND(
     238           0 :     FD_LAYOUT_APPEND(
     239           0 :     FD_LAYOUT_INIT,
     240           0 :       alignof(fd_store_t),     sizeof(fd_store_t)                ),
     241           0 :       fd_store_map_align(),    fd_store_map_footprint( fec_max ) ),
     242           0 :       fd_store_pool_align(),   fd_store_pool_footprint()         ),
     243           0 :       alignof(fd_store_fec_t), sizeof(fd_store_fec_t)*fec_max    ),
     244           0 :     fd_store_align() );
     245           0 : }
     246             : 
     247             : /* fd_store_new formats an unused memory region for use as a store.
     248             :    mem is a non-NULL pointer to this region in the local address space
     249             :    with the required footprint and alignment.  fec_max is an integer
     250             :    power-of-two. */
     251             : 
     252             : void *
     253             : fd_store_new( void * shmem,
     254             :               ulong  fec_max,
     255             :               ulong  part_cnt );
     256             : 
     257             : /* fd_store_join joins the caller to the store.  store points to the
     258             :    first byte of the memory region backing the store in the caller's
     259             :    address space.
     260             : 
     261             :    Returns a pointer in the local address space to store on success. */
     262             : 
     263             : fd_store_t *
     264             : fd_store_join( void * store );
     265             : 
     266             : /* fd_store_leave leaves a current local join.  Returns a pointer to the
     267             :    underlying shared memory region on success and NULL on failure (logs
     268             :    details).  Reasons for failure include store is NULL. */
     269             : 
     270             : void *
     271             : fd_store_leave( fd_store_t const * store );
     272             : 
     273             : /* fd_store_delete unformats a memory region used as a store.
     274             :    Assumes only the nobody is joined to the region.  Returns a
     275             :    pointer to the underlying shared memory region or NULL if used
     276             :    obviously in error (e.g. store is obviously not a store ... logs
     277             :    details).  The ownership of the memory region is transferred to the
     278             :    caller. */
     279             : 
     280             : void *
     281             : fd_store_delete( void * shstore );
     282             : 
     283             : /* Accessors */
     284             : 
     285             : /* fd_store_wksp returns the local join to the wksp backing the store.
     286             :    The lifetime of the returned pointer is at least as long as the
     287             :    lifetime of the local join.  Assumes store is a current local
     288             :    join. */
     289             : 
     290             : FD_FN_PURE static inline fd_wksp_t *
     291           0 : fd_store_wksp( fd_store_t const * store ) {
     292           0 :   return (fd_wksp_t *)( ( (ulong)store ) - store->store_gaddr );
     293           0 : }
     294             : 
     295             : /* fd_store_{s}lock_{acquire,release} interface store's shared-exclusive
     296             :    lock.  See also FD_STORE_{S,X}LOCK_{BEGIN,END}. */
     297             : 
     298           0 : static inline void fd_store_slock_acquire( fd_store_t * store ) { fd_rwlock_read   ( &store->lock ); }
     299           0 : static inline void fd_store_slock_release( fd_store_t * store ) { fd_rwlock_unread ( &store->lock ); }
     300           0 : static inline void fd_store_xlock_acquire( fd_store_t * store ) { fd_rwlock_write  ( &store->lock ); }
     301           0 : static inline void fd_store_xlock_release( fd_store_t * store ) { fd_rwlock_unwrite( &store->lock ); }
     302             : 
     303             : static inline void
     304           0 : fd_store_private_slock_end( fd_store_t ** _store ) {
     305           0 :    fd_store_t * store = *_store;
     306           0 :    fd_store_slock_release( store );
     307           0 : }
     308             : 
     309           0 : #define FD_STORE_SLOCK_BEGIN(store) {                                                 \
     310           0 :   fd_store_t * _store __attribute__((cleanup(fd_store_private_slock_end))) = (store); \
     311           0 :   fd_store_slock_acquire( _store );                                                   \
     312           0 :   {
     313             : 
     314           0 : #define FD_STORE_SLOCK_END }}
     315             : 
     316             : static inline void
     317           0 : fd_store_private_xlock_end( fd_store_t ** _store ) {
     318           0 :    fd_store_t * store = *_store;
     319           0 :    fd_store_xlock_release( store );
     320           0 : }
     321             : 
     322           0 : #define FD_STORE_XLOCK_BEGIN(store) {                                                 \
     323           0 :   fd_store_t * _store __attribute__((cleanup(fd_store_private_xlock_end))) = (store); \
     324           0 :   fd_store_xlock_acquire( _store );                                                   \
     325           0 :   {
     326             : 
     327           0 : #define FD_STORE_XLOCK_END }}
     328             : 
     329             : 
     330             : /* fd_store_query queries the FEC set keyed by merkle_root.  Returns a
     331             :    pointer to the fd_store_fec_t if found, NULL otherwise.  Assumes
     332             :    caller has acquired the shared lock.
     333             : 
     334             :    IMPORTANT SAFETY TIP!  Caller should only call release the shared
     335             :    lock when they no longer retain interest in the returned pointer. */
     336             : 
     337             : fd_store_fec_t *
     338             : fd_store_query( fd_store_t      * store,
     339             :                 fd_hash_t const * merkle_root );
     340             : 
     341             : /* Operations */
     342             : 
     343             : /* fd_store_insert inserts a new FEC keyed by merkle_root into the
     344             :    store.  Returns a pointer to the inserted pool ele (fd_store_fec_t *)
     345             :    on success.  If the merkle root has previously been inserted, returns
     346             :    a pointer to the previous element with that key.  Returns NULL if the
     347             :    store is full (see fd_store_evict).
     348             : 
     349             :    Each fd_store_fec_t can hold at most FD_STORE_DATA_MAX bytes of data,
     350             :    and caller is responsible for copying into the region.
     351             : 
     352             :    This is a blocking operation that acquires a shared lock as part of
     353             :    its implementation.  Assumes caller has safely partitioned insertions
     354             :    if running in parallel (see top-level documentation in fd_store.h for
     355             :    details). */
     356             : 
     357             : fd_store_fec_t *
     358             : fd_store_insert( fd_store_t * store,
     359             :                  ulong        part_idx,
     360             :                  fd_hash_t  * merkle_root );
     361             : 
     362             : /* fd_store_remove removes the FEC set keyed by merkle_root.  Logs a
     363             :    warning if merkle_root is not found.
     364             : 
     365             :    This is a blocking operation that acquires an exclusive lock as part
     366             :    of its implementation.
     367             : 
     368             :    IMPORTANT SAFETY TIP!  Caller should only call fd_store_shrel or
     369             :    fd_store_exrel when they no longer retain interest in the returned
     370             :    pointer. */
     371             : 
     372             : void
     373             : fd_store_remove( fd_store_t      * store,
     374             :                  fd_hash_t const * merkle_root );
     375             : 
     376             : /* fd_store_verify returns 0 if the store is not obviously corrupt or -1
     377             :    otherwise (logs details). */
     378             : 
     379             : int
     380             : fd_store_verify( fd_store_t * store );
     381             : 
     382             : FD_PROTOTYPES_END
     383             : 
     384             : #endif /* HEADER_fd_src_disco_store_fd_store_h */

Generated by: LCOV version 1.14