12 11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 | /************************************************************************** * * Copyright 2006-2008 Tungsten Graphics, Inc., Cedar Park, TX. USA. * Copyright 2016 Intel Corporation * All Rights Reserved. * * Permission is hereby granted, free of charge, to any person obtaining a * copy of this software and associated documentation files (the * "Software"), to deal in the Software without restriction, including * without limitation the rights to use, copy, modify, merge, publish, * distribute, sub license, and/or sell copies of the Software, and to * permit persons to whom the Software is furnished to do so, subject to * the following conditions: * * The above copyright notice and this permission notice (including the * next paragraph) shall be included in all copies or substantial portions * of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL * THE COPYRIGHT HOLDERS, AUTHORS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE * USE OR OTHER DEALINGS IN THE SOFTWARE. * * **************************************************************************/ /* * Authors: * Thomas Hellstrom <thomas-at-tungstengraphics-dot-com> */ #ifndef _DRM_MM_H_ #define _DRM_MM_H_ /* * Generic range manager structs */ #include <linux/bug.h> #include <linux/rbtree.h> #include <linux/limits.h> #include <linux/mm_types.h> #include <linux/list.h> #include <linux/spinlock.h> #ifdef CONFIG_DRM_DEBUG_MM #include <linux/stackdepot.h> #endif #include <linux/types.h> #include <drm/drm_print.h> #ifdef CONFIG_DRM_DEBUG_MM #define DRM_MM_BUG_ON(expr) BUG_ON(expr) #else #define DRM_MM_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) #endif /** * enum drm_mm_insert_mode - control search and allocation behaviour * * The &struct drm_mm range manager supports finding a suitable modes using * a number of search trees. These trees are oranised by size, by address and * in most recent eviction order. This allows the user to find either the * smallest hole to reuse, the lowest or highest address to reuse, or simply * reuse the most recent eviction that fits. When allocating the &drm_mm_node * from within the hole, the &drm_mm_insert_mode also dictate whether to * allocate the lowest matching address or the highest. */ enum drm_mm_insert_mode { /** * @DRM_MM_INSERT_BEST: * * Search for the smallest hole (within the search range) that fits * the desired node. * * Allocates the node from the bottom of the found hole. */ DRM_MM_INSERT_BEST = 0, /** * @DRM_MM_INSERT_LOW: * * Search for the lowest hole (address closest to 0, within the search * range) that fits the desired node. * * Allocates the node from the bottom of the found hole. */ DRM_MM_INSERT_LOW, /** * @DRM_MM_INSERT_HIGH: * * Search for the highest hole (address closest to U64_MAX, within the * search range) that fits the desired node. * * Allocates the node from the *top* of the found hole. The specified * alignment for the node is applied to the base of the node * (&drm_mm_node.start). */ DRM_MM_INSERT_HIGH, /** * @DRM_MM_INSERT_EVICT: * * Search for the most recently evicted hole (within the search range) * that fits the desired node. This is appropriate for use immediately * after performing an eviction scan (see drm_mm_scan_init()) and * removing the selected nodes to form a hole. * * Allocates the node from the bottom of the found hole. */ DRM_MM_INSERT_EVICT, /** * @DRM_MM_INSERT_ONCE: * * Only check the first hole for suitablity and report -ENOSPC * immediately otherwise, rather than check every hole until a * suitable one is found. Can only be used in conjunction with another * search method such as DRM_MM_INSERT_HIGH or DRM_MM_INSERT_LOW. */ DRM_MM_INSERT_ONCE = BIT(31), /** * @DRM_MM_INSERT_HIGHEST: * * Only check the highest hole (the hole with the largest address) and * insert the node at the top of the hole or report -ENOSPC if * unsuitable. * * Does not search all holes. */ DRM_MM_INSERT_HIGHEST = DRM_MM_INSERT_HIGH | DRM_MM_INSERT_ONCE, /** * @DRM_MM_INSERT_LOWEST: * * Only check the lowest hole (the hole with the smallest address) and * insert the node at the bottom of the hole or report -ENOSPC if * unsuitable. * * Does not search all holes. */ DRM_MM_INSERT_LOWEST = DRM_MM_INSERT_LOW | DRM_MM_INSERT_ONCE, }; /** * struct drm_mm_node - allocated block in the DRM allocator * * This represents an allocated block in a &drm_mm allocator. Except for * pre-reserved nodes inserted using drm_mm_reserve_node() the structure is * entirely opaque and should only be accessed through the provided funcions. * Since allocation of these nodes is entirely handled by the driver they can be * embedded. */ struct drm_mm_node { /** @color: Opaque driver-private tag. */ unsigned long color; /** @start: Start address of the allocated block. */ u64 start; /** @size: Size of the allocated block. */ u64 size; /* private: */ struct drm_mm *mm; struct list_head node_list; struct list_head hole_stack; struct rb_node rb; struct rb_node rb_hole_size; struct rb_node rb_hole_addr; u64 __subtree_last; u64 hole_size; u64 subtree_max_hole; unsigned long flags; #define DRM_MM_NODE_ALLOCATED_BIT 0 #define DRM_MM_NODE_SCANNED_BIT 1 #ifdef CONFIG_DRM_DEBUG_MM depot_stack_handle_t stack; #endif }; /** * struct drm_mm - DRM allocator * * DRM range allocator with a few special functions and features geared towards * managing GPU memory. Except for the @color_adjust callback the structure is * entirely opaque and should only be accessed through the provided functions * and macros. This structure can be embedded into larger driver structures. */ struct drm_mm { /** * @color_adjust: * * Optional driver callback to further apply restrictions on a hole. The * node argument points at the node containing the hole from which the * block would be allocated (see drm_mm_hole_follows() and friends). The * other arguments are the size of the block to be allocated. The driver * can adjust the start and end as needed to e.g. insert guard pages. */ void (*color_adjust)(const struct drm_mm_node *node, unsigned long color, u64 *start, u64 *end); /* private: */ /* List of all memory nodes that immediately precede a free hole. */ struct list_head hole_stack; /* head_node.node_list is the list of all memory nodes, ordered * according to the (increasing) start address of the memory node. */ struct drm_mm_node head_node; /* Keep an interval_tree for fast lookup of drm_mm_nodes by address. */ struct rb_root_cached interval_tree; struct rb_root_cached holes_size; struct rb_root holes_addr; unsigned long scan_active; }; /** * struct drm_mm_scan - DRM allocator eviction roaster data * * This structure tracks data needed for the eviction roaster set up using * drm_mm_scan_init(), and used with drm_mm_scan_add_block() and * drm_mm_scan_remove_block(). The structure is entirely opaque and should only * be accessed through the provided functions and macros. It is meant to be * allocated temporarily by the driver on the stack. */ struct drm_mm_scan { /* private: */ struct drm_mm *mm; u64 size; u64 alignment; u64 remainder_mask; u64 range_start; u64 range_end; u64 hit_start; u64 hit_end; unsigned long color; enum drm_mm_insert_mode mode; }; /** * drm_mm_node_allocated - checks whether a node is allocated * @node: drm_mm_node to check * * Drivers are required to clear a node prior to using it with the * drm_mm range manager. * * Drivers should use this helper for proper encapsulation of drm_mm * internals. * * Returns: * True if the @node is allocated. */ static inline bool drm_mm_node_allocated(const struct drm_mm_node *node) { return test_bit(DRM_MM_NODE_ALLOCATED_BIT, &node->flags); } /** * drm_mm_initialized - checks whether an allocator is initialized * @mm: drm_mm to check * * Drivers should clear the struct drm_mm prior to initialisation if they * want to use this function. * * Drivers should use this helper for proper encapsulation of drm_mm * internals. * * Returns: * True if the @mm is initialized. */ static inline bool drm_mm_initialized(const struct drm_mm *mm) { return READ_ONCE(mm->hole_stack.next); } /** * drm_mm_hole_follows - checks whether a hole follows this node * @node: drm_mm_node to check * * Holes are embedded into the drm_mm using the tail of a drm_mm_node. * If you wish to know whether a hole follows this particular node, * query this function. See also drm_mm_hole_node_start() and * drm_mm_hole_node_end(). * * Returns: * True if a hole follows the @node. */ static inline bool drm_mm_hole_follows(const struct drm_mm_node *node) { return node->hole_size; } static inline u64 __drm_mm_hole_node_start(const struct drm_mm_node *hole_node) { return hole_node->start + hole_node->size; } /** * drm_mm_hole_node_start - computes the start of the hole following @node * @hole_node: drm_mm_node which implicitly tracks the following hole * * This is useful for driver-specific debug dumpers. Otherwise drivers should * not inspect holes themselves. Drivers must check first whether a hole indeed * follows by looking at drm_mm_hole_follows() * * Returns: * Start of the subsequent hole. */ static inline u64 drm_mm_hole_node_start(const struct drm_mm_node *hole_node) { DRM_MM_BUG_ON(!drm_mm_hole_follows(hole_node)); return __drm_mm_hole_node_start(hole_node); } static inline u64 __drm_mm_hole_node_end(const struct drm_mm_node *hole_node) { return list_next_entry(hole_node, node_list)->start; } /** * drm_mm_hole_node_end - computes the end of the hole following @node * @hole_node: drm_mm_node which implicitly tracks the following hole * * This is useful for driver-specific debug dumpers. Otherwise drivers should * not inspect holes themselves. Drivers must check first whether a hole indeed * follows by looking at drm_mm_hole_follows(). * * Returns: * End of the subsequent hole. */ static inline u64 drm_mm_hole_node_end(const struct drm_mm_node *hole_node) { return __drm_mm_hole_node_end(hole_node); } /** * drm_mm_nodes - list of nodes under the drm_mm range manager * @mm: the struct drm_mm range manager * * As the drm_mm range manager hides its node_list deep with its * structure, extracting it looks painful and repetitive. This is * not expected to be used outside of the drm_mm_for_each_node() * macros and similar internal functions. * * Returns: * The node list, may be empty. */ #define drm_mm_nodes(mm) (&(mm)->head_node.node_list) /** * drm_mm_for_each_node - iterator to walk over all allocated nodes * @entry: &struct drm_mm_node to assign to in each iteration step * @mm: &drm_mm allocator to walk * * This iterator walks over all nodes in the range allocator. It is implemented * with list_for_each(), so not save against removal of elements. */ #define drm_mm_for_each_node(entry, mm) \ list_for_each_entry(entry, drm_mm_nodes(mm), node_list) /** * drm_mm_for_each_node_safe - iterator to walk over all allocated nodes * @entry: &struct drm_mm_node to assign to in each iteration step * @next: &struct drm_mm_node to store the next step * @mm: &drm_mm allocator to walk * * This iterator walks over all nodes in the range allocator. It is implemented * with list_for_each_safe(), so save against removal of elements. */ #define drm_mm_for_each_node_safe(entry, next, mm) \ list_for_each_entry_safe(entry, next, drm_mm_nodes(mm), node_list) /** * drm_mm_for_each_hole - iterator to walk over all holes * @pos: &drm_mm_node used internally to track progress * @mm: &drm_mm allocator to walk * @hole_start: ulong variable to assign the hole start to on each iteration * @hole_end: ulong variable to assign the hole end to on each iteration * * This iterator walks over all holes in the range allocator. It is implemented * with list_for_each(), so not save against removal of elements. @entry is used * internally and will not reflect a real drm_mm_node for the very first hole. * Hence users of this iterator may not access it. * * Implementation Note: * We need to inline list_for_each_entry in order to be able to set hole_start * and hole_end on each iteration while keeping the macro sane. */ #define drm_mm_for_each_hole(pos, mm, hole_start, hole_end) \ for (pos = list_first_entry(&(mm)->hole_stack, \ typeof(*pos), hole_stack); \ &pos->hole_stack != &(mm)->hole_stack ? \ hole_start = drm_mm_hole_node_start(pos), \ hole_end = hole_start + pos->hole_size, \ 1 : 0; \ pos = list_next_entry(pos, hole_stack)) /* * Basic range manager support (drm_mm.c) */ int drm_mm_reserve_node(struct drm_mm *mm, struct drm_mm_node *node); int drm_mm_insert_node_in_range(struct drm_mm *mm, struct drm_mm_node *node, u64 size, u64 alignment, unsigned long color, u64 start, u64 end, enum drm_mm_insert_mode mode); /** * drm_mm_insert_node_generic - search for space and insert @node * @mm: drm_mm to allocate from * @node: preallocate node to insert * @size: size of the allocation * @alignment: alignment of the allocation * @color: opaque tag value to use for this node * @mode: fine-tune the allocation search and placement * * This is a simplified version of drm_mm_insert_node_in_range() with no * range restrictions applied. * * The preallocated node must be cleared to 0. * * Returns: * 0 on success, -ENOSPC if there's no suitable hole. */ static inline int drm_mm_insert_node_generic(struct drm_mm *mm, struct drm_mm_node *node, u64 size, u64 alignment, unsigned long color, enum drm_mm_insert_mode mode) { return drm_mm_insert_node_in_range(mm, node, size, alignment, color, 0, U64_MAX, mode); } /** * drm_mm_insert_node - search for space and insert @node * @mm: drm_mm to allocate from * @node: preallocate node to insert * @size: size of the allocation * * This is a simplified version of drm_mm_insert_node_generic() with @color set * to 0. * * The preallocated node must be cleared to 0. * * Returns: * 0 on success, -ENOSPC if there's no suitable hole. */ static inline int drm_mm_insert_node(struct drm_mm *mm, struct drm_mm_node *node, u64 size) { return drm_mm_insert_node_generic(mm, node, size, 0, 0, 0); } void drm_mm_remove_node(struct drm_mm_node *node); void drm_mm_replace_node(struct drm_mm_node *old, struct drm_mm_node *new); void drm_mm_init(struct drm_mm *mm, u64 start, u64 size); void drm_mm_takedown(struct drm_mm *mm); /** * drm_mm_clean - checks whether an allocator is clean * @mm: drm_mm allocator to check * * Returns: * True if the allocator is completely free, false if there's still a node * allocated in it. */ static inline bool drm_mm_clean(const struct drm_mm *mm) { return list_empty(drm_mm_nodes(mm)); } struct drm_mm_node * __drm_mm_interval_first(const struct drm_mm *mm, u64 start, u64 last); /** * drm_mm_for_each_node_in_range - iterator to walk over a range of * allocated nodes * @node__: drm_mm_node structure to assign to in each iteration step * @mm__: drm_mm allocator to walk * @start__: starting offset, the first node will overlap this * @end__: ending offset, the last node will start before this (but may overlap) * * This iterator walks over all nodes in the range allocator that lie * between @start and @end. It is implemented similarly to list_for_each(), * but using the internal interval tree to accelerate the search for the * starting node, and so not safe against removal of elements. It assumes * that @end is within (or is the upper limit of) the drm_mm allocator. * If [@start, @end] are beyond the range of the drm_mm, the iterator may walk * over the special _unallocated_ &drm_mm.head_node, and may even continue * indefinitely. */ #define drm_mm_for_each_node_in_range(node__, mm__, start__, end__) \ for (node__ = __drm_mm_interval_first((mm__), (start__), (end__)-1); \ node__->start < (end__); \ node__ = list_next_entry(node__, node_list)) void drm_mm_scan_init_with_range(struct drm_mm_scan *scan, struct drm_mm *mm, u64 size, u64 alignment, unsigned long color, u64 start, u64 end, enum drm_mm_insert_mode mode); /** * drm_mm_scan_init - initialize lru scanning * @scan: scan state * @mm: drm_mm to scan * @size: size of the allocation * @alignment: alignment of the allocation * @color: opaque tag value to use for the allocation * @mode: fine-tune the allocation search and placement * * This is a simplified version of drm_mm_scan_init_with_range() with no range * restrictions applied. * * This simply sets up the scanning routines with the parameters for the desired * hole. * * Warning: * As long as the scan list is non-empty, no other operations than * adding/removing nodes to/from the scan list are allowed. */ static inline void drm_mm_scan_init(struct drm_mm_scan *scan, struct drm_mm *mm, u64 size, u64 alignment, unsigned long color, enum drm_mm_insert_mode mode) { drm_mm_scan_init_with_range(scan, mm, size, alignment, color, 0, U64_MAX, mode); } bool drm_mm_scan_add_block(struct drm_mm_scan *scan, struct drm_mm_node *node); bool drm_mm_scan_remove_block(struct drm_mm_scan *scan, struct drm_mm_node *node); struct drm_mm_node *drm_mm_scan_color_evict(struct drm_mm_scan *scan); void drm_mm_print(const struct drm_mm *mm, struct drm_printer *p); #endif |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_PART_STAT_H #define _LINUX_PART_STAT_H #include <linux/blkdev.h> #include <asm/local.h> struct disk_stats { u64 nsecs[NR_STAT_GROUPS]; unsigned long sectors[NR_STAT_GROUPS]; unsigned long ios[NR_STAT_GROUPS]; unsigned long merges[NR_STAT_GROUPS]; unsigned long io_ticks; local_t in_flight[2]; }; /* * Macros to operate on percpu disk statistics: * * {disk|part|all}_stat_{add|sub|inc|dec}() modify the stat counters and should * be called between disk_stat_lock() and disk_stat_unlock(). * * part_stat_read() can be called at any time. */ #define part_stat_lock() preempt_disable() #define part_stat_unlock() preempt_enable() #define part_stat_get_cpu(part, field, cpu) \ (per_cpu_ptr((part)->bd_stats, (cpu))->field) #define part_stat_get(part, field) \ part_stat_get_cpu(part, field, smp_processor_id()) #define part_stat_read(part, field) \ ({ \ typeof((part)->bd_stats->field) res = 0; \ unsigned int _cpu; \ for_each_possible_cpu(_cpu) \ res += per_cpu_ptr((part)->bd_stats, _cpu)->field; \ res; \ }) static inline void part_stat_set_all(struct block_device *part, int value) { int i; for_each_possible_cpu(i) memset(per_cpu_ptr(part->bd_stats, i), value, sizeof(struct disk_stats)); } #define part_stat_read_accum(part, field) \ (part_stat_read(part, field[STAT_READ]) + \ part_stat_read(part, field[STAT_WRITE]) + \ part_stat_read(part, field[STAT_DISCARD])) #define __part_stat_add(part, field, addnd) \ __this_cpu_add((part)->bd_stats->field, addnd) #define part_stat_add(part, field, addnd) do { \ __part_stat_add((part), field, addnd); \ if ((part)->bd_partno) \ __part_stat_add(bdev_whole(part), field, addnd); \ } while (0) #define part_stat_dec(part, field) \ part_stat_add(part, field, -1) #define part_stat_inc(part, field) \ part_stat_add(part, field, 1) #define part_stat_sub(part, field, subnd) \ part_stat_add(part, field, -subnd) #define part_stat_local_dec(part, field) \ local_dec(&(part_stat_get(part, field))) #define part_stat_local_inc(part, field) \ local_inc(&(part_stat_get(part, field))) #define part_stat_local_read(part, field) \ local_read(&(part_stat_get(part, field))) #define part_stat_local_read_cpu(part, field, cpu) \ local_read(&(part_stat_get_cpu(part, field, cpu))) #endif /* _LINUX_PART_STAT_H */ |
15 479 165 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __NET_LWTUNNEL_H #define __NET_LWTUNNEL_H 1 #include <linux/lwtunnel.h> #include <linux/netdevice.h> #include <linux/skbuff.h> #include <linux/types.h> #include <net/route.h> #define LWTUNNEL_HASH_BITS 7 #define LWTUNNEL_HASH_SIZE (1 << LWTUNNEL_HASH_BITS) /* lw tunnel state flags */ #define LWTUNNEL_STATE_OUTPUT_REDIRECT BIT(0) #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) #define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) /* LWTUNNEL_XMIT_CONTINUE should be distinguishable from dst_output return * values (NET_XMIT_xxx and NETDEV_TX_xxx in linux/netdevice.h) for safety. */ enum { LWTUNNEL_XMIT_DONE, LWTUNNEL_XMIT_CONTINUE = 0x100, }; struct lwtunnel_state { __u16 type; __u16 flags; __u16 headroom; atomic_t refcnt; int (*orig_output)(struct net *net, struct sock *sk, struct sk_buff *skb); int (*orig_input)(struct sk_buff *); struct rcu_head rcu; __u8 data[]; }; struct lwtunnel_encap_ops { int (*build_state)(struct net *net, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **ts, struct netlink_ext_ack *extack); void (*destroy_state)(struct lwtunnel_state *lws); int (*output)(struct net *net, struct sock *sk, struct sk_buff *skb); int (*input)(struct sk_buff *skb); int (*fill_encap)(struct sk_buff *skb, struct lwtunnel_state *lwtstate); int (*get_encap_size)(struct lwtunnel_state *lwtstate); int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); int (*xmit)(struct sk_buff *skb); struct module *owner; }; #ifdef CONFIG_LWTUNNEL DECLARE_STATIC_KEY_FALSE(nf_hooks_lwtunnel_enabled); void lwtstate_free(struct lwtunnel_state *lws); static inline struct lwtunnel_state * lwtstate_get(struct lwtunnel_state *lws) { if (lws) atomic_inc(&lws->refcnt); return lws; } static inline void lwtstate_put(struct lwtunnel_state *lws) { if (!lws) return; if (atomic_dec_and_test(&lws->refcnt)) lwtstate_free(lws); } static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate) { if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_OUTPUT_REDIRECT)) return true; return false; } static inline bool lwtunnel_input_redirect(struct lwtunnel_state *lwtstate) { if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_INPUT_REDIRECT)) return true; return false; } static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate) { if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_XMIT_REDIRECT)) return true; return false; } static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate, unsigned int mtu) { if ((lwtunnel_xmit_redirect(lwtstate) || lwtunnel_output_redirect(lwtstate)) && lwtstate->headroom < mtu) return lwtstate->headroom; return 0; } int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num); int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, unsigned int num); int lwtunnel_valid_encap_type(u16 encap_type, struct netlink_ext_ack *extack); int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, struct netlink_ext_ack *extack); int lwtunnel_build_state(struct net *net, u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws, struct netlink_ext_ack *extack); int lwtunnel_fill_encap(struct sk_buff *skb, struct lwtunnel_state *lwtstate, int encap_attr, int encap_type_attr); int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate); struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len); int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b); int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb); int lwtunnel_input(struct sk_buff *skb); int lwtunnel_xmit(struct sk_buff *skb); int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress); static inline void lwtunnel_set_redirect(struct dst_entry *dst) { if (lwtunnel_output_redirect(dst->lwtstate)) { dst->lwtstate->orig_output = dst->output; dst->output = lwtunnel_output; } if (lwtunnel_input_redirect(dst->lwtstate)) { dst->lwtstate->orig_input = dst->input; dst->input = lwtunnel_input; } } #else static inline void lwtstate_free(struct lwtunnel_state *lws) { } static inline struct lwtunnel_state * lwtstate_get(struct lwtunnel_state *lws) { return lws; } static inline void lwtstate_put(struct lwtunnel_state *lws) { } static inline bool lwtunnel_output_redirect(struct lwtunnel_state *lwtstate) { return false; } static inline bool lwtunnel_input_redirect(struct lwtunnel_state *lwtstate) { return false; } static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate) { return false; } static inline void lwtunnel_set_redirect(struct dst_entry *dst) { } static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate, unsigned int mtu) { return 0; } static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op, unsigned int num) { return -EOPNOTSUPP; } static inline int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op, unsigned int num) { return -EOPNOTSUPP; } static inline int lwtunnel_valid_encap_type(u16 encap_type, struct netlink_ext_ack *extack) { NL_SET_ERR_MSG(extack, "CONFIG_LWTUNNEL is not enabled in this kernel"); return -EOPNOTSUPP; } static inline int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, struct netlink_ext_ack *extack) { /* return 0 since we are not walking attr looking for * RTA_ENCAP_TYPE attribute on nexthops. */ return 0; } static inline int lwtunnel_build_state(struct net *net, u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws, struct netlink_ext_ack *extack) { return -EOPNOTSUPP; } static inline int lwtunnel_fill_encap(struct sk_buff *skb, struct lwtunnel_state *lwtstate, int encap_attr, int encap_type_attr) { return 0; } static inline int lwtunnel_get_encap_size(struct lwtunnel_state *lwtstate) { return 0; } static inline struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len) { return NULL; } static inline int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b) { return 0; } static inline int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb) { return -EOPNOTSUPP; } static inline int lwtunnel_input(struct sk_buff *skb) { return -EOPNOTSUPP; } static inline int lwtunnel_xmit(struct sk_buff *skb) { return -EOPNOTSUPP; } #endif /* CONFIG_LWTUNNEL */ #define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" __stringify(encap_type)) #endif /* __NET_LWTUNNEL_H */ |
1 135 152 150 6 8 8 8 7 5 5 5 2 2 1 2 1 1 160 151 134 134 1 1 135 1 1 2 31 31 144 130 146 150 127 79 69 69 127 151 4 1 151 9 7 161 150 156 142 161 1 2 1 158 6 136 20 151 134 2 2 2 1 1 151 151 161 152 3 3 3 159 158 127 127 137 122 91 121 160 3 3 152 138 2 2 2 138 143 143 160 31 4 4 4 4 4 3 4 3 4 1 2 1 2 4 1 1 4 1 5 1 1 36 32 4 36 34 4 34 1 33 3 23 30 2 30 30 7 30 30 30 30 30 3 30 30 30 30 30 30 27 34 1 36 33 32 27 6 9 9 33 33 2 3 1 5 33 33 33 33 33 33 33 31 33 33 3 2 3 2 3 33 2 2 3 4 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 | // SPDX-License-Identifier: GPL-2.0-only /* * net/sched/sch_netem.c Network emulator * * Many of the algorithms and ideas for this came from * NIST Net which is not copyrighted. * * Authors: Stephen Hemminger <shemminger@osdl.org> * Catalin(ux aka Dino) BOIE <catab at umbrella dot ro> */ #include <linux/mm.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/skbuff.h> #include <linux/vmalloc.h> #include <linux/rtnetlink.h> #include <linux/reciprocal_div.h> #include <linux/rbtree.h> #include <net/gso.h> #include <net/netlink.h> #include <net/pkt_sched.h> #include <net/inet_ecn.h> #define VERSION "1.3" /* Network Emulation Queuing algorithm. ==================================== Sources: [1] Mark Carson, Darrin Santay, "NIST Net - A Linux-based Network Emulation Tool [2] Luigi Rizzo, DummyNet for FreeBSD ---------------------------------------------------------------- This started out as a simple way to delay outgoing packets to test TCP but has grown to include most of the functionality of a full blown network emulator like NISTnet. It can delay packets and add random jitter (and correlation). The random distribution can be loaded from a table as well to provide normal, Pareto, or experimental curves. Packet loss, duplication, and reordering can also be emulated. This qdisc does not do classification that can be handled in layering other disciplines. It does not need to do bandwidth control either since that can be handled by using token bucket or other rate control. Correlated Loss Generator models Added generation of correlated loss according to the "Gilbert-Elliot" model, a 4-state markov model. References: [1] NetemCLG Home http://netgroup.uniroma2.it/NetemCLG [2] S. Salsano, F. Ludovici, A. Ordine, "Definition of a general and intuitive loss model for packet networks and its implementation in the Netem module in the Linux kernel", available in [1] Authors: Stefano Salsano <stefano.salsano at uniroma2.it Fabio Ludovici <fabio.ludovici at yahoo.it> */ struct disttable { u32 size; s16 table[]; }; struct netem_sched_data { /* internal t(ime)fifo qdisc uses t_root and sch->limit */ struct rb_root t_root; /* a linear queue; reduces rbtree rebalancing when jitter is low */ struct sk_buff *t_head; struct sk_buff *t_tail; /* optional qdisc for classful handling (NULL at netem init) */ struct Qdisc *qdisc; struct qdisc_watchdog watchdog; s64 latency; s64 jitter; u32 loss; u32 ecn; u32 limit; u32 counter; u32 gap; u32 duplicate; u32 reorder; u32 corrupt; u64 rate; s32 packet_overhead; u32 cell_size; struct reciprocal_value cell_size_reciprocal; s32 cell_overhead; struct crndstate { u32 last; u32 rho; } delay_cor, loss_cor, dup_cor, reorder_cor, corrupt_cor; struct prng { u64 seed; struct rnd_state prng_state; } prng; struct disttable *delay_dist; enum { CLG_RANDOM, CLG_4_STATES, CLG_GILB_ELL, } loss_model; enum { TX_IN_GAP_PERIOD = 1, TX_IN_BURST_PERIOD, LOST_IN_GAP_PERIOD, LOST_IN_BURST_PERIOD, } _4_state_model; enum { GOOD_STATE = 1, BAD_STATE, } GE_state_model; /* Correlated Loss Generation models */ struct clgstate { /* state of the Markov chain */ u8 state; /* 4-states and Gilbert-Elliot models */ u32 a1; /* p13 for 4-states or p for GE */ u32 a2; /* p31 for 4-states or r for GE */ u32 a3; /* p32 for 4-states or h for GE */ u32 a4; /* p14 for 4-states or 1-k for GE */ u32 a5; /* p23 used only in 4-states */ } clg; struct tc_netem_slot slot_config; struct slotstate { u64 slot_next; s32 packets_left; s32 bytes_left; } slot; struct disttable *slot_dist; }; /* Time stamp put into socket buffer control block * Only valid when skbs are in our internal t(ime)fifo queue. * * As skb->rbnode uses same storage than skb->next, skb->prev and skb->tstamp, * and skb->next & skb->prev are scratch space for a qdisc, * we save skb->tstamp value in skb->cb[] before destroying it. */ struct netem_skb_cb { u64 time_to_send; }; static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb) { /* we assume we can use skb next/prev/tstamp as storage for rb_node */ qdisc_cb_private_validate(skb, sizeof(struct netem_skb_cb)); return (struct netem_skb_cb *)qdisc_skb_cb(skb)->data; } /* init_crandom - initialize correlated random number generator * Use entropy source for initial seed. */ static void init_crandom(struct crndstate *state, unsigned long rho) { state->rho = rho; state->last = get_random_u32(); } /* get_crandom - correlated random number generator * Next number depends on last value. * rho is scaled to avoid floating point. */ static u32 get_crandom(struct crndstate *state, struct prng *p) { u64 value, rho; unsigned long answer; struct rnd_state *s = &p->prng_state; if (!state || state->rho == 0) /* no correlation */ return prandom_u32_state(s); value = prandom_u32_state(s); rho = (u64)state->rho + 1; answer = (value * ((1ull<<32) - rho) + state->last * rho) >> 32; state->last = answer; return answer; } /* loss_4state - 4-state model loss generator * Generates losses according to the 4-state Markov chain adopted in * the GI (General and Intuitive) loss model. */ static bool loss_4state(struct netem_sched_data *q) { struct clgstate *clg = &q->clg; u32 rnd = prandom_u32_state(&q->prng.prng_state); /* * Makes a comparison between rnd and the transition * probabilities outgoing from the current state, then decides the * next state and if the next packet has to be transmitted or lost. * The four states correspond to: * TX_IN_GAP_PERIOD => successfully transmitted packets within a gap period * LOST_IN_GAP_PERIOD => isolated losses within a gap period * LOST_IN_BURST_PERIOD => lost packets within a burst period * TX_IN_BURST_PERIOD => successfully transmitted packets within a burst period */ switch (clg->state) { case TX_IN_GAP_PERIOD: if (rnd < clg->a4) { clg->state = LOST_IN_GAP_PERIOD; return true; } else if (clg->a4 < rnd && rnd < clg->a1 + clg->a4) { clg->state = LOST_IN_BURST_PERIOD; return true; } else if (clg->a1 + clg->a4 < rnd) { clg->state = TX_IN_GAP_PERIOD; } break; case TX_IN_BURST_PERIOD: if (rnd < clg->a5) { clg->state = LOST_IN_BURST_PERIOD; return true; } else { clg->state = TX_IN_BURST_PERIOD; } break; case LOST_IN_BURST_PERIOD: if (rnd < clg->a3) clg->state = TX_IN_BURST_PERIOD; else if (clg->a3 < rnd && rnd < clg->a2 + clg->a3) { clg->state = TX_IN_GAP_PERIOD; } else if (clg->a2 + clg->a3 < rnd) { clg->state = LOST_IN_BURST_PERIOD; return true; } break; case LOST_IN_GAP_PERIOD: clg->state = TX_IN_GAP_PERIOD; break; } return false; } /* loss_gilb_ell - Gilbert-Elliot model loss generator * Generates losses according to the Gilbert-Elliot loss model or * its special cases (Gilbert or Simple Gilbert) * * Makes a comparison between random number and the transition * probabilities outgoing from the current state, then decides the * next state. A second random number is extracted and the comparison * with the loss probability of the current state decides if the next * packet will be transmitted or lost. */ static bool loss_gilb_ell(struct netem_sched_data *q) { struct clgstate *clg = &q->clg; struct rnd_state *s = &q->prng.prng_state; switch (clg->state) { case GOOD_STATE: if (prandom_u32_state(s) < clg->a1) clg->state = BAD_STATE; if (prandom_u32_state(s) < clg->a4) return true; break; case BAD_STATE: if (prandom_u32_state(s) < clg->a2) clg->state = GOOD_STATE; if (prandom_u32_state(s) > clg->a3) return true; } return false; } static bool loss_event(struct netem_sched_data *q) { switch (q->loss_model) { case CLG_RANDOM: /* Random packet drop 0 => none, ~0 => all */ return q->loss && q->loss >= get_crandom(&q->loss_cor, &q->prng); case CLG_4_STATES: /* 4state loss model algorithm (used also for GI model) * Extracts a value from the markov 4 state loss generator, * if it is 1 drops a packet and if needed writes the event in * the kernel logs */ return loss_4state(q); case CLG_GILB_ELL: /* Gilbert-Elliot loss model algorithm * Extracts a value from the Gilbert-Elliot loss generator, * if it is 1 drops a packet and if needed writes the event in * the kernel logs */ return loss_gilb_ell(q); } return false; /* not reached */ } /* tabledist - return a pseudo-randomly distributed value with mean mu and * std deviation sigma. Uses table lookup to approximate the desired * distribution, and a uniformly-distributed pseudo-random source. */ static s64 tabledist(s64 mu, s32 sigma, struct crndstate *state, struct prng *prng, const struct disttable *dist) { s64 x; long t; u32 rnd; if (sigma == 0) return mu; rnd = get_crandom(state, prng); /* default uniform distribution */ if (dist == NULL) return ((rnd % (2 * (u32)sigma)) + mu) - sigma; t = dist->table[rnd % dist->size]; x = (sigma % NETEM_DIST_SCALE) * t; if (x >= 0) x += NETEM_DIST_SCALE/2; else x -= NETEM_DIST_SCALE/2; return x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu; } static u64 packet_time_ns(u64 len, const struct netem_sched_data *q) { len += q->packet_overhead; if (q->cell_size) { u32 cells = reciprocal_divide(len, q->cell_size_reciprocal); if (len > cells * q->cell_size) /* extra cell needed for remainder */ cells++; len = cells * (q->cell_size + q->cell_overhead); } return div64_u64(len * NSEC_PER_SEC, q->rate); } static void tfifo_reset(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); struct rb_node *p = rb_first(&q->t_root); while (p) { struct sk_buff *skb = rb_to_skb(p); p = rb_next(p); rb_erase(&skb->rbnode, &q->t_root); rtnl_kfree_skbs(skb, skb); } rtnl_kfree_skbs(q->t_head, q->t_tail); q->t_head = NULL; q->t_tail = NULL; } static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); u64 tnext = netem_skb_cb(nskb)->time_to_send; if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) { if (q->t_tail) q->t_tail->next = nskb; else q->t_head = nskb; q->t_tail = nskb; } else { struct rb_node **p = &q->t_root.rb_node, *parent = NULL; while (*p) { struct sk_buff *skb; parent = *p; skb = rb_to_skb(parent); if (tnext >= netem_skb_cb(skb)->time_to_send) p = &parent->rb_right; else p = &parent->rb_left; } rb_link_node(&nskb->rbnode, parent, p); rb_insert_color(&nskb->rbnode, &q->t_root); } sch->q.qlen++; } /* netem can't properly corrupt a megapacket (like we get from GSO), so instead * when we statistically choose to corrupt one, we instead segment it, returning * the first packet to be corrupted, and re-enqueue the remaining frames */ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct sk_buff *segs; netdev_features_t features = netif_skb_features(skb); segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); if (IS_ERR_OR_NULL(segs)) { qdisc_drop(skb, sch, to_free); return NULL; } consume_skb(skb); return segs; } /* * Insert one skb into qdisc. * Note: parent depends on return value to account for queue length. * NET_XMIT_DROP: queue length didn't change. * NET_XMIT_SUCCESS: one skb was queued. */ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { struct netem_sched_data *q = qdisc_priv(sch); /* We don't fill cb now as skb_unshare() may invalidate it */ struct netem_skb_cb *cb; struct sk_buff *skb2; struct sk_buff *segs = NULL; unsigned int prev_len = qdisc_pkt_len(skb); int count = 1; int rc = NET_XMIT_SUCCESS; int rc_drop = NET_XMIT_DROP; /* Do not fool qdisc_drop_all() */ skb->prev = NULL; /* Random duplication */ if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor, &q->prng)) ++count; /* Drop packet? */ if (loss_event(q)) { if (q->ecn && INET_ECN_set_ce(skb)) qdisc_qstats_drop(sch); /* mark packet */ else --count; } if (count == 0) { qdisc_qstats_drop(sch); __qdisc_drop(skb, to_free); return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; } /* If a delay is expected, orphan the skb. (orphaning usually takes * place at TX completion time, so _before_ the link transit delay) */ if (q->latency || q->jitter || q->rate) skb_orphan_partial(skb); /* * If we need to duplicate packet, then re-insert at top of the * qdisc tree, since parent queuer expects that only one * skb will be queued. */ if (count > 1 && (skb2 = skb_clone(skb, GFP_ATOMIC)) != NULL) { struct Qdisc *rootq = qdisc_root_bh(sch); u32 dupsave = q->duplicate; /* prevent duplicating a dup... */ q->duplicate = 0; rootq->enqueue(skb2, rootq, to_free); q->duplicate = dupsave; rc_drop = NET_XMIT_SUCCESS; } /* * Randomized packet corruption. * Make copy if needed since we are modifying * If packet is going to be hardware checksummed, then * do it now in software before we mangle it. */ if (q->corrupt && q->corrupt >= get_crandom(&q->corrupt_cor, &q->prng)) { if (skb_is_gso(skb)) { skb = netem_segment(skb, sch, to_free); if (!skb) return rc_drop; segs = skb->next; skb_mark_not_on_list(skb); qdisc_skb_cb(skb)->pkt_len = skb->len; } skb = skb_unshare(skb, GFP_ATOMIC); if (unlikely(!skb)) { qdisc_qstats_drop(sch); goto finish_segs; } if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) { qdisc_drop(skb, sch, to_free); skb = NULL; goto finish_segs; } skb->data[get_random_u32_below(skb_headlen(skb))] ^= 1<<get_random_u32_below(8); } if (unlikely(sch->q.qlen >= sch->limit)) { /* re-link segs, so that qdisc_drop_all() frees them all */ skb->next = segs; qdisc_drop_all(skb, sch, to_free); return rc_drop; } qdisc_qstats_backlog_inc(sch, skb); cb = netem_skb_cb(skb); if (q->gap == 0 || /* not doing reordering */ q->counter < q->gap - 1 || /* inside last reordering gap */ q->reorder < get_crandom(&q->reorder_cor, &q->prng)) { u64 now; s64 delay; delay = tabledist(q->latency, q->jitter, &q->delay_cor, &q->prng, q->delay_dist); now = ktime_get_ns(); if (q->rate) { struct netem_skb_cb *last = NULL; if (sch->q.tail) last = netem_skb_cb(sch->q.tail); if (q->t_root.rb_node) { struct sk_buff *t_skb; struct netem_skb_cb *t_last; t_skb = skb_rb_last(&q->t_root); t_last = netem_skb_cb(t_skb); if (!last || t_last->time_to_send > last->time_to_send) last = t_last; } if (q->t_tail) { struct netem_skb_cb *t_last = netem_skb_cb(q->t_tail); if (!last || t_last->time_to_send > last->time_to_send) last = t_last; } if (last) { /* * Last packet in queue is reference point (now), * calculate this time bonus and subtract * from delay. */ delay -= last->time_to_send - now; delay = max_t(s64, 0, delay); now = last->time_to_send; } delay += packet_time_ns(qdisc_pkt_len(skb), q); } cb->time_to_send = now + delay; ++q->counter; tfifo_enqueue(skb, sch); } else { /* * Do re-ordering by putting one out of N packets at the front * of the queue. */ cb->time_to_send = ktime_get_ns(); q->counter = 0; __qdisc_enqueue_head(skb, &sch->q); sch->qstats.requeues++; } finish_segs: if (segs) { unsigned int len, last_len; int nb; len = skb ? skb->len : 0; nb = skb ? 1 : 0; while (segs) { skb2 = segs->next; skb_mark_not_on_list(segs); qdisc_skb_cb(segs)->pkt_len = segs->len; last_len = segs->len; rc = qdisc_enqueue(segs, sch, to_free); if (rc != NET_XMIT_SUCCESS) { if (net_xmit_drop_count(rc)) qdisc_qstats_drop(sch); } else { nb++; len += last_len; } segs = skb2; } /* Parent qdiscs accounted for 1 skb of size @prev_len */ qdisc_tree_reduce_backlog(sch, -(nb - 1), -(len - prev_len)); } else if (!skb) { return NET_XMIT_DROP; } return NET_XMIT_SUCCESS; } /* Delay the next round with a new future slot with a * correct number of bytes and packets. */ static void get_slot_next(struct netem_sched_data *q, u64 now) { s64 next_delay; if (!q->slot_dist) next_delay = q->slot_config.min_delay + (get_random_u32() * (q->slot_config.max_delay - q->slot_config.min_delay) >> 32); else next_delay = tabledist(q->slot_config.dist_delay, (s32)(q->slot_config.dist_jitter), NULL, &q->prng, q->slot_dist); q->slot.slot_next = now + next_delay; q->slot.packets_left = q->slot_config.max_packets; q->slot.bytes_left = q->slot_config.max_bytes; } static struct sk_buff *netem_peek(struct netem_sched_data *q) { struct sk_buff *skb = skb_rb_first(&q->t_root); u64 t1, t2; if (!skb) return q->t_head; if (!q->t_head) return skb; t1 = netem_skb_cb(skb)->time_to_send; t2 = netem_skb_cb(q->t_head)->time_to_send; if (t1 < t2) return skb; return q->t_head; } static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb) { if (skb == q->t_head) { q->t_head = skb->next; if (!q->t_head) q->t_tail = NULL; } else { rb_erase(&skb->rbnode, &q->t_root); } } static struct sk_buff *netem_dequeue(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; tfifo_dequeue: skb = __qdisc_dequeue_head(&sch->q); if (skb) { qdisc_qstats_backlog_dec(sch, skb); deliver: qdisc_bstats_update(sch, skb); return skb; } skb = netem_peek(q); if (skb) { u64 time_to_send; u64 now = ktime_get_ns(); /* if more time remaining? */ time_to_send = netem_skb_cb(skb)->time_to_send; if (q->slot.slot_next && q->slot.slot_next < time_to_send) get_slot_next(q, now); if (time_to_send <= now && q->slot.slot_next <= now) { netem_erase_head(q, skb); sch->q.qlen--; qdisc_qstats_backlog_dec(sch, skb); skb->next = NULL; skb->prev = NULL; /* skb->dev shares skb->rbnode area, * we need to restore its value. */ skb->dev = qdisc_dev(sch); if (q->slot.slot_next) { q->slot.packets_left--; q->slot.bytes_left -= qdisc_pkt_len(skb); if (q->slot.packets_left <= 0 || q->slot.bytes_left <= 0) get_slot_next(q, now); } if (q->qdisc) { unsigned int pkt_len = qdisc_pkt_len(skb); struct sk_buff *to_free = NULL; int err; err = qdisc_enqueue(skb, q->qdisc, &to_free); kfree_skb_list(to_free); if (err != NET_XMIT_SUCCESS && net_xmit_drop_count(err)) { qdisc_qstats_drop(sch); qdisc_tree_reduce_backlog(sch, 1, pkt_len); } goto tfifo_dequeue; } goto deliver; } if (q->qdisc) { skb = q->qdisc->ops->dequeue(q->qdisc); if (skb) goto deliver; } qdisc_watchdog_schedule_ns(&q->watchdog, max(time_to_send, q->slot.slot_next)); } if (q->qdisc) { skb = q->qdisc->ops->dequeue(q->qdisc); if (skb) goto deliver; } return NULL; } static void netem_reset(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); qdisc_reset_queue(sch); tfifo_reset(sch); if (q->qdisc) qdisc_reset(q->qdisc); qdisc_watchdog_cancel(&q->watchdog); } static void dist_free(struct disttable *d) { kvfree(d); } /* * Distribution data is a variable size payload containing * signed 16 bit values. */ static int get_dist_table(struct disttable **tbl, const struct nlattr *attr) { size_t n = nla_len(attr)/sizeof(__s16); const __s16 *data = nla_data(attr); struct disttable *d; int i; if (!n || n > NETEM_DIST_MAX) return -EINVAL; d = kvmalloc(struct_size(d, table, n), GFP_KERNEL); if (!d) return -ENOMEM; d->size = n; for (i = 0; i < n; i++) d->table[i] = data[i]; *tbl = d; return 0; } static void get_slot(struct netem_sched_data *q, const struct nlattr *attr) { const struct tc_netem_slot *c = nla_data(attr); q->slot_config = *c; if (q->slot_config.max_packets == 0) q->slot_config.max_packets = INT_MAX; if (q->slot_config.max_bytes == 0) q->slot_config.max_bytes = INT_MAX; /* capping dist_jitter to the range acceptable by tabledist() */ q->slot_config.dist_jitter = min_t(__s64, INT_MAX, abs(q->slot_config.dist_jitter)); q->slot.packets_left = q->slot_config.max_packets; q->slot.bytes_left = q->slot_config.max_bytes; if (q->slot_config.min_delay | q->slot_config.max_delay | q->slot_config.dist_jitter) q->slot.slot_next = ktime_get_ns(); else q->slot.slot_next = 0; } static void get_correlation(struct netem_sched_data *q, const struct nlattr *attr) { const struct tc_netem_corr *c = nla_data(attr); init_crandom(&q->delay_cor, c->delay_corr); init_crandom(&q->loss_cor, c->loss_corr); init_crandom(&q->dup_cor, c->dup_corr); } static void get_reorder(struct netem_sched_data *q, const struct nlattr *attr) { const struct tc_netem_reorder *r = nla_data(attr); q->reorder = r->probability; init_crandom(&q->reorder_cor, r->correlation); } static void get_corrupt(struct netem_sched_data *q, const struct nlattr *attr) { const struct tc_netem_corrupt *r = nla_data(attr); q->corrupt = r->probability; init_crandom(&q->corrupt_cor, r->correlation); } static void get_rate(struct netem_sched_data *q, const struct nlattr *attr) { const struct tc_netem_rate *r = nla_data(attr); q->rate = r->rate; q->packet_overhead = r->packet_overhead; q->cell_size = r->cell_size; q->cell_overhead = r->cell_overhead; if (q->cell_size) q->cell_size_reciprocal = reciprocal_value(q->cell_size); else q->cell_size_reciprocal = (struct reciprocal_value) { 0 }; } static int get_loss_clg(struct netem_sched_data *q, const struct nlattr *attr) { const struct nlattr *la; int rem; nla_for_each_nested(la, attr, rem) { u16 type = nla_type(la); switch (type) { case NETEM_LOSS_GI: { const struct tc_netem_gimodel *gi = nla_data(la); if (nla_len(la) < sizeof(struct tc_netem_gimodel)) { pr_info("netem: incorrect gi model size\n"); return -EINVAL; } q->loss_model = CLG_4_STATES; q->clg.state = TX_IN_GAP_PERIOD; q->clg.a1 = gi->p13; q->clg.a2 = gi->p31; q->clg.a3 = gi->p32; q->clg.a4 = gi->p14; q->clg.a5 = gi->p23; break; } case NETEM_LOSS_GE: { const struct tc_netem_gemodel *ge = nla_data(la); if (nla_len(la) < sizeof(struct tc_netem_gemodel)) { pr_info("netem: incorrect ge model size\n"); return -EINVAL; } q->loss_model = CLG_GILB_ELL; q->clg.state = GOOD_STATE; q->clg.a1 = ge->p; q->clg.a2 = ge->r; q->clg.a3 = ge->h; q->clg.a4 = ge->k1; break; } default: pr_info("netem: unknown loss type %u\n", type); return -EINVAL; } } return 0; } static const struct nla_policy netem_policy[TCA_NETEM_MAX + 1] = { [TCA_NETEM_CORR] = { .len = sizeof(struct tc_netem_corr) }, [TCA_NETEM_REORDER] = { .len = sizeof(struct tc_netem_reorder) }, [TCA_NETEM_CORRUPT] = { .len = sizeof(struct tc_netem_corrupt) }, [TCA_NETEM_RATE] = { .len = sizeof(struct tc_netem_rate) }, [TCA_NETEM_LOSS] = { .type = NLA_NESTED }, [TCA_NETEM_ECN] = { .type = NLA_U32 }, [TCA_NETEM_RATE64] = { .type = NLA_U64 }, [TCA_NETEM_LATENCY64] = { .type = NLA_S64 }, [TCA_NETEM_JITTER64] = { .type = NLA_S64 }, [TCA_NETEM_SLOT] = { .len = sizeof(struct tc_netem_slot) }, [TCA_NETEM_PRNG_SEED] = { .type = NLA_U64 }, }; static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla, const struct nla_policy *policy, int len) { int nested_len = nla_len(nla) - NLA_ALIGN(len); if (nested_len < 0) { pr_info("netem: invalid attributes len %d\n", nested_len); return -EINVAL; } if (nested_len >= nla_attr_size(0)) return nla_parse_deprecated(tb, maxtype, nla_data(nla) + NLA_ALIGN(len), nested_len, policy, NULL); memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1)); return 0; } /* Parse netlink message to set options */ static int netem_change(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct netem_sched_data *q = qdisc_priv(sch); struct nlattr *tb[TCA_NETEM_MAX + 1]; struct disttable *delay_dist = NULL; struct disttable *slot_dist = NULL; struct tc_netem_qopt *qopt; struct clgstate old_clg; int old_loss_model = CLG_RANDOM; int ret; qopt = nla_data(opt); ret = parse_attr(tb, TCA_NETEM_MAX, opt, netem_policy, sizeof(*qopt)); if (ret < 0) return ret; if (tb[TCA_NETEM_DELAY_DIST]) { ret = get_dist_table(&delay_dist, tb[TCA_NETEM_DELAY_DIST]); if (ret) goto table_free; } if (tb[TCA_NETEM_SLOT_DIST]) { ret = get_dist_table(&slot_dist, tb[TCA_NETEM_SLOT_DIST]); if (ret) goto table_free; } sch_tree_lock(sch); /* backup q->clg and q->loss_model */ old_clg = q->clg; old_loss_model = q->loss_model; if (tb[TCA_NETEM_LOSS]) { ret = get_loss_clg(q, tb[TCA_NETEM_LOSS]); if (ret) { q->loss_model = old_loss_model; q->clg = old_clg; goto unlock; } } else { q->loss_model = CLG_RANDOM; } if (delay_dist) swap(q->delay_dist, delay_dist); if (slot_dist) swap(q->slot_dist, slot_dist); sch->limit = qopt->limit; q->latency = PSCHED_TICKS2NS(qopt->latency); q->jitter = PSCHED_TICKS2NS(qopt->jitter); q->limit = qopt->limit; q->gap = qopt->gap; q->counter = 0; q->loss = qopt->loss; q->duplicate = qopt->duplicate; /* for compatibility with earlier versions. * if gap is set, need to assume 100% probability */ if (q->gap) q->reorder = ~0; if (tb[TCA_NETEM_CORR]) get_correlation(q, tb[TCA_NETEM_CORR]); if (tb[TCA_NETEM_REORDER]) get_reorder(q, tb[TCA_NETEM_REORDER]); if (tb[TCA_NETEM_CORRUPT]) get_corrupt(q, tb[TCA_NETEM_CORRUPT]); if (tb[TCA_NETEM_RATE]) get_rate(q, tb[TCA_NETEM_RATE]); if (tb[TCA_NETEM_RATE64]) q->rate = max_t(u64, q->rate, nla_get_u64(tb[TCA_NETEM_RATE64])); if (tb[TCA_NETEM_LATENCY64]) q->latency = nla_get_s64(tb[TCA_NETEM_LATENCY64]); if (tb[TCA_NETEM_JITTER64]) q->jitter = nla_get_s64(tb[TCA_NETEM_JITTER64]); if (tb[TCA_NETEM_ECN]) q->ecn = nla_get_u32(tb[TCA_NETEM_ECN]); if (tb[TCA_NETEM_SLOT]) get_slot(q, tb[TCA_NETEM_SLOT]); /* capping jitter to the range acceptable by tabledist() */ q->jitter = min_t(s64, abs(q->jitter), INT_MAX); if (tb[TCA_NETEM_PRNG_SEED]) q->prng.seed = nla_get_u64(tb[TCA_NETEM_PRNG_SEED]); else q->prng.seed = get_random_u64(); prandom_seed_state(&q->prng.prng_state, q->prng.seed); unlock: sch_tree_unlock(sch); table_free: dist_free(delay_dist); dist_free(slot_dist); return ret; } static int netem_init(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { struct netem_sched_data *q = qdisc_priv(sch); int ret; qdisc_watchdog_init(&q->watchdog, sch); if (!opt) return -EINVAL; q->loss_model = CLG_RANDOM; ret = netem_change(sch, opt, extack); if (ret) pr_info("netem: change failed\n"); return ret; } static void netem_destroy(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); qdisc_watchdog_cancel(&q->watchdog); if (q->qdisc) qdisc_put(q->qdisc); dist_free(q->delay_dist); dist_free(q->slot_dist); } static int dump_loss_model(const struct netem_sched_data *q, struct sk_buff *skb) { struct nlattr *nest; nest = nla_nest_start_noflag(skb, TCA_NETEM_LOSS); if (nest == NULL) goto nla_put_failure; switch (q->loss_model) { case CLG_RANDOM: /* legacy loss model */ nla_nest_cancel(skb, nest); return 0; /* no data */ case CLG_4_STATES: { struct tc_netem_gimodel gi = { .p13 = q->clg.a1, .p31 = q->clg.a2, .p32 = q->clg.a3, .p14 = q->clg.a4, .p23 = q->clg.a5, }; if (nla_put(skb, NETEM_LOSS_GI, sizeof(gi), &gi)) goto nla_put_failure; break; } case CLG_GILB_ELL: { struct tc_netem_gemodel ge = { .p = q->clg.a1, .r = q->clg.a2, .h = q->clg.a3, .k1 = q->clg.a4, }; if (nla_put(skb, NETEM_LOSS_GE, sizeof(ge), &ge)) goto nla_put_failure; break; } } nla_nest_end(skb, nest); return 0; nla_put_failure: nla_nest_cancel(skb, nest); return -1; } static int netem_dump(struct Qdisc *sch, struct sk_buff *skb) { const struct netem_sched_data *q = qdisc_priv(sch); struct nlattr *nla = (struct nlattr *) skb_tail_pointer(skb); struct tc_netem_qopt qopt; struct tc_netem_corr cor; struct tc_netem_reorder reorder; struct tc_netem_corrupt corrupt; struct tc_netem_rate rate; struct tc_netem_slot slot; qopt.latency = min_t(psched_time_t, PSCHED_NS2TICKS(q->latency), UINT_MAX); qopt.jitter = min_t(psched_time_t, PSCHED_NS2TICKS(q->jitter), UINT_MAX); qopt.limit = q->limit; qopt.loss = q->loss; qopt.gap = q->gap; qopt.duplicate = q->duplicate; if (nla_put(skb, TCA_OPTIONS, sizeof(qopt), &qopt)) goto nla_put_failure; if (nla_put(skb, TCA_NETEM_LATENCY64, sizeof(q->latency), &q->latency)) goto nla_put_failure; if (nla_put(skb, TCA_NETEM_JITTER64, sizeof(q->jitter), &q->jitter)) goto nla_put_failure; cor.delay_corr = q->delay_cor.rho; cor.loss_corr = q->loss_cor.rho; cor.dup_corr = q->dup_cor.rho; if (nla_put(skb, TCA_NETEM_CORR, sizeof(cor), &cor)) goto nla_put_failure; reorder.probability = q->reorder; reorder.correlation = q->reorder_cor.rho; if (nla_put(skb, TCA_NETEM_REORDER, sizeof(reorder), &reorder)) goto nla_put_failure; corrupt.probability = q->corrupt; corrupt.correlation = q->corrupt_cor.rho; if (nla_put(skb, TCA_NETEM_CORRUPT, sizeof(corrupt), &corrupt)) goto nla_put_failure; if (q->rate >= (1ULL << 32)) { if (nla_put_u64_64bit(skb, TCA_NETEM_RATE64, q->rate, TCA_NETEM_PAD)) goto nla_put_failure; rate.rate = ~0U; } else { rate.rate = q->rate; } rate.packet_overhead = q->packet_overhead; rate.cell_size = q->cell_size; rate.cell_overhead = q->cell_overhead; if (nla_put(skb, TCA_NETEM_RATE, sizeof(rate), &rate)) goto nla_put_failure; if (q->ecn && nla_put_u32(skb, TCA_NETEM_ECN, q->ecn)) goto nla_put_failure; if (dump_loss_model(q, skb) != 0) goto nla_put_failure; if (q->slot_config.min_delay | q->slot_config.max_delay | q->slot_config.dist_jitter) { slot = q->slot_config; if (slot.max_packets == INT_MAX) slot.max_packets = 0; if (slot.max_bytes == INT_MAX) slot.max_bytes = 0; if (nla_put(skb, TCA_NETEM_SLOT, sizeof(slot), &slot)) goto nla_put_failure; } if (nla_put_u64_64bit(skb, TCA_NETEM_PRNG_SEED, q->prng.seed, TCA_NETEM_PAD)) goto nla_put_failure; return nla_nest_end(skb, nla); nla_put_failure: nlmsg_trim(skb, nla); return -1; } static int netem_dump_class(struct Qdisc *sch, unsigned long cl, struct sk_buff *skb, struct tcmsg *tcm) { struct netem_sched_data *q = qdisc_priv(sch); if (cl != 1 || !q->qdisc) /* only one class */ return -ENOENT; tcm->tcm_handle |= TC_H_MIN(1); tcm->tcm_info = q->qdisc->handle; return 0; } static int netem_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new, struct Qdisc **old, struct netlink_ext_ack *extack) { struct netem_sched_data *q = qdisc_priv(sch); *old = qdisc_replace(sch, new, &q->qdisc); return 0; } static struct Qdisc *netem_leaf(struct Qdisc *sch, unsigned long arg) { struct netem_sched_data *q = qdisc_priv(sch); return q->qdisc; } static unsigned long netem_find(struct Qdisc *sch, u32 classid) { return 1; } static void netem_walk(struct Qdisc *sch, struct qdisc_walker *walker) { if (!walker->stop) { if (!tc_qdisc_stats_dump(sch, 1, walker)) return; } } static const struct Qdisc_class_ops netem_class_ops = { .graft = netem_graft, .leaf = netem_leaf, .find = netem_find, .walk = netem_walk, .dump = netem_dump_class, }; static struct Qdisc_ops netem_qdisc_ops __read_mostly = { .id = "netem", .cl_ops = &netem_class_ops, .priv_size = sizeof(struct netem_sched_data), .enqueue = netem_enqueue, .dequeue = netem_dequeue, .peek = qdisc_peek_dequeued, .init = netem_init, .reset = netem_reset, .destroy = netem_destroy, .change = netem_change, .dump = netem_dump, .owner = THIS_MODULE, }; static int __init netem_module_init(void) { pr_info("netem: version " VERSION "\n"); return register_qdisc(&netem_qdisc_ops); } static void __exit netem_module_exit(void) { unregister_qdisc(&netem_qdisc_ops); } module_init(netem_module_init) module_exit(netem_module_exit) MODULE_LICENSE("GPL"); |
5 3 2 1 1 1 3 5 2 1 1 2 1 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2014 Arturo Borrero Gonzalez <arturo@debian.org> */ #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_nat.h> #include <net/netfilter/nf_nat_redirect.h> #include <net/netfilter/nf_tables.h> struct nft_redir { u8 sreg_proto_min; u8 sreg_proto_max; u16 flags; }; static const struct nla_policy nft_redir_policy[NFTA_REDIR_MAX + 1] = { [NFTA_REDIR_REG_PROTO_MIN] = { .type = NLA_U32 }, [NFTA_REDIR_REG_PROTO_MAX] = { .type = NLA_U32 }, [NFTA_REDIR_FLAGS] = NLA_POLICY_MASK(NLA_BE32, NF_NAT_RANGE_MASK), }; static int nft_redir_validate(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nft_data **data) { int err; err = nft_chain_validate_dependency(ctx->chain, NFT_CHAIN_T_NAT); if (err < 0) return err; return nft_chain_validate_hooks(ctx->chain, (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_OUT)); } static int nft_redir_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_redir *priv = nft_expr_priv(expr); unsigned int plen; int err; plen = sizeof_field(struct nf_nat_range, min_proto.all); if (tb[NFTA_REDIR_REG_PROTO_MIN]) { err = nft_parse_register_load(tb[NFTA_REDIR_REG_PROTO_MIN], &priv->sreg_proto_min, plen); if (err < 0) return err; if (tb[NFTA_REDIR_REG_PROTO_MAX]) { err = nft_parse_register_load(tb[NFTA_REDIR_REG_PROTO_MAX], &priv->sreg_proto_max, plen); if (err < 0) return err; } else { priv->sreg_proto_max = priv->sreg_proto_min; } priv->flags |= NF_NAT_RANGE_PROTO_SPECIFIED; } if (tb[NFTA_REDIR_FLAGS]) priv->flags = ntohl(nla_get_be32(tb[NFTA_REDIR_FLAGS])); return nf_ct_netns_get(ctx->net, ctx->family); } static int nft_redir_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct nft_redir *priv = nft_expr_priv(expr); if (priv->sreg_proto_min) { if (nft_dump_register(skb, NFTA_REDIR_REG_PROTO_MIN, priv->sreg_proto_min)) goto nla_put_failure; if (nft_dump_register(skb, NFTA_REDIR_REG_PROTO_MAX, priv->sreg_proto_max)) goto nla_put_failure; } if (priv->flags != 0 && nla_put_be32(skb, NFTA_REDIR_FLAGS, htonl(priv->flags))) goto nla_put_failure; return 0; nla_put_failure: return -1; } static void nft_redir_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { const struct nft_redir *priv = nft_expr_priv(expr); struct nf_nat_range2 range; memset(&range, 0, sizeof(range)); range.flags = priv->flags; if (priv->sreg_proto_min) { range.min_proto.all = (__force __be16) nft_reg_load16(®s->data[priv->sreg_proto_min]); range.max_proto.all = (__force __be16) nft_reg_load16(®s->data[priv->sreg_proto_max]); } switch (nft_pf(pkt)) { case NFPROTO_IPV4: regs->verdict.code = nf_nat_redirect_ipv4(pkt->skb, &range, nft_hook(pkt)); break; #ifdef CONFIG_NF_TABLES_IPV6 case NFPROTO_IPV6: regs->verdict.code = nf_nat_redirect_ipv6(pkt->skb, &range, nft_hook(pkt)); break; #endif default: WARN_ON_ONCE(1); break; } } static void nft_redir_ipv4_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { nf_ct_netns_put(ctx->net, NFPROTO_IPV4); } static struct nft_expr_type nft_redir_ipv4_type; static const struct nft_expr_ops nft_redir_ipv4_ops = { .type = &nft_redir_ipv4_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_redir)), .eval = nft_redir_eval, .init = nft_redir_init, .destroy = nft_redir_ipv4_destroy, .dump = nft_redir_dump, .validate = nft_redir_validate, .reduce = NFT_REDUCE_READONLY, }; static struct nft_expr_type nft_redir_ipv4_type __read_mostly = { .family = NFPROTO_IPV4, .name = "redir", .ops = &nft_redir_ipv4_ops, .policy = nft_redir_policy, .maxattr = NFTA_REDIR_MAX, .owner = THIS_MODULE, }; #ifdef CONFIG_NF_TABLES_IPV6 static void nft_redir_ipv6_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { nf_ct_netns_put(ctx->net, NFPROTO_IPV6); } static struct nft_expr_type nft_redir_ipv6_type; static const struct nft_expr_ops nft_redir_ipv6_ops = { .type = &nft_redir_ipv6_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_redir)), .eval = nft_redir_eval, .init = nft_redir_init, .destroy = nft_redir_ipv6_destroy, .dump = nft_redir_dump, .validate = nft_redir_validate, .reduce = NFT_REDUCE_READONLY, }; static struct nft_expr_type nft_redir_ipv6_type __read_mostly = { .family = NFPROTO_IPV6, .name = "redir", .ops = &nft_redir_ipv6_ops, .policy = nft_redir_policy, .maxattr = NFTA_REDIR_MAX, .owner = THIS_MODULE, }; #endif #ifdef CONFIG_NF_TABLES_INET static void nft_redir_inet_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { nf_ct_netns_put(ctx->net, NFPROTO_INET); } static struct nft_expr_type nft_redir_inet_type; static const struct nft_expr_ops nft_redir_inet_ops = { .type = &nft_redir_inet_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_redir)), .eval = nft_redir_eval, .init = nft_redir_init, .destroy = nft_redir_inet_destroy, .dump = nft_redir_dump, .validate = nft_redir_validate, .reduce = NFT_REDUCE_READONLY, }; static struct nft_expr_type nft_redir_inet_type __read_mostly = { .family = NFPROTO_INET, .name = "redir", .ops = &nft_redir_inet_ops, .policy = nft_redir_policy, .maxattr = NFTA_REDIR_MAX, .owner = THIS_MODULE, }; static int __init nft_redir_module_init_inet(void) { return nft_register_expr(&nft_redir_inet_type); } #else static inline int nft_redir_module_init_inet(void) { return 0; } #endif static int __init nft_redir_module_init(void) { int ret = nft_register_expr(&nft_redir_ipv4_type); if (ret) return ret; #ifdef CONFIG_NF_TABLES_IPV6 ret = nft_register_expr(&nft_redir_ipv6_type); if (ret) { nft_unregister_expr(&nft_redir_ipv4_type); return ret; } #endif ret = nft_redir_module_init_inet(); if (ret < 0) { nft_unregister_expr(&nft_redir_ipv4_type); #ifdef CONFIG_NF_TABLES_IPV6 nft_unregister_expr(&nft_redir_ipv6_type); #endif return ret; } return ret; } static void __exit nft_redir_module_exit(void) { nft_unregister_expr(&nft_redir_ipv4_type); #ifdef CONFIG_NF_TABLES_IPV6 nft_unregister_expr(&nft_redir_ipv6_type); #endif #ifdef CONFIG_NF_TABLES_INET nft_unregister_expr(&nft_redir_inet_type); #endif } module_init(nft_redir_module_init); module_exit(nft_redir_module_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Arturo Borrero Gonzalez <arturo@debian.org>"); MODULE_ALIAS_NFT_EXPR("redir"); MODULE_DESCRIPTION("Netfilter nftables redirect support"); |
6906 135 7089 5 251 209 1 27 1 27 5 1679 373 1147 279 111 41 7351 4 25 34 1119 1984 3 27 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Linux Socket Filter Data Structures */ #ifndef __LINUX_FILTER_H__ #define __LINUX_FILTER_H__ #include <linux/atomic.h> #include <linux/bpf.h> #include <linux/refcount.h> #include <linux/compat.h> #include <linux/skbuff.h> #include <linux/linkage.h> #include <linux/printk.h> #include <linux/workqueue.h> #include <linux/sched.h> #include <linux/sched/clock.h> #include <linux/capability.h> #include <linux/set_memory.h> #include <linux/kallsyms.h> #include <linux/if_vlan.h> #include <linux/vmalloc.h> #include <linux/sockptr.h> #include <crypto/sha1.h> #include <linux/u64_stats_sync.h> #include <net/sch_generic.h> #include <asm/byteorder.h> #include <uapi/linux/filter.h> struct sk_buff; struct sock; struct seccomp_data; struct bpf_prog_aux; struct xdp_rxq_info; struct xdp_buff; struct sock_reuseport; struct ctl_table; struct ctl_table_header; /* ArgX, context and stack frame pointer register positions. Note, * Arg1, Arg2, Arg3, etc are used as argument mappings of function * calls in BPF_CALL instruction. */ #define BPF_REG_ARG1 BPF_REG_1 #define BPF_REG_ARG2 BPF_REG_2 #define BPF_REG_ARG3 BPF_REG_3 #define BPF_REG_ARG4 BPF_REG_4 #define BPF_REG_ARG5 BPF_REG_5 #define BPF_REG_CTX BPF_REG_6 #define BPF_REG_FP BPF_REG_10 /* Additional register mappings for converted user programs. */ #define BPF_REG_A BPF_REG_0 #define BPF_REG_X BPF_REG_7 #define BPF_REG_TMP BPF_REG_2 /* scratch reg */ #define BPF_REG_D BPF_REG_8 /* data, callee-saved */ #define BPF_REG_H BPF_REG_9 /* hlen, callee-saved */ /* Kernel hidden auxiliary/helper register. */ #define BPF_REG_AX MAX_BPF_REG #define MAX_BPF_EXT_REG (MAX_BPF_REG + 1) #define MAX_BPF_JIT_REG MAX_BPF_EXT_REG /* unused opcode to mark special call to bpf_tail_call() helper */ #define BPF_TAIL_CALL 0xf0 /* unused opcode to mark special load instruction. Same as BPF_ABS */ #define BPF_PROBE_MEM 0x20 /* unused opcode to mark special ldsx instruction. Same as BPF_IND */ #define BPF_PROBE_MEMSX 0x40 /* unused opcode to mark call to interpreter with arguments */ #define BPF_CALL_ARGS 0xe0 /* unused opcode to mark speculation barrier for mitigating * Speculative Store Bypass */ #define BPF_NOSPEC 0xc0 /* As per nm, we expose JITed images as text (code) section for * kallsyms. That way, tools like perf can find it to match * addresses. */ #define BPF_SYM_ELF_TYPE 't' /* BPF program can access up to 512 bytes of stack space. */ #define MAX_BPF_STACK 512 /* Helper macros for filter block array initializers. */ /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ #define BPF_ALU64_REG_OFF(OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) #define BPF_ALU64_REG(OP, DST, SRC) \ BPF_ALU64_REG_OFF(OP, DST, SRC, 0) #define BPF_ALU32_REG_OFF(OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_OP(OP) | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) #define BPF_ALU32_REG(OP, DST, SRC) \ BPF_ALU32_REG_OFF(OP, DST, SRC, 0) /* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */ #define BPF_ALU64_IMM(OP, DST, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = 0, \ .imm = IMM }) #define BPF_ALU32_IMM(OP, DST, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = 0, \ .imm = IMM }) /* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */ #define BPF_ENDIAN(TYPE, DST, LEN) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \ .dst_reg = DST, \ .src_reg = 0, \ .off = 0, \ .imm = LEN }) /* Short form of mov, dst_reg = src_reg */ #define BPF_MOV64_REG(DST, SRC) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_MOV | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = 0 }) #define BPF_MOV32_REG(DST, SRC) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_MOV | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = 0 }) /* Short form of mov, dst_reg = imm32 */ #define BPF_MOV64_IMM(DST, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_MOV | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = 0, \ .imm = IMM }) #define BPF_MOV32_IMM(DST, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_MOV | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = 0, \ .imm = IMM }) /* Special form of mov32, used for doing explicit zero extension on dst. */ #define BPF_ZEXT_REG(DST) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_MOV | BPF_X, \ .dst_reg = DST, \ .src_reg = DST, \ .off = 0, \ .imm = 1 }) static inline bool insn_is_zext(const struct bpf_insn *insn) { return insn->code == (BPF_ALU | BPF_MOV | BPF_X) && insn->imm == 1; } /* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ #define BPF_LD_IMM64(DST, IMM) \ BPF_LD_IMM64_RAW(DST, 0, IMM) #define BPF_LD_IMM64_RAW(DST, SRC, IMM) \ ((struct bpf_insn) { \ .code = BPF_LD | BPF_DW | BPF_IMM, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = (__u32) (IMM) }), \ ((struct bpf_insn) { \ .code = 0, /* zero is reserved opcode */ \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = ((__u64) (IMM)) >> 32 }) /* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */ #define BPF_LD_MAP_FD(DST, MAP_FD) \ BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD) /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */ #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = IMM }) #define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \ ((struct bpf_insn) { \ .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \ .dst_reg = DST, \ .src_reg = SRC, \ .off = 0, \ .imm = IMM }) /* Direct packet access, R0 = *(uint *) (skb->data + imm32) */ #define BPF_LD_ABS(SIZE, IMM) \ ((struct bpf_insn) { \ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = IMM }) /* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */ #define BPF_LD_IND(SIZE, SRC, IMM) \ ((struct bpf_insn) { \ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \ .dst_reg = 0, \ .src_reg = SRC, \ .off = 0, \ .imm = IMM }) /* Memory load, dst_reg = *(uint *) (src_reg + off16) */ #define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) /* Memory store, *(uint *) (dst_reg + off16) = src_reg */ #define BPF_STX_MEM(SIZE, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) /* * Atomic operations: * * BPF_ADD *(uint *) (dst_reg + off16) += src_reg * BPF_AND *(uint *) (dst_reg + off16) &= src_reg * BPF_OR *(uint *) (dst_reg + off16) |= src_reg * BPF_XOR *(uint *) (dst_reg + off16) ^= src_reg * BPF_ADD | BPF_FETCH src_reg = atomic_fetch_add(dst_reg + off16, src_reg); * BPF_AND | BPF_FETCH src_reg = atomic_fetch_and(dst_reg + off16, src_reg); * BPF_OR | BPF_FETCH src_reg = atomic_fetch_or(dst_reg + off16, src_reg); * BPF_XOR | BPF_FETCH src_reg = atomic_fetch_xor(dst_reg + off16, src_reg); * BPF_XCHG src_reg = atomic_xchg(dst_reg + off16, src_reg) * BPF_CMPXCHG r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg) */ #define BPF_ATOMIC_OP(SIZE, OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = OP }) /* Legacy alias */ #define BPF_STX_XADD(SIZE, DST, SRC, OFF) BPF_ATOMIC_OP(SIZE, BPF_ADD, DST, SRC, OFF) /* Memory store, *(uint *) (dst_reg + off16) = imm32 */ #define BPF_ST_MEM(SIZE, DST, OFF, IMM) \ ((struct bpf_insn) { \ .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \ .dst_reg = DST, \ .src_reg = 0, \ .off = OFF, \ .imm = IMM }) /* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */ #define BPF_JMP_REG(OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_OP(OP) | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) /* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ #define BPF_JMP_IMM(OP, DST, IMM, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = OFF, \ .imm = IMM }) /* Like BPF_JMP_REG, but with 32-bit wide operands for comparison. */ #define BPF_JMP32_REG(OP, DST, SRC, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP32 | BPF_OP(OP) | BPF_X, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = 0 }) /* Like BPF_JMP_IMM, but with 32-bit wide operands for comparison. */ #define BPF_JMP32_IMM(OP, DST, IMM, OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP32 | BPF_OP(OP) | BPF_K, \ .dst_reg = DST, \ .src_reg = 0, \ .off = OFF, \ .imm = IMM }) /* Unconditional jumps, goto pc + off16 */ #define BPF_JMP_A(OFF) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_JA, \ .dst_reg = 0, \ .src_reg = 0, \ .off = OFF, \ .imm = 0 }) /* Relative call */ #define BPF_CALL_REL(TGT) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_CALL, \ .dst_reg = 0, \ .src_reg = BPF_PSEUDO_CALL, \ .off = 0, \ .imm = TGT }) /* Convert function address to BPF immediate */ #define BPF_CALL_IMM(x) ((void *)(x) - (void *)__bpf_call_base) #define BPF_EMIT_CALL(FUNC) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_CALL, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = BPF_CALL_IMM(FUNC) }) /* Raw code statement block */ #define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \ ((struct bpf_insn) { \ .code = CODE, \ .dst_reg = DST, \ .src_reg = SRC, \ .off = OFF, \ .imm = IMM }) /* Program exit */ #define BPF_EXIT_INSN() \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_EXIT, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = 0 }) /* Speculation barrier */ #define BPF_ST_NOSPEC() \ ((struct bpf_insn) { \ .code = BPF_ST | BPF_NOSPEC, \ .dst_reg = 0, \ .src_reg = 0, \ .off = 0, \ .imm = 0 }) /* Internal classic blocks for direct assignment */ #define __BPF_STMT(CODE, K) \ ((struct sock_filter) BPF_STMT(CODE, K)) #define __BPF_JUMP(CODE, K, JT, JF) \ ((struct sock_filter) BPF_JUMP(CODE, K, JT, JF)) #define bytes_to_bpf_size(bytes) \ ({ \ int bpf_size = -EINVAL; \ \ if (bytes == sizeof(u8)) \ bpf_size = BPF_B; \ else if (bytes == sizeof(u16)) \ bpf_size = BPF_H; \ else if (bytes == sizeof(u32)) \ bpf_size = BPF_W; \ else if (bytes == sizeof(u64)) \ bpf_size = BPF_DW; \ \ bpf_size; \ }) #define bpf_size_to_bytes(bpf_size) \ ({ \ int bytes = -EINVAL; \ \ if (bpf_size == BPF_B) \ bytes = sizeof(u8); \ else if (bpf_size == BPF_H) \ bytes = sizeof(u16); \ else if (bpf_size == BPF_W) \ bytes = sizeof(u32); \ else if (bpf_size == BPF_DW) \ bytes = sizeof(u64); \ \ bytes; \ }) #define BPF_SIZEOF(type) \ ({ \ const int __size = bytes_to_bpf_size(sizeof(type)); \ BUILD_BUG_ON(__size < 0); \ __size; \ }) #define BPF_FIELD_SIZEOF(type, field) \ ({ \ const int __size = bytes_to_bpf_size(sizeof_field(type, field)); \ BUILD_BUG_ON(__size < 0); \ __size; \ }) #define BPF_LDST_BYTES(insn) \ ({ \ const int __size = bpf_size_to_bytes(BPF_SIZE((insn)->code)); \ WARN_ON(__size < 0); \ __size; \ }) #define __BPF_MAP_0(m, v, ...) v #define __BPF_MAP_1(m, v, t, a, ...) m(t, a) #define __BPF_MAP_2(m, v, t, a, ...) m(t, a), __BPF_MAP_1(m, v, __VA_ARGS__) #define __BPF_MAP_3(m, v, t, a, ...) m(t, a), __BPF_MAP_2(m, v, __VA_ARGS__) #define __BPF_MAP_4(m, v, t, a, ...) m(t, a), __BPF_MAP_3(m, v, __VA_ARGS__) #define __BPF_MAP_5(m, v, t, a, ...) m(t, a), __BPF_MAP_4(m, v, __VA_ARGS__) #define __BPF_REG_0(...) __BPF_PAD(5) #define __BPF_REG_1(...) __BPF_MAP(1, __VA_ARGS__), __BPF_PAD(4) #define __BPF_REG_2(...) __BPF_MAP(2, __VA_ARGS__), __BPF_PAD(3) #define __BPF_REG_3(...) __BPF_MAP(3, __VA_ARGS__), __BPF_PAD(2) #define __BPF_REG_4(...) __BPF_MAP(4, __VA_ARGS__), __BPF_PAD(1) #define __BPF_REG_5(...) __BPF_MAP(5, __VA_ARGS__) #define __BPF_MAP(n, ...) __BPF_MAP_##n(__VA_ARGS__) #define __BPF_REG(n, ...) __BPF_REG_##n(__VA_ARGS__) #define __BPF_CAST(t, a) \ (__force t) \ (__force \ typeof(__builtin_choose_expr(sizeof(t) == sizeof(unsigned long), \ (unsigned long)0, (t)0))) a #define __BPF_V void #define __BPF_N #define __BPF_DECL_ARGS(t, a) t a #define __BPF_DECL_REGS(t, a) u64 a #define __BPF_PAD(n) \ __BPF_MAP(n, __BPF_DECL_ARGS, __BPF_N, u64, __ur_1, u64, __ur_2, \ u64, __ur_3, u64, __ur_4, u64, __ur_5) #define BPF_CALL_x(x, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) #define BPF_CALL_0(name, ...) BPF_CALL_x(0, name, __VA_ARGS__) #define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) #define BPF_CALL_2(name, ...) BPF_CALL_x(2, name, __VA_ARGS__) #define BPF_CALL_3(name, ...) BPF_CALL_x(3, name, __VA_ARGS__) #define BPF_CALL_4(name, ...) BPF_CALL_x(4, name, __VA_ARGS__) #define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__) #define bpf_ctx_range(TYPE, MEMBER) \ offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1 #define bpf_ctx_range_till(TYPE, MEMBER1, MEMBER2) \ offsetof(TYPE, MEMBER1) ... offsetofend(TYPE, MEMBER2) - 1 #if BITS_PER_LONG == 64 # define bpf_ctx_range_ptr(TYPE, MEMBER) \ offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1 #else # define bpf_ctx_range_ptr(TYPE, MEMBER) \ offsetof(TYPE, MEMBER) ... offsetof(TYPE, MEMBER) + 8 - 1 #endif /* BITS_PER_LONG == 64 */ #define bpf_target_off(TYPE, MEMBER, SIZE, PTR_SIZE) \ ({ \ BUILD_BUG_ON(sizeof_field(TYPE, MEMBER) != (SIZE)); \ *(PTR_SIZE) = (SIZE); \ offsetof(TYPE, MEMBER); \ }) /* A struct sock_filter is architecture independent. */ struct compat_sock_fprog { u16 len; compat_uptr_t filter; /* struct sock_filter * */ }; struct sock_fprog_kern { u16 len; struct sock_filter *filter; }; /* Some arches need doubleword alignment for their instructions and/or data */ #define BPF_IMAGE_ALIGNMENT 8 struct bpf_binary_header { u32 size; u8 image[] __aligned(BPF_IMAGE_ALIGNMENT); }; struct bpf_prog_stats { u64_stats_t cnt; u64_stats_t nsecs; u64_stats_t misses; struct u64_stats_sync syncp; } __aligned(2 * sizeof(u64)); struct sk_filter { refcount_t refcnt; struct rcu_head rcu; struct bpf_prog *prog; }; DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); extern struct mutex nf_conn_btf_access_lock; extern int (*nfct_btf_struct_access)(struct bpf_verifier_log *log, const struct bpf_reg_state *reg, int off, int size); typedef unsigned int (*bpf_dispatcher_fn)(const void *ctx, const struct bpf_insn *insnsi, unsigned int (*bpf_func)(const void *, const struct bpf_insn *)); static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog, const void *ctx, bpf_dispatcher_fn dfunc) { u32 ret; cant_migrate(); if (static_branch_unlikely(&bpf_stats_enabled_key)) { struct bpf_prog_stats *stats; u64 start = sched_clock(); unsigned long flags; ret = dfunc(ctx, prog->insnsi, prog->bpf_func); stats = this_cpu_ptr(prog->stats); flags = u64_stats_update_begin_irqsave(&stats->syncp); u64_stats_inc(&stats->cnt); u64_stats_add(&stats->nsecs, sched_clock() - start); u64_stats_update_end_irqrestore(&stats->syncp, flags); } else { ret = dfunc(ctx, prog->insnsi, prog->bpf_func); } return ret; } static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void *ctx) { return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func); } /* * Use in preemptible and therefore migratable context to make sure that * the execution of the BPF program runs on one CPU. * * This uses migrate_disable/enable() explicitly to document that the * invocation of a BPF program does not require reentrancy protection * against a BPF program which is invoked from a preempting task. */ static inline u32 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog, const void *ctx) { u32 ret; migrate_disable(); ret = bpf_prog_run(prog, ctx); migrate_enable(); return ret; } #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN struct bpf_skb_data_end { struct qdisc_skb_cb qdisc_cb; void *data_meta; void *data_end; }; struct bpf_nh_params { u32 nh_family; union { u32 ipv4_nh; struct in6_addr ipv6_nh; }; }; struct bpf_redirect_info { u64 tgt_index; void *tgt_value; struct bpf_map *map; u32 flags; u32 kern_flags; u32 map_id; enum bpf_map_type map_type; struct bpf_nh_params nh; }; DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); /* flags for bpf_redirect_info kern_flags */ #define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */ /* Compute the linear packet data range [data, data_end) which * will be accessed by various program types (cls_bpf, act_bpf, * lwt, ...). Subsystems allowing direct data access must (!) * ensure that cb[] area can be written to when BPF program is * invoked (otherwise cb[] save/restore is necessary). */ static inline void bpf_compute_data_pointers(struct sk_buff *skb) { struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; BUILD_BUG_ON(sizeof(*cb) > sizeof_field(struct sk_buff, cb)); cb->data_meta = skb->data - skb_metadata_len(skb); cb->data_end = skb->data + skb_headlen(skb); } /* Similar to bpf_compute_data_pointers(), except that save orginal * data in cb->data and cb->meta_data for restore. */ static inline void bpf_compute_and_save_data_end( struct sk_buff *skb, void **saved_data_end) { struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; *saved_data_end = cb->data_end; cb->data_end = skb->data + skb_headlen(skb); } /* Restore data saved by bpf_compute_data_pointers(). */ static inline void bpf_restore_data_end( struct sk_buff *skb, void *saved_data_end) { struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; cb->data_end = saved_data_end; } static inline u8 *bpf_skb_cb(const struct sk_buff *skb) { /* eBPF programs may read/write skb->cb[] area to transfer meta * data between tail calls. Since this also needs to work with * tc, that scratch memory is mapped to qdisc_skb_cb's data area. * * In some socket filter cases, the cb unfortunately needs to be * saved/restored so that protocol specific skb->cb[] data won't * be lost. In any case, due to unpriviledged eBPF programs * attached to sockets, we need to clear the bpf_skb_cb() area * to not leak previous contents to user space. */ BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) != BPF_SKB_CB_LEN); BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) != sizeof_field(struct qdisc_skb_cb, data)); return qdisc_skb_cb(skb)->data; } /* Must be invoked with migration disabled */ static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog, const void *ctx) { const struct sk_buff *skb = ctx; u8 *cb_data = bpf_skb_cb(skb); u8 cb_saved[BPF_SKB_CB_LEN]; u32 res; if (unlikely(prog->cb_access)) { memcpy(cb_saved, cb_data, sizeof(cb_saved)); memset(cb_data, 0, sizeof(cb_saved)); } res = bpf_prog_run(prog, skb); if (unlikely(prog->cb_access)) memcpy(cb_data, cb_saved, sizeof(cb_saved)); return res; } static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, struct sk_buff *skb) { u32 res; migrate_disable(); res = __bpf_prog_run_save_cb(prog, skb); migrate_enable(); return res; } static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, struct sk_buff *skb) { u8 *cb_data = bpf_skb_cb(skb); u32 res; if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } DECLARE_BPF_DISPATCHER(xdp) DECLARE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key); u32 xdp_master_redirect(struct xdp_buff *xdp); void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog); static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog) { return prog->len * sizeof(struct bpf_insn); } static inline u32 bpf_prog_tag_scratch_size(const struct bpf_prog *prog) { return round_up(bpf_prog_insn_size(prog) + sizeof(__be64) + 1, SHA1_BLOCK_SIZE); } static inline unsigned int bpf_prog_size(unsigned int proglen) { return max(sizeof(struct bpf_prog), offsetof(struct bpf_prog, insns[proglen])); } static inline bool bpf_prog_was_classic(const struct bpf_prog *prog) { /* When classic BPF programs have been loaded and the arch * does not have a classic BPF JIT (anymore), they have been * converted via bpf_migrate_filter() to eBPF and thus always * have an unspec program type. */ return prog->type == BPF_PROG_TYPE_UNSPEC; } static inline u32 bpf_ctx_off_adjust_machine(u32 size) { const u32 size_machine = sizeof(unsigned long); if (size > size_machine && size % size_machine == 0) size = size_machine; return size; } static inline bool bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default) { return size <= size_default && (size & (size - 1)) == 0; } static inline u8 bpf_ctx_narrow_access_offset(u32 off, u32 size, u32 size_default) { u8 access_off = off & (size_default - 1); #ifdef __LITTLE_ENDIAN return access_off; #else return size_default - (access_off + size); #endif } #define bpf_ctx_wide_access_ok(off, size, type, field) \ (size == sizeof(__u64) && \ off >= offsetof(type, field) && \ off + sizeof(__u64) <= offsetofend(type, field) && \ off % sizeof(__u64) == 0) #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) static inline void bpf_prog_lock_ro(struct bpf_prog *fp) { #ifndef CONFIG_BPF_JIT_ALWAYS_ON if (!fp->jited) { set_vm_flush_reset_perms(fp); set_memory_ro((unsigned long)fp, fp->pages); } #endif } static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr) { set_vm_flush_reset_perms(hdr); set_memory_rox((unsigned long)hdr, hdr->size >> PAGE_SHIFT); } int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap); static inline int sk_filter(struct sock *sk, struct sk_buff *skb) { return sk_filter_trim_cap(sk, skb, 1); } struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err); void bpf_prog_free(struct bpf_prog *fp); bool bpf_opcode_in_insntable(u8 code); void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, const u32 *insn_to_jit_off); int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog); void bpf_prog_jit_attempt_done(struct bpf_prog *prog); struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags); void __bpf_prog_free(struct bpf_prog *fp); static inline void bpf_prog_unlock_free(struct bpf_prog *fp) { __bpf_prog_free(fp); } typedef int (*bpf_aux_classic_check_t)(struct sock_filter *filter, unsigned int flen); int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog); int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog, bpf_aux_classic_check_t trans, bool save_orig); void bpf_prog_destroy(struct bpf_prog *fp); int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_attach_bpf(u32 ufd, struct sock *sk); int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk); void sk_reuseport_prog_free(struct bpf_prog *prog); int sk_detach_filter(struct sock *sk); int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len); bool sk_filter_charge(struct sock *sk, struct sk_filter *fp); void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp); u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); #define __bpf_call_base_args \ ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \ (void *)__bpf_call_base) struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog); void bpf_jit_compile(struct bpf_prog *prog); bool bpf_jit_needs_zext(void); bool bpf_jit_supports_subprog_tailcalls(void); bool bpf_jit_supports_kfunc_call(void); bool bpf_jit_supports_far_kfunc_call(void); bool bpf_helper_changes_pkt_data(void *func); static inline bool bpf_dump_raw_ok(const struct cred *cred) { /* Reconstruction of call-sites is dependent on kallsyms, * thus make dump the same restriction. */ return kallsyms_show_value(cred); } struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt); void bpf_clear_redirect_map(struct bpf_map *map); static inline bool xdp_return_frame_no_direct(void) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_set_return_frame_no_direct(void) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; } static inline void xdp_clear_return_frame_no_direct(void) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; } static inline int xdp_ok_fwd_dev(const struct net_device *fwd, unsigned int pktlen) { unsigned int len; if (unlikely(!(fwd->flags & IFF_UP))) return -ENETDOWN; len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN; if (pktlen > len) return -EMSGSIZE; return 0; } /* The pair of xdp_do_redirect and xdp_do_flush MUST be called in the * same cpu context. Further for best results no more than a single map * for the do_redirect/do_flush pair should be used. This limitation is * because we only track one map and force a flush when the map changes. * This does not appear to be a real limitation for existing software. */ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, struct xdp_buff *xdp, struct bpf_prog *prog); int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *prog); int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp, struct xdp_frame *xdpf, struct bpf_prog *prog); void xdp_do_flush(void); /* The xdp_do_flush_map() helper has been renamed to drop the _map suffix, as * it is no longer only flushing maps. Keep this define for compatibility * until all drivers are updated - do not use xdp_do_flush_map() in new code! */ #define xdp_do_flush_map xdp_do_flush void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog, u32 act); #ifdef CONFIG_INET struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, struct sock *migrating_sk, u32 hash); #else static inline struct sock * bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, struct sock *migrating_sk, u32 hash) { return NULL; } #endif #ifdef CONFIG_BPF_JIT extern int bpf_jit_enable; extern int bpf_jit_harden; extern int bpf_jit_kallsyms; extern long bpf_jit_limit; extern long bpf_jit_limit_max; typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size); void bpf_jit_fill_hole_with_zero(void *area, unsigned int size); struct bpf_binary_header * bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, unsigned int alignment, bpf_jit_fill_hole_t bpf_fill_ill_insns); void bpf_jit_binary_free(struct bpf_binary_header *hdr); u64 bpf_jit_alloc_exec_limit(void); void *bpf_jit_alloc_exec(unsigned long size); void bpf_jit_free_exec(void *addr); void bpf_jit_free(struct bpf_prog *fp); struct bpf_binary_header * bpf_jit_binary_pack_hdr(const struct bpf_prog *fp); void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns); void bpf_prog_pack_free(struct bpf_binary_header *hdr); static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) { return list_empty(&fp->aux->ksym.lnode) || fp->aux->ksym.lnode.prev == LIST_POISON2; } struct bpf_binary_header * bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **ro_image, unsigned int alignment, struct bpf_binary_header **rw_hdr, u8 **rw_image, bpf_jit_fill_hole_t bpf_fill_ill_insns); int bpf_jit_binary_pack_finalize(struct bpf_prog *prog, struct bpf_binary_header *ro_header, struct bpf_binary_header *rw_header); void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header, struct bpf_binary_header *rw_header); int bpf_jit_add_poke_descriptor(struct bpf_prog *prog, struct bpf_jit_poke_descriptor *poke); int bpf_jit_get_func_addr(const struct bpf_prog *prog, const struct bpf_insn *insn, bool extra_pass, u64 *func_addr, bool *func_addr_fixed); struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *fp); void bpf_jit_prog_release_other(struct bpf_prog *fp, struct bpf_prog *fp_other); static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen, u32 pass, void *image) { pr_err("flen=%u proglen=%u pass=%u image=%pK from=%s pid=%d\n", flen, proglen, pass, image, current->comm, task_pid_nr(current)); if (image) print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET, 16, 1, image, proglen, false); } static inline bool bpf_jit_is_ebpf(void) { # ifdef CONFIG_HAVE_EBPF_JIT return true; # else return false; # endif } static inline bool ebpf_jit_enabled(void) { return bpf_jit_enable && bpf_jit_is_ebpf(); } static inline bool bpf_prog_ebpf_jited(const struct bpf_prog *fp) { return fp->jited && bpf_jit_is_ebpf(); } static inline bool bpf_jit_blinding_enabled(struct bpf_prog *prog) { /* These are the prerequisites, should someone ever have the * idea to call blinding outside of them, we make sure to * bail out. */ if (!bpf_jit_is_ebpf()) return false; if (!prog->jit_requested) return false; if (!bpf_jit_harden) return false; if (bpf_jit_harden == 1 && bpf_capable()) return false; return true; } static inline bool bpf_jit_kallsyms_enabled(void) { /* There are a couple of corner cases where kallsyms should * not be enabled f.e. on hardening. */ if (bpf_jit_harden) return false; if (!bpf_jit_kallsyms) return false; if (bpf_jit_kallsyms == 1) return true; return false; } const char *__bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym); bool is_bpf_text_address(unsigned long addr); int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym); static inline const char * bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym) { const char *ret = __bpf_address_lookup(addr, size, off, sym); if (ret && modname) *modname = NULL; return ret; } void bpf_prog_kallsyms_add(struct bpf_prog *fp); void bpf_prog_kallsyms_del(struct bpf_prog *fp); #else /* CONFIG_BPF_JIT */ static inline bool ebpf_jit_enabled(void) { return false; } static inline bool bpf_jit_blinding_enabled(struct bpf_prog *prog) { return false; } static inline bool bpf_prog_ebpf_jited(const struct bpf_prog *fp) { return false; } static inline int bpf_jit_add_poke_descriptor(struct bpf_prog *prog, struct bpf_jit_poke_descriptor *poke) { return -ENOTSUPP; } static inline void bpf_jit_free(struct bpf_prog *fp) { bpf_prog_unlock_free(fp); } static inline bool bpf_jit_kallsyms_enabled(void) { return false; } static inline const char * __bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym) { return NULL; } static inline bool is_bpf_text_address(unsigned long addr) { return false; } static inline int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym) { return -ERANGE; } static inline const char * bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym) { return NULL; } static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp) { } static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp) { } #endif /* CONFIG_BPF_JIT */ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp); #define BPF_ANC BIT(15) static inline bool bpf_needs_clear_a(const struct sock_filter *first) { switch (first->code) { case BPF_RET | BPF_K: case BPF_LD | BPF_W | BPF_LEN: return false; case BPF_LD | BPF_W | BPF_ABS: case BPF_LD | BPF_H | BPF_ABS: case BPF_LD | BPF_B | BPF_ABS: if (first->k == SKF_AD_OFF + SKF_AD_ALU_XOR_X) return true; return false; default: return true; } } static inline u16 bpf_anc_helper(const struct sock_filter *ftest) { BUG_ON(ftest->code & BPF_ANC); switch (ftest->code) { case BPF_LD | BPF_W | BPF_ABS: case BPF_LD | BPF_H | BPF_ABS: case BPF_LD | BPF_B | BPF_ABS: #define BPF_ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \ return BPF_ANC | SKF_AD_##CODE switch (ftest->k) { BPF_ANCILLARY(PROTOCOL); BPF_ANCILLARY(PKTTYPE); BPF_ANCILLARY(IFINDEX); BPF_ANCILLARY(NLATTR); BPF_ANCILLARY(NLATTR_NEST); BPF_ANCILLARY(MARK); BPF_ANCILLARY(QUEUE); BPF_ANCILLARY(HATYPE); BPF_ANCILLARY(RXHASH); BPF_ANCILLARY(CPU); BPF_ANCILLARY(ALU_XOR_X); BPF_ANCILLARY(VLAN_TAG); BPF_ANCILLARY(VLAN_TAG_PRESENT); BPF_ANCILLARY(PAY_OFFSET); BPF_ANCILLARY(RANDOM); BPF_ANCILLARY(VLAN_TPID); } fallthrough; default: return ftest->code; } } void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size); static inline int bpf_tell_extensions(void) { return SKF_AD_MAX; } struct bpf_sock_addr_kern { struct sock *sk; struct sockaddr *uaddr; /* Temporary "register" to make indirect stores to nested structures * defined above. We need three registers to make such a store, but * only two (src and dst) are available at convert_ctx_access time */ u64 tmp_reg; void *t_ctx; /* Attach type specific context. */ }; struct bpf_sock_ops_kern { struct sock *sk; union { u32 args[4]; u32 reply; u32 replylong[4]; }; struct sk_buff *syn_skb; struct sk_buff *skb; void *skb_data_end; u8 op; u8 is_fullsock; u8 remaining_opt_len; u64 temp; /* temp and everything after is not * initialized to 0 before calling * the BPF program. New fields that * should be initialized to 0 should * be inserted before temp. * temp is scratch storage used by * sock_ops_convert_ctx_access * as temporary storage of a register. */ }; struct bpf_sysctl_kern { struct ctl_table_header *head; struct ctl_table *table; void *cur_val; size_t cur_len; void *new_val; size_t new_len; int new_updated; int write; loff_t *ppos; /* Temporary "register" for indirect stores to ppos. */ u64 tmp_reg; }; #define BPF_SOCKOPT_KERN_BUF_SIZE 32 struct bpf_sockopt_buf { u8 data[BPF_SOCKOPT_KERN_BUF_SIZE]; }; struct bpf_sockopt_kern { struct sock *sk; u8 *optval; u8 *optval_end; s32 level; s32 optname; s32 optlen; /* for retval in struct bpf_cg_run_ctx */ struct task_struct *current_task; /* Temporary "register" for indirect stores to ppos. */ u64 tmp_reg; }; int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len); struct bpf_sk_lookup_kern { u16 family; u16 protocol; __be16 sport; u16 dport; struct { __be32 saddr; __be32 daddr; } v4; struct { const struct in6_addr *saddr; const struct in6_addr *daddr; } v6; struct sock *selected_sk; u32 ingress_ifindex; bool no_reuseport; }; extern struct static_key_false bpf_sk_lookup_enabled; /* Runners for BPF_SK_LOOKUP programs to invoke on socket lookup. * * Allowed return values for a BPF SK_LOOKUP program are SK_PASS and * SK_DROP. Their meaning is as follows: * * SK_PASS && ctx.selected_sk != NULL: use selected_sk as lookup result * SK_PASS && ctx.selected_sk == NULL: continue to htable-based socket lookup * SK_DROP : terminate lookup with -ECONNREFUSED * * This macro aggregates return values and selected sockets from * multiple BPF programs according to following rules in order: * * 1. If any program returned SK_PASS and a non-NULL ctx.selected_sk, * macro result is SK_PASS and last ctx.selected_sk is used. * 2. If any program returned SK_DROP return value, * macro result is SK_DROP. * 3. Otherwise result is SK_PASS and ctx.selected_sk is NULL. * * Caller must ensure that the prog array is non-NULL, and that the * array as well as the programs it contains remain valid. */ #define BPF_PROG_SK_LOOKUP_RUN_ARRAY(array, ctx, func) \ ({ \ struct bpf_sk_lookup_kern *_ctx = &(ctx); \ struct bpf_prog_array_item *_item; \ struct sock *_selected_sk = NULL; \ bool _no_reuseport = false; \ struct bpf_prog *_prog; \ bool _all_pass = true; \ u32 _ret; \ \ migrate_disable(); \ _item = &(array)->items[0]; \ while ((_prog = READ_ONCE(_item->prog))) { \ /* restore most recent selection */ \ _ctx->selected_sk = _selected_sk; \ _ctx->no_reuseport = _no_reuseport; \ \ _ret = func(_prog, _ctx); \ if (_ret == SK_PASS && _ctx->selected_sk) { \ /* remember last non-NULL socket */ \ _selected_sk = _ctx->selected_sk; \ _no_reuseport = _ctx->no_reuseport; \ } else if (_ret == SK_DROP && _all_pass) { \ _all_pass = false; \ } \ _item++; \ } \ _ctx->selected_sk = _selected_sk; \ _ctx->no_reuseport = _no_reuseport; \ migrate_enable(); \ _all_pass || _selected_sk ? SK_PASS : SK_DROP; \ }) static inline bool bpf_sk_lookup_run_v4(struct net *net, int protocol, const __be32 saddr, const __be16 sport, const __be32 daddr, const u16 dport, const int ifindex, struct sock **psk) { struct bpf_prog_array *run_array; struct sock *selected_sk = NULL; bool no_reuseport = false; rcu_read_lock(); run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]); if (run_array) { struct bpf_sk_lookup_kern ctx = { .family = AF_INET, .protocol = protocol, .v4.saddr = saddr, .v4.daddr = daddr, .sport = sport, .dport = dport, .ingress_ifindex = ifindex, }; u32 act; act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, bpf_prog_run); if (act == SK_PASS) { selected_sk = ctx.selected_sk; no_reuseport = ctx.no_reuseport; } else { selected_sk = ERR_PTR(-ECONNREFUSED); } } rcu_read_unlock(); *psk = selected_sk; return no_reuseport; } #if IS_ENABLED(CONFIG_IPV6) static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol, const struct in6_addr *saddr, const __be16 sport, const struct in6_addr *daddr, const u16 dport, const int ifindex, struct sock **psk) { struct bpf_prog_array *run_array; struct sock *selected_sk = NULL; bool no_reuseport = false; rcu_read_lock(); run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]); if (run_array) { struct bpf_sk_lookup_kern ctx = { .family = AF_INET6, .protocol = protocol, .v6.saddr = saddr, .v6.daddr = daddr, .sport = sport, .dport = dport, .ingress_ifindex = ifindex, }; u32 act; act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, bpf_prog_run); if (act == SK_PASS) { selected_sk = ctx.selected_sk; no_reuseport = ctx.no_reuseport; } else { selected_sk = ERR_PTR(-ECONNREFUSED); } } rcu_read_unlock(); *psk = selected_sk; return no_reuseport; } #endif /* IS_ENABLED(CONFIG_IPV6) */ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 index, u64 flags, const u64 flag_mask, void *lookup_elem(struct bpf_map *map, u32 key)) { struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX; /* Lower bits of the flags are used as return code on lookup failure */ if (unlikely(flags & ~(action_mask | flag_mask))) return XDP_ABORTED; ri->tgt_value = lookup_elem(map, index); if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) { /* If the lookup fails we want to clear out the state in the * redirect_info struct completely, so that if an eBPF program * performs multiple lookups, the last one always takes * precedence. */ ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */ ri->map_type = BPF_MAP_TYPE_UNSPEC; return flags & action_mask; } ri->tgt_index = index; ri->map_id = map->id; ri->map_type = map->map_type; if (flags & BPF_F_BROADCAST) { WRITE_ONCE(ri->map, map); ri->flags = flags; } else { WRITE_ONCE(ri->map, NULL); ri->flags = 0; } return XDP_REDIRECT; } #ifdef CONFIG_NET int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len); int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags); int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len); int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len); void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len); void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off, void *buf, unsigned long len, bool flush); #else /* CONFIG_NET */ static inline int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) { return -EOPNOTSUPP; } static inline int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags) { return -EOPNOTSUPP; } static inline int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) { return -EOPNOTSUPP; } static inline int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) { return -EOPNOTSUPP; } static inline void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) { return NULL; } static inline void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off, void *buf, unsigned long len, bool flush) { } #endif /* CONFIG_NET */ #endif /* __LINUX_FILTER_H__ */ |
22 5808 1470 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_STRING_H_ #define _LINUX_STRING_H_ #include <linux/compiler.h> /* for inline */ #include <linux/types.h> /* for size_t */ #include <linux/stddef.h> /* for NULL */ #include <linux/errno.h> /* for E2BIG */ #include <linux/stdarg.h> #include <uapi/linux/string.h> extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); extern void *vmemdup_user(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); /* * Include machine specific inline routines */ #include <asm/string.h> #ifndef __HAVE_ARCH_STRCPY extern char * strcpy(char *,const char *); #endif #ifndef __HAVE_ARCH_STRNCPY extern char * strncpy(char *,const char *, __kernel_size_t); #endif #ifndef __HAVE_ARCH_STRLCPY size_t strlcpy(char *, const char *, size_t); #endif #ifndef __HAVE_ARCH_STRSCPY ssize_t strscpy(char *, const char *, size_t); #endif /* Wraps calls to strscpy()/memset(), no arch specific code required */ ssize_t strscpy_pad(char *dest, const char *src, size_t count); #ifndef __HAVE_ARCH_STRCAT extern char * strcat(char *, const char *); #endif #ifndef __HAVE_ARCH_STRNCAT extern char * strncat(char *, const char *, __kernel_size_t); #endif #ifndef __HAVE_ARCH_STRLCAT extern size_t strlcat(char *, const char *, __kernel_size_t); #endif #ifndef __HAVE_ARCH_STRCMP extern int strcmp(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRNCMP extern int strncmp(const char *,const char *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_STRCASECMP extern int strcasecmp(const char *s1, const char *s2); #endif #ifndef __HAVE_ARCH_STRNCASECMP extern int strncasecmp(const char *s1, const char *s2, size_t n); #endif #ifndef __HAVE_ARCH_STRCHR extern char * strchr(const char *,int); #endif #ifndef __HAVE_ARCH_STRCHRNUL extern char * strchrnul(const char *,int); #endif extern char * strnchrnul(const char *, size_t, int); #ifndef __HAVE_ARCH_STRNCHR extern char * strnchr(const char *, size_t, int); #endif #ifndef __HAVE_ARCH_STRRCHR extern char * strrchr(const char *,int); #endif extern char * __must_check skip_spaces(const char *); extern char *strim(char *); static inline __must_check char *strstrip(char *str) { return strim(str); } #ifndef __HAVE_ARCH_STRSTR extern char * strstr(const char *, const char *); #endif #ifndef __HAVE_ARCH_STRNSTR extern char * strnstr(const char *, const char *, size_t); #endif #ifndef __HAVE_ARCH_STRLEN extern __kernel_size_t strlen(const char *); #endif #ifndef __HAVE_ARCH_STRNLEN extern __kernel_size_t strnlen(const char *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_STRPBRK extern char * strpbrk(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRSEP extern char * strsep(char **,const char *); #endif #ifndef __HAVE_ARCH_STRSPN extern __kernel_size_t strspn(const char *,const char *); #endif #ifndef __HAVE_ARCH_STRCSPN extern __kernel_size_t strcspn(const char *,const char *); #endif #ifndef __HAVE_ARCH_MEMSET extern void * memset(void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET16 extern void *memset16(uint16_t *, uint16_t, __kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET32 extern void *memset32(uint32_t *, uint32_t, __kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSET64 extern void *memset64(uint64_t *, uint64_t, __kernel_size_t); #endif static inline void *memset_l(unsigned long *p, unsigned long v, __kernel_size_t n) { if (BITS_PER_LONG == 32) return memset32((uint32_t *)p, v, n); else return memset64((uint64_t *)p, v, n); } static inline void *memset_p(void **p, void *v, __kernel_size_t n) { if (BITS_PER_LONG == 32) return memset32((uint32_t *)p, (uintptr_t)v, n); else return memset64((uint64_t *)p, (uintptr_t)v, n); } extern void **__memcat_p(void **a, void **b); #define memcat_p(a, b) ({ \ BUILD_BUG_ON_MSG(!__same_type(*(a), *(b)), \ "type mismatch in memcat_p()"); \ (typeof(*a) *)__memcat_p((void **)(a), (void **)(b)); \ }) #ifndef __HAVE_ARCH_MEMCPY extern void * memcpy(void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMMOVE extern void * memmove(void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMSCAN extern void * memscan(void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCMP extern int memcmp(const void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_BCMP extern int bcmp(const void *,const void *,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCHR extern void * memchr(const void *,int,__kernel_size_t); #endif #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) { memcpy(dst, src, cnt); } #endif void *memchr_inv(const void *s, int c, size_t n); char *strreplace(char *str, char old, char new); extern void kfree_const(const void *x); extern char *kstrdup(const char *s, gfp_t gfp) __malloc; extern const char *kstrdup_const(const char *s, gfp_t gfp); extern char *kstrndup(const char *s, size_t len, gfp_t gfp); extern void *kmemdup(const void *src, size_t len, gfp_t gfp) __realloc_size(2); extern void *kvmemdup(const void *src, size_t len, gfp_t gfp) __realloc_size(2); extern char *kmemdup_nul(const char *s, size_t len, gfp_t gfp); extern char **argv_split(gfp_t gfp, const char *str, int *argcp); extern void argv_free(char **argv); extern bool sysfs_streq(const char *s1, const char *s2); int match_string(const char * const *array, size_t n, const char *string); int __sysfs_match_string(const char * const *array, size_t n, const char *s); /** * sysfs_match_string - matches given string in an array * @_a: array of strings * @_s: string to match with * * Helper for __sysfs_match_string(). Calculates the size of @a automatically. */ #define sysfs_match_string(_a, _s) __sysfs_match_string(_a, ARRAY_SIZE(_a), _s) #ifdef CONFIG_BINARY_PRINTF int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, va_list args); int bstr_printf(char *buf, size_t size, const char *fmt, const u32 *bin_buf); int bprintf(u32 *bin_buf, size_t size, const char *fmt, ...) __printf(3, 4); #endif extern ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos, const void *from, size_t available); int ptr_to_hashval(const void *ptr, unsigned long *hashval_out); /** * strstarts - does @str start with @prefix? * @str: string to examine * @prefix: prefix to look for. */ static inline bool strstarts(const char *str, const char *prefix) { return strncmp(str, prefix, strlen(prefix)) == 0; } size_t memweight(const void *ptr, size_t bytes); /** * memzero_explicit - Fill a region of memory (e.g. sensitive * keying data) with 0s. * @s: Pointer to the start of the area. * @count: The size of the area. * * Note: usually using memset() is just fine (!), but in cases * where clearing out _local_ data at the end of a scope is * necessary, memzero_explicit() should be used instead in * order to prevent the compiler from optimising away zeroing. * * memzero_explicit() doesn't need an arch-specific version as * it just invokes the one of memset() implicitly. */ static inline void memzero_explicit(void *s, size_t count) { memset(s, 0, count); barrier_data(s); } /** * kbasename - return the last part of a pathname. * * @path: path to extract the filename from. */ static inline const char *kbasename(const char *path) { const char *tail = strrchr(path, '/'); return tail ? tail + 1 : path; } #if !defined(__NO_FORTIFY) && defined(__OPTIMIZE__) && defined(CONFIG_FORTIFY_SOURCE) #include <linux/fortify-string.h> #endif #ifndef unsafe_memcpy #define unsafe_memcpy(dst, src, bytes, justification) \ memcpy(dst, src, bytes) #endif void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count, int pad); /** * strtomem_pad - Copy NUL-terminated string to non-NUL-terminated buffer * * @dest: Pointer of destination character array (marked as __nonstring) * @src: Pointer to NUL-terminated string * @pad: Padding character to fill any remaining bytes of @dest after copy * * This is a replacement for strncpy() uses where the destination is not * a NUL-terminated string, but with bounds checking on the source size, and * an explicit padding character. If padding is not required, use strtomem(). * * Note that the size of @dest is not an argument, as the length of @dest * must be discoverable by the compiler. */ #define strtomem_pad(dest, src, pad) do { \ const size_t _dest_len = __builtin_object_size(dest, 1); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ _dest_len == (size_t)-1); \ memcpy_and_pad(dest, _dest_len, src, strnlen(src, _dest_len), pad); \ } while (0) /** * strtomem - Copy NUL-terminated string to non-NUL-terminated buffer * * @dest: Pointer of destination character array (marked as __nonstring) * @src: Pointer to NUL-terminated string * * This is a replacement for strncpy() uses where the destination is not * a NUL-terminated string, but with bounds checking on the source size, and * without trailing padding. If padding is required, use strtomem_pad(). * * Note that the size of @dest is not an argument, as the length of @dest * must be discoverable by the compiler. */ #define strtomem(dest, src) do { \ const size_t _dest_len = __builtin_object_size(dest, 1); \ \ BUILD_BUG_ON(!__builtin_constant_p(_dest_len) || \ _dest_len == (size_t)-1); \ memcpy(dest, src, min(_dest_len, strnlen(src, _dest_len))); \ } while (0) /** * memset_after - Set a value after a struct member to the end of a struct * * @obj: Address of target struct instance * @v: Byte value to repeatedly write * @member: after which struct member to start writing bytes * * This is good for clearing padding following the given member. */ #define memset_after(obj, v, member) \ ({ \ u8 *__ptr = (u8 *)(obj); \ typeof(v) __val = (v); \ memset(__ptr + offsetofend(typeof(*(obj)), member), __val, \ sizeof(*(obj)) - offsetofend(typeof(*(obj)), member)); \ }) /** * memset_startat - Set a value starting at a member to the end of a struct * * @obj: Address of target struct instance * @v: Byte value to repeatedly write * @member: struct member to start writing at * * Note that if there is padding between the prior member and the target * member, memset_after() should be used to clear the prior padding. */ #define memset_startat(obj, v, member) \ ({ \ u8 *__ptr = (u8 *)(obj); \ typeof(v) __val = (v); \ memset(__ptr + offsetof(typeof(*(obj)), member), __val, \ sizeof(*(obj)) - offsetof(typeof(*(obj)), member)); \ }) /** * str_has_prefix - Test if a string has a given prefix * @str: The string to test * @prefix: The string to see if @str starts with * * A common way to test a prefix of a string is to do: * strncmp(str, prefix, sizeof(prefix) - 1) * * But this can lead to bugs due to typos, or if prefix is a pointer * and not a constant. Instead use str_has_prefix(). * * Returns: * * strlen(@prefix) if @str starts with @prefix * * 0 if @str does not start with @prefix */ static __always_inline size_t str_has_prefix(const char *str, const char *prefix) { size_t len = strlen(prefix); return strncmp(str, prefix, len) == 0 ? len : 0; } #endif /* _LINUX_STRING_H_ */ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 | // SPDX-License-Identifier: GPL-2.0-only /* Copyright (C) 2003-2013 Jozsef Kadlecsik <kadlec@netfilter.org> */ /* Kernel module implementing an IP set type: the hash:net,port type */ #include <linux/jhash.h> #include <linux/module.h> #include <linux/ip.h> #include <linux/skbuff.h> #include <linux/errno.h> #include <linux/random.h> #include <net/ip.h> #include <net/ipv6.h> #include <net/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/ipset/pfxlen.h> #include <linux/netfilter/ipset/ip_set.h> #include <linux/netfilter/ipset/ip_set_getport.h> #include <linux/netfilter/ipset/ip_set_hash.h> #define IPSET_TYPE_REV_MIN 0 /* 1 SCTP and UDPLITE support added */ /* 2 Range as input support for IPv4 added */ /* 3 nomatch flag support added */ /* 4 Counters support added */ /* 5 Comments support added */ /* 6 Forceadd support added */ /* 7 skbinfo support added */ #define IPSET_TYPE_REV_MAX 8 /* bucketsize, initval support added */ MODULE_LICENSE("GPL"); MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@netfilter.org>"); IP_SET_MODULE_DESC("hash:net,port", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX); MODULE_ALIAS("ip_set_hash:net,port"); /* Type specific function prefix */ #define HTYPE hash_netport #define IP_SET_HASH_WITH_PROTO #define IP_SET_HASH_WITH_NETS /* We squeeze the "nomatch" flag into cidr: we don't support cidr == 0 * However this way we have to store internally cidr - 1, * dancing back and forth. */ #define IP_SET_HASH_WITH_NETS_PACKED /* IPv4 variant */ /* Member elements */ struct hash_netport4_elem { __be32 ip; __be16 port; u8 proto; u8 cidr:7; u8 nomatch:1; }; /* Common functions */ static bool hash_netport4_data_equal(const struct hash_netport4_elem *ip1, const struct hash_netport4_elem *ip2, u32 *multi) { return ip1->ip == ip2->ip && ip1->port == ip2->port && ip1->proto == ip2->proto && ip1->cidr == ip2->cidr; } static int hash_netport4_do_data_match(const struct hash_netport4_elem *elem) { return elem->nomatch ? -ENOTEMPTY : 1; } static void hash_netport4_data_set_flags(struct hash_netport4_elem *elem, u32 flags) { elem->nomatch = !!((flags >> 16) & IPSET_FLAG_NOMATCH); } static void hash_netport4_data_reset_flags(struct hash_netport4_elem *elem, u8 *flags) { swap(*flags, elem->nomatch); } static void hash_netport4_data_netmask(struct hash_netport4_elem *elem, u8 cidr) { elem->ip &= ip_set_netmask(cidr); elem->cidr = cidr - 1; } static bool hash_netport4_data_list(struct sk_buff *skb, const struct hash_netport4_elem *data) { u32 flags = data->nomatch ? IPSET_FLAG_NOMATCH : 0; if (nla_put_ipaddr4(skb, IPSET_ATTR_IP, data->ip) || nla_put_net16(skb, IPSET_ATTR_PORT, data->port) || nla_put_u8(skb, IPSET_ATTR_CIDR, data->cidr + 1) || nla_put_u8(skb, IPSET_ATTR_PROTO, data->proto) || (flags && nla_put_net32(skb, IPSET_ATTR_CADT_FLAGS, htonl(flags)))) goto nla_put_failure; return false; nla_put_failure: return true; } static void hash_netport4_data_next(struct hash_netport4_elem *next, const struct hash_netport4_elem *d) { next->ip = d->ip; next->port = d->port; } #define MTYPE hash_netport4 #define HOST_MASK 32 #include "ip_set_hash_gen.h" static int hash_netport4_kadt(struct ip_set *set, const struct sk_buff *skb, const struct xt_action_param *par, enum ipset_adt adt, struct ip_set_adt_opt *opt) { const struct hash_netport4 *h = set->data; ipset_adtfn adtfn = set->variant->adt[adt]; struct hash_netport4_elem e = { .cidr = INIT_CIDR(h->nets[0].cidr[0], HOST_MASK), }; struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set); if (adt == IPSET_TEST) e.cidr = HOST_MASK - 1; if (!ip_set_get_ip4_port(skb, opt->flags & IPSET_DIM_TWO_SRC, &e.port, &e.proto)) return -EINVAL; ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip); e.ip &= ip_set_netmask(e.cidr + 1); return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); } static int hash_netport4_uadt(struct ip_set *set, struct nlattr *tb[], enum ipset_adt adt, u32 *lineno, u32 flags, bool retried) { struct hash_netport4 *h = set->data; ipset_adtfn adtfn = set->variant->adt[adt]; struct hash_netport4_elem e = { .cidr = HOST_MASK - 1 }; struct ip_set_ext ext = IP_SET_INIT_UEXT(set); u32 port, port_to, p = 0, ip = 0, ip_to = 0, i = 0; bool with_ports = false; u8 cidr; int ret; if (tb[IPSET_ATTR_LINENO]) *lineno = nla_get_u32(tb[IPSET_ATTR_LINENO]); if (unlikely(!tb[IPSET_ATTR_IP] || !ip_set_attr_netorder(tb, IPSET_ATTR_PORT) || !ip_set_optattr_netorder(tb, IPSET_ATTR_PORT_TO) || !ip_set_optattr_netorder(tb, IPSET_ATTR_CADT_FLAGS))) return -IPSET_ERR_PROTOCOL; ret = ip_set_get_hostipaddr4(tb[IPSET_ATTR_IP], &ip); if (ret) return ret; ret = ip_set_get_extensions(set, tb, &ext); if (ret) return ret; if (tb[IPSET_ATTR_CIDR]) { cidr = nla_get_u8(tb[IPSET_ATTR_CIDR]); if (!cidr || cidr > HOST_MASK) return -IPSET_ERR_INVALID_CIDR; e.cidr = cidr - 1; } e.port = nla_get_be16(tb[IPSET_ATTR_PORT]); if (tb[IPSET_ATTR_PROTO]) { e.proto = nla_get_u8(tb[IPSET_ATTR_PROTO]); with_ports = ip_set_proto_with_ports(e.proto); if (e.proto == 0) return -IPSET_ERR_INVALID_PROTO; } else { return -IPSET_ERR_MISSING_PROTO; } if (!(with_ports || e.proto == IPPROTO_ICMP)) e.port = 0; with_ports = with_ports && tb[IPSET_ATTR_PORT_TO]; if (tb[IPSET_ATTR_CADT_FLAGS]) { u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]); if (cadt_flags & IPSET_FLAG_NOMATCH) flags |= (IPSET_FLAG_NOMATCH << 16); } if (adt == IPSET_TEST || !(with_ports || tb[IPSET_ATTR_IP_TO])) { e.ip = htonl(ip & ip_set_hostmask(e.cidr + 1)); ret = adtfn(set, &e, &ext, &ext, flags); return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } port = port_to = ntohs(e.port); if (tb[IPSET_ATTR_PORT_TO]) { port_to = ip_set_get_h16(tb[IPSET_ATTR_PORT_TO]); if (port_to < port) swap(port, port_to); } if (tb[IPSET_ATTR_IP_TO]) { ret = ip_set_get_hostipaddr4(tb[IPSET_ATTR_IP_TO], &ip_to); if (ret) return ret; if (ip_to < ip) swap(ip, ip_to); if (ip + UINT_MAX == ip_to) return -IPSET_ERR_HASH_RANGE; } else { ip_set_mask_from_to(ip, ip_to, e.cidr + 1); } if (retried) { ip = ntohl(h->next.ip); p = ntohs(h->next.port); } else { p = port; } do { e.ip = htonl(ip); ip = ip_set_range_to_cidr(ip, ip_to, &cidr); e.cidr = cidr - 1; for (; p <= port_to; p++, i++) { e.port = htons(p); if (i > IPSET_MAX_RANGE) { hash_netport4_data_next(&h->next, &e); return -ERANGE; } ret = adtfn(set, &e, &ext, &ext, flags); if (ret && !ip_set_eexist(ret, flags)) return ret; ret = 0; } p = port; } while (ip++ < ip_to); return ret; } /* IPv6 variant */ struct hash_netport6_elem { union nf_inet_addr ip; __be16 port; u8 proto; u8 cidr:7; u8 nomatch:1; }; /* Common functions */ static bool hash_netport6_data_equal(const struct hash_netport6_elem *ip1, const struct hash_netport6_elem *ip2, u32 *multi) { return ipv6_addr_equal(&ip1->ip.in6, &ip2->ip.in6) && ip1->port == ip2->port && ip1->proto == ip2->proto && ip1->cidr == ip2->cidr; } static int hash_netport6_do_data_match(const struct hash_netport6_elem *elem) { return elem->nomatch ? -ENOTEMPTY : 1; } static void hash_netport6_data_set_flags(struct hash_netport6_elem *elem, u32 flags) { elem->nomatch = !!((flags >> 16) & IPSET_FLAG_NOMATCH); } static void hash_netport6_data_reset_flags(struct hash_netport6_elem *elem, u8 *flags) { swap(*flags, elem->nomatch); } static void hash_netport6_data_netmask(struct hash_netport6_elem *elem, u8 cidr) { ip6_netmask(&elem->ip, cidr); elem->cidr = cidr - 1; } static bool hash_netport6_data_list(struct sk_buff *skb, const struct hash_netport6_elem *data) { u32 flags = data->nomatch ? IPSET_FLAG_NOMATCH : 0; if (nla_put_ipaddr6(skb, IPSET_ATTR_IP, &data->ip.in6) || nla_put_net16(skb, IPSET_ATTR_PORT, data->port) || nla_put_u8(skb, IPSET_ATTR_CIDR, data->cidr + 1) || nla_put_u8(skb, IPSET_ATTR_PROTO, data->proto) || (flags && nla_put_net32(skb, IPSET_ATTR_CADT_FLAGS, htonl(flags)))) goto nla_put_failure; return false; nla_put_failure: return true; } static void hash_netport6_data_next(struct hash_netport6_elem *next, const struct hash_netport6_elem *d) { next->port = d->port; } #undef MTYPE #undef HOST_MASK #define MTYPE hash_netport6 #define HOST_MASK 128 #define IP_SET_EMIT_CREATE #include "ip_set_hash_gen.h" static int hash_netport6_kadt(struct ip_set *set, const struct sk_buff *skb, const struct xt_action_param *par, enum ipset_adt adt, struct ip_set_adt_opt *opt) { const struct hash_netport6 *h = set->data; ipset_adtfn adtfn = set->variant->adt[adt]; struct hash_netport6_elem e = { .cidr = INIT_CIDR(h->nets[0].cidr[0], HOST_MASK), }; struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set); if (adt == IPSET_TEST) e.cidr = HOST_MASK - 1; if (!ip_set_get_ip6_port(skb, opt->flags & IPSET_DIM_TWO_SRC, &e.port, &e.proto)) return -EINVAL; ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6); ip6_netmask(&e.ip, e.cidr + 1); return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags); } static int hash_netport6_uadt(struct ip_set *set, struct nlattr *tb[], enum ipset_adt adt, u32 *lineno, u32 flags, bool retried) { const struct hash_netport6 *h = set->data; ipset_adtfn adtfn = set->variant->adt[adt]; struct hash_netport6_elem e = { .cidr = HOST_MASK - 1 }; struct ip_set_ext ext = IP_SET_INIT_UEXT(set); u32 port, port_to; bool with_ports = false; u8 cidr; int ret; if (tb[IPSET_ATTR_LINENO]) *lineno = nla_get_u32(tb[IPSET_ATTR_LINENO]); if (unlikely(!tb[IPSET_ATTR_IP] || !ip_set_attr_netorder(tb, IPSET_ATTR_PORT) || !ip_set_optattr_netorder(tb, IPSET_ATTR_PORT_TO) || !ip_set_optattr_netorder(tb, IPSET_ATTR_CADT_FLAGS))) return -IPSET_ERR_PROTOCOL; if (unlikely(tb[IPSET_ATTR_IP_TO])) return -IPSET_ERR_HASH_RANGE_UNSUPPORTED; ret = ip_set_get_ipaddr6(tb[IPSET_ATTR_IP], &e.ip); if (ret) return ret; ret = ip_set_get_extensions(set, tb, &ext); if (ret) return ret; if (tb[IPSET_ATTR_CIDR]) { cidr = nla_get_u8(tb[IPSET_ATTR_CIDR]); if (!cidr || cidr > HOST_MASK) return -IPSET_ERR_INVALID_CIDR; e.cidr = cidr - 1; } ip6_netmask(&e.ip, e.cidr + 1); e.port = nla_get_be16(tb[IPSET_ATTR_PORT]); if (tb[IPSET_ATTR_PROTO]) { e.proto = nla_get_u8(tb[IPSET_ATTR_PROTO]); with_ports = ip_set_proto_with_ports(e.proto); if (e.proto == 0) return -IPSET_ERR_INVALID_PROTO; } else { return -IPSET_ERR_MISSING_PROTO; } if (!(with_ports || e.proto == IPPROTO_ICMPV6)) e.port = 0; if (tb[IPSET_ATTR_CADT_FLAGS]) { u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]); if (cadt_flags & IPSET_FLAG_NOMATCH) flags |= (IPSET_FLAG_NOMATCH << 16); } if (adt == IPSET_TEST || !with_ports || !tb[IPSET_ATTR_PORT_TO]) { ret = adtfn(set, &e, &ext, &ext, flags); return ip_set_enomatch(ret, flags, adt, set) ? -ret : ip_set_eexist(ret, flags) ? 0 : ret; } port = ntohs(e.port); port_to = ip_set_get_h16(tb[IPSET_ATTR_PORT_TO]); if (port > port_to) swap(port, port_to); if (retried) port = ntohs(h->next.port); for (; port <= port_to; port++) { e.port = htons(port); ret = adtfn(set, &e, &ext, &ext, flags); if (ret && !ip_set_eexist(ret, flags)) return ret; ret = 0; } return ret; } static struct ip_set_type hash_netport_type __read_mostly = { .name = "hash:net,port", .protocol = IPSET_PROTOCOL, .features = IPSET_TYPE_IP | IPSET_TYPE_PORT | IPSET_TYPE_NOMATCH, .dimension = IPSET_DIM_TWO, .family = NFPROTO_UNSPEC, .revision_min = IPSET_TYPE_REV_MIN, .revision_max = IPSET_TYPE_REV_MAX, .create_flags[IPSET_TYPE_REV_MAX] = IPSET_CREATE_FLAG_BUCKETSIZE, .create = hash_netport_create, .create_policy = { [IPSET_ATTR_HASHSIZE] = { .type = NLA_U32 }, [IPSET_ATTR_MAXELEM] = { .type = NLA_U32 }, [IPSET_ATTR_INITVAL] = { .type = NLA_U32 }, [IPSET_ATTR_BUCKETSIZE] = { .type = NLA_U8 }, [IPSET_ATTR_RESIZE] = { .type = NLA_U8 }, [IPSET_ATTR_PROTO] = { .type = NLA_U8 }, [IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 }, [IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 }, }, .adt_policy = { [IPSET_ATTR_IP] = { .type = NLA_NESTED }, [IPSET_ATTR_IP_TO] = { .type = NLA_NESTED }, [IPSET_ATTR_PORT] = { .type = NLA_U16 }, [IPSET_ATTR_PORT_TO] = { .type = NLA_U16 }, [IPSET_ATTR_PROTO] = { .type = NLA_U8 }, [IPSET_ATTR_CIDR] = { .type = NLA_U8 }, [IPSET_ATTR_TIMEOUT] = { .type = NLA_U32 }, [IPSET_ATTR_LINENO] = { .type = NLA_U32 }, [IPSET_ATTR_CADT_FLAGS] = { .type = NLA_U32 }, [IPSET_ATTR_BYTES] = { .type = NLA_U64 }, [IPSET_ATTR_PACKETS] = { .type = NLA_U64 }, [IPSET_ATTR_COMMENT] = { .type = NLA_NUL_STRING, .len = IPSET_MAX_COMMENT_SIZE }, [IPSET_ATTR_SKBMARK] = { .type = NLA_U64 }, [IPSET_ATTR_SKBPRIO] = { .type = NLA_U32 }, [IPSET_ATTR_SKBQUEUE] = { .type = NLA_U16 }, }, .me = THIS_MODULE, }; static int __init hash_netport_init(void) { return ip_set_type_register(&hash_netport_type); } static void __exit hash_netport_fini(void) { rcu_barrier(); ip_set_type_unregister(&hash_netport_type); } module_init(hash_netport_init); module_exit(hash_netport_fini); |
2412 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_HUGETLB_INLINE_H #define _LINUX_HUGETLB_INLINE_H #ifdef CONFIG_HUGETLB_PAGE #include <linux/mm.h> static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma) { return !!(vma->vm_flags & VM_HUGETLB); } #else static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma) { return false; } #endif #endif |
9 18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef __NET_RTNETLINK_H #define __NET_RTNETLINK_H #include <linux/rtnetlink.h> #include <net/netlink.h> typedef int (*rtnl_doit_func)(struct sk_buff *, struct nlmsghdr *, struct netlink_ext_ack *); typedef int (*rtnl_dumpit_func)(struct sk_buff *, struct netlink_callback *); enum rtnl_link_flags { RTNL_FLAG_DOIT_UNLOCKED = BIT(0), RTNL_FLAG_BULK_DEL_SUPPORTED = BIT(1), }; enum rtnl_kinds { RTNL_KIND_NEW, RTNL_KIND_DEL, RTNL_KIND_GET, RTNL_KIND_SET }; #define RTNL_KIND_MASK 0x3 static inline enum rtnl_kinds rtnl_msgtype_kind(int msgtype) { return msgtype & RTNL_KIND_MASK; } void rtnl_register(int protocol, int msgtype, rtnl_doit_func, rtnl_dumpit_func, unsigned int flags); int rtnl_register_module(struct module *owner, int protocol, int msgtype, rtnl_doit_func, rtnl_dumpit_func, unsigned int flags); int rtnl_unregister(int protocol, int msgtype); void rtnl_unregister_all(int protocol); static inline int rtnl_msg_family(const struct nlmsghdr *nlh) { if (nlmsg_len(nlh) >= sizeof(struct rtgenmsg)) return ((struct rtgenmsg *) nlmsg_data(nlh))->rtgen_family; else return AF_UNSPEC; } /** * struct rtnl_link_ops - rtnetlink link operations * * @list: Used internally * @kind: Identifier * @netns_refund: Physical device, move to init_net on netns exit * @maxtype: Highest device specific netlink attribute number * @policy: Netlink policy for device specific attribute validation * @validate: Optional validation function for netlink/changelink parameters * @alloc: netdev allocation function, can be %NULL and is then used * in place of alloc_netdev_mqs(), in this case @priv_size * and @setup are unused. Returns a netdev or ERR_PTR(). * @priv_size: sizeof net_device private space * @setup: net_device setup function * @newlink: Function for configuring and registering a new device * @changelink: Function for changing parameters of an existing device * @dellink: Function to remove a device * @get_size: Function to calculate required room for dumping device * specific netlink attributes * @fill_info: Function to dump device specific netlink attributes * @get_xstats_size: Function to calculate required room for dumping device * specific statistics * @fill_xstats: Function to dump device specific statistics * @get_num_tx_queues: Function to determine number of transmit queues * to create when creating a new device. * @get_num_rx_queues: Function to determine number of receive queues * to create when creating a new device. * @get_link_net: Function to get the i/o netns of the device * @get_linkxstats_size: Function to calculate the required room for * dumping device-specific extended link stats * @fill_linkxstats: Function to dump device-specific extended link stats */ struct rtnl_link_ops { struct list_head list; const char *kind; size_t priv_size; struct net_device *(*alloc)(struct nlattr *tb[], const char *ifname, unsigned char name_assign_type, unsigned int num_tx_queues, unsigned int num_rx_queues); void (*setup)(struct net_device *dev); bool netns_refund; unsigned int maxtype; const struct nla_policy *policy; int (*validate)(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack); int (*newlink)(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack); int (*changelink)(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack); void (*dellink)(struct net_device *dev, struct list_head *head); size_t (*get_size)(const struct net_device *dev); int (*fill_info)(struct sk_buff *skb, const struct net_device *dev); size_t (*get_xstats_size)(const struct net_device *dev); int (*fill_xstats)(struct sk_buff *skb, const struct net_device *dev); unsigned int (*get_num_tx_queues)(void); unsigned int (*get_num_rx_queues)(void); unsigned int slave_maxtype; const struct nla_policy *slave_policy; int (*slave_changelink)(struct net_device *dev, struct net_device *slave_dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack); size_t (*get_slave_size)(const struct net_device *dev, const struct net_device *slave_dev); int (*fill_slave_info)(struct sk_buff *skb, const struct net_device *dev, const struct net_device *slave_dev); struct net *(*get_link_net)(const struct net_device *dev); size_t (*get_linkxstats_size)(const struct net_device *dev, int attr); int (*fill_linkxstats)(struct sk_buff *skb, const struct net_device *dev, int *prividx, int attr); }; int __rtnl_link_register(struct rtnl_link_ops *ops); void __rtnl_link_unregister(struct rtnl_link_ops *ops); int rtnl_link_register(struct rtnl_link_ops *ops); void rtnl_link_unregister(struct rtnl_link_ops *ops); /** * struct rtnl_af_ops - rtnetlink address family operations * * @list: Used internally * @family: Address family * @fill_link_af: Function to fill IFLA_AF_SPEC with address family * specific netlink attributes. * @get_link_af_size: Function to calculate size of address family specific * netlink attributes. * @validate_link_af: Validate a IFLA_AF_SPEC attribute, must check attr * for invalid configuration settings. * @set_link_af: Function to parse a IFLA_AF_SPEC attribute and modify * net_device accordingly. */ struct rtnl_af_ops { struct list_head list; int family; int (*fill_link_af)(struct sk_buff *skb, const struct net_device *dev, u32 ext_filter_mask); size_t (*get_link_af_size)(const struct net_device *dev, u32 ext_filter_mask); int (*validate_link_af)(const struct net_device *dev, const struct nlattr *attr, struct netlink_ext_ack *extack); int (*set_link_af)(struct net_device *dev, const struct nlattr *attr, struct netlink_ext_ack *extack); int (*fill_stats_af)(struct sk_buff *skb, const struct net_device *dev); size_t (*get_stats_af_size)(const struct net_device *dev); }; void rtnl_af_register(struct rtnl_af_ops *ops); void rtnl_af_unregister(struct rtnl_af_ops *ops); struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[]); struct net_device *rtnl_create_link(struct net *net, const char *ifname, unsigned char name_assign_type, const struct rtnl_link_ops *ops, struct nlattr *tb[], struct netlink_ext_ack *extack); int rtnl_delete_link(struct net_device *dev, u32 portid, const struct nlmsghdr *nlh); int rtnl_configure_link(struct net_device *dev, const struct ifinfomsg *ifm, u32 portid, const struct nlmsghdr *nlh); int rtnl_nla_parse_ifinfomsg(struct nlattr **tb, const struct nlattr *nla_peer, struct netlink_ext_ack *exterr); struct net *rtnl_get_net_ns_capable(struct sock *sk, int netnsid); #define MODULE_ALIAS_RTNL_LINK(kind) MODULE_ALIAS("rtnl-link-" kind) #endif |
4 1 6 2 5 4 3 2 3 2 8 3 3 3 3 2 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 3 3 3 3 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 10 8 2 2 2 3 2 2 2 2 2 9 3 2 1 3 3 3 3 3 3 2 2 2 2 6 1 1 1 5 1 1 4 3 2 2 10 3 2 2 2 2 2 2 1 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 | // SPDX-License-Identifier: MIT /* * vgaarb.c: Implements VGA arbitration. For details refer to * Documentation/gpu/vgaarbiter.rst * * (C) Copyright 2005 Benjamin Herrenschmidt <benh@kernel.crashing.org> * (C) Copyright 2007 Paulo R. Zanoni <przanoni@gmail.com> * (C) Copyright 2007, 2009 Tiago Vignatti <vignatti@freedesktop.org> */ #define pr_fmt(fmt) "vgaarb: " fmt #define vgaarb_dbg(dev, fmt, arg...) dev_dbg(dev, "vgaarb: " fmt, ##arg) #define vgaarb_info(dev, fmt, arg...) dev_info(dev, "vgaarb: " fmt, ##arg) #define vgaarb_err(dev, fmt, arg...) dev_err(dev, "vgaarb: " fmt, ##arg) #include <linux/module.h> #include <linux/kernel.h> #include <linux/pci.h> #include <linux/errno.h> #include <linux/init.h> #include <linux/list.h> #include <linux/sched/signal.h> #include <linux/wait.h> #include <linux/spinlock.h> #include <linux/poll.h> #include <linux/miscdevice.h> #include <linux/slab.h> #include <linux/screen_info.h> #include <linux/vt.h> #include <linux/console.h> #include <linux/acpi.h> #include <linux/uaccess.h> #include <linux/vgaarb.h> static void vga_arbiter_notify_clients(void); /* * We keep a list of all VGA devices in the system to speed * up the various operations of the arbiter */ struct vga_device { struct list_head list; struct pci_dev *pdev; unsigned int decodes; /* what it decodes */ unsigned int owns; /* what it owns */ unsigned int locks; /* what it locks */ unsigned int io_lock_cnt; /* legacy IO lock count */ unsigned int mem_lock_cnt; /* legacy MEM lock count */ unsigned int io_norm_cnt; /* normal IO count */ unsigned int mem_norm_cnt; /* normal MEM count */ bool bridge_has_one_vga; bool is_firmware_default; /* device selected by firmware */ unsigned int (*set_decode)(struct pci_dev *pdev, bool decode); }; static LIST_HEAD(vga_list); static int vga_count, vga_decode_count; static bool vga_arbiter_used; static DEFINE_SPINLOCK(vga_lock); static DECLARE_WAIT_QUEUE_HEAD(vga_wait_queue); static const char *vga_iostate_to_str(unsigned int iostate) { /* Ignore VGA_RSRC_IO and VGA_RSRC_MEM */ iostate &= VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM; switch (iostate) { case VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM: return "io+mem"; case VGA_RSRC_LEGACY_IO: return "io"; case VGA_RSRC_LEGACY_MEM: return "mem"; } return "none"; } static int vga_str_to_iostate(char *buf, int str_size, unsigned int *io_state) { /* * In theory, we could hand out locks on IO and MEM separately to * userspace, but this can cause deadlocks. */ if (strncmp(buf, "none", 4) == 0) { *io_state = VGA_RSRC_NONE; return 1; } /* XXX We're not checking the str_size! */ if (strncmp(buf, "io+mem", 6) == 0) goto both; else if (strncmp(buf, "io", 2) == 0) goto both; else if (strncmp(buf, "mem", 3) == 0) goto both; return 0; both: *io_state = VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM; return 1; } /* This is only used as a cookie, it should not be dereferenced */ static struct pci_dev *vga_default; /* Find somebody in our list */ static struct vga_device *vgadev_find(struct pci_dev *pdev) { struct vga_device *vgadev; list_for_each_entry(vgadev, &vga_list, list) if (pdev == vgadev->pdev) return vgadev; return NULL; } /** * vga_default_device - return the default VGA device, for vgacon * * This can be defined by the platform. The default implementation is * rather dumb and will probably only work properly on single VGA card * setups and/or x86 platforms. * * If your VGA default device is not PCI, you'll have to return NULL here. * In this case, I assume it will not conflict with any PCI card. If this * is not true, I'll have to define two arch hooks for enabling/disabling * the VGA default device if that is possible. This may be a problem with * real _ISA_ VGA cards, in addition to a PCI one. I don't know at this * point how to deal with that card. Can their IOs be disabled at all? If * not, then I suppose it's a matter of having the proper arch hook telling * us about it, so we basically never allow anybody to succeed a vga_get(). */ struct pci_dev *vga_default_device(void) { return vga_default; } EXPORT_SYMBOL_GPL(vga_default_device); void vga_set_default_device(struct pci_dev *pdev) { if (vga_default == pdev) return; pci_dev_put(vga_default); vga_default = pci_dev_get(pdev); } /** * vga_remove_vgacon - deactivate VGA console * * Unbind and unregister vgacon in case pdev is the default VGA device. * Can be called by GPU drivers on initialization to make sure VGA register * access done by vgacon will not disturb the device. * * @pdev: PCI device. */ #if !defined(CONFIG_VGA_CONSOLE) int vga_remove_vgacon(struct pci_dev *pdev) { return 0; } #elif !defined(CONFIG_DUMMY_CONSOLE) int vga_remove_vgacon(struct pci_dev *pdev) { return -ENODEV; } #else int vga_remove_vgacon(struct pci_dev *pdev) { int ret = 0; if (pdev != vga_default) return 0; vgaarb_info(&pdev->dev, "deactivate vga console\n"); console_lock(); if (con_is_bound(&vga_con)) ret = do_take_over_console(&dummy_con, 0, MAX_NR_CONSOLES - 1, 1); if (ret == 0) { ret = do_unregister_con_driver(&vga_con); /* Ignore "already unregistered". */ if (ret == -ENODEV) ret = 0; } console_unlock(); return ret; } #endif EXPORT_SYMBOL(vga_remove_vgacon); /* * If we don't ever use VGA arbitration, we should avoid turning off * anything anywhere due to old X servers getting confused about the boot * device not being VGA. */ static void vga_check_first_use(void) { /* * Inform all GPUs in the system that VGA arbitration has occurred * so they can disable resources if possible. */ if (!vga_arbiter_used) { vga_arbiter_used = true; vga_arbiter_notify_clients(); } } static struct vga_device *__vga_tryget(struct vga_device *vgadev, unsigned int rsrc) { struct device *dev = &vgadev->pdev->dev; unsigned int wants, legacy_wants, match; struct vga_device *conflict; unsigned int pci_bits; u32 flags = 0; /* * Account for "normal" resources to lock. If we decode the legacy, * counterpart, we need to request it as well */ if ((rsrc & VGA_RSRC_NORMAL_IO) && (vgadev->decodes & VGA_RSRC_LEGACY_IO)) rsrc |= VGA_RSRC_LEGACY_IO; if ((rsrc & VGA_RSRC_NORMAL_MEM) && (vgadev->decodes & VGA_RSRC_LEGACY_MEM)) rsrc |= VGA_RSRC_LEGACY_MEM; vgaarb_dbg(dev, "%s: %d\n", __func__, rsrc); vgaarb_dbg(dev, "%s: owns: %d\n", __func__, vgadev->owns); /* Check what resources we need to acquire */ wants = rsrc & ~vgadev->owns; /* We already own everything, just mark locked & bye bye */ if (wants == 0) goto lock_them; /* * We don't need to request a legacy resource, we just enable * appropriate decoding and go. */ legacy_wants = wants & VGA_RSRC_LEGACY_MASK; if (legacy_wants == 0) goto enable_them; /* Ok, we don't, let's find out who we need to kick off */ list_for_each_entry(conflict, &vga_list, list) { unsigned int lwants = legacy_wants; unsigned int change_bridge = 0; /* Don't conflict with myself */ if (vgadev == conflict) continue; /* * We have a possible conflict. Before we go further, we must * check if we sit on the same bus as the conflicting device. * If we don't, then we must tie both IO and MEM resources * together since there is only a single bit controlling * VGA forwarding on P2P bridges. */ if (vgadev->pdev->bus != conflict->pdev->bus) { change_bridge = 1; lwants = VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM; } /* * Check if the guy has a lock on the resource. If he does, * return the conflicting entry. */ if (conflict->locks & lwants) return conflict; /* * Ok, now check if it owns the resource we want. We can * lock resources that are not decoded; therefore a device * can own resources it doesn't decode. */ match = lwants & conflict->owns; if (!match) continue; /* * Looks like he doesn't have a lock, we can steal them * from him. */ flags = 0; pci_bits = 0; /* * If we can't control legacy resources via the bridge, we * also need to disable normal decoding. */ if (!conflict->bridge_has_one_vga) { if ((match & conflict->decodes) & VGA_RSRC_LEGACY_MEM) pci_bits |= PCI_COMMAND_MEMORY; if ((match & conflict->decodes) & VGA_RSRC_LEGACY_IO) pci_bits |= PCI_COMMAND_IO; if (pci_bits) flags |= PCI_VGA_STATE_CHANGE_DECODES; } if (change_bridge) flags |= PCI_VGA_STATE_CHANGE_BRIDGE; pci_set_vga_state(conflict->pdev, false, pci_bits, flags); conflict->owns &= ~match; /* If we disabled normal decoding, reflect it in owns */ if (pci_bits & PCI_COMMAND_MEMORY) conflict->owns &= ~VGA_RSRC_NORMAL_MEM; if (pci_bits & PCI_COMMAND_IO) conflict->owns &= ~VGA_RSRC_NORMAL_IO; } enable_them: /* * Ok, we got it, everybody conflicting has been disabled, let's * enable us. Mark any bits in "owns" regardless of whether we * decoded them. We can lock resources we don't decode, therefore * we must track them via "owns". */ flags = 0; pci_bits = 0; if (!vgadev->bridge_has_one_vga) { flags |= PCI_VGA_STATE_CHANGE_DECODES; if (wants & (VGA_RSRC_LEGACY_MEM|VGA_RSRC_NORMAL_MEM)) pci_bits |= PCI_COMMAND_MEMORY; if (wants & (VGA_RSRC_LEGACY_IO|VGA_RSRC_NORMAL_IO)) pci_bits |= PCI_COMMAND_IO; } if (wants & VGA_RSRC_LEGACY_MASK) flags |= PCI_VGA_STATE_CHANGE_BRIDGE; pci_set_vga_state(vgadev->pdev, true, pci_bits, flags); vgadev->owns |= wants; lock_them: vgadev->locks |= (rsrc & VGA_RSRC_LEGACY_MASK); if (rsrc & VGA_RSRC_LEGACY_IO) vgadev->io_lock_cnt++; if (rsrc & VGA_RSRC_LEGACY_MEM) vgadev->mem_lock_cnt++; if (rsrc & VGA_RSRC_NORMAL_IO) vgadev->io_norm_cnt++; if (rsrc & VGA_RSRC_NORMAL_MEM) vgadev->mem_norm_cnt++; return NULL; } static void __vga_put(struct vga_device *vgadev, unsigned int rsrc) { struct device *dev = &vgadev->pdev->dev; unsigned int old_locks = vgadev->locks; vgaarb_dbg(dev, "%s\n", __func__); /* * Update our counters and account for equivalent legacy resources * if we decode them. */ if ((rsrc & VGA_RSRC_NORMAL_IO) && vgadev->io_norm_cnt > 0) { vgadev->io_norm_cnt--; if (vgadev->decodes & VGA_RSRC_LEGACY_IO) rsrc |= VGA_RSRC_LEGACY_IO; } if ((rsrc & VGA_RSRC_NORMAL_MEM) && vgadev->mem_norm_cnt > 0) { vgadev->mem_norm_cnt--; if (vgadev->decodes & VGA_RSRC_LEGACY_MEM) rsrc |= VGA_RSRC_LEGACY_MEM; } if ((rsrc & VGA_RSRC_LEGACY_IO) && vgadev->io_lock_cnt > 0) vgadev->io_lock_cnt--; if ((rsrc & VGA_RSRC_LEGACY_MEM) && vgadev->mem_lock_cnt > 0) vgadev->mem_lock_cnt--; /* * Just clear lock bits, we do lazy operations so we don't really * have to bother about anything else at this point. */ if (vgadev->io_lock_cnt == 0) vgadev->locks &= ~VGA_RSRC_LEGACY_IO; if (vgadev->mem_lock_cnt == 0) vgadev->locks &= ~VGA_RSRC_LEGACY_MEM; /* * Kick the wait queue in case somebody was waiting if we actually * released something. */ if (old_locks != vgadev->locks) wake_up_all(&vga_wait_queue); } /** * vga_get - acquire & lock VGA resources * @pdev: PCI device of the VGA card or NULL for the system default * @rsrc: bit mask of resources to acquire and lock * @interruptible: blocking should be interruptible by signals ? * * Acquire VGA resources for the given card and mark those resources * locked. If the resources requested are "normal" (and not legacy) * resources, the arbiter will first check whether the card is doing legacy * decoding for that type of resource. If yes, the lock is "converted" into * a legacy resource lock. * * The arbiter will first look for all VGA cards that might conflict and disable * their IOs and/or Memory access, including VGA forwarding on P2P bridges if * necessary, so that the requested resources can be used. Then, the card is * marked as locking these resources and the IO and/or Memory accesses are * enabled on the card (including VGA forwarding on parent P2P bridges if any). * * This function will block if some conflicting card is already locking one of * the required resources (or any resource on a different bus segment, since P2P * bridges don't differentiate VGA memory and IO afaik). You can indicate * whether this blocking should be interruptible by a signal (for userland * interface) or not. * * Must not be called at interrupt time or in atomic context. If the card * already owns the resources, the function succeeds. Nested calls are * supported (a per-resource counter is maintained) * * On success, release the VGA resource again with vga_put(). * * Returns: * * 0 on success, negative error code on failure. */ int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible) { struct vga_device *vgadev, *conflict; unsigned long flags; wait_queue_entry_t wait; int rc = 0; vga_check_first_use(); /* The caller should check for this, but let's be sure */ if (pdev == NULL) pdev = vga_default_device(); if (pdev == NULL) return 0; for (;;) { spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev == NULL) { spin_unlock_irqrestore(&vga_lock, flags); rc = -ENODEV; break; } conflict = __vga_tryget(vgadev, rsrc); spin_unlock_irqrestore(&vga_lock, flags); if (conflict == NULL) break; /* * We have a conflict; we wait until somebody kicks the * work queue. Currently we have one work queue that we * kick each time some resources are released, but it would * be fairly easy to have a per-device one so that we only * need to attach to the conflicting device. */ init_waitqueue_entry(&wait, current); add_wait_queue(&vga_wait_queue, &wait); set_current_state(interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE); if (interruptible && signal_pending(current)) { __set_current_state(TASK_RUNNING); remove_wait_queue(&vga_wait_queue, &wait); rc = -ERESTARTSYS; break; } schedule(); remove_wait_queue(&vga_wait_queue, &wait); } return rc; } EXPORT_SYMBOL(vga_get); /** * vga_tryget - try to acquire & lock legacy VGA resources * @pdev: PCI device of VGA card or NULL for system default * @rsrc: bit mask of resources to acquire and lock * * Perform the same operation as vga_get(), but return an error (-EBUSY) * instead of blocking if the resources are already locked by another card. * Can be called in any context. * * On success, release the VGA resource again with vga_put(). * * Returns: * * 0 on success, negative error code on failure. */ static int vga_tryget(struct pci_dev *pdev, unsigned int rsrc) { struct vga_device *vgadev; unsigned long flags; int rc = 0; vga_check_first_use(); /* The caller should check for this, but let's be sure */ if (pdev == NULL) pdev = vga_default_device(); if (pdev == NULL) return 0; spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev == NULL) { rc = -ENODEV; goto bail; } if (__vga_tryget(vgadev, rsrc)) rc = -EBUSY; bail: spin_unlock_irqrestore(&vga_lock, flags); return rc; } /** * vga_put - release lock on legacy VGA resources * @pdev: PCI device of VGA card or NULL for system default * @rsrc: bit mask of resource to release * * Release resources previously locked by vga_get() or vga_tryget(). The * resources aren't disabled right away, so that a subsequent vga_get() on * the same card will succeed immediately. Resources have a counter, so * locks are only released if the counter reaches 0. */ void vga_put(struct pci_dev *pdev, unsigned int rsrc) { struct vga_device *vgadev; unsigned long flags; /* The caller should check for this, but let's be sure */ if (pdev == NULL) pdev = vga_default_device(); if (pdev == NULL) return; spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev == NULL) goto bail; __vga_put(vgadev, rsrc); bail: spin_unlock_irqrestore(&vga_lock, flags); } EXPORT_SYMBOL(vga_put); static bool vga_is_firmware_default(struct pci_dev *pdev) { #if defined(CONFIG_X86) || defined(CONFIG_IA64) u64 base = screen_info.lfb_base; u64 size = screen_info.lfb_size; struct resource *r; u64 limit; /* Select the device owning the boot framebuffer if there is one */ if (screen_info.capabilities & VIDEO_CAPABILITY_64BIT_BASE) base |= (u64)screen_info.ext_lfb_base << 32; limit = base + size; /* Does firmware framebuffer belong to us? */ pci_dev_for_each_resource(pdev, r) { if (resource_type(r) != IORESOURCE_MEM) continue; if (!r->start || !r->end) continue; if (base < r->start || limit >= r->end) continue; return true; } #endif return false; } static bool vga_arb_integrated_gpu(struct device *dev) { #if defined(CONFIG_ACPI) struct acpi_device *adev = ACPI_COMPANION(dev); return adev && !strcmp(acpi_device_hid(adev), ACPI_VIDEO_HID); #else return false; #endif } /* * Return true if vgadev is a better default VGA device than the best one * we've seen so far. */ static bool vga_is_boot_device(struct vga_device *vgadev) { struct vga_device *boot_vga = vgadev_find(vga_default_device()); struct pci_dev *pdev = vgadev->pdev; u16 cmd, boot_cmd; /* * We select the default VGA device in this order: * Firmware framebuffer (see vga_arb_select_default_device()) * Legacy VGA device (owns VGA_RSRC_LEGACY_MASK) * Non-legacy integrated device (see vga_arb_select_default_device()) * Non-legacy discrete device (see vga_arb_select_default_device()) * Other device (see vga_arb_select_default_device()) */ /* * We always prefer a firmware default device, so if we've already * found one, there's no need to consider vgadev. */ if (boot_vga && boot_vga->is_firmware_default) return false; if (vga_is_firmware_default(pdev)) { vgadev->is_firmware_default = true; return true; } /* * A legacy VGA device has MEM and IO enabled and any bridges * leading to it have PCI_BRIDGE_CTL_VGA enabled so the legacy * resources ([mem 0xa0000-0xbffff], [io 0x3b0-0x3bb], etc) are * routed to it. * * We use the first one we find, so if we've already found one, * vgadev is no better. */ if (boot_vga && (boot_vga->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK) return false; if ((vgadev->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK) return true; /* * If we haven't found a legacy VGA device, accept a non-legacy * device. It may have either IO or MEM enabled, and bridges may * not have PCI_BRIDGE_CTL_VGA enabled, so it may not be able to * use legacy VGA resources. Prefer an integrated GPU over others. */ pci_read_config_word(pdev, PCI_COMMAND, &cmd); if (cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) { /* * An integrated GPU overrides a previous non-legacy * device. We expect only a single integrated GPU, but if * there are more, we use the *last* because that was the * previous behavior. */ if (vga_arb_integrated_gpu(&pdev->dev)) return true; /* * We prefer the first non-legacy discrete device we find. * If we already found one, vgadev is no better. */ if (boot_vga) { pci_read_config_word(boot_vga->pdev, PCI_COMMAND, &boot_cmd); if (boot_cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) return false; } return true; } /* * Vgadev has neither IO nor MEM enabled. If we haven't found any * other VGA devices, it is the best candidate so far. */ if (!boot_vga) return true; return false; } /* * Rules for using a bridge to control a VGA descendant decoding: if a bridge * has only one VGA descendant then it can be used to control the VGA routing * for that device. It should always use the bridge closest to the device to * control it. If a bridge has a direct VGA descendant, but also have a sub- * bridge VGA descendant then we cannot use that bridge to control the direct * VGA descendant. So for every device we register, we need to iterate all * its parent bridges so we can invalidate any devices using them properly. */ static void vga_arbiter_check_bridge_sharing(struct vga_device *vgadev) { struct vga_device *same_bridge_vgadev; struct pci_bus *new_bus, *bus; struct pci_dev *new_bridge, *bridge; vgadev->bridge_has_one_vga = true; if (list_empty(&vga_list)) { vgaarb_info(&vgadev->pdev->dev, "bridge control possible\n"); return; } /* Iterate the new device's bridge hierarchy */ new_bus = vgadev->pdev->bus; while (new_bus) { new_bridge = new_bus->self; /* Go through list of devices already registered */ list_for_each_entry(same_bridge_vgadev, &vga_list, list) { bus = same_bridge_vgadev->pdev->bus; bridge = bus->self; /* See if it shares a bridge with this device */ if (new_bridge == bridge) { /* * If its direct parent bridge is the same * as any bridge of this device then it can't * be used for that device. */ same_bridge_vgadev->bridge_has_one_vga = false; } /* * Now iterate the previous device's bridge hierarchy. * If the new device's parent bridge is in the other * device's hierarchy, we can't use it to control this * device. */ while (bus) { bridge = bus->self; if (bridge && bridge == vgadev->pdev->bus->self) vgadev->bridge_has_one_vga = false; bus = bus->parent; } } new_bus = new_bus->parent; } if (vgadev->bridge_has_one_vga) vgaarb_info(&vgadev->pdev->dev, "bridge control possible\n"); else vgaarb_info(&vgadev->pdev->dev, "no bridge control possible\n"); } /* * Currently, we assume that the "initial" setup of the system is not sane, * that is, we come up with conflicting devices and let the arbiter's * client decide if devices decodes legacy things or not. */ static bool vga_arbiter_add_pci_device(struct pci_dev *pdev) { struct vga_device *vgadev; unsigned long flags; struct pci_bus *bus; struct pci_dev *bridge; u16 cmd; /* Only deal with VGA class devices */ if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA) return false; /* Allocate structure */ vgadev = kzalloc(sizeof(struct vga_device), GFP_KERNEL); if (vgadev == NULL) { vgaarb_err(&pdev->dev, "failed to allocate VGA arbiter data\n"); /* * What to do on allocation failure? For now, let's just do * nothing, I'm not sure there is anything saner to be done. */ return false; } /* Take lock & check for duplicates */ spin_lock_irqsave(&vga_lock, flags); if (vgadev_find(pdev) != NULL) { BUG_ON(1); goto fail; } vgadev->pdev = pdev; /* By default, assume we decode everything */ vgadev->decodes = VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM | VGA_RSRC_NORMAL_IO | VGA_RSRC_NORMAL_MEM; /* By default, mark it as decoding */ vga_decode_count++; /* * Mark that we "own" resources based on our enables, we will * clear that below if the bridge isn't forwarding. */ pci_read_config_word(pdev, PCI_COMMAND, &cmd); if (cmd & PCI_COMMAND_IO) vgadev->owns |= VGA_RSRC_LEGACY_IO; if (cmd & PCI_COMMAND_MEMORY) vgadev->owns |= VGA_RSRC_LEGACY_MEM; /* Check if VGA cycles can get down to us */ bus = pdev->bus; while (bus) { bridge = bus->self; if (bridge) { u16 l; pci_read_config_word(bridge, PCI_BRIDGE_CONTROL, &l); if (!(l & PCI_BRIDGE_CTL_VGA)) { vgadev->owns = 0; break; } } bus = bus->parent; } if (vga_is_boot_device(vgadev)) { vgaarb_info(&pdev->dev, "setting as boot VGA device%s\n", vga_default_device() ? " (overriding previous)" : ""); vga_set_default_device(pdev); } vga_arbiter_check_bridge_sharing(vgadev); /* Add to the list */ list_add_tail(&vgadev->list, &vga_list); vga_count++; vgaarb_info(&pdev->dev, "VGA device added: decodes=%s,owns=%s,locks=%s\n", vga_iostate_to_str(vgadev->decodes), vga_iostate_to_str(vgadev->owns), vga_iostate_to_str(vgadev->locks)); spin_unlock_irqrestore(&vga_lock, flags); return true; fail: spin_unlock_irqrestore(&vga_lock, flags); kfree(vgadev); return false; } static bool vga_arbiter_del_pci_device(struct pci_dev *pdev) { struct vga_device *vgadev; unsigned long flags; bool ret = true; spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev == NULL) { ret = false; goto bail; } if (vga_default == pdev) vga_set_default_device(NULL); if (vgadev->decodes & (VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM)) vga_decode_count--; /* Remove entry from list */ list_del(&vgadev->list); vga_count--; /* Wake up all possible waiters */ wake_up_all(&vga_wait_queue); bail: spin_unlock_irqrestore(&vga_lock, flags); kfree(vgadev); return ret; } /* Called with the lock */ static void vga_update_device_decodes(struct vga_device *vgadev, unsigned int new_decodes) { struct device *dev = &vgadev->pdev->dev; unsigned int old_decodes = vgadev->decodes; unsigned int decodes_removed = ~new_decodes & old_decodes; unsigned int decodes_unlocked = vgadev->locks & decodes_removed; vgadev->decodes = new_decodes; vgaarb_info(dev, "VGA decodes changed: olddecodes=%s,decodes=%s:owns=%s\n", vga_iostate_to_str(old_decodes), vga_iostate_to_str(vgadev->decodes), vga_iostate_to_str(vgadev->owns)); /* If we removed locked decodes, lock count goes to zero, and release */ if (decodes_unlocked) { if (decodes_unlocked & VGA_RSRC_LEGACY_IO) vgadev->io_lock_cnt = 0; if (decodes_unlocked & VGA_RSRC_LEGACY_MEM) vgadev->mem_lock_cnt = 0; __vga_put(vgadev, decodes_unlocked); } /* Change decodes counter */ if (old_decodes & VGA_RSRC_LEGACY_MASK && !(new_decodes & VGA_RSRC_LEGACY_MASK)) vga_decode_count--; if (!(old_decodes & VGA_RSRC_LEGACY_MASK) && new_decodes & VGA_RSRC_LEGACY_MASK) vga_decode_count++; vgaarb_dbg(dev, "decoding count now is: %d\n", vga_decode_count); } static void __vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes, bool userspace) { struct vga_device *vgadev; unsigned long flags; decodes &= VGA_RSRC_LEGACY_MASK; spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev == NULL) goto bail; /* Don't let userspace futz with kernel driver decodes */ if (userspace && vgadev->set_decode) goto bail; /* Update the device decodes + counter */ vga_update_device_decodes(vgadev, decodes); /* * XXX If somebody is going from "doesn't decode" to "decodes" * state here, additional care must be taken as we may have pending * ownership of non-legacy region. */ bail: spin_unlock_irqrestore(&vga_lock, flags); } /** * vga_set_legacy_decoding * @pdev: PCI device of the VGA card * @decodes: bit mask of what legacy regions the card decodes * * Indicate to the arbiter if the card decodes legacy VGA IOs, legacy VGA * Memory, both, or none. All cards default to both, the card driver (fbdev for * example) should tell the arbiter if it has disabled legacy decoding, so the * card can be left out of the arbitration process (and can be safe to take * interrupts at any time. */ void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes) { __vga_set_legacy_decoding(pdev, decodes, false); } EXPORT_SYMBOL(vga_set_legacy_decoding); /** * vga_client_register - register or unregister a VGA arbitration client * @pdev: PCI device of the VGA client * @set_decode: VGA decode change callback * * Clients have two callback mechanisms they can use. * * @set_decode callback: If a client can disable its GPU VGA resource, it * will get a callback from this to set the encode/decode state. * * Rationale: we cannot disable VGA decode resources unconditionally * because some single GPU laptops seem to require ACPI or BIOS access to * the VGA registers to control things like backlights etc. Hopefully newer * multi-GPU laptops do something saner, and desktops won't have any * special ACPI for this. The driver will get a callback when VGA * arbitration is first used by userspace since some older X servers have * issues. * * Does not check whether a client for @pdev has been registered already. * * To unregister, call vga_client_unregister(). * * Returns: 0 on success, -ENODEV on failure */ int vga_client_register(struct pci_dev *pdev, unsigned int (*set_decode)(struct pci_dev *pdev, bool decode)) { unsigned long flags; struct vga_device *vgadev; spin_lock_irqsave(&vga_lock, flags); vgadev = vgadev_find(pdev); if (vgadev) vgadev->set_decode = set_decode; spin_unlock_irqrestore(&vga_lock, flags); if (!vgadev) return -ENODEV; return 0; } EXPORT_SYMBOL(vga_client_register); /* * Char driver implementation * * Semantics is: * * open : Open user instance of the arbiter. By default, it's * attached to the default VGA device of the system. * * close : Close user instance, release locks * * read : Return a string indicating the status of the target. * An IO state string is of the form {io,mem,io+mem,none}, * mc and ic are respectively mem and io lock counts (for * debugging/diagnostic only). "decodes" indicate what the * card currently decodes, "owns" indicates what is currently * enabled on it, and "locks" indicates what is locked by this * card. If the card is unplugged, we get "invalid" then for * card_ID and an -ENODEV error is returned for any command * until a new card is targeted * * "<card_ID>,decodes=<io_state>,owns=<io_state>,locks=<io_state> (ic,mc)" * * write : write a command to the arbiter. List of commands is: * * target <card_ID> : switch target to card <card_ID> (see below) * lock <io_state> : acquire locks on target ("none" is invalid io_state) * trylock <io_state> : non-blocking acquire locks on target * unlock <io_state> : release locks on target * unlock all : release all locks on target held by this user * decodes <io_state> : set the legacy decoding attributes for the card * * poll : event if something change on any card (not just the target) * * card_ID is of the form "PCI:domain:bus:dev.fn". It can be set to "default" * to go back to the system default card (TODO: not implemented yet). * Currently, only PCI is supported as a prefix, but the userland API may * support other bus types in the future, even if the current kernel * implementation doesn't. * * Note about locks: * * The driver keeps track of which user has what locks on which card. It * supports stacking, like the kernel one. This complicates the implementation * a bit, but makes the arbiter more tolerant to userspace problems and able * to properly cleanup in all cases when a process dies. * Currently, a max of 16 cards simultaneously can have locks issued from * userspace for a given user (file descriptor instance) of the arbiter. * * If the device is hot-unplugged, there is a hook inside the module to notify * it being added/removed in the system and automatically added/removed in * the arbiter. */ #define MAX_USER_CARDS CONFIG_VGA_ARB_MAX_GPUS #define PCI_INVALID_CARD ((struct pci_dev *)-1UL) /* Each user has an array of these, tracking which cards have locks */ struct vga_arb_user_card { struct pci_dev *pdev; unsigned int mem_cnt; unsigned int io_cnt; }; struct vga_arb_private { struct list_head list; struct pci_dev *target; struct vga_arb_user_card cards[MAX_USER_CARDS]; spinlock_t lock; }; static LIST_HEAD(vga_user_list); static DEFINE_SPINLOCK(vga_user_lock); /* * Take a string in the format: "PCI:domain:bus:dev.fn" and return the * respective values. If the string is not in this format, return 0. */ static int vga_pci_str_to_vars(char *buf, int count, unsigned int *domain, unsigned int *bus, unsigned int *devfn) { int n; unsigned int slot, func; n = sscanf(buf, "PCI:%x:%x:%x.%x", domain, bus, &slot, &func); if (n != 4) return 0; *devfn = PCI_DEVFN(slot, func); return 1; } static ssize_t vga_arb_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct vga_arb_private *priv = file->private_data; struct vga_device *vgadev; struct pci_dev *pdev; unsigned long flags; size_t len; int rc; char *lbuf; lbuf = kmalloc(1024, GFP_KERNEL); if (lbuf == NULL) return -ENOMEM; /* Protect vga_list */ spin_lock_irqsave(&vga_lock, flags); /* If we are targeting the default, use it */ pdev = priv->target; if (pdev == NULL || pdev == PCI_INVALID_CARD) { spin_unlock_irqrestore(&vga_lock, flags); len = sprintf(lbuf, "invalid"); goto done; } /* Find card vgadev structure */ vgadev = vgadev_find(pdev); if (vgadev == NULL) { /* * Wow, it's not in the list, that shouldn't happen, let's * fix us up and return invalid card. */ spin_unlock_irqrestore(&vga_lock, flags); len = sprintf(lbuf, "invalid"); goto done; } /* Fill the buffer with info */ len = snprintf(lbuf, 1024, "count:%d,PCI:%s,decodes=%s,owns=%s,locks=%s(%u:%u)\n", vga_decode_count, pci_name(pdev), vga_iostate_to_str(vgadev->decodes), vga_iostate_to_str(vgadev->owns), vga_iostate_to_str(vgadev->locks), vgadev->io_lock_cnt, vgadev->mem_lock_cnt); spin_unlock_irqrestore(&vga_lock, flags); done: /* Copy that to user */ if (len > count) len = count; rc = copy_to_user(buf, lbuf, len); kfree(lbuf); if (rc) return -EFAULT; return len; } /* * TODO: To avoid parsing inside kernel and to improve the speed we may * consider use ioctl here */ static ssize_t vga_arb_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { struct vga_arb_private *priv = file->private_data; struct vga_arb_user_card *uc = NULL; struct pci_dev *pdev; unsigned int io_state; char kbuf[64], *curr_pos; size_t remaining = count; int ret_val; int i; if (count >= sizeof(kbuf)) return -EINVAL; if (copy_from_user(kbuf, buf, count)) return -EFAULT; curr_pos = kbuf; kbuf[count] = '\0'; if (strncmp(curr_pos, "lock ", 5) == 0) { curr_pos += 5; remaining -= 5; pr_debug("client 0x%p called 'lock'\n", priv); if (!vga_str_to_iostate(curr_pos, remaining, &io_state)) { ret_val = -EPROTO; goto done; } if (io_state == VGA_RSRC_NONE) { ret_val = -EPROTO; goto done; } pdev = priv->target; if (priv->target == NULL) { ret_val = -ENODEV; goto done; } vga_get_uninterruptible(pdev, io_state); /* Update the client's locks lists */ for (i = 0; i < MAX_USER_CARDS; i++) { if (priv->cards[i].pdev == pdev) { if (io_state & VGA_RSRC_LEGACY_IO) priv->cards[i].io_cnt++; if (io_state & VGA_RSRC_LEGACY_MEM) priv->cards[i].mem_cnt++; break; } } ret_val = count; goto done; } else if (strncmp(curr_pos, "unlock ", 7) == 0) { curr_pos += 7; remaining -= 7; pr_debug("client 0x%p called 'unlock'\n", priv); if (strncmp(curr_pos, "all", 3) == 0) io_state = VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM; else { if (!vga_str_to_iostate (curr_pos, remaining, &io_state)) { ret_val = -EPROTO; goto done; } /* TODO: Add this? if (io_state == VGA_RSRC_NONE) { ret_val = -EPROTO; goto done; } */ } pdev = priv->target; if (priv->target == NULL) { ret_val = -ENODEV; goto done; } for (i = 0; i < MAX_USER_CARDS; i++) { if (priv->cards[i].pdev == pdev) uc = &priv->cards[i]; } if (!uc) { ret_val = -EINVAL; goto done; } if (io_state & VGA_RSRC_LEGACY_IO && uc->io_cnt == 0) { ret_val = -EINVAL; goto done; } if (io_state & VGA_RSRC_LEGACY_MEM && uc->mem_cnt == 0) { ret_val = -EINVAL; goto done; } vga_put(pdev, io_state); if (io_state & VGA_RSRC_LEGACY_IO) uc->io_cnt--; if (io_state & VGA_RSRC_LEGACY_MEM) uc->mem_cnt--; ret_val = count; goto done; } else if (strncmp(curr_pos, "trylock ", 8) == 0) { curr_pos += 8; remaining -= 8; pr_debug("client 0x%p called 'trylock'\n", priv); if (!vga_str_to_iostate(curr_pos, remaining, &io_state)) { ret_val = -EPROTO; goto done; } /* TODO: Add this? if (io_state == VGA_RSRC_NONE) { ret_val = -EPROTO; goto done; } */ pdev = priv->target; if (priv->target == NULL) { ret_val = -ENODEV; goto done; } if (vga_tryget(pdev, io_state)) { /* Update the client's locks lists... */ for (i = 0; i < MAX_USER_CARDS; i++) { if (priv->cards[i].pdev == pdev) { if (io_state & VGA_RSRC_LEGACY_IO) priv->cards[i].io_cnt++; if (io_state & VGA_RSRC_LEGACY_MEM) priv->cards[i].mem_cnt++; break; } } ret_val = count; goto done; } else { ret_val = -EBUSY; goto done; } } else if (strncmp(curr_pos, "target ", 7) == 0) { unsigned int domain, bus, devfn; struct vga_device *vgadev; curr_pos += 7; remaining -= 7; pr_debug("client 0x%p called 'target'\n", priv); /* If target is default */ if (!strncmp(curr_pos, "default", 7)) pdev = pci_dev_get(vga_default_device()); else { if (!vga_pci_str_to_vars(curr_pos, remaining, &domain, &bus, &devfn)) { ret_val = -EPROTO; goto done; } pdev = pci_get_domain_bus_and_slot(domain, bus, devfn); if (!pdev) { pr_debug("invalid PCI address %04x:%02x:%02x.%x\n", domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); ret_val = -ENODEV; goto done; } pr_debug("%s ==> %04x:%02x:%02x.%x pdev %p\n", curr_pos, domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), pdev); } vgadev = vgadev_find(pdev); pr_debug("vgadev %p\n", vgadev); if (vgadev == NULL) { if (pdev) { vgaarb_dbg(&pdev->dev, "not a VGA device\n"); pci_dev_put(pdev); } ret_val = -ENODEV; goto done; } priv->target = pdev; for (i = 0; i < MAX_USER_CARDS; i++) { if (priv->cards[i].pdev == pdev) break; if (priv->cards[i].pdev == NULL) { priv->cards[i].pdev = pdev; priv->cards[i].io_cnt = 0; priv->cards[i].mem_cnt = 0; break; } } if (i == MAX_USER_CARDS) { vgaarb_dbg(&pdev->dev, "maximum user cards (%d) number reached, ignoring this one!\n", MAX_USER_CARDS); pci_dev_put(pdev); /* XXX: Which value to return? */ ret_val = -ENOMEM; goto done; } ret_val = count; pci_dev_put(pdev); goto done; } else if (strncmp(curr_pos, "decodes ", 8) == 0) { curr_pos += 8; remaining -= 8; pr_debug("client 0x%p called 'decodes'\n", priv); if (!vga_str_to_iostate(curr_pos, remaining, &io_state)) { ret_val = -EPROTO; goto done; } pdev = priv->target; if (priv->target == NULL) { ret_val = -ENODEV; goto done; } __vga_set_legacy_decoding(pdev, io_state, true); ret_val = count; goto done; } /* If we got here, the message written is not part of the protocol! */ return -EPROTO; done: return ret_val; } static __poll_t vga_arb_fpoll(struct file *file, poll_table *wait) { pr_debug("%s\n", __func__); poll_wait(file, &vga_wait_queue, wait); return EPOLLIN; } static int vga_arb_open(struct inode *inode, struct file *file) { struct vga_arb_private *priv; unsigned long flags; pr_debug("%s\n", __func__); priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (priv == NULL) return -ENOMEM; spin_lock_init(&priv->lock); file->private_data = priv; spin_lock_irqsave(&vga_user_lock, flags); list_add(&priv->list, &vga_user_list); spin_unlock_irqrestore(&vga_user_lock, flags); /* Set the client's lists of locks */ priv->target = vga_default_device(); /* Maybe this is still null! */ priv->cards[0].pdev = priv->target; priv->cards[0].io_cnt = 0; priv->cards[0].mem_cnt = 0; return 0; } static int vga_arb_release(struct inode *inode, struct file *file) { struct vga_arb_private *priv = file->private_data; struct vga_arb_user_card *uc; unsigned long flags; int i; pr_debug("%s\n", __func__); spin_lock_irqsave(&vga_user_lock, flags); list_del(&priv->list); for (i = 0; i < MAX_USER_CARDS; i++) { uc = &priv->cards[i]; if (uc->pdev == NULL) continue; vgaarb_dbg(&uc->pdev->dev, "uc->io_cnt == %d, uc->mem_cnt == %d\n", uc->io_cnt, uc->mem_cnt); while (uc->io_cnt--) vga_put(uc->pdev, VGA_RSRC_LEGACY_IO); while (uc->mem_cnt--) vga_put(uc->pdev, VGA_RSRC_LEGACY_MEM); } spin_unlock_irqrestore(&vga_user_lock, flags); kfree(priv); return 0; } /* * Callback any registered clients to let them know we have a change in VGA * cards. */ static void vga_arbiter_notify_clients(void) { struct vga_device *vgadev; unsigned long flags; unsigned int new_decodes; bool new_state; if (!vga_arbiter_used) return; new_state = (vga_count > 1) ? false : true; spin_lock_irqsave(&vga_lock, flags); list_for_each_entry(vgadev, &vga_list, list) { if (vgadev->set_decode) { new_decodes = vgadev->set_decode(vgadev->pdev, new_state); vga_update_device_decodes(vgadev, new_decodes); } } spin_unlock_irqrestore(&vga_lock, flags); } static int pci_notify(struct notifier_block *nb, unsigned long action, void *data) { struct device *dev = data; struct pci_dev *pdev = to_pci_dev(dev); bool notify = false; vgaarb_dbg(dev, "%s\n", __func__); /* * For now, we're only interested in devices added and removed. * I didn't test this thing here, so someone needs to double check * for the cases of hot-pluggable VGA cards. */ if (action == BUS_NOTIFY_ADD_DEVICE) notify = vga_arbiter_add_pci_device(pdev); else if (action == BUS_NOTIFY_DEL_DEVICE) notify = vga_arbiter_del_pci_device(pdev); if (notify) vga_arbiter_notify_clients(); return 0; } static struct notifier_block pci_notifier = { .notifier_call = pci_notify, }; static const struct file_operations vga_arb_device_fops = { .read = vga_arb_read, .write = vga_arb_write, .poll = vga_arb_fpoll, .open = vga_arb_open, .release = vga_arb_release, .llseek = noop_llseek, }; static struct miscdevice vga_arb_device = { MISC_DYNAMIC_MINOR, "vga_arbiter", &vga_arb_device_fops }; static int __init vga_arb_device_init(void) { int rc; struct pci_dev *pdev; rc = misc_register(&vga_arb_device); if (rc < 0) pr_err("error %d registering device\n", rc); bus_register_notifier(&pci_bus_type, &pci_notifier); /* Add all VGA class PCI devices by default */ pdev = NULL; while ((pdev = pci_get_subsys(PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) vga_arbiter_add_pci_device(pdev); pr_info("loaded\n"); return rc; } subsys_initcall_sync(vga_arb_device_init); |
23 23 23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | // SPDX-License-Identifier: GPL-2.0-only /* * File: sysctl.c * * Phonet /proc/sys/net/phonet interface implementation * * Copyright (C) 2008 Nokia Corporation. * * Author: Rémi Denis-Courmont */ #include <linux/seqlock.h> #include <linux/sysctl.h> #include <linux/errno.h> #include <linux/init.h> #include <net/sock.h> #include <linux/phonet.h> #include <net/phonet/phonet.h> #define DYNAMIC_PORT_MIN 0x40 #define DYNAMIC_PORT_MAX 0x7f static DEFINE_SEQLOCK(local_port_range_lock); static int local_port_range_min[2] = {0, 0}; static int local_port_range_max[2] = {1023, 1023}; static int local_port_range[2] = {DYNAMIC_PORT_MIN, DYNAMIC_PORT_MAX}; static struct ctl_table_header *phonet_table_hrd; static void set_local_port_range(int range[2]) { write_seqlock(&local_port_range_lock); local_port_range[0] = range[0]; local_port_range[1] = range[1]; write_sequnlock(&local_port_range_lock); } void phonet_get_local_port_range(int *min, int *max) { unsigned int seq; do { seq = read_seqbegin(&local_port_range_lock); if (min) *min = local_port_range[0]; if (max) *max = local_port_range[1]; } while (read_seqretry(&local_port_range_lock, seq)); } static int proc_local_port_range(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int ret; int range[2] = {local_port_range[0], local_port_range[1]}; struct ctl_table tmp = { .data = &range, .maxlen = sizeof(range), .mode = table->mode, .extra1 = &local_port_range_min, .extra2 = &local_port_range_max, }; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (write && ret == 0) { if (range[1] < range[0]) ret = -EINVAL; else set_local_port_range(range); } return ret; } static struct ctl_table phonet_table[] = { { .procname = "local_port_range", .data = &local_port_range, .maxlen = sizeof(local_port_range), .mode = 0644, .proc_handler = proc_local_port_range, }, { } }; int __init phonet_sysctl_init(void) { phonet_table_hrd = register_net_sysctl(&init_net, "net/phonet", phonet_table); return phonet_table_hrd == NULL ? -ENOMEM : 0; } void phonet_sysctl_exit(void) { unregister_net_sysctl_table(phonet_table_hrd); } |
8 8 8 8 33 29 18 3 32 6 4 2 14 14 9 4 7 7 4 11 4 19 19 18 18 18 18 18 17 16 14 14 2 1392 1391 16103 16821 11538 505 99 6155 7160 7169 1367 1395 1394 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 | // SPDX-License-Identifier: GPL-2.0 /* * linux/kernel/capability.c * * Copyright (C) 1997 Andrew Main <zefram@fysh.org> * * Integrated into 2.1.97+, Andrew G. Morgan <morgan@kernel.org> * 30 May 2002: Cleanup, Robert M. Love <rml@tech9.net> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/audit.h> #include <linux/capability.h> #include <linux/mm.h> #include <linux/export.h> #include <linux/security.h> #include <linux/syscalls.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> #include <linux/uaccess.h> int file_caps_enabled = 1; static int __init file_caps_disable(char *str) { file_caps_enabled = 0; return 1; } __setup("no_file_caps", file_caps_disable); #ifdef CONFIG_MULTIUSER /* * More recent versions of libcap are available from: * * http://www.kernel.org/pub/linux/libs/security/linux-privs/ */ static void warn_legacy_capability_use(void) { char name[sizeof(current->comm)]; pr_info_once("warning: `%s' uses 32-bit capabilities (legacy support in use)\n", get_task_comm(name, current)); } /* * Version 2 capabilities worked fine, but the linux/capability.h file * that accompanied their introduction encouraged their use without * the necessary user-space source code changes. As such, we have * created a version 3 with equivalent functionality to version 2, but * with a header change to protect legacy source code from using * version 2 when it wanted to use version 1. If your system has code * that trips the following warning, it is using version 2 specific * capabilities and may be doing so insecurely. * * The remedy is to either upgrade your version of libcap (to 2.10+, * if the application is linked against it), or recompile your * application with modern kernel headers and this warning will go * away. */ static void warn_deprecated_v2(void) { char name[sizeof(current->comm)]; pr_info_once("warning: `%s' uses deprecated v2 capabilities in a way that may be insecure\n", get_task_comm(name, current)); } /* * Version check. Return the number of u32s in each capability flag * array, or a negative value on error. */ static int cap_validate_magic(cap_user_header_t header, unsigned *tocopy) { __u32 version; if (get_user(version, &header->version)) return -EFAULT; switch (version) { case _LINUX_CAPABILITY_VERSION_1: warn_legacy_capability_use(); *tocopy = _LINUX_CAPABILITY_U32S_1; break; case _LINUX_CAPABILITY_VERSION_2: warn_deprecated_v2(); fallthrough; /* v3 is otherwise equivalent to v2 */ case _LINUX_CAPABILITY_VERSION_3: *tocopy = _LINUX_CAPABILITY_U32S_3; break; default: if (put_user((u32)_KERNEL_CAPABILITY_VERSION, &header->version)) return -EFAULT; return -EINVAL; } return 0; } /* * The only thing that can change the capabilities of the current * process is the current process. As such, we can't be in this code * at the same time as we are in the process of setting capabilities * in this process. The net result is that we can limit our use of * locks to when we are reading the caps of another process. */ static inline int cap_get_target_pid(pid_t pid, kernel_cap_t *pEp, kernel_cap_t *pIp, kernel_cap_t *pPp) { int ret; if (pid && (pid != task_pid_vnr(current))) { const struct task_struct *target; rcu_read_lock(); target = find_task_by_vpid(pid); if (!target) ret = -ESRCH; else ret = security_capget(target, pEp, pIp, pPp); rcu_read_unlock(); } else ret = security_capget(current, pEp, pIp, pPp); return ret; } /** * sys_capget - get the capabilities of a given process. * @header: pointer to struct that contains capability version and * target pid data * @dataptr: pointer to struct that contains the effective, permitted, * and inheritable capabilities that are returned * * Returns 0 on success and < 0 on error. */ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr) { int ret = 0; pid_t pid; unsigned tocopy; kernel_cap_t pE, pI, pP; struct __user_cap_data_struct kdata[2]; ret = cap_validate_magic(header, &tocopy); if ((dataptr == NULL) || (ret != 0)) return ((dataptr == NULL) && (ret == -EINVAL)) ? 0 : ret; if (get_user(pid, &header->pid)) return -EFAULT; if (pid < 0) return -EINVAL; ret = cap_get_target_pid(pid, &pE, &pI, &pP); if (ret) return ret; /* * Annoying legacy format with 64-bit capabilities exposed * as two sets of 32-bit fields, so we need to split the * capability values up. */ kdata[0].effective = pE.val; kdata[1].effective = pE.val >> 32; kdata[0].permitted = pP.val; kdata[1].permitted = pP.val >> 32; kdata[0].inheritable = pI.val; kdata[1].inheritable = pI.val >> 32; /* * Note, in the case, tocopy < _KERNEL_CAPABILITY_U32S, * we silently drop the upper capabilities here. This * has the effect of making older libcap * implementations implicitly drop upper capability * bits when they perform a: capget/modify/capset * sequence. * * This behavior is considered fail-safe * behavior. Upgrading the application to a newer * version of libcap will enable access to the newer * capabilities. * * An alternative would be to return an error here * (-ERANGE), but that causes legacy applications to * unexpectedly fail; the capget/modify/capset aborts * before modification is attempted and the application * fails. */ if (copy_to_user(dataptr, kdata, tocopy * sizeof(kdata[0]))) return -EFAULT; return 0; } static kernel_cap_t mk_kernel_cap(u32 low, u32 high) { return (kernel_cap_t) { (low | ((u64)high << 32)) & CAP_VALID_MASK }; } /** * sys_capset - set capabilities for a process or (*) a group of processes * @header: pointer to struct that contains capability version and * target pid data * @data: pointer to struct that contains the effective, permitted, * and inheritable capabilities * * Set capabilities for the current process only. The ability to any other * process(es) has been deprecated and removed. * * The restrictions on setting capabilities are specified as: * * I: any raised capabilities must be a subset of the old permitted * P: any raised capabilities must be a subset of the old permitted * E: must be set to a subset of new permitted * * Returns 0 on success and < 0 on error. */ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data) { struct __user_cap_data_struct kdata[2] = { { 0, }, }; unsigned tocopy, copybytes; kernel_cap_t inheritable, permitted, effective; struct cred *new; int ret; pid_t pid; ret = cap_validate_magic(header, &tocopy); if (ret != 0) return ret; if (get_user(pid, &header->pid)) return -EFAULT; /* may only affect current now */ if (pid != 0 && pid != task_pid_vnr(current)) return -EPERM; copybytes = tocopy * sizeof(struct __user_cap_data_struct); if (copybytes > sizeof(kdata)) return -EFAULT; if (copy_from_user(&kdata, data, copybytes)) return -EFAULT; effective = mk_kernel_cap(kdata[0].effective, kdata[1].effective); permitted = mk_kernel_cap(kdata[0].permitted, kdata[1].permitted); inheritable = mk_kernel_cap(kdata[0].inheritable, kdata[1].inheritable); new = prepare_creds(); if (!new) return -ENOMEM; ret = security_capset(new, current_cred(), &effective, &inheritable, &permitted); if (ret < 0) goto error; audit_log_capset(new, current_cred()); return commit_creds(new); error: abort_creds(new); return ret; } /** * has_ns_capability - Does a task have a capability in a specific user ns * @t: The task in question * @ns: target user namespace * @cap: The capability to be tested for * * Return true if the specified task has the given superior capability * currently in effect to the specified user namespace, false if not. * * Note that this does not set PF_SUPERPRIV on the task. */ bool has_ns_capability(struct task_struct *t, struct user_namespace *ns, int cap) { int ret; rcu_read_lock(); ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NONE); rcu_read_unlock(); return (ret == 0); } /** * has_capability - Does a task have a capability in init_user_ns * @t: The task in question * @cap: The capability to be tested for * * Return true if the specified task has the given superior capability * currently in effect to the initial user namespace, false if not. * * Note that this does not set PF_SUPERPRIV on the task. */ bool has_capability(struct task_struct *t, int cap) { return has_ns_capability(t, &init_user_ns, cap); } EXPORT_SYMBOL(has_capability); /** * has_ns_capability_noaudit - Does a task have a capability (unaudited) * in a specific user ns. * @t: The task in question * @ns: target user namespace * @cap: The capability to be tested for * * Return true if the specified task has the given superior capability * currently in effect to the specified user namespace, false if not. * Do not write an audit message for the check. * * Note that this does not set PF_SUPERPRIV on the task. */ bool has_ns_capability_noaudit(struct task_struct *t, struct user_namespace *ns, int cap) { int ret; rcu_read_lock(); ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); } /** * has_capability_noaudit - Does a task have a capability (unaudited) in the * initial user ns * @t: The task in question * @cap: The capability to be tested for * * Return true if the specified task has the given superior capability * currently in effect to init_user_ns, false if not. Don't write an * audit message for the check. * * Note that this does not set PF_SUPERPRIV on the task. */ bool has_capability_noaudit(struct task_struct *t, int cap) { return has_ns_capability_noaudit(t, &init_user_ns, cap); } EXPORT_SYMBOL(has_capability_noaudit); static bool ns_capable_common(struct user_namespace *ns, int cap, unsigned int opts) { int capable; if (unlikely(!cap_valid(cap))) { pr_crit("capable() called with invalid cap=%u\n", cap); BUG(); } capable = security_capable(current_cred(), ns, cap, opts); if (capable == 0) { current->flags |= PF_SUPERPRIV; return true; } return false; } /** * ns_capable - Determine if the current task has a superior capability in effect * @ns: The usernamespace we want the capability in * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool ns_capable(struct user_namespace *ns, int cap) { return ns_capable_common(ns, cap, CAP_OPT_NONE); } EXPORT_SYMBOL(ns_capable); /** * ns_capable_noaudit - Determine if the current task has a superior capability * (unaudited) in effect * @ns: The usernamespace we want the capability in * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool ns_capable_noaudit(struct user_namespace *ns, int cap) { return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT); } EXPORT_SYMBOL(ns_capable_noaudit); /** * ns_capable_setid - Determine if the current task has a superior capability * in effect, while signalling that this check is being done from within a * setid or setgroups syscall. * @ns: The usernamespace we want the capability in * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool ns_capable_setid(struct user_namespace *ns, int cap) { return ns_capable_common(ns, cap, CAP_OPT_INSETID); } EXPORT_SYMBOL(ns_capable_setid); /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(&init_user_ns, cap); } EXPORT_SYMBOL(capable); #endif /* CONFIG_MULTIUSER */ /** * file_ns_capable - Determine if the file's opener had a capability in effect * @file: The file we want to check * @ns: The usernamespace we want the capability in * @cap: The capability to be tested for * * Return true if task that opened the file had a capability in effect * when the file was opened. * * This does not set PF_SUPERPRIV because the caller may not * actually be privileged. */ bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap) { if (WARN_ON_ONCE(!cap_valid(cap))) return false; if (security_capable(file->f_cred, ns, cap, CAP_OPT_NONE) == 0) return true; return false; } EXPORT_SYMBOL(file_ns_capable); /** * privileged_wrt_inode_uidgid - Do capabilities in the namespace work over the inode? * @ns: The user namespace in question * @idmap: idmap of the mount @inode was found from * @inode: The inode in question * * Return true if the inode uid and gid are within the namespace. */ bool privileged_wrt_inode_uidgid(struct user_namespace *ns, struct mnt_idmap *idmap, const struct inode *inode) { return vfsuid_has_mapping(ns, i_uid_into_vfsuid(idmap, inode)) && vfsgid_has_mapping(ns, i_gid_into_vfsgid(idmap, inode)); } /** * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped * @idmap: idmap of the mount @inode was found from * @inode: The inode in question * @cap: The capability in question * * Return true if the current task has the given capability targeted at * its own user namespace and that the given inode's uid and gid are * mapped into the current user namespace. */ bool capable_wrt_inode_uidgid(struct mnt_idmap *idmap, const struct inode *inode, int cap) { struct user_namespace *ns = current_user_ns(); return ns_capable(ns, cap) && privileged_wrt_inode_uidgid(ns, idmap, inode); } EXPORT_SYMBOL(capable_wrt_inode_uidgid); /** * ptracer_capable - Determine if the ptracer holds CAP_SYS_PTRACE in the namespace * @tsk: The task that may be ptraced * @ns: The user namespace to search for CAP_SYS_PTRACE in * * Return true if the task that is ptracing the current task had CAP_SYS_PTRACE * in the specified user namespace. */ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) { int ret = 0; /* An absent tracer adds no restrictions */ const struct cred *cred; rcu_read_lock(); cred = rcu_dereference(tsk->ptracer_cred); if (cred) ret = security_capable(cred, ns, CAP_SYS_PTRACE, CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); } |
3 1 1 5 11 28 9 9 5 4 2 3 6 21 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * An interface between IEEE802.15.4 device and rest of the kernel. * * Copyright (C) 2007-2012 Siemens AG * * Written by: * Pavel Smolenskiy <pavel.smolenskiy@gmail.com> * Maxim Gorbachyov <maxim.gorbachev@siemens.com> * Maxim Osipov <maxim.osipov@siemens.com> * Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> * Alexander Smirnov <alex.bluesman.smirnov@gmail.com> */ #ifndef IEEE802154_NETDEVICE_H #define IEEE802154_NETDEVICE_H #define IEEE802154_REQUIRED_SIZE(struct_type, member) \ (offsetof(typeof(struct_type), member) + \ sizeof(((typeof(struct_type) *)(NULL))->member)) #define IEEE802154_ADDR_OFFSET \ offsetof(typeof(struct sockaddr_ieee802154), addr) #define IEEE802154_MIN_NAMELEN (IEEE802154_ADDR_OFFSET + \ IEEE802154_REQUIRED_SIZE(struct ieee802154_addr_sa, addr_type)) #define IEEE802154_NAMELEN_SHORT (IEEE802154_ADDR_OFFSET + \ IEEE802154_REQUIRED_SIZE(struct ieee802154_addr_sa, short_addr)) #define IEEE802154_NAMELEN_LONG (IEEE802154_ADDR_OFFSET + \ IEEE802154_REQUIRED_SIZE(struct ieee802154_addr_sa, hwaddr)) #include <net/af_ieee802154.h> #include <linux/netdevice.h> #include <linux/skbuff.h> #include <linux/ieee802154.h> #include <net/cfg802154.h> struct ieee802154_beacon_hdr { #if defined(__LITTLE_ENDIAN_BITFIELD) u16 beacon_order:4, superframe_order:4, final_cap_slot:4, battery_life_ext:1, reserved0:1, pan_coordinator:1, assoc_permit:1; u8 gts_count:3, gts_reserved:4, gts_permit:1; u8 pend_short_addr_count:3, reserved1:1, pend_ext_addr_count:3, reserved2:1; #elif defined(__BIG_ENDIAN_BITFIELD) u16 assoc_permit:1, pan_coordinator:1, reserved0:1, battery_life_ext:1, final_cap_slot:4, superframe_order:4, beacon_order:4; u8 gts_permit:1, gts_reserved:4, gts_count:3; u8 reserved2:1, pend_ext_addr_count:3, reserved1:1, pend_short_addr_count:3; #else #error "Please fix <asm/byteorder.h>" #endif } __packed; struct ieee802154_mac_cmd_pl { u8 cmd_id; } __packed; struct ieee802154_sechdr { #if defined(__LITTLE_ENDIAN_BITFIELD) u8 level:3, key_id_mode:2, reserved:3; #elif defined(__BIG_ENDIAN_BITFIELD) u8 reserved:3, key_id_mode:2, level:3; #else #error "Please fix <asm/byteorder.h>" #endif u8 key_id; __le32 frame_counter; union { __le32 short_src; __le64 extended_src; }; }; struct ieee802154_hdr_fc { #if defined(__LITTLE_ENDIAN_BITFIELD) u16 type:3, security_enabled:1, frame_pending:1, ack_request:1, intra_pan:1, reserved:3, dest_addr_mode:2, version:2, source_addr_mode:2; #elif defined(__BIG_ENDIAN_BITFIELD) u16 reserved:1, intra_pan:1, ack_request:1, frame_pending:1, security_enabled:1, type:3, source_addr_mode:2, version:2, dest_addr_mode:2, reserved2:2; #else #error "Please fix <asm/byteorder.h>" #endif }; enum ieee802154_frame_version { IEEE802154_2003_STD, IEEE802154_2006_STD, IEEE802154_STD, IEEE802154_RESERVED_STD, IEEE802154_MULTIPURPOSE_STD = IEEE802154_2003_STD, }; enum ieee802154_addressing_mode { IEEE802154_NO_ADDRESSING, IEEE802154_RESERVED, IEEE802154_SHORT_ADDRESSING, IEEE802154_EXTENDED_ADDRESSING, }; struct ieee802154_hdr { struct ieee802154_hdr_fc fc; u8 seq; struct ieee802154_addr source; struct ieee802154_addr dest; struct ieee802154_sechdr sec; }; struct ieee802154_beacon_frame { struct ieee802154_hdr mhr; struct ieee802154_beacon_hdr mac_pl; }; struct ieee802154_mac_cmd_frame { struct ieee802154_hdr mhr; struct ieee802154_mac_cmd_pl mac_pl; }; struct ieee802154_beacon_req_frame { struct ieee802154_hdr mhr; struct ieee802154_mac_cmd_pl mac_pl; }; /* pushes hdr onto the skb. fields of hdr->fc that can be calculated from * the contents of hdr will be, and the actual value of those bits in * hdr->fc will be ignored. this includes the INTRA_PAN bit and the frame * version, if SECEN is set. */ int ieee802154_hdr_push(struct sk_buff *skb, struct ieee802154_hdr *hdr); /* pulls the entire 802.15.4 header off of the skb, including the security * header, and performs pan id decompression */ int ieee802154_hdr_pull(struct sk_buff *skb, struct ieee802154_hdr *hdr); /* parses the frame control, sequence number of address fields in a given skb * and stores them into hdr, performing pan id decompression and length checks * to be suitable for use in header_ops.parse */ int ieee802154_hdr_peek_addrs(const struct sk_buff *skb, struct ieee802154_hdr *hdr); /* parses the full 802.15.4 header a given skb and stores them into hdr, * performing pan id decompression and length checks to be suitable for use in * header_ops.parse */ int ieee802154_hdr_peek(const struct sk_buff *skb, struct ieee802154_hdr *hdr); /* pushes/pulls various frame types into/from an skb */ int ieee802154_beacon_push(struct sk_buff *skb, struct ieee802154_beacon_frame *beacon); int ieee802154_mac_cmd_push(struct sk_buff *skb, void *frame, const void *pl, unsigned int pl_len); int ieee802154_mac_cmd_pl_pull(struct sk_buff *skb, struct ieee802154_mac_cmd_pl *mac_pl); int ieee802154_max_payload(const struct ieee802154_hdr *hdr); static inline int ieee802154_sechdr_authtag_len(const struct ieee802154_sechdr *sec) { switch (sec->level) { case IEEE802154_SCF_SECLEVEL_MIC32: case IEEE802154_SCF_SECLEVEL_ENC_MIC32: return 4; case IEEE802154_SCF_SECLEVEL_MIC64: case IEEE802154_SCF_SECLEVEL_ENC_MIC64: return 8; case IEEE802154_SCF_SECLEVEL_MIC128: case IEEE802154_SCF_SECLEVEL_ENC_MIC128: return 16; case IEEE802154_SCF_SECLEVEL_NONE: case IEEE802154_SCF_SECLEVEL_ENC: default: return 0; } } static inline int ieee802154_hdr_length(struct sk_buff *skb) { struct ieee802154_hdr hdr; int len = ieee802154_hdr_pull(skb, &hdr); if (len > 0) skb_push(skb, len); return len; } static inline bool ieee802154_addr_equal(const struct ieee802154_addr *a1, const struct ieee802154_addr *a2) { if (a1->pan_id != a2->pan_id || a1->mode != a2->mode) return false; if ((a1->mode == IEEE802154_ADDR_LONG && a1->extended_addr != a2->extended_addr) || (a1->mode == IEEE802154_ADDR_SHORT && a1->short_addr != a2->short_addr)) return false; return true; } static inline __le64 ieee802154_devaddr_from_raw(const void *raw) { u64 temp; memcpy(&temp, raw, IEEE802154_ADDR_LEN); return (__force __le64)swab64(temp); } static inline void ieee802154_devaddr_to_raw(void *raw, __le64 addr) { u64 temp = swab64((__force u64)addr); memcpy(raw, &temp, IEEE802154_ADDR_LEN); } static inline int ieee802154_sockaddr_check_size(struct sockaddr_ieee802154 *daddr, int len) { struct ieee802154_addr_sa *sa; int ret = 0; sa = &daddr->addr; if (len < IEEE802154_MIN_NAMELEN) return -EINVAL; switch (sa->addr_type) { case IEEE802154_ADDR_NONE: break; case IEEE802154_ADDR_SHORT: if (len < IEEE802154_NAMELEN_SHORT) ret = -EINVAL; break; case IEEE802154_ADDR_LONG: if (len < IEEE802154_NAMELEN_LONG) ret = -EINVAL; break; default: ret = -EINVAL; break; } return ret; } static inline void ieee802154_addr_from_sa(struct ieee802154_addr *a, const struct ieee802154_addr_sa *sa) { a->mode = sa->addr_type; a->pan_id = cpu_to_le16(sa->pan_id); switch (a->mode) { case IEEE802154_ADDR_SHORT: a->short_addr = cpu_to_le16(sa->short_addr); break; case IEEE802154_ADDR_LONG: a->extended_addr = ieee802154_devaddr_from_raw(sa->hwaddr); break; } } static inline void ieee802154_addr_to_sa(struct ieee802154_addr_sa *sa, const struct ieee802154_addr *a) { sa->addr_type = a->mode; sa->pan_id = le16_to_cpu(a->pan_id); switch (a->mode) { case IEEE802154_ADDR_SHORT: sa->short_addr = le16_to_cpu(a->short_addr); break; case IEEE802154_ADDR_LONG: ieee802154_devaddr_to_raw(sa->hwaddr, a->extended_addr); break; } } /* * A control block of skb passed between the ARPHRD_IEEE802154 device * and other stack parts. */ struct ieee802154_mac_cb { u8 lqi; u8 type; bool ackreq; bool secen; bool secen_override; u8 seclevel; bool seclevel_override; struct ieee802154_addr source; struct ieee802154_addr dest; }; static inline struct ieee802154_mac_cb *mac_cb(struct sk_buff *skb) { return (struct ieee802154_mac_cb *)skb->cb; } static inline struct ieee802154_mac_cb *mac_cb_init(struct sk_buff *skb) { BUILD_BUG_ON(sizeof(struct ieee802154_mac_cb) > sizeof(skb->cb)); memset(skb->cb, 0, sizeof(struct ieee802154_mac_cb)); return mac_cb(skb); } enum { IEEE802154_LLSEC_DEVKEY_IGNORE, IEEE802154_LLSEC_DEVKEY_RESTRICT, IEEE802154_LLSEC_DEVKEY_RECORD, __IEEE802154_LLSEC_DEVKEY_MAX, }; #define IEEE802154_MAC_SCAN_ED 0 #define IEEE802154_MAC_SCAN_ACTIVE 1 #define IEEE802154_MAC_SCAN_PASSIVE 2 #define IEEE802154_MAC_SCAN_ORPHAN 3 struct ieee802154_mac_params { s8 transmit_power; u8 min_be; u8 max_be; u8 csma_retries; s8 frame_retries; bool lbt; struct wpan_phy_cca cca; s32 cca_ed_level; }; struct wpan_phy; enum { IEEE802154_LLSEC_PARAM_ENABLED = BIT(0), IEEE802154_LLSEC_PARAM_FRAME_COUNTER = BIT(1), IEEE802154_LLSEC_PARAM_OUT_LEVEL = BIT(2), IEEE802154_LLSEC_PARAM_OUT_KEY = BIT(3), IEEE802154_LLSEC_PARAM_KEY_SOURCE = BIT(4), IEEE802154_LLSEC_PARAM_PAN_ID = BIT(5), IEEE802154_LLSEC_PARAM_HWADDR = BIT(6), IEEE802154_LLSEC_PARAM_COORD_HWADDR = BIT(7), IEEE802154_LLSEC_PARAM_COORD_SHORTADDR = BIT(8), }; struct ieee802154_llsec_ops { int (*get_params)(struct net_device *dev, struct ieee802154_llsec_params *params); int (*set_params)(struct net_device *dev, const struct ieee802154_llsec_params *params, int changed); int (*add_key)(struct net_device *dev, const struct ieee802154_llsec_key_id *id, const struct ieee802154_llsec_key *key); int (*del_key)(struct net_device *dev, const struct ieee802154_llsec_key_id *id); int (*add_dev)(struct net_device *dev, const struct ieee802154_llsec_device *llsec_dev); int (*del_dev)(struct net_device *dev, __le64 dev_addr); int (*add_devkey)(struct net_device *dev, __le64 device_addr, const struct ieee802154_llsec_device_key *key); int (*del_devkey)(struct net_device *dev, __le64 device_addr, const struct ieee802154_llsec_device_key *key); int (*add_seclevel)(struct net_device *dev, const struct ieee802154_llsec_seclevel *sl); int (*del_seclevel)(struct net_device *dev, const struct ieee802154_llsec_seclevel *sl); void (*lock_table)(struct net_device *dev); void (*get_table)(struct net_device *dev, struct ieee802154_llsec_table **t); void (*unlock_table)(struct net_device *dev); }; /* * This should be located at net_device->ml_priv * * get_phy should increment the reference counting on returned phy. * Use wpan_wpy_put to put that reference. */ struct ieee802154_mlme_ops { /* The following fields are optional (can be NULL). */ int (*assoc_req)(struct net_device *dev, struct ieee802154_addr *addr, u8 channel, u8 page, u8 cap); int (*assoc_resp)(struct net_device *dev, struct ieee802154_addr *addr, __le16 short_addr, u8 status); int (*disassoc_req)(struct net_device *dev, struct ieee802154_addr *addr, u8 reason); int (*start_req)(struct net_device *dev, struct ieee802154_addr *addr, u8 channel, u8 page, u8 bcn_ord, u8 sf_ord, u8 pan_coord, u8 blx, u8 coord_realign); int (*scan_req)(struct net_device *dev, u8 type, u32 channels, u8 page, u8 duration); int (*set_mac_params)(struct net_device *dev, const struct ieee802154_mac_params *params); void (*get_mac_params)(struct net_device *dev, struct ieee802154_mac_params *params); const struct ieee802154_llsec_ops *llsec; }; static inline struct ieee802154_mlme_ops * ieee802154_mlme_ops(const struct net_device *dev) { return dev->ml_priv; } #endif |
4980 4940 144 1995 1780 140 144 4691 302 140 140 140 303 276 277 4 4 4 4 4 4 4 297 298 297 295 294 4 4 4 4 2562 908 2558 2561 2562 2562 2560 20 2560 2066 2067 2067 1373 1781 1781 1781 2065 1759 1196 2542 2500 2543 1773 1779 2547 23 15 3 2529 350 51 51 51 51 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 | // SPDX-License-Identifier: GPL-2.0-only /* * Generic helpers for smp ipi calls * * (C) Jens Axboe <jens.axboe@oracle.com> 2008 */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/irq_work.h> #include <linux/rcupdate.h> #include <linux/rculist.h> #include <linux/kernel.h> #include <linux/export.h> #include <linux/percpu.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/gfp.h> #include <linux/smp.h> #include <linux/cpu.h> #include <linux/sched.h> #include <linux/sched/idle.h> #include <linux/hypervisor.h> #include <linux/sched/clock.h> #include <linux/nmi.h> #include <linux/sched/debug.h> #include <linux/jump_label.h> #include <trace/events/ipi.h> #define CREATE_TRACE_POINTS #include <trace/events/csd.h> #undef CREATE_TRACE_POINTS #include "smpboot.h" #include "sched/smp.h" #define CSD_TYPE(_csd) ((_csd)->node.u_flags & CSD_FLAG_TYPE_MASK) struct call_function_data { call_single_data_t __percpu *csd; cpumask_var_t cpumask; cpumask_var_t cpumask_ipi; }; static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data); static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue); static DEFINE_PER_CPU(atomic_t, trigger_backtrace) = ATOMIC_INIT(1); static void __flush_smp_call_function_queue(bool warn_cpu_offline); int smpcfd_prepare_cpu(unsigned int cpu) { struct call_function_data *cfd = &per_cpu(cfd_data, cpu); if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, cpu_to_node(cpu))) return -ENOMEM; if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL, cpu_to_node(cpu))) { free_cpumask_var(cfd->cpumask); return -ENOMEM; } cfd->csd = alloc_percpu(call_single_data_t); if (!cfd->csd) { free_cpumask_var(cfd->cpumask); free_cpumask_var(cfd->cpumask_ipi); return -ENOMEM; } return 0; } int smpcfd_dead_cpu(unsigned int cpu) { struct call_function_data *cfd = &per_cpu(cfd_data, cpu); free_cpumask_var(cfd->cpumask); free_cpumask_var(cfd->cpumask_ipi); free_percpu(cfd->csd); return 0; } int smpcfd_dying_cpu(unsigned int cpu) { /* * The IPIs for the smp-call-function callbacks queued by other * CPUs might arrive late, either due to hardware latencies or * because this CPU disabled interrupts (inside stop-machine) * before the IPIs were sent. So flush out any pending callbacks * explicitly (without waiting for the IPIs to arrive), to * ensure that the outgoing CPU doesn't go offline with work * still pending. */ __flush_smp_call_function_queue(false); irq_work_run(); return 0; } void __init call_function_init(void) { int i; for_each_possible_cpu(i) init_llist_head(&per_cpu(call_single_queue, i)); smpcfd_prepare_cpu(smp_processor_id()); } static __always_inline void send_call_function_single_ipi(int cpu) { if (call_function_single_prep_ipi(cpu)) { trace_ipi_send_cpu(cpu, _RET_IP_, generic_smp_call_function_single_interrupt); arch_send_call_function_single_ipi(cpu); } } static __always_inline void send_call_function_ipi_mask(struct cpumask *mask) { trace_ipi_send_cpumask(mask, _RET_IP_, generic_smp_call_function_single_interrupt); arch_send_call_function_ipi_mask(mask); } static __always_inline void csd_do_func(smp_call_func_t func, void *info, struct __call_single_data *csd) { trace_csd_function_entry(func, csd); func(info); trace_csd_function_exit(func, csd); } #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled); /* * Parse the csdlock_debug= kernel boot parameter. * * If you need to restore the old "ext" value that once provided * additional debugging information, reapply the following commits: * * de7b09ef658d ("locking/csd_lock: Prepare more CSD lock debugging") * a5aabace5fb8 ("locking/csd_lock: Add more data to CSD lock debugging") */ static int __init csdlock_debug(char *str) { int ret; unsigned int val = 0; ret = get_option(&str, &val); if (ret) { if (val) static_branch_enable(&csdlock_debug_enabled); else static_branch_disable(&csdlock_debug_enabled); } return 1; } __setup("csdlock_debug=", csdlock_debug); static DEFINE_PER_CPU(call_single_data_t *, cur_csd); static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func); static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); /* Record current CSD work for current CPU, NULL to erase. */ static void __csd_lock_record(struct __call_single_data *csd) { if (!csd) { smp_mb(); /* NULL cur_csd after unlock. */ __this_cpu_write(cur_csd, NULL); return; } __this_cpu_write(cur_csd_func, csd->func); __this_cpu_write(cur_csd_info, csd->info); smp_wmb(); /* func and info before csd. */ __this_cpu_write(cur_csd, csd); smp_mb(); /* Update cur_csd before function call. */ /* Or before unlock, as the case may be. */ } static __always_inline void csd_lock_record(struct __call_single_data *csd) { if (static_branch_unlikely(&csdlock_debug_enabled)) __csd_lock_record(csd); } static int csd_lock_wait_getcpu(struct __call_single_data *csd) { unsigned int csd_type; csd_type = CSD_TYPE(csd); if (csd_type == CSD_TYPE_ASYNC || csd_type == CSD_TYPE_SYNC) return csd->node.dst; /* Other CSD_TYPE_ values might not have ->dst. */ return -1; } /* * Complain if too much time spent waiting. Note that only * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU, * so waiting on other types gets much less information. */ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *ts1, int *bug_id) { int cpu = -1; int cpux; bool firsttime; u64 ts2, ts_delta; call_single_data_t *cpu_cur_csd; unsigned int flags = READ_ONCE(csd->node.u_flags); unsigned long long csd_lock_timeout_ns = csd_lock_timeout * NSEC_PER_MSEC; if (!(flags & CSD_FLAG_LOCK)) { if (!unlikely(*bug_id)) return true; cpu = csd_lock_wait_getcpu(csd); pr_alert("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock.\n", *bug_id, raw_smp_processor_id(), cpu); return true; } ts2 = sched_clock(); ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; firsttime = !*bug_id; if (firsttime) *bug_id = atomic_inc_return(&csd_bug_count); cpu = csd_lock_wait_getcpu(csd); if (WARN_ONCE(cpu < 0 || cpu >= nr_cpu_ids, "%s: cpu = %d\n", __func__, cpu)) cpux = 0; else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, cpu, csd->func, csd->info); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), READ_ONCE(per_cpu(cur_csd_info, cpux))); } else { pr_alert("\tcsd: CSD lock (#%d) %s.\n", *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request"); } if (cpu >= 0) { if (atomic_cmpxchg_acquire(&per_cpu(trigger_backtrace, cpu), 1, 0)) dump_cpu_task(cpu); if (!cpu_cur_csd) { pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu); arch_send_call_function_single_ipi(cpu); } } if (firsttime) dump_stack(); *ts1 = ts2; return false; } /* * csd_lock/csd_unlock used to serialize access to per-cpu csd resources * * For non-synchronous ipi calls the csd can still be in use by the * previous function call. For multi-cpu calls its even more interesting * as we'll have to ensure no other cpu is observing our csd. */ static void __csd_lock_wait(struct __call_single_data *csd) { int bug_id = 0; u64 ts0, ts1; ts1 = ts0 = sched_clock(); for (;;) { if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id)) break; cpu_relax(); } smp_acquire__after_ctrl_dep(); } static __always_inline void csd_lock_wait(struct __call_single_data *csd) { if (static_branch_unlikely(&csdlock_debug_enabled)) { __csd_lock_wait(csd); return; } smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK)); } #else static void csd_lock_record(struct __call_single_data *csd) { } static __always_inline void csd_lock_wait(struct __call_single_data *csd) { smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK)); } #endif static __always_inline void csd_lock(struct __call_single_data *csd) { csd_lock_wait(csd); csd->node.u_flags |= CSD_FLAG_LOCK; /* * prevent CPU from reordering the above assignment * to ->flags with any subsequent assignments to other * fields of the specified call_single_data_t structure: */ smp_wmb(); } static __always_inline void csd_unlock(struct __call_single_data *csd) { WARN_ON(!(csd->node.u_flags & CSD_FLAG_LOCK)); /* * ensure we're all done before releasing data: */ smp_store_release(&csd->node.u_flags, 0); } static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); void __smp_call_single_queue(int cpu, struct llist_node *node) { /* * We have to check the type of the CSD before queueing it, because * once queued it can have its flags cleared by * flush_smp_call_function_queue() * even if we haven't sent the smp_call IPI yet (e.g. the stopper * executes migration_cpu_stop() on the remote CPU). */ if (trace_csd_queue_cpu_enabled()) { call_single_data_t *csd; smp_call_func_t func; csd = container_of(node, call_single_data_t, node.llist); func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? sched_ttwu_pending : csd->func; trace_csd_queue_cpu(cpu, _RET_IP_, func, csd); } /* * The list addition should be visible to the target CPU when it pops * the head of the list to pull the entry off it in the IPI handler * because of normal cache coherency rules implied by the underlying * llist ops. * * If IPIs can go out of order to the cache coherency protocol * in an architecture, sufficient synchronisation should be added * to arch code to make it appear to obey cache coherency WRT * locking and barrier primitives. Generic code isn't really * equipped to do the right thing... */ if (llist_add(node, &per_cpu(call_single_queue, cpu))) send_call_function_single_ipi(cpu); } /* * Insert a previously allocated call_single_data_t element * for execution on the given CPU. data must already have * ->func, ->info, and ->flags set. */ static int generic_exec_single(int cpu, struct __call_single_data *csd) { if (cpu == smp_processor_id()) { smp_call_func_t func = csd->func; void *info = csd->info; unsigned long flags; /* * We can unlock early even for the synchronous on-stack case, * since we're doing this from the same CPU.. */ csd_lock_record(csd); csd_unlock(csd); local_irq_save(flags); csd_do_func(func, info, NULL); csd_lock_record(NULL); local_irq_restore(flags); return 0; } if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) { csd_unlock(csd); return -ENXIO; } __smp_call_single_queue(cpu, &csd->node.llist); return 0; } /** * generic_smp_call_function_single_interrupt - Execute SMP IPI callbacks * * Invoked by arch to handle an IPI for call function single. * Must be called with interrupts disabled. */ void generic_smp_call_function_single_interrupt(void) { __flush_smp_call_function_queue(true); } /** * __flush_smp_call_function_queue - Flush pending smp-call-function callbacks * * @warn_cpu_offline: If set to 'true', warn if callbacks were queued on an * offline CPU. Skip this check if set to 'false'. * * Flush any pending smp-call-function callbacks queued on this CPU. This is * invoked by the generic IPI handler, as well as by a CPU about to go offline, * to ensure that all pending IPI callbacks are run before it goes completely * offline. * * Loop through the call_single_queue and run all the queued callbacks. * Must be called with interrupts disabled. */ static void __flush_smp_call_function_queue(bool warn_cpu_offline) { call_single_data_t *csd, *csd_next; struct llist_node *entry, *prev; struct llist_head *head; static bool warned; atomic_t *tbt; lockdep_assert_irqs_disabled(); /* Allow waiters to send backtrace NMI from here onwards */ tbt = this_cpu_ptr(&trigger_backtrace); atomic_set_release(tbt, 1); head = this_cpu_ptr(&call_single_queue); entry = llist_del_all(head); entry = llist_reverse_order(entry); /* There shouldn't be any pending callbacks on an offline CPU. */ if (unlikely(warn_cpu_offline && !cpu_online(smp_processor_id()) && !warned && entry != NULL)) { warned = true; WARN(1, "IPI on offline CPU %d\n", smp_processor_id()); /* * We don't have to use the _safe() variant here * because we are not invoking the IPI handlers yet. */ llist_for_each_entry(csd, entry, node.llist) { switch (CSD_TYPE(csd)) { case CSD_TYPE_ASYNC: case CSD_TYPE_SYNC: case CSD_TYPE_IRQ_WORK: pr_warn("IPI callback %pS sent to offline CPU\n", csd->func); break; case CSD_TYPE_TTWU: pr_warn("IPI task-wakeup sent to offline CPU\n"); break; default: pr_warn("IPI callback, unknown type %d, sent to offline CPU\n", CSD_TYPE(csd)); break; } } } /* * First; run all SYNC callbacks, people are waiting for us. */ prev = NULL; llist_for_each_entry_safe(csd, csd_next, entry, node.llist) { /* Do we wait until *after* callback? */ if (CSD_TYPE(csd) == CSD_TYPE_SYNC) { smp_call_func_t func = csd->func; void *info = csd->info; if (prev) { prev->next = &csd_next->node.llist; } else { entry = &csd_next->node.llist; } csd_lock_record(csd); csd_do_func(func, info, csd); csd_unlock(csd); csd_lock_record(NULL); } else { prev = &csd->node.llist; } } if (!entry) return; /* * Second; run all !SYNC callbacks. */ prev = NULL; llist_for_each_entry_safe(csd, csd_next, entry, node.llist) { int type = CSD_TYPE(csd); if (type != CSD_TYPE_TTWU) { if (prev) { prev->next = &csd_next->node.llist; } else { entry = &csd_next->node.llist; } if (type == CSD_TYPE_ASYNC) { smp_call_func_t func = csd->func; void *info = csd->info; csd_lock_record(csd); csd_unlock(csd); csd_do_func(func, info, csd); csd_lock_record(NULL); } else if (type == CSD_TYPE_IRQ_WORK) { irq_work_single(csd); } } else { prev = &csd->node.llist; } } /* * Third; only CSD_TYPE_TTWU is left, issue those. */ if (entry) { csd = llist_entry(entry, typeof(*csd), node.llist); csd_do_func(sched_ttwu_pending, entry, csd); } } /** * flush_smp_call_function_queue - Flush pending smp-call-function callbacks * from task context (idle, migration thread) * * When TIF_POLLING_NRFLAG is supported and a CPU is in idle and has it * set, then remote CPUs can avoid sending IPIs and wake the idle CPU by * setting TIF_NEED_RESCHED. The idle task on the woken up CPU has to * handle queued SMP function calls before scheduling. * * The migration thread has to ensure that an eventually pending wakeup has * been handled before it migrates a task. */ void flush_smp_call_function_queue(void) { unsigned int was_pending; unsigned long flags; if (llist_empty(this_cpu_ptr(&call_single_queue))) return; local_irq_save(flags); /* Get the already pending soft interrupts for RT enabled kernels */ was_pending = local_softirq_pending(); __flush_smp_call_function_queue(true); if (local_softirq_pending()) do_softirq_post_smp_call_flush(was_pending); local_irq_restore(flags); } /* * smp_call_function_single - Run a function on a specific CPU * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait until function has completed on other CPUs. * * Returns 0 on success, else a negative status code. */ int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) { call_single_data_t *csd; call_single_data_t csd_stack = { .node = { .u_flags = CSD_FLAG_LOCK | CSD_TYPE_SYNC, }, }; int this_cpu; int err; /* * prevent preemption and reschedule on another processor, * as well as CPU removal */ this_cpu = get_cpu(); /* * Can deadlock when called with interrupts disabled. * We allow cpu's that are not yet online though, as no one else can * send smp call function interrupt to this cpu and as such deadlocks * can't happen. */ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress); /* * When @wait we can deadlock when we interrupt between llist_add() and * arch_send_call_function_ipi*(); when !@wait we can deadlock due to * csd_lock() on because the interrupt context uses the same csd * storage. */ WARN_ON_ONCE(!in_task()); csd = &csd_stack; if (!wait) { csd = this_cpu_ptr(&csd_data); csd_lock(csd); } csd->func = func; csd->info = info; #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG csd->node.src = smp_processor_id(); csd->node.dst = cpu; #endif err = generic_exec_single(cpu, csd); if (wait) csd_lock_wait(csd); put_cpu(); return err; } EXPORT_SYMBOL(smp_call_function_single); /** * smp_call_function_single_async() - Run an asynchronous function on a * specific CPU. * @cpu: The CPU to run on. * @csd: Pre-allocated and setup data structure * * Like smp_call_function_single(), but the call is asynchonous and * can thus be done from contexts with disabled interrupts. * * The caller passes his own pre-allocated data structure * (ie: embedded in an object) and is responsible for synchronizing it * such that the IPIs performed on the @csd are strictly serialized. * * If the function is called with one csd which has not yet been * processed by previous call to smp_call_function_single_async(), the * function will return immediately with -EBUSY showing that the csd * object is still in progress. * * NOTE: Be careful, there is unfortunately no current debugging facility to * validate the correctness of this serialization. * * Return: %0 on success or negative errno value on error */ int smp_call_function_single_async(int cpu, struct __call_single_data *csd) { int err = 0; preempt_disable(); if (csd->node.u_flags & CSD_FLAG_LOCK) { err = -EBUSY; goto out; } csd->node.u_flags = CSD_FLAG_LOCK; smp_wmb(); err = generic_exec_single(cpu, csd); out: preempt_enable(); return err; } EXPORT_SYMBOL_GPL(smp_call_function_single_async); /* * smp_call_function_any - Run a function on any of the given cpus * @mask: The mask of cpus it can run on. * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait until function has completed. * * Returns 0 on success, else a negative status code (if no cpus were online). * * Selection preference: * 1) current cpu if in @mask * 2) any cpu of current node if in @mask * 3) any other online cpu in @mask */ int smp_call_function_any(const struct cpumask *mask, smp_call_func_t func, void *info, int wait) { unsigned int cpu; const struct cpumask *nodemask; int ret; /* Try for same CPU (cheapest) */ cpu = get_cpu(); if (cpumask_test_cpu(cpu, mask)) goto call; /* Try for same node. */ nodemask = cpumask_of_node(cpu_to_node(cpu)); for (cpu = cpumask_first_and(nodemask, mask); cpu < nr_cpu_ids; cpu = cpumask_next_and(cpu, nodemask, mask)) { if (cpu_online(cpu)) goto call; } /* Any online will do: smp_call_function_single handles nr_cpu_ids. */ cpu = cpumask_any_and(mask, cpu_online_mask); call: ret = smp_call_function_single(cpu, func, info, wait); put_cpu(); return ret; } EXPORT_SYMBOL_GPL(smp_call_function_any); /* * Flags to be used as scf_flags argument of smp_call_function_many_cond(). * * %SCF_WAIT: Wait until function execution is completed * %SCF_RUN_LOCAL: Run also locally if local cpu is set in cpumask */ #define SCF_WAIT (1U << 0) #define SCF_RUN_LOCAL (1U << 1) static void smp_call_function_many_cond(const struct cpumask *mask, smp_call_func_t func, void *info, unsigned int scf_flags, smp_cond_func_t cond_func) { int cpu, last_cpu, this_cpu = smp_processor_id(); struct call_function_data *cfd; bool wait = scf_flags & SCF_WAIT; int nr_cpus = 0; bool run_remote = false; bool run_local = false; lockdep_assert_preemption_disabled(); /* * Can deadlock when called with interrupts disabled. * We allow cpu's that are not yet online though, as no one else can * send smp call function interrupt to this cpu and as such deadlocks * can't happen. */ if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled) lockdep_assert_irqs_enabled(); /* * When @wait we can deadlock when we interrupt between llist_add() and * arch_send_call_function_ipi*(); when !@wait we can deadlock due to * csd_lock() on because the interrupt context uses the same csd * storage. */ WARN_ON_ONCE(!in_task()); /* Check if we need local execution. */ if ((scf_flags & SCF_RUN_LOCAL) && cpumask_test_cpu(this_cpu, mask)) run_local = true; /* Check if we need remote execution, i.e., any CPU excluding this one. */ cpu = cpumask_first_and(mask, cpu_online_mask); if (cpu == this_cpu) cpu = cpumask_next_and(cpu, mask, cpu_online_mask); if (cpu < nr_cpu_ids) run_remote = true; if (run_remote) { cfd = this_cpu_ptr(&cfd_data); cpumask_and(cfd->cpumask, mask, cpu_online_mask); __cpumask_clear_cpu(this_cpu, cfd->cpumask); cpumask_clear(cfd->cpumask_ipi); for_each_cpu(cpu, cfd->cpumask) { call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu); if (cond_func && !cond_func(cpu, info)) { __cpumask_clear_cpu(cpu, cfd->cpumask); continue; } csd_lock(csd); if (wait) csd->node.u_flags |= CSD_TYPE_SYNC; csd->func = func; csd->info = info; #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG csd->node.src = smp_processor_id(); csd->node.dst = cpu; #endif trace_csd_queue_cpu(cpu, _RET_IP_, func, csd); if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) { __cpumask_set_cpu(cpu, cfd->cpumask_ipi); nr_cpus++; last_cpu = cpu; } } /* * Choose the most efficient way to send an IPI. Note that the * number of CPUs might be zero due to concurrent changes to the * provided mask. */ if (nr_cpus == 1) send_call_function_single_ipi(last_cpu); else if (likely(nr_cpus > 1)) send_call_function_ipi_mask(cfd->cpumask_ipi); } if (run_local && (!cond_func || cond_func(this_cpu, info))) { unsigned long flags; local_irq_save(flags); csd_do_func(func, info, NULL); local_irq_restore(flags); } if (run_remote && wait) { for_each_cpu(cpu, cfd->cpumask) { call_single_data_t *csd; csd = per_cpu_ptr(cfd->csd, cpu); csd_lock_wait(csd); } } } /** * smp_call_function_many(): Run a function on a set of CPUs. * @mask: The set of cpus to run on (only runs on online subset). * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: Bitmask that controls the operation. If %SCF_WAIT is set, wait * (atomically) until function has completed on other CPUs. If * %SCF_RUN_LOCAL is set, the function will also be run locally * if the local CPU is set in the @cpumask. * * If @wait is true, then returns once @func has returned. * * You must not call this function with disabled interrupts or from a * hardware interrupt handler or from a bottom half handler. Preemption * must be disabled when calling this function. */ void smp_call_function_many(const struct cpumask *mask, smp_call_func_t func, void *info, bool wait) { smp_call_function_many_cond(mask, func, info, wait * SCF_WAIT, NULL); } EXPORT_SYMBOL(smp_call_function_many); /** * smp_call_function(): Run a function on all other CPUs. * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait (atomically) until function has completed * on other CPUs. * * Returns 0. * * If @wait is true, then returns once @func has returned; otherwise * it returns just before the target cpu calls @func. * * You must not call this function with disabled interrupts or from a * hardware interrupt handler or from a bottom half handler. */ void smp_call_function(smp_call_func_t func, void *info, int wait) { preempt_disable(); smp_call_function_many(cpu_online_mask, func, info, wait); preempt_enable(); } EXPORT_SYMBOL(smp_call_function); /* Setup configured maximum number of CPUs to activate */ unsigned int setup_max_cpus = NR_CPUS; EXPORT_SYMBOL(setup_max_cpus); /* * Setup routine for controlling SMP activation * * Command-line option of "nosmp" or "maxcpus=0" will disable SMP * activation entirely (the MPS table probe still happens, though). * * Command-line option of "maxcpus=<NUM>", where <NUM> is an integer * greater than 0, limits the maximum number of CPUs activated in * SMP mode to <NUM>. */ void __weak __init arch_disable_smp_support(void) { } static int __init nosmp(char *str) { setup_max_cpus = 0; arch_disable_smp_support(); return 0; } early_param("nosmp", nosmp); /* this is hard limit */ static int __init nrcpus(char *str) { int nr_cpus; if (get_option(&str, &nr_cpus) && nr_cpus > 0 && nr_cpus < nr_cpu_ids) set_nr_cpu_ids(nr_cpus); return 0; } early_param("nr_cpus", nrcpus); static int __init maxcpus(char *str) { get_option(&str, &setup_max_cpus); if (setup_max_cpus == 0) arch_disable_smp_support(); return 0; } early_param("maxcpus", maxcpus); #if (NR_CPUS > 1) && !defined(CONFIG_FORCE_NR_CPUS) /* Setup number of possible processor ids */ unsigned int nr_cpu_ids __read_mostly = NR_CPUS; EXPORT_SYMBOL(nr_cpu_ids); #endif /* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */ void __init setup_nr_cpu_ids(void) { set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1); } /* Called by boot processor to activate the rest. */ void __init smp_init(void) { int num_nodes, num_cpus; idle_threads_init(); cpuhp_threads_init(); pr_info("Bringing up secondary CPUs ...\n"); bringup_nonboot_cpus(setup_max_cpus); num_nodes = num_online_nodes(); num_cpus = num_online_cpus(); pr_info("Brought up %d node%s, %d CPU%s\n", num_nodes, (num_nodes > 1 ? "s" : ""), num_cpus, (num_cpus > 1 ? "s" : "")); /* Any cleanup work */ smp_cpus_done(setup_max_cpus); } /* * on_each_cpu_cond(): Call a function on each processor for which * the supplied function cond_func returns true, optionally waiting * for all the required CPUs to finish. This may include the local * processor. * @cond_func: A callback function that is passed a cpu id and * the info parameter. The function is called * with preemption disabled. The function should * return a blooean value indicating whether to IPI * the specified CPU. * @func: The function to run on all applicable CPUs. * This must be fast and non-blocking. * @info: An arbitrary pointer to pass to both functions. * @wait: If true, wait (atomically) until function has * completed on other CPUs. * * Preemption is disabled to protect against CPUs going offline but not online. * CPUs going online during the call will not be seen or sent an IPI. * * You must not call this function with disabled interrupts or * from a hardware interrupt handler or from a bottom half handler. */ void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func, void *info, bool wait, const struct cpumask *mask) { unsigned int scf_flags = SCF_RUN_LOCAL; if (wait) scf_flags |= SCF_WAIT; preempt_disable(); smp_call_function_many_cond(mask, func, info, scf_flags, cond_func); preempt_enable(); } EXPORT_SYMBOL(on_each_cpu_cond_mask); static void do_nothing(void *unused) { } /** * kick_all_cpus_sync - Force all cpus out of idle * * Used to synchronize the update of pm_idle function pointer. It's * called after the pointer is updated and returns after the dummy * callback function has been executed on all cpus. The execution of * the function can only happen on the remote cpus after they have * left the idle function which had been called via pm_idle function * pointer. So it's guaranteed that nothing uses the previous pointer * anymore. */ void kick_all_cpus_sync(void) { /* Make sure the change is visible before we kick the cpus */ smp_mb(); smp_call_function(do_nothing, NULL, 1); } EXPORT_SYMBOL_GPL(kick_all_cpus_sync); /** * wake_up_all_idle_cpus - break all cpus out of idle * wake_up_all_idle_cpus try to break all cpus which is in idle state even * including idle polling cpus, for non-idle cpus, we will do nothing * for them. */ void wake_up_all_idle_cpus(void) { int cpu; for_each_possible_cpu(cpu) { preempt_disable(); if (cpu != smp_processor_id() && cpu_online(cpu)) wake_up_if_idle(cpu); preempt_enable(); } } EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus); /** * struct smp_call_on_cpu_struct - Call a function on a specific CPU * @work: &work_struct * @done: &completion to signal * @func: function to call * @data: function's data argument * @ret: return value from @func * @cpu: target CPU (%-1 for any CPU) * * Used to call a function on a specific cpu and wait for it to return. * Optionally make sure the call is done on a specified physical cpu via vcpu * pinning in order to support virtualized environments. */ struct smp_call_on_cpu_struct { struct work_struct work; struct completion done; int (*func)(void *); void *data; int ret; int cpu; }; static void smp_call_on_cpu_callback(struct work_struct *work) { struct smp_call_on_cpu_struct *sscs; sscs = container_of(work, struct smp_call_on_cpu_struct, work); if (sscs->cpu >= 0) hypervisor_pin_vcpu(sscs->cpu); sscs->ret = sscs->func(sscs->data); if (sscs->cpu >= 0) hypervisor_pin_vcpu(-1); complete(&sscs->done); } int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys) { struct smp_call_on_cpu_struct sscs = { .done = COMPLETION_INITIALIZER_ONSTACK(sscs.done), .func = func, .data = par, .cpu = phys ? cpu : -1, }; INIT_WORK_ONSTACK(&sscs.work, smp_call_on_cpu_callback); if (cpu >= nr_cpu_ids || !cpu_online(cpu)) return -ENXIO; queue_work_on(cpu, system_wq, &sscs.work); wait_for_completion(&sscs.done); return sscs.ret; } EXPORT_SYMBOL_GPL(smp_call_on_cpu); |
2 1 50 1 2 1 1 328 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Copyright (c) 2014 Mahesh Bandewar <maheshb@google.com> */ #ifndef __IPVLAN_H #define __IPVLAN_H #include <linux/kernel.h> #include <linux/types.h> #include <linux/module.h> #include <linux/init.h> #include <linux/rculist.h> #include <linux/notifier.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/if_arp.h> #include <linux/if_link.h> #include <linux/if_vlan.h> #include <linux/ip.h> #include <linux/inetdevice.h> #include <linux/netfilter.h> #include <net/ip.h> #include <net/ip6_route.h> #include <net/netns/generic.h> #include <net/rtnetlink.h> #include <net/route.h> #include <net/addrconf.h> #include <net/l3mdev.h> #define IPVLAN_DRV "ipvlan" #define IPV_DRV_VER "0.1" #define IPVLAN_HASH_SIZE (1 << BITS_PER_BYTE) #define IPVLAN_HASH_MASK (IPVLAN_HASH_SIZE - 1) #define IPVLAN_MAC_FILTER_BITS 8 #define IPVLAN_MAC_FILTER_SIZE (1 << IPVLAN_MAC_FILTER_BITS) #define IPVLAN_MAC_FILTER_MASK (IPVLAN_MAC_FILTER_SIZE - 1) #define IPVLAN_QBACKLOG_LIMIT 1000 typedef enum { IPVL_IPV6 = 0, IPVL_ICMPV6, IPVL_IPV4, IPVL_ARP, } ipvl_hdr_type; struct ipvl_pcpu_stats { u64_stats_t rx_pkts; u64_stats_t rx_bytes; u64_stats_t rx_mcast; u64_stats_t tx_pkts; u64_stats_t tx_bytes; struct u64_stats_sync syncp; u32 rx_errs; u32 tx_drps; }; struct ipvl_port; struct ipvl_dev { struct net_device *dev; struct list_head pnode; struct ipvl_port *port; struct net_device *phy_dev; struct list_head addrs; struct ipvl_pcpu_stats __percpu *pcpu_stats; DECLARE_BITMAP(mac_filters, IPVLAN_MAC_FILTER_SIZE); netdev_features_t sfeatures; u32 msg_enable; spinlock_t addrs_lock; }; struct ipvl_addr { struct ipvl_dev *master; /* Back pointer to master */ union { struct in6_addr ip6; /* IPv6 address on logical interface */ struct in_addr ip4; /* IPv4 address on logical interface */ } ipu; #define ip6addr ipu.ip6 #define ip4addr ipu.ip4 struct hlist_node hlnode; /* Hash-table linkage */ struct list_head anode; /* logical-interface linkage */ ipvl_hdr_type atype; struct rcu_head rcu; }; struct ipvl_port { struct net_device *dev; possible_net_t pnet; struct hlist_head hlhead[IPVLAN_HASH_SIZE]; struct list_head ipvlans; u16 mode; u16 flags; u16 dev_id_start; struct work_struct wq; struct sk_buff_head backlog; int count; struct ida ida; netdevice_tracker dev_tracker; }; struct ipvl_skb_cb { bool tx_pkt; }; #define IPVL_SKB_CB(_skb) ((struct ipvl_skb_cb *)&((_skb)->cb[0])) static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d) { return rcu_dereference(d->rx_handler_data); } static inline struct ipvl_port *ipvlan_port_get_rcu_bh(const struct net_device *d) { return rcu_dereference_bh(d->rx_handler_data); } static inline struct ipvl_port *ipvlan_port_get_rtnl(const struct net_device *d) { return rtnl_dereference(d->rx_handler_data); } static inline bool ipvlan_is_private(const struct ipvl_port *port) { return !!(port->flags & IPVLAN_F_PRIVATE); } static inline void ipvlan_mark_private(struct ipvl_port *port) { port->flags |= IPVLAN_F_PRIVATE; } static inline void ipvlan_clear_private(struct ipvl_port *port) { port->flags &= ~IPVLAN_F_PRIVATE; } static inline bool ipvlan_is_vepa(const struct ipvl_port *port) { return !!(port->flags & IPVLAN_F_VEPA); } static inline void ipvlan_mark_vepa(struct ipvl_port *port) { port->flags |= IPVLAN_F_VEPA; } static inline void ipvlan_clear_vepa(struct ipvl_port *port) { port->flags &= ~IPVLAN_F_VEPA; } void ipvlan_init_secret(void); unsigned int ipvlan_mac_hash(const unsigned char *addr); rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb); void ipvlan_process_multicast(struct work_struct *work); int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev); void ipvlan_ht_addr_add(struct ipvl_dev *ipvlan, struct ipvl_addr *addr); struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan, const void *iaddr, bool is_v6); bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6); void ipvlan_ht_addr_del(struct ipvl_addr *addr); struct ipvl_addr *ipvlan_addr_lookup(struct ipvl_port *port, void *lyr3h, int addr_type, bool use_dest); void *ipvlan_get_L3_hdr(struct ipvl_port *port, struct sk_buff *skb, int *type); void ipvlan_count_rx(const struct ipvl_dev *ipvlan, unsigned int len, bool success, bool mcast); int ipvlan_link_new(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack); void ipvlan_link_delete(struct net_device *dev, struct list_head *head); void ipvlan_link_setup(struct net_device *dev); int ipvlan_link_register(struct rtnl_link_ops *ops); #ifdef CONFIG_IPVLAN_L3S int ipvlan_l3s_register(struct ipvl_port *port); void ipvlan_l3s_unregister(struct ipvl_port *port); void ipvlan_migrate_l3s_hook(struct net *oldnet, struct net *newnet); int ipvlan_l3s_init(void); void ipvlan_l3s_cleanup(void); #else static inline int ipvlan_l3s_register(struct ipvl_port *port) { return -ENOTSUPP; } static inline void ipvlan_l3s_unregister(struct ipvl_port *port) { } static inline void ipvlan_migrate_l3s_hook(struct net *oldnet, struct net *newnet) { } static inline int ipvlan_l3s_init(void) { return 0; } static inline void ipvlan_l3s_cleanup(void) { } #endif /* CONFIG_IPVLAN_L3S */ static inline bool netif_is_ipvlan_port(const struct net_device *dev) { return rcu_access_pointer(dev->rx_handler) == ipvlan_handle_frame; } #endif /* __IPVLAN_H */ |
1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 | // SPDX-License-Identifier: GPL-2.0-only /* * vivid-ctrls.c - control support functions. * * Copyright 2014 Cisco Systems, Inc. and/or its affiliates. All rights reserved. */ #include <linux/errno.h> #include <linux/kernel.h> #include <linux/videodev2.h> #include <media/v4l2-event.h> #include <media/v4l2-common.h> #include "vivid-core.h" #include "vivid-vid-cap.h" #include "vivid-vid-out.h" #include "vivid-vid-common.h" #include "vivid-radio-common.h" #include "vivid-osd.h" #include "vivid-ctrls.h" #include "vivid-cec.h" #define VIVID_CID_CUSTOM_BASE (V4L2_CID_USER_BASE | 0xf000) #define VIVID_CID_BUTTON (VIVID_CID_CUSTOM_BASE + 0) #define VIVID_CID_BOOLEAN (VIVID_CID_CUSTOM_BASE + 1) #define VIVID_CID_INTEGER (VIVID_CID_CUSTOM_BASE + 2) #define VIVID_CID_INTEGER64 (VIVID_CID_CUSTOM_BASE + 3) #define VIVID_CID_MENU (VIVID_CID_CUSTOM_BASE + 4) #define VIVID_CID_STRING (VIVID_CID_CUSTOM_BASE + 5) #define VIVID_CID_BITMASK (VIVID_CID_CUSTOM_BASE + 6) #define VIVID_CID_INTMENU (VIVID_CID_CUSTOM_BASE + 7) #define VIVID_CID_U32_ARRAY (VIVID_CID_CUSTOM_BASE + 8) #define VIVID_CID_U16_MATRIX (VIVID_CID_CUSTOM_BASE + 9) #define VIVID_CID_U8_4D_ARRAY (VIVID_CID_CUSTOM_BASE + 10) #define VIVID_CID_AREA (VIVID_CID_CUSTOM_BASE + 11) #define VIVID_CID_RO_INTEGER (VIVID_CID_CUSTOM_BASE + 12) #define VIVID_CID_U32_DYN_ARRAY (VIVID_CID_CUSTOM_BASE + 13) #define VIVID_CID_U8_PIXEL_ARRAY (VIVID_CID_CUSTOM_BASE + 14) #define VIVID_CID_S32_ARRAY (VIVID_CID_CUSTOM_BASE + 15) #define VIVID_CID_S64_ARRAY (VIVID_CID_CUSTOM_BASE + 16) #define VIVID_CID_VIVID_BASE (0x00f00000 | 0xf000) #define VIVID_CID_VIVID_CLASS (0x00f00000 | 1) #define VIVID_CID_TEST_PATTERN (VIVID_CID_VIVID_BASE + 0) #define VIVID_CID_OSD_TEXT_MODE (VIVID_CID_VIVID_BASE + 1) #define VIVID_CID_HOR_MOVEMENT (VIVID_CID_VIVID_BASE + 2) #define VIVID_CID_VERT_MOVEMENT (VIVID_CID_VIVID_BASE + 3) #define VIVID_CID_SHOW_BORDER (VIVID_CID_VIVID_BASE + 4) #define VIVID_CID_SHOW_SQUARE (VIVID_CID_VIVID_BASE + 5) #define VIVID_CID_INSERT_SAV (VIVID_CID_VIVID_BASE + 6) #define VIVID_CID_INSERT_EAV (VIVID_CID_VIVID_BASE + 7) #define VIVID_CID_VBI_CAP_INTERLACED (VIVID_CID_VIVID_BASE + 8) #define VIVID_CID_INSERT_HDMI_VIDEO_GUARD_BAND (VIVID_CID_VIVID_BASE + 9) #define VIVID_CID_HFLIP (VIVID_CID_VIVID_BASE + 20) #define VIVID_CID_VFLIP (VIVID_CID_VIVID_BASE + 21) #define VIVID_CID_STD_ASPECT_RATIO (VIVID_CID_VIVID_BASE + 22) #define VIVID_CID_DV_TIMINGS_ASPECT_RATIO (VIVID_CID_VIVID_BASE + 23) #define VIVID_CID_TSTAMP_SRC (VIVID_CID_VIVID_BASE + 24) #define VIVID_CID_COLORSPACE (VIVID_CID_VIVID_BASE + 25) #define VIVID_CID_XFER_FUNC (VIVID_CID_VIVID_BASE + 26) #define VIVID_CID_YCBCR_ENC (VIVID_CID_VIVID_BASE + 27) #define VIVID_CID_QUANTIZATION (VIVID_CID_VIVID_BASE + 28) #define VIVID_CID_LIMITED_RGB_RANGE (VIVID_CID_VIVID_BASE + 29) #define VIVID_CID_ALPHA_MODE (VIVID_CID_VIVID_BASE + 30) #define VIVID_CID_HAS_CROP_CAP (VIVID_CID_VIVID_BASE + 31) #define VIVID_CID_HAS_COMPOSE_CAP (VIVID_CID_VIVID_BASE + 32) #define VIVID_CID_HAS_SCALER_CAP (VIVID_CID_VIVID_BASE + 33) #define VIVID_CID_HAS_CROP_OUT (VIVID_CID_VIVID_BASE + 34) #define VIVID_CID_HAS_COMPOSE_OUT (VIVID_CID_VIVID_BASE + 35) #define VIVID_CID_HAS_SCALER_OUT (VIVID_CID_VIVID_BASE + 36) #define VIVID_CID_LOOP_VIDEO (VIVID_CID_VIVID_BASE + 37) #define VIVID_CID_SEQ_WRAP (VIVID_CID_VIVID_BASE + 38) #define VIVID_CID_TIME_WRAP (VIVID_CID_VIVID_BASE + 39) #define VIVID_CID_MAX_EDID_BLOCKS (VIVID_CID_VIVID_BASE + 40) #define VIVID_CID_PERCENTAGE_FILL (VIVID_CID_VIVID_BASE + 41) #define VIVID_CID_REDUCED_FPS (VIVID_CID_VIVID_BASE + 42) #define VIVID_CID_HSV_ENC (VIVID_CID_VIVID_BASE + 43) #define VIVID_CID_DISPLAY_PRESENT (VIVID_CID_VIVID_BASE + 44) #define VIVID_CID_STD_SIGNAL_MODE (VIVID_CID_VIVID_BASE + 60) #define VIVID_CID_STANDARD (VIVID_CID_VIVID_BASE + 61) #define VIVID_CID_DV_TIMINGS_SIGNAL_MODE (VIVID_CID_VIVID_BASE + 62) #define VIVID_CID_DV_TIMINGS (VIVID_CID_VIVID_BASE + 63) #define VIVID_CID_PERC_DROPPED (VIVID_CID_VIVID_BASE + 64) #define VIVID_CID_DISCONNECT (VIVID_CID_VIVID_BASE + 65) #define VIVID_CID_DQBUF_ERROR (VIVID_CID_VIVID_BASE + 66) #define VIVID_CID_QUEUE_SETUP_ERROR (VIVID_CID_VIVID_BASE + 67) #define VIVID_CID_BUF_PREPARE_ERROR (VIVID_CID_VIVID_BASE + 68) #define VIVID_CID_START_STR_ERROR (VIVID_CID_VIVID_BASE + 69) #define VIVID_CID_QUEUE_ERROR (VIVID_CID_VIVID_BASE + 70) #define VIVID_CID_CLEAR_FB (VIVID_CID_VIVID_BASE + 71) #define VIVID_CID_REQ_VALIDATE_ERROR (VIVID_CID_VIVID_BASE + 72) #define VIVID_CID_RADIO_SEEK_MODE (VIVID_CID_VIVID_BASE + 90) #define VIVID_CID_RADIO_SEEK_PROG_LIM (VIVID_CID_VIVID_BASE + 91) #define VIVID_CID_RADIO_RX_RDS_RBDS (VIVID_CID_VIVID_BASE + 92) #define VIVID_CID_RADIO_RX_RDS_BLOCKIO (VIVID_CID_VIVID_BASE + 93) #define VIVID_CID_RADIO_TX_RDS_BLOCKIO (VIVID_CID_VIVID_BASE + 94) #define VIVID_CID_SDR_CAP_FM_DEVIATION (VIVID_CID_VIVID_BASE + 110) #define VIVID_CID_META_CAP_GENERATE_PTS (VIVID_CID_VIVID_BASE + 111) #define VIVID_CID_META_CAP_GENERATE_SCR (VIVID_CID_VIVID_BASE + 112) /* General User Controls */ static void vivid_unregister_dev(bool valid, struct video_device *vdev) { if (!valid) return; clear_bit(V4L2_FL_REGISTERED, &vdev->flags); v4l2_event_wake_all(vdev); } static int vivid_user_gen_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_user_gen); switch (ctrl->id) { case VIVID_CID_DISCONNECT: v4l2_info(&dev->v4l2_dev, "disconnect\n"); dev->disconnect_error = true; vivid_unregister_dev(dev->has_vid_cap, &dev->vid_cap_dev); vivid_unregister_dev(dev->has_vid_out, &dev->vid_out_dev); vivid_unregister_dev(dev->has_vbi_cap, &dev->vbi_cap_dev); vivid_unregister_dev(dev->has_vbi_out, &dev->vbi_out_dev); vivid_unregister_dev(dev->has_radio_rx, &dev->radio_rx_dev); vivid_unregister_dev(dev->has_radio_tx, &dev->radio_tx_dev); vivid_unregister_dev(dev->has_sdr_cap, &dev->sdr_cap_dev); vivid_unregister_dev(dev->has_meta_cap, &dev->meta_cap_dev); vivid_unregister_dev(dev->has_meta_out, &dev->meta_out_dev); vivid_unregister_dev(dev->has_touch_cap, &dev->touch_cap_dev); break; case VIVID_CID_BUTTON: dev->button_pressed = 30; break; } return 0; } static const struct v4l2_ctrl_ops vivid_user_gen_ctrl_ops = { .s_ctrl = vivid_user_gen_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_button = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_BUTTON, .name = "Button", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_ctrl_config vivid_ctrl_boolean = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_BOOLEAN, .name = "Boolean", .type = V4L2_CTRL_TYPE_BOOLEAN, .min = 0, .max = 1, .step = 1, .def = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_int32 = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_INTEGER, .name = "Integer 32 Bits", .type = V4L2_CTRL_TYPE_INTEGER, .min = 0xffffffff80000000ULL, .max = 0x7fffffff, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_int64 = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_INTEGER64, .name = "Integer 64 Bits", .type = V4L2_CTRL_TYPE_INTEGER64, .min = 0x8000000000000000ULL, .max = 0x7fffffffffffffffLL, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_u32_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_U32_ARRAY, .name = "U32 1 Element Array", .type = V4L2_CTRL_TYPE_U32, .def = 0x18, .min = 0x10, .max = 0x20000, .step = 1, .dims = { 1 }, }; static const struct v4l2_ctrl_config vivid_ctrl_u32_dyn_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_U32_DYN_ARRAY, .name = "U32 Dynamic Array", .type = V4L2_CTRL_TYPE_U32, .flags = V4L2_CTRL_FLAG_DYNAMIC_ARRAY, .def = 50, .min = 10, .max = 90, .step = 1, .dims = { 100 }, }; static const struct v4l2_ctrl_config vivid_ctrl_u16_matrix = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_U16_MATRIX, .name = "U16 8x16 Matrix", .type = V4L2_CTRL_TYPE_U16, .def = 0x18, .min = 0x10, .max = 0x2000, .step = 1, .dims = { 8, 16 }, }; static const struct v4l2_ctrl_config vivid_ctrl_u8_4d_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_U8_4D_ARRAY, .name = "U8 2x3x4x5 Array", .type = V4L2_CTRL_TYPE_U8, .def = 0x18, .min = 0x10, .max = 0x20, .step = 1, .dims = { 2, 3, 4, 5 }, }; static const struct v4l2_ctrl_config vivid_ctrl_u8_pixel_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_U8_PIXEL_ARRAY, .name = "U8 Pixel Array", .type = V4L2_CTRL_TYPE_U8, .def = 0x80, .min = 0x00, .max = 0xff, .step = 1, .dims = { 640 / PIXEL_ARRAY_DIV, 360 / PIXEL_ARRAY_DIV }, }; static const struct v4l2_ctrl_config vivid_ctrl_s32_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_S32_ARRAY, .name = "S32 2 Element Array", .type = V4L2_CTRL_TYPE_INTEGER, .def = 2, .min = -10, .max = 10, .step = 1, .dims = { 2 }, }; static const struct v4l2_ctrl_config vivid_ctrl_s64_array = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_S64_ARRAY, .name = "S64 5 Element Array", .type = V4L2_CTRL_TYPE_INTEGER64, .def = 4, .min = -10, .max = 10, .step = 1, .dims = { 5 }, }; static const char * const vivid_ctrl_menu_strings[] = { "Menu Item 0 (Skipped)", "Menu Item 1", "Menu Item 2 (Skipped)", "Menu Item 3", "Menu Item 4", "Menu Item 5 (Skipped)", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_menu = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_MENU, .name = "Menu", .type = V4L2_CTRL_TYPE_MENU, .min = 1, .max = 4, .def = 3, .menu_skip_mask = 0x04, .qmenu = vivid_ctrl_menu_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_string = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_STRING, .name = "String", .type = V4L2_CTRL_TYPE_STRING, .min = 2, .max = 4, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_bitmask = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_BITMASK, .name = "Bitmask", .type = V4L2_CTRL_TYPE_BITMASK, .def = 0x80002000, .min = 0, .max = 0x80402010, .step = 0, }; static const s64 vivid_ctrl_int_menu_values[] = { 1, 1, 2, 3, 5, 8, 13, 21, 42, }; static const struct v4l2_ctrl_config vivid_ctrl_int_menu = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_INTMENU, .name = "Integer Menu", .type = V4L2_CTRL_TYPE_INTEGER_MENU, .min = 1, .max = 8, .def = 4, .menu_skip_mask = 0x02, .qmenu_int = vivid_ctrl_int_menu_values, }; static const struct v4l2_ctrl_config vivid_ctrl_disconnect = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_DISCONNECT, .name = "Disconnect", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_area area = { .width = 1000, .height = 2000, }; static const struct v4l2_ctrl_config vivid_ctrl_area = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_AREA, .name = "Area", .type = V4L2_CTRL_TYPE_AREA, .p_def.p_const = &area, }; static const struct v4l2_ctrl_config vivid_ctrl_ro_int32 = { .ops = &vivid_user_gen_ctrl_ops, .id = VIVID_CID_RO_INTEGER, .name = "Read-Only Integer 32 Bits", .type = V4L2_CTRL_TYPE_INTEGER, .flags = V4L2_CTRL_FLAG_READ_ONLY, .min = 0, .max = 255, .step = 1, }; /* Framebuffer Controls */ static int vivid_fb_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_fb); switch (ctrl->id) { case VIVID_CID_CLEAR_FB: vivid_clear_fb(dev); break; } return 0; } static const struct v4l2_ctrl_ops vivid_fb_ctrl_ops = { .s_ctrl = vivid_fb_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_clear_fb = { .ops = &vivid_fb_ctrl_ops, .id = VIVID_CID_CLEAR_FB, .name = "Clear Framebuffer", .type = V4L2_CTRL_TYPE_BUTTON, }; /* Video User Controls */ static int vivid_user_vid_g_volatile_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_user_vid); switch (ctrl->id) { case V4L2_CID_AUTOGAIN: dev->gain->val = (jiffies_to_msecs(jiffies) / 1000) & 0xff; break; } return 0; } static int vivid_user_vid_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_user_vid); switch (ctrl->id) { case V4L2_CID_BRIGHTNESS: dev->input_brightness[dev->input] = ctrl->val - dev->input * 128; tpg_s_brightness(&dev->tpg, dev->input_brightness[dev->input]); break; case V4L2_CID_CONTRAST: tpg_s_contrast(&dev->tpg, ctrl->val); break; case V4L2_CID_SATURATION: tpg_s_saturation(&dev->tpg, ctrl->val); break; case V4L2_CID_HUE: tpg_s_hue(&dev->tpg, ctrl->val); break; case V4L2_CID_HFLIP: dev->hflip = ctrl->val; tpg_s_hflip(&dev->tpg, dev->sensor_hflip ^ dev->hflip); break; case V4L2_CID_VFLIP: dev->vflip = ctrl->val; tpg_s_vflip(&dev->tpg, dev->sensor_vflip ^ dev->vflip); break; case V4L2_CID_ALPHA_COMPONENT: tpg_s_alpha_component(&dev->tpg, ctrl->val); break; } return 0; } static const struct v4l2_ctrl_ops vivid_user_vid_ctrl_ops = { .g_volatile_ctrl = vivid_user_vid_g_volatile_ctrl, .s_ctrl = vivid_user_vid_s_ctrl, }; /* Video Capture Controls */ static int vivid_vid_cap_s_ctrl(struct v4l2_ctrl *ctrl) { static const u32 colorspaces[] = { V4L2_COLORSPACE_SMPTE170M, V4L2_COLORSPACE_REC709, V4L2_COLORSPACE_SRGB, V4L2_COLORSPACE_OPRGB, V4L2_COLORSPACE_BT2020, V4L2_COLORSPACE_DCI_P3, V4L2_COLORSPACE_SMPTE240M, V4L2_COLORSPACE_470_SYSTEM_M, V4L2_COLORSPACE_470_SYSTEM_BG, }; struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_vid_cap); unsigned int i, j; switch (ctrl->id) { case VIVID_CID_TEST_PATTERN: vivid_update_quality(dev); tpg_s_pattern(&dev->tpg, ctrl->val); break; case VIVID_CID_COLORSPACE: tpg_s_colorspace(&dev->tpg, colorspaces[ctrl->val]); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); vivid_send_source_change(dev, WEBCAM); break; case VIVID_CID_XFER_FUNC: tpg_s_xfer_func(&dev->tpg, ctrl->val); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); vivid_send_source_change(dev, WEBCAM); break; case VIVID_CID_YCBCR_ENC: tpg_s_ycbcr_enc(&dev->tpg, ctrl->val); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); vivid_send_source_change(dev, WEBCAM); break; case VIVID_CID_HSV_ENC: tpg_s_hsv_enc(&dev->tpg, ctrl->val ? V4L2_HSV_ENC_256 : V4L2_HSV_ENC_180); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); vivid_send_source_change(dev, WEBCAM); break; case VIVID_CID_QUANTIZATION: tpg_s_quantization(&dev->tpg, ctrl->val); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); vivid_send_source_change(dev, WEBCAM); break; case V4L2_CID_DV_RX_RGB_RANGE: if (!vivid_is_hdmi_cap(dev)) break; tpg_s_rgb_range(&dev->tpg, ctrl->val); break; case VIVID_CID_LIMITED_RGB_RANGE: tpg_s_real_rgb_range(&dev->tpg, ctrl->val ? V4L2_DV_RGB_RANGE_LIMITED : V4L2_DV_RGB_RANGE_FULL); break; case VIVID_CID_ALPHA_MODE: tpg_s_alpha_mode(&dev->tpg, ctrl->val); break; case VIVID_CID_HOR_MOVEMENT: tpg_s_mv_hor_mode(&dev->tpg, ctrl->val); break; case VIVID_CID_VERT_MOVEMENT: tpg_s_mv_vert_mode(&dev->tpg, ctrl->val); break; case VIVID_CID_OSD_TEXT_MODE: dev->osd_mode = ctrl->val; break; case VIVID_CID_PERCENTAGE_FILL: tpg_s_perc_fill(&dev->tpg, ctrl->val); for (i = 0; i < VIDEO_MAX_FRAME; i++) dev->must_blank[i] = ctrl->val < 100; break; case VIVID_CID_INSERT_SAV: tpg_s_insert_sav(&dev->tpg, ctrl->val); break; case VIVID_CID_INSERT_EAV: tpg_s_insert_eav(&dev->tpg, ctrl->val); break; case VIVID_CID_INSERT_HDMI_VIDEO_GUARD_BAND: tpg_s_insert_hdmi_video_guard_band(&dev->tpg, ctrl->val); break; case VIVID_CID_HFLIP: dev->sensor_hflip = ctrl->val; tpg_s_hflip(&dev->tpg, dev->sensor_hflip ^ dev->hflip); break; case VIVID_CID_VFLIP: dev->sensor_vflip = ctrl->val; tpg_s_vflip(&dev->tpg, dev->sensor_vflip ^ dev->vflip); break; case VIVID_CID_REDUCED_FPS: dev->reduced_fps = ctrl->val; vivid_update_format_cap(dev, true); break; case VIVID_CID_HAS_CROP_CAP: dev->has_crop_cap = ctrl->val; vivid_update_format_cap(dev, true); break; case VIVID_CID_HAS_COMPOSE_CAP: dev->has_compose_cap = ctrl->val; vivid_update_format_cap(dev, true); break; case VIVID_CID_HAS_SCALER_CAP: dev->has_scaler_cap = ctrl->val; vivid_update_format_cap(dev, true); break; case VIVID_CID_SHOW_BORDER: tpg_s_show_border(&dev->tpg, ctrl->val); break; case VIVID_CID_SHOW_SQUARE: tpg_s_show_square(&dev->tpg, ctrl->val); break; case VIVID_CID_STD_ASPECT_RATIO: dev->std_aspect_ratio[dev->input] = ctrl->val; tpg_s_video_aspect(&dev->tpg, vivid_get_video_aspect(dev)); break; case VIVID_CID_DV_TIMINGS_SIGNAL_MODE: dev->dv_timings_signal_mode[dev->input] = dev->ctrl_dv_timings_signal_mode->val; dev->query_dv_timings[dev->input] = dev->ctrl_dv_timings->val; dev->power_present = 0; for (i = 0, j = 0; i < ARRAY_SIZE(dev->dv_timings_signal_mode); i++) if (dev->input_type[i] == HDMI) { if (dev->dv_timings_signal_mode[i] != NO_SIGNAL) dev->power_present |= (1 << j); j++; } __v4l2_ctrl_s_ctrl(dev->ctrl_rx_power_present, dev->power_present); v4l2_ctrl_activate(dev->ctrl_dv_timings, dev->dv_timings_signal_mode[dev->input] == SELECTED_DV_TIMINGS); vivid_update_quality(dev); vivid_send_source_change(dev, HDMI); break; case VIVID_CID_DV_TIMINGS_ASPECT_RATIO: dev->dv_timings_aspect_ratio[dev->input] = ctrl->val; tpg_s_video_aspect(&dev->tpg, vivid_get_video_aspect(dev)); break; case VIVID_CID_TSTAMP_SRC: dev->tstamp_src_is_soe = ctrl->val; dev->vb_vid_cap_q.timestamp_flags &= ~V4L2_BUF_FLAG_TSTAMP_SRC_MASK; if (dev->tstamp_src_is_soe) dev->vb_vid_cap_q.timestamp_flags |= V4L2_BUF_FLAG_TSTAMP_SRC_SOE; break; case VIVID_CID_MAX_EDID_BLOCKS: dev->edid_max_blocks = ctrl->val; if (dev->edid_blocks > dev->edid_max_blocks) dev->edid_blocks = dev->edid_max_blocks; break; } return 0; } static const struct v4l2_ctrl_ops vivid_vid_cap_ctrl_ops = { .s_ctrl = vivid_vid_cap_s_ctrl, }; static const char * const vivid_ctrl_hor_movement_strings[] = { "Move Left Fast", "Move Left", "Move Left Slow", "No Movement", "Move Right Slow", "Move Right", "Move Right Fast", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_hor_movement = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HOR_MOVEMENT, .name = "Horizontal Movement", .type = V4L2_CTRL_TYPE_MENU, .max = TPG_MOVE_POS_FAST, .def = TPG_MOVE_NONE, .qmenu = vivid_ctrl_hor_movement_strings, }; static const char * const vivid_ctrl_vert_movement_strings[] = { "Move Up Fast", "Move Up", "Move Up Slow", "No Movement", "Move Down Slow", "Move Down", "Move Down Fast", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_vert_movement = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_VERT_MOVEMENT, .name = "Vertical Movement", .type = V4L2_CTRL_TYPE_MENU, .max = TPG_MOVE_POS_FAST, .def = TPG_MOVE_NONE, .qmenu = vivid_ctrl_vert_movement_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_show_border = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_SHOW_BORDER, .name = "Show Border", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_show_square = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_SHOW_SQUARE, .name = "Show Square", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const char * const vivid_ctrl_osd_mode_strings[] = { "All", "Counters Only", "None", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_osd_mode = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_OSD_TEXT_MODE, .name = "OSD Text Mode", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_osd_mode_strings) - 2, .qmenu = vivid_ctrl_osd_mode_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_perc_fill = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_PERCENTAGE_FILL, .name = "Fill Percentage of Frame", .type = V4L2_CTRL_TYPE_INTEGER, .min = 0, .max = 100, .def = 100, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_insert_sav = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_INSERT_SAV, .name = "Insert SAV Code in Image", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_insert_eav = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_INSERT_EAV, .name = "Insert EAV Code in Image", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_insert_hdmi_video_guard_band = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_INSERT_HDMI_VIDEO_GUARD_BAND, .name = "Insert Video Guard Band", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_hflip = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HFLIP, .name = "Sensor Flipped Horizontally", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_vflip = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_VFLIP, .name = "Sensor Flipped Vertically", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_reduced_fps = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_REDUCED_FPS, .name = "Reduced Framerate", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_has_crop_cap = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HAS_CROP_CAP, .name = "Enable Capture Cropping", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_has_compose_cap = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HAS_COMPOSE_CAP, .name = "Enable Capture Composing", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_has_scaler_cap = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HAS_SCALER_CAP, .name = "Enable Capture Scaler", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const char * const vivid_ctrl_tstamp_src_strings[] = { "End of Frame", "Start of Exposure", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_tstamp_src = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_TSTAMP_SRC, .name = "Timestamp Source", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_tstamp_src_strings) - 2, .qmenu = vivid_ctrl_tstamp_src_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_std_aspect_ratio = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_STD_ASPECT_RATIO, .name = "Standard Aspect Ratio", .type = V4L2_CTRL_TYPE_MENU, .min = 1, .max = 4, .def = 1, .qmenu = tpg_aspect_strings, }; static const char * const vivid_ctrl_dv_timings_signal_mode_strings[] = { "Current DV Timings", "No Signal", "No Lock", "Out of Range", "Selected DV Timings", "Cycle Through All DV Timings", "Custom DV Timings", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_dv_timings_signal_mode = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_DV_TIMINGS_SIGNAL_MODE, .name = "DV Timings Signal Mode", .type = V4L2_CTRL_TYPE_MENU, .max = 5, .qmenu = vivid_ctrl_dv_timings_signal_mode_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_dv_timings_aspect_ratio = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_DV_TIMINGS_ASPECT_RATIO, .name = "DV Timings Aspect Ratio", .type = V4L2_CTRL_TYPE_MENU, .max = 3, .qmenu = tpg_aspect_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_max_edid_blocks = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_MAX_EDID_BLOCKS, .name = "Maximum EDID Blocks", .type = V4L2_CTRL_TYPE_INTEGER, .min = 1, .max = 256, .def = 2, .step = 1, }; static const char * const vivid_ctrl_colorspace_strings[] = { "SMPTE 170M", "Rec. 709", "sRGB", "opRGB", "BT.2020", "DCI-P3", "SMPTE 240M", "470 System M", "470 System BG", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_colorspace = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_COLORSPACE, .name = "Colorspace", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_colorspace_strings) - 2, .def = 2, .qmenu = vivid_ctrl_colorspace_strings, }; static const char * const vivid_ctrl_xfer_func_strings[] = { "Default", "Rec. 709", "sRGB", "opRGB", "SMPTE 240M", "None", "DCI-P3", "SMPTE 2084", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_xfer_func = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_XFER_FUNC, .name = "Transfer Function", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_xfer_func_strings) - 2, .qmenu = vivid_ctrl_xfer_func_strings, }; static const char * const vivid_ctrl_ycbcr_enc_strings[] = { "Default", "ITU-R 601", "Rec. 709", "xvYCC 601", "xvYCC 709", "", "BT.2020", "BT.2020 Constant Luminance", "SMPTE 240M", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_ycbcr_enc = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_YCBCR_ENC, .name = "Y'CbCr Encoding", .type = V4L2_CTRL_TYPE_MENU, .menu_skip_mask = 1 << 5, .max = ARRAY_SIZE(vivid_ctrl_ycbcr_enc_strings) - 2, .qmenu = vivid_ctrl_ycbcr_enc_strings, }; static const char * const vivid_ctrl_hsv_enc_strings[] = { "Hue 0-179", "Hue 0-256", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_hsv_enc = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_HSV_ENC, .name = "HSV Encoding", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_hsv_enc_strings) - 2, .qmenu = vivid_ctrl_hsv_enc_strings, }; static const char * const vivid_ctrl_quantization_strings[] = { "Default", "Full Range", "Limited Range", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_quantization = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_QUANTIZATION, .name = "Quantization", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_quantization_strings) - 2, .qmenu = vivid_ctrl_quantization_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_alpha_mode = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_ALPHA_MODE, .name = "Apply Alpha To Red Only", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_limited_rgb_range = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_LIMITED_RGB_RANGE, .name = "Limited RGB Range (16-235)", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; /* Video Loop Control */ static int vivid_loop_cap_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_loop_cap); switch (ctrl->id) { case VIVID_CID_LOOP_VIDEO: dev->loop_video = ctrl->val; vivid_update_quality(dev); vivid_send_source_change(dev, SVID); vivid_send_source_change(dev, HDMI); break; } return 0; } static const struct v4l2_ctrl_ops vivid_loop_cap_ctrl_ops = { .s_ctrl = vivid_loop_cap_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_loop_video = { .ops = &vivid_loop_cap_ctrl_ops, .id = VIVID_CID_LOOP_VIDEO, .name = "Loop Video", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; /* VBI Capture Control */ static int vivid_vbi_cap_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_vbi_cap); switch (ctrl->id) { case VIVID_CID_VBI_CAP_INTERLACED: dev->vbi_cap_interlaced = ctrl->val; break; } return 0; } static const struct v4l2_ctrl_ops vivid_vbi_cap_ctrl_ops = { .s_ctrl = vivid_vbi_cap_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_vbi_cap_interlaced = { .ops = &vivid_vbi_cap_ctrl_ops, .id = VIVID_CID_VBI_CAP_INTERLACED, .name = "Interlaced VBI Format", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; /* Video Output Controls */ static int vivid_vid_out_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_vid_out); struct v4l2_bt_timings *bt = &dev->dv_timings_out.bt; u32 display_present = 0; unsigned int i, j, bus_idx; switch (ctrl->id) { case VIVID_CID_HAS_CROP_OUT: dev->has_crop_out = ctrl->val; vivid_update_format_out(dev); break; case VIVID_CID_HAS_COMPOSE_OUT: dev->has_compose_out = ctrl->val; vivid_update_format_out(dev); break; case VIVID_CID_HAS_SCALER_OUT: dev->has_scaler_out = ctrl->val; vivid_update_format_out(dev); break; case V4L2_CID_DV_TX_MODE: dev->dvi_d_out = ctrl->val == V4L2_DV_TX_MODE_DVI_D; if (!vivid_is_hdmi_out(dev)) break; if (!dev->dvi_d_out && (bt->flags & V4L2_DV_FL_IS_CE_VIDEO)) { if (bt->width == 720 && bt->height <= 576) dev->colorspace_out = V4L2_COLORSPACE_SMPTE170M; else dev->colorspace_out = V4L2_COLORSPACE_REC709; dev->quantization_out = V4L2_QUANTIZATION_DEFAULT; } else { dev->colorspace_out = V4L2_COLORSPACE_SRGB; dev->quantization_out = dev->dvi_d_out ? V4L2_QUANTIZATION_LIM_RANGE : V4L2_QUANTIZATION_DEFAULT; } if (dev->loop_video) vivid_send_source_change(dev, HDMI); break; case VIVID_CID_DISPLAY_PRESENT: if (dev->output_type[dev->output] != HDMI) break; dev->display_present[dev->output] = ctrl->val; for (i = 0, j = 0; i < dev->num_outputs; i++) if (dev->output_type[i] == HDMI) display_present |= dev->display_present[i] << j++; __v4l2_ctrl_s_ctrl(dev->ctrl_tx_rxsense, display_present); if (dev->edid_blocks) { __v4l2_ctrl_s_ctrl(dev->ctrl_tx_edid_present, display_present); __v4l2_ctrl_s_ctrl(dev->ctrl_tx_hotplug, display_present); } bus_idx = dev->cec_output2bus_map[dev->output]; if (!dev->cec_tx_adap[bus_idx]) break; if (ctrl->val && dev->edid_blocks) cec_s_phys_addr(dev->cec_tx_adap[bus_idx], dev->cec_tx_adap[bus_idx]->phys_addr, false); else cec_phys_addr_invalidate(dev->cec_tx_adap[bus_idx]); break; } return 0; } static const struct v4l2_ctrl_ops vivid_vid_out_ctrl_ops = { .s_ctrl = vivid_vid_out_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_has_crop_out = { .ops = &vivid_vid_out_ctrl_ops, .id = VIVID_CID_HAS_CROP_OUT, .name = "Enable Output Cropping", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_has_compose_out = { .ops = &vivid_vid_out_ctrl_ops, .id = VIVID_CID_HAS_COMPOSE_OUT, .name = "Enable Output Composing", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_has_scaler_out = { .ops = &vivid_vid_out_ctrl_ops, .id = VIVID_CID_HAS_SCALER_OUT, .name = "Enable Output Scaler", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_display_present = { .ops = &vivid_vid_out_ctrl_ops, .id = VIVID_CID_DISPLAY_PRESENT, .name = "Display Present", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; /* Streaming Controls */ static int vivid_streaming_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_streaming); switch (ctrl->id) { case VIVID_CID_DQBUF_ERROR: dev->dqbuf_error = true; break; case VIVID_CID_PERC_DROPPED: dev->perc_dropped_buffers = ctrl->val; break; case VIVID_CID_QUEUE_SETUP_ERROR: dev->queue_setup_error = true; break; case VIVID_CID_BUF_PREPARE_ERROR: dev->buf_prepare_error = true; break; case VIVID_CID_START_STR_ERROR: dev->start_streaming_error = true; break; case VIVID_CID_REQ_VALIDATE_ERROR: dev->req_validate_error = true; break; case VIVID_CID_QUEUE_ERROR: if (vb2_start_streaming_called(&dev->vb_vid_cap_q)) vb2_queue_error(&dev->vb_vid_cap_q); if (vb2_start_streaming_called(&dev->vb_vbi_cap_q)) vb2_queue_error(&dev->vb_vbi_cap_q); if (vb2_start_streaming_called(&dev->vb_vid_out_q)) vb2_queue_error(&dev->vb_vid_out_q); if (vb2_start_streaming_called(&dev->vb_vbi_out_q)) vb2_queue_error(&dev->vb_vbi_out_q); if (vb2_start_streaming_called(&dev->vb_sdr_cap_q)) vb2_queue_error(&dev->vb_sdr_cap_q); break; case VIVID_CID_SEQ_WRAP: dev->seq_wrap = ctrl->val; break; case VIVID_CID_TIME_WRAP: dev->time_wrap = ctrl->val; if (dev->time_wrap == 1) dev->time_wrap = (1ULL << 63) - NSEC_PER_SEC * 16ULL; else if (dev->time_wrap == 2) dev->time_wrap = ((1ULL << 31) - 16) * NSEC_PER_SEC; break; } return 0; } static const struct v4l2_ctrl_ops vivid_streaming_ctrl_ops = { .s_ctrl = vivid_streaming_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_dqbuf_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_DQBUF_ERROR, .name = "Inject V4L2_BUF_FLAG_ERROR", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_ctrl_config vivid_ctrl_perc_dropped = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_PERC_DROPPED, .name = "Percentage of Dropped Buffers", .type = V4L2_CTRL_TYPE_INTEGER, .min = 0, .max = 100, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_queue_setup_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_QUEUE_SETUP_ERROR, .name = "Inject VIDIOC_REQBUFS Error", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_ctrl_config vivid_ctrl_buf_prepare_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_BUF_PREPARE_ERROR, .name = "Inject VIDIOC_QBUF Error", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_ctrl_config vivid_ctrl_start_streaming_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_START_STR_ERROR, .name = "Inject VIDIOC_STREAMON Error", .type = V4L2_CTRL_TYPE_BUTTON, }; static const struct v4l2_ctrl_config vivid_ctrl_queue_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_QUEUE_ERROR, .name = "Inject Fatal Streaming Error", .type = V4L2_CTRL_TYPE_BUTTON, }; #ifdef CONFIG_MEDIA_CONTROLLER static const struct v4l2_ctrl_config vivid_ctrl_req_validate_error = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_REQ_VALIDATE_ERROR, .name = "Inject req_validate() Error", .type = V4L2_CTRL_TYPE_BUTTON, }; #endif static const struct v4l2_ctrl_config vivid_ctrl_seq_wrap = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_SEQ_WRAP, .name = "Wrap Sequence Number", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const char * const vivid_ctrl_time_wrap_strings[] = { "None", "64 Bit", "32 Bit", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_time_wrap = { .ops = &vivid_streaming_ctrl_ops, .id = VIVID_CID_TIME_WRAP, .name = "Wrap Timestamp", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_time_wrap_strings) - 2, .qmenu = vivid_ctrl_time_wrap_strings, }; /* SDTV Capture Controls */ static int vivid_sdtv_cap_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_sdtv_cap); switch (ctrl->id) { case VIVID_CID_STD_SIGNAL_MODE: dev->std_signal_mode[dev->input] = dev->ctrl_std_signal_mode->val; if (dev->std_signal_mode[dev->input] == SELECTED_STD) dev->query_std[dev->input] = vivid_standard[dev->ctrl_standard->val]; v4l2_ctrl_activate(dev->ctrl_standard, dev->std_signal_mode[dev->input] == SELECTED_STD); vivid_update_quality(dev); vivid_send_source_change(dev, TV); vivid_send_source_change(dev, SVID); break; } return 0; } static const struct v4l2_ctrl_ops vivid_sdtv_cap_ctrl_ops = { .s_ctrl = vivid_sdtv_cap_s_ctrl, }; static const char * const vivid_ctrl_std_signal_mode_strings[] = { "Current Standard", "No Signal", "No Lock", "", "Selected Standard", "Cycle Through All Standards", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_std_signal_mode = { .ops = &vivid_sdtv_cap_ctrl_ops, .id = VIVID_CID_STD_SIGNAL_MODE, .name = "Standard Signal Mode", .type = V4L2_CTRL_TYPE_MENU, .max = ARRAY_SIZE(vivid_ctrl_std_signal_mode_strings) - 2, .menu_skip_mask = 1 << 3, .qmenu = vivid_ctrl_std_signal_mode_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_standard = { .ops = &vivid_sdtv_cap_ctrl_ops, .id = VIVID_CID_STANDARD, .name = "Standard", .type = V4L2_CTRL_TYPE_MENU, .max = 14, .qmenu = vivid_ctrl_standard_strings, }; /* Radio Receiver Controls */ static int vivid_radio_rx_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_radio_rx); switch (ctrl->id) { case VIVID_CID_RADIO_SEEK_MODE: dev->radio_rx_hw_seek_mode = ctrl->val; break; case VIVID_CID_RADIO_SEEK_PROG_LIM: dev->radio_rx_hw_seek_prog_lim = ctrl->val; break; case VIVID_CID_RADIO_RX_RDS_RBDS: dev->rds_gen.use_rbds = ctrl->val; break; case VIVID_CID_RADIO_RX_RDS_BLOCKIO: dev->radio_rx_rds_controls = ctrl->val; dev->radio_rx_caps &= ~V4L2_CAP_READWRITE; dev->radio_rx_rds_use_alternates = false; if (!dev->radio_rx_rds_controls) { dev->radio_rx_caps |= V4L2_CAP_READWRITE; __v4l2_ctrl_s_ctrl(dev->radio_rx_rds_pty, 0); __v4l2_ctrl_s_ctrl(dev->radio_rx_rds_ta, 0); __v4l2_ctrl_s_ctrl(dev->radio_rx_rds_tp, 0); __v4l2_ctrl_s_ctrl(dev->radio_rx_rds_ms, 0); __v4l2_ctrl_s_ctrl_string(dev->radio_rx_rds_psname, ""); __v4l2_ctrl_s_ctrl_string(dev->radio_rx_rds_radiotext, ""); } v4l2_ctrl_activate(dev->radio_rx_rds_pty, dev->radio_rx_rds_controls); v4l2_ctrl_activate(dev->radio_rx_rds_psname, dev->radio_rx_rds_controls); v4l2_ctrl_activate(dev->radio_rx_rds_radiotext, dev->radio_rx_rds_controls); v4l2_ctrl_activate(dev->radio_rx_rds_ta, dev->radio_rx_rds_controls); v4l2_ctrl_activate(dev->radio_rx_rds_tp, dev->radio_rx_rds_controls); v4l2_ctrl_activate(dev->radio_rx_rds_ms, dev->radio_rx_rds_controls); dev->radio_rx_dev.device_caps = dev->radio_rx_caps; break; case V4L2_CID_RDS_RECEPTION: dev->radio_rx_rds_enabled = ctrl->val; break; } return 0; } static const struct v4l2_ctrl_ops vivid_radio_rx_ctrl_ops = { .s_ctrl = vivid_radio_rx_s_ctrl, }; static const char * const vivid_ctrl_radio_rds_mode_strings[] = { "Block I/O", "Controls", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_radio_rx_rds_blockio = { .ops = &vivid_radio_rx_ctrl_ops, .id = VIVID_CID_RADIO_RX_RDS_BLOCKIO, .name = "RDS Rx I/O Mode", .type = V4L2_CTRL_TYPE_MENU, .qmenu = vivid_ctrl_radio_rds_mode_strings, .max = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_radio_rx_rds_rbds = { .ops = &vivid_radio_rx_ctrl_ops, .id = VIVID_CID_RADIO_RX_RDS_RBDS, .name = "Generate RBDS Instead of RDS", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; static const char * const vivid_ctrl_radio_hw_seek_mode_strings[] = { "Bounded", "Wrap Around", "Both", NULL, }; static const struct v4l2_ctrl_config vivid_ctrl_radio_hw_seek_mode = { .ops = &vivid_radio_rx_ctrl_ops, .id = VIVID_CID_RADIO_SEEK_MODE, .name = "Radio HW Seek Mode", .type = V4L2_CTRL_TYPE_MENU, .max = 2, .qmenu = vivid_ctrl_radio_hw_seek_mode_strings, }; static const struct v4l2_ctrl_config vivid_ctrl_radio_hw_seek_prog_lim = { .ops = &vivid_radio_rx_ctrl_ops, .id = VIVID_CID_RADIO_SEEK_PROG_LIM, .name = "Radio Programmable HW Seek", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .step = 1, }; /* Radio Transmitter Controls */ static int vivid_radio_tx_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_radio_tx); switch (ctrl->id) { case VIVID_CID_RADIO_TX_RDS_BLOCKIO: dev->radio_tx_rds_controls = ctrl->val; dev->radio_tx_caps &= ~V4L2_CAP_READWRITE; if (!dev->radio_tx_rds_controls) dev->radio_tx_caps |= V4L2_CAP_READWRITE; dev->radio_tx_dev.device_caps = dev->radio_tx_caps; break; case V4L2_CID_RDS_TX_PTY: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl(dev->radio_rx_rds_pty, ctrl->val); break; case V4L2_CID_RDS_TX_PS_NAME: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl_string(dev->radio_rx_rds_psname, ctrl->p_new.p_char); break; case V4L2_CID_RDS_TX_RADIO_TEXT: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl_string(dev->radio_rx_rds_radiotext, ctrl->p_new.p_char); break; case V4L2_CID_RDS_TX_TRAFFIC_ANNOUNCEMENT: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl(dev->radio_rx_rds_ta, ctrl->val); break; case V4L2_CID_RDS_TX_TRAFFIC_PROGRAM: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl(dev->radio_rx_rds_tp, ctrl->val); break; case V4L2_CID_RDS_TX_MUSIC_SPEECH: if (dev->radio_rx_rds_controls) v4l2_ctrl_s_ctrl(dev->radio_rx_rds_ms, ctrl->val); break; } return 0; } static const struct v4l2_ctrl_ops vivid_radio_tx_ctrl_ops = { .s_ctrl = vivid_radio_tx_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_radio_tx_rds_blockio = { .ops = &vivid_radio_tx_ctrl_ops, .id = VIVID_CID_RADIO_TX_RDS_BLOCKIO, .name = "RDS Tx I/O Mode", .type = V4L2_CTRL_TYPE_MENU, .qmenu = vivid_ctrl_radio_rds_mode_strings, .max = 1, .def = 1, }; /* SDR Capture Controls */ static int vivid_sdr_cap_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_sdr_cap); switch (ctrl->id) { case VIVID_CID_SDR_CAP_FM_DEVIATION: dev->sdr_fm_deviation = ctrl->val; break; } return 0; } static const struct v4l2_ctrl_ops vivid_sdr_cap_ctrl_ops = { .s_ctrl = vivid_sdr_cap_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_sdr_cap_fm_deviation = { .ops = &vivid_sdr_cap_ctrl_ops, .id = VIVID_CID_SDR_CAP_FM_DEVIATION, .name = "FM Deviation", .type = V4L2_CTRL_TYPE_INTEGER, .min = 100, .max = 200000, .def = 75000, .step = 1, }; /* Metadata Capture Control */ static int vivid_meta_cap_s_ctrl(struct v4l2_ctrl *ctrl) { struct vivid_dev *dev = container_of(ctrl->handler, struct vivid_dev, ctrl_hdl_meta_cap); switch (ctrl->id) { case VIVID_CID_META_CAP_GENERATE_PTS: dev->meta_pts = ctrl->val; break; case VIVID_CID_META_CAP_GENERATE_SCR: dev->meta_scr = ctrl->val; break; } return 0; } static const struct v4l2_ctrl_ops vivid_meta_cap_ctrl_ops = { .s_ctrl = vivid_meta_cap_s_ctrl, }; static const struct v4l2_ctrl_config vivid_ctrl_meta_has_pts = { .ops = &vivid_meta_cap_ctrl_ops, .id = VIVID_CID_META_CAP_GENERATE_PTS, .name = "Generate PTS", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_meta_has_src_clk = { .ops = &vivid_meta_cap_ctrl_ops, .id = VIVID_CID_META_CAP_GENERATE_SCR, .name = "Generate SCR", .type = V4L2_CTRL_TYPE_BOOLEAN, .max = 1, .def = 1, .step = 1, }; static const struct v4l2_ctrl_config vivid_ctrl_class = { .ops = &vivid_user_gen_ctrl_ops, .flags = V4L2_CTRL_FLAG_READ_ONLY | V4L2_CTRL_FLAG_WRITE_ONLY, .id = VIVID_CID_VIVID_CLASS, .name = "Vivid Controls", .type = V4L2_CTRL_TYPE_CTRL_CLASS, }; int vivid_create_controls(struct vivid_dev *dev, bool show_ccs_cap, bool show_ccs_out, bool no_error_inj, bool has_sdtv, bool has_hdmi) { struct v4l2_ctrl_handler *hdl_user_gen = &dev->ctrl_hdl_user_gen; struct v4l2_ctrl_handler *hdl_user_vid = &dev->ctrl_hdl_user_vid; struct v4l2_ctrl_handler *hdl_user_aud = &dev->ctrl_hdl_user_aud; struct v4l2_ctrl_handler *hdl_streaming = &dev->ctrl_hdl_streaming; struct v4l2_ctrl_handler *hdl_sdtv_cap = &dev->ctrl_hdl_sdtv_cap; struct v4l2_ctrl_handler *hdl_loop_cap = &dev->ctrl_hdl_loop_cap; struct v4l2_ctrl_handler *hdl_fb = &dev->ctrl_hdl_fb; struct v4l2_ctrl_handler *hdl_vid_cap = &dev->ctrl_hdl_vid_cap; struct v4l2_ctrl_handler *hdl_vid_out = &dev->ctrl_hdl_vid_out; struct v4l2_ctrl_handler *hdl_vbi_cap = &dev->ctrl_hdl_vbi_cap; struct v4l2_ctrl_handler *hdl_vbi_out = &dev->ctrl_hdl_vbi_out; struct v4l2_ctrl_handler *hdl_radio_rx = &dev->ctrl_hdl_radio_rx; struct v4l2_ctrl_handler *hdl_radio_tx = &dev->ctrl_hdl_radio_tx; struct v4l2_ctrl_handler *hdl_sdr_cap = &dev->ctrl_hdl_sdr_cap; struct v4l2_ctrl_handler *hdl_meta_cap = &dev->ctrl_hdl_meta_cap; struct v4l2_ctrl_handler *hdl_meta_out = &dev->ctrl_hdl_meta_out; struct v4l2_ctrl_handler *hdl_tch_cap = &dev->ctrl_hdl_touch_cap; struct v4l2_ctrl_config vivid_ctrl_dv_timings = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_DV_TIMINGS, .name = "DV Timings", .type = V4L2_CTRL_TYPE_MENU, }; int i; v4l2_ctrl_handler_init(hdl_user_gen, 10); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_user_vid, 9); v4l2_ctrl_new_custom(hdl_user_vid, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_user_aud, 2); v4l2_ctrl_new_custom(hdl_user_aud, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_streaming, 8); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_sdtv_cap, 2); v4l2_ctrl_new_custom(hdl_sdtv_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_loop_cap, 1); v4l2_ctrl_new_custom(hdl_loop_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_fb, 1); v4l2_ctrl_new_custom(hdl_fb, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_vid_cap, 55); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_vid_out, 26); if (!no_error_inj || dev->has_fb || dev->num_hdmi_outputs) v4l2_ctrl_new_custom(hdl_vid_out, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_vbi_cap, 21); v4l2_ctrl_new_custom(hdl_vbi_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_vbi_out, 19); if (!no_error_inj) v4l2_ctrl_new_custom(hdl_vbi_out, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_radio_rx, 17); v4l2_ctrl_new_custom(hdl_radio_rx, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_radio_tx, 17); v4l2_ctrl_new_custom(hdl_radio_tx, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_sdr_cap, 19); v4l2_ctrl_new_custom(hdl_sdr_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_meta_cap, 2); v4l2_ctrl_new_custom(hdl_meta_cap, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_meta_out, 2); v4l2_ctrl_new_custom(hdl_meta_out, &vivid_ctrl_class, NULL); v4l2_ctrl_handler_init(hdl_tch_cap, 2); v4l2_ctrl_new_custom(hdl_tch_cap, &vivid_ctrl_class, NULL); /* User Controls */ dev->volume = v4l2_ctrl_new_std(hdl_user_aud, NULL, V4L2_CID_AUDIO_VOLUME, 0, 255, 1, 200); dev->mute = v4l2_ctrl_new_std(hdl_user_aud, NULL, V4L2_CID_AUDIO_MUTE, 0, 1, 1, 0); if (dev->has_vid_cap) { dev->brightness = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_BRIGHTNESS, 0, 255, 1, 128); for (i = 0; i < MAX_INPUTS; i++) dev->input_brightness[i] = 128; dev->contrast = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_CONTRAST, 0, 255, 1, 128); dev->saturation = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_SATURATION, 0, 255, 1, 128); dev->hue = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_HUE, -128, 128, 1, 0); v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_HFLIP, 0, 1, 1, 0); v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_VFLIP, 0, 1, 1, 0); dev->autogain = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_AUTOGAIN, 0, 1, 1, 1); dev->gain = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_GAIN, 0, 255, 1, 100); dev->alpha = v4l2_ctrl_new_std(hdl_user_vid, &vivid_user_vid_ctrl_ops, V4L2_CID_ALPHA_COMPONENT, 0, 255, 1, 0); } dev->button = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_button, NULL); dev->int32 = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_int32, NULL); dev->int64 = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_int64, NULL); dev->boolean = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_boolean, NULL); dev->menu = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_menu, NULL); dev->string = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_string, NULL); dev->bitmask = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_bitmask, NULL); dev->int_menu = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_int_menu, NULL); dev->ro_int32 = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_ro_int32, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_area, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_u32_array, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_u32_dyn_array, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_u16_matrix, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_u8_4d_array, NULL); dev->pixel_array = v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_u8_pixel_array, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_s32_array, NULL); v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_s64_array, NULL); if (dev->has_vid_cap) { /* Image Processing Controls */ struct v4l2_ctrl_config vivid_ctrl_test_pattern = { .ops = &vivid_vid_cap_ctrl_ops, .id = VIVID_CID_TEST_PATTERN, .name = "Test Pattern", .type = V4L2_CTRL_TYPE_MENU, .max = TPG_PAT_NOISE, .qmenu = tpg_pattern_strings, }; dev->test_pattern = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_test_pattern, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_perc_fill, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_hor_movement, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_vert_movement, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_osd_mode, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_show_border, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_show_square, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_hflip, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_vflip, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_insert_sav, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_insert_eav, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_insert_hdmi_video_guard_band, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_reduced_fps, NULL); if (show_ccs_cap) { dev->ctrl_has_crop_cap = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_has_crop_cap, NULL); dev->ctrl_has_compose_cap = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_has_compose_cap, NULL); dev->ctrl_has_scaler_cap = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_has_scaler_cap, NULL); } v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_tstamp_src, NULL); dev->colorspace = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_colorspace, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_xfer_func, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_ycbcr_enc, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_hsv_enc, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_quantization, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_alpha_mode, NULL); } if (dev->has_vid_out && show_ccs_out) { dev->ctrl_has_crop_out = v4l2_ctrl_new_custom(hdl_vid_out, &vivid_ctrl_has_crop_out, NULL); dev->ctrl_has_compose_out = v4l2_ctrl_new_custom(hdl_vid_out, &vivid_ctrl_has_compose_out, NULL); dev->ctrl_has_scaler_out = v4l2_ctrl_new_custom(hdl_vid_out, &vivid_ctrl_has_scaler_out, NULL); } /* * Testing this driver with v4l2-compliance will trigger the error * injection controls, and after that nothing will work as expected. * So we have a module option to drop these error injecting controls * allowing us to run v4l2_compliance again. */ if (!no_error_inj) { v4l2_ctrl_new_custom(hdl_user_gen, &vivid_ctrl_disconnect, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_dqbuf_error, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_perc_dropped, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_queue_setup_error, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_buf_prepare_error, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_start_streaming_error, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_queue_error, NULL); #ifdef CONFIG_MEDIA_CONTROLLER v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_req_validate_error, NULL); #endif v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_seq_wrap, NULL); v4l2_ctrl_new_custom(hdl_streaming, &vivid_ctrl_time_wrap, NULL); } if (has_sdtv && (dev->has_vid_cap || dev->has_vbi_cap)) { if (dev->has_vid_cap) v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_std_aspect_ratio, NULL); dev->ctrl_std_signal_mode = v4l2_ctrl_new_custom(hdl_sdtv_cap, &vivid_ctrl_std_signal_mode, NULL); dev->ctrl_standard = v4l2_ctrl_new_custom(hdl_sdtv_cap, &vivid_ctrl_standard, NULL); if (dev->ctrl_std_signal_mode) v4l2_ctrl_cluster(2, &dev->ctrl_std_signal_mode); if (dev->has_raw_vbi_cap) v4l2_ctrl_new_custom(hdl_vbi_cap, &vivid_ctrl_vbi_cap_interlaced, NULL); } if (dev->num_hdmi_inputs) { s64 hdmi_input_mask = GENMASK(dev->num_hdmi_inputs - 1, 0); dev->ctrl_dv_timings_signal_mode = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_dv_timings_signal_mode, NULL); vivid_ctrl_dv_timings.max = dev->query_dv_timings_size - 1; vivid_ctrl_dv_timings.qmenu = (const char * const *)dev->query_dv_timings_qmenu; dev->ctrl_dv_timings = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_dv_timings, NULL); if (dev->ctrl_dv_timings_signal_mode) v4l2_ctrl_cluster(2, &dev->ctrl_dv_timings_signal_mode); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_dv_timings_aspect_ratio, NULL); v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_max_edid_blocks, NULL); dev->real_rgb_range_cap = v4l2_ctrl_new_custom(hdl_vid_cap, &vivid_ctrl_limited_rgb_range, NULL); dev->rgb_range_cap = v4l2_ctrl_new_std_menu(hdl_vid_cap, &vivid_vid_cap_ctrl_ops, V4L2_CID_DV_RX_RGB_RANGE, V4L2_DV_RGB_RANGE_FULL, 0, V4L2_DV_RGB_RANGE_AUTO); dev->ctrl_rx_power_present = v4l2_ctrl_new_std(hdl_vid_cap, NULL, V4L2_CID_DV_RX_POWER_PRESENT, 0, hdmi_input_mask, 0, hdmi_input_mask); } if (dev->num_hdmi_outputs) { s64 hdmi_output_mask = GENMASK(dev->num_hdmi_outputs - 1, 0); /* * We aren't doing anything with this at the moment, but * HDMI outputs typically have this controls. */ dev->ctrl_tx_rgb_range = v4l2_ctrl_new_std_menu(hdl_vid_out, NULL, V4L2_CID_DV_TX_RGB_RANGE, V4L2_DV_RGB_RANGE_FULL, 0, V4L2_DV_RGB_RANGE_AUTO); dev->ctrl_tx_mode = v4l2_ctrl_new_std_menu(hdl_vid_out, NULL, V4L2_CID_DV_TX_MODE, V4L2_DV_TX_MODE_HDMI, 0, V4L2_DV_TX_MODE_HDMI); dev->ctrl_display_present = v4l2_ctrl_new_custom(hdl_vid_out, &vivid_ctrl_display_present, NULL); dev->ctrl_tx_hotplug = v4l2_ctrl_new_std(hdl_vid_out, NULL, V4L2_CID_DV_TX_HOTPLUG, 0, hdmi_output_mask, 0, hdmi_output_mask); dev->ctrl_tx_rxsense = v4l2_ctrl_new_std(hdl_vid_out, NULL, V4L2_CID_DV_TX_RXSENSE, 0, hdmi_output_mask, 0, hdmi_output_mask); dev->ctrl_tx_edid_present = v4l2_ctrl_new_std(hdl_vid_out, NULL, V4L2_CID_DV_TX_EDID_PRESENT, 0, hdmi_output_mask, 0, hdmi_output_mask); } if ((dev->has_vid_cap && dev->has_vid_out) || (dev->has_vbi_cap && dev->has_vbi_out)) v4l2_ctrl_new_custom(hdl_loop_cap, &vivid_ctrl_loop_video, NULL); if (dev->has_fb) v4l2_ctrl_new_custom(hdl_fb, &vivid_ctrl_clear_fb, NULL); if (dev->has_radio_rx) { v4l2_ctrl_new_custom(hdl_radio_rx, &vivid_ctrl_radio_hw_seek_mode, NULL); v4l2_ctrl_new_custom(hdl_radio_rx, &vivid_ctrl_radio_hw_seek_prog_lim, NULL); v4l2_ctrl_new_custom(hdl_radio_rx, &vivid_ctrl_radio_rx_rds_blockio, NULL); v4l2_ctrl_new_custom(hdl_radio_rx, &vivid_ctrl_radio_rx_rds_rbds, NULL); v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RECEPTION, 0, 1, 1, 1); dev->radio_rx_rds_pty = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_PTY, 0, 31, 1, 0); dev->radio_rx_rds_psname = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_PS_NAME, 0, 8, 8, 0); dev->radio_rx_rds_radiotext = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_RADIO_TEXT, 0, 64, 64, 0); dev->radio_rx_rds_ta = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_TRAFFIC_ANNOUNCEMENT, 0, 1, 1, 0); dev->radio_rx_rds_tp = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_TRAFFIC_PROGRAM, 0, 1, 1, 0); dev->radio_rx_rds_ms = v4l2_ctrl_new_std(hdl_radio_rx, &vivid_radio_rx_ctrl_ops, V4L2_CID_RDS_RX_MUSIC_SPEECH, 0, 1, 1, 1); } if (dev->has_radio_tx) { v4l2_ctrl_new_custom(hdl_radio_tx, &vivid_ctrl_radio_tx_rds_blockio, NULL); dev->radio_tx_rds_pi = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_PI, 0, 0xffff, 1, 0x8088); dev->radio_tx_rds_pty = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_PTY, 0, 31, 1, 3); dev->radio_tx_rds_psname = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_PS_NAME, 0, 8, 8, 0); if (dev->radio_tx_rds_psname) v4l2_ctrl_s_ctrl_string(dev->radio_tx_rds_psname, "VIVID-TX"); dev->radio_tx_rds_radiotext = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_RADIO_TEXT, 0, 64 * 2, 64, 0); if (dev->radio_tx_rds_radiotext) v4l2_ctrl_s_ctrl_string(dev->radio_tx_rds_radiotext, "This is a VIVID default Radio Text template text, change at will"); dev->radio_tx_rds_mono_stereo = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_MONO_STEREO, 0, 1, 1, 1); dev->radio_tx_rds_art_head = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_ARTIFICIAL_HEAD, 0, 1, 1, 0); dev->radio_tx_rds_compressed = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_COMPRESSED, 0, 1, 1, 0); dev->radio_tx_rds_dyn_pty = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_DYNAMIC_PTY, 0, 1, 1, 0); dev->radio_tx_rds_ta = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_TRAFFIC_ANNOUNCEMENT, 0, 1, 1, 0); dev->radio_tx_rds_tp = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_TRAFFIC_PROGRAM, 0, 1, 1, 1); dev->radio_tx_rds_ms = v4l2_ctrl_new_std(hdl_radio_tx, &vivid_radio_tx_ctrl_ops, V4L2_CID_RDS_TX_MUSIC_SPEECH, 0, 1, 1, 1); } if (dev->has_sdr_cap) { v4l2_ctrl_new_custom(hdl_sdr_cap, &vivid_ctrl_sdr_cap_fm_deviation, NULL); } if (dev->has_meta_cap) { v4l2_ctrl_new_custom(hdl_meta_cap, &vivid_ctrl_meta_has_pts, NULL); v4l2_ctrl_new_custom(hdl_meta_cap, &vivid_ctrl_meta_has_src_clk, NULL); } if (hdl_user_gen->error) return hdl_user_gen->error; if (hdl_user_vid->error) return hdl_user_vid->error; if (hdl_user_aud->error) return hdl_user_aud->error; if (hdl_streaming->error) return hdl_streaming->error; if (hdl_sdr_cap->error) return hdl_sdr_cap->error; if (hdl_loop_cap->error) return hdl_loop_cap->error; if (dev->autogain) v4l2_ctrl_auto_cluster(2, &dev->autogain, 0, true); if (dev->has_vid_cap) { v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_vid, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_user_aud, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_streaming, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_sdtv_cap, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_loop_cap, NULL, false); v4l2_ctrl_add_handler(hdl_vid_cap, hdl_fb, NULL, false); if (hdl_vid_cap->error) return hdl_vid_cap->error; dev->vid_cap_dev.ctrl_handler = hdl_vid_cap; } if (dev->has_vid_out) { v4l2_ctrl_add_handler(hdl_vid_out, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_vid_out, hdl_user_aud, NULL, false); v4l2_ctrl_add_handler(hdl_vid_out, hdl_streaming, NULL, false); v4l2_ctrl_add_handler(hdl_vid_out, hdl_fb, NULL, false); if (hdl_vid_out->error) return hdl_vid_out->error; dev->vid_out_dev.ctrl_handler = hdl_vid_out; } if (dev->has_vbi_cap) { v4l2_ctrl_add_handler(hdl_vbi_cap, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_vbi_cap, hdl_streaming, NULL, false); v4l2_ctrl_add_handler(hdl_vbi_cap, hdl_sdtv_cap, NULL, false); v4l2_ctrl_add_handler(hdl_vbi_cap, hdl_loop_cap, NULL, false); if (hdl_vbi_cap->error) return hdl_vbi_cap->error; dev->vbi_cap_dev.ctrl_handler = hdl_vbi_cap; } if (dev->has_vbi_out) { v4l2_ctrl_add_handler(hdl_vbi_out, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_vbi_out, hdl_streaming, NULL, false); if (hdl_vbi_out->error) return hdl_vbi_out->error; dev->vbi_out_dev.ctrl_handler = hdl_vbi_out; } if (dev->has_radio_rx) { v4l2_ctrl_add_handler(hdl_radio_rx, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_radio_rx, hdl_user_aud, NULL, false); if (hdl_radio_rx->error) return hdl_radio_rx->error; dev->radio_rx_dev.ctrl_handler = hdl_radio_rx; } if (dev->has_radio_tx) { v4l2_ctrl_add_handler(hdl_radio_tx, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_radio_tx, hdl_user_aud, NULL, false); if (hdl_radio_tx->error) return hdl_radio_tx->error; dev->radio_tx_dev.ctrl_handler = hdl_radio_tx; } if (dev->has_sdr_cap) { v4l2_ctrl_add_handler(hdl_sdr_cap, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_sdr_cap, hdl_streaming, NULL, false); if (hdl_sdr_cap->error) return hdl_sdr_cap->error; dev->sdr_cap_dev.ctrl_handler = hdl_sdr_cap; } if (dev->has_meta_cap) { v4l2_ctrl_add_handler(hdl_meta_cap, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_meta_cap, hdl_streaming, NULL, false); if (hdl_meta_cap->error) return hdl_meta_cap->error; dev->meta_cap_dev.ctrl_handler = hdl_meta_cap; } if (dev->has_meta_out) { v4l2_ctrl_add_handler(hdl_meta_out, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_meta_out, hdl_streaming, NULL, false); if (hdl_meta_out->error) return hdl_meta_out->error; dev->meta_out_dev.ctrl_handler = hdl_meta_out; } if (dev->has_touch_cap) { v4l2_ctrl_add_handler(hdl_tch_cap, hdl_user_gen, NULL, false); v4l2_ctrl_add_handler(hdl_tch_cap, hdl_streaming, NULL, false); if (hdl_tch_cap->error) return hdl_tch_cap->error; dev->touch_cap_dev.ctrl_handler = hdl_tch_cap; } return 0; } void vivid_free_controls(struct vivid_dev *dev) { v4l2_ctrl_handler_free(&dev->ctrl_hdl_vid_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_vid_out); v4l2_ctrl_handler_free(&dev->ctrl_hdl_vbi_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_vbi_out); v4l2_ctrl_handler_free(&dev->ctrl_hdl_radio_rx); v4l2_ctrl_handler_free(&dev->ctrl_hdl_radio_tx); v4l2_ctrl_handler_free(&dev->ctrl_hdl_sdr_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_user_gen); v4l2_ctrl_handler_free(&dev->ctrl_hdl_user_vid); v4l2_ctrl_handler_free(&dev->ctrl_hdl_user_aud); v4l2_ctrl_handler_free(&dev->ctrl_hdl_streaming); v4l2_ctrl_handler_free(&dev->ctrl_hdl_sdtv_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_loop_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_fb); v4l2_ctrl_handler_free(&dev->ctrl_hdl_meta_cap); v4l2_ctrl_handler_free(&dev->ctrl_hdl_meta_out); v4l2_ctrl_handler_free(&dev->ctrl_hdl_touch_cap); } |
103 103 3 102 3 3 102 102 94 20 102 22 93 93 87 41 26 84 91 89 89 93 93 85 46 93 92 93 1 1 93 93 92 1 1 92 5 15 94 92 80 80 93 83 85 37 85 1 1 85 85 1 1 102 102 102 102 102 102 89 102 93 93 93 93 43 20 92 93 93 27 24 93 93 3 35 58 94 23 21 21 19 94 27 18 27 94 94 94 4 6 94 3 3 3 3 103 102 102 102 102 102 94 94 22 94 48 46 48 39 33 5 5 5 94 94 94 94 94 94 93 97 97 1 112 109 47 103 112 112 47 57 57 103 43 96 1 7 63 43 63 103 103 103 48 103 64 93 82 12 82 94 94 43 43 94 94 93 15 15 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 | // SPDX-License-Identifier: GPL-2.0-only /* * fs/direct-io.c * * Copyright (C) 2002, Linus Torvalds. * * O_DIRECT * * 04Jul2002 Andrew Morton * Initial version * 11Sep2002 janetinc@us.ibm.com * added readv/writev support. * 29Oct2002 Andrew Morton * rewrote bio_add_page() support. * 30Oct2002 pbadari@us.ibm.com * added support for non-aligned IO. * 06Nov2002 pbadari@us.ibm.com * added asynchronous IO support. * 21Jul2003 nathans@sgi.com * added IO completion notifier. */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/types.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/slab.h> #include <linux/highmem.h> #include <linux/pagemap.h> #include <linux/task_io_accounting_ops.h> #include <linux/bio.h> #include <linux/wait.h> #include <linux/err.h> #include <linux/blkdev.h> #include <linux/buffer_head.h> #include <linux/rwsem.h> #include <linux/uio.h> #include <linux/atomic.h> #include <linux/prefetch.h> #include "internal.h" /* * How many user pages to map in one call to iov_iter_extract_pages(). This * determines the size of a structure in the slab cache */ #define DIO_PAGES 64 /* * Flags for dio_complete() */ #define DIO_COMPLETE_ASYNC 0x01 /* This is async IO */ #define DIO_COMPLETE_INVALIDATE 0x02 /* Can invalidate pages */ /* * This code generally works in units of "dio_blocks". A dio_block is * somewhere between the hard sector size and the filesystem block size. it * is determined on a per-invocation basis. When talking to the filesystem * we need to convert dio_blocks to fs_blocks by scaling the dio_block quantity * down by dio->blkfactor. Similarly, fs-blocksize quantities are converted * to bio_block quantities by shifting left by blkfactor. * * If blkfactor is zero then the user's request was aligned to the filesystem's * blocksize. */ /* dio_state only used in the submission path */ struct dio_submit { struct bio *bio; /* bio under assembly */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which is finer than the filesystem's soft blocksize, this specifies how much finer. blkfactor=2 means 1/4-block alignment. Does not change */ unsigned start_zero_done; /* flag: sub-blocksize zeroing has been performed at the start of a write */ int pages_in_io; /* approximate total IO pages */ sector_t block_in_file; /* Current offset into the underlying file in dio_block units. */ unsigned blocks_available; /* At block_in_file. changes */ int reap_counter; /* rate limit reaping */ sector_t final_block_in_request;/* doesn't change */ int boundary; /* prev block is at a boundary */ get_block_t *get_block; /* block mapping function */ loff_t logical_offset_in_bio; /* current first logical block in bio */ sector_t final_block_in_bio; /* current final block in bio + 1 */ sector_t next_block_for_io; /* next block to be put under IO, in dio_blocks units */ /* * Deferred addition of a page to the dio. These variables are * private to dio_send_cur_page(), submit_page_section() and * dio_bio_add_page(). */ struct page *cur_page; /* The page */ unsigned cur_page_offset; /* Offset into it, in bytes */ unsigned cur_page_len; /* Nr of bytes at cur_page_offset */ sector_t cur_page_block; /* Where it starts */ loff_t cur_page_fs_offset; /* Offset in file */ struct iov_iter *iter; /* * Page queue. These variables belong to dio_refill_pages() and * dio_get_page(). */ unsigned head; /* next page to process */ unsigned tail; /* last valid page + 1 */ size_t from, to; }; /* dio_state communicated between submission path and end_io */ struct dio { int flags; /* doesn't change */ blk_opf_t opf; /* request operation type and flags */ struct gendisk *bio_disk; struct inode *inode; loff_t i_size; /* i_size when submitted */ dio_iodone_t *end_io; /* IO completion function */ bool is_pinned; /* T if we have pins on the pages */ void *private; /* copy from map_bh.b_private */ /* BIO completion state */ spinlock_t bio_lock; /* protects BIO fields below */ int page_errors; /* err from iov_iter_extract_pages() */ int is_async; /* is IO async ? */ bool defer_completion; /* defer AIO completion to workqueue? */ bool should_dirty; /* if pages should be dirtied */ int io_error; /* IO error in completion path */ unsigned long refcount; /* direct_io_worker() and bios */ struct bio *bio_list; /* singly linked via bi_private */ struct task_struct *waiter; /* waiting task (NULL if none) */ /* AIO related stuff */ struct kiocb *iocb; /* kiocb */ ssize_t result; /* IO result */ /* * pages[] (and any fields placed after it) are not zeroed out at * allocation time. Don't add new fields after pages[] unless you * wish that they not be zeroed. */ union { struct page *pages[DIO_PAGES]; /* page buffer */ struct work_struct complete_work;/* deferred AIO completion */ }; } ____cacheline_aligned_in_smp; static struct kmem_cache *dio_cache __read_mostly; /* * How many pages are in the queue? */ static inline unsigned dio_pages_present(struct dio_submit *sdio) { return sdio->tail - sdio->head; } /* * Go grab and pin some userspace pages. Typically we'll get 64 at a time. */ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio) { struct page **pages = dio->pages; const enum req_op dio_op = dio->opf & REQ_OP_MASK; ssize_t ret; ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX, DIO_PAGES, 0, &sdio->from); if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) { /* * A memory fault, but the filesystem has some outstanding * mapped blocks. We need to use those blocks up to avoid * leaking stale data in the file. */ if (dio->page_errors == 0) dio->page_errors = ret; dio->pages[0] = ZERO_PAGE(0); sdio->head = 0; sdio->tail = 1; sdio->from = 0; sdio->to = PAGE_SIZE; return 0; } if (ret >= 0) { ret += sdio->from; sdio->head = 0; sdio->tail = (ret + PAGE_SIZE - 1) / PAGE_SIZE; sdio->to = ((ret - 1) & (PAGE_SIZE - 1)) + 1; return 0; } return ret; } /* * Get another userspace page. Returns an ERR_PTR on error. Pages are * buffered inside the dio so that we can call iov_iter_extract_pages() * against a decent number of pages, less frequently. To provide nicer use of * the L1 cache. */ static inline struct page *dio_get_page(struct dio *dio, struct dio_submit *sdio) { if (dio_pages_present(sdio) == 0) { int ret; ret = dio_refill_pages(dio, sdio); if (ret) return ERR_PTR(ret); BUG_ON(dio_pages_present(sdio) == 0); } return dio->pages[sdio->head]; } static void dio_pin_page(struct dio *dio, struct page *page) { if (dio->is_pinned) folio_add_pin(page_folio(page)); } static void dio_unpin_page(struct dio *dio, struct page *page) { if (dio->is_pinned) unpin_user_page(page); } /* * dio_complete() - called when all DIO BIO I/O has been completed * * This drops i_dio_count, lets interested parties know that a DIO operation * has completed, and calculates the resulting return code for the operation. * * It lets the filesystem know if it registered an interest earlier via * get_block. Pass the private field of the map buffer_head so that * filesystems can use it to hold additional state between get_block calls and * dio_complete. */ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) { const enum req_op dio_op = dio->opf & REQ_OP_MASK; loff_t offset = dio->iocb->ki_pos; ssize_t transferred = 0; int err; /* * AIO submission can race with bio completion to get here while * expecting to have the last io completed by bio completion. * In that case -EIOCBQUEUED is in fact not an error we want * to preserve through this call. */ if (ret == -EIOCBQUEUED) ret = 0; if (dio->result) { transferred = dio->result; /* Check for short read case */ if (dio_op == REQ_OP_READ && ((offset + transferred) > dio->i_size)) transferred = dio->i_size - offset; /* ignore EFAULT if some IO has been done */ if (unlikely(ret == -EFAULT) && transferred) ret = 0; } if (ret == 0) ret = dio->page_errors; if (ret == 0) ret = dio->io_error; if (ret == 0) ret = transferred; if (dio->end_io) { // XXX: ki_pos?? err = dio->end_io(dio->iocb, offset, ret, dio->private); if (err) ret = err; } /* * Try again to invalidate clean pages which might have been cached by * non-direct readahead, or faulted in by get_user_pages() if the source * of the write was an mmap'ed region of the file we're writing. Either * one is a pretty crazy thing to do, so we don't support it 100%. If * this invalidation fails, tough, the write still worked... * * And this page cache invalidation has to be after dio->end_io(), as * some filesystems convert unwritten extents to real allocations in * end_io() when necessary, otherwise a racing buffer read would cache * zeros from unwritten extents. */ if (flags & DIO_COMPLETE_INVALIDATE && ret > 0 && dio_op == REQ_OP_WRITE) kiocb_invalidate_post_direct_write(dio->iocb, ret); inode_dio_end(dio->inode); if (flags & DIO_COMPLETE_ASYNC) { /* * generic_write_sync expects ki_pos to have been updated * already, but the submission path only does this for * synchronous I/O. */ dio->iocb->ki_pos += transferred; if (ret > 0 && dio_op == REQ_OP_WRITE) ret = generic_write_sync(dio->iocb, ret); dio->iocb->ki_complete(dio->iocb, ret); } kmem_cache_free(dio_cache, dio); return ret; } static void dio_aio_complete_work(struct work_struct *work) { struct dio *dio = container_of(work, struct dio, complete_work); dio_complete(dio, 0, DIO_COMPLETE_ASYNC | DIO_COMPLETE_INVALIDATE); } static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio); /* * Asynchronous IO callback. */ static void dio_bio_end_aio(struct bio *bio) { struct dio *dio = bio->bi_private; const enum req_op dio_op = dio->opf & REQ_OP_MASK; unsigned long remaining; unsigned long flags; bool defer_completion = false; /* cleanup the bio */ dio_bio_complete(dio, bio); spin_lock_irqsave(&dio->bio_lock, flags); remaining = --dio->refcount; if (remaining == 1 && dio->waiter) wake_up_process(dio->waiter); spin_unlock_irqrestore(&dio->bio_lock, flags); if (remaining == 0) { /* * Defer completion when defer_completion is set or * when the inode has pages mapped and this is AIO write. * We need to invalidate those pages because there is a * chance they contain stale data in the case buffered IO * went in between AIO submission and completion into the * same region. */ if (dio->result) defer_completion = dio->defer_completion || (dio_op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages); if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); } else { dio_complete(dio, 0, DIO_COMPLETE_ASYNC); } } } /* * The BIO completion handler simply queues the BIO up for the process-context * handler. * * During I/O bi_private points at the dio. After I/O, bi_private is used to * implement a singly-linked list of completed BIOs, at dio->bio_list. */ static void dio_bio_end_io(struct bio *bio) { struct dio *dio = bio->bi_private; unsigned long flags; spin_lock_irqsave(&dio->bio_lock, flags); bio->bi_private = dio->bio_list; dio->bio_list = bio; if (--dio->refcount == 1 && dio->waiter) wake_up_process(dio->waiter); spin_unlock_irqrestore(&dio->bio_lock, flags); } static inline void dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, struct block_device *bdev, sector_t first_sector, int nr_vecs) { struct bio *bio; /* * bio_alloc() is guaranteed to return a bio when allowed to sleep and * we request a valid number of vectors. */ bio = bio_alloc(bdev, nr_vecs, dio->opf, GFP_KERNEL); bio->bi_iter.bi_sector = first_sector; if (dio->is_async) bio->bi_end_io = dio_bio_end_aio; else bio->bi_end_io = dio_bio_end_io; if (dio->is_pinned) bio_set_flag(bio, BIO_PAGE_PINNED); sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; } /* * In the AIO read case we speculatively dirty the pages before starting IO. * During IO completion, any of these pages which happen to have been written * back will be redirtied by bio_check_pages_dirty(). * * bios hold a dio reference between submit_bio and ->end_io. */ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) { const enum req_op dio_op = dio->opf & REQ_OP_MASK; struct bio *bio = sdio->bio; unsigned long flags; bio->bi_private = dio; spin_lock_irqsave(&dio->bio_lock, flags); dio->refcount++; spin_unlock_irqrestore(&dio->bio_lock, flags); if (dio->is_async && dio_op == REQ_OP_READ && dio->should_dirty) bio_set_pages_dirty(bio); dio->bio_disk = bio->bi_bdev->bd_disk; submit_bio(bio); sdio->bio = NULL; sdio->boundary = 0; sdio->logical_offset_in_bio = 0; } /* * Release any resources in case of a failure */ static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio) { if (dio->is_pinned) unpin_user_pages(dio->pages + sdio->head, sdio->tail - sdio->head); sdio->head = sdio->tail; } /* * Wait for the next BIO to complete. Remove it and return it. NULL is * returned once all BIOs have been completed. This must only be called once * all bios have been issued so that dio->refcount can only decrease. This * requires that the caller hold a reference on the dio. */ static struct bio *dio_await_one(struct dio *dio) { unsigned long flags; struct bio *bio = NULL; spin_lock_irqsave(&dio->bio_lock, flags); /* * Wait as long as the list is empty and there are bios in flight. bio * completion drops the count, maybe adds to the list, and wakes while * holding the bio_lock so we don't need set_current_state()'s barrier * and can call it after testing our condition. */ while (dio->refcount > 1 && dio->bio_list == NULL) { __set_current_state(TASK_UNINTERRUPTIBLE); dio->waiter = current; spin_unlock_irqrestore(&dio->bio_lock, flags); blk_io_schedule(); /* wake up sets us TASK_RUNNING */ spin_lock_irqsave(&dio->bio_lock, flags); dio->waiter = NULL; } if (dio->bio_list) { bio = dio->bio_list; dio->bio_list = bio->bi_private; } spin_unlock_irqrestore(&dio->bio_lock, flags); return bio; } /* * Process one completed BIO. No locks are held. */ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) { blk_status_t err = bio->bi_status; const enum req_op dio_op = dio->opf & REQ_OP_MASK; bool should_dirty = dio_op == REQ_OP_READ && dio->should_dirty; if (err) { if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT)) dio->io_error = -EAGAIN; else dio->io_error = -EIO; } if (dio->is_async && should_dirty) { bio_check_pages_dirty(bio); /* transfers ownership */ } else { bio_release_pages(bio, should_dirty); bio_put(bio); } return err; } /* * Wait on and process all in-flight BIOs. This must only be called once * all bios have been issued so that the refcount can only decrease. * This just waits for all bios to make it through dio_bio_complete. IO * errors are propagated through dio->io_error and should be propagated via * dio_complete(). */ static void dio_await_completion(struct dio *dio) { struct bio *bio; do { bio = dio_await_one(dio); if (bio) dio_bio_complete(dio, bio); } while (bio); } /* * A really large O_DIRECT read or write can generate a lot of BIOs. So * to keep the memory consumption sane we periodically reap any completed BIOs * during the BIO generation phase. * * This also helps to limit the peak amount of pinned userspace memory. */ static inline int dio_bio_reap(struct dio *dio, struct dio_submit *sdio) { int ret = 0; if (sdio->reap_counter++ >= 64) { while (dio->bio_list) { unsigned long flags; struct bio *bio; int ret2; spin_lock_irqsave(&dio->bio_lock, flags); bio = dio->bio_list; dio->bio_list = bio->bi_private; spin_unlock_irqrestore(&dio->bio_lock, flags); ret2 = blk_status_to_errno(dio_bio_complete(dio, bio)); if (ret == 0) ret = ret2; } sdio->reap_counter = 0; } return ret; } static int dio_set_defer_completion(struct dio *dio) { struct super_block *sb = dio->inode->i_sb; if (dio->defer_completion) return 0; dio->defer_completion = true; if (!sb->s_dio_done_wq) return sb_init_dio_done_wq(sb); return 0; } /* * Call into the fs to map some more disk blocks. We record the current number * of available blocks at sdio->blocks_available. These are in units of the * fs blocksize, i_blocksize(inode). * * The fs is allowed to map lots of blocks at once. If it wants to do that, * it uses the passed inode-relative block number as the file offset, as usual. * * get_block() is passed the number of i_blkbits-sized blocks which direct_io * has remaining to do. The fs should not map more than this number of blocks. * * If the fs has mapped a lot of blocks, it should populate bh->b_size to * indicate how much contiguous disk space has been made available at * bh->b_blocknr. * * If *any* of the mapped blocks are new, then the fs must set buffer_new(). * This isn't very efficient... * * In the case of filesystem holes: the fs may return an arbitrarily-large * hole by returning an appropriate value in b_size and by clearing * buffer_mapped(). However the direct-io code will only process holes one * block at a time - it will repeatedly call get_block() as it walks the hole. */ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, struct buffer_head *map_bh) { const enum req_op dio_op = dio->opf & REQ_OP_MASK; int ret; sector_t fs_startblk; /* Into file, in filesystem-sized blocks */ sector_t fs_endblk; /* Into file, in filesystem-sized blocks */ unsigned long fs_count; /* Number of filesystem-sized blocks */ int create; unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor; loff_t i_size; /* * If there was a memory error and we've overwritten all the * mapped blocks then we can now return that memory error */ ret = dio->page_errors; if (ret == 0) { BUG_ON(sdio->block_in_file >= sdio->final_block_in_request); fs_startblk = sdio->block_in_file >> sdio->blkfactor; fs_endblk = (sdio->final_block_in_request - 1) >> sdio->blkfactor; fs_count = fs_endblk - fs_startblk + 1; map_bh->b_state = 0; map_bh->b_size = fs_count << i_blkbits; /* * For writes that could fill holes inside i_size on a * DIO_SKIP_HOLES filesystem we forbid block creations: only * overwrites are permitted. We will return early to the caller * once we see an unmapped buffer head returned, and the caller * will fall back to buffered I/O. * * Otherwise the decision is left to the get_blocks method, * which may decide to handle it or also return an unmapped * buffer head. */ create = dio_op == REQ_OP_WRITE; if (dio->flags & DIO_SKIP_HOLES) { i_size = i_size_read(dio->inode); if (i_size && fs_startblk <= (i_size - 1) >> i_blkbits) create = 0; } ret = (*sdio->get_block)(dio->inode, fs_startblk, map_bh, create); /* Store for completion */ dio->private = map_bh->b_private; if (ret == 0 && buffer_defer_completion(map_bh)) ret = dio_set_defer_completion(dio); } return ret; } /* * There is no bio. Make one now. */ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio, sector_t start_sector, struct buffer_head *map_bh) { sector_t sector; int ret, nr_pages; ret = dio_bio_reap(dio, sdio); if (ret) goto out; sector = start_sector << (sdio->blkbits - 9); nr_pages = bio_max_segs(sdio->pages_in_io); BUG_ON(nr_pages <= 0); dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages); sdio->boundary = 0; out: return ret; } /* * Attempt to put the current chunk of 'cur_page' into the current BIO. If * that was successful then update final_block_in_bio and take a ref against * the just-added page. * * Return zero on success. Non-zero means the caller needs to start a new BIO. */ static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio) { int ret; ret = bio_add_page(sdio->bio, sdio->cur_page, sdio->cur_page_len, sdio->cur_page_offset); if (ret == sdio->cur_page_len) { /* * Decrement count only, if we are done with this page */ if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE) sdio->pages_in_io--; dio_pin_page(dio, sdio->cur_page); sdio->final_block_in_bio = sdio->cur_page_block + (sdio->cur_page_len >> sdio->blkbits); ret = 0; } else { ret = 1; } return ret; } /* * Put cur_page under IO. The section of cur_page which is described by * cur_page_offset,cur_page_len is put into a BIO. The section of cur_page * starts on-disk at cur_page_block. * * We take a ref against the page here (on behalf of its presence in the bio). * * The caller of this function is responsible for removing cur_page from the * dio, and for dropping the refcount which came from that presence. */ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio, struct buffer_head *map_bh) { int ret = 0; if (sdio->bio) { loff_t cur_offset = sdio->cur_page_fs_offset; loff_t bio_next_offset = sdio->logical_offset_in_bio + sdio->bio->bi_iter.bi_size; /* * See whether this new request is contiguous with the old. * * Btrfs cannot handle having logically non-contiguous requests * submitted. For example if you have * * Logical: [0-4095][HOLE][8192-12287] * Physical: [0-4095] [4096-8191] * * We cannot submit those pages together as one BIO. So if our * current logical offset in the file does not equal what would * be the next logical offset in the bio, submit the bio we * have. */ if (sdio->final_block_in_bio != sdio->cur_page_block || cur_offset != bio_next_offset) dio_bio_submit(dio, sdio); } if (sdio->bio == NULL) { ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh); if (ret) goto out; } if (dio_bio_add_page(dio, sdio) != 0) { dio_bio_submit(dio, sdio); ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh); if (ret == 0) { ret = dio_bio_add_page(dio, sdio); BUG_ON(ret != 0); } } out: return ret; } /* * An autonomous function to put a chunk of a page under deferred IO. * * The caller doesn't actually know (or care) whether this piece of page is in * a BIO, or is under IO or whatever. We just take care of all possible * situations here. The separation between the logic of do_direct_IO() and * that of submit_page_section() is important for clarity. Please don't break. * * The chunk of page starts on-disk at blocknr. * * We perform deferred IO, by recording the last-submitted page inside our * private part of the dio structure. If possible, we just expand the IO * across that page here. * * If that doesn't work out then we put the old page into the bio and add this * page to the dio instead. */ static inline int submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page, unsigned offset, unsigned len, sector_t blocknr, struct buffer_head *map_bh) { const enum req_op dio_op = dio->opf & REQ_OP_MASK; int ret = 0; int boundary = sdio->boundary; /* dio_send_cur_page may clear it */ if (dio_op == REQ_OP_WRITE) { /* * Read accounting is performed in submit_bio() */ task_io_account_write(len); } /* * Can we just grow the current page's presence in the dio? */ if (sdio->cur_page == page && sdio->cur_page_offset + sdio->cur_page_len == offset && sdio->cur_page_block + (sdio->cur_page_len >> sdio->blkbits) == blocknr) { sdio->cur_page_len += len; goto out; } /* * If there's a deferred page already there then send it. */ if (sdio->cur_page) { ret = dio_send_cur_page(dio, sdio, map_bh); dio_unpin_page(dio, sdio->cur_page); sdio->cur_page = NULL; if (ret) return ret; } dio_pin_page(dio, page); /* It is in dio */ sdio->cur_page = page; sdio->cur_page_offset = offset; sdio->cur_page_len = len; sdio->cur_page_block = blocknr; sdio->cur_page_fs_offset = sdio->block_in_file << sdio->blkbits; out: /* * If boundary then we want to schedule the IO now to * avoid metadata seeks. */ if (boundary) { ret = dio_send_cur_page(dio, sdio, map_bh); if (sdio->bio) dio_bio_submit(dio, sdio); dio_unpin_page(dio, sdio->cur_page); sdio->cur_page = NULL; } return ret; } /* * If we are not writing the entire block and get_block() allocated * the block for us, we need to fill-in the unused portion of the * block with zeros. This happens only if user-buffer, fileoffset or * io length is not filesystem block-size multiple. * * `end' is zero if we're doing the start of the IO, 1 at the end of the * IO. */ static inline void dio_zero_block(struct dio *dio, struct dio_submit *sdio, int end, struct buffer_head *map_bh) { unsigned dio_blocks_per_fs_block; unsigned this_chunk_blocks; /* In dio_blocks */ unsigned this_chunk_bytes; struct page *page; sdio->start_zero_done = 1; if (!sdio->blkfactor || !buffer_new(map_bh)) return; dio_blocks_per_fs_block = 1 << sdio->blkfactor; this_chunk_blocks = sdio->block_in_file & (dio_blocks_per_fs_block - 1); if (!this_chunk_blocks) return; /* * We need to zero out part of an fs block. It is either at the * beginning or the end of the fs block. */ if (end) this_chunk_blocks = dio_blocks_per_fs_block - this_chunk_blocks; this_chunk_bytes = this_chunk_blocks << sdio->blkbits; page = ZERO_PAGE(0); if (submit_page_section(dio, sdio, page, 0, this_chunk_bytes, sdio->next_block_for_io, map_bh)) return; sdio->next_block_for_io += this_chunk_blocks; } /* * Walk the user pages, and the file, mapping blocks to disk and generating * a sequence of (page,offset,len,block) mappings. These mappings are injected * into submit_page_section(), which takes care of the next stage of submission * * Direct IO against a blockdev is different from a file. Because we can * happily perform page-sized but 512-byte aligned IOs. It is important that * blockdev IO be able to have fine alignment and large sizes. * * So what we do is to permit the ->get_block function to populate bh.b_size * with the size of IO which is permitted at this offset and this i_blkbits. * * For best results, the blockdev should be set up with 512-byte i_blkbits and * it should set b_size to PAGE_SIZE or more inside get_block(). This gives * fine alignment but still allows this function to work in PAGE_SIZE units. */ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio, struct buffer_head *map_bh) { const enum req_op dio_op = dio->opf & REQ_OP_MASK; const unsigned blkbits = sdio->blkbits; const unsigned i_blkbits = blkbits + sdio->blkfactor; int ret = 0; while (sdio->block_in_file < sdio->final_block_in_request) { struct page *page; size_t from, to; page = dio_get_page(dio, sdio); if (IS_ERR(page)) { ret = PTR_ERR(page); goto out; } from = sdio->head ? 0 : sdio->from; to = (sdio->head == sdio->tail - 1) ? sdio->to : PAGE_SIZE; sdio->head++; while (from < to) { unsigned this_chunk_bytes; /* # of bytes mapped */ unsigned this_chunk_blocks; /* # of blocks */ unsigned u; if (sdio->blocks_available == 0) { /* * Need to go and map some more disk */ unsigned long blkmask; unsigned long dio_remainder; ret = get_more_blocks(dio, sdio, map_bh); if (ret) { dio_unpin_page(dio, page); goto out; } if (!buffer_mapped(map_bh)) goto do_holes; sdio->blocks_available = map_bh->b_size >> blkbits; sdio->next_block_for_io = map_bh->b_blocknr << sdio->blkfactor; if (buffer_new(map_bh)) { clean_bdev_aliases( map_bh->b_bdev, map_bh->b_blocknr, map_bh->b_size >> i_blkbits); } if (!sdio->blkfactor) goto do_holes; blkmask = (1 << sdio->blkfactor) - 1; dio_remainder = (sdio->block_in_file & blkmask); /* * If we are at the start of IO and that IO * starts partway into a fs-block, * dio_remainder will be non-zero. If the IO * is a read then we can simply advance the IO * cursor to the first block which is to be * read. But if the IO is a write and the * block was newly allocated we cannot do that; * the start of the fs block must be zeroed out * on-disk */ if (!buffer_new(map_bh)) sdio->next_block_for_io += dio_remainder; sdio->blocks_available -= dio_remainder; } do_holes: /* Handle holes */ if (!buffer_mapped(map_bh)) { loff_t i_size_aligned; /* AKPM: eargh, -ENOTBLK is a hack */ if (dio_op == REQ_OP_WRITE) { dio_unpin_page(dio, page); return -ENOTBLK; } /* * Be sure to account for a partial block as the * last block in the file */ i_size_aligned = ALIGN(i_size_read(dio->inode), 1 << blkbits); if (sdio->block_in_file >= i_size_aligned >> blkbits) { /* We hit eof */ dio_unpin_page(dio, page); goto out; } zero_user(page, from, 1 << blkbits); sdio->block_in_file++; from += 1 << blkbits; dio->result += 1 << blkbits; goto next_block; } /* * If we're performing IO which has an alignment which * is finer than the underlying fs, go check to see if * we must zero out the start of this block. */ if (unlikely(sdio->blkfactor && !sdio->start_zero_done)) dio_zero_block(dio, sdio, 0, map_bh); /* * Work out, in this_chunk_blocks, how much disk we * can add to this page */ this_chunk_blocks = sdio->blocks_available; u = (to - from) >> blkbits; if (this_chunk_blocks > u) this_chunk_blocks = u; u = sdio->final_block_in_request - sdio->block_in_file; if (this_chunk_blocks > u) this_chunk_blocks = u; this_chunk_bytes = this_chunk_blocks << blkbits; BUG_ON(this_chunk_bytes == 0); if (this_chunk_blocks == sdio->blocks_available) sdio->boundary = buffer_boundary(map_bh); ret = submit_page_section(dio, sdio, page, from, this_chunk_bytes, sdio->next_block_for_io, map_bh); if (ret) { dio_unpin_page(dio, page); goto out; } sdio->next_block_for_io += this_chunk_blocks; sdio->block_in_file += this_chunk_blocks; from += this_chunk_bytes; dio->result += this_chunk_bytes; sdio->blocks_available -= this_chunk_blocks; next_block: BUG_ON(sdio->block_in_file > sdio->final_block_in_request); if (sdio->block_in_file == sdio->final_block_in_request) break; } /* Drop the pin which was taken in get_user_pages() */ dio_unpin_page(dio, page); } out: return ret; } static inline int drop_refcount(struct dio *dio) { int ret2; unsigned long flags; /* * Sync will always be dropping the final ref and completing the * operation. AIO can if it was a broken operation described above or * in fact if all the bios race to complete before we get here. In * that case dio_complete() translates the EIOCBQUEUED into the proper * return code that the caller will hand to ->complete(). * * This is managed by the bio_lock instead of being an atomic_t so that * completion paths can drop their ref and use the remaining count to * decide to wake the submission path atomically. */ spin_lock_irqsave(&dio->bio_lock, flags); ret2 = --dio->refcount; spin_unlock_irqrestore(&dio->bio_lock, flags); return ret2; } /* * This is a library function for use by filesystem drivers. * * The locking rules are governed by the flags parameter: * - if the flags value contains DIO_LOCKING we use a fancy locking * scheme for dumb filesystems. * For writes this function is called under i_mutex and returns with * i_mutex held, for reads, i_mutex is not held on entry, but it is * taken and dropped again before returning. * - if the flags value does NOT contain DIO_LOCKING we don't use any * internal locking but rather rely on the filesystem to synchronize * direct I/O reads/writes versus each other and truncate. * * To help with locking against truncate we incremented the i_dio_count * counter before starting direct I/O, and decrement it once we are done. * Truncate can wait for it to reach zero to provide exclusion. It is * expected that filesystem provide exclusion between new direct I/O * and truncates. For DIO_LOCKING filesystems this is done by i_mutex, * but other filesystems need to take care of this on their own. * * NOTE: if you pass "sdio" to anything by pointer make sure that function * is always inlined. Otherwise gcc is unable to split the structure into * individual fields and will generate much worse code. This is important * for the whole file. */ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, struct block_device *bdev, struct iov_iter *iter, get_block_t get_block, dio_iodone_t end_io, int flags) { unsigned i_blkbits = READ_ONCE(inode->i_blkbits); unsigned blkbits = i_blkbits; unsigned blocksize_mask = (1 << blkbits) - 1; ssize_t retval = -EINVAL; const size_t count = iov_iter_count(iter); loff_t offset = iocb->ki_pos; const loff_t end = offset + count; struct dio *dio; struct dio_submit sdio = { 0, }; struct buffer_head map_bh = { 0, }; struct blk_plug plug; unsigned long align = offset | iov_iter_alignment(iter); /* * Avoid references to bdev if not absolutely needed to give * the early prefetch in the caller enough time. */ /* watch out for a 0 len io from a tricksy fs */ if (iov_iter_rw(iter) == READ && !count) return 0; dio = kmem_cache_alloc(dio_cache, GFP_KERNEL); if (!dio) return -ENOMEM; /* * Believe it or not, zeroing out the page array caused a .5% * performance regression in a database benchmark. So, we take * care to only zero out what's needed. */ memset(dio, 0, offsetof(struct dio, pages)); dio->flags = flags; if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) { /* will be released by direct_io_worker */ inode_lock(inode); } dio->is_pinned = iov_iter_extract_will_pin(iter); /* Once we sampled i_size check for reads beyond EOF */ dio->i_size = i_size_read(inode); if (iov_iter_rw(iter) == READ && offset >= dio->i_size) { retval = 0; goto fail_dio; } if (align & blocksize_mask) { if (bdev) blkbits = blksize_bits(bdev_logical_block_size(bdev)); blocksize_mask = (1 << blkbits) - 1; if (align & blocksize_mask) goto fail_dio; } if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) { struct address_space *mapping = iocb->ki_filp->f_mapping; retval = filemap_write_and_wait_range(mapping, offset, end - 1); if (retval) goto fail_dio; } /* * For file extending writes updating i_size before data writeouts * complete can expose uninitialized blocks in dumb filesystems. * In that case we need to wait for I/O completion even if asked * for an asynchronous write. */ if (is_sync_kiocb(iocb)) dio->is_async = false; else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode)) dio->is_async = false; else dio->is_async = true; dio->inode = inode; if (iov_iter_rw(iter) == WRITE) { dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; if (iocb->ki_flags & IOCB_NOWAIT) dio->opf |= REQ_NOWAIT; } else { dio->opf = REQ_OP_READ; } /* * For AIO O_(D)SYNC writes we need to defer completions to a workqueue * so that we can call ->fsync. */ if (dio->is_async && iov_iter_rw(iter) == WRITE) { retval = 0; if (iocb_is_dsync(iocb)) retval = dio_set_defer_completion(dio); else if (!dio->inode->i_sb->s_dio_done_wq) { /* * In case of AIO write racing with buffered read we * need to defer completion. We can't decide this now, * however the workqueue needs to be initialized here. */ retval = sb_init_dio_done_wq(dio->inode->i_sb); } if (retval) goto fail_dio; } /* * Will be decremented at I/O completion time. */ inode_dio_begin(inode); retval = 0; sdio.blkbits = blkbits; sdio.blkfactor = i_blkbits - blkbits; sdio.block_in_file = offset >> blkbits; sdio.get_block = get_block; dio->end_io = end_io; sdio.final_block_in_bio = -1; sdio.next_block_for_io = -1; dio->iocb = iocb; spin_lock_init(&dio->bio_lock); dio->refcount = 1; dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ; sdio.iter = iter; sdio.final_block_in_request = end >> blkbits; /* * In case of non-aligned buffers, we may need 2 more * pages since we need to zero out first and last block. */ if (unlikely(sdio.blkfactor)) sdio.pages_in_io = 2; sdio.pages_in_io += iov_iter_npages(iter, INT_MAX); blk_start_plug(&plug); retval = do_direct_IO(dio, &sdio, &map_bh); if (retval) dio_cleanup(dio, &sdio); if (retval == -ENOTBLK) { /* * The remaining part of the request will be * handled by buffered I/O when we return */ retval = 0; } /* * There may be some unwritten disk at the end of a part-written * fs-block-sized block. Go zero that now. */ dio_zero_block(dio, &sdio, 1, &map_bh); if (sdio.cur_page) { ssize_t ret2; ret2 = dio_send_cur_page(dio, &sdio, &map_bh); if (retval == 0) retval = ret2; dio_unpin_page(dio, sdio.cur_page); sdio.cur_page = NULL; } if (sdio.bio) dio_bio_submit(dio, &sdio); blk_finish_plug(&plug); /* * It is possible that, we return short IO due to end of file. * In that case, we need to release all the pages we got hold on. */ dio_cleanup(dio, &sdio); /* * All block lookups have been performed. For READ requests * we can let i_mutex go now that its achieved its purpose * of protecting us from looking up uninitialized blocks. */ if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING)) inode_unlock(dio->inode); /* * The only time we want to leave bios in flight is when a successful * partial aio read or full aio write have been setup. In that case * bio completion will call aio_complete. The only time it's safe to * call aio_complete is when we return -EIOCBQUEUED, so we key on that. * This had *better* be the only place that raises -EIOCBQUEUED. */ BUG_ON(retval == -EIOCBQUEUED); if (dio->is_async && retval == 0 && dio->result && (iov_iter_rw(iter) == READ || dio->result == count)) retval = -EIOCBQUEUED; else dio_await_completion(dio); if (drop_refcount(dio) == 0) { retval = dio_complete(dio, retval, DIO_COMPLETE_INVALIDATE); } else BUG_ON(retval != -EIOCBQUEUED); return retval; fail_dio: if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) inode_unlock(inode); kmem_cache_free(dio_cache, dio); return retval; } EXPORT_SYMBOL(__blockdev_direct_IO); static __init int dio_init(void) { dio_cache = KMEM_CACHE(dio, SLAB_PANIC); return 0; } module_init(dio_init) |
748 748 17 687 13 7 3 3 3 3 3 13 6 6 4 3 3 2 12 13 4 9 17 5 5 3 3 88 88 88 688 690 691 690 690 691 688 1 1 1 691 689 1 1 691 689 691 1 1 690 689 690 691 690 689 8 8 8 8 686 686 686 686 686 8 685 103 103 102 103 103 103 682 680 683 683 103 103 103 1 1 1 59 59 688 687 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 | // SPDX-License-Identifier: GPL-2.0-or-later /* audit.c -- Auditing support * Gateway between the kernel (e.g., selinux) and the user-space audit daemon. * System-call specific features have moved to auditsc.c * * Copyright 2003-2007 Red Hat Inc., Durham, North Carolina. * All Rights Reserved. * * Written by Rickard E. (Rik) Faith <faith@redhat.com> * * Goals: 1) Integrate fully with Security Modules. * 2) Minimal run-time overhead: * a) Minimal when syscall auditing is disabled (audit_enable=0). * b) Small when syscall auditing is enabled and no audit record * is generated (defer as much work as possible to record * generation time): * i) context is allocated, * ii) names from getname are stored without a copy, and * iii) inode information stored from path_lookup. * 3) Ability to disable syscall auditing at boot time (audit=0). * 4) Usable by other parts of the kernel (if audit_log* is called, * then a syscall record will be generated automatically for the * current syscall). * 5) Netlink interface to user-space. * 6) Support low-overhead kernel-based filtering to minimize the * information that must be passed to user-space. * * Audit userspace, documentation, tests, and bug/issue trackers: * https://github.com/linux-audit */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/file.h> #include <linux/init.h> #include <linux/types.h> #include <linux/atomic.h> #include <linux/mm.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/kthread.h> #include <linux/kernel.h> #include <linux/syscalls.h> #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/mutex.h> #include <linux/gfp.h> #include <linux/pid.h> #include <linux/audit.h> #include <net/sock.h> #include <net/netlink.h> #include <linux/skbuff.h> #include <linux/security.h> #include <linux/freezer.h> #include <linux/pid_namespace.h> #include <net/netns/generic.h> #include "audit.h" /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED. * (Initialization happens after skb_init is called.) */ #define AUDIT_DISABLED -1 #define AUDIT_UNINITIALIZED 0 #define AUDIT_INITIALIZED 1 static int audit_initialized = AUDIT_UNINITIALIZED; u32 audit_enabled = AUDIT_OFF; bool audit_ever_enabled = !!AUDIT_OFF; EXPORT_SYMBOL_GPL(audit_enabled); /* Default state when kernel boots without any parameters. */ static u32 audit_default = AUDIT_OFF; /* If auditing cannot proceed, audit_failure selects what happens. */ static u32 audit_failure = AUDIT_FAIL_PRINTK; /* private audit network namespace index */ static unsigned int audit_net_id; /** * struct audit_net - audit private network namespace data * @sk: communication socket */ struct audit_net { struct sock *sk; }; /** * struct auditd_connection - kernel/auditd connection state * @pid: auditd PID * @portid: netlink portid * @net: the associated network namespace * @rcu: RCU head * * Description: * This struct is RCU protected; you must either hold the RCU lock for reading * or the associated spinlock for writing. */ struct auditd_connection { struct pid *pid; u32 portid; struct net *net; struct rcu_head rcu; }; static struct auditd_connection __rcu *auditd_conn; static DEFINE_SPINLOCK(auditd_conn_lock); /* If audit_rate_limit is non-zero, limit the rate of sending audit records * to that number per second. This prevents DoS attacks, but results in * audit records being dropped. */ static u32 audit_rate_limit; /* Number of outstanding audit_buffers allowed. * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; /* The identity of the user shutting down the audit system. */ static kuid_t audit_sig_uid = INVALID_UID; static pid_t audit_sig_pid = -1; static u32 audit_sig_sid; /* Records can be lost in several ways: 0) [suppressed in audit_alloc] 1) out of memory in audit_log_start [kmalloc of struct audit_buffer] 2) out of memory in audit_log_move [alloc_skb] 3) suppressed due to audit_rate_limit 4) suppressed due to audit_backlog_limit */ static atomic_t audit_lost = ATOMIC_INIT(0); /* Monotonically increasing sum of time the kernel has spent * waiting while the backlog limit is exceeded. */ static atomic_t audit_backlog_wait_time_actual = ATOMIC_INIT(0); /* Hash for inode-based rules */ struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS]; static struct kmem_cache *audit_buffer_cache; /* queue msgs to send via kauditd_task */ static struct sk_buff_head audit_queue; /* queue msgs due to temporary unicast send problems */ static struct sk_buff_head audit_retry_queue; /* queue msgs waiting for new auditd connection */ static struct sk_buff_head audit_hold_queue; /* queue servicing thread */ static struct task_struct *kauditd_task; static DECLARE_WAIT_QUEUE_HEAD(kauditd_wait); /* waitqueue for callers who are blocked on the audit backlog */ static DECLARE_WAIT_QUEUE_HEAD(audit_backlog_wait); static struct audit_features af = {.vers = AUDIT_FEATURE_VERSION, .mask = -1, .features = 0, .lock = 0,}; static char *audit_feature_names[2] = { "only_unset_loginuid", "loginuid_immutable", }; /** * struct audit_ctl_mutex - serialize requests from userspace * @lock: the mutex used for locking * @owner: the task which owns the lock * * Description: * This is the lock struct used to ensure we only process userspace requests * in an orderly fashion. We can't simply use a mutex/lock here because we * need to track lock ownership so we don't end up blocking the lock owner in * audit_log_start() or similar. */ static struct audit_ctl_mutex { struct mutex lock; void *owner; } audit_cmd_mutex; /* AUDIT_BUFSIZ is the size of the temporary buffer used for formatting * audit records. Since printk uses a 1024 byte buffer, this buffer * should be at least that large. */ #define AUDIT_BUFSIZ 1024 /* The audit_buffer is used when formatting an audit record. The caller * locks briefly to get the record off the freelist or to allocate the * buffer, and locks briefly to send the buffer to the netlink layer or * to place it on a transmit queue. Multiple audit_buffers can be in * use simultaneously. */ struct audit_buffer { struct sk_buff *skb; /* formatted skb ready to send */ struct audit_context *ctx; /* NULL or associated context */ gfp_t gfp_mask; }; struct audit_reply { __u32 portid; struct net *net; struct sk_buff *skb; }; /** * auditd_test_task - Check to see if a given task is an audit daemon * @task: the task to check * * Description: * Return 1 if the task is a registered audit daemon, 0 otherwise. */ int auditd_test_task(struct task_struct *task) { int rc; struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); rc = (ac && ac->pid == task_tgid(task) ? 1 : 0); rcu_read_unlock(); return rc; } /** * audit_ctl_lock - Take the audit control lock */ void audit_ctl_lock(void) { mutex_lock(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = current; } /** * audit_ctl_unlock - Drop the audit control lock */ void audit_ctl_unlock(void) { audit_cmd_mutex.owner = NULL; mutex_unlock(&audit_cmd_mutex.lock); } /** * audit_ctl_owner_current - Test to see if the current task owns the lock * * Description: * Return true if the current task owns the audit control lock, false if it * doesn't own the lock. */ static bool audit_ctl_owner_current(void) { return (current == audit_cmd_mutex.owner); } /** * auditd_pid_vnr - Return the auditd PID relative to the namespace * * Description: * Returns the PID in relation to the namespace, 0 on failure. */ static pid_t auditd_pid_vnr(void) { pid_t pid; const struct auditd_connection *ac; rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac || !ac->pid) pid = 0; else pid = pid_vnr(ac->pid); rcu_read_unlock(); return pid; } /** * audit_get_sk - Return the audit socket for the given network namespace * @net: the destination network namespace * * Description: * Returns the sock pointer if valid, NULL otherwise. The caller must ensure * that a reference is held for the network namespace while the sock is in use. */ static struct sock *audit_get_sk(const struct net *net) { struct audit_net *aunet; if (!net) return NULL; aunet = net_generic(net, audit_net_id); return aunet->sk; } void audit_panic(const char *message) { switch (audit_failure) { case AUDIT_FAIL_SILENT: break; case AUDIT_FAIL_PRINTK: if (printk_ratelimit()) pr_err("%s\n", message); break; case AUDIT_FAIL_PANIC: panic("audit: %s\n", message); break; } } static inline int audit_rate_check(void) { static unsigned long last_check = 0; static int messages = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int retval = 0; if (!audit_rate_limit) return 1; spin_lock_irqsave(&lock, flags); if (++messages < audit_rate_limit) { retval = 1; } else { now = jiffies; if (time_after(now, last_check + HZ)) { last_check = now; messages = 0; retval = 1; } } spin_unlock_irqrestore(&lock, flags); return retval; } /** * audit_log_lost - conditionally log lost audit message event * @message: the message stating reason for lost audit message * * Emit at least 1 message per second, even if audit_rate_check is * throttling. * Always increment the lost messages counter. */ void audit_log_lost(const char *message) { static unsigned long last_msg = 0; static DEFINE_SPINLOCK(lock); unsigned long flags; unsigned long now; int print; atomic_inc(&audit_lost); print = (audit_failure == AUDIT_FAIL_PANIC || !audit_rate_limit); if (!print) { spin_lock_irqsave(&lock, flags); now = jiffies; if (time_after(now, last_msg + HZ)) { print = 1; last_msg = now; } spin_unlock_irqrestore(&lock, flags); } if (print) { if (printk_ratelimit()) pr_warn("audit_lost=%u audit_rate_limit=%u audit_backlog_limit=%u\n", atomic_read(&audit_lost), audit_rate_limit, audit_backlog_limit); audit_panic(message); } } static int audit_log_config_change(char *function_name, u32 new, u32 old, int allow_changes) { struct audit_buffer *ab; int rc = 0; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONFIG_CHANGE); if (unlikely(!ab)) return rc; audit_log_format(ab, "op=set %s=%u old=%u ", function_name, new, old); audit_log_session_info(ab); rc = audit_log_task_context(ab); if (rc) allow_changes = 0; /* Something weird, deny request */ audit_log_format(ab, " res=%d", allow_changes); audit_log_end(ab); return rc; } static int audit_do_config_change(char *function_name, u32 *to_change, u32 new) { int allow_changes, rc = 0; u32 old = *to_change; /* check if we are locked */ if (audit_enabled == AUDIT_LOCKED) allow_changes = 0; else allow_changes = 1; if (audit_enabled != AUDIT_OFF) { rc = audit_log_config_change(function_name, new, old, allow_changes); if (rc) allow_changes = 0; } /* If we are allowed, make the change */ if (allow_changes == 1) *to_change = new; /* Not allowed, update reason */ else if (rc == 0) rc = -EPERM; return rc; } static int audit_set_rate_limit(u32 limit) { return audit_do_config_change("audit_rate_limit", &audit_rate_limit, limit); } static int audit_set_backlog_limit(u32 limit) { return audit_do_config_change("audit_backlog_limit", &audit_backlog_limit, limit); } static int audit_set_backlog_wait_time(u32 timeout) { return audit_do_config_change("audit_backlog_wait_time", &audit_backlog_wait_time, timeout); } static int audit_set_enabled(u32 state) { int rc; if (state > AUDIT_LOCKED) return -EINVAL; rc = audit_do_config_change("audit_enabled", &audit_enabled, state); if (!rc) audit_ever_enabled |= !!state; return rc; } static int audit_set_failure(u32 state) { if (state != AUDIT_FAIL_SILENT && state != AUDIT_FAIL_PRINTK && state != AUDIT_FAIL_PANIC) return -EINVAL; return audit_do_config_change("audit_failure", &audit_failure, state); } /** * auditd_conn_free - RCU helper to release an auditd connection struct * @rcu: RCU head * * Description: * Drop any references inside the auditd connection tracking struct and free * the memory. */ static void auditd_conn_free(struct rcu_head *rcu) { struct auditd_connection *ac; ac = container_of(rcu, struct auditd_connection, rcu); put_pid(ac->pid); put_net(ac->net); kfree(ac); } /** * auditd_set - Set/Reset the auditd connection state * @pid: auditd PID * @portid: auditd netlink portid * @net: auditd network namespace pointer * * Description: * This function will obtain and drop network namespace references as * necessary. Returns zero on success, negative values on failure. */ static int auditd_set(struct pid *pid, u32 portid, struct net *net) { unsigned long flags; struct auditd_connection *ac_old, *ac_new; if (!pid || !net) return -EINVAL; ac_new = kzalloc(sizeof(*ac_new), GFP_KERNEL); if (!ac_new) return -ENOMEM; ac_new->pid = get_pid(pid); ac_new->portid = portid; ac_new->net = get_net(net); spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); rcu_assign_pointer(auditd_conn, ac_new); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); return 0; } /** * kauditd_printk_skb - Print the audit record to the ring buffer * @skb: audit record * * Whatever the reason, this packet may not make it to the auditd connection * so write it via printk so the information isn't completely lost. */ static void kauditd_printk_skb(struct sk_buff *skb) { struct nlmsghdr *nlh = nlmsg_hdr(skb); char *data = nlmsg_data(nlh); if (nlh->nlmsg_type != AUDIT_EOE && printk_ratelimit()) pr_notice("type=%d %s\n", nlh->nlmsg_type, data); } /** * kauditd_rehold_skb - Handle a audit record send failure in the hold queue * @skb: audit record * @error: error code (unused) * * Description: * This should only be used by the kauditd_thread when it fails to flush the * hold queue. */ static void kauditd_rehold_skb(struct sk_buff *skb, __always_unused int error) { /* put the record back in the queue */ skb_queue_tail(&audit_hold_queue, skb); } /** * kauditd_hold_skb - Queue an audit record, waiting for auditd * @skb: audit record * @error: error code * * Description: * Queue the audit record, waiting for an instance of auditd. When this * function is called we haven't given up yet on sending the record, but things * are not looking good. The first thing we want to do is try to write the * record via printk and then see if we want to try and hold on to the record * and queue it, if we have room. If we want to hold on to the record, but we * don't have room, record a record lost message. */ static void kauditd_hold_skb(struct sk_buff *skb, int error) { /* at this point it is uncertain if we will ever send this to auditd so * try to send the message via printk before we go any further */ kauditd_printk_skb(skb); /* can we just silently drop the message? */ if (!audit_default) goto drop; /* the hold queue is only for when the daemon goes away completely, * not -EAGAIN failures; if we are in a -EAGAIN state requeue the * record on the retry queue unless it's full, in which case drop it */ if (error == -EAGAIN) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } audit_log_lost("kauditd retry queue overflow"); goto drop; } /* if we have room in the hold queue, queue the message */ if (!audit_backlog_limit || skb_queue_len(&audit_hold_queue) < audit_backlog_limit) { skb_queue_tail(&audit_hold_queue, skb); return; } /* we have no other options - drop the message */ audit_log_lost("kauditd hold queue overflow"); drop: kfree_skb(skb); } /** * kauditd_retry_skb - Queue an audit record, attempt to send again to auditd * @skb: audit record * @error: error code (unused) * * Description: * Not as serious as kauditd_hold_skb() as we still have a connected auditd, * but for some reason we are having problems sending it audit records so * queue the given record and attempt to resend. */ static void kauditd_retry_skb(struct sk_buff *skb, __always_unused int error) { if (!audit_backlog_limit || skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { skb_queue_tail(&audit_retry_queue, skb); return; } /* we have to drop the record, send it via printk as a last effort */ kauditd_printk_skb(skb); audit_log_lost("kauditd retry queue overflow"); kfree_skb(skb); } /** * auditd_reset - Disconnect the auditd connection * @ac: auditd connection state * * Description: * Break the auditd/kauditd connection and move all the queued records into the * hold queue in case auditd reconnects. It is important to note that the @ac * pointer should never be dereferenced inside this function as it may be NULL * or invalid, you can only compare the memory address! If @ac is NULL then * the connection will always be reset. */ static void auditd_reset(const struct auditd_connection *ac) { unsigned long flags; struct sk_buff *skb; struct auditd_connection *ac_old; /* if it isn't already broken, break the connection */ spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); if (ac && ac != ac_old) { /* someone already registered a new auditd connection */ spin_unlock_irqrestore(&auditd_conn_lock, flags); return; } rcu_assign_pointer(auditd_conn, NULL); spin_unlock_irqrestore(&auditd_conn_lock, flags); if (ac_old) call_rcu(&ac_old->rcu, auditd_conn_free); /* flush the retry queue to the hold queue, but don't touch the main * queue since we need to process that normally for multicast */ while ((skb = skb_dequeue(&audit_retry_queue))) kauditd_hold_skb(skb, -ECONNREFUSED); } /** * auditd_send_unicast_skb - Send a record via unicast to auditd * @skb: audit record * * Description: * Send a skb to the audit daemon, returns positive/zero values on success and * negative values on failure; in all cases the skb will be consumed by this * function. If the send results in -ECONNREFUSED the connection with auditd * will be reset. This function may sleep so callers should not hold any locks * where this would cause a problem. */ static int auditd_send_unicast_skb(struct sk_buff *skb) { int rc; u32 portid; struct net *net; struct sock *sk; struct auditd_connection *ac; /* NOTE: we can't call netlink_unicast while in the RCU section so * take a reference to the network namespace and grab local * copies of the namespace, the sock, and the portid; the * namespace and sock aren't going to go away while we hold a * reference and if the portid does become invalid after the RCU * section netlink_unicast() should safely return an error */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); kfree_skb(skb); rc = -ECONNREFUSED; goto err; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); rc = netlink_unicast(sk, skb, portid, 0); put_net(net); if (rc < 0) goto err; return rc; err: if (ac && rc == -ECONNREFUSED) auditd_reset(ac); return rc; } /** * kauditd_send_queue - Helper for kauditd_thread to flush skb queues * @sk: the sending sock * @portid: the netlink destination * @queue: the skb queue to process * @retry_limit: limit on number of netlink unicast failures * @skb_hook: per-skb hook for additional processing * @err_hook: hook called if the skb fails the netlink unicast send * * Description: * Run through the given queue and attempt to send the audit records to auditd, * returns zero on success, negative values on failure. It is up to the caller * to ensure that the @sk is valid for the duration of this function. * */ static int kauditd_send_queue(struct sock *sk, u32 portid, struct sk_buff_head *queue, unsigned int retry_limit, void (*skb_hook)(struct sk_buff *skb), void (*err_hook)(struct sk_buff *skb, int error)) { int rc = 0; struct sk_buff *skb = NULL; struct sk_buff *skb_tail; unsigned int failed = 0; /* NOTE: kauditd_thread takes care of all our locking, we just use * the netlink info passed to us (e.g. sk and portid) */ skb_tail = skb_peek_tail(queue); while ((skb != skb_tail) && (skb = skb_dequeue(queue))) { /* call the skb_hook for each skb we touch */ if (skb_hook) (*skb_hook)(skb); /* can we send to anyone via unicast? */ if (!sk) { if (err_hook) (*err_hook)(skb, -ECONNREFUSED); continue; } retry: /* grab an extra skb reference in case of error */ skb_get(skb); rc = netlink_unicast(sk, skb, portid, 0); if (rc < 0) { /* send failed - try a few times unless fatal error */ if (++failed >= retry_limit || rc == -ECONNREFUSED || rc == -EPERM) { sk = NULL; if (err_hook) (*err_hook)(skb, rc); if (rc == -EAGAIN) rc = 0; /* continue to drain the queue */ continue; } else goto retry; } else { /* skb sent - drop the extra reference and continue */ consume_skb(skb); failed = 0; } } return (rc >= 0 ? 0 : rc); } /* * kauditd_send_multicast_skb - Send a record to any multicast listeners * @skb: audit record * * Description: * Write a multicast message to anyone listening in the initial network * namespace. This function doesn't consume an skb as might be expected since * it has to copy it anyways. */ static void kauditd_send_multicast_skb(struct sk_buff *skb) { struct sk_buff *copy; struct sock *sock = audit_get_sk(&init_net); struct nlmsghdr *nlh; /* NOTE: we are not taking an additional reference for init_net since * we don't have to worry about it going away */ if (!netlink_has_listeners(sock, AUDIT_NLGRP_READLOG)) return; /* * The seemingly wasteful skb_copy() rather than bumping the refcount * using skb_get() is necessary because non-standard mods are made to * the skb by the original kaudit unicast socket send routine. The * existing auditd daemon assumes this breakage. Fixing this would * require co-ordinating a change in the established protocol between * the kaudit kernel subsystem and the auditd userspace code. There is * no reason for new multicast clients to continue with this * non-compliance. */ copy = skb_copy(skb, GFP_KERNEL); if (!copy) return; nlh = nlmsg_hdr(copy); nlh->nlmsg_len = skb->len; nlmsg_multicast(sock, copy, 0, AUDIT_NLGRP_READLOG, GFP_KERNEL); } /** * kauditd_thread - Worker thread to send audit records to userspace * @dummy: unused */ static int kauditd_thread(void *dummy) { int rc; u32 portid = 0; struct net *net = NULL; struct sock *sk = NULL; struct auditd_connection *ac; #define UNICAST_RETRIES 5 set_freezable(); while (!kthread_should_stop()) { /* NOTE: see the lock comments in auditd_send_unicast_skb() */ rcu_read_lock(); ac = rcu_dereference(auditd_conn); if (!ac) { rcu_read_unlock(); goto main_queue; } net = get_net(ac->net); sk = audit_get_sk(net); portid = ac->portid; rcu_read_unlock(); /* attempt to flush the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_hold_queue, UNICAST_RETRIES, NULL, kauditd_rehold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } /* attempt to flush the retry queue */ rc = kauditd_send_queue(sk, portid, &audit_retry_queue, UNICAST_RETRIES, NULL, kauditd_hold_skb); if (rc < 0) { sk = NULL; auditd_reset(ac); goto main_queue; } main_queue: /* process the main queue - do the multicast send and attempt * unicast, dump failed record sends to the retry queue; if * sk == NULL due to previous failures we will just do the * multicast send and move the record to the hold queue */ rc = kauditd_send_queue(sk, portid, &audit_queue, 1, kauditd_send_multicast_skb, (sk ? kauditd_retry_skb : kauditd_hold_skb)); if (ac && rc < 0) auditd_reset(ac); sk = NULL; /* drop our netns reference, no auditd sends past this line */ if (net) { put_net(net); net = NULL; } /* we have processed all the queues so wake everyone */ wake_up(&audit_backlog_wait); /* NOTE: we want to wake up if there is anything on the queue, * regardless of if an auditd is connected, as we need to * do the multicast send and rotate records from the * main queue to the retry/hold queues */ wait_event_freezable(kauditd_wait, (skb_queue_len(&audit_queue) ? 1 : 0)); } return 0; } int audit_send_list_thread(void *_dest) { struct audit_netlink_list *dest = _dest; struct sk_buff *skb; struct sock *sk = audit_get_sk(dest->net); /* wait for parent to finish and send an ACK */ audit_ctl_lock(); audit_ctl_unlock(); while ((skb = __skb_dequeue(&dest->q)) != NULL) netlink_unicast(sk, skb, dest->portid, 0); put_net(dest->net); kfree(dest); return 0; } struct sk_buff *audit_make_reply(int seq, int type, int done, int multi, const void *payload, int size) { struct sk_buff *skb; struct nlmsghdr *nlh; void *data; int flags = multi ? NLM_F_MULTI : 0; int t = done ? NLMSG_DONE : type; skb = nlmsg_new(size, GFP_KERNEL); if (!skb) return NULL; nlh = nlmsg_put(skb, 0, seq, t, size, flags); if (!nlh) goto out_kfree_skb; data = nlmsg_data(nlh); memcpy(data, payload, size); return skb; out_kfree_skb: kfree_skb(skb); return NULL; } static void audit_free_reply(struct audit_reply *reply) { if (!reply) return; kfree_skb(reply->skb); if (reply->net) put_net(reply->net); kfree(reply); } static int audit_send_reply_thread(void *arg) { struct audit_reply *reply = (struct audit_reply *)arg; audit_ctl_lock(); audit_ctl_unlock(); /* Ignore failure. It'll only happen if the sender goes away, because our timeout is set to infinite. */ netlink_unicast(audit_get_sk(reply->net), reply->skb, reply->portid, 0); reply->skb = NULL; audit_free_reply(reply); return 0; } /** * audit_send_reply - send an audit reply message via netlink * @request_skb: skb of request we are replying to (used to target the reply) * @seq: sequence number * @type: audit message type * @done: done (last) flag * @multi: multi-part message flag * @payload: payload data * @size: payload size * * Allocates a skb, builds the netlink message, and sends it to the port id. */ static void audit_send_reply(struct sk_buff *request_skb, int seq, int type, int done, int multi, const void *payload, int size) { struct task_struct *tsk; struct audit_reply *reply; reply = kzalloc(sizeof(*reply), GFP_KERNEL); if (!reply) return; reply->skb = audit_make_reply(seq, type, done, multi, payload, size); if (!reply->skb) goto err; reply->net = get_net(sock_net(NETLINK_CB(request_skb).sk)); reply->portid = NETLINK_CB(request_skb).portid; tsk = kthread_run(audit_send_reply_thread, reply, "audit_send_reply"); if (IS_ERR(tsk)) goto err; return; err: audit_free_reply(reply); } /* * Check for appropriate CAP_AUDIT_ capabilities on incoming audit * control messages. */ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) { int err = 0; /* Only support initial user namespace for now. */ /* * We return ECONNREFUSED because it tricks userspace into thinking * that audit was not configured into the kernel. Lots of users * configure their PAM stack (because that's what the distro does) * to reject login if unable to send messages to audit. If we return * ECONNREFUSED the PAM stack thinks the kernel does not have audit * configured in and will let login proceed. If we return EPERM * userspace will reject all logins. This should be removed when we * support non init namespaces!! */ if (current_user_ns() != &init_user_ns) return -ECONNREFUSED; switch (msg_type) { case AUDIT_LIST: case AUDIT_ADD: case AUDIT_DEL: return -EOPNOTSUPP; case AUDIT_GET: case AUDIT_SET: case AUDIT_GET_FEATURE: case AUDIT_SET_FEATURE: case AUDIT_LIST_RULES: case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: case AUDIT_TTY_GET: case AUDIT_TTY_SET: case AUDIT_TRIM: case AUDIT_MAKE_EQUIV: /* Only support auditd and auditctl in initial pid namespace * for now. */ if (task_active_pid_ns(current) != &init_pid_ns) return -EPERM; if (!netlink_capable(skb, CAP_AUDIT_CONTROL)) err = -EPERM; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!netlink_capable(skb, CAP_AUDIT_WRITE)) err = -EPERM; break; default: /* bad msg */ err = -EINVAL; } return err; } static void audit_log_common_recv_msg(struct audit_context *context, struct audit_buffer **ab, u16 msg_type) { uid_t uid = from_kuid(&init_user_ns, current_uid()); pid_t pid = task_tgid_nr(current); if (!audit_enabled && msg_type != AUDIT_USER_AVC) { *ab = NULL; return; } *ab = audit_log_start(context, GFP_KERNEL, msg_type); if (unlikely(!*ab)) return; audit_log_format(*ab, "pid=%d uid=%u ", pid, uid); audit_log_session_info(*ab); audit_log_task_context(*ab); } static inline void audit_log_user_recv_msg(struct audit_buffer **ab, u16 msg_type) { audit_log_common_recv_msg(NULL, ab, msg_type); } static int is_audit_feature_set(int i) { return af.features & AUDIT_FEATURE_TO_MASK(i); } static int audit_get_feature(struct sk_buff *skb) { u32 seq; seq = nlmsg_hdr(skb)->nlmsg_seq; audit_send_reply(skb, seq, AUDIT_GET_FEATURE, 0, 0, &af, sizeof(af)); return 0; } static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature, u32 old_lock, u32 new_lock, int res) { struct audit_buffer *ab; if (audit_enabled == AUDIT_OFF) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_FEATURE_CHANGE); if (!ab) return; audit_log_task_info(ab); audit_log_format(ab, " feature=%s old=%u new=%u old_lock=%u new_lock=%u res=%d", audit_feature_names[which], !!old_feature, !!new_feature, !!old_lock, !!new_lock, res); audit_log_end(ab); } static int audit_set_feature(struct audit_features *uaf) { int i; BUILD_BUG_ON(AUDIT_LAST_FEATURE + 1 > ARRAY_SIZE(audit_feature_names)); /* if there is ever a version 2 we should handle that here */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; new_lock = (uaf->lock | af.lock) & feature; old_lock = af.lock & feature; /* are we changing a locked feature? */ if (old_lock && (new_feature != old_feature)) { audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 0); return -EPERM; } } /* nothing invalid, do the changes */ for (i = 0; i <= AUDIT_LAST_FEATURE; i++) { u32 feature = AUDIT_FEATURE_TO_MASK(i); u32 old_feature, new_feature, old_lock, new_lock; /* if we are not changing this feature, move along */ if (!(feature & uaf->mask)) continue; old_feature = af.features & feature; new_feature = uaf->features & feature; old_lock = af.lock & feature; new_lock = (uaf->lock | af.lock) & feature; if (new_feature != old_feature) audit_log_feature_change(i, old_feature, new_feature, old_lock, new_lock, 1); if (new_feature) af.features |= feature; else af.features &= ~feature; af.lock |= new_lock; } return 0; } static int audit_replace(struct pid *pid) { pid_t pvnr; struct sk_buff *skb; pvnr = pid_vnr(pid); skb = audit_make_reply(0, AUDIT_REPLACE, 0, 0, &pvnr, sizeof(pvnr)); if (!skb) return -ENOMEM; return auditd_send_unicast_skb(skb); } static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) { u32 seq; void *data; int data_len; int err; struct audit_buffer *ab; u16 msg_type = nlh->nlmsg_type; struct audit_sig_info *sig_data; char *ctx = NULL; u32 len; err = audit_netlink_ok(skb, msg_type); if (err) return err; seq = nlh->nlmsg_seq; data = nlmsg_data(nlh); data_len = nlmsg_len(nlh); switch (msg_type) { case AUDIT_GET: { struct audit_status s; memset(&s, 0, sizeof(s)); s.enabled = audit_enabled; s.failure = audit_failure; /* NOTE: use pid_vnr() so the PID is relative to the current * namespace */ s.pid = auditd_pid_vnr(); s.rate_limit = audit_rate_limit; s.backlog_limit = audit_backlog_limit; s.lost = atomic_read(&audit_lost); s.backlog = skb_queue_len(&audit_queue); s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL; s.backlog_wait_time = audit_backlog_wait_time; s.backlog_wait_time_actual = atomic_read(&audit_backlog_wait_time_actual); audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_SET: { struct audit_status s; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); if (s.mask & AUDIT_STATUS_ENABLED) { err = audit_set_enabled(s.enabled); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_FAILURE) { err = audit_set_failure(s.failure); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_PID) { /* NOTE: we are using the vnr PID functions below * because the s.pid value is relative to the * namespace of the caller; at present this * doesn't matter much since you can really only * run auditd from the initial pid namespace, but * something to keep in mind if this changes */ pid_t new_pid = s.pid; pid_t auditd_pid; struct pid *req_pid = task_tgid(current); /* Sanity check - PID values must match. Setting * pid to 0 is how auditd ends auditing. */ if (new_pid && (new_pid != pid_vnr(req_pid))) return -EINVAL; /* test the auditd connection */ audit_replace(req_pid); auditd_pid = auditd_pid_vnr(); if (auditd_pid) { /* replacing a healthy auditd is not allowed */ if (new_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EEXIST; } /* only current auditd can unregister itself */ if (pid_vnr(req_pid) != auditd_pid) { audit_log_config_change("audit_pid", new_pid, auditd_pid, 0); return -EACCES; } } if (new_pid) { /* register a new auditd connection */ err = auditd_set(req_pid, NETLINK_CB(skb).portid, sock_net(NETLINK_CB(skb).sk)); if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, err ? 0 : 1); if (err) return err; /* try to process any backlog */ wake_up_interruptible(&kauditd_wait); } else { if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, auditd_pid, 1); /* unregister the auditd connection */ auditd_reset(NULL); } } if (s.mask & AUDIT_STATUS_RATE_LIMIT) { err = audit_set_rate_limit(s.rate_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_LIMIT) { err = audit_set_backlog_limit(s.backlog_limit); if (err < 0) return err; } if (s.mask & AUDIT_STATUS_BACKLOG_WAIT_TIME) { if (sizeof(s) > (size_t)nlh->nlmsg_len) return -EINVAL; if (s.backlog_wait_time > 10*AUDIT_BACKLOG_WAIT_TIME) return -EINVAL; err = audit_set_backlog_wait_time(s.backlog_wait_time); if (err < 0) return err; } if (s.mask == AUDIT_STATUS_LOST) { u32 lost = atomic_xchg(&audit_lost, 0); audit_log_config_change("lost", 0, lost, 1); return lost; } if (s.mask == AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL) { u32 actual = atomic_xchg(&audit_backlog_wait_time_actual, 0); audit_log_config_change("backlog_wait_time_actual", 0, actual, 1); return actual; } break; } case AUDIT_GET_FEATURE: err = audit_get_feature(skb); if (err) return err; break; case AUDIT_SET_FEATURE: if (data_len < sizeof(struct audit_features)) return -EINVAL; err = audit_set_feature(data); if (err) return err; break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!audit_enabled && msg_type != AUDIT_USER_AVC) return 0; /* exit early if there isn't at least one character to print */ if (data_len < 2) return -EINVAL; err = audit_filter(msg_type, AUDIT_FILTER_USER); if (err == 1) { /* match or error */ char *str = data; err = 0; if (msg_type == AUDIT_USER_TTY) { err = tty_audit_push(); if (err) break; } audit_log_user_recv_msg(&ab, msg_type); if (msg_type != AUDIT_USER_TTY) { /* ensure NULL termination */ str[data_len - 1] = '\0'; audit_log_format(ab, " msg='%.*s'", AUDIT_MESSAGE_TEXT_MAX, str); } else { audit_log_format(ab, " data="); if (str[data_len - 1] == '\0') data_len--; audit_log_n_untrustedstring(ab, str, data_len); } audit_log_end(ab); } break; case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: if (data_len < sizeof(struct audit_rule_data)) return -EINVAL; if (audit_enabled == AUDIT_LOCKED) { audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=%s audit_enabled=%d res=0", msg_type == AUDIT_ADD_RULE ? "add_rule" : "remove_rule", audit_enabled); audit_log_end(ab); return -EPERM; } err = audit_rule_change(msg_type, seq, data, data_len); break; case AUDIT_LIST_RULES: err = audit_list_rules_send(skb, seq); break; case AUDIT_TRIM: audit_trim_trees(); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=trim res=1"); audit_log_end(ab); break; case AUDIT_MAKE_EQUIV: { void *bufp = data; u32 sizes[2]; size_t msglen = data_len; char *old, *new; err = -EINVAL; if (msglen < 2 * sizeof(u32)) break; memcpy(sizes, bufp, 2 * sizeof(u32)); bufp += 2 * sizeof(u32); msglen -= 2 * sizeof(u32); old = audit_unpack_string(&bufp, &msglen, sizes[0]); if (IS_ERR(old)) { err = PTR_ERR(old); break; } new = audit_unpack_string(&bufp, &msglen, sizes[1]); if (IS_ERR(new)) { err = PTR_ERR(new); kfree(old); break; } /* OK, here comes... */ err = audit_tag_tree(old, new); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=make_equiv old="); audit_log_untrustedstring(ab, old); audit_log_format(ab, " new="); audit_log_untrustedstring(ab, new); audit_log_format(ab, " res=%d", !err); audit_log_end(ab); kfree(old); kfree(new); break; } case AUDIT_SIGNAL_INFO: len = 0; if (audit_sig_sid) { err = security_secid_to_secctx(audit_sig_sid, &ctx, &len); if (err) return err; } sig_data = kmalloc(struct_size(sig_data, ctx, len), GFP_KERNEL); if (!sig_data) { if (audit_sig_sid) security_release_secctx(ctx, len); return -ENOMEM; } sig_data->uid = from_kuid(&init_user_ns, audit_sig_uid); sig_data->pid = audit_sig_pid; if (audit_sig_sid) { memcpy(sig_data->ctx, ctx, len); security_release_secctx(ctx, len); } audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO, 0, 0, sig_data, struct_size(sig_data, ctx, len)); kfree(sig_data); break; case AUDIT_TTY_GET: { struct audit_tty_status s; unsigned int t; t = READ_ONCE(current->signal->audit_tty); s.enabled = t & AUDIT_TTY_ENABLE; s.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_send_reply(skb, seq, AUDIT_TTY_GET, 0, 0, &s, sizeof(s)); break; } case AUDIT_TTY_SET: { struct audit_tty_status s, old; struct audit_buffer *ab; unsigned int t; memset(&s, 0, sizeof(s)); /* guard against past and future API changes */ memcpy(&s, data, min_t(size_t, sizeof(s), data_len)); /* check if new data is valid */ if ((s.enabled != 0 && s.enabled != 1) || (s.log_passwd != 0 && s.log_passwd != 1)) err = -EINVAL; if (err) t = READ_ONCE(current->signal->audit_tty); else { t = s.enabled | (-s.log_passwd & AUDIT_TTY_LOG_PASSWD); t = xchg(¤t->signal->audit_tty, t); } old.enabled = t & AUDIT_TTY_ENABLE; old.log_passwd = !!(t & AUDIT_TTY_LOG_PASSWD); audit_log_common_recv_msg(audit_context(), &ab, AUDIT_CONFIG_CHANGE); audit_log_format(ab, " op=tty_set old-enabled=%d new-enabled=%d" " old-log_passwd=%d new-log_passwd=%d res=%d", old.enabled, s.enabled, old.log_passwd, s.log_passwd, !err); audit_log_end(ab); break; } default: err = -EINVAL; break; } return err < 0 ? err : 0; } /** * audit_receive - receive messages from a netlink control socket * @skb: the message buffer * * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned */ int len; int err; nlh = nlmsg_hdr(skb); len = skb->len; audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { err = audit_receive_msg(skb, nlh); /* if err or if this message says it wants a response */ if (err || (nlh->nlmsg_flags & NLM_F_ACK)) netlink_ack(skb, nlh, err, NULL); nlh = nlmsg_next(nlh, &len); } audit_ctl_unlock(); /* can't block with the ctrl lock, so penalize the sender now */ if (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { DECLARE_WAITQUEUE(wait, current); /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); schedule_timeout(audit_backlog_wait_time); remove_wait_queue(&audit_backlog_wait, &wait); } } /* Log information about who is connecting to the audit multicast socket */ static void audit_log_multicast(int group, const char *op, int err) { const struct cred *cred; struct tty_struct *tty; char comm[sizeof(current->comm)]; struct audit_buffer *ab; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_EVENT_LISTENER); if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, "pid=%u uid=%u auid=%u tty=%s ses=%u", task_pid_nr(current), from_kuid(&init_user_ns, cred->uid), from_kuid(&init_user_ns, audit_get_loginuid(current)), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_task_context(ab); /* subj= */ audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); /* exe= */ audit_log_format(ab, " nl-mcgrp=%d op=%s res=%d", group, op, !err); audit_log_end(ab); } /* Run custom bind function on netlink socket group connect or bind requests. */ static int audit_multicast_bind(struct net *net, int group) { int err = 0; if (!capable(CAP_AUDIT_READ)) err = -EPERM; audit_log_multicast(group, "connect", err); return err; } static void audit_multicast_unbind(struct net *net, int group) { audit_log_multicast(group, "disconnect", 0); } static int __net_init audit_net_init(struct net *net) { struct netlink_kernel_cfg cfg = { .input = audit_receive, .bind = audit_multicast_bind, .unbind = audit_multicast_unbind, .flags = NL_CFG_F_NONROOT_RECV, .groups = AUDIT_NLGRP_MAX, }; struct audit_net *aunet = net_generic(net, audit_net_id); aunet->sk = netlink_kernel_create(net, NETLINK_AUDIT, &cfg); if (aunet->sk == NULL) { audit_panic("cannot initialize netlink socket in namespace"); return -ENOMEM; } /* limit the timeout in case auditd is blocked/stopped */ aunet->sk->sk_sndtimeo = HZ / 10; return 0; } static void __net_exit audit_net_exit(struct net *net) { struct audit_net *aunet = net_generic(net, audit_net_id); /* NOTE: you would think that we would want to check the auditd * connection and potentially reset it here if it lives in this * namespace, but since the auditd connection tracking struct holds a * reference to this namespace (see auditd_set()) we are only ever * going to get here after that connection has been released */ netlink_kernel_release(aunet->sk); } static struct pernet_operations audit_net_ops __net_initdata = { .init = audit_net_init, .exit = audit_net_exit, .id = &audit_net_id, .size = sizeof(struct audit_net), }; /* Initialize audit support at boot time. */ static int __init audit_init(void) { int i; if (audit_initialized == AUDIT_DISABLED) return 0; audit_buffer_cache = kmem_cache_create("audit_buffer", sizeof(struct audit_buffer), 0, SLAB_PANIC, NULL); skb_queue_head_init(&audit_queue); skb_queue_head_init(&audit_retry_queue); skb_queue_head_init(&audit_hold_queue); for (i = 0; i < AUDIT_INODE_BUCKETS; i++) INIT_LIST_HEAD(&audit_inode_hash[i]); mutex_init(&audit_cmd_mutex.lock); audit_cmd_mutex.owner = NULL; pr_info("initializing netlink subsys (%s)\n", audit_default ? "enabled" : "disabled"); register_pernet_subsys(&audit_net_ops); audit_initialized = AUDIT_INITIALIZED; kauditd_task = kthread_run(kauditd_thread, NULL, "kauditd"); if (IS_ERR(kauditd_task)) { int err = PTR_ERR(kauditd_task); panic("audit: failed to start the kauditd thread (%d)\n", err); } audit_log(NULL, GFP_KERNEL, AUDIT_KERNEL, "state=initialized audit_enabled=%u res=1", audit_enabled); return 0; } postcore_initcall(audit_init); /* * Process kernel command-line parameter at boot time. * audit={0|off} or audit={1|on}. */ static int __init audit_enable(char *str) { if (!strcasecmp(str, "off") || !strcmp(str, "0")) audit_default = AUDIT_OFF; else if (!strcasecmp(str, "on") || !strcmp(str, "1")) audit_default = AUDIT_ON; else { pr_err("audit: invalid 'audit' parameter value (%s)\n", str); audit_default = AUDIT_ON; } if (audit_default == AUDIT_OFF) audit_initialized = AUDIT_DISABLED; if (audit_set_enabled(audit_default)) pr_err("audit: error setting audit state (%d)\n", audit_default); pr_info("%s\n", audit_default ? "enabled (after initialization)" : "disabled (until reboot)"); return 1; } __setup("audit=", audit_enable); /* Process kernel command-line parameter at boot time. * audit_backlog_limit=<n> */ static int __init audit_backlog_limit_set(char *str) { u32 audit_backlog_limit_arg; pr_info("audit_backlog_limit: "); if (kstrtouint(str, 0, &audit_backlog_limit_arg)) { pr_cont("using default of %u, unable to parse %s\n", audit_backlog_limit, str); return 1; } audit_backlog_limit = audit_backlog_limit_arg; pr_cont("%d\n", audit_backlog_limit); return 1; } __setup("audit_backlog_limit=", audit_backlog_limit_set); static void audit_buffer_free(struct audit_buffer *ab) { if (!ab) return; kfree_skb(ab->skb); kmem_cache_free(audit_buffer_cache, ab); } static struct audit_buffer *audit_buffer_alloc(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; ab = kmem_cache_alloc(audit_buffer_cache, gfp_mask); if (!ab) return NULL; ab->skb = nlmsg_new(AUDIT_BUFSIZ, gfp_mask); if (!ab->skb) goto err; if (!nlmsg_put(ab->skb, 0, 0, type, 0, 0)) goto err; ab->ctx = ctx; ab->gfp_mask = gfp_mask; return ab; err: audit_buffer_free(ab); return NULL; } /** * audit_serial - compute a serial number for the audit record * * Compute a serial number for the audit record. Audit records are * written to user-space as soon as they are generated, so a complete * audit record may be written in several pieces. The timestamp of the * record and this serial number are used by the user-space tools to * determine which pieces belong to the same audit record. The * (timestamp,serial) tuple is unique for each syscall and is live from * syscall entry to syscall exit. * * NOTE: Another possibility is to store the formatted records off the * audit context (for those records that have a context), and emit them * all at syscall exit. However, this could delay the reporting of * significant errors until syscall exit (or never, if the system * halts). */ unsigned int audit_serial(void) { static atomic_t serial = ATOMIC_INIT(0); return atomic_inc_return(&serial); } static inline void audit_get_stamp(struct audit_context *ctx, struct timespec64 *t, unsigned int *serial) { if (!ctx || !auditsc_get_stamp(ctx, t, serial)) { ktime_get_coarse_real_ts64(t); *serial = audit_serial(); } } /** * audit_log_start - obtain an audit buffer * @ctx: audit_context (may be NULL) * @gfp_mask: type of allocation * @type: audit message type * * Returns audit_buffer pointer on success or NULL on error. * * Obtain an audit buffer. This routine does locking to obtain the * audit buffer, but then no locking is required for calls to * audit_log_*format. If the task (ctx) is a task that is currently in a * syscall, then the syscall is marked as auditable and an audit record * will be written at syscall exit. If there is no associated task, then * task context (ctx) should be NULL. */ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, int type) { struct audit_buffer *ab; struct timespec64 t; unsigned int serial; if (audit_initialized != AUDIT_INITIALIZED) return NULL; if (unlikely(!audit_filter(type, AUDIT_FILTER_EXCLUDE))) return NULL; /* NOTE: don't ever fail/sleep on these two conditions: * 1. auditd generated record - since we need auditd to drain the * queue; also, when we are checking for auditd, compare PIDs using * task_tgid_vnr() since auditd_pid is set in audit_receive_msg() * using a PID anchored in the caller's namespace * 2. generator holding the audit_cmd_mutex - we don't want to block * while holding the mutex, although we do penalize the sender * later in audit_receive() when it is safe to block */ if (!(auditd_test_task(current) || audit_ctl_owner_current())) { long stime = audit_backlog_wait_time; while (audit_backlog_limit && (skb_queue_len(&audit_queue) > audit_backlog_limit)) { /* wake kauditd to try and flush the queue */ wake_up_interruptible(&kauditd_wait); /* sleep if we are allowed and we haven't exhausted our * backlog wait limit */ if (gfpflags_allow_blocking(gfp_mask) && (stime > 0)) { long rtime = stime; DECLARE_WAITQUEUE(wait, current); add_wait_queue_exclusive(&audit_backlog_wait, &wait); set_current_state(TASK_UNINTERRUPTIBLE); stime = schedule_timeout(rtime); atomic_add(rtime - stime, &audit_backlog_wait_time_actual); remove_wait_queue(&audit_backlog_wait, &wait); } else { if (audit_rate_check() && printk_ratelimit()) pr_warn("audit_backlog=%d > audit_backlog_limit=%d\n", skb_queue_len(&audit_queue), audit_backlog_limit); audit_log_lost("backlog limit exceeded"); return NULL; } } } ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start"); return NULL; } audit_get_stamp(ab->ctx, &t, &serial); /* cancel dummy context to enable supporting records */ if (ctx) ctx->dummy = 0; audit_log_format(ab, "audit(%llu.%03lu:%u): ", (unsigned long long)t.tv_sec, t.tv_nsec/1000000, serial); return ab; } /** * audit_expand - expand skb in the audit buffer * @ab: audit_buffer * @extra: space to add at tail of the skb * * Returns 0 (no space) on failed expansion, or available space if * successful. */ static inline int audit_expand(struct audit_buffer *ab, int extra) { struct sk_buff *skb = ab->skb; int oldtail = skb_tailroom(skb); int ret = pskb_expand_head(skb, 0, extra, ab->gfp_mask); int newtail = skb_tailroom(skb); if (ret < 0) { audit_log_lost("out of memory in audit_expand"); return 0; } skb->truesize += newtail - oldtail; return newtail; } /* * Format an audit message into the audit buffer. If there isn't enough * room in the audit buffer, more room will be allocated and vsnprint * will be called a second time. Currently, we assume that a printk * can't format message larger than 1024 bytes, so we don't either. */ static void audit_log_vformat(struct audit_buffer *ab, const char *fmt, va_list args) { int len, avail; struct sk_buff *skb; va_list args2; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); if (avail == 0) { avail = audit_expand(ab, AUDIT_BUFSIZ); if (!avail) goto out; } va_copy(args2, args); len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args); if (len >= avail) { /* The printk buffer is 1024 bytes long, so if we get * here and AUDIT_BUFSIZ is at least 1024, then we can * log everything that printk could have logged. */ avail = audit_expand(ab, max_t(unsigned, AUDIT_BUFSIZ, 1+len-avail)); if (!avail) goto out_va_end; len = vsnprintf(skb_tail_pointer(skb), avail, fmt, args2); } if (len > 0) skb_put(skb, len); out_va_end: va_end(args2); out: return; } /** * audit_log_format - format a message into the audit buffer. * @ab: audit_buffer * @fmt: format string * @...: optional parameters matching @fmt string * * All the work is done in audit_log_vformat. */ void audit_log_format(struct audit_buffer *ab, const char *fmt, ...) { va_list args; if (!ab) return; va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); } /** * audit_log_n_hex - convert a buffer to hex and append it to the audit skb * @ab: the audit_buffer * @buf: buffer to convert to hex * @len: length of @buf to be converted * * No return value; failure to expand is silently ignored. * * This function will take the passed buf and convert it into a string of * ascii hex digits. The new string is placed onto the skb. */ void audit_log_n_hex(struct audit_buffer *ab, const unsigned char *buf, size_t len) { int i, avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = len<<1; if (new_len >= avail) { /* Round the buffer request up to the next multiple */ new_len = AUDIT_BUFSIZ*(((new_len-avail)/AUDIT_BUFSIZ) + 1); avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); for (i = 0; i < len; i++) ptr = hex_byte_pack_upper(ptr, buf[i]); *ptr = 0; skb_put(skb, len << 1); /* new string is twice the old string */ } /* * Format a string of no more than slen characters into the audit buffer, * enclosed in quote marks. */ void audit_log_n_string(struct audit_buffer *ab, const char *string, size_t slen) { int avail, new_len; unsigned char *ptr; struct sk_buff *skb; if (!ab) return; BUG_ON(!ab->skb); skb = ab->skb; avail = skb_tailroom(skb); new_len = slen + 3; /* enclosing quotes + null terminator */ if (new_len > avail) { avail = audit_expand(ab, new_len); if (!avail) return; } ptr = skb_tail_pointer(skb); *ptr++ = '"'; memcpy(ptr, string, slen); ptr += slen; *ptr++ = '"'; *ptr = 0; skb_put(skb, slen + 2); /* don't include null terminator */ } /** * audit_string_contains_control - does a string need to be logged in hex * @string: string to be checked * @len: max length of the string to check */ bool audit_string_contains_control(const char *string, size_t len) { const unsigned char *p; for (p = string; p < (const unsigned char *)string + len; p++) { if (*p == '"' || *p < 0x21 || *p > 0x7e) return true; } return false; } /** * audit_log_n_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @len: length of string (not including trailing null) * @string: string to be logged * * This code will escape a string that is passed to it if the string * contains a control character, unprintable character, double quote mark, * or a space. Unescaped strings will start and end with a double quote mark. * Strings that are escaped are printed in hex (2 digits per char). * * The caller specifies the number of characters in the string to log, which may * or may not be the entire string. */ void audit_log_n_untrustedstring(struct audit_buffer *ab, const char *string, size_t len) { if (audit_string_contains_control(string, len)) audit_log_n_hex(ab, string, len); else audit_log_n_string(ab, string, len); } /** * audit_log_untrustedstring - log a string that may contain random characters * @ab: audit_buffer * @string: string to be logged * * Same as audit_log_n_untrustedstring(), except that strlen is used to * determine string length. */ void audit_log_untrustedstring(struct audit_buffer *ab, const char *string) { audit_log_n_untrustedstring(ab, string, strlen(string)); } /* This is a helper-function to print the escaped d_path */ void audit_log_d_path(struct audit_buffer *ab, const char *prefix, const struct path *path) { char *p, *pathname; if (prefix) audit_log_format(ab, "%s", prefix); /* We will allow 11 spaces for ' (deleted)' to be appended */ pathname = kmalloc(PATH_MAX+11, ab->gfp_mask); if (!pathname) { audit_log_format(ab, "\"<no_memory>\""); return; } p = d_path(path, pathname, PATH_MAX+11); if (IS_ERR(p)) { /* Should never happen since we send PATH_MAX */ /* FIXME: can we save some information here? */ audit_log_format(ab, "\"<too_long>\""); } else audit_log_untrustedstring(ab, p); kfree(pathname); } void audit_log_session_info(struct audit_buffer *ab) { unsigned int sessionid = audit_get_sessionid(current); uid_t auid = from_kuid(&init_user_ns, audit_get_loginuid(current)); audit_log_format(ab, "auid=%u ses=%u", auid, sessionid); } void audit_log_key(struct audit_buffer *ab, char *key) { audit_log_format(ab, " key="); if (key) audit_log_untrustedstring(ab, key); else audit_log_format(ab, "(null)"); } int audit_log_task_context(struct audit_buffer *ab) { char *ctx = NULL; unsigned len; int error; u32 sid; security_current_getsecid_subj(&sid); if (!sid) return 0; error = security_secid_to_secctx(sid, &ctx, &len); if (error) { if (error != -EINVAL) goto error_path; return 0; } audit_log_format(ab, " subj=%s", ctx); security_release_secctx(ctx, len); return 0; error_path: audit_panic("error in audit_log_task_context"); return error; } EXPORT_SYMBOL(audit_log_task_context); void audit_log_d_path_exe(struct audit_buffer *ab, struct mm_struct *mm) { struct file *exe_file; if (!mm) goto out_null; exe_file = get_mm_exe_file(mm); if (!exe_file) goto out_null; audit_log_d_path(ab, " exe=", &exe_file->f_path); fput(exe_file); return; out_null: audit_log_format(ab, " exe=(null)"); } struct tty_struct *audit_get_tty(void) { struct tty_struct *tty = NULL; unsigned long flags; spin_lock_irqsave(¤t->sighand->siglock, flags); if (current->signal) tty = tty_kref_get(current->signal->tty); spin_unlock_irqrestore(¤t->sighand->siglock, flags); return tty; } void audit_put_tty(struct tty_struct *tty) { tty_kref_put(tty); } void audit_log_task_info(struct audit_buffer *ab) { const struct cred *cred; char comm[sizeof(current->comm)]; struct tty_struct *tty; if (!ab) return; cred = current_cred(); tty = audit_get_tty(); audit_log_format(ab, " ppid=%d pid=%d auid=%u uid=%u gid=%u" " euid=%u suid=%u fsuid=%u" " egid=%u sgid=%u fsgid=%u tty=%s ses=%u", task_ppid_nr(current), task_tgid_nr(current), from_kuid(&init_user_ns, audit_get_loginuid(current)), from_kuid(&init_user_ns, cred->uid), from_kgid(&init_user_ns, cred->gid), from_kuid(&init_user_ns, cred->euid), from_kuid(&init_user_ns, cred->suid), from_kuid(&init_user_ns, cred->fsuid), from_kgid(&init_user_ns, cred->egid), from_kgid(&init_user_ns, cred->sgid), from_kgid(&init_user_ns, cred->fsgid), tty ? tty_name(tty) : "(none)", audit_get_sessionid(current)); audit_put_tty(tty); audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, get_task_comm(comm, current)); audit_log_d_path_exe(ab, current->mm); audit_log_task_context(ab); } EXPORT_SYMBOL(audit_log_task_info); /** * audit_log_path_denied - report a path restriction denial * @type: audit message type (AUDIT_ANOM_LINK, AUDIT_ANOM_CREAT, etc) * @operation: specific operation name */ void audit_log_path_denied(int type, const char *operation) { struct audit_buffer *ab; if (!audit_enabled || audit_dummy_context()) return; /* Generate log with subject, operation, outcome. */ ab = audit_log_start(audit_context(), GFP_KERNEL, type); if (!ab) return; audit_log_format(ab, "op=%s", operation); audit_log_task_info(ab); audit_log_format(ab, " res=0"); audit_log_end(ab); } /* global counter which is incremented every time something logs in */ static atomic_t session_id = ATOMIC_INIT(0); static int audit_set_loginuid_perm(kuid_t loginuid) { /* if we are unset, we don't need privs */ if (!audit_loginuid_set(current)) return 0; /* if AUDIT_FEATURE_LOGINUID_IMMUTABLE means never ever allow a change*/ if (is_audit_feature_set(AUDIT_FEATURE_LOGINUID_IMMUTABLE)) return -EPERM; /* it is set, you need permission */ if (!capable(CAP_AUDIT_CONTROL)) return -EPERM; /* reject if this is not an unset and we don't allow that */ if (is_audit_feature_set(AUDIT_FEATURE_ONLY_UNSET_LOGINUID) && uid_valid(loginuid)) return -EPERM; return 0; } static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, unsigned int oldsessionid, unsigned int sessionid, int rc) { struct audit_buffer *ab; uid_t uid, oldloginuid, loginuid; struct tty_struct *tty; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN); if (!ab) return; uid = from_kuid(&init_user_ns, task_uid(current)); oldloginuid = from_kuid(&init_user_ns, koldloginuid); loginuid = from_kuid(&init_user_ns, kloginuid); tty = audit_get_tty(); audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); audit_log_task_context(ab); audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", oldsessionid, sessionid, !rc); audit_put_tty(tty); audit_log_end(ab); } /** * audit_set_loginuid - set current task's loginuid * @loginuid: loginuid value * * Returns 0. * * Called (set) from fs/proc/base.c::proc_loginuid_write(). */ int audit_set_loginuid(kuid_t loginuid) { unsigned int oldsessionid, sessionid = AUDIT_SID_UNSET; kuid_t oldloginuid; int rc; oldloginuid = audit_get_loginuid(current); oldsessionid = audit_get_sessionid(current); rc = audit_set_loginuid_perm(loginuid); if (rc) goto out; /* are we setting or clearing? */ if (uid_valid(loginuid)) { sessionid = (unsigned int)atomic_inc_return(&session_id); if (unlikely(sessionid == AUDIT_SID_UNSET)) sessionid = (unsigned int)atomic_inc_return(&session_id); } current->sessionid = sessionid; current->loginuid = loginuid; out: audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc); return rc; } /** * audit_signal_info - record signal info for shutting down audit subsystem * @sig: signal value * @t: task being signaled * * If the audit subsystem is being terminated, record the task (pid) * and uid that is doing that. */ int audit_signal_info(int sig, struct task_struct *t) { kuid_t uid = current_uid(), auid; if (auditd_test_task(t) && (sig == SIGTERM || sig == SIGHUP || sig == SIGUSR1 || sig == SIGUSR2)) { audit_sig_pid = task_tgid_nr(current); auid = audit_get_loginuid(current); if (uid_valid(auid)) audit_sig_uid = auid; else audit_sig_uid = uid; security_current_getsecid_subj(&audit_sig_sid); } return audit_signal_info_syscall(t); } /** * audit_log_end - end one audit record * @ab: the audit_buffer * * We can not do a netlink send inside an irq context because it blocks (last * arg, flags, is not set to MSG_DONTWAIT), so the audit buffer is placed on a * queue and a kthread is scheduled to remove them from the queue outside the * irq context. May be called in any context. */ void audit_log_end(struct audit_buffer *ab) { struct sk_buff *skb; struct nlmsghdr *nlh; if (!ab) return; if (audit_rate_check()) { skb = ab->skb; ab->skb = NULL; /* setup the netlink header, see the comments in * kauditd_send_multicast_skb() for length quirks */ nlh = nlmsg_hdr(skb); nlh->nlmsg_len = skb->len - NLMSG_HDRLEN; /* queue the netlink packet and poke the kauditd thread */ skb_queue_tail(&audit_queue, skb); wake_up_interruptible(&kauditd_wait); } else audit_log_lost("rate limit exceeded"); audit_buffer_free(ab); } /** * audit_log - Log an audit record * @ctx: audit context * @gfp_mask: type of allocation * @type: audit message type * @fmt: format string to use * @...: variable parameters matching the format string * * This is a convenience function that calls audit_log_start, * audit_log_vformat, and audit_log_end. It may be called * in any context. */ void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type, const char *fmt, ...) { struct audit_buffer *ab; va_list args; ab = audit_log_start(ctx, gfp_mask, type); if (ab) { va_start(args, fmt); audit_log_vformat(ab, fmt, args); va_end(args); audit_log_end(ab); } } EXPORT_SYMBOL(audit_log_start); EXPORT_SYMBOL(audit_log_end); EXPORT_SYMBOL(audit_log_format); EXPORT_SYMBOL(audit_log); |
58 52 36 32 31 31 31 13 31 10 31 31 13 13 2 67 8 67 58 8 58 58 54 54 58 58 8 7 7 7 8 40 37 34 4 37 10 5 1 8 6 8 8 84 12 1276 185 28 1275 177 24 1276 219 83 20 7 3 2 33 11 11 9 3 2 11 11 38 15 15 14 13 13 13 13 11 1269 79 78 71 64 32 2 2 1 1 1 1 2 2 2 42 41 17 17 1 1 35 36 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 | // SPDX-License-Identifier: GPL-2.0-or-later /* * INET 802.1Q VLAN * Ethernet-type device handling. * * Authors: Ben Greear <greearb@candelatech.com> * Please send support related email to: netdev@vger.kernel.org * VLAN Home Page: http://www.candelatech.com/~greear/vlan.html * * Fixes: * Fix for packet capture - Nick Eggleston <nick@dccinc.com>; * Add HW acceleration hooks - David S. Miller <davem@redhat.com>; * Correct all the locking - David S. Miller <davem@redhat.com>; * Use hash table for VLAN groups - David S. Miller <davem@redhat.com> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/capability.h> #include <linux/module.h> #include <linux/netdevice.h> #include <linux/skbuff.h> #include <linux/slab.h> #include <linux/init.h> #include <linux/rculist.h> #include <net/p8022.h> #include <net/arp.h> #include <linux/rtnetlink.h> #include <linux/notifier.h> #include <net/rtnetlink.h> #include <net/net_namespace.h> #include <net/netns/generic.h> #include <linux/uaccess.h> #include <linux/if_vlan.h> #include "vlan.h" #include "vlanproc.h" #define DRV_VERSION "1.8" /* Global VLAN variables */ unsigned int vlan_net_id __read_mostly; const char vlan_fullname[] = "802.1Q VLAN Support"; const char vlan_version[] = DRV_VERSION; /* End of global variables definitions. */ static int vlan_group_prealloc_vid(struct vlan_group *vg, __be16 vlan_proto, u16 vlan_id) { struct net_device **array; unsigned int vidx; unsigned int size; int pidx; ASSERT_RTNL(); pidx = vlan_proto_idx(vlan_proto); if (pidx < 0) return -EINVAL; vidx = vlan_id / VLAN_GROUP_ARRAY_PART_LEN; array = vg->vlan_devices_arrays[pidx][vidx]; if (array != NULL) return 0; size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN; array = kzalloc(size, GFP_KERNEL_ACCOUNT); if (array == NULL) return -ENOBUFS; /* paired with smp_rmb() in __vlan_group_get_device() */ smp_wmb(); vg->vlan_devices_arrays[pidx][vidx] = array; return 0; } static void vlan_stacked_transfer_operstate(const struct net_device *rootdev, struct net_device *dev, struct vlan_dev_priv *vlan) { if (!(vlan->flags & VLAN_FLAG_BRIDGE_BINDING)) netif_stacked_transfer_operstate(rootdev, dev); } void unregister_vlan_dev(struct net_device *dev, struct list_head *head) { struct vlan_dev_priv *vlan = vlan_dev_priv(dev); struct net_device *real_dev = vlan->real_dev; struct vlan_info *vlan_info; struct vlan_group *grp; u16 vlan_id = vlan->vlan_id; ASSERT_RTNL(); vlan_info = rtnl_dereference(real_dev->vlan_info); BUG_ON(!vlan_info); grp = &vlan_info->grp; grp->nr_vlan_devs--; if (vlan->flags & VLAN_FLAG_MVRP) vlan_mvrp_request_leave(dev); if (vlan->flags & VLAN_FLAG_GVRP) vlan_gvrp_request_leave(dev); vlan_group_set_device(grp, vlan->vlan_proto, vlan_id, NULL); netdev_upper_dev_unlink(real_dev, dev); /* Because unregister_netdevice_queue() makes sure at least one rcu * grace period is respected before device freeing, * we dont need to call synchronize_net() here. */ unregister_netdevice_queue(dev, head); if (grp->nr_vlan_devs == 0) { vlan_mvrp_uninit_applicant(real_dev); vlan_gvrp_uninit_applicant(real_dev); } vlan_vid_del(real_dev, vlan->vlan_proto, vlan_id); } int vlan_check_real_dev(struct net_device *real_dev, __be16 protocol, u16 vlan_id, struct netlink_ext_ack *extack) { const char *name = real_dev->name; if (real_dev->features & NETIF_F_VLAN_CHALLENGED) { pr_info("VLANs not supported on %s\n", name); NL_SET_ERR_MSG_MOD(extack, "VLANs not supported on device"); return -EOPNOTSUPP; } if (vlan_find_dev(real_dev, protocol, vlan_id) != NULL) { NL_SET_ERR_MSG_MOD(extack, "VLAN device already exists"); return -EEXIST; } return 0; } int register_vlan_dev(struct net_device *dev, struct netlink_ext_ack *extack) { struct vlan_dev_priv *vlan = vlan_dev_priv(dev); struct net_device *real_dev = vlan->real_dev; u16 vlan_id = vlan->vlan_id; struct vlan_info *vlan_info; struct vlan_group *grp; int err; err = vlan_vid_add(real_dev, vlan->vlan_proto, vlan_id); if (err) return err; vlan_info = rtnl_dereference(real_dev->vlan_info); /* vlan_info should be there now. vlan_vid_add took care of it */ BUG_ON(!vlan_info); grp = &vlan_info->grp; if (grp->nr_vlan_devs == 0) { err = vlan_gvrp_init_applicant(real_dev); if (err < 0) goto out_vid_del; err = vlan_mvrp_init_applicant(real_dev); if (err < 0) goto out_uninit_gvrp; } err = vlan_group_prealloc_vid(grp, vlan->vlan_proto, vlan_id); if (err < 0) goto out_uninit_mvrp; err = register_netdevice(dev); if (err < 0) goto out_uninit_mvrp; err = netdev_upper_dev_link(real_dev, dev, extack); if (err) goto out_unregister_netdev; vlan_stacked_transfer_operstate(real_dev, dev, vlan); linkwatch_fire_event(dev); /* _MUST_ call rfc2863_policy() */ /* So, got the sucker initialized, now lets place * it into our local structure. */ vlan_group_set_device(grp, vlan->vlan_proto, vlan_id, dev); grp->nr_vlan_devs++; return 0; out_unregister_netdev: unregister_netdevice(dev); out_uninit_mvrp: if (grp->nr_vlan_devs == 0) vlan_mvrp_uninit_applicant(real_dev); out_uninit_gvrp: if (grp->nr_vlan_devs == 0) vlan_gvrp_uninit_applicant(real_dev); out_vid_del: vlan_vid_del(real_dev, vlan->vlan_proto, vlan_id); return err; } /* Attach a VLAN device to a mac address (ie Ethernet Card). * Returns 0 if the device was created or a negative error code otherwise. */ static int register_vlan_device(struct net_device *real_dev, u16 vlan_id) { struct net_device *new_dev; struct vlan_dev_priv *vlan; struct net *net = dev_net(real_dev); struct vlan_net *vn = net_generic(net, vlan_net_id); char name[IFNAMSIZ]; int err; if (vlan_id >= VLAN_VID_MASK) return -ERANGE; err = vlan_check_real_dev(real_dev, htons(ETH_P_8021Q), vlan_id, NULL); if (err < 0) return err; /* Gotta set up the fields for the device. */ switch (vn->name_type) { case VLAN_NAME_TYPE_RAW_PLUS_VID: /* name will look like: eth1.0005 */ snprintf(name, IFNAMSIZ, "%s.%.4i", real_dev->name, vlan_id); break; case VLAN_NAME_TYPE_PLUS_VID_NO_PAD: /* Put our vlan.VID in the name. * Name will look like: vlan5 */ snprintf(name, IFNAMSIZ, "vlan%i", vlan_id); break; case VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD: /* Put our vlan.VID in the name. * Name will look like: eth0.5 */ snprintf(name, IFNAMSIZ, "%s.%i", real_dev->name, vlan_id); break; case VLAN_NAME_TYPE_PLUS_VID: /* Put our vlan.VID in the name. * Name will look like: vlan0005 */ default: snprintf(name, IFNAMSIZ, "vlan%.4i", vlan_id); } new_dev = alloc_netdev(sizeof(struct vlan_dev_priv), name, NET_NAME_UNKNOWN, vlan_setup); if (new_dev == NULL) return -ENOBUFS; dev_net_set(new_dev, net); /* need 4 bytes for extra VLAN header info, * hope the underlying device can handle it. */ new_dev->mtu = real_dev->mtu; vlan = vlan_dev_priv(new_dev); vlan->vlan_proto = htons(ETH_P_8021Q); vlan->vlan_id = vlan_id; vlan->real_dev = real_dev; vlan->dent = NULL; vlan->flags = VLAN_FLAG_REORDER_HDR; new_dev->rtnl_link_ops = &vlan_link_ops; err = register_vlan_dev(new_dev, NULL); if (err < 0) goto out_free_newdev; return 0; out_free_newdev: free_netdev(new_dev); return err; } static void vlan_sync_address(struct net_device *dev, struct net_device *vlandev) { struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev); /* May be called without an actual change */ if (ether_addr_equal(vlan->real_dev_addr, dev->dev_addr)) return; /* vlan continues to inherit address of lower device */ if (vlan_dev_inherit_address(vlandev, dev)) goto out; /* vlan address was different from the old address and is equal to * the new address */ if (!ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && ether_addr_equal(vlandev->dev_addr, dev->dev_addr)) dev_uc_del(dev, vlandev->dev_addr); /* vlan address was equal to the old address and is different from * the new address */ if (ether_addr_equal(vlandev->dev_addr, vlan->real_dev_addr) && !ether_addr_equal(vlandev->dev_addr, dev->dev_addr)) dev_uc_add(dev, vlandev->dev_addr); out: ether_addr_copy(vlan->real_dev_addr, dev->dev_addr); } static void vlan_transfer_features(struct net_device *dev, struct net_device *vlandev) { struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev); netif_inherit_tso_max(vlandev, dev); if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto)) vlandev->hard_header_len = dev->hard_header_len; else vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN; #if IS_ENABLED(CONFIG_FCOE) vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid; #endif vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE; vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE); vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev); netdev_update_features(vlandev); } static int __vlan_device_event(struct net_device *dev, unsigned long event) { int err = 0; switch (event) { case NETDEV_CHANGENAME: vlan_proc_rem_dev(dev); err = vlan_proc_add_dev(dev); break; case NETDEV_REGISTER: err = vlan_proc_add_dev(dev); break; case NETDEV_UNREGISTER: vlan_proc_rem_dev(dev); break; } return err; } static int vlan_device_event(struct notifier_block *unused, unsigned long event, void *ptr) { struct netlink_ext_ack *extack = netdev_notifier_info_to_extack(ptr); struct net_device *dev = netdev_notifier_info_to_dev(ptr); struct vlan_group *grp; struct vlan_info *vlan_info; int i, flgs; struct net_device *vlandev; struct vlan_dev_priv *vlan; bool last = false; LIST_HEAD(list); int err; if (is_vlan_dev(dev)) { int err = __vlan_device_event(dev, event); if (err) return notifier_from_errno(err); } if ((event == NETDEV_UP) && (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER)) { pr_info("adding VLAN 0 to HW filter on device %s\n", dev->name); vlan_vid_add(dev, htons(ETH_P_8021Q), 0); } if (event == NETDEV_DOWN && (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER)) vlan_vid_del(dev, htons(ETH_P_8021Q), 0); vlan_info = rtnl_dereference(dev->vlan_info); if (!vlan_info) goto out; grp = &vlan_info->grp; /* It is OK that we do not hold the group lock right now, * as we run under the RTNL lock. */ switch (event) { case NETDEV_CHANGE: /* Propagate real device state to vlan devices */ vlan_group_for_each_dev(grp, i, vlandev) vlan_stacked_transfer_operstate(dev, vlandev, vlan_dev_priv(vlandev)); break; case NETDEV_CHANGEADDR: /* Adjust unicast filters on underlying device */ vlan_group_for_each_dev(grp, i, vlandev) { flgs = vlandev->flags; if (!(flgs & IFF_UP)) continue; vlan_sync_address(dev, vlandev); } break; case NETDEV_CHANGEMTU: vlan_group_for_each_dev(grp, i, vlandev) { if (vlandev->mtu <= dev->mtu) continue; dev_set_mtu(vlandev, dev->mtu); } break; case NETDEV_FEAT_CHANGE: /* Propagate device features to underlying device */ vlan_group_for_each_dev(grp, i, vlandev) vlan_transfer_features(dev, vlandev); break; case NETDEV_DOWN: { struct net_device *tmp; LIST_HEAD(close_list); /* Put all VLANs for this dev in the down state too. */ vlan_group_for_each_dev(grp, i, vlandev) { flgs = vlandev->flags; if (!(flgs & IFF_UP)) continue; vlan = vlan_dev_priv(vlandev); if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING)) list_add(&vlandev->close_list, &close_list); } dev_close_many(&close_list, false); list_for_each_entry_safe(vlandev, tmp, &close_list, close_list) { vlan_stacked_transfer_operstate(dev, vlandev, vlan_dev_priv(vlandev)); list_del_init(&vlandev->close_list); } list_del(&close_list); break; } case NETDEV_UP: /* Put all VLANs for this dev in the up state too. */ vlan_group_for_each_dev(grp, i, vlandev) { flgs = dev_get_flags(vlandev); if (flgs & IFF_UP) continue; vlan = vlan_dev_priv(vlandev); if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING)) dev_change_flags(vlandev, flgs | IFF_UP, extack); vlan_stacked_transfer_operstate(dev, vlandev, vlan); } break; case NETDEV_UNREGISTER: /* twiddle thumbs on netns device moves */ if (dev->reg_state != NETREG_UNREGISTERING) break; vlan_group_for_each_dev(grp, i, vlandev) { /* removal of last vid destroys vlan_info, abort * afterwards */ if (vlan_info->nr_vids == 1) last = true; unregister_vlan_dev(vlandev, &list); if (last) break; } unregister_netdevice_many(&list); break; case NETDEV_PRE_TYPE_CHANGE: /* Forbid underlaying device to change its type. */ if (vlan_uses_dev(dev)) return NOTIFY_BAD; break; case NETDEV_NOTIFY_PEERS: case NETDEV_BONDING_FAILOVER: case NETDEV_RESEND_IGMP: /* Propagate to vlan devices */ vlan_group_for_each_dev(grp, i, vlandev) call_netdevice_notifiers(event, vlandev); break; case NETDEV_CVLAN_FILTER_PUSH_INFO: err = vlan_filter_push_vids(vlan_info, htons(ETH_P_8021Q)); if (err) return notifier_from_errno(err); break; case NETDEV_CVLAN_FILTER_DROP_INFO: vlan_filter_drop_vids(vlan_info, htons(ETH_P_8021Q)); break; case NETDEV_SVLAN_FILTER_PUSH_INFO: err = vlan_filter_push_vids(vlan_info, htons(ETH_P_8021AD)); if (err) return notifier_from_errno(err); break; case NETDEV_SVLAN_FILTER_DROP_INFO: vlan_filter_drop_vids(vlan_info, htons(ETH_P_8021AD)); break; } out: return NOTIFY_DONE; } static struct notifier_block vlan_notifier_block __read_mostly = { .notifier_call = vlan_device_event, }; /* * VLAN IOCTL handler. * o execute requested action or pass command to the device driver * arg is really a struct vlan_ioctl_args __user *. */ static int vlan_ioctl_handler(struct net *net, void __user *arg) { int err; struct vlan_ioctl_args args; struct net_device *dev = NULL; if (copy_from_user(&args, arg, sizeof(struct vlan_ioctl_args))) return -EFAULT; /* Null terminate this sucker, just in case. */ args.device1[sizeof(args.device1) - 1] = 0; args.u.device2[sizeof(args.u.device2) - 1] = 0; rtnl_lock(); switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: case SET_VLAN_EGRESS_PRIORITY_CMD: case SET_VLAN_FLAG_CMD: case ADD_VLAN_CMD: case DEL_VLAN_CMD: case GET_VLAN_REALDEV_NAME_CMD: case GET_VLAN_VID_CMD: err = -ENODEV; dev = __dev_get_by_name(net, args.device1); if (!dev) goto out; err = -EINVAL; if (args.cmd != ADD_VLAN_CMD && !is_vlan_dev(dev)) goto out; } switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; vlan_dev_set_ingress_priority(dev, args.u.skb_priority, args.vlan_qos); err = 0; break; case SET_VLAN_EGRESS_PRIORITY_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_set_egress_priority(dev, args.u.skb_priority, args.vlan_qos); break; case SET_VLAN_FLAG_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = vlan_dev_change_flags(dev, args.vlan_qos ? args.u.flag : 0, args.u.flag); break; case SET_VLAN_NAME_TYPE_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; if (args.u.name_type < VLAN_NAME_TYPE_HIGHEST) { struct vlan_net *vn; vn = net_generic(net, vlan_net_id); vn->name_type = args.u.name_type; err = 0; } else { err = -EINVAL; } break; case ADD_VLAN_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; err = register_vlan_device(dev, args.u.VID); break; case DEL_VLAN_CMD: err = -EPERM; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) break; unregister_vlan_dev(dev, NULL); err = 0; break; case GET_VLAN_REALDEV_NAME_CMD: err = 0; vlan_dev_get_realdev_name(dev, args.u.device2, sizeof(args.u.device2)); if (copy_to_user(arg, &args, sizeof(struct vlan_ioctl_args))) err = -EFAULT; break; case GET_VLAN_VID_CMD: err = 0; args.u.VID = vlan_dev_vlan_id(dev); if (copy_to_user(arg, &args, sizeof(struct vlan_ioctl_args))) err = -EFAULT; break; default: err = -EOPNOTSUPP; break; } out: rtnl_unlock(); return err; } static int __net_init vlan_init_net(struct net *net) { struct vlan_net *vn = net_generic(net, vlan_net_id); int err; vn->name_type = VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD; err = vlan_proc_init(net); return err; } static void __net_exit vlan_exit_net(struct net *net) { vlan_proc_cleanup(net); } static struct pernet_operations vlan_net_ops = { .init = vlan_init_net, .exit = vlan_exit_net, .id = &vlan_net_id, .size = sizeof(struct vlan_net), }; static int __init vlan_proto_init(void) { int err; pr_info("%s v%s\n", vlan_fullname, vlan_version); err = register_pernet_subsys(&vlan_net_ops); if (err < 0) goto err0; err = register_netdevice_notifier(&vlan_notifier_block); if (err < 0) goto err2; err = vlan_gvrp_init(); if (err < 0) goto err3; err = vlan_mvrp_init(); if (err < 0) goto err4; err = vlan_netlink_init(); if (err < 0) goto err5; vlan_ioctl_set(vlan_ioctl_handler); return 0; err5: vlan_mvrp_uninit(); err4: vlan_gvrp_uninit(); err3: unregister_netdevice_notifier(&vlan_notifier_block); err2: unregister_pernet_subsys(&vlan_net_ops); err0: return err; } static void __exit vlan_cleanup_module(void) { vlan_ioctl_set(NULL); vlan_netlink_fini(); unregister_netdevice_notifier(&vlan_notifier_block); unregister_pernet_subsys(&vlan_net_ops); rcu_barrier(); /* Wait for completion of call_rcu()'s */ vlan_mvrp_uninit(); vlan_gvrp_uninit(); } module_init(vlan_proto_init); module_exit(vlan_cleanup_module); MODULE_LICENSE("GPL"); MODULE_VERSION(DRV_VERSION); |
525 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | /* SPDX-License-Identifier: GPL-2.0 */ /* linux/include/linux/clockchips.h * * This file contains the structure definitions for clockchips. * * If you are not a clockchip, or the time of day code, you should * not be including this file! */ #ifndef _LINUX_CLOCKCHIPS_H #define _LINUX_CLOCKCHIPS_H #ifdef CONFIG_GENERIC_CLOCKEVENTS # include <linux/clocksource.h> # include <linux/cpumask.h> # include <linux/ktime.h> # include <linux/notifier.h> struct clock_event_device; struct module; /* * Possible states of a clock event device. * * DETACHED: Device is not used by clockevents core. Initial state or can be * reached from SHUTDOWN. * SHUTDOWN: Device is powered-off. Can be reached from PERIODIC or ONESHOT. * PERIODIC: Device is programmed to generate events periodically. Can be * reached from DETACHED or SHUTDOWN. * ONESHOT: Device is programmed to generate event only once. Can be reached * from DETACHED or SHUTDOWN. * ONESHOT_STOPPED: Device was programmed in ONESHOT mode and is temporarily * stopped. */ enum clock_event_state { CLOCK_EVT_STATE_DETACHED, CLOCK_EVT_STATE_SHUTDOWN, CLOCK_EVT_STATE_PERIODIC, CLOCK_EVT_STATE_ONESHOT, CLOCK_EVT_STATE_ONESHOT_STOPPED, }; /* * Clock event features */ # define CLOCK_EVT_FEAT_PERIODIC 0x000001 # define CLOCK_EVT_FEAT_ONESHOT 0x000002 # define CLOCK_EVT_FEAT_KTIME 0x000004 /* * x86(64) specific (mis)features: * * - Clockevent source stops in C3 State and needs broadcast support. * - Local APIC timer is used as a dummy device. */ # define CLOCK_EVT_FEAT_C3STOP 0x000008 # define CLOCK_EVT_FEAT_DUMMY 0x000010 /* * Core shall set the interrupt affinity dynamically in broadcast mode */ # define CLOCK_EVT_FEAT_DYNIRQ 0x000020 # define CLOCK_EVT_FEAT_PERCPU 0x000040 /* * Clockevent device is based on a hrtimer for broadcast */ # define CLOCK_EVT_FEAT_HRTIMER 0x000080 /** * struct clock_event_device - clock event device descriptor * @event_handler: Assigned by the framework to be called by the low * level handler of the event source * @set_next_event: set next event function using a clocksource delta * @set_next_ktime: set next event function using a direct ktime value * @next_event: local storage for the next event in oneshot mode * @max_delta_ns: maximum delta value in ns * @min_delta_ns: minimum delta value in ns * @mult: nanosecond to cycles multiplier * @shift: nanoseconds to cycles divisor (power of two) * @state_use_accessors:current state of the device, assigned by the core code * @features: features * @retries: number of forced programming retries * @set_state_periodic: switch state to periodic * @set_state_oneshot: switch state to oneshot * @set_state_oneshot_stopped: switch state to oneshot_stopped * @set_state_shutdown: switch state to shutdown * @tick_resume: resume clkevt device * @broadcast: function to broadcast events * @min_delta_ticks: minimum delta value in ticks stored for reconfiguration * @max_delta_ticks: maximum delta value in ticks stored for reconfiguration * @name: ptr to clock event name * @rating: variable to rate clock event devices * @irq: IRQ number (only for non CPU local devices) * @bound_on: Bound on CPU * @cpumask: cpumask to indicate for which CPUs this device works * @list: list head for the management code * @owner: module reference */ struct clock_event_device { void (*event_handler)(struct clock_event_device *); int (*set_next_event)(unsigned long evt, struct clock_event_device *); int (*set_next_ktime)(ktime_t expires, struct clock_event_device *); ktime_t next_event; u64 max_delta_ns; u64 min_delta_ns; u32 mult; u32 shift; enum clock_event_state state_use_accessors; unsigned int features; unsigned long retries; int (*set_state_periodic)(struct clock_event_device *); int (*set_state_oneshot)(struct clock_event_device *); int (*set_state_oneshot_stopped)(struct clock_event_device *); int (*set_state_shutdown)(struct clock_event_device *); int (*tick_resume)(struct clock_event_device *); void (*broadcast)(const struct cpumask *mask); void (*suspend)(struct clock_event_device *); void (*resume)(struct clock_event_device *); unsigned long min_delta_ticks; unsigned long max_delta_ticks; const char *name; int rating; int irq; int bound_on; const struct cpumask *cpumask; struct list_head list; struct module *owner; } ____cacheline_aligned; /* Helpers to verify state of a clockevent device */ static inline bool clockevent_state_detached(struct clock_event_device *dev) { return dev->state_use_accessors == CLOCK_EVT_STATE_DETACHED; } static inline bool clockevent_state_shutdown(struct clock_event_device *dev) { return dev->state_use_accessors == CLOCK_EVT_STATE_SHUTDOWN; } static inline bool clockevent_state_periodic(struct clock_event_device *dev) { return dev->state_use_accessors == CLOCK_EVT_STATE_PERIODIC; } static inline bool clockevent_state_oneshot(struct clock_event_device *dev) { return dev->state_use_accessors == CLOCK_EVT_STATE_ONESHOT; } static inline bool clockevent_state_oneshot_stopped(struct clock_event_device *dev) { return dev->state_use_accessors == CLOCK_EVT_STATE_ONESHOT_STOPPED; } /* * Calculate a multiplication factor for scaled math, which is used to convert * nanoseconds based values to clock ticks: * * clock_ticks = (nanoseconds * factor) >> shift. * * div_sc is the rearranged equation to calculate a factor from a given clock * ticks / nanoseconds ratio: * * factor = (clock_ticks << shift) / nanoseconds */ static inline unsigned long div_sc(unsigned long ticks, unsigned long nsec, int shift) { u64 tmp = ((u64)ticks) << shift; do_div(tmp, nsec); return (unsigned long) tmp; } /* Clock event layer functions */ extern u64 clockevent_delta2ns(unsigned long latch, struct clock_event_device *evt); extern void clockevents_register_device(struct clock_event_device *dev); extern int clockevents_unbind_device(struct clock_event_device *ced, int cpu); extern void clockevents_config_and_register(struct clock_event_device *dev, u32 freq, unsigned long min_delta, unsigned long max_delta); extern int clockevents_update_freq(struct clock_event_device *ce, u32 freq); static inline void clockevents_calc_mult_shift(struct clock_event_device *ce, u32 freq, u32 maxsec) { return clocks_calc_mult_shift(&ce->mult, &ce->shift, NSEC_PER_SEC, freq, maxsec); } extern void clockevents_suspend(void); extern void clockevents_resume(void); # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST # ifdef CONFIG_ARCH_HAS_TICK_BROADCAST extern void tick_broadcast(const struct cpumask *mask); # else # define tick_broadcast NULL # endif extern int tick_receive_broadcast(void); # endif # if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT) extern void tick_setup_hrtimer_broadcast(void); extern int tick_check_broadcast_expired(void); # else static __always_inline int tick_check_broadcast_expired(void) { return 0; } static inline void tick_setup_hrtimer_broadcast(void) { } # endif #else /* !CONFIG_GENERIC_CLOCKEVENTS: */ static inline void clockevents_suspend(void) { } static inline void clockevents_resume(void) { } static __always_inline int tick_check_broadcast_expired(void) { return 0; } static inline void tick_setup_hrtimer_broadcast(void) { } #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ #endif /* _LINUX_CLOCKCHIPS_H */ |
2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | /* SPDX-License-Identifier: GPL-2.0-or-later */ #ifndef _TFRC_H_ #define _TFRC_H_ /* * Copyright (c) 2007 The University of Aberdeen, Scotland, UK * Copyright (c) 2005-6 The University of Waikato, Hamilton, New Zealand. * Copyright (c) 2005-6 Ian McDonald <ian.mcdonald@jandi.co.nz> * Copyright (c) 2005 Arnaldo Carvalho de Melo <acme@conectiva.com.br> * Copyright (c) 2003 Nils-Erik Mattsson, Joacim Haggmark, Magnus Erixzon */ #include <linux/types.h> #include <linux/math64.h> #include "../../dccp.h" /* internal includes that this library exports: */ #include "loss_interval.h" #include "packet_history.h" #ifdef CONFIG_IP_DCCP_TFRC_DEBUG extern bool tfrc_debug; #define tfrc_pr_debug(format, a...) DCCP_PR_DEBUG(tfrc_debug, format, ##a) #else #define tfrc_pr_debug(format, a...) #endif /* integer-arithmetic divisions of type (a * 1000000)/b */ static inline u64 scaled_div(u64 a, u64 b) { BUG_ON(b == 0); return div64_u64(a * 1000000, b); } static inline u32 scaled_div32(u64 a, u64 b) { u64 result = scaled_div(a, b); if (result > UINT_MAX) { DCCP_CRIT("Overflow: %llu/%llu > UINT_MAX", (unsigned long long)a, (unsigned long long)b); return UINT_MAX; } return result; } /** * tfrc_ewma - Exponentially weighted moving average * @weight: Weight to be used as damping factor, in units of 1/10 */ static inline u32 tfrc_ewma(const u32 avg, const u32 newval, const u8 weight) { return avg ? (weight * avg + (10 - weight) * newval) / 10 : newval; } u32 tfrc_calc_x(u16 s, u32 R, u32 p); u32 tfrc_calc_x_reverse_lookup(u32 fvalue); u32 tfrc_invert_loss_event_rate(u32 loss_event_rate); int tfrc_tx_packet_history_init(void); void tfrc_tx_packet_history_exit(void); int tfrc_rx_packet_history_init(void); void tfrc_rx_packet_history_exit(void); int tfrc_li_init(void); void tfrc_li_exit(void); #ifdef CONFIG_IP_DCCP_TFRC_LIB int tfrc_lib_init(void); void tfrc_lib_exit(void); #else #define tfrc_lib_init() (0) #define tfrc_lib_exit() #endif #endif /* _TFRC_H_ */ |
7 8 8 8 8 8 8 8 8 8 8 8 7 8 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Copyright (C) 2022 Oracle. All Rights Reserved. * Author: Allison Henderson <allison.henderson@oracle.com> */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_format.h" #include "xfs_trans_resv.h" #include "xfs_shared.h" #include "xfs_mount.h" #include "xfs_defer.h" #include "xfs_log_format.h" #include "xfs_trans.h" #include "xfs_bmap_btree.h" #include "xfs_trans_priv.h" #include "xfs_log.h" #include "xfs_inode.h" #include "xfs_da_format.h" #include "xfs_da_btree.h" #include "xfs_attr.h" #include "xfs_attr_item.h" #include "xfs_trace.h" #include "xfs_trans_space.h" #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log_priv.h" #include "xfs_log_recover.h" struct kmem_cache *xfs_attri_cache; struct kmem_cache *xfs_attrd_cache; static const struct xfs_item_ops xfs_attri_item_ops; static const struct xfs_item_ops xfs_attrd_item_ops; static struct xfs_attrd_log_item *xfs_trans_get_attrd(struct xfs_trans *tp, struct xfs_attri_log_item *attrip); static inline struct xfs_attri_log_item *ATTRI_ITEM(struct xfs_log_item *lip) { return container_of(lip, struct xfs_attri_log_item, attri_item); } /* * Shared xattr name/value buffers for logged extended attribute operations * * When logging updates to extended attributes, we can create quite a few * attribute log intent items for a single xattr update. To avoid cycling the * memory allocator and memcpy overhead, the name (and value, for setxattr) * are kept in a refcounted object that is shared across all related log items * and the upper-level deferred work state structure. The shared buffer has * a control structure, followed by the name, and then the value. */ static inline struct xfs_attri_log_nameval * xfs_attri_log_nameval_get( struct xfs_attri_log_nameval *nv) { if (!refcount_inc_not_zero(&nv->refcount)) return NULL; return nv; } static inline void xfs_attri_log_nameval_put( struct xfs_attri_log_nameval *nv) { if (!nv) return; if (refcount_dec_and_test(&nv->refcount)) kvfree(nv); } static inline struct xfs_attri_log_nameval * xfs_attri_log_nameval_alloc( const void *name, unsigned int name_len, const void *value, unsigned int value_len) { struct xfs_attri_log_nameval *nv; /* * This could be over 64kB in length, so we have to use kvmalloc() for * this. But kvmalloc() utterly sucks, so we use our own version. */ nv = xlog_kvmalloc(sizeof(struct xfs_attri_log_nameval) + name_len + value_len); nv->name.i_addr = nv + 1; nv->name.i_len = name_len; nv->name.i_type = XLOG_REG_TYPE_ATTR_NAME; memcpy(nv->name.i_addr, name, name_len); if (value_len) { nv->value.i_addr = nv->name.i_addr + name_len; nv->value.i_len = value_len; memcpy(nv->value.i_addr, value, value_len); } else { nv->value.i_addr = NULL; nv->value.i_len = 0; } nv->value.i_type = XLOG_REG_TYPE_ATTR_VALUE; refcount_set(&nv->refcount, 1); return nv; } STATIC void xfs_attri_item_free( struct xfs_attri_log_item *attrip) { kmem_free(attrip->attri_item.li_lv_shadow); xfs_attri_log_nameval_put(attrip->attri_nameval); kmem_cache_free(xfs_attri_cache, attrip); } /* * Freeing the attrip requires that we remove it from the AIL if it has already * been placed there. However, the ATTRI may not yet have been placed in the * AIL when called by xfs_attri_release() from ATTRD processing due to the * ordering of committed vs unpin operations in bulk insert operations. Hence * the reference count to ensure only the last caller frees the ATTRI. */ STATIC void xfs_attri_release( struct xfs_attri_log_item *attrip) { ASSERT(atomic_read(&attrip->attri_refcount) > 0); if (!atomic_dec_and_test(&attrip->attri_refcount)) return; xfs_trans_ail_delete(&attrip->attri_item, 0); xfs_attri_item_free(attrip); } STATIC void xfs_attri_item_size( struct xfs_log_item *lip, int *nvecs, int *nbytes) { struct xfs_attri_log_item *attrip = ATTRI_ITEM(lip); struct xfs_attri_log_nameval *nv = attrip->attri_nameval; *nvecs += 2; *nbytes += sizeof(struct xfs_attri_log_format) + xlog_calc_iovec_len(nv->name.i_len); if (!nv->value.i_len) return; *nvecs += 1; *nbytes += xlog_calc_iovec_len(nv->value.i_len); } /* * This is called to fill in the log iovecs for the given attri log * item. We use 1 iovec for the attri_format_item, 1 for the name, and * another for the value if it is present */ STATIC void xfs_attri_item_format( struct xfs_log_item *lip, struct xfs_log_vec *lv) { struct xfs_attri_log_item *attrip = ATTRI_ITEM(lip); struct xfs_log_iovec *vecp = NULL; struct xfs_attri_log_nameval *nv = attrip->attri_nameval; attrip->attri_format.alfi_type = XFS_LI_ATTRI; attrip->attri_format.alfi_size = 1; /* * This size accounting must be done before copying the attrip into the * iovec. If we do it after, the wrong size will be recorded to the log * and we trip across assertion checks for bad region sizes later during * the log recovery. */ ASSERT(nv->name.i_len > 0); attrip->attri_format.alfi_size++; if (nv->value.i_len > 0) attrip->attri_format.alfi_size++; xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTRI_FORMAT, &attrip->attri_format, sizeof(struct xfs_attri_log_format)); xlog_copy_from_iovec(lv, &vecp, &nv->name); if (nv->value.i_len > 0) xlog_copy_from_iovec(lv, &vecp, &nv->value); } /* * The unpin operation is the last place an ATTRI is manipulated in the log. It * is either inserted in the AIL or aborted in the event of a log I/O error. In * either case, the ATTRI transaction has been successfully committed to make * it this far. Therefore, we expect whoever committed the ATTRI to either * construct and commit the ATTRD or drop the ATTRD's reference in the event of * error. Simply drop the log's ATTRI reference now that the log is done with * it. */ STATIC void xfs_attri_item_unpin( struct xfs_log_item *lip, int remove) { xfs_attri_release(ATTRI_ITEM(lip)); } STATIC void xfs_attri_item_release( struct xfs_log_item *lip) { xfs_attri_release(ATTRI_ITEM(lip)); } /* * Allocate and initialize an attri item. Caller may allocate an additional * trailing buffer for name and value */ STATIC struct xfs_attri_log_item * xfs_attri_init( struct xfs_mount *mp, struct xfs_attri_log_nameval *nv) { struct xfs_attri_log_item *attrip; attrip = kmem_cache_zalloc(xfs_attri_cache, GFP_NOFS | __GFP_NOFAIL); /* * Grab an extra reference to the name/value buffer for this log item. * The caller retains its own reference! */ attrip->attri_nameval = xfs_attri_log_nameval_get(nv); ASSERT(attrip->attri_nameval); xfs_log_item_init(mp, &attrip->attri_item, XFS_LI_ATTRI, &xfs_attri_item_ops); attrip->attri_format.alfi_id = (uintptr_t)(void *)attrip; atomic_set(&attrip->attri_refcount, 2); return attrip; } static inline struct xfs_attrd_log_item *ATTRD_ITEM(struct xfs_log_item *lip) { return container_of(lip, struct xfs_attrd_log_item, attrd_item); } STATIC void xfs_attrd_item_free(struct xfs_attrd_log_item *attrdp) { kmem_free(attrdp->attrd_item.li_lv_shadow); kmem_cache_free(xfs_attrd_cache, attrdp); } STATIC void xfs_attrd_item_size( struct xfs_log_item *lip, int *nvecs, int *nbytes) { *nvecs += 1; *nbytes += sizeof(struct xfs_attrd_log_format); } /* * This is called to fill in the log iovecs for the given attrd log item. We use * only 1 iovec for the attrd_format, and we point that at the attr_log_format * structure embedded in the attrd item. */ STATIC void xfs_attrd_item_format( struct xfs_log_item *lip, struct xfs_log_vec *lv) { struct xfs_attrd_log_item *attrdp = ATTRD_ITEM(lip); struct xfs_log_iovec *vecp = NULL; attrdp->attrd_format.alfd_type = XFS_LI_ATTRD; attrdp->attrd_format.alfd_size = 1; xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTRD_FORMAT, &attrdp->attrd_format, sizeof(struct xfs_attrd_log_format)); } /* * The ATTRD is either committed or aborted if the transaction is canceled. If * the transaction is canceled, drop our reference to the ATTRI and free the * ATTRD. */ STATIC void xfs_attrd_item_release( struct xfs_log_item *lip) { struct xfs_attrd_log_item *attrdp = ATTRD_ITEM(lip); xfs_attri_release(attrdp->attrd_attrip); xfs_attrd_item_free(attrdp); } static struct xfs_log_item * xfs_attrd_item_intent( struct xfs_log_item *lip) { return &ATTRD_ITEM(lip)->attrd_attrip->attri_item; } /* * Performs one step of an attribute update intent and marks the attrd item * dirty.. An attr operation may be a set or a remove. Note that the * transaction is marked dirty regardless of whether the operation succeeds or * fails to support the ATTRI/ATTRD lifecycle rules. */ STATIC int xfs_xattri_finish_update( struct xfs_attr_intent *attr, struct xfs_attrd_log_item *attrdp) { struct xfs_da_args *args = attr->xattri_da_args; int error; if (XFS_TEST_ERROR(false, args->dp->i_mount, XFS_ERRTAG_LARP)) { error = -EIO; goto out; } error = xfs_attr_set_iter(attr); if (!error && attr->xattri_dela_state != XFS_DAS_DONE) error = -EAGAIN; out: /* * Mark the transaction dirty, even on error. This ensures the * transaction is aborted, which: * * 1.) releases the ATTRI and frees the ATTRD * 2.) shuts down the filesystem */ args->trans->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE; /* * attr intent/done items are null when logged attributes are disabled */ if (attrdp) set_bit(XFS_LI_DIRTY, &attrdp->attrd_item.li_flags); return error; } /* Log an attr to the intent item. */ STATIC void xfs_attr_log_item( struct xfs_trans *tp, struct xfs_attri_log_item *attrip, const struct xfs_attr_intent *attr) { struct xfs_attri_log_format *attrp; tp->t_flags |= XFS_TRANS_DIRTY; set_bit(XFS_LI_DIRTY, &attrip->attri_item.li_flags); /* * At this point the xfs_attr_intent has been constructed, and we've * created the log intent. Fill in the attri log item and log format * structure with fields from this xfs_attr_intent */ attrp = &attrip->attri_format; attrp->alfi_ino = attr->xattri_da_args->dp->i_ino; ASSERT(!(attr->xattri_op_flags & ~XFS_ATTRI_OP_FLAGS_TYPE_MASK)); attrp->alfi_op_flags = attr->xattri_op_flags; attrp->alfi_value_len = attr->xattri_nameval->value.i_len; attrp->alfi_name_len = attr->xattri_nameval->name.i_len; ASSERT(!(attr->xattri_da_args->attr_filter & ~XFS_ATTRI_FILTER_MASK)); attrp->alfi_attr_filter = attr->xattri_da_args->attr_filter; } /* Get an ATTRI. */ static struct xfs_log_item * xfs_attr_create_intent( struct xfs_trans *tp, struct list_head *items, unsigned int count, bool sort) { struct xfs_mount *mp = tp->t_mountp; struct xfs_attri_log_item *attrip; struct xfs_attr_intent *attr; struct xfs_da_args *args; ASSERT(count == 1); /* * Each attr item only performs one attribute operation at a time, so * this is a list of one */ attr = list_first_entry_or_null(items, struct xfs_attr_intent, xattri_list); args = attr->xattri_da_args; if (!(args->op_flags & XFS_DA_OP_LOGGED)) return NULL; /* * Create a buffer to store the attribute name and value. This buffer * will be shared between the higher level deferred xattr work state * and the lower level xattr log items. */ if (!attr->xattri_nameval) { /* * Transfer our reference to the name/value buffer to the * deferred work state structure. */ attr->xattri_nameval = xfs_attri_log_nameval_alloc(args->name, args->namelen, args->value, args->valuelen); } attrip = xfs_attri_init(mp, attr->xattri_nameval); xfs_trans_add_item(tp, &attrip->attri_item); xfs_attr_log_item(tp, attrip, attr); return &attrip->attri_item; } static inline void xfs_attr_free_item( struct xfs_attr_intent *attr) { if (attr->xattri_da_state) xfs_da_state_free(attr->xattri_da_state); xfs_attri_log_nameval_put(attr->xattri_nameval); if (attr->xattri_da_args->op_flags & XFS_DA_OP_RECOVERY) kmem_free(attr); else kmem_cache_free(xfs_attr_intent_cache, attr); } /* Process an attr. */ STATIC int xfs_attr_finish_item( struct xfs_trans *tp, struct xfs_log_item *done, struct list_head *item, struct xfs_btree_cur **state) { struct xfs_attr_intent *attr; struct xfs_attrd_log_item *done_item = NULL; int error; attr = container_of(item, struct xfs_attr_intent, xattri_list); if (done) done_item = ATTRD_ITEM(done); /* * Always reset trans after EAGAIN cycle * since the transaction is new */ attr->xattri_da_args->trans = tp; error = xfs_xattri_finish_update(attr, done_item); if (error != -EAGAIN) xfs_attr_free_item(attr); return error; } /* Abort all pending ATTRs. */ STATIC void xfs_attr_abort_intent( struct xfs_log_item *intent) { xfs_attri_release(ATTRI_ITEM(intent)); } /* Cancel an attr */ STATIC void xfs_attr_cancel_item( struct list_head *item) { struct xfs_attr_intent *attr; attr = container_of(item, struct xfs_attr_intent, xattri_list); xfs_attr_free_item(attr); } STATIC bool xfs_attri_item_match( struct xfs_log_item *lip, uint64_t intent_id) { return ATTRI_ITEM(lip)->attri_format.alfi_id == intent_id; } /* Is this recovered ATTRI format ok? */ static inline bool xfs_attri_validate( struct xfs_mount *mp, struct xfs_attri_log_format *attrp) { unsigned int op = attrp->alfi_op_flags & XFS_ATTRI_OP_FLAGS_TYPE_MASK; if (attrp->__pad != 0) return false; if (attrp->alfi_op_flags & ~XFS_ATTRI_OP_FLAGS_TYPE_MASK) return false; if (attrp->alfi_attr_filter & ~XFS_ATTRI_FILTER_MASK) return false; /* alfi_op_flags should be either a set or remove */ switch (op) { case XFS_ATTRI_OP_FLAGS_SET: case XFS_ATTRI_OP_FLAGS_REPLACE: case XFS_ATTRI_OP_FLAGS_REMOVE: break; default: return false; } if (attrp->alfi_value_len > XATTR_SIZE_MAX) return false; if ((attrp->alfi_name_len > XATTR_NAME_MAX) || (attrp->alfi_name_len == 0)) return false; return xfs_verify_ino(mp, attrp->alfi_ino); } /* * Process an attr intent item that was recovered from the log. We need to * delete the attr that it describes. */ STATIC int xfs_attri_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { struct xfs_attri_log_item *attrip = ATTRI_ITEM(lip); struct xfs_attr_intent *attr; struct xfs_mount *mp = lip->li_log->l_mp; struct xfs_inode *ip; struct xfs_da_args *args; struct xfs_trans *tp; struct xfs_trans_res resv; struct xfs_attri_log_format *attrp; struct xfs_attri_log_nameval *nv = attrip->attri_nameval; int error; int total; int local; struct xfs_attrd_log_item *done_item = NULL; /* * First check the validity of the attr described by the ATTRI. If any * are bad, then assume that all are bad and just toss the ATTRI. */ attrp = &attrip->attri_format; if (!xfs_attri_validate(mp, attrp) || !xfs_attr_namecheck(nv->name.i_addr, nv->name.i_len)) return -EFSCORRUPTED; error = xlog_recover_iget(mp, attrp->alfi_ino, &ip); if (error) return error; attr = kmem_zalloc(sizeof(struct xfs_attr_intent) + sizeof(struct xfs_da_args), KM_NOFS); args = (struct xfs_da_args *)(attr + 1); attr->xattri_da_args = args; attr->xattri_op_flags = attrp->alfi_op_flags & XFS_ATTRI_OP_FLAGS_TYPE_MASK; /* * We're reconstructing the deferred work state structure from the * recovered log item. Grab a reference to the name/value buffer and * attach it to the new work state. */ attr->xattri_nameval = xfs_attri_log_nameval_get(nv); ASSERT(attr->xattri_nameval); args->dp = ip; args->geo = mp->m_attr_geo; args->whichfork = XFS_ATTR_FORK; args->name = nv->name.i_addr; args->namelen = nv->name.i_len; args->hashval = xfs_da_hashname(args->name, args->namelen); args->attr_filter = attrp->alfi_attr_filter & XFS_ATTRI_FILTER_MASK; args->op_flags = XFS_DA_OP_RECOVERY | XFS_DA_OP_OKNOENT | XFS_DA_OP_LOGGED; ASSERT(xfs_sb_version_haslogxattrs(&mp->m_sb)); switch (attr->xattri_op_flags) { case XFS_ATTRI_OP_FLAGS_SET: case XFS_ATTRI_OP_FLAGS_REPLACE: args->value = nv->value.i_addr; args->valuelen = nv->value.i_len; args->total = xfs_attr_calc_size(args, &local); if (xfs_inode_hasattr(args->dp)) attr->xattri_dela_state = xfs_attr_init_replace_state(args); else attr->xattri_dela_state = xfs_attr_init_add_state(args); break; case XFS_ATTRI_OP_FLAGS_REMOVE: if (!xfs_inode_hasattr(args->dp)) goto out; attr->xattri_dela_state = xfs_attr_init_remove_state(args); break; default: ASSERT(0); error = -EFSCORRUPTED; goto out; } xfs_init_attr_trans(args, &resv, &total); resv = xlog_recover_resv(&resv); error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp); if (error) goto out; args->trans = tp; done_item = xfs_trans_get_attrd(tp, attrip); xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0); error = xfs_xattri_finish_update(attr, done_item); if (error == -EAGAIN) { /* * There's more work to do, so add the intent item to this * transaction so that we can continue it later. */ xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); error = xfs_defer_ops_capture_and_commit(tp, capture_list); if (error) goto out_unlock; xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_irele(ip); return 0; } if (error) { xfs_trans_cancel(tp); goto out_unlock; } error = xfs_defer_ops_capture_and_commit(tp, capture_list); out_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_irele(ip); out: xfs_attr_free_item(attr); return error; } /* Re-log an intent item to push the log tail forward. */ static struct xfs_log_item * xfs_attri_item_relog( struct xfs_log_item *intent, struct xfs_trans *tp) { struct xfs_attrd_log_item *attrdp; struct xfs_attri_log_item *old_attrip; struct xfs_attri_log_item *new_attrip; struct xfs_attri_log_format *new_attrp; struct xfs_attri_log_format *old_attrp; old_attrip = ATTRI_ITEM(intent); old_attrp = &old_attrip->attri_format; tp->t_flags |= XFS_TRANS_DIRTY; attrdp = xfs_trans_get_attrd(tp, old_attrip); set_bit(XFS_LI_DIRTY, &attrdp->attrd_item.li_flags); /* * Create a new log item that shares the same name/value buffer as the * old log item. */ new_attrip = xfs_attri_init(tp->t_mountp, old_attrip->attri_nameval); new_attrp = &new_attrip->attri_format; new_attrp->alfi_ino = old_attrp->alfi_ino; new_attrp->alfi_op_flags = old_attrp->alfi_op_flags; new_attrp->alfi_value_len = old_attrp->alfi_value_len; new_attrp->alfi_name_len = old_attrp->alfi_name_len; new_attrp->alfi_attr_filter = old_attrp->alfi_attr_filter; xfs_trans_add_item(tp, &new_attrip->attri_item); set_bit(XFS_LI_DIRTY, &new_attrip->attri_item.li_flags); return &new_attrip->attri_item; } STATIC int xlog_recover_attri_commit_pass2( struct xlog *log, struct list_head *buffer_list, struct xlog_recover_item *item, xfs_lsn_t lsn) { struct xfs_mount *mp = log->l_mp; struct xfs_attri_log_item *attrip; struct xfs_attri_log_format *attri_formatp; struct xfs_attri_log_nameval *nv; const void *attr_value = NULL; const void *attr_name; size_t len; attri_formatp = item->ri_buf[0].i_addr; attr_name = item->ri_buf[1].i_addr; /* Validate xfs_attri_log_format before the large memory allocation */ len = sizeof(struct xfs_attri_log_format); if (item->ri_buf[0].i_len != len) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, item->ri_buf[0].i_addr, item->ri_buf[0].i_len); return -EFSCORRUPTED; } if (!xfs_attri_validate(mp, attri_formatp)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, item->ri_buf[0].i_addr, item->ri_buf[0].i_len); return -EFSCORRUPTED; } /* Validate the attr name */ if (item->ri_buf[1].i_len != xlog_calc_iovec_len(attri_formatp->alfi_name_len)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, item->ri_buf[0].i_addr, item->ri_buf[0].i_len); return -EFSCORRUPTED; } if (!xfs_attr_namecheck(attr_name, attri_formatp->alfi_name_len)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, item->ri_buf[1].i_addr, item->ri_buf[1].i_len); return -EFSCORRUPTED; } /* Validate the attr value, if present */ if (attri_formatp->alfi_value_len != 0) { if (item->ri_buf[2].i_len != xlog_calc_iovec_len(attri_formatp->alfi_value_len)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, item->ri_buf[0].i_addr, item->ri_buf[0].i_len); return -EFSCORRUPTED; } attr_value = item->ri_buf[2].i_addr; } /* * Memory alloc failure will cause replay to abort. We attach the * name/value buffer to the recovered incore log item and drop our * reference. */ nv = xfs_attri_log_nameval_alloc(attr_name, attri_formatp->alfi_name_len, attr_value, attri_formatp->alfi_value_len); attrip = xfs_attri_init(mp, nv); memcpy(&attrip->attri_format, attri_formatp, len); /* * The ATTRI has two references. One for the ATTRD and one for ATTRI to * ensure it makes it into the AIL. Insert the ATTRI into the AIL * directly and drop the ATTRI reference. Note that * xfs_trans_ail_update() drops the AIL lock. */ xfs_trans_ail_insert(log->l_ailp, &attrip->attri_item, lsn); xfs_attri_release(attrip); xfs_attri_log_nameval_put(nv); return 0; } /* * This routine is called to allocate an "attr free done" log item. */ static struct xfs_attrd_log_item * xfs_trans_get_attrd(struct xfs_trans *tp, struct xfs_attri_log_item *attrip) { struct xfs_attrd_log_item *attrdp; ASSERT(tp != NULL); attrdp = kmem_cache_zalloc(xfs_attrd_cache, GFP_NOFS | __GFP_NOFAIL); xfs_log_item_init(tp->t_mountp, &attrdp->attrd_item, XFS_LI_ATTRD, &xfs_attrd_item_ops); attrdp->attrd_attrip = attrip; attrdp->attrd_format.alfd_alf_id = attrip->attri_format.alfi_id; xfs_trans_add_item(tp, &attrdp->attrd_item); return attrdp; } /* Get an ATTRD so we can process all the attrs. */ static struct xfs_log_item * xfs_attr_create_done( struct xfs_trans *tp, struct xfs_log_item *intent, unsigned int count) { if (!intent) return NULL; return &xfs_trans_get_attrd(tp, ATTRI_ITEM(intent))->attrd_item; } const struct xfs_defer_op_type xfs_attr_defer_type = { .max_items = 1, .create_intent = xfs_attr_create_intent, .abort_intent = xfs_attr_abort_intent, .create_done = xfs_attr_create_done, .finish_item = xfs_attr_finish_item, .cancel_item = xfs_attr_cancel_item, }; /* * This routine is called when an ATTRD format structure is found in a committed * transaction in the log. Its purpose is to cancel the corresponding ATTRI if * it was still in the log. To do this it searches the AIL for the ATTRI with * an id equal to that in the ATTRD format structure. If we find it we drop * the ATTRD reference, which removes the ATTRI from the AIL and frees it. */ STATIC int xlog_recover_attrd_commit_pass2( struct xlog *log, struct list_head *buffer_list, struct xlog_recover_item *item, xfs_lsn_t lsn) { struct xfs_attrd_log_format *attrd_formatp; attrd_formatp = item->ri_buf[0].i_addr; if (item->ri_buf[0].i_len != sizeof(struct xfs_attrd_log_format)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp, item->ri_buf[0].i_addr, item->ri_buf[0].i_len); return -EFSCORRUPTED; } xlog_recover_release_intent(log, XFS_LI_ATTRI, attrd_formatp->alfd_alf_id); return 0; } static const struct xfs_item_ops xfs_attri_item_ops = { .flags = XFS_ITEM_INTENT, .iop_size = xfs_attri_item_size, .iop_format = xfs_attri_item_format, .iop_unpin = xfs_attri_item_unpin, .iop_release = xfs_attri_item_release, .iop_recover = xfs_attri_item_recover, .iop_match = xfs_attri_item_match, .iop_relog = xfs_attri_item_relog, }; const struct xlog_recover_item_ops xlog_attri_item_ops = { .item_type = XFS_LI_ATTRI, .commit_pass2 = xlog_recover_attri_commit_pass2, }; static const struct xfs_item_ops xfs_attrd_item_ops = { .flags = XFS_ITEM_RELEASE_WHEN_COMMITTED | XFS_ITEM_INTENT_DONE, .iop_size = xfs_attrd_item_size, .iop_format = xfs_attrd_item_format, .iop_release = xfs_attrd_item_release, .iop_intent = xfs_attrd_item_intent, }; const struct xlog_recover_item_ops xlog_attrd_item_ops = { .item_type = XFS_LI_ATTRD, .commit_pass2 = xlog_recover_attrd_commit_pass2, }; |
3 2 2 2 1 1 3 1 3 1 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Syncookies implementation for the Linux kernel * * Copyright (C) 1997 Andi Kleen * Based on ideas by D.J.Bernstein and Eric Schenk. */ #include <linux/tcp.h> #include <linux/siphash.h> #include <linux/kernel.h> #include <linux/export.h> #include <net/secure_seq.h> #include <net/tcp.h> #include <net/route.h> static siphash_aligned_key_t syncookie_secret[2]; #define COOKIEBITS 24 /* Upper bits store count */ #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1) /* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK * stores TCP options: * * MSB LSB * | 31 ... 6 | 5 | 4 | 3 2 1 0 | * | Timestamp | ECN | SACK | WScale | * * When we receive a valid cookie-ACK, we look at the echoed tsval (if * any) to figure out which TCP options we should use for the rebuilt * connection. * * A WScale setting of '0xf' (which is an invalid scaling value) * means that original syn did not include the TCP window scaling option. */ #define TS_OPT_WSCALE_MASK 0xf #define TS_OPT_SACK BIT(4) #define TS_OPT_ECN BIT(5) /* There is no TS_OPT_TIMESTAMP: * if ACK contains timestamp option, we already know it was * requested/supported by the syn/synack exchange. */ #define TSBITS 6 #define TSMASK (((__u32)1 << TSBITS) - 1) static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, u32 count, int c) { net_get_random_once(syncookie_secret, sizeof(syncookie_secret)); return siphash_4u32((__force u32)saddr, (__force u32)daddr, (__force u32)sport << 16 | (__force u32)dport, count, &syncookie_secret[c]); } /* * when syncookies are in effect and tcp timestamps are enabled we encode * tcp options in the lower bits of the timestamp value that will be * sent in the syn-ack. * Since subsequent timestamps use the normal tcp_time_stamp value, we * must make sure that the resulting initial timestamp is <= tcp_time_stamp. */ u64 cookie_init_timestamp(struct request_sock *req, u64 now) { struct inet_request_sock *ireq; u32 ts, ts_now = tcp_ns_to_ts(now); u32 options = 0; ireq = inet_rsk(req); options = ireq->wscale_ok ? ireq->snd_wscale : TS_OPT_WSCALE_MASK; if (ireq->sack_ok) options |= TS_OPT_SACK; if (ireq->ecn_ok) options |= TS_OPT_ECN; ts = ts_now & ~TSMASK; ts |= options; if (ts > ts_now) { ts >>= TSBITS; ts--; ts <<= TSBITS; ts |= options; } return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ); } static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, __u32 sseq, __u32 data) { /* * Compute the secure sequence number. * The output should be: * HASH(sec1,saddr,sport,daddr,dport,sec1) + sseq + (count * 2^24) * + (HASH(sec2,saddr,sport,daddr,dport,count,sec2) % 2^24). * Where sseq is their sequence number and count increases every * minute by 1. * As an extra hack, we add a small "data" value that encodes the * MSS into the second hash value. */ u32 count = tcp_cookie_time(); return (cookie_hash(saddr, daddr, sport, dport, 0, 0) + sseq + (count << COOKIEBITS) + ((cookie_hash(saddr, daddr, sport, dport, count, 1) + data) & COOKIEMASK)); } /* * This retrieves the small "data" value from the syncookie. * If the syncookie is bad, the data returned will be out of * range. This must be checked by the caller. * * The count value used to generate the cookie must be less than * MAX_SYNCOOKIE_AGE minutes in the past. * The return value (__u32)-1 if this test fails. */ static __u32 check_tcp_syn_cookie(__u32 cookie, __be32 saddr, __be32 daddr, __be16 sport, __be16 dport, __u32 sseq) { u32 diff, count = tcp_cookie_time(); /* Strip away the layers from the cookie */ cookie -= cookie_hash(saddr, daddr, sport, dport, 0, 0) + sseq; /* Cookie is now reduced to (count * 2^24) ^ (hash % 2^24) */ diff = (count - (cookie >> COOKIEBITS)) & ((__u32) -1 >> COOKIEBITS); if (diff >= MAX_SYNCOOKIE_AGE) return (__u32)-1; return (cookie - cookie_hash(saddr, daddr, sport, dport, count - diff, 1)) & COOKIEMASK; /* Leaving the data behind */ } /* * MSS Values are chosen based on the 2011 paper * 'An Analysis of TCP Maximum Segement Sizes' by S. Alcock and R. Nelson. * Values .. * .. lower than 536 are rare (< 0.2%) * .. between 537 and 1299 account for less than < 1.5% of observed values * .. in the 1300-1349 range account for about 15 to 20% of observed mss values * .. exceeding 1460 are very rare (< 0.04%) * * 1460 is the single most frequently announced mss value (30 to 46% depending * on monitor location). Table must be sorted. */ static __u16 const msstab[] = { 536, 1300, 1440, /* 1440, 1452: PPPoE */ 1460, }; /* * Generate a syncookie. mssp points to the mss, which is returned * rounded down to the value encoded in the cookie. */ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th, u16 *mssp) { int mssind; const __u16 mss = *mssp; for (mssind = ARRAY_SIZE(msstab) - 1; mssind ; mssind--) if (mss >= msstab[mssind]) break; *mssp = msstab[mssind]; return secure_tcp_syn_cookie(iph->saddr, iph->daddr, th->source, th->dest, ntohl(th->seq), mssind); } EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence); __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mssp) { const struct iphdr *iph = ip_hdr(skb); const struct tcphdr *th = tcp_hdr(skb); return __cookie_v4_init_sequence(iph, th, mssp); } /* * Check if a ack sequence number is a valid syncookie. * Return the decoded mss if it is, or 0 if not. */ int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th, u32 cookie) { __u32 seq = ntohl(th->seq) - 1; __u32 mssind = check_tcp_syn_cookie(cookie, iph->saddr, iph->daddr, th->source, th->dest, seq); return mssind < ARRAY_SIZE(msstab) ? msstab[mssind] : 0; } EXPORT_SYMBOL_GPL(__cookie_v4_check); struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb, struct request_sock *req, struct dst_entry *dst, u32 tsoff) { struct inet_connection_sock *icsk = inet_csk(sk); struct sock *child; bool own_req; child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst, NULL, &own_req); if (child) { refcount_set(&req->rsk_refcnt, 1); tcp_sk(child)->tsoffset = tsoff; sock_rps_save_rxhash(child, skb); if (rsk_drop_req(req)) { reqsk_put(req); return child; } if (inet_csk_reqsk_queue_add(sk, req, child)) return child; bh_unlock_sock(child); sock_put(child); } __reqsk_free(req); return NULL; } EXPORT_SYMBOL(tcp_get_cookie_sock); /* * when syncookies are in effect and tcp timestamps are enabled we stored * additional tcp options in the timestamp. * This extracts these options from the timestamp echo. * * return false if we decode a tcp option that is disabled * on the host. */ bool cookie_timestamp_decode(const struct net *net, struct tcp_options_received *tcp_opt) { /* echoed timestamp, lowest bits contain options */ u32 options = tcp_opt->rcv_tsecr; if (!tcp_opt->saw_tstamp) { tcp_clear_options(tcp_opt); return true; } if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps)) return false; tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0; if (tcp_opt->sack_ok && !READ_ONCE(net->ipv4.sysctl_tcp_sack)) return false; if ((options & TS_OPT_WSCALE_MASK) == TS_OPT_WSCALE_MASK) return true; /* no window scaling */ tcp_opt->wscale_ok = 1; tcp_opt->snd_wscale = options & TS_OPT_WSCALE_MASK; return READ_ONCE(net->ipv4.sysctl_tcp_window_scaling) != 0; } EXPORT_SYMBOL(cookie_timestamp_decode); bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt, const struct net *net, const struct dst_entry *dst) { bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN; if (!ecn_ok) return false; if (READ_ONCE(net->ipv4.sysctl_tcp_ecn)) return true; return dst_feature(dst, RTAX_FEATURE_ECN); } EXPORT_SYMBOL(cookie_ecn_ok); struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb) { struct tcp_request_sock *treq; struct request_sock *req; if (sk_is_mptcp(sk)) req = mptcp_subflow_reqsk_alloc(ops, sk, false); else req = inet_reqsk_alloc(ops, sk, false); if (!req) return NULL; treq = tcp_rsk(req); /* treq->af_specific might be used to perform TCP_MD5 lookup */ treq->af_specific = af_ops; treq->syn_tos = TCP_SKB_CB(skb)->ip_dsfield; #if IS_ENABLED(CONFIG_MPTCP) treq->is_mptcp = sk_is_mptcp(sk); if (treq->is_mptcp) { int err = mptcp_subflow_init_cookie_req(req, sk, skb); if (err) { reqsk_free(req); return NULL; } } #endif return req; } EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc); /* On input, sk is a listener. * Output is listener if incoming packet would not create a child * NULL if memory could not be allocated. */ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) { struct ip_options *opt = &TCP_SKB_CB(skb)->header.h4.opt; struct tcp_options_received tcp_opt; struct inet_request_sock *ireq; struct tcp_request_sock *treq; struct tcp_sock *tp = tcp_sk(sk); const struct tcphdr *th = tcp_hdr(skb); __u32 cookie = ntohl(th->ack_seq) - 1; struct sock *ret = sk; struct request_sock *req; int full_space, mss; struct rtable *rt; __u8 rcv_wscale; struct flowi4 fl4; u32 tsoff = 0; if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_syncookies) || !th->ack || th->rst) goto out; if (tcp_synq_no_recent_overflow(sk)) goto out; mss = __cookie_v4_check(ip_hdr(skb), th, cookie); if (mss == 0) { __NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED); goto out; } __NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV); /* check for timestamp cookie support */ memset(&tcp_opt, 0, sizeof(tcp_opt)); tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL); if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) { tsoff = secure_tcp_ts_off(sock_net(sk), ip_hdr(skb)->daddr, ip_hdr(skb)->saddr); tcp_opt.rcv_tsecr -= tsoff; } if (!cookie_timestamp_decode(sock_net(sk), &tcp_opt)) goto out; ret = NULL; req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops, &tcp_request_sock_ipv4_ops, sk, skb); if (!req) goto out; ireq = inet_rsk(req); treq = tcp_rsk(req); treq->rcv_isn = ntohl(th->seq) - 1; treq->snt_isn = cookie; treq->ts_off = 0; treq->txhash = net_tx_rndhash(); req->mss = mss; ireq->ir_num = ntohs(th->dest); ireq->ir_rmt_port = th->source; sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr); sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr); ireq->ir_mark = inet_request_mark(sk, skb); ireq->snd_wscale = tcp_opt.snd_wscale; ireq->sack_ok = tcp_opt.sack_ok; ireq->wscale_ok = tcp_opt.wscale_ok; ireq->tstamp_ok = tcp_opt.saw_tstamp; req->ts_recent = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0; treq->snt_synack = 0; treq->tfo_listener = false; if (IS_ENABLED(CONFIG_SMC)) ireq->smc_ok = 0; ireq->ir_iif = inet_request_bound_dev_if(sk, skb); /* We throwed the options of the initial SYN away, so we hope * the ACK carries the same options again (see RFC1122 4.2.3.8) */ RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(sock_net(sk), skb)); if (security_inet_conn_request(sk, skb, req)) { reqsk_free(req); goto out; } req->num_retrans = 0; /* * We need to lookup the route here to get at the correct * window size. We should better make sure that the window size * hasn't changed since we received the original syn, but I see * no easy way to do this. */ flowi4_init_output(&fl4, ireq->ir_iif, ireq->ir_mark, ip_sock_rt_tos(sk), ip_sock_rt_scope(sk), IPPROTO_TCP, inet_sk_flowi_flags(sk), opt->srr ? opt->faddr : ireq->ir_rmt_addr, ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid); security_req_classify_flow(req, flowi4_to_flowi_common(&fl4)); rt = ip_route_output_key(sock_net(sk), &fl4); if (IS_ERR(rt)) { reqsk_free(req); goto out; } /* Try to redo what tcp_v4_send_synack did. */ req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW); /* limit the window selection if the user enforce a smaller rx buffer */ full_space = tcp_full_space(sk); if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0)) req->rsk_window_clamp = full_space; tcp_select_initial_window(sk, full_space, req->mss, &req->rsk_rcv_wnd, &req->rsk_window_clamp, ireq->wscale_ok, &rcv_wscale, dst_metric(&rt->dst, RTAX_INITRWND)); ireq->rcv_wscale = rcv_wscale; ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst); ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff); /* ip_queue_xmit() depends on our flow being setup * Normal sockets get it right from inet_csk_route_child_sock() */ if (ret) inet_sk(ret)->cork.fl.u.ip4 = fl4; out: return ret; } |
9 6 9 1 9 65 65 21 65 65 65 65 67 20 1 20 20 20 20 19 20 20 20 45 45 45 45 45 4 45 1 1 45 45 5 45 45 44 45 45 45 2 2 2 2 2 2 2 10 10 10 10 10 1 2 7 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | // SPDX-License-Identifier: GPL-2.0 /* * fs/proc_namespace.c - handling of /proc/<pid>/{mounts,mountinfo,mountstats} * * In fact, that's a piece of procfs; it's *almost* isolated from * the rest of fs/proc, but has rather close relationships with * fs/namespace.c, thus here instead of fs/proc * */ #include <linux/mnt_namespace.h> #include <linux/nsproxy.h> #include <linux/security.h> #include <linux/fs_struct.h> #include <linux/sched/task.h> #include "proc/internal.h" /* only for get_proc_task() in ->open() */ #include "pnode.h" #include "internal.h" static __poll_t mounts_poll(struct file *file, poll_table *wait) { struct seq_file *m = file->private_data; struct proc_mounts *p = m->private; struct mnt_namespace *ns = p->ns; __poll_t res = EPOLLIN | EPOLLRDNORM; int event; poll_wait(file, &p->ns->poll, wait); event = READ_ONCE(ns->event); if (m->poll_event != event) { m->poll_event = event; res |= EPOLLERR | EPOLLPRI; } return res; } struct proc_fs_opts { int flag; const char *str; }; static int show_sb_opts(struct seq_file *m, struct super_block *sb) { static const struct proc_fs_opts fs_opts[] = { { SB_SYNCHRONOUS, ",sync" }, { SB_DIRSYNC, ",dirsync" }, { SB_MANDLOCK, ",mand" }, { SB_LAZYTIME, ",lazytime" }, { 0, NULL } }; const struct proc_fs_opts *fs_infop; for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) { if (sb->s_flags & fs_infop->flag) seq_puts(m, fs_infop->str); } return security_sb_show_options(m, sb); } static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { static const struct proc_fs_opts mnt_opts[] = { { MNT_NOSUID, ",nosuid" }, { MNT_NODEV, ",nodev" }, { MNT_NOEXEC, ",noexec" }, { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, { MNT_NOSYMFOLLOW, ",nosymfollow" }, { 0, NULL } }; const struct proc_fs_opts *fs_infop; for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) { if (mnt->mnt_flags & fs_infop->flag) seq_puts(m, fs_infop->str); } if (is_idmapped_mnt(mnt)) seq_puts(m, ",idmapped"); } static inline void mangle(struct seq_file *m, const char *s) { seq_escape(m, s, " \t\n\\#"); } static void show_type(struct seq_file *m, struct super_block *sb) { mangle(m, sb->s_type->name); if (sb->s_subtype) { seq_putc(m, '.'); mangle(m, sb->s_subtype); } } static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) { struct proc_mounts *p = m->private; struct mount *r = real_mount(mnt); struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; struct super_block *sb = mnt_path.dentry->d_sb; int err; if (sb->s_op->show_devname) { err = sb->s_op->show_devname(m, mnt_path.dentry); if (err) goto out; } else { mangle(m, r->mnt_devname ? r->mnt_devname : "none"); } seq_putc(m, ' '); /* mountpoints outside of chroot jail will give SEQ_SKIP on this */ err = seq_path_root(m, &mnt_path, &p->root, " \t\n\\"); if (err) goto out; seq_putc(m, ' '); show_type(m, sb); seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw"); err = show_sb_opts(m, sb); if (err) goto out; show_mnt_opts(m, mnt); if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); out: return err; } static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) { struct proc_mounts *p = m->private; struct mount *r = real_mount(mnt); struct super_block *sb = mnt->mnt_sb; struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; int err; seq_printf(m, "%i %i %u:%u ", r->mnt_id, r->mnt_parent->mnt_id, MAJOR(sb->s_dev), MINOR(sb->s_dev)); if (sb->s_op->show_path) { err = sb->s_op->show_path(m, mnt->mnt_root); if (err) goto out; } else { seq_dentry(m, mnt->mnt_root, " \t\n\\"); } seq_putc(m, ' '); /* mountpoints outside of chroot jail will give SEQ_SKIP on this */ err = seq_path_root(m, &mnt_path, &p->root, " \t\n\\"); if (err) goto out; seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw"); show_mnt_opts(m, mnt); /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) seq_printf(m, " shared:%i", r->mnt_group_id); if (IS_MNT_SLAVE(r)) { int master = r->mnt_master->mnt_group_id; int dom = get_dominating_id(r, &p->root); seq_printf(m, " master:%i", master); if (dom && dom != master) seq_printf(m, " propagate_from:%i", dom); } if (IS_MNT_UNBINDABLE(r)) seq_puts(m, " unbindable"); /* Filesystem specific data */ seq_puts(m, " - "); show_type(m, sb); seq_putc(m, ' '); if (sb->s_op->show_devname) { err = sb->s_op->show_devname(m, mnt->mnt_root); if (err) goto out; } else { mangle(m, r->mnt_devname ? r->mnt_devname : "none"); } seq_puts(m, sb_rdonly(sb) ? " ro" : " rw"); err = show_sb_opts(m, sb); if (err) goto out; if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt->mnt_root); seq_putc(m, '\n'); out: return err; } static int show_vfsstat(struct seq_file *m, struct vfsmount *mnt) { struct proc_mounts *p = m->private; struct mount *r = real_mount(mnt); struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; struct super_block *sb = mnt_path.dentry->d_sb; int err; /* device */ if (sb->s_op->show_devname) { seq_puts(m, "device "); err = sb->s_op->show_devname(m, mnt_path.dentry); if (err) goto out; } else { if (r->mnt_devname) { seq_puts(m, "device "); mangle(m, r->mnt_devname); } else seq_puts(m, "no device"); } /* mount point */ seq_puts(m, " mounted on "); /* mountpoints outside of chroot jail will give SEQ_SKIP on this */ err = seq_path_root(m, &mnt_path, &p->root, " \t\n\\"); if (err) goto out; seq_putc(m, ' '); /* file system type */ seq_puts(m, "with fstype "); show_type(m, sb); /* optional statistics */ if (sb->s_op->show_stats) { seq_putc(m, ' '); err = sb->s_op->show_stats(m, mnt_path.dentry); } seq_putc(m, '\n'); out: return err; } static int mounts_open_common(struct inode *inode, struct file *file, int (*show)(struct seq_file *, struct vfsmount *)) { struct task_struct *task = get_proc_task(inode); struct nsproxy *nsp; struct mnt_namespace *ns = NULL; struct path root; struct proc_mounts *p; struct seq_file *m; int ret = -EINVAL; if (!task) goto err; task_lock(task); nsp = task->nsproxy; if (!nsp || !nsp->mnt_ns) { task_unlock(task); put_task_struct(task); goto err; } ns = nsp->mnt_ns; get_mnt_ns(ns); if (!task->fs) { task_unlock(task); put_task_struct(task); ret = -ENOENT; goto err_put_ns; } get_fs_root(task->fs, &root); task_unlock(task); put_task_struct(task); ret = seq_open_private(file, &mounts_op, sizeof(struct proc_mounts)); if (ret) goto err_put_path; m = file->private_data; m->poll_event = ns->event; p = m->private; p->ns = ns; p->root = root; p->show = show; INIT_LIST_HEAD(&p->cursor.mnt_list); p->cursor.mnt.mnt_flags = MNT_CURSOR; return 0; err_put_path: path_put(&root); err_put_ns: put_mnt_ns(ns); err: return ret; } static int mounts_release(struct inode *inode, struct file *file) { struct seq_file *m = file->private_data; struct proc_mounts *p = m->private; path_put(&p->root); mnt_cursor_del(p->ns, &p->cursor); put_mnt_ns(p->ns); return seq_release_private(inode, file); } static int mounts_open(struct inode *inode, struct file *file) { return mounts_open_common(inode, file, show_vfsmnt); } static int mountinfo_open(struct inode *inode, struct file *file) { return mounts_open_common(inode, file, show_mountinfo); } static int mountstats_open(struct inode *inode, struct file *file) { return mounts_open_common(inode, file, show_vfsstat); } const struct file_operations proc_mounts_operations = { .open = mounts_open, .read_iter = seq_read_iter, .splice_read = copy_splice_read, .llseek = seq_lseek, .release = mounts_release, .poll = mounts_poll, }; const struct file_operations proc_mountinfo_operations = { .open = mountinfo_open, .read_iter = seq_read_iter, .splice_read = copy_splice_read, .llseek = seq_lseek, .release = mounts_release, .poll = mounts_poll, }; const struct file_operations proc_mountstats_operations = { .open = mountstats_open, .read_iter = seq_read_iter, .splice_read = copy_splice_read, .llseek = seq_lseek, .release = mounts_release, }; |
2 2 9 2 17 21 21 21 21 19 21 2 2 19 2 21 19 21 4 4 4 4 1 3 2 4 4 1 3 2 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 6 6 6 1 1 1 1 5 6 6 11 10 1 10 10 10 10 10 10 1 9 8 8 1 7 7 2 7 7 1 6 6 5 1 6 5 2 6 11 14 14 13 13 12 1 1 1 12 11 10 10 9 2 7 6 14 14 15 4 3 3 11 11 7 11 11 1 1 1 1 3 1 1 1 80 8 79 1 12 12 7 6 6 6 3 2 2 1 1 1 1 3 3 2 2 3 1 1 1 3 2 4 3 2 2 2 2 1 2 2 1 2 2 2 1 1 1 1 1 51 3 3 3 3 1 1 14 14 2 2 1 3 1 1 1 6 3 8 1 79 80 79 7 1 4 4 3 4 1 1 4 4 3 2 3 4 2 2 2 2 2 2 2 1 1 1 1 1 2 2 14 9 14 14 1 13 10 9 8 8 1 1 8 8 1 7 10 5 10 4 10 2 10 10 23 23 23 23 23 23 23 23 23 23 23 23 23 23 21 23 23 23 23 23 7 7 2 2 2 2 2 7 22 23 23 8 8 8 8 5 4 2 1 1 1 2 1 22 1 21 20 22 1 21 17 17 21 21 21 21 21 21 4 21 21 20 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 | // SPDX-License-Identifier: GPL-2.0-or-later /* * History: * Started: Aug 9 by Lawrence Foard (entropy@world.std.com), * to allow user process control of SCSI devices. * Development Sponsored by Killy Corp. NY NY * * Original driver (sg.c): * Copyright (C) 1992 Lawrence Foard * Version 2 and 3 extensions to driver: * Copyright (C) 1998 - 2014 Douglas Gilbert */ static int sg_version_num = 30536; /* 2 digits for each component */ #define SG_VERSION_STR "3.5.36" /* * D. P. Gilbert (dgilbert@interlog.com), notes: * - scsi logging is available via SCSI_LOG_TIMEOUT macros. First * the kernel/module needs to be built with CONFIG_SCSI_LOGGING * (otherwise the macros compile to empty statements). * */ #include <linux/module.h> #include <linux/fs.h> #include <linux/kernel.h> #include <linux/sched.h> #include <linux/string.h> #include <linux/mm.h> #include <linux/errno.h> #include <linux/mtio.h> #include <linux/ioctl.h> #include <linux/major.h> #include <linux/slab.h> #include <linux/fcntl.h> #include <linux/init.h> #include <linux/poll.h> #include <linux/moduleparam.h> #include <linux/cdev.h> #include <linux/idr.h> #include <linux/seq_file.h> #include <linux/blkdev.h> #include <linux/delay.h> #include <linux/blktrace_api.h> #include <linux/mutex.h> #include <linux/atomic.h> #include <linux/ratelimit.h> #include <linux/uio.h> #include <linux/cred.h> /* for sg_check_file_access() */ #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> #include <scsi/scsi_dbg.h> #include <scsi/scsi_device.h> #include <scsi/scsi_driver.h> #include <scsi/scsi_eh.h> #include <scsi/scsi_host.h> #include <scsi/scsi_ioctl.h> #include <scsi/scsi_tcq.h> #include <scsi/sg.h> #include "scsi_logging.h" #ifdef CONFIG_SCSI_PROC_FS #include <linux/proc_fs.h> static char *sg_version_date = "20140603"; static int sg_proc_init(void); #endif #define SG_ALLOW_DIO_DEF 0 #define SG_MAX_DEVS (1 << MINORBITS) /* SG_MAX_CDB_SIZE should be 260 (spc4r37 section 3.1.30) however the type * of sg_io_hdr::cmd_len can only represent 255. All SCSI commands greater * than 16 bytes are "variable length" whose length is a multiple of 4 */ #define SG_MAX_CDB_SIZE 252 #define SG_DEFAULT_TIMEOUT mult_frac(SG_DEFAULT_TIMEOUT_USER, HZ, USER_HZ) static int sg_big_buff = SG_DEF_RESERVED_SIZE; /* N.B. This variable is readable and writeable via /proc/scsi/sg/def_reserved_size . Each time sg_open() is called a buffer of this size (or less if there is not enough memory) will be reserved for use by this file descriptor. [Deprecated usage: this variable is also readable via /proc/sys/kernel/sg-big-buff if the sg driver is built into the kernel (i.e. it is not a module).] */ static int def_reserved_size = -1; /* picks up init parameter */ static int sg_allow_dio = SG_ALLOW_DIO_DEF; static int scatter_elem_sz = SG_SCATTER_SZ; static int scatter_elem_sz_prev = SG_SCATTER_SZ; #define SG_SECTOR_SZ 512 static int sg_add_device(struct device *); static void sg_remove_device(struct device *); static DEFINE_IDR(sg_index_idr); static DEFINE_RWLOCK(sg_index_lock); /* Also used to lock file descriptor list for device */ static struct class_interface sg_interface = { .add_dev = sg_add_device, .remove_dev = sg_remove_device, }; typedef struct sg_scatter_hold { /* holding area for scsi scatter gather info */ unsigned short k_use_sg; /* Count of kernel scatter-gather pieces */ unsigned sglist_len; /* size of malloc'd scatter-gather list ++ */ unsigned bufflen; /* Size of (aggregate) data buffer */ struct page **pages; int page_order; char dio_in_use; /* 0->indirect IO (or mmap), 1->dio */ unsigned char cmd_opcode; /* first byte of command */ } Sg_scatter_hold; struct sg_device; /* forward declarations */ struct sg_fd; typedef struct sg_request { /* SG_MAX_QUEUE requests outstanding per file */ struct list_head entry; /* list entry */ struct sg_fd *parentfp; /* NULL -> not in use */ Sg_scatter_hold data; /* hold buffer, perhaps scatter list */ sg_io_hdr_t header; /* scsi command+info, see <scsi/sg.h> */ unsigned char sense_b[SCSI_SENSE_BUFFERSIZE]; char res_used; /* 1 -> using reserve buffer, 0 -> not ... */ char orphan; /* 1 -> drop on sight, 0 -> normal */ char sg_io_owned; /* 1 -> packet belongs to SG_IO */ /* done protected by rq_list_lock */ char done; /* 0->before bh, 1->before read, 2->read */ struct request *rq; struct bio *bio; struct execute_work ew; } Sg_request; typedef struct sg_fd { /* holds the state of a file descriptor */ struct list_head sfd_siblings; /* protected by device's sfd_lock */ struct sg_device *parentdp; /* owning device */ wait_queue_head_t read_wait; /* queue read until command done */ rwlock_t rq_list_lock; /* protect access to list in req_arr */ struct mutex f_mutex; /* protect against changes in this fd */ int timeout; /* defaults to SG_DEFAULT_TIMEOUT */ int timeout_user; /* defaults to SG_DEFAULT_TIMEOUT_USER */ Sg_scatter_hold reserve; /* buffer held for this file descriptor */ struct list_head rq_list; /* head of request list */ struct fasync_struct *async_qp; /* used by asynchronous notification */ Sg_request req_arr[SG_MAX_QUEUE]; /* used as singly-linked list */ char force_packid; /* 1 -> pack_id input to read(), 0 -> ignored */ char cmd_q; /* 1 -> allow command queuing, 0 -> don't */ unsigned char next_cmd_len; /* 0: automatic, >0: use on next write() */ char keep_orphan; /* 0 -> drop orphan (def), 1 -> keep for read() */ char mmap_called; /* 0 -> mmap() never called on this fd */ char res_in_use; /* 1 -> 'reserve' array in use */ struct kref f_ref; struct execute_work ew; } Sg_fd; typedef struct sg_device { /* holds the state of each scsi generic device */ struct scsi_device *device; wait_queue_head_t open_wait; /* queue open() when O_EXCL present */ struct mutex open_rel_lock; /* held when in open() or release() */ int sg_tablesize; /* adapter's max scatter-gather table size */ u32 index; /* device index number */ struct list_head sfds; rwlock_t sfd_lock; /* protect access to sfd list */ atomic_t detaching; /* 0->device usable, 1->device detaching */ bool exclude; /* 1->open(O_EXCL) succeeded and is active */ int open_cnt; /* count of opens (perhaps < num(sfds) ) */ char sgdebug; /* 0->off, 1->sense, 9->dump dev, 10-> all devs */ char name[DISK_NAME_LEN]; struct cdev * cdev; /* char_dev [sysfs: /sys/cdev/major/sg<n>] */ struct kref d_ref; } Sg_device; /* tasklet or soft irq callback */ static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status); static int sg_start_req(Sg_request *srp, unsigned char *cmd); static int sg_finish_rem_req(Sg_request * srp); static int sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size); static ssize_t sg_new_read(Sg_fd * sfp, char __user *buf, size_t count, Sg_request * srp); static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp); static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking); static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer); static void sg_remove_scat(Sg_fd * sfp, Sg_scatter_hold * schp); static void sg_build_reserve(Sg_fd * sfp, int req_size); static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size); static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp); static Sg_fd *sg_add_sfp(Sg_device * sdp); static void sg_remove_sfp(struct kref *); static Sg_request *sg_get_rq_mark(Sg_fd * sfp, int pack_id, bool *busy); static Sg_request *sg_add_request(Sg_fd * sfp); static int sg_remove_request(Sg_fd * sfp, Sg_request * srp); static Sg_device *sg_get_dev(int dev); static void sg_device_destroy(struct kref *kref); #define SZ_SG_HEADER sizeof(struct sg_header) #define SZ_SG_IO_HDR sizeof(sg_io_hdr_t) #define SZ_SG_IOVEC sizeof(sg_iovec_t) #define SZ_SG_REQ_INFO sizeof(sg_req_info_t) #define sg_printk(prefix, sdp, fmt, a...) \ sdev_prefix_printk(prefix, (sdp)->device, (sdp)->name, fmt, ##a) /* * The SCSI interfaces that use read() and write() as an asynchronous variant of * ioctl(..., SG_IO, ...) are fundamentally unsafe, since there are lots of ways * to trigger read() and write() calls from various contexts with elevated * privileges. This can lead to kernel memory corruption (e.g. if these * interfaces are called through splice()) and privilege escalation inside * userspace (e.g. if a process with access to such a device passes a file * descriptor to a SUID binary as stdin/stdout/stderr). * * This function provides protection for the legacy API by restricting the * calling context. */ static int sg_check_file_access(struct file *filp, const char *caller) { if (filp->f_cred != current_real_cred()) { pr_err_once("%s: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n", caller, task_tgid_vnr(current), current->comm); return -EPERM; } return 0; } static int sg_allow_access(struct file *filp, unsigned char *cmd) { struct sg_fd *sfp = filp->private_data; if (sfp->parentdp->device->type == TYPE_SCANNER) return 0; if (!scsi_cmd_allowed(cmd, filp->f_mode & FMODE_WRITE)) return -EPERM; return 0; } static int open_wait(Sg_device *sdp, int flags) { int retval = 0; if (flags & O_EXCL) { while (sdp->open_cnt > 0) { mutex_unlock(&sdp->open_rel_lock); retval = wait_event_interruptible(sdp->open_wait, (atomic_read(&sdp->detaching) || !sdp->open_cnt)); mutex_lock(&sdp->open_rel_lock); if (retval) /* -ERESTARTSYS */ return retval; if (atomic_read(&sdp->detaching)) return -ENODEV; } } else { while (sdp->exclude) { mutex_unlock(&sdp->open_rel_lock); retval = wait_event_interruptible(sdp->open_wait, (atomic_read(&sdp->detaching) || !sdp->exclude)); mutex_lock(&sdp->open_rel_lock); if (retval) /* -ERESTARTSYS */ return retval; if (atomic_read(&sdp->detaching)) return -ENODEV; } } return retval; } /* Returns 0 on success, else a negated errno value */ static int sg_open(struct inode *inode, struct file *filp) { int dev = iminor(inode); int flags = filp->f_flags; struct request_queue *q; Sg_device *sdp; Sg_fd *sfp; int retval; nonseekable_open(inode, filp); if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) return -EPERM; /* Can't lock it with read only access */ sdp = sg_get_dev(dev); if (IS_ERR(sdp)) return PTR_ERR(sdp); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_open: flags=0x%x\n", flags)); /* This driver's module count bumped by fops_get in <linux/fs.h> */ /* Prevent the device driver from vanishing while we sleep */ retval = scsi_device_get(sdp->device); if (retval) goto sg_put; retval = scsi_autopm_get_device(sdp->device); if (retval) goto sdp_put; /* scsi_block_when_processing_errors() may block so bypass * check if O_NONBLOCK. Permits SCSI commands to be issued * during error recovery. Tread carefully. */ if (!((flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) { retval = -ENXIO; /* we are in error recovery for this device */ goto error_out; } mutex_lock(&sdp->open_rel_lock); if (flags & O_NONBLOCK) { if (flags & O_EXCL) { if (sdp->open_cnt > 0) { retval = -EBUSY; goto error_mutex_locked; } } else { if (sdp->exclude) { retval = -EBUSY; goto error_mutex_locked; } } } else { retval = open_wait(sdp, flags); if (retval) /* -ERESTARTSYS or -ENODEV */ goto error_mutex_locked; } /* N.B. at this point we are holding the open_rel_lock */ if (flags & O_EXCL) sdp->exclude = true; if (sdp->open_cnt < 1) { /* no existing opens */ sdp->sgdebug = 0; q = sdp->device->request_queue; sdp->sg_tablesize = queue_max_segments(q); } sfp = sg_add_sfp(sdp); if (IS_ERR(sfp)) { retval = PTR_ERR(sfp); goto out_undo; } filp->private_data = sfp; sdp->open_cnt++; mutex_unlock(&sdp->open_rel_lock); retval = 0; sg_put: kref_put(&sdp->d_ref, sg_device_destroy); return retval; out_undo: if (flags & O_EXCL) { sdp->exclude = false; /* undo if error */ wake_up_interruptible(&sdp->open_wait); } error_mutex_locked: mutex_unlock(&sdp->open_rel_lock); error_out: scsi_autopm_put_device(sdp->device); sdp_put: scsi_device_put(sdp->device); goto sg_put; } /* Release resources associated with a successful sg_open() * Returns 0 on success, else a negated errno value */ static int sg_release(struct inode *inode, struct file *filp) { Sg_device *sdp; Sg_fd *sfp; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_release\n")); mutex_lock(&sdp->open_rel_lock); scsi_autopm_put_device(sdp->device); kref_put(&sfp->f_ref, sg_remove_sfp); sdp->open_cnt--; /* possibly many open()s waiting on exlude clearing, start many; * only open(O_EXCL)s wait on 0==open_cnt so only start one */ if (sdp->exclude) { sdp->exclude = false; wake_up_interruptible_all(&sdp->open_wait); } else if (0 == sdp->open_cnt) { wake_up_interruptible(&sdp->open_wait); } mutex_unlock(&sdp->open_rel_lock); return 0; } static int get_sg_io_pack_id(int *pack_id, void __user *buf, size_t count) { struct sg_header __user *old_hdr = buf; int reply_len; if (count >= SZ_SG_HEADER) { /* negative reply_len means v3 format, otherwise v1/v2 */ if (get_user(reply_len, &old_hdr->reply_len)) return -EFAULT; if (reply_len >= 0) return get_user(*pack_id, &old_hdr->pack_id); if (in_compat_syscall() && count >= sizeof(struct compat_sg_io_hdr)) { struct compat_sg_io_hdr __user *hp = buf; return get_user(*pack_id, &hp->pack_id); } if (count >= sizeof(struct sg_io_hdr)) { struct sg_io_hdr __user *hp = buf; return get_user(*pack_id, &hp->pack_id); } } /* no valid header was passed, so ignore the pack_id */ *pack_id = -1; return 0; } static ssize_t sg_read(struct file *filp, char __user *buf, size_t count, loff_t * ppos) { Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; int req_pack_id = -1; bool busy; sg_io_hdr_t *hp; struct sg_header *old_hdr; int retval; /* * This could cause a response to be stranded. Close the associated * file descriptor to free up any resources being held. */ retval = sg_check_file_access(filp, __func__); if (retval) return retval; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_read: count=%d\n", (int) count)); if (sfp->force_packid) retval = get_sg_io_pack_id(&req_pack_id, buf, count); if (retval) return retval; srp = sg_get_rq_mark(sfp, req_pack_id, &busy); if (!srp) { /* now wait on packet to arrive */ if (filp->f_flags & O_NONBLOCK) return -EAGAIN; retval = wait_event_interruptible(sfp->read_wait, ((srp = sg_get_rq_mark(sfp, req_pack_id, &busy)) || (!busy && atomic_read(&sdp->detaching)))); if (!srp) /* signal or detaching */ return retval ? retval : -ENODEV; } if (srp->header.interface_id != '\0') return sg_new_read(sfp, buf, count, srp); hp = &srp->header; old_hdr = kzalloc(SZ_SG_HEADER, GFP_KERNEL); if (!old_hdr) return -ENOMEM; old_hdr->reply_len = (int) hp->timeout; old_hdr->pack_len = old_hdr->reply_len; /* old, strange behaviour */ old_hdr->pack_id = hp->pack_id; old_hdr->twelve_byte = ((srp->data.cmd_opcode >= 0xc0) && (12 == hp->cmd_len)) ? 1 : 0; old_hdr->target_status = hp->masked_status; old_hdr->host_status = hp->host_status; old_hdr->driver_status = hp->driver_status; if ((CHECK_CONDITION & hp->masked_status) || (srp->sense_b[0] & 0x70) == 0x70) { old_hdr->driver_status = DRIVER_SENSE; memcpy(old_hdr->sense_buffer, srp->sense_b, sizeof (old_hdr->sense_buffer)); } switch (hp->host_status) { /* This setup of 'result' is for backward compatibility and is best ignored by the user who should use target, host + driver status */ case DID_OK: case DID_PASSTHROUGH: case DID_SOFT_ERROR: old_hdr->result = 0; break; case DID_NO_CONNECT: case DID_BUS_BUSY: case DID_TIME_OUT: old_hdr->result = EBUSY; break; case DID_BAD_TARGET: case DID_ABORT: case DID_PARITY: case DID_RESET: case DID_BAD_INTR: old_hdr->result = EIO; break; case DID_ERROR: old_hdr->result = (srp->sense_b[0] == 0 && hp->masked_status == GOOD) ? 0 : EIO; break; default: old_hdr->result = EIO; break; } /* Now copy the result back to the user buffer. */ if (count >= SZ_SG_HEADER) { if (copy_to_user(buf, old_hdr, SZ_SG_HEADER)) { retval = -EFAULT; goto free_old_hdr; } buf += SZ_SG_HEADER; if (count > old_hdr->reply_len) count = old_hdr->reply_len; if (count > SZ_SG_HEADER) { if (sg_read_oxfer(srp, buf, count - SZ_SG_HEADER)) { retval = -EFAULT; goto free_old_hdr; } } } else count = (old_hdr->result == 0) ? 0 : -EIO; sg_finish_rem_req(srp); sg_remove_request(sfp, srp); retval = count; free_old_hdr: kfree(old_hdr); return retval; } static ssize_t sg_new_read(Sg_fd * sfp, char __user *buf, size_t count, Sg_request * srp) { sg_io_hdr_t *hp = &srp->header; int err = 0, err2; int len; if (in_compat_syscall()) { if (count < sizeof(struct compat_sg_io_hdr)) { err = -EINVAL; goto err_out; } } else if (count < SZ_SG_IO_HDR) { err = -EINVAL; goto err_out; } hp->sb_len_wr = 0; if ((hp->mx_sb_len > 0) && hp->sbp) { if ((CHECK_CONDITION & hp->masked_status) || (srp->sense_b[0] & 0x70) == 0x70) { int sb_len = SCSI_SENSE_BUFFERSIZE; sb_len = (hp->mx_sb_len > sb_len) ? sb_len : hp->mx_sb_len; len = 8 + (int) srp->sense_b[7]; /* Additional sense length field */ len = (len > sb_len) ? sb_len : len; if (copy_to_user(hp->sbp, srp->sense_b, len)) { err = -EFAULT; goto err_out; } hp->driver_status = DRIVER_SENSE; hp->sb_len_wr = len; } } if (hp->masked_status || hp->host_status || hp->driver_status) hp->info |= SG_INFO_CHECK; err = put_sg_io_hdr(hp, buf); err_out: err2 = sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return err ? : err2 ? : count; } static ssize_t sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) { int mxsize, cmd_size, k; int input_size, blocking; unsigned char opcode; Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; struct sg_header old_hdr; sg_io_hdr_t *hp; unsigned char cmnd[SG_MAX_CDB_SIZE]; int retval; retval = sg_check_file_access(filp, __func__); if (retval) return retval; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_write: count=%d\n", (int) count)); if (atomic_read(&sdp->detaching)) return -ENODEV; if (!((filp->f_flags & O_NONBLOCK) || scsi_block_when_processing_errors(sdp->device))) return -ENXIO; if (count < SZ_SG_HEADER) return -EIO; if (copy_from_user(&old_hdr, buf, SZ_SG_HEADER)) return -EFAULT; blocking = !(filp->f_flags & O_NONBLOCK); if (old_hdr.reply_len < 0) return sg_new_write(sfp, filp, buf, count, blocking, 0, 0, NULL); if (count < (SZ_SG_HEADER + 6)) return -EIO; /* The minimum scsi command length is 6 bytes. */ buf += SZ_SG_HEADER; if (get_user(opcode, buf)) return -EFAULT; if (!(srp = sg_add_request(sfp))) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sdp, "sg_write: queue full\n")); return -EDOM; } mutex_lock(&sfp->f_mutex); if (sfp->next_cmd_len > 0) { cmd_size = sfp->next_cmd_len; sfp->next_cmd_len = 0; /* reset so only this write() effected */ } else { cmd_size = COMMAND_SIZE(opcode); /* based on SCSI command group */ if ((opcode >= 0xc0) && old_hdr.twelve_byte) cmd_size = 12; } mutex_unlock(&sfp->f_mutex); SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sdp, "sg_write: scsi opcode=0x%02x, cmd_size=%d\n", (int) opcode, cmd_size)); /* Determine buffer size. */ input_size = count - cmd_size; mxsize = (input_size > old_hdr.reply_len) ? input_size : old_hdr.reply_len; mxsize -= SZ_SG_HEADER; input_size -= SZ_SG_HEADER; if (input_size < 0) { sg_remove_request(sfp, srp); return -EIO; /* User did not pass enough bytes for this command. */ } hp = &srp->header; hp->interface_id = '\0'; /* indicator of old interface tunnelled */ hp->cmd_len = (unsigned char) cmd_size; hp->iovec_count = 0; hp->mx_sb_len = 0; if (input_size > 0) hp->dxfer_direction = (old_hdr.reply_len > SZ_SG_HEADER) ? SG_DXFER_TO_FROM_DEV : SG_DXFER_TO_DEV; else hp->dxfer_direction = (mxsize > 0) ? SG_DXFER_FROM_DEV : SG_DXFER_NONE; hp->dxfer_len = mxsize; if ((hp->dxfer_direction == SG_DXFER_TO_DEV) || (hp->dxfer_direction == SG_DXFER_TO_FROM_DEV)) hp->dxferp = (char __user *)buf + cmd_size; else hp->dxferp = NULL; hp->sbp = NULL; hp->timeout = old_hdr.reply_len; /* structure abuse ... */ hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; if (copy_from_user(cmnd, buf, cmd_size)) { sg_remove_request(sfp, srp); return -EFAULT; } /* * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV, * but is is possible that the app intended SG_DXFER_TO_DEV, because there * is a non-zero input_size, so emit a warning. */ if (hp->dxfer_direction == SG_DXFER_TO_FROM_DEV) { printk_ratelimited(KERN_WARNING "sg_write: data in/out %d/%d bytes " "for SCSI command 0x%x-- guessing " "data in;\n program %s not setting " "count and/or reply_len properly\n", old_hdr.reply_len - (int)SZ_SG_HEADER, input_size, (unsigned int) cmnd[0], current->comm); } k = sg_common_write(sfp, srp, cmnd, sfp->timeout, blocking); return (k < 0) ? k : count; } static ssize_t sg_new_write(Sg_fd *sfp, struct file *file, const char __user *buf, size_t count, int blocking, int read_only, int sg_io_owned, Sg_request **o_srp) { int k; Sg_request *srp; sg_io_hdr_t *hp; unsigned char cmnd[SG_MAX_CDB_SIZE]; int timeout; unsigned long ul_timeout; if (count < SZ_SG_IO_HDR) return -EINVAL; sfp->cmd_q = 1; /* when sg_io_hdr seen, set command queuing on */ if (!(srp = sg_add_request(sfp))) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_new_write: queue full\n")); return -EDOM; } srp->sg_io_owned = sg_io_owned; hp = &srp->header; if (get_sg_io_hdr(hp, buf)) { sg_remove_request(sfp, srp); return -EFAULT; } if (hp->interface_id != 'S') { sg_remove_request(sfp, srp); return -ENOSYS; } if (hp->flags & SG_FLAG_MMAP_IO) { if (hp->dxfer_len > sfp->reserve.bufflen) { sg_remove_request(sfp, srp); return -ENOMEM; /* MMAP_IO size must fit in reserve buffer */ } if (hp->flags & SG_FLAG_DIRECT_IO) { sg_remove_request(sfp, srp); return -EINVAL; /* either MMAP_IO or DIRECT_IO (not both) */ } if (sfp->res_in_use) { sg_remove_request(sfp, srp); return -EBUSY; /* reserve buffer already being used */ } } ul_timeout = msecs_to_jiffies(srp->header.timeout); timeout = (ul_timeout < INT_MAX) ? ul_timeout : INT_MAX; if ((!hp->cmdp) || (hp->cmd_len < 6) || (hp->cmd_len > sizeof (cmnd))) { sg_remove_request(sfp, srp); return -EMSGSIZE; } if (copy_from_user(cmnd, hp->cmdp, hp->cmd_len)) { sg_remove_request(sfp, srp); return -EFAULT; } if (read_only && sg_allow_access(file, cmnd)) { sg_remove_request(sfp, srp); return -EPERM; } k = sg_common_write(sfp, srp, cmnd, timeout, blocking); if (k < 0) return k; if (o_srp) *o_srp = srp; return count; } static int sg_common_write(Sg_fd * sfp, Sg_request * srp, unsigned char *cmnd, int timeout, int blocking) { int k, at_head; Sg_device *sdp = sfp->parentdp; sg_io_hdr_t *hp = &srp->header; srp->data.cmd_opcode = cmnd[0]; /* hold opcode of command */ hp->status = 0; hp->masked_status = 0; hp->msg_status = 0; hp->info = 0; hp->host_status = 0; hp->driver_status = 0; hp->resid = 0; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_common_write: scsi opcode=0x%02x, cmd_size=%d\n", (int) cmnd[0], (int) hp->cmd_len)); if (hp->dxfer_len >= SZ_256M) { sg_remove_request(sfp, srp); return -EINVAL; } k = sg_start_req(srp, cmnd); if (k) { SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_common_write: start_req err=%d\n", k)); sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return k; /* probably out of space --> ENOMEM */ } if (atomic_read(&sdp->detaching)) { if (srp->bio) { blk_mq_free_request(srp->rq); srp->rq = NULL; } sg_finish_rem_req(srp); sg_remove_request(sfp, srp); return -ENODEV; } hp->duration = jiffies_to_msecs(jiffies); if (hp->interface_id != '\0' && /* v3 (or later) interface */ (SG_FLAG_Q_AT_TAIL & hp->flags)) at_head = 0; else at_head = 1; srp->rq->timeout = timeout; kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */ srp->rq->end_io = sg_rq_end_io; blk_execute_rq_nowait(srp->rq, at_head); return 0; } static int srp_done(Sg_fd *sfp, Sg_request *srp) { unsigned long flags; int ret; read_lock_irqsave(&sfp->rq_list_lock, flags); ret = srp->done; read_unlock_irqrestore(&sfp->rq_list_lock, flags); return ret; } static int max_sectors_bytes(struct request_queue *q) { unsigned int max_sectors = queue_max_sectors(q); max_sectors = min_t(unsigned int, max_sectors, INT_MAX >> 9); return max_sectors << 9; } static void sg_fill_request_table(Sg_fd *sfp, sg_req_info_t *rinfo) { Sg_request *srp; int val; unsigned int ms; val = 0; list_for_each_entry(srp, &sfp->rq_list, entry) { if (val >= SG_MAX_QUEUE) break; rinfo[val].req_state = srp->done + 1; rinfo[val].problem = srp->header.masked_status & srp->header.host_status & srp->header.driver_status; if (srp->done) rinfo[val].duration = srp->header.duration; else { ms = jiffies_to_msecs(jiffies); rinfo[val].duration = (ms > srp->header.duration) ? (ms - srp->header.duration) : 0; } rinfo[val].orphan = srp->orphan; rinfo[val].sg_io_owned = srp->sg_io_owned; rinfo[val].pack_id = srp->header.pack_id; rinfo[val].usr_ptr = srp->header.usr_ptr; val++; } } #ifdef CONFIG_COMPAT struct compat_sg_req_info { /* used by SG_GET_REQUEST_TABLE ioctl() */ char req_state; char orphan; char sg_io_owned; char problem; int pack_id; compat_uptr_t usr_ptr; unsigned int duration; int unused; }; static int put_compat_request_table(struct compat_sg_req_info __user *o, struct sg_req_info *rinfo) { int i; for (i = 0; i < SG_MAX_QUEUE; i++) { if (copy_to_user(o + i, rinfo + i, offsetof(sg_req_info_t, usr_ptr)) || put_user((uintptr_t)rinfo[i].usr_ptr, &o[i].usr_ptr) || put_user(rinfo[i].duration, &o[i].duration) || put_user(rinfo[i].unused, &o[i].unused)) return -EFAULT; } return 0; } #endif static long sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp, unsigned int cmd_in, void __user *p) { int __user *ip = p; int result, val, read_only; Sg_request *srp; unsigned long iflags; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_ioctl: cmd=0x%x\n", (int) cmd_in)); read_only = (O_RDWR != (filp->f_flags & O_ACCMODE)); switch (cmd_in) { case SG_IO: if (atomic_read(&sdp->detaching)) return -ENODEV; if (!scsi_block_when_processing_errors(sdp->device)) return -ENXIO; result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, 1, read_only, 1, &srp); if (result < 0) return result; result = wait_event_interruptible(sfp->read_wait, srp_done(sfp, srp)); write_lock_irq(&sfp->rq_list_lock); if (srp->done) { srp->done = 2; write_unlock_irq(&sfp->rq_list_lock); result = sg_new_read(sfp, p, SZ_SG_IO_HDR, srp); return (result < 0) ? result : 0; } srp->orphan = 1; write_unlock_irq(&sfp->rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ case SG_SET_TIMEOUT: result = get_user(val, ip); if (result) return result; if (val < 0) return -EIO; if (val >= mult_frac((s64)INT_MAX, USER_HZ, HZ)) val = min_t(s64, mult_frac((s64)INT_MAX, USER_HZ, HZ), INT_MAX); sfp->timeout_user = val; sfp->timeout = mult_frac(val, HZ, USER_HZ); return 0; case SG_GET_TIMEOUT: /* N.B. User receives timeout as return value */ /* strange ..., for backward compatibility */ return sfp->timeout_user; case SG_SET_FORCE_LOW_DMA: /* * N.B. This ioctl never worked properly, but failed to * return an error value. So returning '0' to keep compability * with legacy applications. */ return 0; case SG_GET_LOW_DMA: return put_user(0, ip); case SG_GET_SCSI_ID: { sg_scsi_id_t v; if (atomic_read(&sdp->detaching)) return -ENODEV; memset(&v, 0, sizeof(v)); v.host_no = sdp->device->host->host_no; v.channel = sdp->device->channel; v.scsi_id = sdp->device->id; v.lun = sdp->device->lun; v.scsi_type = sdp->device->type; v.h_cmd_per_lun = sdp->device->host->cmd_per_lun; v.d_queue_depth = sdp->device->queue_depth; if (copy_to_user(p, &v, sizeof(sg_scsi_id_t))) return -EFAULT; return 0; } case SG_SET_FORCE_PACK_ID: result = get_user(val, ip); if (result) return result; sfp->force_packid = val ? 1 : 0; return 0; case SG_GET_PACK_ID: read_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(srp, &sfp->rq_list, entry) { if ((1 == srp->done) && (!srp->sg_io_owned)) { read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(srp->header.pack_id, ip); } } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(-1, ip); case SG_GET_NUM_WAITING: read_lock_irqsave(&sfp->rq_list_lock, iflags); val = 0; list_for_each_entry(srp, &sfp->rq_list, entry) { if ((1 == srp->done) && (!srp->sg_io_owned)) ++val; } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); return put_user(val, ip); case SG_GET_SG_TABLESIZE: return put_user(sdp->sg_tablesize, ip); case SG_SET_RESERVED_SIZE: result = get_user(val, ip); if (result) return result; if (val < 0) return -EINVAL; val = min_t(int, val, max_sectors_bytes(sdp->device->request_queue)); mutex_lock(&sfp->f_mutex); if (val != sfp->reserve.bufflen) { if (sfp->mmap_called || sfp->res_in_use) { mutex_unlock(&sfp->f_mutex); return -EBUSY; } sg_remove_scat(sfp, &sfp->reserve); sg_build_reserve(sfp, val); } mutex_unlock(&sfp->f_mutex); return 0; case SG_GET_RESERVED_SIZE: val = min_t(int, sfp->reserve.bufflen, max_sectors_bytes(sdp->device->request_queue)); return put_user(val, ip); case SG_SET_COMMAND_Q: result = get_user(val, ip); if (result) return result; sfp->cmd_q = val ? 1 : 0; return 0; case SG_GET_COMMAND_Q: return put_user((int) sfp->cmd_q, ip); case SG_SET_KEEP_ORPHAN: result = get_user(val, ip); if (result) return result; sfp->keep_orphan = val; return 0; case SG_GET_KEEP_ORPHAN: return put_user((int) sfp->keep_orphan, ip); case SG_NEXT_CMD_LEN: result = get_user(val, ip); if (result) return result; if (val > SG_MAX_CDB_SIZE) return -ENOMEM; sfp->next_cmd_len = (val > 0) ? val : 0; return 0; case SG_GET_VERSION_NUM: return put_user(sg_version_num, ip); case SG_GET_ACCESS_COUNT: /* faked - we don't have a real access count anymore */ val = (sdp->device ? 1 : 0); return put_user(val, ip); case SG_GET_REQUEST_TABLE: { sg_req_info_t *rinfo; rinfo = kcalloc(SG_MAX_QUEUE, SZ_SG_REQ_INFO, GFP_KERNEL); if (!rinfo) return -ENOMEM; read_lock_irqsave(&sfp->rq_list_lock, iflags); sg_fill_request_table(sfp, rinfo); read_unlock_irqrestore(&sfp->rq_list_lock, iflags); #ifdef CONFIG_COMPAT if (in_compat_syscall()) result = put_compat_request_table(p, rinfo); else #endif result = copy_to_user(p, rinfo, SZ_SG_REQ_INFO * SG_MAX_QUEUE); result = result ? -EFAULT : 0; kfree(rinfo); return result; } case SG_EMULATED_HOST: if (atomic_read(&sdp->detaching)) return -ENODEV; return put_user(sdp->device->host->hostt->emulated, ip); case SCSI_IOCTL_SEND_COMMAND: if (atomic_read(&sdp->detaching)) return -ENODEV; return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE, cmd_in, p); case SG_SET_DEBUG: result = get_user(val, ip); if (result) return result; sdp->sgdebug = (char) val; return 0; case BLKSECTGET: return put_user(max_sectors_bytes(sdp->device->request_queue), ip); case BLKTRACESETUP: return blk_trace_setup(sdp->device->request_queue, sdp->name, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), NULL, p); case BLKTRACESTART: return blk_trace_startstop(sdp->device->request_queue, 1); case BLKTRACESTOP: return blk_trace_startstop(sdp->device->request_queue, 0); case BLKTRACETEARDOWN: return blk_trace_remove(sdp->device->request_queue); case SCSI_IOCTL_GET_IDLUN: case SCSI_IOCTL_GET_BUS_NUMBER: case SCSI_IOCTL_PROBE_HOST: case SG_GET_TRANSFORM: case SG_SCSI_RESET: if (atomic_read(&sdp->detaching)) return -ENODEV; break; default: if (read_only) return -EPERM; /* don't know so take safe approach */ break; } result = scsi_ioctl_block_when_processing_errors(sdp->device, cmd_in, filp->f_flags & O_NDELAY); if (result) return result; return -ENOIOCTLCMD; } static long sg_ioctl(struct file *filp, unsigned int cmd_in, unsigned long arg) { void __user *p = (void __user *)arg; Sg_device *sdp; Sg_fd *sfp; int ret; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; ret = sg_ioctl_common(filp, sdp, sfp, cmd_in, p); if (ret != -ENOIOCTLCMD) return ret; return scsi_ioctl(sdp->device, filp->f_mode & FMODE_WRITE, cmd_in, p); } static __poll_t sg_poll(struct file *filp, poll_table * wait) { __poll_t res = 0; Sg_device *sdp; Sg_fd *sfp; Sg_request *srp; int count = 0; unsigned long iflags; sfp = filp->private_data; if (!sfp) return EPOLLERR; sdp = sfp->parentdp; if (!sdp) return EPOLLERR; poll_wait(filp, &sfp->read_wait, wait); read_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(srp, &sfp->rq_list, entry) { /* if any read waiting, flag it */ if ((0 == res) && (1 == srp->done) && (!srp->sg_io_owned)) res = EPOLLIN | EPOLLRDNORM; ++count; } read_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (atomic_read(&sdp->detaching)) res |= EPOLLHUP; else if (!sfp->cmd_q) { if (0 == count) res |= EPOLLOUT | EPOLLWRNORM; } else if (count < SG_MAX_QUEUE) res |= EPOLLOUT | EPOLLWRNORM; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_poll: res=0x%x\n", (__force u32) res)); return res; } static int sg_fasync(int fd, struct file *filp, int mode) { Sg_device *sdp; Sg_fd *sfp; if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp))) return -ENXIO; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_fasync: mode=%d\n", mode)); return fasync_helper(fd, filp, mode, &sfp->async_qp); } static vm_fault_t sg_vma_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; Sg_fd *sfp; unsigned long offset, len, sa; Sg_scatter_hold *rsv_schp; int k, length; if ((NULL == vma) || (!(sfp = (Sg_fd *) vma->vm_private_data))) return VM_FAULT_SIGBUS; rsv_schp = &sfp->reserve; offset = vmf->pgoff << PAGE_SHIFT; if (offset >= rsv_schp->bufflen) return VM_FAULT_SIGBUS; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sfp->parentdp, "sg_vma_fault: offset=%lu, scatg=%d\n", offset, rsv_schp->k_use_sg)); sa = vma->vm_start; length = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg && sa < vma->vm_end; k++) { len = vma->vm_end - sa; len = (len < length) ? len : length; if (offset < len) { struct page *page = nth_page(rsv_schp->pages[k], offset >> PAGE_SHIFT); get_page(page); /* increment page count */ vmf->page = page; return 0; /* success */ } sa += len; offset -= len; } return VM_FAULT_SIGBUS; } static const struct vm_operations_struct sg_mmap_vm_ops = { .fault = sg_vma_fault, }; static int sg_mmap(struct file *filp, struct vm_area_struct *vma) { Sg_fd *sfp; unsigned long req_sz, len, sa; Sg_scatter_hold *rsv_schp; int k, length; int ret = 0; if ((!filp) || (!vma) || (!(sfp = (Sg_fd *) filp->private_data))) return -ENXIO; req_sz = vma->vm_end - vma->vm_start; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sfp->parentdp, "sg_mmap starting, vm_start=%p, len=%d\n", (void *) vma->vm_start, (int) req_sz)); if (vma->vm_pgoff) return -EINVAL; /* want no offset */ rsv_schp = &sfp->reserve; mutex_lock(&sfp->f_mutex); if (req_sz > rsv_schp->bufflen) { ret = -ENOMEM; /* cannot map more than reserved buffer */ goto out; } sa = vma->vm_start; length = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg && sa < vma->vm_end; k++) { len = vma->vm_end - sa; len = (len < length) ? len : length; sa += len; } sfp->mmap_called = 1; vm_flags_set(vma, VM_IO | VM_DONTEXPAND | VM_DONTDUMP); vma->vm_private_data = sfp; vma->vm_ops = &sg_mmap_vm_ops; out: mutex_unlock(&sfp->f_mutex); return ret; } static void sg_rq_end_io_usercontext(struct work_struct *work) { struct sg_request *srp = container_of(work, struct sg_request, ew.work); struct sg_fd *sfp = srp->parentfp; sg_finish_rem_req(srp); sg_remove_request(sfp, srp); kref_put(&sfp->f_ref, sg_remove_sfp); } /* * This function is a "bottom half" handler that is called by the mid * level when a command is completed (or has failed). */ static enum rq_end_io_ret sg_rq_end_io(struct request *rq, blk_status_t status) { struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(rq); struct sg_request *srp = rq->end_io_data; Sg_device *sdp; Sg_fd *sfp; unsigned long iflags; unsigned int ms; char *sense; int result, resid, done = 1; if (WARN_ON(srp->done != 0)) return RQ_END_IO_NONE; sfp = srp->parentfp; if (WARN_ON(sfp == NULL)) return RQ_END_IO_NONE; sdp = sfp->parentdp; if (unlikely(atomic_read(&sdp->detaching))) pr_info("%s: device detaching\n", __func__); sense = scmd->sense_buffer; result = scmd->result; resid = scmd->resid_len; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sdp, "sg_cmd_done: pack_id=%d, res=0x%x\n", srp->header.pack_id, result)); srp->header.resid = resid; ms = jiffies_to_msecs(jiffies); srp->header.duration = (ms > srp->header.duration) ? (ms - srp->header.duration) : 0; if (0 != result) { struct scsi_sense_hdr sshdr; srp->header.status = 0xff & result; srp->header.masked_status = sg_status_byte(result); srp->header.msg_status = COMMAND_COMPLETE; srp->header.host_status = host_byte(result); srp->header.driver_status = driver_byte(result); if ((sdp->sgdebug > 0) && ((CHECK_CONDITION == srp->header.masked_status) || (COMMAND_TERMINATED == srp->header.masked_status))) __scsi_print_sense(sdp->device, __func__, sense, SCSI_SENSE_BUFFERSIZE); /* Following if statement is a patch supplied by Eric Youngdale */ if (driver_byte(result) != 0 && scsi_normalize_sense(sense, SCSI_SENSE_BUFFERSIZE, &sshdr) && !scsi_sense_is_deferred(&sshdr) && sshdr.sense_key == UNIT_ATTENTION && sdp->device->removable) { /* Detected possible disc change. Set the bit - this */ /* may be used if there are filesystems using this device */ sdp->device->changed = 1; } } if (scmd->sense_len) memcpy(srp->sense_b, scmd->sense_buffer, SCSI_SENSE_BUFFERSIZE); /* Rely on write phase to clean out srp status values, so no "else" */ /* * Free the request as soon as it is complete so that its resources * can be reused without waiting for userspace to read() the * result. But keep the associated bio (if any) around until * blk_rq_unmap_user() can be called from user context. */ srp->rq = NULL; blk_mq_free_request(rq); write_lock_irqsave(&sfp->rq_list_lock, iflags); if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; else done = 0; } srp->done = done; write_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (likely(done)) { /* Now wake up any sg_read() that is waiting for this * packet. */ wake_up_interruptible(&sfp->read_wait); kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); kref_put(&sfp->f_ref, sg_remove_sfp); } else { INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext); schedule_work(&srp->ew.work); } return RQ_END_IO_NONE; } static const struct file_operations sg_fops = { .owner = THIS_MODULE, .read = sg_read, .write = sg_write, .poll = sg_poll, .unlocked_ioctl = sg_ioctl, .compat_ioctl = compat_ptr_ioctl, .open = sg_open, .mmap = sg_mmap, .release = sg_release, .fasync = sg_fasync, .llseek = no_llseek, }; static struct class *sg_sysfs_class; static int sg_sysfs_valid = 0; static Sg_device * sg_alloc(struct scsi_device *scsidp) { struct request_queue *q = scsidp->request_queue; Sg_device *sdp; unsigned long iflags; int error; u32 k; sdp = kzalloc(sizeof(Sg_device), GFP_KERNEL); if (!sdp) { sdev_printk(KERN_WARNING, scsidp, "%s: kmalloc Sg_device " "failure\n", __func__); return ERR_PTR(-ENOMEM); } idr_preload(GFP_KERNEL); write_lock_irqsave(&sg_index_lock, iflags); error = idr_alloc(&sg_index_idr, sdp, 0, SG_MAX_DEVS, GFP_NOWAIT); if (error < 0) { if (error == -ENOSPC) { sdev_printk(KERN_WARNING, scsidp, "Unable to attach sg device type=%d, minor number exceeds %d\n", scsidp->type, SG_MAX_DEVS - 1); error = -ENODEV; } else { sdev_printk(KERN_WARNING, scsidp, "%s: idr " "allocation Sg_device failure: %d\n", __func__, error); } goto out_unlock; } k = error; SCSI_LOG_TIMEOUT(3, sdev_printk(KERN_INFO, scsidp, "sg_alloc: dev=%d \n", k)); sprintf(sdp->name, "sg%d", k); sdp->device = scsidp; mutex_init(&sdp->open_rel_lock); INIT_LIST_HEAD(&sdp->sfds); init_waitqueue_head(&sdp->open_wait); atomic_set(&sdp->detaching, 0); rwlock_init(&sdp->sfd_lock); sdp->sg_tablesize = queue_max_segments(q); sdp->index = k; kref_init(&sdp->d_ref); error = 0; out_unlock: write_unlock_irqrestore(&sg_index_lock, iflags); idr_preload_end(); if (error) { kfree(sdp); return ERR_PTR(error); } return sdp; } static int sg_add_device(struct device *cl_dev) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = NULL; struct cdev * cdev = NULL; int error; unsigned long iflags; if (!blk_get_queue(scsidp->request_queue)) { pr_warn("%s: get scsi_device queue failed\n", __func__); return -ENODEV; } error = -ENOMEM; cdev = cdev_alloc(); if (!cdev) { pr_warn("%s: cdev_alloc failed\n", __func__); goto out; } cdev->owner = THIS_MODULE; cdev->ops = &sg_fops; sdp = sg_alloc(scsidp); if (IS_ERR(sdp)) { pr_warn("%s: sg_alloc failed\n", __func__); error = PTR_ERR(sdp); goto out; } error = cdev_add(cdev, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), 1); if (error) goto cdev_add_err; sdp->cdev = cdev; if (sg_sysfs_valid) { struct device *sg_class_member; sg_class_member = device_create(sg_sysfs_class, cl_dev->parent, MKDEV(SCSI_GENERIC_MAJOR, sdp->index), sdp, "%s", sdp->name); if (IS_ERR(sg_class_member)) { pr_err("%s: device_create failed\n", __func__); error = PTR_ERR(sg_class_member); goto cdev_add_err; } error = sysfs_create_link(&scsidp->sdev_gendev.kobj, &sg_class_member->kobj, "generic"); if (error) pr_err("%s: unable to make symlink 'generic' back " "to sg%d\n", __func__, sdp->index); } else pr_warn("%s: sg_sys Invalid\n", __func__); sdev_printk(KERN_NOTICE, scsidp, "Attached scsi generic sg%d " "type %d\n", sdp->index, scsidp->type); dev_set_drvdata(cl_dev, sdp); return 0; cdev_add_err: write_lock_irqsave(&sg_index_lock, iflags); idr_remove(&sg_index_idr, sdp->index); write_unlock_irqrestore(&sg_index_lock, iflags); kfree(sdp); out: if (cdev) cdev_del(cdev); blk_put_queue(scsidp->request_queue); return error; } static void sg_device_destroy(struct kref *kref) { struct sg_device *sdp = container_of(kref, struct sg_device, d_ref); struct request_queue *q = sdp->device->request_queue; unsigned long flags; /* CAUTION! Note that the device can still be found via idr_find() * even though the refcount is 0. Therefore, do idr_remove() BEFORE * any other cleanup. */ blk_trace_remove(q); blk_put_queue(q); write_lock_irqsave(&sg_index_lock, flags); idr_remove(&sg_index_idr, sdp->index); write_unlock_irqrestore(&sg_index_lock, flags); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_device_destroy\n")); kfree(sdp); } static void sg_remove_device(struct device *cl_dev) { struct scsi_device *scsidp = to_scsi_device(cl_dev->parent); Sg_device *sdp = dev_get_drvdata(cl_dev); unsigned long iflags; Sg_fd *sfp; int val; if (!sdp) return; /* want sdp->detaching non-zero as soon as possible */ val = atomic_inc_return(&sdp->detaching); if (val > 1) return; /* only want to do following once per device */ SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "%s\n", __func__)); read_lock_irqsave(&sdp->sfd_lock, iflags); list_for_each_entry(sfp, &sdp->sfds, sfd_siblings) { wake_up_interruptible_all(&sfp->read_wait); kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP); } wake_up_interruptible_all(&sdp->open_wait); read_unlock_irqrestore(&sdp->sfd_lock, iflags); sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic"); device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index)); cdev_del(sdp->cdev); sdp->cdev = NULL; kref_put(&sdp->d_ref, sg_device_destroy); } module_param_named(scatter_elem_sz, scatter_elem_sz, int, S_IRUGO | S_IWUSR); module_param_named(def_reserved_size, def_reserved_size, int, S_IRUGO | S_IWUSR); module_param_named(allow_dio, sg_allow_dio, int, S_IRUGO | S_IWUSR); MODULE_AUTHOR("Douglas Gilbert"); MODULE_DESCRIPTION("SCSI generic (sg) driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(SG_VERSION_STR); MODULE_ALIAS_CHARDEV_MAJOR(SCSI_GENERIC_MAJOR); MODULE_PARM_DESC(scatter_elem_sz, "scatter gather element " "size (default: max(SG_SCATTER_SZ, PAGE_SIZE))"); MODULE_PARM_DESC(def_reserved_size, "size of buffer reserved for each fd"); MODULE_PARM_DESC(allow_dio, "allow direct I/O (default: 0 (disallow))"); #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> static struct ctl_table sg_sysctls[] = { { .procname = "sg-big-buff", .data = &sg_big_buff, .maxlen = sizeof(int), .mode = 0444, .proc_handler = proc_dointvec, }, {} }; static struct ctl_table_header *hdr; static void register_sg_sysctls(void) { if (!hdr) hdr = register_sysctl("kernel", sg_sysctls); } static void unregister_sg_sysctls(void) { if (hdr) unregister_sysctl_table(hdr); } #else #define register_sg_sysctls() do { } while (0) #define unregister_sg_sysctls() do { } while (0) #endif /* CONFIG_SYSCTL */ static int __init init_sg(void) { int rc; if (scatter_elem_sz < PAGE_SIZE) { scatter_elem_sz = PAGE_SIZE; scatter_elem_sz_prev = scatter_elem_sz; } if (def_reserved_size >= 0) sg_big_buff = def_reserved_size; else def_reserved_size = sg_big_buff; rc = register_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS, "sg"); if (rc) return rc; sg_sysfs_class = class_create("scsi_generic"); if ( IS_ERR(sg_sysfs_class) ) { rc = PTR_ERR(sg_sysfs_class); goto err_out; } sg_sysfs_valid = 1; rc = scsi_register_interface(&sg_interface); if (0 == rc) { #ifdef CONFIG_SCSI_PROC_FS sg_proc_init(); #endif /* CONFIG_SCSI_PROC_FS */ return 0; } class_destroy(sg_sysfs_class); register_sg_sysctls(); err_out: unregister_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS); return rc; } static void __exit exit_sg(void) { unregister_sg_sysctls(); #ifdef CONFIG_SCSI_PROC_FS remove_proc_subtree("scsi/sg", NULL); #endif /* CONFIG_SCSI_PROC_FS */ scsi_unregister_interface(&sg_interface); class_destroy(sg_sysfs_class); sg_sysfs_valid = 0; unregister_chrdev_region(MKDEV(SCSI_GENERIC_MAJOR, 0), SG_MAX_DEVS); idr_destroy(&sg_index_idr); } static int sg_start_req(Sg_request *srp, unsigned char *cmd) { int res; struct request *rq; Sg_fd *sfp = srp->parentfp; sg_io_hdr_t *hp = &srp->header; int dxfer_len = (int) hp->dxfer_len; int dxfer_dir = hp->dxfer_direction; unsigned int iov_count = hp->iovec_count; Sg_scatter_hold *req_schp = &srp->data; Sg_scatter_hold *rsv_schp = &sfp->reserve; struct request_queue *q = sfp->parentdp->device->request_queue; struct rq_map_data *md, map_data; int rw = hp->dxfer_direction == SG_DXFER_TO_DEV ? ITER_SOURCE : ITER_DEST; struct scsi_cmnd *scmd; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_start_req: dxfer_len=%d\n", dxfer_len)); /* * NOTE * * With scsi-mq enabled, there are a fixed number of preallocated * requests equal in number to shost->can_queue. If all of the * preallocated requests are already in use, then scsi_alloc_request() * will sleep until an active command completes, freeing up a request. * Although waiting in an asynchronous interface is less than ideal, we * do not want to use BLK_MQ_REQ_NOWAIT here because userspace might * not expect an EWOULDBLOCK from this condition. */ rq = scsi_alloc_request(q, hp->dxfer_direction == SG_DXFER_TO_DEV ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN, 0); if (IS_ERR(rq)) return PTR_ERR(rq); scmd = blk_mq_rq_to_pdu(rq); if (hp->cmd_len > sizeof(scmd->cmnd)) { blk_mq_free_request(rq); return -EINVAL; } memcpy(scmd->cmnd, cmd, hp->cmd_len); scmd->cmd_len = hp->cmd_len; srp->rq = rq; rq->end_io_data = srp; scmd->allowed = SG_DEFAULT_RETRIES; if ((dxfer_len <= 0) || (dxfer_dir == SG_DXFER_NONE)) return 0; if (sg_allow_dio && hp->flags & SG_FLAG_DIRECT_IO && dxfer_dir != SG_DXFER_UNKNOWN && !iov_count && blk_rq_aligned(q, (unsigned long)hp->dxferp, dxfer_len)) md = NULL; else md = &map_data; if (md) { mutex_lock(&sfp->f_mutex); if (dxfer_len <= rsv_schp->bufflen && !sfp->res_in_use) { sfp->res_in_use = 1; sg_link_reserve(sfp, srp, dxfer_len); } else if (hp->flags & SG_FLAG_MMAP_IO) { res = -EBUSY; /* sfp->res_in_use == 1 */ if (dxfer_len > rsv_schp->bufflen) res = -ENOMEM; mutex_unlock(&sfp->f_mutex); return res; } else { res = sg_build_indirect(req_schp, sfp, dxfer_len); if (res) { mutex_unlock(&sfp->f_mutex); return res; } } mutex_unlock(&sfp->f_mutex); md->pages = req_schp->pages; md->page_order = req_schp->page_order; md->nr_entries = req_schp->k_use_sg; md->offset = 0; md->null_mapped = hp->dxferp ? 0 : 1; if (dxfer_dir == SG_DXFER_TO_FROM_DEV) md->from_user = 1; else md->from_user = 0; } res = blk_rq_map_user_io(rq, md, hp->dxferp, hp->dxfer_len, GFP_ATOMIC, iov_count, iov_count, 1, rw); if (!res) { srp->bio = rq->bio; if (!md) { req_schp->dio_in_use = 1; hp->info |= SG_INFO_DIRECT_IO; } } return res; } static int sg_finish_rem_req(Sg_request *srp) { int ret = 0; Sg_fd *sfp = srp->parentfp; Sg_scatter_hold *req_schp = &srp->data; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_finish_rem_req: res_used=%d\n", (int) srp->res_used)); if (srp->bio) ret = blk_rq_unmap_user(srp->bio); if (srp->rq) blk_mq_free_request(srp->rq); if (srp->res_used) sg_unlink_reserve(sfp, srp); else sg_remove_scat(sfp, req_schp); return ret; } static int sg_build_sgat(Sg_scatter_hold * schp, const Sg_fd * sfp, int tablesize) { int sg_bufflen = tablesize * sizeof(struct page *); gfp_t gfp_flags = GFP_ATOMIC | __GFP_NOWARN; schp->pages = kzalloc(sg_bufflen, gfp_flags); if (!schp->pages) return -ENOMEM; schp->sglist_len = sg_bufflen; return tablesize; /* number of scat_gath elements allocated */ } static int sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, int buff_size) { int ret_sz = 0, i, k, rem_sz, num, mx_sc_elems; int sg_tablesize = sfp->parentdp->sg_tablesize; int blk_size = buff_size, order; gfp_t gfp_mask = GFP_ATOMIC | __GFP_COMP | __GFP_NOWARN | __GFP_ZERO; if (blk_size < 0) return -EFAULT; if (0 == blk_size) ++blk_size; /* don't know why */ /* round request up to next highest SG_SECTOR_SZ byte boundary */ blk_size = ALIGN(blk_size, SG_SECTOR_SZ); SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: buff_size=%d, blk_size=%d\n", buff_size, blk_size)); /* N.B. ret_sz carried into this block ... */ mx_sc_elems = sg_build_sgat(schp, sfp, sg_tablesize); if (mx_sc_elems < 0) return mx_sc_elems; /* most likely -ENOMEM */ num = scatter_elem_sz; if (unlikely(num != scatter_elem_sz_prev)) { if (num < PAGE_SIZE) { scatter_elem_sz = PAGE_SIZE; scatter_elem_sz_prev = PAGE_SIZE; } else scatter_elem_sz_prev = num; } order = get_order(num); retry: ret_sz = 1 << (PAGE_SHIFT + order); for (k = 0, rem_sz = blk_size; rem_sz > 0 && k < mx_sc_elems; k++, rem_sz -= ret_sz) { num = (rem_sz > scatter_elem_sz_prev) ? scatter_elem_sz_prev : rem_sz; schp->pages[k] = alloc_pages(gfp_mask, order); if (!schp->pages[k]) goto out; if (num == scatter_elem_sz_prev) { if (unlikely(ret_sz > scatter_elem_sz_prev)) { scatter_elem_sz = ret_sz; scatter_elem_sz_prev = ret_sz; } } SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: k=%d, num=%d, ret_sz=%d\n", k, num, ret_sz)); } /* end of for loop */ schp->page_order = order; schp->k_use_sg = k; SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_indirect: k_use_sg=%d, rem_sz=%d\n", k, rem_sz)); schp->bufflen = blk_size; if (rem_sz > 0) /* must have failed */ return -ENOMEM; return 0; out: for (i = 0; i < k; i++) __free_pages(schp->pages[i], order); if (--order >= 0) goto retry; return -ENOMEM; } static void sg_remove_scat(Sg_fd * sfp, Sg_scatter_hold * schp) { SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_remove_scat: k_use_sg=%d\n", schp->k_use_sg)); if (schp->pages && schp->sglist_len > 0) { if (!schp->dio_in_use) { int k; for (k = 0; k < schp->k_use_sg && schp->pages[k]; k++) { SCSI_LOG_TIMEOUT(5, sg_printk(KERN_INFO, sfp->parentdp, "sg_remove_scat: k=%d, pg=0x%p\n", k, schp->pages[k])); __free_pages(schp->pages[k], schp->page_order); } kfree(schp->pages); } } memset(schp, 0, sizeof (*schp)); } static int sg_read_oxfer(Sg_request * srp, char __user *outp, int num_read_xfer) { Sg_scatter_hold *schp = &srp->data; int k, num; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, srp->parentfp->parentdp, "sg_read_oxfer: num_read_xfer=%d\n", num_read_xfer)); if ((!outp) || (num_read_xfer <= 0)) return 0; num = 1 << (PAGE_SHIFT + schp->page_order); for (k = 0; k < schp->k_use_sg && schp->pages[k]; k++) { if (num > num_read_xfer) { if (copy_to_user(outp, page_address(schp->pages[k]), num_read_xfer)) return -EFAULT; break; } else { if (copy_to_user(outp, page_address(schp->pages[k]), num)) return -EFAULT; num_read_xfer -= num; if (num_read_xfer <= 0) break; outp += num; } } return 0; } static void sg_build_reserve(Sg_fd * sfp, int req_size) { Sg_scatter_hold *schp = &sfp->reserve; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_build_reserve: req_size=%d\n", req_size)); do { if (req_size < PAGE_SIZE) req_size = PAGE_SIZE; if (0 == sg_build_indirect(schp, sfp, req_size)) return; else sg_remove_scat(sfp, schp); req_size >>= 1; /* divide by 2 */ } while (req_size > (PAGE_SIZE / 2)); } static void sg_link_reserve(Sg_fd * sfp, Sg_request * srp, int size) { Sg_scatter_hold *req_schp = &srp->data; Sg_scatter_hold *rsv_schp = &sfp->reserve; int k, num, rem; srp->res_used = 1; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, sfp->parentdp, "sg_link_reserve: size=%d\n", size)); rem = size; num = 1 << (PAGE_SHIFT + rsv_schp->page_order); for (k = 0; k < rsv_schp->k_use_sg; k++) { if (rem <= num) { req_schp->k_use_sg = k + 1; req_schp->sglist_len = rsv_schp->sglist_len; req_schp->pages = rsv_schp->pages; req_schp->bufflen = size; req_schp->page_order = rsv_schp->page_order; break; } else rem -= num; } if (k >= rsv_schp->k_use_sg) SCSI_LOG_TIMEOUT(1, sg_printk(KERN_INFO, sfp->parentdp, "sg_link_reserve: BAD size\n")); } static void sg_unlink_reserve(Sg_fd * sfp, Sg_request * srp) { Sg_scatter_hold *req_schp = &srp->data; SCSI_LOG_TIMEOUT(4, sg_printk(KERN_INFO, srp->parentfp->parentdp, "sg_unlink_reserve: req->k_use_sg=%d\n", (int) req_schp->k_use_sg)); req_schp->k_use_sg = 0; req_schp->bufflen = 0; req_schp->pages = NULL; req_schp->page_order = 0; req_schp->sglist_len = 0; srp->res_used = 0; /* Called without mutex lock to avoid deadlock */ sfp->res_in_use = 0; } static Sg_request * sg_get_rq_mark(Sg_fd * sfp, int pack_id, bool *busy) { Sg_request *resp; unsigned long iflags; *busy = false; write_lock_irqsave(&sfp->rq_list_lock, iflags); list_for_each_entry(resp, &sfp->rq_list, entry) { /* look for requests that are not SG_IO owned */ if ((!resp->sg_io_owned) && ((-1 == pack_id) || (resp->header.pack_id == pack_id))) { switch (resp->done) { case 0: /* request active */ *busy = true; break; case 1: /* request done; response ready to return */ resp->done = 2; /* guard against other readers */ write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return resp; case 2: /* response already being returned */ break; } } } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return NULL; } /* always adds to end of list */ static Sg_request * sg_add_request(Sg_fd * sfp) { int k; unsigned long iflags; Sg_request *rp = sfp->req_arr; write_lock_irqsave(&sfp->rq_list_lock, iflags); if (!list_empty(&sfp->rq_list)) { if (!sfp->cmd_q) goto out_unlock; for (k = 0; k < SG_MAX_QUEUE; ++k, ++rp) { if (!rp->parentfp) break; } if (k >= SG_MAX_QUEUE) goto out_unlock; } memset(rp, 0, sizeof (Sg_request)); rp->parentfp = sfp; rp->header.duration = jiffies_to_msecs(jiffies); list_add_tail(&rp->entry, &sfp->rq_list); write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return rp; out_unlock: write_unlock_irqrestore(&sfp->rq_list_lock, iflags); return NULL; } /* Return of 1 for found; 0 for not found */ static int sg_remove_request(Sg_fd * sfp, Sg_request * srp) { unsigned long iflags; int res = 0; if (!sfp || !srp || list_empty(&sfp->rq_list)) return res; write_lock_irqsave(&sfp->rq_list_lock, iflags); if (!list_empty(&srp->entry)) { list_del(&srp->entry); srp->parentfp = NULL; res = 1; } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); /* * If the device is detaching, wakeup any readers in case we just * removed the last response, which would leave nothing for them to * return other than -ENODEV. */ if (unlikely(atomic_read(&sfp->parentdp->detaching))) wake_up_interruptible_all(&sfp->read_wait); return res; } static Sg_fd * sg_add_sfp(Sg_device * sdp) { Sg_fd *sfp; unsigned long iflags; int bufflen; sfp = kzalloc(sizeof(*sfp), GFP_ATOMIC | __GFP_NOWARN); if (!sfp) return ERR_PTR(-ENOMEM); init_waitqueue_head(&sfp->read_wait); rwlock_init(&sfp->rq_list_lock); INIT_LIST_HEAD(&sfp->rq_list); kref_init(&sfp->f_ref); mutex_init(&sfp->f_mutex); sfp->timeout = SG_DEFAULT_TIMEOUT; sfp->timeout_user = SG_DEFAULT_TIMEOUT_USER; sfp->force_packid = SG_DEF_FORCE_PACK_ID; sfp->cmd_q = SG_DEF_COMMAND_Q; sfp->keep_orphan = SG_DEF_KEEP_ORPHAN; sfp->parentdp = sdp; write_lock_irqsave(&sdp->sfd_lock, iflags); if (atomic_read(&sdp->detaching)) { write_unlock_irqrestore(&sdp->sfd_lock, iflags); kfree(sfp); return ERR_PTR(-ENODEV); } list_add_tail(&sfp->sfd_siblings, &sdp->sfds); write_unlock_irqrestore(&sdp->sfd_lock, iflags); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_add_sfp: sfp=0x%p\n", sfp)); if (unlikely(sg_big_buff != def_reserved_size)) sg_big_buff = def_reserved_size; bufflen = min_t(int, sg_big_buff, max_sectors_bytes(sdp->device->request_queue)); sg_build_reserve(sfp, bufflen); SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_add_sfp: bufflen=%d, k_use_sg=%d\n", sfp->reserve.bufflen, sfp->reserve.k_use_sg)); kref_get(&sdp->d_ref); __module_get(THIS_MODULE); return sfp; } static void sg_remove_sfp_usercontext(struct work_struct *work) { struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); struct sg_device *sdp = sfp->parentdp; Sg_request *srp; unsigned long iflags; /* Cleanup any responses which were never read(). */ write_lock_irqsave(&sfp->rq_list_lock, iflags); while (!list_empty(&sfp->rq_list)) { srp = list_first_entry(&sfp->rq_list, Sg_request, entry); sg_finish_rem_req(srp); list_del(&srp->entry); srp->parentfp = NULL; } write_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (sfp->reserve.bufflen > 0) { SCSI_LOG_TIMEOUT(6, sg_printk(KERN_INFO, sdp, "sg_remove_sfp: bufflen=%d, k_use_sg=%d\n", (int) sfp->reserve.bufflen, (int) sfp->reserve.k_use_sg)); sg_remove_scat(sfp, &sfp->reserve); } SCSI_LOG_TIMEOUT(6, sg_printk(KERN_INFO, sdp, "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp); scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy); module_put(THIS_MODULE); } static void sg_remove_sfp(struct kref *kref) { struct sg_fd *sfp = container_of(kref, struct sg_fd, f_ref); struct sg_device *sdp = sfp->parentdp; unsigned long iflags; write_lock_irqsave(&sdp->sfd_lock, iflags); list_del(&sfp->sfd_siblings); write_unlock_irqrestore(&sdp->sfd_lock, iflags); INIT_WORK(&sfp->ew.work, sg_remove_sfp_usercontext); schedule_work(&sfp->ew.work); } #ifdef CONFIG_SCSI_PROC_FS static int sg_idr_max_id(int id, void *p, void *data) { int *k = data; if (*k < id) *k = id; return 0; } static int sg_last_dev(void) { int k = -1; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); idr_for_each(&sg_index_idr, sg_idr_max_id, &k); read_unlock_irqrestore(&sg_index_lock, iflags); return k + 1; /* origin 1 */ } #endif /* must be called with sg_index_lock held */ static Sg_device *sg_lookup_dev(int dev) { return idr_find(&sg_index_idr, dev); } static Sg_device * sg_get_dev(int dev) { struct sg_device *sdp; unsigned long flags; read_lock_irqsave(&sg_index_lock, flags); sdp = sg_lookup_dev(dev); if (!sdp) sdp = ERR_PTR(-ENXIO); else if (atomic_read(&sdp->detaching)) { /* If sdp->detaching, then the refcount may already be 0, in * which case it would be a bug to do kref_get(). */ sdp = ERR_PTR(-ENODEV); } else kref_get(&sdp->d_ref); read_unlock_irqrestore(&sg_index_lock, flags); return sdp; } #ifdef CONFIG_SCSI_PROC_FS static int sg_proc_seq_show_int(struct seq_file *s, void *v); static int sg_proc_single_open_adio(struct inode *inode, struct file *file); static ssize_t sg_proc_write_adio(struct file *filp, const char __user *buffer, size_t count, loff_t *off); static const struct proc_ops adio_proc_ops = { .proc_open = sg_proc_single_open_adio, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_write = sg_proc_write_adio, .proc_release = single_release, }; static int sg_proc_single_open_dressz(struct inode *inode, struct file *file); static ssize_t sg_proc_write_dressz(struct file *filp, const char __user *buffer, size_t count, loff_t *off); static const struct proc_ops dressz_proc_ops = { .proc_open = sg_proc_single_open_dressz, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_write = sg_proc_write_dressz, .proc_release = single_release, }; static int sg_proc_seq_show_version(struct seq_file *s, void *v); static int sg_proc_seq_show_devhdr(struct seq_file *s, void *v); static int sg_proc_seq_show_dev(struct seq_file *s, void *v); static void * dev_seq_start(struct seq_file *s, loff_t *pos); static void * dev_seq_next(struct seq_file *s, void *v, loff_t *pos); static void dev_seq_stop(struct seq_file *s, void *v); static const struct seq_operations dev_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_dev, }; static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v); static const struct seq_operations devstrs_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_devstrs, }; static int sg_proc_seq_show_debug(struct seq_file *s, void *v); static const struct seq_operations debug_seq_ops = { .start = dev_seq_start, .next = dev_seq_next, .stop = dev_seq_stop, .show = sg_proc_seq_show_debug, }; static int sg_proc_init(void) { struct proc_dir_entry *p; p = proc_mkdir("scsi/sg", NULL); if (!p) return 1; proc_create("allow_dio", S_IRUGO | S_IWUSR, p, &adio_proc_ops); proc_create_seq("debug", S_IRUGO, p, &debug_seq_ops); proc_create("def_reserved_size", S_IRUGO | S_IWUSR, p, &dressz_proc_ops); proc_create_single("device_hdr", S_IRUGO, p, sg_proc_seq_show_devhdr); proc_create_seq("devices", S_IRUGO, p, &dev_seq_ops); proc_create_seq("device_strs", S_IRUGO, p, &devstrs_seq_ops); proc_create_single("version", S_IRUGO, p, sg_proc_seq_show_version); return 0; } static int sg_proc_seq_show_int(struct seq_file *s, void *v) { seq_printf(s, "%d\n", *((int *)s->private)); return 0; } static int sg_proc_single_open_adio(struct inode *inode, struct file *file) { return single_open(file, sg_proc_seq_show_int, &sg_allow_dio); } static ssize_t sg_proc_write_adio(struct file *filp, const char __user *buffer, size_t count, loff_t *off) { int err; unsigned long num; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; err = kstrtoul_from_user(buffer, count, 0, &num); if (err) return err; sg_allow_dio = num ? 1 : 0; return count; } static int sg_proc_single_open_dressz(struct inode *inode, struct file *file) { return single_open(file, sg_proc_seq_show_int, &sg_big_buff); } static ssize_t sg_proc_write_dressz(struct file *filp, const char __user *buffer, size_t count, loff_t *off) { int err; unsigned long k = ULONG_MAX; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; err = kstrtoul_from_user(buffer, count, 0, &k); if (err) return err; if (k <= 1048576) { /* limit "big buff" to 1 MB */ sg_big_buff = k; return count; } return -ERANGE; } static int sg_proc_seq_show_version(struct seq_file *s, void *v) { seq_printf(s, "%d\t%s [%s]\n", sg_version_num, SG_VERSION_STR, sg_version_date); return 0; } static int sg_proc_seq_show_devhdr(struct seq_file *s, void *v) { seq_puts(s, "host\tchan\tid\tlun\ttype\topens\tqdepth\tbusy\tonline\n"); return 0; } struct sg_proc_deviter { loff_t index; size_t max; }; static void * dev_seq_start(struct seq_file *s, loff_t *pos) { struct sg_proc_deviter * it = kmalloc(sizeof(*it), GFP_KERNEL); s->private = it; if (! it) return NULL; it->index = *pos; it->max = sg_last_dev(); if (it->index >= it->max) return NULL; return it; } static void * dev_seq_next(struct seq_file *s, void *v, loff_t *pos) { struct sg_proc_deviter * it = s->private; *pos = ++it->index; return (it->index < it->max) ? it : NULL; } static void dev_seq_stop(struct seq_file *s, void *v) { kfree(s->private); } static int sg_proc_seq_show_dev(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; if ((NULL == sdp) || (NULL == sdp->device) || (atomic_read(&sdp->detaching))) seq_puts(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n"); else { scsidp = sdp->device; seq_printf(s, "%d\t%d\t%d\t%llu\t%d\t%d\t%d\t%d\t%d\n", scsidp->host->host_no, scsidp->channel, scsidp->id, scsidp->lun, (int) scsidp->type, 1, (int) scsidp->queue_depth, (int) scsi_device_busy(scsidp), (int) scsi_device_online(scsidp)); } read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; struct scsi_device *scsidp; unsigned long iflags; read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; scsidp = sdp ? sdp->device : NULL; if (sdp && scsidp && (!atomic_read(&sdp->detaching))) seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n", scsidp->vendor, scsidp->model, scsidp->rev); else seq_puts(s, "<no active device>\n"); read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } /* must be called while holding sg_index_lock */ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp) { int k, new_interface, blen, usg; Sg_request *srp; Sg_fd *fp; const sg_io_hdr_t *hp; const char * cp; unsigned int ms; k = 0; list_for_each_entry(fp, &sdp->sfds, sfd_siblings) { k++; read_lock(&fp->rq_list_lock); /* irqs already disabled */ seq_printf(s, " FD(%d): timeout=%dms bufflen=%d " "(res)sgat=%d low_dma=%d\n", k, jiffies_to_msecs(fp->timeout), fp->reserve.bufflen, (int) fp->reserve.k_use_sg, 0); seq_printf(s, " cmd_q=%d f_packid=%d k_orphan=%d closed=0\n", (int) fp->cmd_q, (int) fp->force_packid, (int) fp->keep_orphan); list_for_each_entry(srp, &fp->rq_list, entry) { hp = &srp->header; new_interface = (hp->interface_id == '\0') ? 0 : 1; if (srp->res_used) { if (new_interface && (SG_FLAG_MMAP_IO & hp->flags)) cp = " mmap>> "; else cp = " rb>> "; } else { if (SG_INFO_DIRECT_IO_MASK & hp->info) cp = " dio>> "; else cp = " "; } seq_puts(s, cp); blen = srp->data.bufflen; usg = srp->data.k_use_sg; seq_puts(s, srp->done ? ((1 == srp->done) ? "rcv:" : "fin:") : "act:"); seq_printf(s, " id=%d blen=%d", srp->header.pack_id, blen); if (srp->done) seq_printf(s, " dur=%d", hp->duration); else { ms = jiffies_to_msecs(jiffies); seq_printf(s, " t_o/elap=%d/%d", (new_interface ? hp->timeout : jiffies_to_msecs(fp->timeout)), (ms > hp->duration ? ms - hp->duration : 0)); } seq_printf(s, "ms sgat=%d op=0x%02x\n", usg, (int) srp->data.cmd_opcode); } if (list_empty(&fp->rq_list)) seq_puts(s, " No requests active\n"); read_unlock(&fp->rq_list_lock); } } static int sg_proc_seq_show_debug(struct seq_file *s, void *v) { struct sg_proc_deviter * it = (struct sg_proc_deviter *) v; Sg_device *sdp; unsigned long iflags; if (it && (0 == it->index)) seq_printf(s, "max_active_device=%d def_reserved_size=%d\n", (int)it->max, sg_big_buff); read_lock_irqsave(&sg_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL; if (NULL == sdp) goto skip; read_lock(&sdp->sfd_lock); if (!list_empty(&sdp->sfds)) { seq_printf(s, " >>> device=%s ", sdp->name); if (atomic_read(&sdp->detaching)) seq_puts(s, "detaching pending close "); else if (sdp->device) { struct scsi_device *scsidp = sdp->device; seq_printf(s, "%d:%d:%d:%llu em=%d", scsidp->host->host_no, scsidp->channel, scsidp->id, scsidp->lun, scsidp->host->hostt->emulated); } seq_printf(s, " sg_tablesize=%d excl=%d open_cnt=%d\n", sdp->sg_tablesize, sdp->exclude, sdp->open_cnt); sg_proc_debug_helper(s, sdp); } read_unlock(&sdp->sfd_lock); skip: read_unlock_irqrestore(&sg_index_lock, iflags); return 0; } #endif /* CONFIG_SCSI_PROC_FS */ module_init(init_sg); module_exit(exit_sg); |
50 103 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 | // SPDX-License-Identifier: GPL-2.0 #ifndef _LINUX_KERNEL_TRACE_H #define _LINUX_KERNEL_TRACE_H #include <linux/fs.h> #include <linux/atomic.h> #include <linux/sched.h> #include <linux/clocksource.h> #include <linux/ring_buffer.h> #include <linux/mmiotrace.h> #include <linux/tracepoint.h> #include <linux/ftrace.h> #include <linux/trace.h> #include <linux/hw_breakpoint.h> #include <linux/trace_seq.h> #include <linux/trace_events.h> #include <linux/compiler.h> #include <linux/glob.h> #include <linux/irq_work.h> #include <linux/workqueue.h> #include <linux/ctype.h> #include <linux/once_lite.h> #include "pid_list.h" #ifdef CONFIG_FTRACE_SYSCALLS #include <asm/unistd.h> /* For NR_syscalls */ #include <asm/syscall.h> /* some archs define it here */ #endif #define TRACE_MODE_WRITE 0640 #define TRACE_MODE_READ 0440 enum trace_type { __TRACE_FIRST_TYPE = 0, TRACE_FN, TRACE_CTX, TRACE_WAKE, TRACE_STACK, TRACE_PRINT, TRACE_BPRINT, TRACE_MMIO_RW, TRACE_MMIO_MAP, TRACE_BRANCH, TRACE_GRAPH_RET, TRACE_GRAPH_ENT, TRACE_USER_STACK, TRACE_BLK, TRACE_BPUTS, TRACE_HWLAT, TRACE_OSNOISE, TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, __TRACE_LAST_TYPE, }; #undef __field #define __field(type, item) type item; #undef __field_fn #define __field_fn(type, item) type item; #undef __field_struct #define __field_struct(type, item) __field(type, item) #undef __field_desc #define __field_desc(type, container, item) #undef __field_packed #define __field_packed(type, container, item) #undef __array #define __array(type, item, size) type item[size]; /* * For backward compatibility, older user space expects to see the * kernel_stack event with a fixed size caller field. But today the fix * size is ignored by the kernel, and the real structure is dynamic. * Expose to user space: "unsigned long caller[8];" but the real structure * will be "unsigned long caller[] __counted_by(size)" */ #undef __stack_array #define __stack_array(type, item, size, field) type item[] __counted_by(field); #undef __array_desc #define __array_desc(type, container, item, size) #undef __dynamic_array #define __dynamic_array(type, item) type item[]; #undef __rel_dynamic_array #define __rel_dynamic_array(type, item) type item[]; #undef F_STRUCT #define F_STRUCT(args...) args #undef FTRACE_ENTRY #define FTRACE_ENTRY(name, struct_name, id, tstruct, print) \ struct struct_name { \ struct trace_entry ent; \ tstruct \ } #undef FTRACE_ENTRY_DUP #define FTRACE_ENTRY_DUP(name, name_struct, id, tstruct, printk) #undef FTRACE_ENTRY_REG #define FTRACE_ENTRY_REG(name, struct_name, id, tstruct, print, regfn) \ FTRACE_ENTRY(name, struct_name, id, PARAMS(tstruct), PARAMS(print)) #undef FTRACE_ENTRY_PACKED #define FTRACE_ENTRY_PACKED(name, struct_name, id, tstruct, print) \ FTRACE_ENTRY(name, struct_name, id, PARAMS(tstruct), PARAMS(print)) __packed #include "trace_entries.h" /* Use this for memory failure errors */ #define MEM_FAIL(condition, fmt, ...) \ DO_ONCE_LITE_IF(condition, pr_err, "ERROR: " fmt, ##__VA_ARGS__) #define FAULT_STRING "(fault)" #define HIST_STACKTRACE_DEPTH 16 #define HIST_STACKTRACE_SIZE (HIST_STACKTRACE_DEPTH * sizeof(unsigned long)) #define HIST_STACKTRACE_SKIP 5 /* * syscalls are special, and need special handling, this is why * they are not included in trace_entries.h */ struct syscall_trace_enter { struct trace_entry ent; int nr; unsigned long args[]; }; struct syscall_trace_exit { struct trace_entry ent; int nr; long ret; }; struct kprobe_trace_entry_head { struct trace_entry ent; unsigned long ip; }; struct eprobe_trace_entry_head { struct trace_entry ent; }; struct kretprobe_trace_entry_head { struct trace_entry ent; unsigned long func; unsigned long ret_ip; }; struct fentry_trace_entry_head { struct trace_entry ent; unsigned long ip; }; struct fexit_trace_entry_head { struct trace_entry ent; unsigned long func; unsigned long ret_ip; }; #define TRACE_BUF_SIZE 1024 struct trace_array; /* * The CPU trace array - it consists of thousands of trace entries * plus some other descriptor data: (for example which task started * the trace, etc.) */ struct trace_array_cpu { atomic_t disabled; void *buffer_page; /* ring buffer spare */ unsigned long entries; unsigned long saved_latency; unsigned long critical_start; unsigned long critical_end; unsigned long critical_sequence; unsigned long nice; unsigned long policy; unsigned long rt_priority; unsigned long skipped_entries; u64 preempt_timestamp; pid_t pid; kuid_t uid; char comm[TASK_COMM_LEN]; #ifdef CONFIG_FUNCTION_TRACER int ftrace_ignore_pid; #endif bool ignore_pid; }; struct tracer; struct trace_option_dentry; struct array_buffer { struct trace_array *tr; struct trace_buffer *buffer; struct trace_array_cpu __percpu *data; u64 time_start; int cpu; }; #define TRACE_FLAGS_MAX_SIZE 32 struct trace_options { struct tracer *tracer; struct trace_option_dentry *topts; }; struct trace_pid_list *trace_pid_list_alloc(void); void trace_pid_list_free(struct trace_pid_list *pid_list); bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid); int trace_pid_list_set(struct trace_pid_list *pid_list, unsigned int pid); int trace_pid_list_clear(struct trace_pid_list *pid_list, unsigned int pid); int trace_pid_list_first(struct trace_pid_list *pid_list, unsigned int *pid); int trace_pid_list_next(struct trace_pid_list *pid_list, unsigned int pid, unsigned int *next); enum { TRACE_PIDS = BIT(0), TRACE_NO_PIDS = BIT(1), }; static inline bool pid_type_enabled(int type, struct trace_pid_list *pid_list, struct trace_pid_list *no_pid_list) { /* Return true if the pid list in type has pids */ return ((type & TRACE_PIDS) && pid_list) || ((type & TRACE_NO_PIDS) && no_pid_list); } static inline bool still_need_pid_events(int type, struct trace_pid_list *pid_list, struct trace_pid_list *no_pid_list) { /* * Turning off what is in @type, return true if the "other" * pid list, still has pids in it. */ return (!(type & TRACE_PIDS) && pid_list) || (!(type & TRACE_NO_PIDS) && no_pid_list); } typedef bool (*cond_update_fn_t)(struct trace_array *tr, void *cond_data); /** * struct cond_snapshot - conditional snapshot data and callback * * The cond_snapshot structure encapsulates a callback function and * data associated with the snapshot for a given tracing instance. * * When a snapshot is taken conditionally, by invoking * tracing_snapshot_cond(tr, cond_data), the cond_data passed in is * passed in turn to the cond_snapshot.update() function. That data * can be compared by the update() implementation with the cond_data * contained within the struct cond_snapshot instance associated with * the trace_array. Because the tr->max_lock is held throughout the * update() call, the update() function can directly retrieve the * cond_snapshot and cond_data associated with the per-instance * snapshot associated with the trace_array. * * The cond_snapshot.update() implementation can save data to be * associated with the snapshot if it decides to, and returns 'true' * in that case, or it returns 'false' if the conditional snapshot * shouldn't be taken. * * The cond_snapshot instance is created and associated with the * user-defined cond_data by tracing_cond_snapshot_enable(). * Likewise, the cond_snapshot instance is destroyed and is no longer * associated with the trace instance by * tracing_cond_snapshot_disable(). * * The method below is required. * * @update: When a conditional snapshot is invoked, the update() * callback function is invoked with the tr->max_lock held. The * update() implementation signals whether or not to actually * take the snapshot, by returning 'true' if so, 'false' if no * snapshot should be taken. Because the max_lock is held for * the duration of update(), the implementation is safe to * directly retrieved and save any implementation data it needs * to in association with the snapshot. */ struct cond_snapshot { void *cond_data; cond_update_fn_t update; }; /* * struct trace_func_repeats - used to keep track of the consecutive * (on the same CPU) calls of a single function. */ struct trace_func_repeats { unsigned long ip; unsigned long parent_ip; unsigned long count; u64 ts_last_call; }; /* * The trace array - an array of per-CPU trace arrays. This is the * highest level data structure that individual tracers deal with. * They have on/off state as well: */ struct trace_array { struct list_head list; char *name; struct array_buffer array_buffer; #ifdef CONFIG_TRACER_MAX_TRACE /* * The max_buffer is used to snapshot the trace when a maximum * latency is reached, or when the user initiates a snapshot. * Some tracers will use this to store a maximum trace while * it continues examining live traces. * * The buffers for the max_buffer are set up the same as the array_buffer * When a snapshot is taken, the buffer of the max_buffer is swapped * with the buffer of the array_buffer and the buffers are reset for * the array_buffer so the tracing can continue. */ struct array_buffer max_buffer; bool allocated_snapshot; #endif #ifdef CONFIG_TRACER_MAX_TRACE unsigned long max_latency; #ifdef CONFIG_FSNOTIFY struct dentry *d_max_latency; struct work_struct fsnotify_work; struct irq_work fsnotify_irqwork; #endif #endif struct trace_pid_list __rcu *filtered_pids; struct trace_pid_list __rcu *filtered_no_pids; /* * max_lock is used to protect the swapping of buffers * when taking a max snapshot. The buffers themselves are * protected by per_cpu spinlocks. But the action of the swap * needs its own lock. * * This is defined as a arch_spinlock_t in order to help * with performance when lockdep debugging is enabled. * * It is also used in other places outside the update_max_tr * so it needs to be defined outside of the * CONFIG_TRACER_MAX_TRACE. */ arch_spinlock_t max_lock; int buffer_disabled; #ifdef CONFIG_FTRACE_SYSCALLS int sys_refcount_enter; int sys_refcount_exit; struct trace_event_file __rcu *enter_syscall_files[NR_syscalls]; struct trace_event_file __rcu *exit_syscall_files[NR_syscalls]; #endif int stop_count; int clock_id; int nr_topts; bool clear_trace; int buffer_percent; unsigned int n_err_log_entries; struct tracer *current_trace; unsigned int trace_flags; unsigned char trace_flags_index[TRACE_FLAGS_MAX_SIZE]; unsigned int flags; raw_spinlock_t start_lock; struct list_head err_log; struct dentry *dir; struct dentry *options; struct dentry *percpu_dir; struct dentry *event_dir; struct trace_options *topts; struct list_head systems; struct list_head events; struct trace_event_file *trace_marker_file; cpumask_var_t tracing_cpumask; /* only trace on set CPUs */ /* one per_cpu trace_pipe can be opened by only one user */ cpumask_var_t pipe_cpumask; int ref; int trace_ref; #ifdef CONFIG_FUNCTION_TRACER struct ftrace_ops *ops; struct trace_pid_list __rcu *function_pids; struct trace_pid_list __rcu *function_no_pids; #ifdef CONFIG_DYNAMIC_FTRACE /* All of these are protected by the ftrace_lock */ struct list_head func_probes; struct list_head mod_trace; struct list_head mod_notrace; #endif /* function tracing enabled */ int function_enabled; #endif int no_filter_buffering_ref; struct list_head hist_vars; #ifdef CONFIG_TRACER_SNAPSHOT struct cond_snapshot *cond_snapshot; #endif struct trace_func_repeats __percpu *last_func_repeats; }; enum { TRACE_ARRAY_FL_GLOBAL = (1 << 0) }; extern struct list_head ftrace_trace_arrays; extern struct mutex trace_types_lock; extern int trace_array_get(struct trace_array *tr); extern int tracing_check_open_get_tr(struct trace_array *tr); extern struct trace_array *trace_array_find(const char *instance); extern struct trace_array *trace_array_find_get(const char *instance); extern u64 tracing_event_time_stamp(struct trace_buffer *buffer, struct ring_buffer_event *rbe); extern int tracing_set_filter_buffering(struct trace_array *tr, bool set); extern int tracing_set_clock(struct trace_array *tr, const char *clockstr); extern bool trace_clock_in_ns(struct trace_array *tr); /* * The global tracer (top) should be the first trace array added, * but we check the flag anyway. */ static inline struct trace_array *top_trace_array(void) { struct trace_array *tr; if (list_empty(&ftrace_trace_arrays)) return NULL; tr = list_entry(ftrace_trace_arrays.prev, typeof(*tr), list); WARN_ON(!(tr->flags & TRACE_ARRAY_FL_GLOBAL)); return tr; } #define FTRACE_CMP_TYPE(var, type) \ __builtin_types_compatible_p(typeof(var), type *) #undef IF_ASSIGN #define IF_ASSIGN(var, entry, etype, id) \ if (FTRACE_CMP_TYPE(var, etype)) { \ var = (typeof(var))(entry); \ WARN_ON(id != 0 && (entry)->type != id); \ break; \ } /* Will cause compile errors if type is not found. */ extern void __ftrace_bad_type(void); /* * The trace_assign_type is a verifier that the entry type is * the same as the type being assigned. To add new types simply * add a line with the following format: * * IF_ASSIGN(var, ent, type, id); * * Where "type" is the trace type that includes the trace_entry * as the "ent" item. And "id" is the trace identifier that is * used in the trace_type enum. * * If the type can have more than one id, then use zero. */ #define trace_assign_type(var, ent) \ do { \ IF_ASSIGN(var, ent, struct ftrace_entry, TRACE_FN); \ IF_ASSIGN(var, ent, struct ctx_switch_entry, 0); \ IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK); \ IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\ IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \ IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT); \ IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \ IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \ IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\ IF_ASSIGN(var, ent, struct timerlat_entry, TRACE_TIMERLAT);\ IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\ IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \ TRACE_MMIO_RW); \ IF_ASSIGN(var, ent, struct trace_mmiotrace_map, \ TRACE_MMIO_MAP); \ IF_ASSIGN(var, ent, struct trace_branch, TRACE_BRANCH); \ IF_ASSIGN(var, ent, struct ftrace_graph_ent_entry, \ TRACE_GRAPH_ENT); \ IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry, \ TRACE_GRAPH_RET); \ IF_ASSIGN(var, ent, struct func_repeats_entry, \ TRACE_FUNC_REPEATS); \ __ftrace_bad_type(); \ } while (0) /* * An option specific to a tracer. This is a boolean value. * The bit is the bit index that sets its value on the * flags value in struct tracer_flags. */ struct tracer_opt { const char *name; /* Will appear on the trace_options file */ u32 bit; /* Mask assigned in val field in tracer_flags */ }; /* * The set of specific options for a tracer. Your tracer * have to set the initial value of the flags val. */ struct tracer_flags { u32 val; struct tracer_opt *opts; struct tracer *trace; }; /* Makes more easy to define a tracer opt */ #define TRACER_OPT(s, b) .name = #s, .bit = b struct trace_option_dentry { struct tracer_opt *opt; struct tracer_flags *flags; struct trace_array *tr; struct dentry *entry; }; /** * struct tracer - a specific tracer and its callbacks to interact with tracefs * @name: the name chosen to select it on the available_tracers file * @init: called when one switches to this tracer (echo name > current_tracer) * @reset: called when one switches to another tracer * @start: called when tracing is unpaused (echo 1 > tracing_on) * @stop: called when tracing is paused (echo 0 > tracing_on) * @update_thresh: called when tracing_thresh is updated * @open: called when the trace file is opened * @pipe_open: called when the trace_pipe file is opened * @close: called when the trace file is released * @pipe_close: called when the trace_pipe file is released * @read: override the default read callback on trace_pipe * @splice_read: override the default splice_read callback on trace_pipe * @selftest: selftest to run on boot (see trace_selftest.c) * @print_headers: override the first lines that describe your columns * @print_line: callback that prints a trace * @set_flag: signals one of your private flags changed (trace_options file) * @flags: your private flags */ struct tracer { const char *name; int (*init)(struct trace_array *tr); void (*reset)(struct trace_array *tr); void (*start)(struct trace_array *tr); void (*stop)(struct trace_array *tr); int (*update_thresh)(struct trace_array *tr); void (*open)(struct trace_iterator *iter); void (*pipe_open)(struct trace_iterator *iter); void (*close)(struct trace_iterator *iter); void (*pipe_close)(struct trace_iterator *iter); ssize_t (*read)(struct trace_iterator *iter, struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos); ssize_t (*splice_read)(struct trace_iterator *iter, struct file *filp, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); #ifdef CONFIG_FTRACE_STARTUP_TEST int (*selftest)(struct tracer *trace, struct trace_array *tr); #endif void (*print_header)(struct seq_file *m); enum print_line_t (*print_line)(struct trace_iterator *iter); /* If you handled the flag setting, return 0 */ int (*set_flag)(struct trace_array *tr, u32 old_flags, u32 bit, int set); /* Return 0 if OK with change, else return non-zero */ int (*flag_changed)(struct trace_array *tr, u32 mask, int set); struct tracer *next; struct tracer_flags *flags; int enabled; bool print_max; bool allow_instances; #ifdef CONFIG_TRACER_MAX_TRACE bool use_max_tr; #endif /* True if tracer cannot be enabled in kernel param */ bool noboot; }; static inline struct ring_buffer_iter * trace_buffer_iter(struct trace_iterator *iter, int cpu) { return iter->buffer_iter ? iter->buffer_iter[cpu] : NULL; } int tracer_init(struct tracer *t, struct trace_array *tr); int tracing_is_enabled(void); void tracing_reset_online_cpus(struct array_buffer *buf); void tracing_reset_all_online_cpus(void); void tracing_reset_all_online_cpus_unlocked(void); int tracing_open_generic(struct inode *inode, struct file *filp); int tracing_open_generic_tr(struct inode *inode, struct file *filp); int tracing_open_file_tr(struct inode *inode, struct file *filp); int tracing_release_file_tr(struct inode *inode, struct file *filp); bool tracing_is_disabled(void); bool tracer_tracing_is_on(struct trace_array *tr); void tracer_tracing_on(struct trace_array *tr); void tracer_tracing_off(struct trace_array *tr); struct dentry *trace_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops); int tracing_init_dentry(void); struct ring_buffer_event; struct ring_buffer_event * trace_buffer_lock_reserve(struct trace_buffer *buffer, int type, unsigned long len, unsigned int trace_ctx); struct trace_entry *tracing_get_trace_entry(struct trace_array *tr, struct trace_array_cpu *data); struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts); void trace_buffer_unlock_commit_nostack(struct trace_buffer *buffer, struct ring_buffer_event *event); bool trace_is_tracepoint_string(const char *str); const char *trace_event_format(struct trace_iterator *iter, const char *fmt); void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, va_list ap) __printf(2, 0); char *trace_iter_expand_format(struct trace_iterator *iter); int trace_empty(struct trace_iterator *iter); void *trace_find_next_entry_inc(struct trace_iterator *iter); void trace_init_global_iter(struct trace_iterator *iter); void tracing_iter_reset(struct trace_iterator *iter, int cpu); unsigned long trace_total_entries_cpu(struct trace_array *tr, int cpu); unsigned long trace_total_entries(struct trace_array *tr); void trace_function(struct trace_array *tr, unsigned long ip, unsigned long parent_ip, unsigned int trace_ctx); void trace_graph_function(struct trace_array *tr, unsigned long ip, unsigned long parent_ip, unsigned int trace_ctx); void trace_latency_header(struct seq_file *m); void trace_default_header(struct seq_file *m); void print_trace_header(struct seq_file *m, struct trace_iterator *iter); void trace_graph_return(struct ftrace_graph_ret *trace); int trace_graph_entry(struct ftrace_graph_ent *trace); void set_graph_array(struct trace_array *tr); void tracing_start_cmdline_record(void); void tracing_stop_cmdline_record(void); void tracing_start_tgid_record(void); void tracing_stop_tgid_record(void); int register_tracer(struct tracer *type); int is_tracing_stopped(void); loff_t tracing_lseek(struct file *file, loff_t offset, int whence); extern cpumask_var_t __read_mostly tracing_buffer_mask; #define for_each_tracing_cpu(cpu) \ for_each_cpu(cpu, tracing_buffer_mask) extern unsigned long nsecs_to_usecs(unsigned long nsecs); extern unsigned long tracing_thresh; /* PID filtering */ extern int pid_max; bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid); bool trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct trace_pid_list *filtered_no_pids, struct task_struct *task); void trace_filter_add_remove_task(struct trace_pid_list *pid_list, struct task_struct *self, struct task_struct *task); void *trace_pid_next(struct trace_pid_list *pid_list, void *v, loff_t *pos); void *trace_pid_start(struct trace_pid_list *pid_list, loff_t *pos); int trace_pid_show(struct seq_file *m, void *v); int trace_pid_write(struct trace_pid_list *filtered_pids, struct trace_pid_list **new_pid_list, const char __user *ubuf, size_t cnt); #ifdef CONFIG_TRACER_MAX_TRACE void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu, void *cond_data); void update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu); #ifdef CONFIG_FSNOTIFY #define LATENCY_FS_NOTIFY #endif #endif /* CONFIG_TRACER_MAX_TRACE */ #ifdef LATENCY_FS_NOTIFY void latency_fsnotify(struct trace_array *tr); #else static inline void latency_fsnotify(struct trace_array *tr) { } #endif #ifdef CONFIG_STACKTRACE void __trace_stack(struct trace_array *tr, unsigned int trace_ctx, int skip); #else static inline void __trace_stack(struct trace_array *tr, unsigned int trace_ctx, int skip) { } #endif /* CONFIG_STACKTRACE */ void trace_last_func_repeats(struct trace_array *tr, struct trace_func_repeats *last_info, unsigned int trace_ctx); extern u64 ftrace_now(int cpu); extern void trace_find_cmdline(int pid, char comm[]); extern int trace_find_tgid(int pid); extern void trace_event_follow_fork(struct trace_array *tr, bool enable); #ifdef CONFIG_DYNAMIC_FTRACE extern unsigned long ftrace_update_tot_cnt; extern unsigned long ftrace_number_of_pages; extern unsigned long ftrace_number_of_groups; void ftrace_init_trace_array(struct trace_array *tr); #else static inline void ftrace_init_trace_array(struct trace_array *tr) { } #endif #define DYN_FTRACE_TEST_NAME trace_selftest_dynamic_test_func extern int DYN_FTRACE_TEST_NAME(void); #define DYN_FTRACE_TEST_NAME2 trace_selftest_dynamic_test_func2 extern int DYN_FTRACE_TEST_NAME2(void); extern bool ring_buffer_expanded; extern bool tracing_selftest_disabled; #ifdef CONFIG_FTRACE_STARTUP_TEST extern void __init disable_tracing_selftest(const char *reason); extern int trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_function_graph(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_irqsoff(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_preemptoff(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_preemptirqsoff(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr); extern int trace_selftest_startup_branch(struct tracer *trace, struct trace_array *tr); /* * Tracer data references selftest functions that only occur * on boot up. These can be __init functions. Thus, when selftests * are enabled, then the tracers need to reference __init functions. */ #define __tracer_data __refdata #else static inline void __init disable_tracing_selftest(const char *reason) { } /* Tracers are seldom changed. Optimize when selftests are disabled. */ #define __tracer_data __read_mostly #endif /* CONFIG_FTRACE_STARTUP_TEST */ extern void *head_page(struct trace_array_cpu *data); extern unsigned long long ns2usecs(u64 nsec); extern int trace_vbprintk(unsigned long ip, const char *fmt, va_list args); extern int trace_vprintk(unsigned long ip, const char *fmt, va_list args); extern int trace_array_vprintk(struct trace_array *tr, unsigned long ip, const char *fmt, va_list args); int trace_array_printk_buf(struct trace_buffer *buffer, unsigned long ip, const char *fmt, ...); void trace_printk_seq(struct trace_seq *s); enum print_line_t print_trace_line(struct trace_iterator *iter); extern char trace_find_mark(unsigned long long duration); struct ftrace_hash; struct ftrace_mod_load { struct list_head list; char *func; char *module; int enable; }; enum { FTRACE_HASH_FL_MOD = (1 << 0), }; struct ftrace_hash { unsigned long size_bits; struct hlist_head *buckets; unsigned long count; unsigned long flags; struct rcu_head rcu; }; struct ftrace_func_entry * ftrace_lookup_ip(struct ftrace_hash *hash, unsigned long ip); static __always_inline bool ftrace_hash_empty(struct ftrace_hash *hash) { return !hash || !(hash->count || (hash->flags & FTRACE_HASH_FL_MOD)); } /* Standard output formatting function used for function return traces */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER /* Flag options */ #define TRACE_GRAPH_PRINT_OVERRUN 0x1 #define TRACE_GRAPH_PRINT_CPU 0x2 #define TRACE_GRAPH_PRINT_OVERHEAD 0x4 #define TRACE_GRAPH_PRINT_PROC 0x8 #define TRACE_GRAPH_PRINT_DURATION 0x10 #define TRACE_GRAPH_PRINT_ABS_TIME 0x20 #define TRACE_GRAPH_PRINT_REL_TIME 0x40 #define TRACE_GRAPH_PRINT_IRQS 0x80 #define TRACE_GRAPH_PRINT_TAIL 0x100 #define TRACE_GRAPH_SLEEP_TIME 0x200 #define TRACE_GRAPH_GRAPH_TIME 0x400 #define TRACE_GRAPH_PRINT_RETVAL 0x800 #define TRACE_GRAPH_PRINT_RETVAL_HEX 0x1000 #define TRACE_GRAPH_PRINT_FILL_SHIFT 28 #define TRACE_GRAPH_PRINT_FILL_MASK (0x3 << TRACE_GRAPH_PRINT_FILL_SHIFT) extern void ftrace_graph_sleep_time_control(bool enable); #ifdef CONFIG_FUNCTION_PROFILER extern void ftrace_graph_graph_time_control(bool enable); #else static inline void ftrace_graph_graph_time_control(bool enable) { } #endif extern enum print_line_t print_graph_function_flags(struct trace_iterator *iter, u32 flags); extern void print_graph_headers_flags(struct seq_file *s, u32 flags); extern void trace_print_graph_duration(unsigned long long duration, struct trace_seq *s); extern void graph_trace_open(struct trace_iterator *iter); extern void graph_trace_close(struct trace_iterator *iter); extern int __trace_graph_entry(struct trace_array *tr, struct ftrace_graph_ent *trace, unsigned int trace_ctx); extern void __trace_graph_return(struct trace_array *tr, struct ftrace_graph_ret *trace, unsigned int trace_ctx); #ifdef CONFIG_DYNAMIC_FTRACE extern struct ftrace_hash __rcu *ftrace_graph_hash; extern struct ftrace_hash __rcu *ftrace_graph_notrace_hash; static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace) { unsigned long addr = trace->func; int ret = 0; struct ftrace_hash *hash; preempt_disable_notrace(); /* * Have to open code "rcu_dereference_sched()" because the * function graph tracer can be called when RCU is not * "watching". * Protected with schedule_on_each_cpu(ftrace_sync) */ hash = rcu_dereference_protected(ftrace_graph_hash, !preemptible()); if (ftrace_hash_empty(hash)) { ret = 1; goto out; } if (ftrace_lookup_ip(hash, addr)) { /* * This needs to be cleared on the return functions * when the depth is zero. */ trace_recursion_set(TRACE_GRAPH_BIT); trace_recursion_set_depth(trace->depth); /* * If no irqs are to be traced, but a set_graph_function * is set, and called by an interrupt handler, we still * want to trace it. */ if (in_hardirq()) trace_recursion_set(TRACE_IRQ_BIT); else trace_recursion_clear(TRACE_IRQ_BIT); ret = 1; } out: preempt_enable_notrace(); return ret; } static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace) { if (trace_recursion_test(TRACE_GRAPH_BIT) && trace->depth == trace_recursion_depth()) trace_recursion_clear(TRACE_GRAPH_BIT); } static inline int ftrace_graph_notrace_addr(unsigned long addr) { int ret = 0; struct ftrace_hash *notrace_hash; preempt_disable_notrace(); /* * Have to open code "rcu_dereference_sched()" because the * function graph tracer can be called when RCU is not * "watching". * Protected with schedule_on_each_cpu(ftrace_sync) */ notrace_hash = rcu_dereference_protected(ftrace_graph_notrace_hash, !preemptible()); if (ftrace_lookup_ip(notrace_hash, addr)) ret = 1; preempt_enable_notrace(); return ret; } #else static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace) { return 1; } static inline int ftrace_graph_notrace_addr(unsigned long addr) { return 0; } static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace) { } #endif /* CONFIG_DYNAMIC_FTRACE */ extern unsigned int fgraph_max_depth; static inline bool ftrace_graph_ignore_func(struct ftrace_graph_ent *trace) { /* trace it when it is-nested-in or is a function enabled. */ return !(trace_recursion_test(TRACE_GRAPH_BIT) || ftrace_graph_addr(trace)) || (trace->depth < 0) || (fgraph_max_depth && trace->depth >= fgraph_max_depth); } #else /* CONFIG_FUNCTION_GRAPH_TRACER */ static inline enum print_line_t print_graph_function_flags(struct trace_iterator *iter, u32 flags) { return TRACE_TYPE_UNHANDLED; } #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ extern struct list_head ftrace_pids; #ifdef CONFIG_FUNCTION_TRACER #define FTRACE_PID_IGNORE -1 #define FTRACE_PID_TRACE -2 struct ftrace_func_command { struct list_head list; char *name; int (*func)(struct trace_array *tr, struct ftrace_hash *hash, char *func, char *cmd, char *params, int enable); }; extern bool ftrace_filter_param __initdata; static inline int ftrace_trace_task(struct trace_array *tr) { return this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid) != FTRACE_PID_IGNORE; } extern int ftrace_is_dead(void); int ftrace_create_function_files(struct trace_array *tr, struct dentry *parent); void ftrace_destroy_function_files(struct trace_array *tr); int ftrace_allocate_ftrace_ops(struct trace_array *tr); void ftrace_free_ftrace_ops(struct trace_array *tr); void ftrace_init_global_array_ops(struct trace_array *tr); void ftrace_init_array_ops(struct trace_array *tr, ftrace_func_t func); void ftrace_reset_array_ops(struct trace_array *tr); void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer); void ftrace_init_tracefs_toplevel(struct trace_array *tr, struct dentry *d_tracer); void ftrace_clear_pids(struct trace_array *tr); int init_function_trace(void); void ftrace_pid_follow_fork(struct trace_array *tr, bool enable); #else static inline int ftrace_trace_task(struct trace_array *tr) { return 1; } static inline int ftrace_is_dead(void) { return 0; } static inline int ftrace_create_function_files(struct trace_array *tr, struct dentry *parent) { return 0; } static inline int ftrace_allocate_ftrace_ops(struct trace_array *tr) { return 0; } static inline void ftrace_free_ftrace_ops(struct trace_array *tr) { } static inline void ftrace_destroy_function_files(struct trace_array *tr) { } static inline __init void ftrace_init_global_array_ops(struct trace_array *tr) { } static inline void ftrace_reset_array_ops(struct trace_array *tr) { } static inline void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d) { } static inline void ftrace_init_tracefs_toplevel(struct trace_array *tr, struct dentry *d) { } static inline void ftrace_clear_pids(struct trace_array *tr) { } static inline int init_function_trace(void) { return 0; } static inline void ftrace_pid_follow_fork(struct trace_array *tr, bool enable) { } /* ftace_func_t type is not defined, use macro instead of static inline */ #define ftrace_init_array_ops(tr, func) do { } while (0) #endif /* CONFIG_FUNCTION_TRACER */ #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_DYNAMIC_FTRACE) struct ftrace_probe_ops { void (*func)(unsigned long ip, unsigned long parent_ip, struct trace_array *tr, struct ftrace_probe_ops *ops, void *data); int (*init)(struct ftrace_probe_ops *ops, struct trace_array *tr, unsigned long ip, void *init_data, void **data); void (*free)(struct ftrace_probe_ops *ops, struct trace_array *tr, unsigned long ip, void *data); int (*print)(struct seq_file *m, unsigned long ip, struct ftrace_probe_ops *ops, void *data); }; struct ftrace_func_mapper; typedef int (*ftrace_mapper_func)(void *data); struct ftrace_func_mapper *allocate_ftrace_func_mapper(void); void **ftrace_func_mapper_find_ip(struct ftrace_func_mapper *mapper, unsigned long ip); int ftrace_func_mapper_add_ip(struct ftrace_func_mapper *mapper, unsigned long ip, void *data); void *ftrace_func_mapper_remove_ip(struct ftrace_func_mapper *mapper, unsigned long ip); void free_ftrace_func_mapper(struct ftrace_func_mapper *mapper, ftrace_mapper_func free_func); extern int register_ftrace_function_probe(char *glob, struct trace_array *tr, struct ftrace_probe_ops *ops, void *data); extern int unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr, struct ftrace_probe_ops *ops); extern void clear_ftrace_function_probes(struct trace_array *tr); int register_ftrace_command(struct ftrace_func_command *cmd); int unregister_ftrace_command(struct ftrace_func_command *cmd); void ftrace_create_filter_files(struct ftrace_ops *ops, struct dentry *parent); void ftrace_destroy_filter_files(struct ftrace_ops *ops); extern int ftrace_set_filter(struct ftrace_ops *ops, unsigned char *buf, int len, int reset); extern int ftrace_set_notrace(struct ftrace_ops *ops, unsigned char *buf, int len, int reset); #else struct ftrace_func_command; static inline __init int register_ftrace_command(struct ftrace_func_command *cmd) { return -EINVAL; } static inline __init int unregister_ftrace_command(char *cmd_name) { return -EINVAL; } static inline void clear_ftrace_function_probes(struct trace_array *tr) { } /* * The ops parameter passed in is usually undefined. * This must be a macro. */ #define ftrace_create_filter_files(ops, parent) do { } while (0) #define ftrace_destroy_filter_files(ops) do { } while (0) #endif /* CONFIG_FUNCTION_TRACER && CONFIG_DYNAMIC_FTRACE */ bool ftrace_event_is_function(struct trace_event_call *call); /* * struct trace_parser - servers for reading the user input separated by spaces * @cont: set if the input is not complete - no final space char was found * @buffer: holds the parsed user input * @idx: user input length * @size: buffer size */ struct trace_parser { bool cont; char *buffer; unsigned idx; unsigned size; }; static inline bool trace_parser_loaded(struct trace_parser *parser) { return (parser->idx != 0); } static inline bool trace_parser_cont(struct trace_parser *parser) { return parser->cont; } static inline void trace_parser_clear(struct trace_parser *parser) { parser->cont = false; parser->idx = 0; } extern int trace_parser_get_init(struct trace_parser *parser, int size); extern void trace_parser_put(struct trace_parser *parser); extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, size_t cnt, loff_t *ppos); /* * Only create function graph options if function graph is configured. */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER # define FGRAPH_FLAGS \ C(DISPLAY_GRAPH, "display-graph"), #else # define FGRAPH_FLAGS #endif #ifdef CONFIG_BRANCH_TRACER # define BRANCH_FLAGS \ C(BRANCH, "branch"), #else # define BRANCH_FLAGS #endif #ifdef CONFIG_FUNCTION_TRACER # define FUNCTION_FLAGS \ C(FUNCTION, "function-trace"), \ C(FUNC_FORK, "function-fork"), # define FUNCTION_DEFAULT_FLAGS TRACE_ITER_FUNCTION #else # define FUNCTION_FLAGS # define FUNCTION_DEFAULT_FLAGS 0UL # define TRACE_ITER_FUNC_FORK 0UL #endif #ifdef CONFIG_STACKTRACE # define STACK_FLAGS \ C(STACKTRACE, "stacktrace"), #else # define STACK_FLAGS #endif /* * trace_iterator_flags is an enumeration that defines bit * positions into trace_flags that controls the output. * * NOTE: These bits must match the trace_options array in * trace.c (this macro guarantees it). */ #define TRACE_FLAGS \ C(PRINT_PARENT, "print-parent"), \ C(SYM_OFFSET, "sym-offset"), \ C(SYM_ADDR, "sym-addr"), \ C(VERBOSE, "verbose"), \ C(RAW, "raw"), \ C(HEX, "hex"), \ C(BIN, "bin"), \ C(BLOCK, "block"), \ C(FIELDS, "fields"), \ C(PRINTK, "trace_printk"), \ C(ANNOTATE, "annotate"), \ C(USERSTACKTRACE, "userstacktrace"), \ C(SYM_USEROBJ, "sym-userobj"), \ C(PRINTK_MSGONLY, "printk-msg-only"), \ C(CONTEXT_INFO, "context-info"), /* Print pid/cpu/time */ \ C(LATENCY_FMT, "latency-format"), \ C(RECORD_CMD, "record-cmd"), \ C(RECORD_TGID, "record-tgid"), \ C(OVERWRITE, "overwrite"), \ C(STOP_ON_FREE, "disable_on_free"), \ C(IRQ_INFO, "irq-info"), \ C(MARKERS, "markers"), \ C(EVENT_FORK, "event-fork"), \ C(PAUSE_ON_TRACE, "pause-on-trace"), \ C(HASH_PTR, "hash-ptr"), /* Print hashed pointer */ \ FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ BRANCH_FLAGS /* * By defining C, we can make TRACE_FLAGS a list of bit names * that will define the bits for the flag masks. */ #undef C #define C(a, b) TRACE_ITER_##a##_BIT enum trace_iterator_bits { TRACE_FLAGS /* Make sure we don't go more than we have bits for */ TRACE_ITER_LAST_BIT }; /* * By redefining C, we can make TRACE_FLAGS a list of masks that * use the bits as defined above. */ #undef C #define C(a, b) TRACE_ITER_##a = (1 << TRACE_ITER_##a##_BIT) enum trace_iterator_flags { TRACE_FLAGS }; /* * TRACE_ITER_SYM_MASK masks the options in trace_flags that * control the output of kernel symbols. */ #define TRACE_ITER_SYM_MASK \ (TRACE_ITER_PRINT_PARENT|TRACE_ITER_SYM_OFFSET|TRACE_ITER_SYM_ADDR) extern struct tracer nop_trace; #ifdef CONFIG_BRANCH_TRACER extern int enable_branch_tracing(struct trace_array *tr); extern void disable_branch_tracing(void); static inline int trace_branch_enable(struct trace_array *tr) { if (tr->trace_flags & TRACE_ITER_BRANCH) return enable_branch_tracing(tr); return 0; } static inline void trace_branch_disable(void) { /* due to races, always disable */ disable_branch_tracing(); } #else static inline int trace_branch_enable(struct trace_array *tr) { return 0; } static inline void trace_branch_disable(void) { } #endif /* CONFIG_BRANCH_TRACER */ /* set ring buffers to default size if not already done so */ int tracing_update_buffers(void); union trace_synth_field { u8 as_u8; u16 as_u16; u32 as_u32; u64 as_u64; struct trace_dynamic_info as_dynamic; }; struct ftrace_event_field { struct list_head link; const char *name; const char *type; int filter_type; int offset; int size; int is_signed; int len; }; struct prog_entry; struct event_filter { struct prog_entry __rcu *prog; char *filter_string; }; struct event_subsystem { struct list_head list; const char *name; struct event_filter *filter; int ref_count; }; struct trace_subsystem_dir { struct list_head list; struct event_subsystem *subsystem; struct trace_array *tr; struct eventfs_file *ef; int ref_count; int nr_events; }; extern int call_filter_check_discard(struct trace_event_call *call, void *rec, struct trace_buffer *buffer, struct ring_buffer_event *event); void trace_buffer_unlock_commit_regs(struct trace_array *tr, struct trace_buffer *buffer, struct ring_buffer_event *event, unsigned int trcace_ctx, struct pt_regs *regs); static inline void trace_buffer_unlock_commit(struct trace_array *tr, struct trace_buffer *buffer, struct ring_buffer_event *event, unsigned int trace_ctx) { trace_buffer_unlock_commit_regs(tr, buffer, event, trace_ctx, NULL); } DECLARE_PER_CPU(struct ring_buffer_event *, trace_buffered_event); DECLARE_PER_CPU(int, trace_buffered_event_cnt); void trace_buffered_event_disable(void); void trace_buffered_event_enable(void); void early_enable_events(struct trace_array *tr, char *buf, bool disable_first); static inline void __trace_event_discard_commit(struct trace_buffer *buffer, struct ring_buffer_event *event) { if (this_cpu_read(trace_buffered_event) == event) { /* Simply release the temp buffer and enable preemption */ this_cpu_dec(trace_buffered_event_cnt); preempt_enable_notrace(); return; } /* ring_buffer_discard_commit() enables preemption */ ring_buffer_discard_commit(buffer, event); } /* * Helper function for event_trigger_unlock_commit{_regs}(). * If there are event triggers attached to this event that requires * filtering against its fields, then they will be called as the * entry already holds the field information of the current event. * * It also checks if the event should be discarded or not. * It is to be discarded if the event is soft disabled and the * event was only recorded to process triggers, or if the event * filter is active and this event did not match the filters. * * Returns true if the event is discarded, false otherwise. */ static inline bool __event_trigger_test_discard(struct trace_event_file *file, struct trace_buffer *buffer, struct ring_buffer_event *event, void *entry, enum event_trigger_type *tt) { unsigned long eflags = file->flags; if (eflags & EVENT_FILE_FL_TRIGGER_COND) *tt = event_triggers_call(file, buffer, entry, event); if (likely(!(file->flags & (EVENT_FILE_FL_SOFT_DISABLED | EVENT_FILE_FL_FILTERED | EVENT_FILE_FL_PID_FILTER)))) return false; if (file->flags & EVENT_FILE_FL_SOFT_DISABLED) goto discard; if (file->flags & EVENT_FILE_FL_FILTERED && !filter_match_preds(file->filter, entry)) goto discard; if ((file->flags & EVENT_FILE_FL_PID_FILTER) && trace_event_ignore_this_pid(file)) goto discard; return false; discard: __trace_event_discard_commit(buffer, event); return true; } /** * event_trigger_unlock_commit - handle triggers and finish event commit * @file: The file pointer associated with the event * @buffer: The ring buffer that the event is being written to * @event: The event meta data in the ring buffer * @entry: The event itself * @trace_ctx: The tracing context flags. * * This is a helper function to handle triggers that require data * from the event itself. It also tests the event against filters and * if the event is soft disabled and should be discarded. */ static inline void event_trigger_unlock_commit(struct trace_event_file *file, struct trace_buffer *buffer, struct ring_buffer_event *event, void *entry, unsigned int trace_ctx) { enum event_trigger_type tt = ETT_NONE; if (!__event_trigger_test_discard(file, buffer, event, entry, &tt)) trace_buffer_unlock_commit(file->tr, buffer, event, trace_ctx); if (tt) event_triggers_post_call(file, tt); } #define FILTER_PRED_INVALID ((unsigned short)-1) #define FILTER_PRED_IS_RIGHT (1 << 15) #define FILTER_PRED_FOLD (1 << 15) /* * The max preds is the size of unsigned short with * two flags at the MSBs. One bit is used for both the IS_RIGHT * and FOLD flags. The other is reserved. * * 2^14 preds is way more than enough. */ #define MAX_FILTER_PRED 16384 struct filter_pred; struct regex; typedef int (*regex_match_func)(char *str, struct regex *r, int len); enum regex_type { MATCH_FULL = 0, MATCH_FRONT_ONLY, MATCH_MIDDLE_ONLY, MATCH_END_ONLY, MATCH_GLOB, MATCH_INDEX, }; struct regex { char pattern[MAX_FILTER_STR_VAL]; int len; int field_len; regex_match_func match; }; static inline bool is_string_field(struct ftrace_event_field *field) { return field->filter_type == FILTER_DYN_STRING || field->filter_type == FILTER_RDYN_STRING || field->filter_type == FILTER_STATIC_STRING || field->filter_type == FILTER_PTR_STRING || field->filter_type == FILTER_COMM; } static inline bool is_function_field(struct ftrace_event_field *field) { return field->filter_type == FILTER_TRACE_FN; } extern enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not); extern void print_event_filter(struct trace_event_file *file, struct trace_seq *s); extern int apply_event_filter(struct trace_event_file *file, char *filter_string); extern int apply_subsystem_event_filter(struct trace_subsystem_dir *dir, char *filter_string); extern void print_subsystem_event_filter(struct event_subsystem *system, struct trace_seq *s); extern int filter_assign_type(const char *type); extern int create_event_filter(struct trace_array *tr, struct trace_event_call *call, char *filter_str, bool set_str, struct event_filter **filterp); extern void free_event_filter(struct event_filter *filter); struct ftrace_event_field * trace_find_event_field(struct trace_event_call *call, char *name); extern void trace_event_enable_cmd_record(bool enable); extern void trace_event_enable_tgid_record(bool enable); extern int event_trace_init(void); extern int init_events(void); extern int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr); extern int event_trace_del_tracer(struct trace_array *tr); extern void __trace_early_add_events(struct trace_array *tr); extern struct trace_event_file *__find_event_file(struct trace_array *tr, const char *system, const char *event); extern struct trace_event_file *find_event_file(struct trace_array *tr, const char *system, const char *event); static inline void *event_file_data(struct file *filp) { return READ_ONCE(file_inode(filp)->i_private); } extern struct mutex event_mutex; extern struct list_head ftrace_events; extern const struct file_operations event_trigger_fops; extern const struct file_operations event_hist_fops; extern const struct file_operations event_hist_debug_fops; extern const struct file_operations event_inject_fops; #ifdef CONFIG_HIST_TRIGGERS extern int register_trigger_hist_cmd(void); extern int register_trigger_hist_enable_disable_cmds(void); #else static inline int register_trigger_hist_cmd(void) { return 0; } static inline int register_trigger_hist_enable_disable_cmds(void) { return 0; } #endif extern int register_trigger_cmds(void); extern void clear_event_triggers(struct trace_array *tr); enum { EVENT_TRIGGER_FL_PROBE = BIT(0), }; struct event_trigger_data { unsigned long count; int ref; int flags; struct event_trigger_ops *ops; struct event_command *cmd_ops; struct event_filter __rcu *filter; char *filter_str; void *private_data; bool paused; bool paused_tmp; struct list_head list; char *name; struct list_head named_list; struct event_trigger_data *named_data; }; /* Avoid typos */ #define ENABLE_EVENT_STR "enable_event" #define DISABLE_EVENT_STR "disable_event" #define ENABLE_HIST_STR "enable_hist" #define DISABLE_HIST_STR "disable_hist" struct enable_trigger_data { struct trace_event_file *file; bool enable; bool hist; }; extern int event_enable_trigger_print(struct seq_file *m, struct event_trigger_data *data); extern void event_enable_trigger_free(struct event_trigger_data *data); extern int event_enable_trigger_parse(struct event_command *cmd_ops, struct trace_event_file *file, char *glob, char *cmd, char *param_and_filter); extern int event_enable_register_trigger(char *glob, struct event_trigger_data *data, struct trace_event_file *file); extern void event_enable_unregister_trigger(char *glob, struct event_trigger_data *test, struct trace_event_file *file); extern void trigger_data_free(struct event_trigger_data *data); extern int event_trigger_init(struct event_trigger_data *data); extern int trace_event_trigger_enable_disable(struct trace_event_file *file, int trigger_enable); extern void update_cond_flag(struct trace_event_file *file); extern int set_trigger_filter(char *filter_str, struct event_trigger_data *trigger_data, struct trace_event_file *file); extern struct event_trigger_data *find_named_trigger(const char *name); extern bool is_named_trigger(struct event_trigger_data *test); extern int save_named_trigger(const char *name, struct event_trigger_data *data); extern void del_named_trigger(struct event_trigger_data *data); extern void pause_named_trigger(struct event_trigger_data *data); extern void unpause_named_trigger(struct event_trigger_data *data); extern void set_named_trigger_data(struct event_trigger_data *data, struct event_trigger_data *named_data); extern struct event_trigger_data * get_named_trigger_data(struct event_trigger_data *data); extern int register_event_command(struct event_command *cmd); extern int unregister_event_command(struct event_command *cmd); extern int register_trigger_hist_enable_disable_cmds(void); extern bool event_trigger_check_remove(const char *glob); extern bool event_trigger_empty_param(const char *param); extern int event_trigger_separate_filter(char *param_and_filter, char **param, char **filter, bool param_required); extern struct event_trigger_data * event_trigger_alloc(struct event_command *cmd_ops, char *cmd, char *param, void *private_data); extern int event_trigger_parse_num(char *trigger, struct event_trigger_data *trigger_data); extern int event_trigger_set_filter(struct event_command *cmd_ops, struct trace_event_file *file, char *param, struct event_trigger_data *trigger_data); extern void event_trigger_reset_filter(struct event_command *cmd_ops, struct event_trigger_data *trigger_data); extern int event_trigger_register(struct event_command *cmd_ops, struct trace_event_file *file, char *glob, struct event_trigger_data *trigger_data); extern void event_trigger_unregister(struct event_command *cmd_ops, struct trace_event_file *file, char *glob, struct event_trigger_data *trigger_data); /** * struct event_trigger_ops - callbacks for trace event triggers * * The methods in this structure provide per-event trigger hooks for * various trigger operations. * * The @init and @free methods are used during trigger setup and * teardown, typically called from an event_command's @parse() * function implementation. * * The @print method is used to print the trigger spec. * * The @trigger method is the function that actually implements the * trigger and is called in the context of the triggering event * whenever that event occurs. * * All the methods below, except for @init() and @free(), must be * implemented. * * @trigger: The trigger 'probe' function called when the triggering * event occurs. The data passed into this callback is the data * that was supplied to the event_command @reg() function that * registered the trigger (see struct event_command) along with * the trace record, rec. * * @init: An optional initialization function called for the trigger * when the trigger is registered (via the event_command reg() * function). This can be used to perform per-trigger * initialization such as incrementing a per-trigger reference * count, for instance. This is usually implemented by the * generic utility function @event_trigger_init() (see * trace_event_triggers.c). * * @free: An optional de-initialization function called for the * trigger when the trigger is unregistered (via the * event_command @reg() function). This can be used to perform * per-trigger de-initialization such as decrementing a * per-trigger reference count and freeing corresponding trigger * data, for instance. This is usually implemented by the * generic utility function @event_trigger_free() (see * trace_event_triggers.c). * * @print: The callback function invoked to have the trigger print * itself. This is usually implemented by a wrapper function * that calls the generic utility function @event_trigger_print() * (see trace_event_triggers.c). */ struct event_trigger_ops { void (*trigger)(struct event_trigger_data *data, struct trace_buffer *buffer, void *rec, struct ring_buffer_event *rbe); int (*init)(struct event_trigger_data *data); void (*free)(struct event_trigger_data *data); int (*print)(struct seq_file *m, struct event_trigger_data *data); }; /** * struct event_command - callbacks and data members for event commands * * Event commands are invoked by users by writing the command name * into the 'trigger' file associated with a trace event. The * parameters associated with a specific invocation of an event * command are used to create an event trigger instance, which is * added to the list of trigger instances associated with that trace * event. When the event is hit, the set of triggers associated with * that event is invoked. * * The data members in this structure provide per-event command data * for various event commands. * * All the data members below, except for @post_trigger, must be set * for each event command. * * @name: The unique name that identifies the event command. This is * the name used when setting triggers via trigger files. * * @trigger_type: A unique id that identifies the event command * 'type'. This value has two purposes, the first to ensure that * only one trigger of the same type can be set at a given time * for a particular event e.g. it doesn't make sense to have both * a traceon and traceoff trigger attached to a single event at * the same time, so traceon and traceoff have the same type * though they have different names. The @trigger_type value is * also used as a bit value for deferring the actual trigger * action until after the current event is finished. Some * commands need to do this if they themselves log to the trace * buffer (see the @post_trigger() member below). @trigger_type * values are defined by adding new values to the trigger_type * enum in include/linux/trace_events.h. * * @flags: See the enum event_command_flags below. * * All the methods below, except for @set_filter() and @unreg_all(), * must be implemented. * * @parse: The callback function responsible for parsing and * registering the trigger written to the 'trigger' file by the * user. It allocates the trigger instance and registers it with * the appropriate trace event. It makes use of the other * event_command callback functions to orchestrate this, and is * usually implemented by the generic utility function * @event_trigger_callback() (see trace_event_triggers.c). * * @reg: Adds the trigger to the list of triggers associated with the * event, and enables the event trigger itself, after * initializing it (via the event_trigger_ops @init() function). * This is also where commands can use the @trigger_type value to * make the decision as to whether or not multiple instances of * the trigger should be allowed. This is usually implemented by * the generic utility function @register_trigger() (see * trace_event_triggers.c). * * @unreg: Removes the trigger from the list of triggers associated * with the event, and disables the event trigger itself, after * initializing it (via the event_trigger_ops @free() function). * This is usually implemented by the generic utility function * @unregister_trigger() (see trace_event_triggers.c). * * @unreg_all: An optional function called to remove all the triggers * from the list of triggers associated with the event. Called * when a trigger file is opened in truncate mode. * * @set_filter: An optional function called to parse and set a filter * for the trigger. If no @set_filter() method is set for the * event command, filters set by the user for the command will be * ignored. This is usually implemented by the generic utility * function @set_trigger_filter() (see trace_event_triggers.c). * * @get_trigger_ops: The callback function invoked to retrieve the * event_trigger_ops implementation associated with the command. * This callback function allows a single event_command to * support multiple trigger implementations via different sets of * event_trigger_ops, depending on the value of the @param * string. */ struct event_command { struct list_head list; char *name; enum event_trigger_type trigger_type; int flags; int (*parse)(struct event_command *cmd_ops, struct trace_event_file *file, char *glob, char *cmd, char *param_and_filter); int (*reg)(char *glob, struct event_trigger_data *data, struct trace_event_file *file); void (*unreg)(char *glob, struct event_trigger_data *data, struct trace_event_file *file); void (*unreg_all)(struct trace_event_file *file); int (*set_filter)(char *filter_str, struct event_trigger_data *data, struct trace_event_file *file); struct event_trigger_ops *(*get_trigger_ops)(char *cmd, char *param); }; /** * enum event_command_flags - flags for struct event_command * * @POST_TRIGGER: A flag that says whether or not this command needs * to have its action delayed until after the current event has * been closed. Some triggers need to avoid being invoked while * an event is currently in the process of being logged, since * the trigger may itself log data into the trace buffer. Thus * we make sure the current event is committed before invoking * those triggers. To do that, the trigger invocation is split * in two - the first part checks the filter using the current * trace record; if a command has the @post_trigger flag set, it * sets a bit for itself in the return value, otherwise it * directly invokes the trigger. Once all commands have been * either invoked or set their return flag, the current record is * either committed or discarded. At that point, if any commands * have deferred their triggers, those commands are finally * invoked following the close of the current event. In other * words, if the event_trigger_ops @func() probe implementation * itself logs to the trace buffer, this flag should be set, * otherwise it can be left unspecified. * * @NEEDS_REC: A flag that says whether or not this command needs * access to the trace record in order to perform its function, * regardless of whether or not it has a filter associated with * it (filters make a trigger require access to the trace record * but are not always present). */ enum event_command_flags { EVENT_CMD_FL_POST_TRIGGER = 1, EVENT_CMD_FL_NEEDS_REC = 2, }; static inline bool event_command_post_trigger(struct event_command *cmd_ops) { return cmd_ops->flags & EVENT_CMD_FL_POST_TRIGGER; } static inline bool event_command_needs_rec(struct event_command *cmd_ops) { return cmd_ops->flags & EVENT_CMD_FL_NEEDS_REC; } extern int trace_event_enable_disable(struct trace_event_file *file, int enable, int soft_disable); extern int tracing_alloc_snapshot(void); extern void tracing_snapshot_cond(struct trace_array *tr, void *cond_data); extern int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update); extern int tracing_snapshot_cond_disable(struct trace_array *tr); extern void *tracing_cond_snapshot_data(struct trace_array *tr); extern const char *__start___trace_bprintk_fmt[]; extern const char *__stop___trace_bprintk_fmt[]; extern const char *__start___tracepoint_str[]; extern const char *__stop___tracepoint_str[]; void trace_printk_control(bool enabled); void trace_printk_start_comm(void); int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set); int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled); /* Used from boot time tracer */ extern int trace_set_options(struct trace_array *tr, char *option); extern int tracing_set_tracer(struct trace_array *tr, const char *buf); extern ssize_t tracing_resize_ring_buffer(struct trace_array *tr, unsigned long size, int cpu_id); extern int tracing_set_cpumask(struct trace_array *tr, cpumask_var_t tracing_cpumask_new); #define MAX_EVENT_NAME_LEN 64 extern ssize_t trace_parse_run_command(struct file *file, const char __user *buffer, size_t count, loff_t *ppos, int (*createfn)(const char *)); extern unsigned int err_pos(char *cmd, const char *str); extern void tracing_log_err(struct trace_array *tr, const char *loc, const char *cmd, const char **errs, u8 type, u16 pos); /* * Normal trace_printk() and friends allocates special buffers * to do the manipulation, as well as saves the print formats * into sections to display. But the trace infrastructure wants * to use these without the added overhead at the price of being * a bit slower (used mainly for warnings, where we don't care * about performance). The internal_trace_puts() is for such * a purpose. */ #define internal_trace_puts(str) __trace_puts(_THIS_IP_, str, strlen(str)) #undef FTRACE_ENTRY #define FTRACE_ENTRY(call, struct_name, id, tstruct, print) \ extern struct trace_event_call \ __aligned(4) event_##call; #undef FTRACE_ENTRY_DUP #define FTRACE_ENTRY_DUP(call, struct_name, id, tstruct, print) \ FTRACE_ENTRY(call, struct_name, id, PARAMS(tstruct), PARAMS(print)) #undef FTRACE_ENTRY_PACKED #define FTRACE_ENTRY_PACKED(call, struct_name, id, tstruct, print) \ FTRACE_ENTRY(call, struct_name, id, PARAMS(tstruct), PARAMS(print)) #include "trace_entries.h" #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_FUNCTION_TRACER) int perf_ftrace_event_register(struct trace_event_call *call, enum trace_reg type, void *data); #else #define perf_ftrace_event_register NULL #endif #ifdef CONFIG_FTRACE_SYSCALLS void init_ftrace_syscalls(void); const char *get_syscall_name(int syscall); #else static inline void init_ftrace_syscalls(void) { } static inline const char *get_syscall_name(int syscall) { return NULL; } #endif #ifdef CONFIG_EVENT_TRACING void trace_event_init(void); void trace_event_eval_update(struct trace_eval_map **map, int len); /* Used from boot time tracer */ extern int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set); extern int trigger_process_regex(struct trace_event_file *file, char *buff); #else static inline void __init trace_event_init(void) { } static inline void trace_event_eval_update(struct trace_eval_map **map, int len) { } #endif #ifdef CONFIG_TRACER_SNAPSHOT void tracing_snapshot_instance(struct trace_array *tr); int tracing_alloc_snapshot_instance(struct trace_array *tr); #else static inline void tracing_snapshot_instance(struct trace_array *tr) { } static inline int tracing_alloc_snapshot_instance(struct trace_array *tr) { return 0; } #endif #ifdef CONFIG_PREEMPT_TRACER void tracer_preempt_on(unsigned long a0, unsigned long a1); void tracer_preempt_off(unsigned long a0, unsigned long a1); #else static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { } static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { } #endif #ifdef CONFIG_IRQSOFF_TRACER void tracer_hardirqs_on(unsigned long a0, unsigned long a1); void tracer_hardirqs_off(unsigned long a0, unsigned long a1); #else static inline void tracer_hardirqs_on(unsigned long a0, unsigned long a1) { } static inline void tracer_hardirqs_off(unsigned long a0, unsigned long a1) { } #endif /* * Reset the state of the trace_iterator so that it can read consumed data. * Normally, the trace_iterator is used for reading the data when it is not * consumed, and must retain state. */ static __always_inline void trace_iterator_reset(struct trace_iterator *iter) { memset_startat(iter, 0, seq); iter->pos = -1; } /* Check the name is good for event/group/fields */ static inline bool __is_good_name(const char *name, bool hash_ok) { if (!isalpha(*name) && *name != '_' && (!hash_ok || *name != '-')) return false; while (*++name != '\0') { if (!isalpha(*name) && !isdigit(*name) && *name != '_' && (!hash_ok || *name != '-')) return false; } return true; } /* Check the name is good for event/group/fields */ static inline bool is_good_name(const char *name) { return __is_good_name(name, false); } /* Check the name is good for system */ static inline bool is_good_system_name(const char *name) { return __is_good_name(name, true); } /* Convert certain expected symbols into '_' when generating event names */ static inline void sanitize_event_name(char *name) { while (*name++ != '\0') if (*name == ':' || *name == '.') *name = '_'; } /* * This is a generic way to read and write a u64 value from a file in tracefs. * * The value is stored on the variable pointed by *val. The value needs * to be at least *min and at most *max. The write is protected by an * existing *lock. */ struct trace_min_max_param { struct mutex *lock; u64 *val; u64 *min; u64 *max; }; #define U64_STR_SIZE 24 /* 20 digits max */ extern const struct file_operations trace_min_max_fops; #ifdef CONFIG_RV extern int rv_init_interface(void); #else static inline int rv_init_interface(void) { return 0; } #endif #endif /* _LINUX_KERNEL_TRACE_H */ |
5 3 3 4 3 3 2 2 2 3 3 3 3 3 2 3 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | // SPDX-License-Identifier: GPL-2.0-only /* * GHASH: hash function for GCM (Galois/Counter Mode). * * Copyright (c) 2007 Nokia Siemens Networks - Mikko Herranen <mh1@iki.fi> * Copyright (c) 2009 Intel Corp. * Author: Huang Ying <ying.huang@intel.com> */ /* * GHASH is a keyed hash function used in GCM authentication tag generation. * * The original GCM paper [1] presents GHASH as a function GHASH(H, A, C) which * takes a 16-byte hash key H, additional authenticated data A, and a ciphertext * C. It formats A and C into a single byte string X, interprets X as a * polynomial over GF(2^128), and evaluates this polynomial at the point H. * * However, the NIST standard for GCM [2] presents GHASH as GHASH(H, X) where X * is the already-formatted byte string containing both A and C. * * "ghash" in the Linux crypto API uses the 'X' (pre-formatted) convention, * since the API supports only a single data stream per hash. Thus, the * formatting of 'A' and 'C' is done in the "gcm" template, not in "ghash". * * The reason "ghash" is separate from "gcm" is to allow "gcm" to use an * accelerated "ghash" when a standalone accelerated "gcm(aes)" is unavailable. * It is generally inappropriate to use "ghash" for other purposes, since it is * an "ε-almost-XOR-universal hash function", not a cryptographic hash function. * It can only be used securely in crypto modes specially designed to use it. * * [1] The Galois/Counter Mode of Operation (GCM) * (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.694.695&rep=rep1&type=pdf) * [2] Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC * (https://csrc.nist.gov/publications/detail/sp/800-38d/final) */ #include <crypto/algapi.h> #include <crypto/gf128mul.h> #include <crypto/ghash.h> #include <crypto/internal/hash.h> #include <linux/crypto.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/module.h> static int ghash_init(struct shash_desc *desc) { struct ghash_desc_ctx *dctx = shash_desc_ctx(desc); memset(dctx, 0, sizeof(*dctx)); return 0; } static int ghash_setkey(struct crypto_shash *tfm, const u8 *key, unsigned int keylen) { struct ghash_ctx *ctx = crypto_shash_ctx(tfm); be128 k; if (keylen != GHASH_BLOCK_SIZE) return -EINVAL; if (ctx->gf128) gf128mul_free_4k(ctx->gf128); BUILD_BUG_ON(sizeof(k) != GHASH_BLOCK_SIZE); memcpy(&k, key, GHASH_BLOCK_SIZE); /* avoid violating alignment rules */ ctx->gf128 = gf128mul_init_4k_lle(&k); memzero_explicit(&k, GHASH_BLOCK_SIZE); if (!ctx->gf128) return -ENOMEM; return 0; } static int ghash_update(struct shash_desc *desc, const u8 *src, unsigned int srclen) { struct ghash_desc_ctx *dctx = shash_desc_ctx(desc); struct ghash_ctx *ctx = crypto_shash_ctx(desc->tfm); u8 *dst = dctx->buffer; if (dctx->bytes) { int n = min(srclen, dctx->bytes); u8 *pos = dst + (GHASH_BLOCK_SIZE - dctx->bytes); dctx->bytes -= n; srclen -= n; while (n--) *pos++ ^= *src++; if (!dctx->bytes) gf128mul_4k_lle((be128 *)dst, ctx->gf128); } while (srclen >= GHASH_BLOCK_SIZE) { crypto_xor(dst, src, GHASH_BLOCK_SIZE); gf128mul_4k_lle((be128 *)dst, ctx->gf128); src += GHASH_BLOCK_SIZE; srclen -= GHASH_BLOCK_SIZE; } if (srclen) { dctx->bytes = GHASH_BLOCK_SIZE - srclen; while (srclen--) *dst++ ^= *src++; } return 0; } static void ghash_flush(struct ghash_ctx *ctx, struct ghash_desc_ctx *dctx) { u8 *dst = dctx->buffer; if (dctx->bytes) { u8 *tmp = dst + (GHASH_BLOCK_SIZE - dctx->bytes); while (dctx->bytes--) *tmp++ ^= 0; gf128mul_4k_lle((be128 *)dst, ctx->gf128); } dctx->bytes = 0; } static int ghash_final(struct shash_desc *desc, u8 *dst) { struct ghash_desc_ctx *dctx = shash_desc_ctx(desc); struct ghash_ctx *ctx = crypto_shash_ctx(desc->tfm); u8 *buf = dctx->buffer; ghash_flush(ctx, dctx); memcpy(dst, buf, GHASH_BLOCK_SIZE); return 0; } static void ghash_exit_tfm(struct crypto_tfm *tfm) { struct ghash_ctx *ctx = crypto_tfm_ctx(tfm); if (ctx->gf128) gf128mul_free_4k(ctx->gf128); } static struct shash_alg ghash_alg = { .digestsize = GHASH_DIGEST_SIZE, .init = ghash_init, .update = ghash_update, .final = ghash_final, .setkey = ghash_setkey, .descsize = sizeof(struct ghash_desc_ctx), .base = { .cra_name = "ghash", .cra_driver_name = "ghash-generic", .cra_priority = 100, .cra_blocksize = GHASH_BLOCK_SIZE, .cra_ctxsize = sizeof(struct ghash_ctx), .cra_module = THIS_MODULE, .cra_exit = ghash_exit_tfm, }, }; static int __init ghash_mod_init(void) { return crypto_register_shash(&ghash_alg); } static void __exit ghash_mod_exit(void) { crypto_unregister_shash(&ghash_alg); } subsys_initcall(ghash_mod_init); module_exit(ghash_mod_exit); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("GHASH hash function"); MODULE_ALIAS_CRYPTO("ghash"); MODULE_ALIAS_CRYPTO("ghash-generic"); |
1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 8079 8080 8081 8082 8083 8084 8085 8086 8087 8088 8089 8090 8091 8092 8093 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103 8104 8105 8106 8107 8108 8109 8110 8111 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8138 8139 8140 8141 8142 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8159 8160 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8173 8174 8175 8176 8177 8178 8179 8180 8181 8182 8183 8184 8185 8186 8187 8188 8189 8190 8191 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8204 8205 8206 8207 8208 8209 8210 8211 8212 8213 8214 8215 8216 8217 8218 8219 8220 8221 8222 8223 8224 8225 8226 8227 8228 8229 8230 8231 8232 8233 8234 8235 8236 8237 8238 8239 8240 8241 8242 8243 8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 8335 8336 8337 8338 8339 8340 8341 8342 8343 8344 8345 8346 8347 8348 8349 8350 8351 8352 8353 8354 8355 8356 8357 8358 8359 8360 8361 8362 8363 8364 8365 8366 8367 8368 8369 8370 8371 8372 8373 8374 8375 8376 8377 8378 8379 8380 8381 8382 8383 8384 8385 8386 8387 8388 8389 8390 8391 8392 8393 8394 8395 8396 8397 8398 8399 8400 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2012 Alexander Block. All rights reserved. */ #include <linux/bsearch.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/sort.h> #include <linux/mount.h> #include <linux/xattr.h> #include <linux/posix_acl_xattr.h> #include <linux/radix-tree.h> #include <linux/vmalloc.h> #include <linux/string.h> #include <linux/compat.h> #include <linux/crc32c.h> #include <linux/fsverity.h> #include "send.h" #include "ctree.h" #include "backref.h" #include "locking.h" #include "disk-io.h" #include "btrfs_inode.h" #include "transaction.h" #include "compression.h" #include "xattr.h" #include "print-tree.h" #include "accessors.h" #include "dir-item.h" #include "file-item.h" #include "ioctl.h" #include "verity.h" #include "lru_cache.h" /* * Maximum number of references an extent can have in order for us to attempt to * issue clone operations instead of write operations. This currently exists to * avoid hitting limitations of the backreference walking code (taking a lot of * time and using too much memory for extents with large number of references). */ #define SEND_MAX_EXTENT_REFS 1024 /* * A fs_path is a helper to dynamically build path names with unknown size. * It reallocates the internal buffer on demand. * It allows fast adding of path elements on the right side (normal path) and * fast adding to the left side (reversed path). A reversed path can also be * unreversed if needed. */ struct fs_path { union { struct { char *start; char *end; char *buf; unsigned short buf_len:15; unsigned short reversed:1; char inline_buf[]; }; /* * Average path length does not exceed 200 bytes, we'll have * better packing in the slab and higher chance to satisfy * a allocation later during send. */ char pad[256]; }; }; #define FS_PATH_INLINE_SIZE \ (sizeof(struct fs_path) - offsetof(struct fs_path, inline_buf)) /* reused for each extent */ struct clone_root { struct btrfs_root *root; u64 ino; u64 offset; u64 num_bytes; bool found_ref; }; #define SEND_MAX_NAME_CACHE_SIZE 256 /* * Limit the root_ids array of struct backref_cache_entry to 17 elements. * This makes the size of a cache entry to be exactly 192 bytes on x86_64, which * can be satisfied from the kmalloc-192 slab, without wasting any space. * The most common case is to have a single root for cloning, which corresponds * to the send root. Having the user specify more than 16 clone roots is not * common, and in such rare cases we simply don't use caching if the number of * cloning roots that lead down to a leaf is more than 17. */ #define SEND_MAX_BACKREF_CACHE_ROOTS 17 /* * Max number of entries in the cache. * With SEND_MAX_BACKREF_CACHE_ROOTS as 17, the size in bytes, excluding * maple tree's internal nodes, is 24K. */ #define SEND_MAX_BACKREF_CACHE_SIZE 128 /* * A backref cache entry maps a leaf to a list of IDs of roots from which the * leaf is accessible and we can use for clone operations. * With SEND_MAX_BACKREF_CACHE_ROOTS as 12, each cache entry is 128 bytes (on * x86_64). */ struct backref_cache_entry { struct btrfs_lru_cache_entry entry; u64 root_ids[SEND_MAX_BACKREF_CACHE_ROOTS]; /* Number of valid elements in the root_ids array. */ int num_roots; }; /* See the comment at lru_cache.h about struct btrfs_lru_cache_entry. */ static_assert(offsetof(struct backref_cache_entry, entry) == 0); /* * Max number of entries in the cache that stores directories that were already * created. The cache uses raw struct btrfs_lru_cache_entry entries, so it uses * at most 4096 bytes - sizeof(struct btrfs_lru_cache_entry) is 48 bytes, but * the kmalloc-64 slab is used, so we get 4096 bytes (64 bytes * 64). */ #define SEND_MAX_DIR_CREATED_CACHE_SIZE 64 /* * Max number of entries in the cache that stores directories that were already * created. The cache uses raw struct btrfs_lru_cache_entry entries, so it uses * at most 4096 bytes - sizeof(struct btrfs_lru_cache_entry) is 48 bytes, but * the kmalloc-64 slab is used, so we get 4096 bytes (64 bytes * 64). */ #define SEND_MAX_DIR_UTIMES_CACHE_SIZE 64 struct send_ctx { struct file *send_filp; loff_t send_off; char *send_buf; u32 send_size; u32 send_max_size; /* * Whether BTRFS_SEND_A_DATA attribute was already added to current * command (since protocol v2, data must be the last attribute). */ bool put_data; struct page **send_buf_pages; u64 flags; /* 'flags' member of btrfs_ioctl_send_args is u64 */ /* Protocol version compatibility requested */ u32 proto; struct btrfs_root *send_root; struct btrfs_root *parent_root; struct clone_root *clone_roots; int clone_roots_cnt; /* current state of the compare_tree call */ struct btrfs_path *left_path; struct btrfs_path *right_path; struct btrfs_key *cmp_key; /* * Keep track of the generation of the last transaction that was used * for relocating a block group. This is periodically checked in order * to detect if a relocation happened since the last check, so that we * don't operate on stale extent buffers for nodes (level >= 1) or on * stale disk_bytenr values of file extent items. */ u64 last_reloc_trans; /* * infos of the currently processed inode. In case of deleted inodes, * these are the values from the deleted inode. */ u64 cur_ino; u64 cur_inode_gen; u64 cur_inode_size; u64 cur_inode_mode; u64 cur_inode_rdev; u64 cur_inode_last_extent; u64 cur_inode_next_write_offset; bool cur_inode_new; bool cur_inode_new_gen; bool cur_inode_deleted; bool ignore_cur_inode; bool cur_inode_needs_verity; void *verity_descriptor; u64 send_progress; struct list_head new_refs; struct list_head deleted_refs; struct btrfs_lru_cache name_cache; /* * The inode we are currently processing. It's not NULL only when we * need to issue write commands for data extents from this inode. */ struct inode *cur_inode; struct file_ra_state ra; u64 page_cache_clear_start; bool clean_page_cache; /* * We process inodes by their increasing order, so if before an * incremental send we reverse the parent/child relationship of * directories such that a directory with a lower inode number was * the parent of a directory with a higher inode number, and the one * becoming the new parent got renamed too, we can't rename/move the * directory with lower inode number when we finish processing it - we * must process the directory with higher inode number first, then * rename/move it and then rename/move the directory with lower inode * number. Example follows. * * Tree state when the first send was performed: * * . * |-- a (ino 257) * |-- b (ino 258) * | * | * |-- c (ino 259) * | |-- d (ino 260) * | * |-- c2 (ino 261) * * Tree state when the second (incremental) send is performed: * * . * |-- a (ino 257) * |-- b (ino 258) * |-- c2 (ino 261) * |-- d2 (ino 260) * |-- cc (ino 259) * * The sequence of steps that lead to the second state was: * * mv /a/b/c/d /a/b/c2/d2 * mv /a/b/c /a/b/c2/d2/cc * * "c" has lower inode number, but we can't move it (2nd mv operation) * before we move "d", which has higher inode number. * * So we just memorize which move/rename operations must be performed * later when their respective parent is processed and moved/renamed. */ /* Indexed by parent directory inode number. */ struct rb_root pending_dir_moves; /* * Reverse index, indexed by the inode number of a directory that * is waiting for the move/rename of its immediate parent before its * own move/rename can be performed. */ struct rb_root waiting_dir_moves; /* * A directory that is going to be rm'ed might have a child directory * which is in the pending directory moves index above. In this case, * the directory can only be removed after the move/rename of its child * is performed. Example: * * Parent snapshot: * * . (ino 256) * |-- a/ (ino 257) * |-- b/ (ino 258) * |-- c/ (ino 259) * | |-- x/ (ino 260) * | * |-- y/ (ino 261) * * Send snapshot: * * . (ino 256) * |-- a/ (ino 257) * |-- b/ (ino 258) * |-- YY/ (ino 261) * |-- x/ (ino 260) * * Sequence of steps that lead to the send snapshot: * rm -f /a/b/c/foo.txt * mv /a/b/y /a/b/YY * mv /a/b/c/x /a/b/YY * rmdir /a/b/c * * When the child is processed, its move/rename is delayed until its * parent is processed (as explained above), but all other operations * like update utimes, chown, chgrp, etc, are performed and the paths * that it uses for those operations must use the orphanized name of * its parent (the directory we're going to rm later), so we need to * memorize that name. * * Indexed by the inode number of the directory to be deleted. */ struct rb_root orphan_dirs; struct rb_root rbtree_new_refs; struct rb_root rbtree_deleted_refs; struct btrfs_lru_cache backref_cache; u64 backref_cache_last_reloc_trans; struct btrfs_lru_cache dir_created_cache; struct btrfs_lru_cache dir_utimes_cache; }; struct pending_dir_move { struct rb_node node; struct list_head list; u64 parent_ino; u64 ino; u64 gen; struct list_head update_refs; }; struct waiting_dir_move { struct rb_node node; u64 ino; /* * There might be some directory that could not be removed because it * was waiting for this directory inode to be moved first. Therefore * after this directory is moved, we can try to rmdir the ino rmdir_ino. */ u64 rmdir_ino; u64 rmdir_gen; bool orphanized; }; struct orphan_dir_info { struct rb_node node; u64 ino; u64 gen; u64 last_dir_index_offset; u64 dir_high_seq_ino; }; struct name_cache_entry { /* * The key in the entry is an inode number, and the generation matches * the inode's generation. */ struct btrfs_lru_cache_entry entry; u64 parent_ino; u64 parent_gen; int ret; int need_later_update; int name_len; char name[]; }; /* See the comment at lru_cache.h about struct btrfs_lru_cache_entry. */ static_assert(offsetof(struct name_cache_entry, entry) == 0); #define ADVANCE 1 #define ADVANCE_ONLY_NEXT -1 enum btrfs_compare_tree_result { BTRFS_COMPARE_TREE_NEW, BTRFS_COMPARE_TREE_DELETED, BTRFS_COMPARE_TREE_CHANGED, BTRFS_COMPARE_TREE_SAME, }; __cold static void inconsistent_snapshot_error(struct send_ctx *sctx, enum btrfs_compare_tree_result result, const char *what) { const char *result_string; switch (result) { case BTRFS_COMPARE_TREE_NEW: result_string = "new"; break; case BTRFS_COMPARE_TREE_DELETED: result_string = "deleted"; break; case BTRFS_COMPARE_TREE_CHANGED: result_string = "updated"; break; case BTRFS_COMPARE_TREE_SAME: ASSERT(0); result_string = "unchanged"; break; default: ASSERT(0); result_string = "unexpected"; } btrfs_err(sctx->send_root->fs_info, "Send: inconsistent snapshot, found %s %s for inode %llu without updated inode item, send root is %llu, parent root is %llu", result_string, what, sctx->cmp_key->objectid, sctx->send_root->root_key.objectid, (sctx->parent_root ? sctx->parent_root->root_key.objectid : 0)); } __maybe_unused static bool proto_cmd_ok(const struct send_ctx *sctx, int cmd) { switch (sctx->proto) { case 1: return cmd <= BTRFS_SEND_C_MAX_V1; case 2: return cmd <= BTRFS_SEND_C_MAX_V2; case 3: return cmd <= BTRFS_SEND_C_MAX_V3; default: return false; } } static int is_waiting_for_move(struct send_ctx *sctx, u64 ino); static struct waiting_dir_move * get_waiting_dir_move(struct send_ctx *sctx, u64 ino); static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 gen); static int need_send_hole(struct send_ctx *sctx) { return (sctx->parent_root && !sctx->cur_inode_new && !sctx->cur_inode_new_gen && !sctx->cur_inode_deleted && S_ISREG(sctx->cur_inode_mode)); } static void fs_path_reset(struct fs_path *p) { if (p->reversed) { p->start = p->buf + p->buf_len - 1; p->end = p->start; *p->start = 0; } else { p->start = p->buf; p->end = p->start; *p->start = 0; } } static struct fs_path *fs_path_alloc(void) { struct fs_path *p; p = kmalloc(sizeof(*p), GFP_KERNEL); if (!p) return NULL; p->reversed = 0; p->buf = p->inline_buf; p->buf_len = FS_PATH_INLINE_SIZE; fs_path_reset(p); return p; } static struct fs_path *fs_path_alloc_reversed(void) { struct fs_path *p; p = fs_path_alloc(); if (!p) return NULL; p->reversed = 1; fs_path_reset(p); return p; } static void fs_path_free(struct fs_path *p) { if (!p) return; if (p->buf != p->inline_buf) kfree(p->buf); kfree(p); } static int fs_path_len(struct fs_path *p) { return p->end - p->start; } static int fs_path_ensure_buf(struct fs_path *p, int len) { char *tmp_buf; int path_len; int old_buf_len; len++; if (p->buf_len >= len) return 0; if (len > PATH_MAX) { WARN_ON(1); return -ENOMEM; } path_len = p->end - p->start; old_buf_len = p->buf_len; /* * Allocate to the next largest kmalloc bucket size, to let * the fast path happen most of the time. */ len = kmalloc_size_roundup(len); /* * First time the inline_buf does not suffice */ if (p->buf == p->inline_buf) { tmp_buf = kmalloc(len, GFP_KERNEL); if (tmp_buf) memcpy(tmp_buf, p->buf, old_buf_len); } else { tmp_buf = krealloc(p->buf, len, GFP_KERNEL); } if (!tmp_buf) return -ENOMEM; p->buf = tmp_buf; p->buf_len = len; if (p->reversed) { tmp_buf = p->buf + old_buf_len - path_len - 1; p->end = p->buf + p->buf_len - 1; p->start = p->end - path_len; memmove(p->start, tmp_buf, path_len + 1); } else { p->start = p->buf; p->end = p->start + path_len; } return 0; } static int fs_path_prepare_for_add(struct fs_path *p, int name_len, char **prepared) { int ret; int new_len; new_len = p->end - p->start + name_len; if (p->start != p->end) new_len++; ret = fs_path_ensure_buf(p, new_len); if (ret < 0) goto out; if (p->reversed) { if (p->start != p->end) *--p->start = '/'; p->start -= name_len; *prepared = p->start; } else { if (p->start != p->end) *p->end++ = '/'; *prepared = p->end; p->end += name_len; *p->end = 0; } out: return ret; } static int fs_path_add(struct fs_path *p, const char *name, int name_len) { int ret; char *prepared; ret = fs_path_prepare_for_add(p, name_len, &prepared); if (ret < 0) goto out; memcpy(prepared, name, name_len); out: return ret; } static int fs_path_add_path(struct fs_path *p, struct fs_path *p2) { int ret; char *prepared; ret = fs_path_prepare_for_add(p, p2->end - p2->start, &prepared); if (ret < 0) goto out; memcpy(prepared, p2->start, p2->end - p2->start); out: return ret; } static int fs_path_add_from_extent_buffer(struct fs_path *p, struct extent_buffer *eb, unsigned long off, int len) { int ret; char *prepared; ret = fs_path_prepare_for_add(p, len, &prepared); if (ret < 0) goto out; read_extent_buffer(eb, prepared, off, len); out: return ret; } static int fs_path_copy(struct fs_path *p, struct fs_path *from) { p->reversed = from->reversed; fs_path_reset(p); return fs_path_add_path(p, from); } static void fs_path_unreverse(struct fs_path *p) { char *tmp; int len; if (!p->reversed) return; tmp = p->start; len = p->end - p->start; p->start = p->buf; p->end = p->start + len; memmove(p->start, tmp, len + 1); p->reversed = 0; } static struct btrfs_path *alloc_path_for_send(void) { struct btrfs_path *path; path = btrfs_alloc_path(); if (!path) return NULL; path->search_commit_root = 1; path->skip_locking = 1; path->need_commit_sem = 1; return path; } static int write_buf(struct file *filp, const void *buf, u32 len, loff_t *off) { int ret; u32 pos = 0; while (pos < len) { ret = kernel_write(filp, buf + pos, len - pos, off); if (ret < 0) return ret; if (ret == 0) return -EIO; pos += ret; } return 0; } static int tlv_put(struct send_ctx *sctx, u16 attr, const void *data, int len) { struct btrfs_tlv_header *hdr; int total_len = sizeof(*hdr) + len; int left = sctx->send_max_size - sctx->send_size; if (WARN_ON_ONCE(sctx->put_data)) return -EINVAL; if (unlikely(left < total_len)) return -EOVERFLOW; hdr = (struct btrfs_tlv_header *) (sctx->send_buf + sctx->send_size); put_unaligned_le16(attr, &hdr->tlv_type); put_unaligned_le16(len, &hdr->tlv_len); memcpy(hdr + 1, data, len); sctx->send_size += total_len; return 0; } #define TLV_PUT_DEFINE_INT(bits) \ static int tlv_put_u##bits(struct send_ctx *sctx, \ u##bits attr, u##bits value) \ { \ __le##bits __tmp = cpu_to_le##bits(value); \ return tlv_put(sctx, attr, &__tmp, sizeof(__tmp)); \ } TLV_PUT_DEFINE_INT(8) TLV_PUT_DEFINE_INT(32) TLV_PUT_DEFINE_INT(64) static int tlv_put_string(struct send_ctx *sctx, u16 attr, const char *str, int len) { if (len == -1) len = strlen(str); return tlv_put(sctx, attr, str, len); } static int tlv_put_uuid(struct send_ctx *sctx, u16 attr, const u8 *uuid) { return tlv_put(sctx, attr, uuid, BTRFS_UUID_SIZE); } static int tlv_put_btrfs_timespec(struct send_ctx *sctx, u16 attr, struct extent_buffer *eb, struct btrfs_timespec *ts) { struct btrfs_timespec bts; read_extent_buffer(eb, &bts, (unsigned long)ts, sizeof(bts)); return tlv_put(sctx, attr, &bts, sizeof(bts)); } #define TLV_PUT(sctx, attrtype, data, attrlen) \ do { \ ret = tlv_put(sctx, attrtype, data, attrlen); \ if (ret < 0) \ goto tlv_put_failure; \ } while (0) #define TLV_PUT_INT(sctx, attrtype, bits, value) \ do { \ ret = tlv_put_u##bits(sctx, attrtype, value); \ if (ret < 0) \ goto tlv_put_failure; \ } while (0) #define TLV_PUT_U8(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 8, data) #define TLV_PUT_U16(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 16, data) #define TLV_PUT_U32(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 32, data) #define TLV_PUT_U64(sctx, attrtype, data) TLV_PUT_INT(sctx, attrtype, 64, data) #define TLV_PUT_STRING(sctx, attrtype, str, len) \ do { \ ret = tlv_put_string(sctx, attrtype, str, len); \ if (ret < 0) \ goto tlv_put_failure; \ } while (0) #define TLV_PUT_PATH(sctx, attrtype, p) \ do { \ ret = tlv_put_string(sctx, attrtype, p->start, \ p->end - p->start); \ if (ret < 0) \ goto tlv_put_failure; \ } while(0) #define TLV_PUT_UUID(sctx, attrtype, uuid) \ do { \ ret = tlv_put_uuid(sctx, attrtype, uuid); \ if (ret < 0) \ goto tlv_put_failure; \ } while (0) #define TLV_PUT_BTRFS_TIMESPEC(sctx, attrtype, eb, ts) \ do { \ ret = tlv_put_btrfs_timespec(sctx, attrtype, eb, ts); \ if (ret < 0) \ goto tlv_put_failure; \ } while (0) static int send_header(struct send_ctx *sctx) { struct btrfs_stream_header hdr; strcpy(hdr.magic, BTRFS_SEND_STREAM_MAGIC); hdr.version = cpu_to_le32(sctx->proto); return write_buf(sctx->send_filp, &hdr, sizeof(hdr), &sctx->send_off); } /* * For each command/item we want to send to userspace, we call this function. */ static int begin_cmd(struct send_ctx *sctx, int cmd) { struct btrfs_cmd_header *hdr; if (WARN_ON(!sctx->send_buf)) return -EINVAL; BUG_ON(sctx->send_size); sctx->send_size += sizeof(*hdr); hdr = (struct btrfs_cmd_header *)sctx->send_buf; put_unaligned_le16(cmd, &hdr->cmd); return 0; } static int send_cmd(struct send_ctx *sctx) { int ret; struct btrfs_cmd_header *hdr; u32 crc; hdr = (struct btrfs_cmd_header *)sctx->send_buf; put_unaligned_le32(sctx->send_size - sizeof(*hdr), &hdr->len); put_unaligned_le32(0, &hdr->crc); crc = btrfs_crc32c(0, (unsigned char *)sctx->send_buf, sctx->send_size); put_unaligned_le32(crc, &hdr->crc); ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size, &sctx->send_off); sctx->send_size = 0; sctx->put_data = false; return ret; } /* * Sends a move instruction to user space */ static int send_rename(struct send_ctx *sctx, struct fs_path *from, struct fs_path *to) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; btrfs_debug(fs_info, "send_rename %s -> %s", from->start, to->start); ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, from); TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_TO, to); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } /* * Sends a link instruction to user space */ static int send_link(struct send_ctx *sctx, struct fs_path *path, struct fs_path *lnk) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; btrfs_debug(fs_info, "send_link %s -> %s", path->start, lnk->start); ret = begin_cmd(sctx, BTRFS_SEND_C_LINK); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, lnk); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } /* * Sends an unlink instruction to user space */ static int send_unlink(struct send_ctx *sctx, struct fs_path *path) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; btrfs_debug(fs_info, "send_unlink %s", path->start); ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } /* * Sends a rmdir instruction to user space */ static int send_rmdir(struct send_ctx *sctx, struct fs_path *path) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; btrfs_debug(fs_info, "send_rmdir %s", path->start); ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } struct btrfs_inode_info { u64 size; u64 gen; u64 mode; u64 uid; u64 gid; u64 rdev; u64 fileattr; u64 nlink; }; /* * Helper function to retrieve some fields from an inode item. */ static int get_inode_info(struct btrfs_root *root, u64 ino, struct btrfs_inode_info *info) { int ret; struct btrfs_path *path; struct btrfs_inode_item *ii; struct btrfs_key key; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = ino; key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret) { if (ret > 0) ret = -ENOENT; goto out; } if (!info) goto out; ii = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); info->size = btrfs_inode_size(path->nodes[0], ii); info->gen = btrfs_inode_generation(path->nodes[0], ii); info->mode = btrfs_inode_mode(path->nodes[0], ii); info->uid = btrfs_inode_uid(path->nodes[0], ii); info->gid = btrfs_inode_gid(path->nodes[0], ii); info->rdev = btrfs_inode_rdev(path->nodes[0], ii); info->nlink = btrfs_inode_nlink(path->nodes[0], ii); /* * Transfer the unchanged u64 value of btrfs_inode_item::flags, that's * otherwise logically split to 32/32 parts. */ info->fileattr = btrfs_inode_flags(path->nodes[0], ii); out: btrfs_free_path(path); return ret; } static int get_inode_gen(struct btrfs_root *root, u64 ino, u64 *gen) { int ret; struct btrfs_inode_info info = { 0 }; ASSERT(gen); ret = get_inode_info(root, ino, &info); *gen = info.gen; return ret; } typedef int (*iterate_inode_ref_t)(int num, u64 dir, int index, struct fs_path *p, void *ctx); /* * Helper function to iterate the entries in ONE btrfs_inode_ref or * btrfs_inode_extref. * The iterate callback may return a non zero value to stop iteration. This can * be a negative value for error codes or 1 to simply stop it. * * path must point to the INODE_REF or INODE_EXTREF when called. */ static int iterate_inode_ref(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_key *found_key, int resolve, iterate_inode_ref_t iterate, void *ctx) { struct extent_buffer *eb = path->nodes[0]; struct btrfs_inode_ref *iref; struct btrfs_inode_extref *extref; struct btrfs_path *tmp_path; struct fs_path *p; u32 cur = 0; u32 total; int slot = path->slots[0]; u32 name_len; char *start; int ret = 0; int num = 0; int index; u64 dir; unsigned long name_off; unsigned long elem_size; unsigned long ptr; p = fs_path_alloc_reversed(); if (!p) return -ENOMEM; tmp_path = alloc_path_for_send(); if (!tmp_path) { fs_path_free(p); return -ENOMEM; } if (found_key->type == BTRFS_INODE_REF_KEY) { ptr = (unsigned long)btrfs_item_ptr(eb, slot, struct btrfs_inode_ref); total = btrfs_item_size(eb, slot); elem_size = sizeof(*iref); } else { ptr = btrfs_item_ptr_offset(eb, slot); total = btrfs_item_size(eb, slot); elem_size = sizeof(*extref); } while (cur < total) { fs_path_reset(p); if (found_key->type == BTRFS_INODE_REF_KEY) { iref = (struct btrfs_inode_ref *)(ptr + cur); name_len = btrfs_inode_ref_name_len(eb, iref); name_off = (unsigned long)(iref + 1); index = btrfs_inode_ref_index(eb, iref); dir = found_key->offset; } else { extref = (struct btrfs_inode_extref *)(ptr + cur); name_len = btrfs_inode_extref_name_len(eb, extref); name_off = (unsigned long)&extref->name; index = btrfs_inode_extref_index(eb, extref); dir = btrfs_inode_extref_parent(eb, extref); } if (resolve) { start = btrfs_ref_to_path(root, tmp_path, name_len, name_off, eb, dir, p->buf, p->buf_len); if (IS_ERR(start)) { ret = PTR_ERR(start); goto out; } if (start < p->buf) { /* overflow , try again with larger buffer */ ret = fs_path_ensure_buf(p, p->buf_len + p->buf - start); if (ret < 0) goto out; start = btrfs_ref_to_path(root, tmp_path, name_len, name_off, eb, dir, p->buf, p->buf_len); if (IS_ERR(start)) { ret = PTR_ERR(start); goto out; } BUG_ON(start < p->buf); } p->start = start; } else { ret = fs_path_add_from_extent_buffer(p, eb, name_off, name_len); if (ret < 0) goto out; } cur += elem_size + name_len; ret = iterate(num, dir, index, p, ctx); if (ret) goto out; num++; } out: btrfs_free_path(tmp_path); fs_path_free(p); return ret; } typedef int (*iterate_dir_item_t)(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *ctx); /* * Helper function to iterate the entries in ONE btrfs_dir_item. * The iterate callback may return a non zero value to stop iteration. This can * be a negative value for error codes or 1 to simply stop it. * * path must point to the dir item when called. */ static int iterate_dir_item(struct btrfs_root *root, struct btrfs_path *path, iterate_dir_item_t iterate, void *ctx) { int ret = 0; struct extent_buffer *eb; struct btrfs_dir_item *di; struct btrfs_key di_key; char *buf = NULL; int buf_len; u32 name_len; u32 data_len; u32 cur; u32 len; u32 total; int slot; int num; /* * Start with a small buffer (1 page). If later we end up needing more * space, which can happen for xattrs on a fs with a leaf size greater * then the page size, attempt to increase the buffer. Typically xattr * values are small. */ buf_len = PATH_MAX; buf = kmalloc(buf_len, GFP_KERNEL); if (!buf) { ret = -ENOMEM; goto out; } eb = path->nodes[0]; slot = path->slots[0]; di = btrfs_item_ptr(eb, slot, struct btrfs_dir_item); cur = 0; len = 0; total = btrfs_item_size(eb, slot); num = 0; while (cur < total) { name_len = btrfs_dir_name_len(eb, di); data_len = btrfs_dir_data_len(eb, di); btrfs_dir_item_key_to_cpu(eb, di, &di_key); if (btrfs_dir_ftype(eb, di) == BTRFS_FT_XATTR) { if (name_len > XATTR_NAME_MAX) { ret = -ENAMETOOLONG; goto out; } if (name_len + data_len > BTRFS_MAX_XATTR_SIZE(root->fs_info)) { ret = -E2BIG; goto out; } } else { /* * Path too long */ if (name_len + data_len > PATH_MAX) { ret = -ENAMETOOLONG; goto out; } } if (name_len + data_len > buf_len) { buf_len = name_len + data_len; if (is_vmalloc_addr(buf)) { vfree(buf); buf = NULL; } else { char *tmp = krealloc(buf, buf_len, GFP_KERNEL | __GFP_NOWARN); if (!tmp) kfree(buf); buf = tmp; } if (!buf) { buf = kvmalloc(buf_len, GFP_KERNEL); if (!buf) { ret = -ENOMEM; goto out; } } } read_extent_buffer(eb, buf, (unsigned long)(di + 1), name_len + data_len); len = sizeof(*di) + name_len + data_len; di = (struct btrfs_dir_item *)((char *)di + len); cur += len; ret = iterate(num, &di_key, buf, name_len, buf + name_len, data_len, ctx); if (ret < 0) goto out; if (ret) { ret = 0; goto out; } num++; } out: kvfree(buf); return ret; } static int __copy_first_ref(int num, u64 dir, int index, struct fs_path *p, void *ctx) { int ret; struct fs_path *pt = ctx; ret = fs_path_copy(pt, p); if (ret < 0) return ret; /* we want the first only */ return 1; } /* * Retrieve the first path of an inode. If an inode has more then one * ref/hardlink, this is ignored. */ static int get_inode_path(struct btrfs_root *root, u64 ino, struct fs_path *path) { int ret; struct btrfs_key key, found_key; struct btrfs_path *p; p = alloc_path_for_send(); if (!p) return -ENOMEM; fs_path_reset(path); key.objectid = ino; key.type = BTRFS_INODE_REF_KEY; key.offset = 0; ret = btrfs_search_slot_for_read(root, &key, p, 1, 0); if (ret < 0) goto out; if (ret) { ret = 1; goto out; } btrfs_item_key_to_cpu(p->nodes[0], &found_key, p->slots[0]); if (found_key.objectid != ino || (found_key.type != BTRFS_INODE_REF_KEY && found_key.type != BTRFS_INODE_EXTREF_KEY)) { ret = -ENOENT; goto out; } ret = iterate_inode_ref(root, p, &found_key, 1, __copy_first_ref, path); if (ret < 0) goto out; ret = 0; out: btrfs_free_path(p); return ret; } struct backref_ctx { struct send_ctx *sctx; /* number of total found references */ u64 found; /* * used for clones found in send_root. clones found behind cur_objectid * and cur_offset are not considered as allowed clones. */ u64 cur_objectid; u64 cur_offset; /* may be truncated in case it's the last extent in a file */ u64 extent_len; /* The bytenr the file extent item we are processing refers to. */ u64 bytenr; /* The owner (root id) of the data backref for the current extent. */ u64 backref_owner; /* The offset of the data backref for the current extent. */ u64 backref_offset; }; static int __clone_root_cmp_bsearch(const void *key, const void *elt) { u64 root = (u64)(uintptr_t)key; const struct clone_root *cr = elt; if (root < cr->root->root_key.objectid) return -1; if (root > cr->root->root_key.objectid) return 1; return 0; } static int __clone_root_cmp_sort(const void *e1, const void *e2) { const struct clone_root *cr1 = e1; const struct clone_root *cr2 = e2; if (cr1->root->root_key.objectid < cr2->root->root_key.objectid) return -1; if (cr1->root->root_key.objectid > cr2->root->root_key.objectid) return 1; return 0; } /* * Called for every backref that is found for the current extent. * Results are collected in sctx->clone_roots->ino/offset. */ static int iterate_backrefs(u64 ino, u64 offset, u64 num_bytes, u64 root_id, void *ctx_) { struct backref_ctx *bctx = ctx_; struct clone_root *clone_root; /* First check if the root is in the list of accepted clone sources */ clone_root = bsearch((void *)(uintptr_t)root_id, bctx->sctx->clone_roots, bctx->sctx->clone_roots_cnt, sizeof(struct clone_root), __clone_root_cmp_bsearch); if (!clone_root) return 0; /* This is our own reference, bail out as we can't clone from it. */ if (clone_root->root == bctx->sctx->send_root && ino == bctx->cur_objectid && offset == bctx->cur_offset) return 0; /* * Make sure we don't consider clones from send_root that are * behind the current inode/offset. */ if (clone_root->root == bctx->sctx->send_root) { /* * If the source inode was not yet processed we can't issue a * clone operation, as the source extent does not exist yet at * the destination of the stream. */ if (ino > bctx->cur_objectid) return 0; /* * We clone from the inode currently being sent as long as the * source extent is already processed, otherwise we could try * to clone from an extent that does not exist yet at the * destination of the stream. */ if (ino == bctx->cur_objectid && offset + bctx->extent_len > bctx->sctx->cur_inode_next_write_offset) return 0; } bctx->found++; clone_root->found_ref = true; /* * If the given backref refers to a file extent item with a larger * number of bytes than what we found before, use the new one so that * we clone more optimally and end up doing less writes and getting * less exclusive, non-shared extents at the destination. */ if (num_bytes > clone_root->num_bytes) { clone_root->ino = ino; clone_root->offset = offset; clone_root->num_bytes = num_bytes; /* * Found a perfect candidate, so there's no need to continue * backref walking. */ if (num_bytes >= bctx->extent_len) return BTRFS_ITERATE_EXTENT_INODES_STOP; } return 0; } static bool lookup_backref_cache(u64 leaf_bytenr, void *ctx, const u64 **root_ids_ret, int *root_count_ret) { struct backref_ctx *bctx = ctx; struct send_ctx *sctx = bctx->sctx; struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; const u64 key = leaf_bytenr >> fs_info->sectorsize_bits; struct btrfs_lru_cache_entry *raw_entry; struct backref_cache_entry *entry; if (btrfs_lru_cache_size(&sctx->backref_cache) == 0) return false; /* * If relocation happened since we first filled the cache, then we must * empty the cache and can not use it, because even though we operate on * read-only roots, their leaves and nodes may have been reallocated and * now be used for different nodes/leaves of the same tree or some other * tree. * * We are called from iterate_extent_inodes() while either holding a * transaction handle or holding fs_info->commit_root_sem, so no need * to take any lock here. */ if (fs_info->last_reloc_trans > sctx->backref_cache_last_reloc_trans) { btrfs_lru_cache_clear(&sctx->backref_cache); return false; } raw_entry = btrfs_lru_cache_lookup(&sctx->backref_cache, key, 0); if (!raw_entry) return false; entry = container_of(raw_entry, struct backref_cache_entry, entry); *root_ids_ret = entry->root_ids; *root_count_ret = entry->num_roots; return true; } static void store_backref_cache(u64 leaf_bytenr, const struct ulist *root_ids, void *ctx) { struct backref_ctx *bctx = ctx; struct send_ctx *sctx = bctx->sctx; struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; struct backref_cache_entry *new_entry; struct ulist_iterator uiter; struct ulist_node *node; int ret; /* * We're called while holding a transaction handle or while holding * fs_info->commit_root_sem (at iterate_extent_inodes()), so must do a * NOFS allocation. */ new_entry = kmalloc(sizeof(struct backref_cache_entry), GFP_NOFS); /* No worries, cache is optional. */ if (!new_entry) return; new_entry->entry.key = leaf_bytenr >> fs_info->sectorsize_bits; new_entry->entry.gen = 0; new_entry->num_roots = 0; ULIST_ITER_INIT(&uiter); while ((node = ulist_next(root_ids, &uiter)) != NULL) { const u64 root_id = node->val; struct clone_root *root; root = bsearch((void *)(uintptr_t)root_id, sctx->clone_roots, sctx->clone_roots_cnt, sizeof(struct clone_root), __clone_root_cmp_bsearch); if (!root) continue; /* Too many roots, just exit, no worries as caching is optional. */ if (new_entry->num_roots >= SEND_MAX_BACKREF_CACHE_ROOTS) { kfree(new_entry); return; } new_entry->root_ids[new_entry->num_roots] = root_id; new_entry->num_roots++; } /* * We may have not added any roots to the new cache entry, which means * none of the roots is part of the list of roots from which we are * allowed to clone. Cache the new entry as it's still useful to avoid * backref walking to determine which roots have a path to the leaf. * * Also use GFP_NOFS because we're called while holding a transaction * handle or while holding fs_info->commit_root_sem. */ ret = btrfs_lru_cache_store(&sctx->backref_cache, &new_entry->entry, GFP_NOFS); ASSERT(ret == 0 || ret == -ENOMEM); if (ret) { /* Caching is optional, no worries. */ kfree(new_entry); return; } /* * We are called from iterate_extent_inodes() while either holding a * transaction handle or holding fs_info->commit_root_sem, so no need * to take any lock here. */ if (btrfs_lru_cache_size(&sctx->backref_cache) == 1) sctx->backref_cache_last_reloc_trans = fs_info->last_reloc_trans; } static int check_extent_item(u64 bytenr, const struct btrfs_extent_item *ei, const struct extent_buffer *leaf, void *ctx) { const u64 refs = btrfs_extent_refs(leaf, ei); const struct backref_ctx *bctx = ctx; const struct send_ctx *sctx = bctx->sctx; if (bytenr == bctx->bytenr) { const u64 flags = btrfs_extent_flags(leaf, ei); if (WARN_ON(flags & BTRFS_EXTENT_FLAG_TREE_BLOCK)) return -EUCLEAN; /* * If we have only one reference and only the send root as a * clone source - meaning no clone roots were given in the * struct btrfs_ioctl_send_args passed to the send ioctl - then * it's our reference and there's no point in doing backref * walking which is expensive, so exit early. */ if (refs == 1 && sctx->clone_roots_cnt == 1) return -ENOENT; } /* * Backreference walking (iterate_extent_inodes() below) is currently * too expensive when an extent has a large number of references, both * in time spent and used memory. So for now just fallback to write * operations instead of clone operations when an extent has more than * a certain amount of references. */ if (refs > SEND_MAX_EXTENT_REFS) return -ENOENT; return 0; } static bool skip_self_data_ref(u64 root, u64 ino, u64 offset, void *ctx) { const struct backref_ctx *bctx = ctx; if (ino == bctx->cur_objectid && root == bctx->backref_owner && offset == bctx->backref_offset) return true; return false; } /* * Given an inode, offset and extent item, it finds a good clone for a clone * instruction. Returns -ENOENT when none could be found. The function makes * sure that the returned clone is usable at the point where sending is at the * moment. This means, that no clones are accepted which lie behind the current * inode+offset. * * path must point to the extent item when called. */ static int find_extent_clone(struct send_ctx *sctx, struct btrfs_path *path, u64 ino, u64 data_offset, u64 ino_size, struct clone_root **found) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; int extent_type; u64 logical; u64 disk_byte; u64 num_bytes; struct btrfs_file_extent_item *fi; struct extent_buffer *eb = path->nodes[0]; struct backref_ctx backref_ctx = { 0 }; struct btrfs_backref_walk_ctx backref_walk_ctx = { 0 }; struct clone_root *cur_clone_root; int compressed; u32 i; /* * With fallocate we can get prealloc extents beyond the inode's i_size, * so we don't do anything here because clone operations can not clone * to a range beyond i_size without increasing the i_size of the * destination inode. */ if (data_offset >= ino_size) return 0; fi = btrfs_item_ptr(eb, path->slots[0], struct btrfs_file_extent_item); extent_type = btrfs_file_extent_type(eb, fi); if (extent_type == BTRFS_FILE_EXTENT_INLINE) return -ENOENT; disk_byte = btrfs_file_extent_disk_bytenr(eb, fi); if (disk_byte == 0) return -ENOENT; compressed = btrfs_file_extent_compression(eb, fi); num_bytes = btrfs_file_extent_num_bytes(eb, fi); logical = disk_byte + btrfs_file_extent_offset(eb, fi); /* * Setup the clone roots. */ for (i = 0; i < sctx->clone_roots_cnt; i++) { cur_clone_root = sctx->clone_roots + i; cur_clone_root->ino = (u64)-1; cur_clone_root->offset = 0; cur_clone_root->num_bytes = 0; cur_clone_root->found_ref = false; } backref_ctx.sctx = sctx; backref_ctx.cur_objectid = ino; backref_ctx.cur_offset = data_offset; backref_ctx.bytenr = disk_byte; /* * Use the header owner and not the send root's id, because in case of a * snapshot we can have shared subtrees. */ backref_ctx.backref_owner = btrfs_header_owner(eb); backref_ctx.backref_offset = data_offset - btrfs_file_extent_offset(eb, fi); /* * The last extent of a file may be too large due to page alignment. * We need to adjust extent_len in this case so that the checks in * iterate_backrefs() work. */ if (data_offset + num_bytes >= ino_size) backref_ctx.extent_len = ino_size - data_offset; else backref_ctx.extent_len = num_bytes; /* * Now collect all backrefs. */ backref_walk_ctx.bytenr = disk_byte; if (compressed == BTRFS_COMPRESS_NONE) backref_walk_ctx.extent_item_pos = btrfs_file_extent_offset(eb, fi); backref_walk_ctx.fs_info = fs_info; backref_walk_ctx.cache_lookup = lookup_backref_cache; backref_walk_ctx.cache_store = store_backref_cache; backref_walk_ctx.indirect_ref_iterator = iterate_backrefs; backref_walk_ctx.check_extent_item = check_extent_item; backref_walk_ctx.user_ctx = &backref_ctx; /* * If have a single clone root, then it's the send root and we can tell * the backref walking code to skip our own backref and not resolve it, * since we can not use it for cloning - the source and destination * ranges can't overlap and in case the leaf is shared through a subtree * due to snapshots, we can't use those other roots since they are not * in the list of clone roots. */ if (sctx->clone_roots_cnt == 1) backref_walk_ctx.skip_data_ref = skip_self_data_ref; ret = iterate_extent_inodes(&backref_walk_ctx, true, iterate_backrefs, &backref_ctx); if (ret < 0) return ret; down_read(&fs_info->commit_root_sem); if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { /* * A transaction commit for a transaction in which block group * relocation was done just happened. * The disk_bytenr of the file extent item we processed is * possibly stale, referring to the extent's location before * relocation. So act as if we haven't found any clone sources * and fallback to write commands, which will read the correct * data from the new extent location. Otherwise we will fail * below because we haven't found our own back reference or we * could be getting incorrect sources in case the old extent * was already reallocated after the relocation. */ up_read(&fs_info->commit_root_sem); return -ENOENT; } up_read(&fs_info->commit_root_sem); btrfs_debug(fs_info, "find_extent_clone: data_offset=%llu, ino=%llu, num_bytes=%llu, logical=%llu", data_offset, ino, num_bytes, logical); if (!backref_ctx.found) { btrfs_debug(fs_info, "no clones found"); return -ENOENT; } cur_clone_root = NULL; for (i = 0; i < sctx->clone_roots_cnt; i++) { struct clone_root *clone_root = &sctx->clone_roots[i]; if (!clone_root->found_ref) continue; /* * Choose the root from which we can clone more bytes, to * minimize write operations and therefore have more extent * sharing at the destination (the same as in the source). */ if (!cur_clone_root || clone_root->num_bytes > cur_clone_root->num_bytes) { cur_clone_root = clone_root; /* * We found an optimal clone candidate (any inode from * any root is fine), so we're done. */ if (clone_root->num_bytes >= backref_ctx.extent_len) break; } } if (cur_clone_root) { *found = cur_clone_root; ret = 0; } else { ret = -ENOENT; } return ret; } static int read_symlink(struct btrfs_root *root, u64 ino, struct fs_path *dest) { int ret; struct btrfs_path *path; struct btrfs_key key; struct btrfs_file_extent_item *ei; u8 type; u8 compression; unsigned long off; int len; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) goto out; if (ret) { /* * An empty symlink inode. Can happen in rare error paths when * creating a symlink (transaction committed before the inode * eviction handler removed the symlink inode items and a crash * happened in between or the subvol was snapshoted in between). * Print an informative message to dmesg/syslog so that the user * can delete the symlink. */ btrfs_err(root->fs_info, "Found empty symlink inode %llu at root %llu", ino, root->root_key.objectid); ret = -EIO; goto out; } ei = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_file_extent_item); type = btrfs_file_extent_type(path->nodes[0], ei); if (unlikely(type != BTRFS_FILE_EXTENT_INLINE)) { ret = -EUCLEAN; btrfs_crit(root->fs_info, "send: found symlink extent that is not inline, ino %llu root %llu extent type %d", ino, btrfs_root_id(root), type); goto out; } compression = btrfs_file_extent_compression(path->nodes[0], ei); if (unlikely(compression != BTRFS_COMPRESS_NONE)) { ret = -EUCLEAN; btrfs_crit(root->fs_info, "send: found symlink extent with compression, ino %llu root %llu compression type %d", ino, btrfs_root_id(root), compression); goto out; } off = btrfs_file_extent_inline_start(ei); len = btrfs_file_extent_ram_bytes(path->nodes[0], ei); ret = fs_path_add_from_extent_buffer(dest, path->nodes[0], off, len); out: btrfs_free_path(path); return ret; } /* * Helper function to generate a file name that is unique in the root of * send_root and parent_root. This is used to generate names for orphan inodes. */ static int gen_unique_name(struct send_ctx *sctx, u64 ino, u64 gen, struct fs_path *dest) { int ret = 0; struct btrfs_path *path; struct btrfs_dir_item *di; char tmp[64]; int len; u64 idx = 0; path = alloc_path_for_send(); if (!path) return -ENOMEM; while (1) { struct fscrypt_str tmp_name; len = snprintf(tmp, sizeof(tmp), "o%llu-%llu-%llu", ino, gen, idx); ASSERT(len < sizeof(tmp)); tmp_name.name = tmp; tmp_name.len = strlen(tmp); di = btrfs_lookup_dir_item(NULL, sctx->send_root, path, BTRFS_FIRST_FREE_OBJECTID, &tmp_name, 0); btrfs_release_path(path); if (IS_ERR(di)) { ret = PTR_ERR(di); goto out; } if (di) { /* not unique, try again */ idx++; continue; } if (!sctx->parent_root) { /* unique */ ret = 0; break; } di = btrfs_lookup_dir_item(NULL, sctx->parent_root, path, BTRFS_FIRST_FREE_OBJECTID, &tmp_name, 0); btrfs_release_path(path); if (IS_ERR(di)) { ret = PTR_ERR(di); goto out; } if (di) { /* not unique, try again */ idx++; continue; } /* unique */ break; } ret = fs_path_add(dest, tmp, strlen(tmp)); out: btrfs_free_path(path); return ret; } enum inode_state { inode_state_no_change, inode_state_will_create, inode_state_did_create, inode_state_will_delete, inode_state_did_delete, }; static int get_cur_inode_state(struct send_ctx *sctx, u64 ino, u64 gen, u64 *send_gen, u64 *parent_gen) { int ret; int left_ret; int right_ret; u64 left_gen; u64 right_gen = 0; struct btrfs_inode_info info; ret = get_inode_info(sctx->send_root, ino, &info); if (ret < 0 && ret != -ENOENT) goto out; left_ret = (info.nlink == 0) ? -ENOENT : ret; left_gen = info.gen; if (send_gen) *send_gen = ((left_ret == -ENOENT) ? 0 : info.gen); if (!sctx->parent_root) { right_ret = -ENOENT; } else { ret = get_inode_info(sctx->parent_root, ino, &info); if (ret < 0 && ret != -ENOENT) goto out; right_ret = (info.nlink == 0) ? -ENOENT : ret; right_gen = info.gen; if (parent_gen) *parent_gen = ((right_ret == -ENOENT) ? 0 : info.gen); } if (!left_ret && !right_ret) { if (left_gen == gen && right_gen == gen) { ret = inode_state_no_change; } else if (left_gen == gen) { if (ino < sctx->send_progress) ret = inode_state_did_create; else ret = inode_state_will_create; } else if (right_gen == gen) { if (ino < sctx->send_progress) ret = inode_state_did_delete; else ret = inode_state_will_delete; } else { ret = -ENOENT; } } else if (!left_ret) { if (left_gen == gen) { if (ino < sctx->send_progress) ret = inode_state_did_create; else ret = inode_state_will_create; } else { ret = -ENOENT; } } else if (!right_ret) { if (right_gen == gen) { if (ino < sctx->send_progress) ret = inode_state_did_delete; else ret = inode_state_will_delete; } else { ret = -ENOENT; } } else { ret = -ENOENT; } out: return ret; } static int is_inode_existent(struct send_ctx *sctx, u64 ino, u64 gen, u64 *send_gen, u64 *parent_gen) { int ret; if (ino == BTRFS_FIRST_FREE_OBJECTID) return 1; ret = get_cur_inode_state(sctx, ino, gen, send_gen, parent_gen); if (ret < 0) goto out; if (ret == inode_state_no_change || ret == inode_state_did_create || ret == inode_state_will_delete) ret = 1; else ret = 0; out: return ret; } /* * Helper function to lookup a dir item in a dir. */ static int lookup_dir_item_inode(struct btrfs_root *root, u64 dir, const char *name, int name_len, u64 *found_inode) { int ret = 0; struct btrfs_dir_item *di; struct btrfs_key key; struct btrfs_path *path; struct fscrypt_str name_str = FSTR_INIT((char *)name, name_len); path = alloc_path_for_send(); if (!path) return -ENOMEM; di = btrfs_lookup_dir_item(NULL, root, path, dir, &name_str, 0); if (IS_ERR_OR_NULL(di)) { ret = di ? PTR_ERR(di) : -ENOENT; goto out; } btrfs_dir_item_key_to_cpu(path->nodes[0], di, &key); if (key.type == BTRFS_ROOT_ITEM_KEY) { ret = -ENOENT; goto out; } *found_inode = key.objectid; out: btrfs_free_path(path); return ret; } /* * Looks up the first btrfs_inode_ref of a given ino. It returns the parent dir, * generation of the parent dir and the name of the dir entry. */ static int get_first_ref(struct btrfs_root *root, u64 ino, u64 *dir, u64 *dir_gen, struct fs_path *name) { int ret; struct btrfs_key key; struct btrfs_key found_key; struct btrfs_path *path; int len; u64 parent_dir; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = ino; key.type = BTRFS_INODE_REF_KEY; key.offset = 0; ret = btrfs_search_slot_for_read(root, &key, path, 1, 0); if (ret < 0) goto out; if (!ret) btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]); if (ret || found_key.objectid != ino || (found_key.type != BTRFS_INODE_REF_KEY && found_key.type != BTRFS_INODE_EXTREF_KEY)) { ret = -ENOENT; goto out; } if (found_key.type == BTRFS_INODE_REF_KEY) { struct btrfs_inode_ref *iref; iref = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_ref); len = btrfs_inode_ref_name_len(path->nodes[0], iref); ret = fs_path_add_from_extent_buffer(name, path->nodes[0], (unsigned long)(iref + 1), len); parent_dir = found_key.offset; } else { struct btrfs_inode_extref *extref; extref = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_extref); len = btrfs_inode_extref_name_len(path->nodes[0], extref); ret = fs_path_add_from_extent_buffer(name, path->nodes[0], (unsigned long)&extref->name, len); parent_dir = btrfs_inode_extref_parent(path->nodes[0], extref); } if (ret < 0) goto out; btrfs_release_path(path); if (dir_gen) { ret = get_inode_gen(root, parent_dir, dir_gen); if (ret < 0) goto out; } *dir = parent_dir; out: btrfs_free_path(path); return ret; } static int is_first_ref(struct btrfs_root *root, u64 ino, u64 dir, const char *name, int name_len) { int ret; struct fs_path *tmp_name; u64 tmp_dir; tmp_name = fs_path_alloc(); if (!tmp_name) return -ENOMEM; ret = get_first_ref(root, ino, &tmp_dir, NULL, tmp_name); if (ret < 0) goto out; if (dir != tmp_dir || name_len != fs_path_len(tmp_name)) { ret = 0; goto out; } ret = !memcmp(tmp_name->start, name, name_len); out: fs_path_free(tmp_name); return ret; } /* * Used by process_recorded_refs to determine if a new ref would overwrite an * already existing ref. In case it detects an overwrite, it returns the * inode/gen in who_ino/who_gen. * When an overwrite is detected, process_recorded_refs does proper orphanizing * to make sure later references to the overwritten inode are possible. * Orphanizing is however only required for the first ref of an inode. * process_recorded_refs does an additional is_first_ref check to see if * orphanizing is really required. */ static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen, const char *name, int name_len, u64 *who_ino, u64 *who_gen, u64 *who_mode) { int ret; u64 parent_root_dir_gen; u64 other_inode = 0; struct btrfs_inode_info info; if (!sctx->parent_root) return 0; ret = is_inode_existent(sctx, dir, dir_gen, NULL, &parent_root_dir_gen); if (ret <= 0) return 0; /* * If we have a parent root we need to verify that the parent dir was * not deleted and then re-created, if it was then we have no overwrite * and we can just unlink this entry. * * @parent_root_dir_gen was set to 0 if the inode does not exist in the * parent root. */ if (sctx->parent_root && dir != BTRFS_FIRST_FREE_OBJECTID && parent_root_dir_gen != dir_gen) return 0; ret = lookup_dir_item_inode(sctx->parent_root, dir, name, name_len, &other_inode); if (ret == -ENOENT) return 0; else if (ret < 0) return ret; /* * Check if the overwritten ref was already processed. If yes, the ref * was already unlinked/moved, so we can safely assume that we will not * overwrite anything at this point in time. */ if (other_inode > sctx->send_progress || is_waiting_for_move(sctx, other_inode)) { ret = get_inode_info(sctx->parent_root, other_inode, &info); if (ret < 0) return ret; *who_ino = other_inode; *who_gen = info.gen; *who_mode = info.mode; return 1; } return 0; } /* * Checks if the ref was overwritten by an already processed inode. This is * used by __get_cur_name_and_parent to find out if the ref was orphanized and * thus the orphan name needs be used. * process_recorded_refs also uses it to avoid unlinking of refs that were * overwritten. */ static int did_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen, u64 ino, u64 ino_gen, const char *name, int name_len) { int ret; u64 ow_inode; u64 ow_gen = 0; u64 send_root_dir_gen; if (!sctx->parent_root) return 0; ret = is_inode_existent(sctx, dir, dir_gen, &send_root_dir_gen, NULL); if (ret <= 0) return ret; /* * @send_root_dir_gen was set to 0 if the inode does not exist in the * send root. */ if (dir != BTRFS_FIRST_FREE_OBJECTID && send_root_dir_gen != dir_gen) return 0; /* check if the ref was overwritten by another ref */ ret = lookup_dir_item_inode(sctx->send_root, dir, name, name_len, &ow_inode); if (ret == -ENOENT) { /* was never and will never be overwritten */ return 0; } else if (ret < 0) { return ret; } if (ow_inode == ino) { ret = get_inode_gen(sctx->send_root, ow_inode, &ow_gen); if (ret < 0) return ret; /* It's the same inode, so no overwrite happened. */ if (ow_gen == ino_gen) return 0; } /* * We know that it is or will be overwritten. Check this now. * The current inode being processed might have been the one that caused * inode 'ino' to be orphanized, therefore check if ow_inode matches * the current inode being processed. */ if (ow_inode < sctx->send_progress) return 1; if (ino != sctx->cur_ino && ow_inode == sctx->cur_ino) { if (ow_gen == 0) { ret = get_inode_gen(sctx->send_root, ow_inode, &ow_gen); if (ret < 0) return ret; } if (ow_gen == sctx->cur_inode_gen) return 1; } return 0; } /* * Same as did_overwrite_ref, but also checks if it is the first ref of an inode * that got overwritten. This is used by process_recorded_refs to determine * if it has to use the path as returned by get_cur_path or the orphan name. */ static int did_overwrite_first_ref(struct send_ctx *sctx, u64 ino, u64 gen) { int ret = 0; struct fs_path *name = NULL; u64 dir; u64 dir_gen; if (!sctx->parent_root) goto out; name = fs_path_alloc(); if (!name) return -ENOMEM; ret = get_first_ref(sctx->parent_root, ino, &dir, &dir_gen, name); if (ret < 0) goto out; ret = did_overwrite_ref(sctx, dir, dir_gen, ino, gen, name->start, fs_path_len(name)); out: fs_path_free(name); return ret; } static inline struct name_cache_entry *name_cache_search(struct send_ctx *sctx, u64 ino, u64 gen) { struct btrfs_lru_cache_entry *entry; entry = btrfs_lru_cache_lookup(&sctx->name_cache, ino, gen); if (!entry) return NULL; return container_of(entry, struct name_cache_entry, entry); } /* * Used by get_cur_path for each ref up to the root. * Returns 0 if it succeeded. * Returns 1 if the inode is not existent or got overwritten. In that case, the * name is an orphan name. This instructs get_cur_path to stop iterating. If 1 * is returned, parent_ino/parent_gen are not guaranteed to be valid. * Returns <0 in case of error. */ static int __get_cur_name_and_parent(struct send_ctx *sctx, u64 ino, u64 gen, u64 *parent_ino, u64 *parent_gen, struct fs_path *dest) { int ret; int nce_ret; struct name_cache_entry *nce; /* * First check if we already did a call to this function with the same * ino/gen. If yes, check if the cache entry is still up-to-date. If yes * return the cached result. */ nce = name_cache_search(sctx, ino, gen); if (nce) { if (ino < sctx->send_progress && nce->need_later_update) { btrfs_lru_cache_remove(&sctx->name_cache, &nce->entry); nce = NULL; } else { *parent_ino = nce->parent_ino; *parent_gen = nce->parent_gen; ret = fs_path_add(dest, nce->name, nce->name_len); if (ret < 0) goto out; ret = nce->ret; goto out; } } /* * If the inode is not existent yet, add the orphan name and return 1. * This should only happen for the parent dir that we determine in * record_new_ref_if_needed(). */ ret = is_inode_existent(sctx, ino, gen, NULL, NULL); if (ret < 0) goto out; if (!ret) { ret = gen_unique_name(sctx, ino, gen, dest); if (ret < 0) goto out; ret = 1; goto out_cache; } /* * Depending on whether the inode was already processed or not, use * send_root or parent_root for ref lookup. */ if (ino < sctx->send_progress) ret = get_first_ref(sctx->send_root, ino, parent_ino, parent_gen, dest); else ret = get_first_ref(sctx->parent_root, ino, parent_ino, parent_gen, dest); if (ret < 0) goto out; /* * Check if the ref was overwritten by an inode's ref that was processed * earlier. If yes, treat as orphan and return 1. */ ret = did_overwrite_ref(sctx, *parent_ino, *parent_gen, ino, gen, dest->start, dest->end - dest->start); if (ret < 0) goto out; if (ret) { fs_path_reset(dest); ret = gen_unique_name(sctx, ino, gen, dest); if (ret < 0) goto out; ret = 1; } out_cache: /* * Store the result of the lookup in the name cache. */ nce = kmalloc(sizeof(*nce) + fs_path_len(dest) + 1, GFP_KERNEL); if (!nce) { ret = -ENOMEM; goto out; } nce->entry.key = ino; nce->entry.gen = gen; nce->parent_ino = *parent_ino; nce->parent_gen = *parent_gen; nce->name_len = fs_path_len(dest); nce->ret = ret; strcpy(nce->name, dest->start); if (ino < sctx->send_progress) nce->need_later_update = 0; else nce->need_later_update = 1; nce_ret = btrfs_lru_cache_store(&sctx->name_cache, &nce->entry, GFP_KERNEL); if (nce_ret < 0) { kfree(nce); ret = nce_ret; } out: return ret; } /* * Magic happens here. This function returns the first ref to an inode as it * would look like while receiving the stream at this point in time. * We walk the path up to the root. For every inode in between, we check if it * was already processed/sent. If yes, we continue with the parent as found * in send_root. If not, we continue with the parent as found in parent_root. * If we encounter an inode that was deleted at this point in time, we use the * inodes "orphan" name instead of the real name and stop. Same with new inodes * that were not created yet and overwritten inodes/refs. * * When do we have orphan inodes: * 1. When an inode is freshly created and thus no valid refs are available yet * 2. When a directory lost all it's refs (deleted) but still has dir items * inside which were not processed yet (pending for move/delete). If anyone * tried to get the path to the dir items, it would get a path inside that * orphan directory. * 3. When an inode is moved around or gets new links, it may overwrite the ref * of an unprocessed inode. If in that case the first ref would be * overwritten, the overwritten inode gets "orphanized". Later when we * process this overwritten inode, it is restored at a new place by moving * the orphan inode. * * sctx->send_progress tells this function at which point in time receiving * would be. */ static int get_cur_path(struct send_ctx *sctx, u64 ino, u64 gen, struct fs_path *dest) { int ret = 0; struct fs_path *name = NULL; u64 parent_inode = 0; u64 parent_gen = 0; int stop = 0; name = fs_path_alloc(); if (!name) { ret = -ENOMEM; goto out; } dest->reversed = 1; fs_path_reset(dest); while (!stop && ino != BTRFS_FIRST_FREE_OBJECTID) { struct waiting_dir_move *wdm; fs_path_reset(name); if (is_waiting_for_rm(sctx, ino, gen)) { ret = gen_unique_name(sctx, ino, gen, name); if (ret < 0) goto out; ret = fs_path_add_path(dest, name); break; } wdm = get_waiting_dir_move(sctx, ino); if (wdm && wdm->orphanized) { ret = gen_unique_name(sctx, ino, gen, name); stop = 1; } else if (wdm) { ret = get_first_ref(sctx->parent_root, ino, &parent_inode, &parent_gen, name); } else { ret = __get_cur_name_and_parent(sctx, ino, gen, &parent_inode, &parent_gen, name); if (ret) stop = 1; } if (ret < 0) goto out; ret = fs_path_add_path(dest, name); if (ret < 0) goto out; ino = parent_inode; gen = parent_gen; } out: fs_path_free(name); if (!ret) fs_path_unreverse(dest); return ret; } /* * Sends a BTRFS_SEND_C_SUBVOL command/item to userspace */ static int send_subvol_begin(struct send_ctx *sctx) { int ret; struct btrfs_root *send_root = sctx->send_root; struct btrfs_root *parent_root = sctx->parent_root; struct btrfs_path *path; struct btrfs_key key; struct btrfs_root_ref *ref; struct extent_buffer *leaf; char *name = NULL; int namelen; path = btrfs_alloc_path(); if (!path) return -ENOMEM; name = kmalloc(BTRFS_PATH_NAME_MAX, GFP_KERNEL); if (!name) { btrfs_free_path(path); return -ENOMEM; } key.objectid = send_root->root_key.objectid; key.type = BTRFS_ROOT_BACKREF_KEY; key.offset = 0; ret = btrfs_search_slot_for_read(send_root->fs_info->tree_root, &key, path, 1, 0); if (ret < 0) goto out; if (ret) { ret = -ENOENT; goto out; } leaf = path->nodes[0]; btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); if (key.type != BTRFS_ROOT_BACKREF_KEY || key.objectid != send_root->root_key.objectid) { ret = -ENOENT; goto out; } ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref); namelen = btrfs_root_ref_name_len(leaf, ref); read_extent_buffer(leaf, name, (unsigned long)(ref + 1), namelen); btrfs_release_path(path); if (parent_root) { ret = begin_cmd(sctx, BTRFS_SEND_C_SNAPSHOT); if (ret < 0) goto out; } else { ret = begin_cmd(sctx, BTRFS_SEND_C_SUBVOL); if (ret < 0) goto out; } TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen); if (!btrfs_is_empty_uuid(sctx->send_root->root_item.received_uuid)) TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID, sctx->send_root->root_item.received_uuid); else TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID, sctx->send_root->root_item.uuid); TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID, btrfs_root_ctransid(&sctx->send_root->root_item)); if (parent_root) { if (!btrfs_is_empty_uuid(parent_root->root_item.received_uuid)) TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID, parent_root->root_item.received_uuid); else TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID, parent_root->root_item.uuid); TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID, btrfs_root_ctransid(&sctx->parent_root->root_item)); } ret = send_cmd(sctx); tlv_put_failure: out: btrfs_free_path(path); kfree(name); return ret; } static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; btrfs_debug(fs_info, "send_truncate %llu size=%llu", ino, size); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_TRUNCATE); if (ret < 0) goto out; ret = get_cur_path(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, size); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } static int send_chmod(struct send_ctx *sctx, u64 ino, u64 gen, u64 mode) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; btrfs_debug(fs_info, "send_chmod %llu mode=%llu", ino, mode); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_CHMOD); if (ret < 0) goto out; ret = get_cur_path(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode & 07777); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } static int send_fileattr(struct send_ctx *sctx, u64 ino, u64 gen, u64 fileattr) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; if (sctx->proto < 2) return 0; btrfs_debug(fs_info, "send_fileattr %llu fileattr=%llu", ino, fileattr); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_FILEATTR); if (ret < 0) goto out; ret = get_cur_path(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILEATTR, fileattr); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } static int send_chown(struct send_ctx *sctx, u64 ino, u64 gen, u64 uid, u64 gid) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; btrfs_debug(fs_info, "send_chown %llu uid=%llu, gid=%llu", ino, uid, gid); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_CHOWN); if (ret < 0) goto out; ret = get_cur_path(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_UID, uid); TLV_PUT_U64(sctx, BTRFS_SEND_A_GID, gid); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } static int send_utimes(struct send_ctx *sctx, u64 ino, u64 gen) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p = NULL; struct btrfs_inode_item *ii; struct btrfs_path *path = NULL; struct extent_buffer *eb; struct btrfs_key key; int slot; btrfs_debug(fs_info, "send_utimes %llu", ino); p = fs_path_alloc(); if (!p) return -ENOMEM; path = alloc_path_for_send(); if (!path) { ret = -ENOMEM; goto out; } key.objectid = ino; key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; ret = btrfs_search_slot(NULL, sctx->send_root, &key, path, 0, 0); if (ret > 0) ret = -ENOENT; if (ret < 0) goto out; eb = path->nodes[0]; slot = path->slots[0]; ii = btrfs_item_ptr(eb, slot, struct btrfs_inode_item); ret = begin_cmd(sctx, BTRFS_SEND_C_UTIMES); if (ret < 0) goto out; ret = get_cur_path(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_ATIME, eb, &ii->atime); TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_MTIME, eb, &ii->mtime); TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_CTIME, eb, &ii->ctime); if (sctx->proto >= 2) TLV_PUT_BTRFS_TIMESPEC(sctx, BTRFS_SEND_A_OTIME, eb, &ii->otime); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); btrfs_free_path(path); return ret; } /* * If the cache is full, we can't remove entries from it and do a call to * send_utimes() for each respective inode, because we might be finishing * processing an inode that is a directory and it just got renamed, and existing * entries in the cache may refer to inodes that have the directory in their * full path - in which case we would generate outdated paths (pre-rename) * for the inodes that the cache entries point to. Instead of prunning the * cache when inserting, do it after we finish processing each inode at * finish_inode_if_needed(). */ static int cache_dir_utimes(struct send_ctx *sctx, u64 dir, u64 gen) { struct btrfs_lru_cache_entry *entry; int ret; entry = btrfs_lru_cache_lookup(&sctx->dir_utimes_cache, dir, gen); if (entry != NULL) return 0; /* Caching is optional, don't fail if we can't allocate memory. */ entry = kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) return send_utimes(sctx, dir, gen); entry->key = dir; entry->gen = gen; ret = btrfs_lru_cache_store(&sctx->dir_utimes_cache, entry, GFP_KERNEL); ASSERT(ret != -EEXIST); if (ret) { kfree(entry); return send_utimes(sctx, dir, gen); } return 0; } static int trim_dir_utimes_cache(struct send_ctx *sctx) { while (btrfs_lru_cache_size(&sctx->dir_utimes_cache) > SEND_MAX_DIR_UTIMES_CACHE_SIZE) { struct btrfs_lru_cache_entry *lru; int ret; lru = btrfs_lru_cache_lru_entry(&sctx->dir_utimes_cache); ASSERT(lru != NULL); ret = send_utimes(sctx, lru->key, lru->gen); if (ret) return ret; btrfs_lru_cache_remove(&sctx->dir_utimes_cache, lru); } return 0; } /* * Sends a BTRFS_SEND_C_MKXXX or SYMLINK command to user space. We don't have * a valid path yet because we did not process the refs yet. So, the inode * is created as orphan. */ static int send_create_inode(struct send_ctx *sctx, u64 ino) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; int cmd; struct btrfs_inode_info info; u64 gen; u64 mode; u64 rdev; btrfs_debug(fs_info, "send_create_inode %llu", ino); p = fs_path_alloc(); if (!p) return -ENOMEM; if (ino != sctx->cur_ino) { ret = get_inode_info(sctx->send_root, ino, &info); if (ret < 0) goto out; gen = info.gen; mode = info.mode; rdev = info.rdev; } else { gen = sctx->cur_inode_gen; mode = sctx->cur_inode_mode; rdev = sctx->cur_inode_rdev; } if (S_ISREG(mode)) { cmd = BTRFS_SEND_C_MKFILE; } else if (S_ISDIR(mode)) { cmd = BTRFS_SEND_C_MKDIR; } else if (S_ISLNK(mode)) { cmd = BTRFS_SEND_C_SYMLINK; } else if (S_ISCHR(mode) || S_ISBLK(mode)) { cmd = BTRFS_SEND_C_MKNOD; } else if (S_ISFIFO(mode)) { cmd = BTRFS_SEND_C_MKFIFO; } else if (S_ISSOCK(mode)) { cmd = BTRFS_SEND_C_MKSOCK; } else { btrfs_warn(sctx->send_root->fs_info, "unexpected inode type %o", (int)(mode & S_IFMT)); ret = -EOPNOTSUPP; goto out; } ret = begin_cmd(sctx, cmd); if (ret < 0) goto out; ret = gen_unique_name(sctx, ino, gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_INO, ino); if (S_ISLNK(mode)) { fs_path_reset(p); ret = read_symlink(sctx->send_root, ino, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH_LINK, p); } else if (S_ISCHR(mode) || S_ISBLK(mode) || S_ISFIFO(mode) || S_ISSOCK(mode)) { TLV_PUT_U64(sctx, BTRFS_SEND_A_RDEV, new_encode_dev(rdev)); TLV_PUT_U64(sctx, BTRFS_SEND_A_MODE, mode); } ret = send_cmd(sctx); if (ret < 0) goto out; tlv_put_failure: out: fs_path_free(p); return ret; } static void cache_dir_created(struct send_ctx *sctx, u64 dir) { struct btrfs_lru_cache_entry *entry; int ret; /* Caching is optional, ignore any failures. */ entry = kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) return; entry->key = dir; entry->gen = 0; ret = btrfs_lru_cache_store(&sctx->dir_created_cache, entry, GFP_KERNEL); if (ret < 0) kfree(entry); } /* * We need some special handling for inodes that get processed before the parent * directory got created. See process_recorded_refs for details. * This function does the check if we already created the dir out of order. */ static int did_create_dir(struct send_ctx *sctx, u64 dir) { int ret = 0; int iter_ret = 0; struct btrfs_path *path = NULL; struct btrfs_key key; struct btrfs_key found_key; struct btrfs_key di_key; struct btrfs_dir_item *di; if (btrfs_lru_cache_lookup(&sctx->dir_created_cache, dir, 0)) return 1; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = dir; key.type = BTRFS_DIR_INDEX_KEY; key.offset = 0; btrfs_for_each_slot(sctx->send_root, &key, &found_key, path, iter_ret) { struct extent_buffer *eb = path->nodes[0]; if (found_key.objectid != key.objectid || found_key.type != key.type) { ret = 0; break; } di = btrfs_item_ptr(eb, path->slots[0], struct btrfs_dir_item); btrfs_dir_item_key_to_cpu(eb, di, &di_key); if (di_key.type != BTRFS_ROOT_ITEM_KEY && di_key.objectid < sctx->send_progress) { ret = 1; cache_dir_created(sctx, dir); break; } } /* Catch error found during iteration */ if (iter_ret < 0) ret = iter_ret; btrfs_free_path(path); return ret; } /* * Only creates the inode if it is: * 1. Not a directory * 2. Or a directory which was not created already due to out of order * directories. See did_create_dir and process_recorded_refs for details. */ static int send_create_inode_if_needed(struct send_ctx *sctx) { int ret; if (S_ISDIR(sctx->cur_inode_mode)) { ret = did_create_dir(sctx, sctx->cur_ino); if (ret < 0) return ret; else if (ret > 0) return 0; } ret = send_create_inode(sctx, sctx->cur_ino); if (ret == 0 && S_ISDIR(sctx->cur_inode_mode)) cache_dir_created(sctx, sctx->cur_ino); return ret; } struct recorded_ref { struct list_head list; char *name; struct fs_path *full_path; u64 dir; u64 dir_gen; int name_len; struct rb_node node; struct rb_root *root; }; static struct recorded_ref *recorded_ref_alloc(void) { struct recorded_ref *ref; ref = kzalloc(sizeof(*ref), GFP_KERNEL); if (!ref) return NULL; RB_CLEAR_NODE(&ref->node); INIT_LIST_HEAD(&ref->list); return ref; } static void recorded_ref_free(struct recorded_ref *ref) { if (!ref) return; if (!RB_EMPTY_NODE(&ref->node)) rb_erase(&ref->node, ref->root); list_del(&ref->list); fs_path_free(ref->full_path); kfree(ref); } static void set_ref_path(struct recorded_ref *ref, struct fs_path *path) { ref->full_path = path; ref->name = (char *)kbasename(ref->full_path->start); ref->name_len = ref->full_path->end - ref->name; } static int dup_ref(struct recorded_ref *ref, struct list_head *list) { struct recorded_ref *new; new = recorded_ref_alloc(); if (!new) return -ENOMEM; new->dir = ref->dir; new->dir_gen = ref->dir_gen; list_add_tail(&new->list, list); return 0; } static void __free_recorded_refs(struct list_head *head) { struct recorded_ref *cur; while (!list_empty(head)) { cur = list_entry(head->next, struct recorded_ref, list); recorded_ref_free(cur); } } static void free_recorded_refs(struct send_ctx *sctx) { __free_recorded_refs(&sctx->new_refs); __free_recorded_refs(&sctx->deleted_refs); } /* * Renames/moves a file/dir to its orphan name. Used when the first * ref of an unprocessed inode gets overwritten and for all non empty * directories. */ static int orphanize_inode(struct send_ctx *sctx, u64 ino, u64 gen, struct fs_path *path) { int ret; struct fs_path *orphan; orphan = fs_path_alloc(); if (!orphan) return -ENOMEM; ret = gen_unique_name(sctx, ino, gen, orphan); if (ret < 0) goto out; ret = send_rename(sctx, path, orphan); out: fs_path_free(orphan); return ret; } static struct orphan_dir_info *add_orphan_dir_info(struct send_ctx *sctx, u64 dir_ino, u64 dir_gen) { struct rb_node **p = &sctx->orphan_dirs.rb_node; struct rb_node *parent = NULL; struct orphan_dir_info *entry, *odi; while (*p) { parent = *p; entry = rb_entry(parent, struct orphan_dir_info, node); if (dir_ino < entry->ino) p = &(*p)->rb_left; else if (dir_ino > entry->ino) p = &(*p)->rb_right; else if (dir_gen < entry->gen) p = &(*p)->rb_left; else if (dir_gen > entry->gen) p = &(*p)->rb_right; else return entry; } odi = kmalloc(sizeof(*odi), GFP_KERNEL); if (!odi) return ERR_PTR(-ENOMEM); odi->ino = dir_ino; odi->gen = dir_gen; odi->last_dir_index_offset = 0; odi->dir_high_seq_ino = 0; rb_link_node(&odi->node, parent, p); rb_insert_color(&odi->node, &sctx->orphan_dirs); return odi; } static struct orphan_dir_info *get_orphan_dir_info(struct send_ctx *sctx, u64 dir_ino, u64 gen) { struct rb_node *n = sctx->orphan_dirs.rb_node; struct orphan_dir_info *entry; while (n) { entry = rb_entry(n, struct orphan_dir_info, node); if (dir_ino < entry->ino) n = n->rb_left; else if (dir_ino > entry->ino) n = n->rb_right; else if (gen < entry->gen) n = n->rb_left; else if (gen > entry->gen) n = n->rb_right; else return entry; } return NULL; } static int is_waiting_for_rm(struct send_ctx *sctx, u64 dir_ino, u64 gen) { struct orphan_dir_info *odi = get_orphan_dir_info(sctx, dir_ino, gen); return odi != NULL; } static void free_orphan_dir_info(struct send_ctx *sctx, struct orphan_dir_info *odi) { if (!odi) return; rb_erase(&odi->node, &sctx->orphan_dirs); kfree(odi); } /* * Returns 1 if a directory can be removed at this point in time. * We check this by iterating all dir items and checking if the inode behind * the dir item was already processed. */ static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 dir_gen) { int ret = 0; int iter_ret = 0; struct btrfs_root *root = sctx->parent_root; struct btrfs_path *path; struct btrfs_key key; struct btrfs_key found_key; struct btrfs_key loc; struct btrfs_dir_item *di; struct orphan_dir_info *odi = NULL; u64 dir_high_seq_ino = 0; u64 last_dir_index_offset = 0; /* * Don't try to rmdir the top/root subvolume dir. */ if (dir == BTRFS_FIRST_FREE_OBJECTID) return 0; odi = get_orphan_dir_info(sctx, dir, dir_gen); if (odi && sctx->cur_ino < odi->dir_high_seq_ino) return 0; path = alloc_path_for_send(); if (!path) return -ENOMEM; if (!odi) { /* * Find the inode number associated with the last dir index * entry. This is very likely the inode with the highest number * of all inodes that have an entry in the directory. We can * then use it to avoid future calls to can_rmdir(), when * processing inodes with a lower number, from having to search * the parent root b+tree for dir index keys. */ key.objectid = dir; key.type = BTRFS_DIR_INDEX_KEY; key.offset = (u64)-1; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) { goto out; } else if (ret > 0) { /* Can't happen, the root is never empty. */ ASSERT(path->slots[0] > 0); if (WARN_ON(path->slots[0] == 0)) { ret = -EUCLEAN; goto out; } path->slots[0]--; } btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if (key.objectid != dir || key.type != BTRFS_DIR_INDEX_KEY) { /* No index keys, dir can be removed. */ ret = 1; goto out; } di = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_dir_item); btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc); dir_high_seq_ino = loc.objectid; if (sctx->cur_ino < dir_high_seq_ino) { ret = 0; goto out; } btrfs_release_path(path); } key.objectid = dir; key.type = BTRFS_DIR_INDEX_KEY; key.offset = (odi ? odi->last_dir_index_offset : 0); btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) { struct waiting_dir_move *dm; if (found_key.objectid != key.objectid || found_key.type != key.type) break; di = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_dir_item); btrfs_dir_item_key_to_cpu(path->nodes[0], di, &loc); dir_high_seq_ino = max(dir_high_seq_ino, loc.objectid); last_dir_index_offset = found_key.offset; dm = get_waiting_dir_move(sctx, loc.objectid); if (dm) { dm->rmdir_ino = dir; dm->rmdir_gen = dir_gen; ret = 0; goto out; } if (loc.objectid > sctx->cur_ino) { ret = 0; goto out; } } if (iter_ret < 0) { ret = iter_ret; goto out; } free_orphan_dir_info(sctx, odi); ret = 1; out: btrfs_free_path(path); if (ret) return ret; if (!odi) { odi = add_orphan_dir_info(sctx, dir, dir_gen); if (IS_ERR(odi)) return PTR_ERR(odi); odi->gen = dir_gen; } odi->last_dir_index_offset = last_dir_index_offset; odi->dir_high_seq_ino = max(odi->dir_high_seq_ino, dir_high_seq_ino); return 0; } static int is_waiting_for_move(struct send_ctx *sctx, u64 ino) { struct waiting_dir_move *entry = get_waiting_dir_move(sctx, ino); return entry != NULL; } static int add_waiting_dir_move(struct send_ctx *sctx, u64 ino, bool orphanized) { struct rb_node **p = &sctx->waiting_dir_moves.rb_node; struct rb_node *parent = NULL; struct waiting_dir_move *entry, *dm; dm = kmalloc(sizeof(*dm), GFP_KERNEL); if (!dm) return -ENOMEM; dm->ino = ino; dm->rmdir_ino = 0; dm->rmdir_gen = 0; dm->orphanized = orphanized; while (*p) { parent = *p; entry = rb_entry(parent, struct waiting_dir_move, node); if (ino < entry->ino) { p = &(*p)->rb_left; } else if (ino > entry->ino) { p = &(*p)->rb_right; } else { kfree(dm); return -EEXIST; } } rb_link_node(&dm->node, parent, p); rb_insert_color(&dm->node, &sctx->waiting_dir_moves); return 0; } static struct waiting_dir_move * get_waiting_dir_move(struct send_ctx *sctx, u64 ino) { struct rb_node *n = sctx->waiting_dir_moves.rb_node; struct waiting_dir_move *entry; while (n) { entry = rb_entry(n, struct waiting_dir_move, node); if (ino < entry->ino) n = n->rb_left; else if (ino > entry->ino) n = n->rb_right; else return entry; } return NULL; } static void free_waiting_dir_move(struct send_ctx *sctx, struct waiting_dir_move *dm) { if (!dm) return; rb_erase(&dm->node, &sctx->waiting_dir_moves); kfree(dm); } static int add_pending_dir_move(struct send_ctx *sctx, u64 ino, u64 ino_gen, u64 parent_ino, struct list_head *new_refs, struct list_head *deleted_refs, const bool is_orphan) { struct rb_node **p = &sctx->pending_dir_moves.rb_node; struct rb_node *parent = NULL; struct pending_dir_move *entry = NULL, *pm; struct recorded_ref *cur; int exists = 0; int ret; pm = kmalloc(sizeof(*pm), GFP_KERNEL); if (!pm) return -ENOMEM; pm->parent_ino = parent_ino; pm->ino = ino; pm->gen = ino_gen; INIT_LIST_HEAD(&pm->list); INIT_LIST_HEAD(&pm->update_refs); RB_CLEAR_NODE(&pm->node); while (*p) { parent = *p; entry = rb_entry(parent, struct pending_dir_move, node); if (parent_ino < entry->parent_ino) { p = &(*p)->rb_left; } else if (parent_ino > entry->parent_ino) { p = &(*p)->rb_right; } else { exists = 1; break; } } list_for_each_entry(cur, deleted_refs, list) { ret = dup_ref(cur, &pm->update_refs); if (ret < 0) goto out; } list_for_each_entry(cur, new_refs, list) { ret = dup_ref(cur, &pm->update_refs); if (ret < 0) goto out; } ret = add_waiting_dir_move(sctx, pm->ino, is_orphan); if (ret) goto out; if (exists) { list_add_tail(&pm->list, &entry->list); } else { rb_link_node(&pm->node, parent, p); rb_insert_color(&pm->node, &sctx->pending_dir_moves); } ret = 0; out: if (ret) { __free_recorded_refs(&pm->update_refs); kfree(pm); } return ret; } static struct pending_dir_move *get_pending_dir_moves(struct send_ctx *sctx, u64 parent_ino) { struct rb_node *n = sctx->pending_dir_moves.rb_node; struct pending_dir_move *entry; while (n) { entry = rb_entry(n, struct pending_dir_move, node); if (parent_ino < entry->parent_ino) n = n->rb_left; else if (parent_ino > entry->parent_ino) n = n->rb_right; else return entry; } return NULL; } static int path_loop(struct send_ctx *sctx, struct fs_path *name, u64 ino, u64 gen, u64 *ancestor_ino) { int ret = 0; u64 parent_inode = 0; u64 parent_gen = 0; u64 start_ino = ino; *ancestor_ino = 0; while (ino != BTRFS_FIRST_FREE_OBJECTID) { fs_path_reset(name); if (is_waiting_for_rm(sctx, ino, gen)) break; if (is_waiting_for_move(sctx, ino)) { if (*ancestor_ino == 0) *ancestor_ino = ino; ret = get_first_ref(sctx->parent_root, ino, &parent_inode, &parent_gen, name); } else { ret = __get_cur_name_and_parent(sctx, ino, gen, &parent_inode, &parent_gen, name); if (ret > 0) { ret = 0; break; } } if (ret < 0) break; if (parent_inode == start_ino) { ret = 1; if (*ancestor_ino == 0) *ancestor_ino = ino; break; } ino = parent_inode; gen = parent_gen; } return ret; } static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) { struct fs_path *from_path = NULL; struct fs_path *to_path = NULL; struct fs_path *name = NULL; u64 orig_progress = sctx->send_progress; struct recorded_ref *cur; u64 parent_ino, parent_gen; struct waiting_dir_move *dm = NULL; u64 rmdir_ino = 0; u64 rmdir_gen; u64 ancestor; bool is_orphan; int ret; name = fs_path_alloc(); from_path = fs_path_alloc(); if (!name || !from_path) { ret = -ENOMEM; goto out; } dm = get_waiting_dir_move(sctx, pm->ino); ASSERT(dm); rmdir_ino = dm->rmdir_ino; rmdir_gen = dm->rmdir_gen; is_orphan = dm->orphanized; free_waiting_dir_move(sctx, dm); if (is_orphan) { ret = gen_unique_name(sctx, pm->ino, pm->gen, from_path); } else { ret = get_first_ref(sctx->parent_root, pm->ino, &parent_ino, &parent_gen, name); if (ret < 0) goto out; ret = get_cur_path(sctx, parent_ino, parent_gen, from_path); if (ret < 0) goto out; ret = fs_path_add_path(from_path, name); } if (ret < 0) goto out; sctx->send_progress = sctx->cur_ino + 1; ret = path_loop(sctx, name, pm->ino, pm->gen, &ancestor); if (ret < 0) goto out; if (ret) { LIST_HEAD(deleted_refs); ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID); ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor, &pm->update_refs, &deleted_refs, is_orphan); if (ret < 0) goto out; if (rmdir_ino) { dm = get_waiting_dir_move(sctx, pm->ino); ASSERT(dm); dm->rmdir_ino = rmdir_ino; dm->rmdir_gen = rmdir_gen; } goto out; } fs_path_reset(name); to_path = name; name = NULL; ret = get_cur_path(sctx, pm->ino, pm->gen, to_path); if (ret < 0) goto out; ret = send_rename(sctx, from_path, to_path); if (ret < 0) goto out; if (rmdir_ino) { struct orphan_dir_info *odi; u64 gen; odi = get_orphan_dir_info(sctx, rmdir_ino, rmdir_gen); if (!odi) { /* already deleted */ goto finish; } gen = odi->gen; ret = can_rmdir(sctx, rmdir_ino, gen); if (ret < 0) goto out; if (!ret) goto finish; name = fs_path_alloc(); if (!name) { ret = -ENOMEM; goto out; } ret = get_cur_path(sctx, rmdir_ino, gen, name); if (ret < 0) goto out; ret = send_rmdir(sctx, name); if (ret < 0) goto out; } finish: ret = cache_dir_utimes(sctx, pm->ino, pm->gen); if (ret < 0) goto out; /* * After rename/move, need to update the utimes of both new parent(s) * and old parent(s). */ list_for_each_entry(cur, &pm->update_refs, list) { /* * The parent inode might have been deleted in the send snapshot */ ret = get_inode_info(sctx->send_root, cur->dir, NULL); if (ret == -ENOENT) { ret = 0; continue; } if (ret < 0) goto out; ret = cache_dir_utimes(sctx, cur->dir, cur->dir_gen); if (ret < 0) goto out; } out: fs_path_free(name); fs_path_free(from_path); fs_path_free(to_path); sctx->send_progress = orig_progress; return ret; } static void free_pending_move(struct send_ctx *sctx, struct pending_dir_move *m) { if (!list_empty(&m->list)) list_del(&m->list); if (!RB_EMPTY_NODE(&m->node)) rb_erase(&m->node, &sctx->pending_dir_moves); __free_recorded_refs(&m->update_refs); kfree(m); } static void tail_append_pending_moves(struct send_ctx *sctx, struct pending_dir_move *moves, struct list_head *stack) { if (list_empty(&moves->list)) { list_add_tail(&moves->list, stack); } else { LIST_HEAD(list); list_splice_init(&moves->list, &list); list_add_tail(&moves->list, stack); list_splice_tail(&list, stack); } if (!RB_EMPTY_NODE(&moves->node)) { rb_erase(&moves->node, &sctx->pending_dir_moves); RB_CLEAR_NODE(&moves->node); } } static int apply_children_dir_moves(struct send_ctx *sctx) { struct pending_dir_move *pm; LIST_HEAD(stack); u64 parent_ino = sctx->cur_ino; int ret = 0; pm = get_pending_dir_moves(sctx, parent_ino); if (!pm) return 0; tail_append_pending_moves(sctx, pm, &stack); while (!list_empty(&stack)) { pm = list_first_entry(&stack, struct pending_dir_move, list); parent_ino = pm->ino; ret = apply_dir_move(sctx, pm); free_pending_move(sctx, pm); if (ret) goto out; pm = get_pending_dir_moves(sctx, parent_ino); if (pm) tail_append_pending_moves(sctx, pm, &stack); } return 0; out: while (!list_empty(&stack)) { pm = list_first_entry(&stack, struct pending_dir_move, list); free_pending_move(sctx, pm); } return ret; } /* * We might need to delay a directory rename even when no ancestor directory * (in the send root) with a higher inode number than ours (sctx->cur_ino) was * renamed. This happens when we rename a directory to the old name (the name * in the parent root) of some other unrelated directory that got its rename * delayed due to some ancestor with higher number that got renamed. * * Example: * * Parent snapshot: * . (ino 256) * |---- a/ (ino 257) * | |---- file (ino 260) * | * |---- b/ (ino 258) * |---- c/ (ino 259) * * Send snapshot: * . (ino 256) * |---- a/ (ino 258) * |---- x/ (ino 259) * |---- y/ (ino 257) * |----- file (ino 260) * * Here we can not rename 258 from 'b' to 'a' without the rename of inode 257 * from 'a' to 'x/y' happening first, which in turn depends on the rename of * inode 259 from 'c' to 'x'. So the order of rename commands the send stream * must issue is: * * 1 - rename 259 from 'c' to 'x' * 2 - rename 257 from 'a' to 'x/y' * 3 - rename 258 from 'b' to 'a' * * Returns 1 if the rename of sctx->cur_ino needs to be delayed, 0 if it can * be done right away and < 0 on error. */ static int wait_for_dest_dir_move(struct send_ctx *sctx, struct recorded_ref *parent_ref, const bool is_orphan) { struct btrfs_fs_info *fs_info = sctx->parent_root->fs_info; struct btrfs_path *path; struct btrfs_key key; struct btrfs_key di_key; struct btrfs_dir_item *di; u64 left_gen; u64 right_gen; int ret = 0; struct waiting_dir_move *wdm; if (RB_EMPTY_ROOT(&sctx->waiting_dir_moves)) return 0; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = parent_ref->dir; key.type = BTRFS_DIR_ITEM_KEY; key.offset = btrfs_name_hash(parent_ref->name, parent_ref->name_len); ret = btrfs_search_slot(NULL, sctx->parent_root, &key, path, 0, 0); if (ret < 0) { goto out; } else if (ret > 0) { ret = 0; goto out; } di = btrfs_match_dir_item_name(fs_info, path, parent_ref->name, parent_ref->name_len); if (!di) { ret = 0; goto out; } /* * di_key.objectid has the number of the inode that has a dentry in the * parent directory with the same name that sctx->cur_ino is being * renamed to. We need to check if that inode is in the send root as * well and if it is currently marked as an inode with a pending rename, * if it is, we need to delay the rename of sctx->cur_ino as well, so * that it happens after that other inode is renamed. */ btrfs_dir_item_key_to_cpu(path->nodes[0], di, &di_key); if (di_key.type != BTRFS_INODE_ITEM_KEY) { ret = 0; goto out; } ret = get_inode_gen(sctx->parent_root, di_key.objectid, &left_gen); if (ret < 0) goto out; ret = get_inode_gen(sctx->send_root, di_key.objectid, &right_gen); if (ret < 0) { if (ret == -ENOENT) ret = 0; goto out; } /* Different inode, no need to delay the rename of sctx->cur_ino */ if (right_gen != left_gen) { ret = 0; goto out; } wdm = get_waiting_dir_move(sctx, di_key.objectid); if (wdm && !wdm->orphanized) { ret = add_pending_dir_move(sctx, sctx->cur_ino, sctx->cur_inode_gen, di_key.objectid, &sctx->new_refs, &sctx->deleted_refs, is_orphan); if (!ret) ret = 1; } out: btrfs_free_path(path); return ret; } /* * Check if inode ino2, or any of its ancestors, is inode ino1. * Return 1 if true, 0 if false and < 0 on error. */ static int check_ino_in_path(struct btrfs_root *root, const u64 ino1, const u64 ino1_gen, const u64 ino2, const u64 ino2_gen, struct fs_path *fs_path) { u64 ino = ino2; if (ino1 == ino2) return ino1_gen == ino2_gen; while (ino > BTRFS_FIRST_FREE_OBJECTID) { u64 parent; u64 parent_gen; int ret; fs_path_reset(fs_path); ret = get_first_ref(root, ino, &parent, &parent_gen, fs_path); if (ret < 0) return ret; if (parent == ino1) return parent_gen == ino1_gen; ino = parent; } return 0; } /* * Check if inode ino1 is an ancestor of inode ino2 in the given root for any * possible path (in case ino2 is not a directory and has multiple hard links). * Return 1 if true, 0 if false and < 0 on error. */ static int is_ancestor(struct btrfs_root *root, const u64 ino1, const u64 ino1_gen, const u64 ino2, struct fs_path *fs_path) { bool free_fs_path = false; int ret = 0; int iter_ret = 0; struct btrfs_path *path = NULL; struct btrfs_key key; if (!fs_path) { fs_path = fs_path_alloc(); if (!fs_path) return -ENOMEM; free_fs_path = true; } path = alloc_path_for_send(); if (!path) { ret = -ENOMEM; goto out; } key.objectid = ino2; key.type = BTRFS_INODE_REF_KEY; key.offset = 0; btrfs_for_each_slot(root, &key, &key, path, iter_ret) { struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; u32 cur_offset = 0; u32 item_size; if (key.objectid != ino2) break; if (key.type != BTRFS_INODE_REF_KEY && key.type != BTRFS_INODE_EXTREF_KEY) break; item_size = btrfs_item_size(leaf, slot); while (cur_offset < item_size) { u64 parent; u64 parent_gen; if (key.type == BTRFS_INODE_EXTREF_KEY) { unsigned long ptr; struct btrfs_inode_extref *extref; ptr = btrfs_item_ptr_offset(leaf, slot); extref = (struct btrfs_inode_extref *) (ptr + cur_offset); parent = btrfs_inode_extref_parent(leaf, extref); cur_offset += sizeof(*extref); cur_offset += btrfs_inode_extref_name_len(leaf, extref); } else { parent = key.offset; cur_offset = item_size; } ret = get_inode_gen(root, parent, &parent_gen); if (ret < 0) goto out; ret = check_ino_in_path(root, ino1, ino1_gen, parent, parent_gen, fs_path); if (ret) goto out; } } ret = 0; if (iter_ret < 0) ret = iter_ret; out: btrfs_free_path(path); if (free_fs_path) fs_path_free(fs_path); return ret; } static int wait_for_parent_move(struct send_ctx *sctx, struct recorded_ref *parent_ref, const bool is_orphan) { int ret = 0; u64 ino = parent_ref->dir; u64 ino_gen = parent_ref->dir_gen; u64 parent_ino_before, parent_ino_after; struct fs_path *path_before = NULL; struct fs_path *path_after = NULL; int len1, len2; path_after = fs_path_alloc(); path_before = fs_path_alloc(); if (!path_after || !path_before) { ret = -ENOMEM; goto out; } /* * Our current directory inode may not yet be renamed/moved because some * ancestor (immediate or not) has to be renamed/moved first. So find if * such ancestor exists and make sure our own rename/move happens after * that ancestor is processed to avoid path build infinite loops (done * at get_cur_path()). */ while (ino > BTRFS_FIRST_FREE_OBJECTID) { u64 parent_ino_after_gen; if (is_waiting_for_move(sctx, ino)) { /* * If the current inode is an ancestor of ino in the * parent root, we need to delay the rename of the * current inode, otherwise don't delayed the rename * because we can end up with a circular dependency * of renames, resulting in some directories never * getting the respective rename operations issued in * the send stream or getting into infinite path build * loops. */ ret = is_ancestor(sctx->parent_root, sctx->cur_ino, sctx->cur_inode_gen, ino, path_before); if (ret) break; } fs_path_reset(path_before); fs_path_reset(path_after); ret = get_first_ref(sctx->send_root, ino, &parent_ino_after, &parent_ino_after_gen, path_after); if (ret < 0) goto out; ret = get_first_ref(sctx->parent_root, ino, &parent_ino_before, NULL, path_before); if (ret < 0 && ret != -ENOENT) { goto out; } else if (ret == -ENOENT) { ret = 0; break; } len1 = fs_path_len(path_before); len2 = fs_path_len(path_after); if (ino > sctx->cur_ino && (parent_ino_before != parent_ino_after || len1 != len2 || memcmp(path_before->start, path_after->start, len1))) { u64 parent_ino_gen; ret = get_inode_gen(sctx->parent_root, ino, &parent_ino_gen); if (ret < 0) goto out; if (ino_gen == parent_ino_gen) { ret = 1; break; } } ino = parent_ino_after; ino_gen = parent_ino_after_gen; } out: fs_path_free(path_before); fs_path_free(path_after); if (ret == 1) { ret = add_pending_dir_move(sctx, sctx->cur_ino, sctx->cur_inode_gen, ino, &sctx->new_refs, &sctx->deleted_refs, is_orphan); if (!ret) ret = 1; } return ret; } static int update_ref_path(struct send_ctx *sctx, struct recorded_ref *ref) { int ret; struct fs_path *new_path; /* * Our reference's name member points to its full_path member string, so * we use here a new path. */ new_path = fs_path_alloc(); if (!new_path) return -ENOMEM; ret = get_cur_path(sctx, ref->dir, ref->dir_gen, new_path); if (ret < 0) { fs_path_free(new_path); return ret; } ret = fs_path_add(new_path, ref->name, ref->name_len); if (ret < 0) { fs_path_free(new_path); return ret; } fs_path_free(ref->full_path); set_ref_path(ref, new_path); return 0; } /* * When processing the new references for an inode we may orphanize an existing * directory inode because its old name conflicts with one of the new references * of the current inode. Later, when processing another new reference of our * inode, we might need to orphanize another inode, but the path we have in the * reference reflects the pre-orphanization name of the directory we previously * orphanized. For example: * * parent snapshot looks like: * * . (ino 256) * |----- f1 (ino 257) * |----- f2 (ino 258) * |----- d1/ (ino 259) * |----- d2/ (ino 260) * * send snapshot looks like: * * . (ino 256) * |----- d1 (ino 258) * |----- f2/ (ino 259) * |----- f2_link/ (ino 260) * | |----- f1 (ino 257) * | * |----- d2 (ino 258) * * When processing inode 257 we compute the name for inode 259 as "d1", and we * cache it in the name cache. Later when we start processing inode 258, when * collecting all its new references we set a full path of "d1/d2" for its new * reference with name "d2". When we start processing the new references we * start by processing the new reference with name "d1", and this results in * orphanizing inode 259, since its old reference causes a conflict. Then we * move on the next new reference, with name "d2", and we find out we must * orphanize inode 260, as its old reference conflicts with ours - but for the * orphanization we use a source path corresponding to the path we stored in the * new reference, which is "d1/d2" and not "o259-6-0/d2" - this makes the * receiver fail since the path component "d1/" no longer exists, it was renamed * to "o259-6-0/" when processing the previous new reference. So in this case we * must recompute the path in the new reference and use it for the new * orphanization operation. */ static int refresh_ref_path(struct send_ctx *sctx, struct recorded_ref *ref) { char *name; int ret; name = kmemdup(ref->name, ref->name_len, GFP_KERNEL); if (!name) return -ENOMEM; fs_path_reset(ref->full_path); ret = get_cur_path(sctx, ref->dir, ref->dir_gen, ref->full_path); if (ret < 0) goto out; ret = fs_path_add(ref->full_path, name, ref->name_len); if (ret < 0) goto out; /* Update the reference's base name pointer. */ set_ref_path(ref, ref->full_path); out: kfree(name); return ret; } /* * This does all the move/link/unlink/rmdir magic. */ static int process_recorded_refs(struct send_ctx *sctx, int *pending_move) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct recorded_ref *cur; struct recorded_ref *cur2; LIST_HEAD(check_dirs); struct fs_path *valid_path = NULL; u64 ow_inode = 0; u64 ow_gen; u64 ow_mode; int did_overwrite = 0; int is_orphan = 0; u64 last_dir_ino_rm = 0; bool can_rename = true; bool orphanized_dir = false; bool orphanized_ancestor = false; btrfs_debug(fs_info, "process_recorded_refs %llu", sctx->cur_ino); /* * This should never happen as the root dir always has the same ref * which is always '..' */ BUG_ON(sctx->cur_ino <= BTRFS_FIRST_FREE_OBJECTID); valid_path = fs_path_alloc(); if (!valid_path) { ret = -ENOMEM; goto out; } /* * First, check if the first ref of the current inode was overwritten * before. If yes, we know that the current inode was already orphanized * and thus use the orphan name. If not, we can use get_cur_path to * get the path of the first ref as it would like while receiving at * this point in time. * New inodes are always orphan at the beginning, so force to use the * orphan name in this case. * The first ref is stored in valid_path and will be updated if it * gets moved around. */ if (!sctx->cur_inode_new) { ret = did_overwrite_first_ref(sctx, sctx->cur_ino, sctx->cur_inode_gen); if (ret < 0) goto out; if (ret) did_overwrite = 1; } if (sctx->cur_inode_new || did_overwrite) { ret = gen_unique_name(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path); if (ret < 0) goto out; is_orphan = 1; } else { ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path); if (ret < 0) goto out; } /* * Before doing any rename and link operations, do a first pass on the * new references to orphanize any unprocessed inodes that may have a * reference that conflicts with one of the new references of the current * inode. This needs to happen first because a new reference may conflict * with the old reference of a parent directory, so we must make sure * that the path used for link and rename commands don't use an * orphanized name when an ancestor was not yet orphanized. * * Example: * * Parent snapshot: * * . (ino 256) * |----- testdir/ (ino 259) * | |----- a (ino 257) * | * |----- b (ino 258) * * Send snapshot: * * . (ino 256) * |----- testdir_2/ (ino 259) * | |----- a (ino 260) * | * |----- testdir (ino 257) * |----- b (ino 257) * |----- b2 (ino 258) * * Processing the new reference for inode 257 with name "b" may happen * before processing the new reference with name "testdir". If so, we * must make sure that by the time we send a link command to create the * hard link "b", inode 259 was already orphanized, since the generated * path in "valid_path" already contains the orphanized name for 259. * We are processing inode 257, so only later when processing 259 we do * the rename operation to change its temporary (orphanized) name to * "testdir_2". */ list_for_each_entry(cur, &sctx->new_refs, list) { ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL); if (ret < 0) goto out; if (ret == inode_state_will_create) continue; /* * Check if this new ref would overwrite the first ref of another * unprocessed inode. If yes, orphanize the overwritten inode. * If we find an overwritten ref that is not the first ref, * simply unlink it. */ ret = will_overwrite_ref(sctx, cur->dir, cur->dir_gen, cur->name, cur->name_len, &ow_inode, &ow_gen, &ow_mode); if (ret < 0) goto out; if (ret) { ret = is_first_ref(sctx->parent_root, ow_inode, cur->dir, cur->name, cur->name_len); if (ret < 0) goto out; if (ret) { struct name_cache_entry *nce; struct waiting_dir_move *wdm; if (orphanized_dir) { ret = refresh_ref_path(sctx, cur); if (ret < 0) goto out; } ret = orphanize_inode(sctx, ow_inode, ow_gen, cur->full_path); if (ret < 0) goto out; if (S_ISDIR(ow_mode)) orphanized_dir = true; /* * If ow_inode has its rename operation delayed * make sure that its orphanized name is used in * the source path when performing its rename * operation. */ wdm = get_waiting_dir_move(sctx, ow_inode); if (wdm) wdm->orphanized = true; /* * Make sure we clear our orphanized inode's * name from the name cache. This is because the * inode ow_inode might be an ancestor of some * other inode that will be orphanized as well * later and has an inode number greater than * sctx->send_progress. We need to prevent * future name lookups from using the old name * and get instead the orphan name. */ nce = name_cache_search(sctx, ow_inode, ow_gen); if (nce) btrfs_lru_cache_remove(&sctx->name_cache, &nce->entry); /* * ow_inode might currently be an ancestor of * cur_ino, therefore compute valid_path (the * current path of cur_ino) again because it * might contain the pre-orphanization name of * ow_inode, which is no longer valid. */ ret = is_ancestor(sctx->parent_root, ow_inode, ow_gen, sctx->cur_ino, NULL); if (ret > 0) { orphanized_ancestor = true; fs_path_reset(valid_path); ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path); } if (ret < 0) goto out; } else { /* * If we previously orphanized a directory that * collided with a new reference that we already * processed, recompute the current path because * that directory may be part of the path. */ if (orphanized_dir) { ret = refresh_ref_path(sctx, cur); if (ret < 0) goto out; } ret = send_unlink(sctx, cur->full_path); if (ret < 0) goto out; } } } list_for_each_entry(cur, &sctx->new_refs, list) { /* * We may have refs where the parent directory does not exist * yet. This happens if the parent directories inum is higher * than the current inum. To handle this case, we create the * parent directory out of order. But we need to check if this * did already happen before due to other refs in the same dir. */ ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL); if (ret < 0) goto out; if (ret == inode_state_will_create) { ret = 0; /* * First check if any of the current inodes refs did * already create the dir. */ list_for_each_entry(cur2, &sctx->new_refs, list) { if (cur == cur2) break; if (cur2->dir == cur->dir) { ret = 1; break; } } /* * If that did not happen, check if a previous inode * did already create the dir. */ if (!ret) ret = did_create_dir(sctx, cur->dir); if (ret < 0) goto out; if (!ret) { ret = send_create_inode(sctx, cur->dir); if (ret < 0) goto out; cache_dir_created(sctx, cur->dir); } } if (S_ISDIR(sctx->cur_inode_mode) && sctx->parent_root) { ret = wait_for_dest_dir_move(sctx, cur, is_orphan); if (ret < 0) goto out; if (ret == 1) { can_rename = false; *pending_move = 1; } } if (S_ISDIR(sctx->cur_inode_mode) && sctx->parent_root && can_rename) { ret = wait_for_parent_move(sctx, cur, is_orphan); if (ret < 0) goto out; if (ret == 1) { can_rename = false; *pending_move = 1; } } /* * link/move the ref to the new place. If we have an orphan * inode, move it and update valid_path. If not, link or move * it depending on the inode mode. */ if (is_orphan && can_rename) { ret = send_rename(sctx, valid_path, cur->full_path); if (ret < 0) goto out; is_orphan = 0; ret = fs_path_copy(valid_path, cur->full_path); if (ret < 0) goto out; } else if (can_rename) { if (S_ISDIR(sctx->cur_inode_mode)) { /* * Dirs can't be linked, so move it. For moved * dirs, we always have one new and one deleted * ref. The deleted ref is ignored later. */ ret = send_rename(sctx, valid_path, cur->full_path); if (!ret) ret = fs_path_copy(valid_path, cur->full_path); if (ret < 0) goto out; } else { /* * We might have previously orphanized an inode * which is an ancestor of our current inode, * so our reference's full path, which was * computed before any such orphanizations, must * be updated. */ if (orphanized_dir) { ret = update_ref_path(sctx, cur); if (ret < 0) goto out; } ret = send_link(sctx, cur->full_path, valid_path); if (ret < 0) goto out; } } ret = dup_ref(cur, &check_dirs); if (ret < 0) goto out; } if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_deleted) { /* * Check if we can already rmdir the directory. If not, * orphanize it. For every dir item inside that gets deleted * later, we do this check again and rmdir it then if possible. * See the use of check_dirs for more details. */ ret = can_rmdir(sctx, sctx->cur_ino, sctx->cur_inode_gen); if (ret < 0) goto out; if (ret) { ret = send_rmdir(sctx, valid_path); if (ret < 0) goto out; } else if (!is_orphan) { ret = orphanize_inode(sctx, sctx->cur_ino, sctx->cur_inode_gen, valid_path); if (ret < 0) goto out; is_orphan = 1; } list_for_each_entry(cur, &sctx->deleted_refs, list) { ret = dup_ref(cur, &check_dirs); if (ret < 0) goto out; } } else if (S_ISDIR(sctx->cur_inode_mode) && !list_empty(&sctx->deleted_refs)) { /* * We have a moved dir. Add the old parent to check_dirs */ cur = list_entry(sctx->deleted_refs.next, struct recorded_ref, list); ret = dup_ref(cur, &check_dirs); if (ret < 0) goto out; } else if (!S_ISDIR(sctx->cur_inode_mode)) { /* * We have a non dir inode. Go through all deleted refs and * unlink them if they were not already overwritten by other * inodes. */ list_for_each_entry(cur, &sctx->deleted_refs, list) { ret = did_overwrite_ref(sctx, cur->dir, cur->dir_gen, sctx->cur_ino, sctx->cur_inode_gen, cur->name, cur->name_len); if (ret < 0) goto out; if (!ret) { /* * If we orphanized any ancestor before, we need * to recompute the full path for deleted names, * since any such path was computed before we * processed any references and orphanized any * ancestor inode. */ if (orphanized_ancestor) { ret = update_ref_path(sctx, cur); if (ret < 0) goto out; } ret = send_unlink(sctx, cur->full_path); if (ret < 0) goto out; } ret = dup_ref(cur, &check_dirs); if (ret < 0) goto out; } /* * If the inode is still orphan, unlink the orphan. This may * happen when a previous inode did overwrite the first ref * of this inode and no new refs were added for the current * inode. Unlinking does not mean that the inode is deleted in * all cases. There may still be links to this inode in other * places. */ if (is_orphan) { ret = send_unlink(sctx, valid_path); if (ret < 0) goto out; } } /* * We did collect all parent dirs where cur_inode was once located. We * now go through all these dirs and check if they are pending for * deletion and if it's finally possible to perform the rmdir now. * We also update the inode stats of the parent dirs here. */ list_for_each_entry(cur, &check_dirs, list) { /* * In case we had refs into dirs that were not processed yet, * we don't need to do the utime and rmdir logic for these dirs. * The dir will be processed later. */ if (cur->dir > sctx->cur_ino) continue; ret = get_cur_inode_state(sctx, cur->dir, cur->dir_gen, NULL, NULL); if (ret < 0) goto out; if (ret == inode_state_did_create || ret == inode_state_no_change) { ret = cache_dir_utimes(sctx, cur->dir, cur->dir_gen); if (ret < 0) goto out; } else if (ret == inode_state_did_delete && cur->dir != last_dir_ino_rm) { ret = can_rmdir(sctx, cur->dir, cur->dir_gen); if (ret < 0) goto out; if (ret) { ret = get_cur_path(sctx, cur->dir, cur->dir_gen, valid_path); if (ret < 0) goto out; ret = send_rmdir(sctx, valid_path); if (ret < 0) goto out; last_dir_ino_rm = cur->dir; } } } ret = 0; out: __free_recorded_refs(&check_dirs); free_recorded_refs(sctx); fs_path_free(valid_path); return ret; } static int rbtree_ref_comp(const void *k, const struct rb_node *node) { const struct recorded_ref *data = k; const struct recorded_ref *ref = rb_entry(node, struct recorded_ref, node); int result; if (data->dir > ref->dir) return 1; if (data->dir < ref->dir) return -1; if (data->dir_gen > ref->dir_gen) return 1; if (data->dir_gen < ref->dir_gen) return -1; if (data->name_len > ref->name_len) return 1; if (data->name_len < ref->name_len) return -1; result = strcmp(data->name, ref->name); if (result > 0) return 1; if (result < 0) return -1; return 0; } static bool rbtree_ref_less(struct rb_node *node, const struct rb_node *parent) { const struct recorded_ref *entry = rb_entry(node, struct recorded_ref, node); return rbtree_ref_comp(entry, parent) < 0; } static int record_ref_in_tree(struct rb_root *root, struct list_head *refs, struct fs_path *name, u64 dir, u64 dir_gen, struct send_ctx *sctx) { int ret = 0; struct fs_path *path = NULL; struct recorded_ref *ref = NULL; path = fs_path_alloc(); if (!path) { ret = -ENOMEM; goto out; } ref = recorded_ref_alloc(); if (!ref) { ret = -ENOMEM; goto out; } ret = get_cur_path(sctx, dir, dir_gen, path); if (ret < 0) goto out; ret = fs_path_add_path(path, name); if (ret < 0) goto out; ref->dir = dir; ref->dir_gen = dir_gen; set_ref_path(ref, path); list_add_tail(&ref->list, refs); rb_add(&ref->node, root, rbtree_ref_less); ref->root = root; out: if (ret) { if (path && (!ref || !ref->full_path)) fs_path_free(path); recorded_ref_free(ref); } return ret; } static int record_new_ref_if_needed(int num, u64 dir, int index, struct fs_path *name, void *ctx) { int ret = 0; struct send_ctx *sctx = ctx; struct rb_node *node = NULL; struct recorded_ref data; struct recorded_ref *ref; u64 dir_gen; ret = get_inode_gen(sctx->send_root, dir, &dir_gen); if (ret < 0) goto out; data.dir = dir; data.dir_gen = dir_gen; set_ref_path(&data, name); node = rb_find(&data, &sctx->rbtree_deleted_refs, rbtree_ref_comp); if (node) { ref = rb_entry(node, struct recorded_ref, node); recorded_ref_free(ref); } else { ret = record_ref_in_tree(&sctx->rbtree_new_refs, &sctx->new_refs, name, dir, dir_gen, sctx); } out: return ret; } static int record_deleted_ref_if_needed(int num, u64 dir, int index, struct fs_path *name, void *ctx) { int ret = 0; struct send_ctx *sctx = ctx; struct rb_node *node = NULL; struct recorded_ref data; struct recorded_ref *ref; u64 dir_gen; ret = get_inode_gen(sctx->parent_root, dir, &dir_gen); if (ret < 0) goto out; data.dir = dir; data.dir_gen = dir_gen; set_ref_path(&data, name); node = rb_find(&data, &sctx->rbtree_new_refs, rbtree_ref_comp); if (node) { ref = rb_entry(node, struct recorded_ref, node); recorded_ref_free(ref); } else { ret = record_ref_in_tree(&sctx->rbtree_deleted_refs, &sctx->deleted_refs, name, dir, dir_gen, sctx); } out: return ret; } static int record_new_ref(struct send_ctx *sctx) { int ret; ret = iterate_inode_ref(sctx->send_root, sctx->left_path, sctx->cmp_key, 0, record_new_ref_if_needed, sctx); if (ret < 0) goto out; ret = 0; out: return ret; } static int record_deleted_ref(struct send_ctx *sctx) { int ret; ret = iterate_inode_ref(sctx->parent_root, sctx->right_path, sctx->cmp_key, 0, record_deleted_ref_if_needed, sctx); if (ret < 0) goto out; ret = 0; out: return ret; } static int record_changed_ref(struct send_ctx *sctx) { int ret = 0; ret = iterate_inode_ref(sctx->send_root, sctx->left_path, sctx->cmp_key, 0, record_new_ref_if_needed, sctx); if (ret < 0) goto out; ret = iterate_inode_ref(sctx->parent_root, sctx->right_path, sctx->cmp_key, 0, record_deleted_ref_if_needed, sctx); if (ret < 0) goto out; ret = 0; out: return ret; } /* * Record and process all refs at once. Needed when an inode changes the * generation number, which means that it was deleted and recreated. */ static int process_all_refs(struct send_ctx *sctx, enum btrfs_compare_tree_result cmd) { int ret = 0; int iter_ret = 0; struct btrfs_root *root; struct btrfs_path *path; struct btrfs_key key; struct btrfs_key found_key; iterate_inode_ref_t cb; int pending_move = 0; path = alloc_path_for_send(); if (!path) return -ENOMEM; if (cmd == BTRFS_COMPARE_TREE_NEW) { root = sctx->send_root; cb = record_new_ref_if_needed; } else if (cmd == BTRFS_COMPARE_TREE_DELETED) { root = sctx->parent_root; cb = record_deleted_ref_if_needed; } else { btrfs_err(sctx->send_root->fs_info, "Wrong command %d in process_all_refs", cmd); ret = -EINVAL; goto out; } key.objectid = sctx->cmp_key->objectid; key.type = BTRFS_INODE_REF_KEY; key.offset = 0; btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) { if (found_key.objectid != key.objectid || (found_key.type != BTRFS_INODE_REF_KEY && found_key.type != BTRFS_INODE_EXTREF_KEY)) break; ret = iterate_inode_ref(root, path, &found_key, 0, cb, sctx); if (ret < 0) goto out; } /* Catch error found during iteration */ if (iter_ret < 0) { ret = iter_ret; goto out; } btrfs_release_path(path); /* * We don't actually care about pending_move as we are simply * re-creating this inode and will be rename'ing it into place once we * rename the parent directory. */ ret = process_recorded_refs(sctx, &pending_move); out: btrfs_free_path(path); return ret; } static int send_set_xattr(struct send_ctx *sctx, struct fs_path *path, const char *name, int name_len, const char *data, int data_len) { int ret = 0; ret = begin_cmd(sctx, BTRFS_SEND_C_SET_XATTR); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len); TLV_PUT(sctx, BTRFS_SEND_A_XATTR_DATA, data, data_len); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } static int send_remove_xattr(struct send_ctx *sctx, struct fs_path *path, const char *name, int name_len) { int ret = 0; ret = begin_cmd(sctx, BTRFS_SEND_C_REMOVE_XATTR); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); TLV_PUT_STRING(sctx, BTRFS_SEND_A_XATTR_NAME, name, name_len); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } static int __process_new_xattr(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *ctx) { int ret; struct send_ctx *sctx = ctx; struct fs_path *p; struct posix_acl_xattr_header dummy_acl; /* Capabilities are emitted by finish_inode_if_needed */ if (!strncmp(name, XATTR_NAME_CAPS, name_len)) return 0; p = fs_path_alloc(); if (!p) return -ENOMEM; /* * This hack is needed because empty acls are stored as zero byte * data in xattrs. Problem with that is, that receiving these zero byte * acls will fail later. To fix this, we send a dummy acl list that * only contains the version number and no entries. */ if (!strncmp(name, XATTR_NAME_POSIX_ACL_ACCESS, name_len) || !strncmp(name, XATTR_NAME_POSIX_ACL_DEFAULT, name_len)) { if (data_len == 0) { dummy_acl.a_version = cpu_to_le32(POSIX_ACL_XATTR_VERSION); data = (char *)&dummy_acl; data_len = sizeof(dummy_acl); } } ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto out; ret = send_set_xattr(sctx, p, name, name_len, data, data_len); out: fs_path_free(p); return ret; } static int __process_deleted_xattr(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *ctx) { int ret; struct send_ctx *sctx = ctx; struct fs_path *p; p = fs_path_alloc(); if (!p) return -ENOMEM; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto out; ret = send_remove_xattr(sctx, p, name, name_len); out: fs_path_free(p); return ret; } static int process_new_xattr(struct send_ctx *sctx) { int ret = 0; ret = iterate_dir_item(sctx->send_root, sctx->left_path, __process_new_xattr, sctx); return ret; } static int process_deleted_xattr(struct send_ctx *sctx) { return iterate_dir_item(sctx->parent_root, sctx->right_path, __process_deleted_xattr, sctx); } struct find_xattr_ctx { const char *name; int name_len; int found_idx; char *found_data; int found_data_len; }; static int __find_xattr(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *vctx) { struct find_xattr_ctx *ctx = vctx; if (name_len == ctx->name_len && strncmp(name, ctx->name, name_len) == 0) { ctx->found_idx = num; ctx->found_data_len = data_len; ctx->found_data = kmemdup(data, data_len, GFP_KERNEL); if (!ctx->found_data) return -ENOMEM; return 1; } return 0; } static int find_xattr(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_key *key, const char *name, int name_len, char **data, int *data_len) { int ret; struct find_xattr_ctx ctx; ctx.name = name; ctx.name_len = name_len; ctx.found_idx = -1; ctx.found_data = NULL; ctx.found_data_len = 0; ret = iterate_dir_item(root, path, __find_xattr, &ctx); if (ret < 0) return ret; if (ctx.found_idx == -1) return -ENOENT; if (data) { *data = ctx.found_data; *data_len = ctx.found_data_len; } else { kfree(ctx.found_data); } return ctx.found_idx; } static int __process_changed_new_xattr(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *ctx) { int ret; struct send_ctx *sctx = ctx; char *found_data = NULL; int found_data_len = 0; ret = find_xattr(sctx->parent_root, sctx->right_path, sctx->cmp_key, name, name_len, &found_data, &found_data_len); if (ret == -ENOENT) { ret = __process_new_xattr(num, di_key, name, name_len, data, data_len, ctx); } else if (ret >= 0) { if (data_len != found_data_len || memcmp(data, found_data, data_len)) { ret = __process_new_xattr(num, di_key, name, name_len, data, data_len, ctx); } else { ret = 0; } } kfree(found_data); return ret; } static int __process_changed_deleted_xattr(int num, struct btrfs_key *di_key, const char *name, int name_len, const char *data, int data_len, void *ctx) { int ret; struct send_ctx *sctx = ctx; ret = find_xattr(sctx->send_root, sctx->left_path, sctx->cmp_key, name, name_len, NULL, NULL); if (ret == -ENOENT) ret = __process_deleted_xattr(num, di_key, name, name_len, data, data_len, ctx); else if (ret >= 0) ret = 0; return ret; } static int process_changed_xattr(struct send_ctx *sctx) { int ret = 0; ret = iterate_dir_item(sctx->send_root, sctx->left_path, __process_changed_new_xattr, sctx); if (ret < 0) goto out; ret = iterate_dir_item(sctx->parent_root, sctx->right_path, __process_changed_deleted_xattr, sctx); out: return ret; } static int process_all_new_xattrs(struct send_ctx *sctx) { int ret = 0; int iter_ret = 0; struct btrfs_root *root; struct btrfs_path *path; struct btrfs_key key; struct btrfs_key found_key; path = alloc_path_for_send(); if (!path) return -ENOMEM; root = sctx->send_root; key.objectid = sctx->cmp_key->objectid; key.type = BTRFS_XATTR_ITEM_KEY; key.offset = 0; btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) { if (found_key.objectid != key.objectid || found_key.type != key.type) { ret = 0; break; } ret = iterate_dir_item(root, path, __process_new_xattr, sctx); if (ret < 0) break; } /* Catch error found during iteration */ if (iter_ret < 0) ret = iter_ret; btrfs_free_path(path); return ret; } static int send_verity(struct send_ctx *sctx, struct fs_path *path, struct fsverity_descriptor *desc) { int ret; ret = begin_cmd(sctx, BTRFS_SEND_C_ENABLE_VERITY); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, path); TLV_PUT_U8(sctx, BTRFS_SEND_A_VERITY_ALGORITHM, le8_to_cpu(desc->hash_algorithm)); TLV_PUT_U32(sctx, BTRFS_SEND_A_VERITY_BLOCK_SIZE, 1U << le8_to_cpu(desc->log_blocksize)); TLV_PUT(sctx, BTRFS_SEND_A_VERITY_SALT_DATA, desc->salt, le8_to_cpu(desc->salt_size)); TLV_PUT(sctx, BTRFS_SEND_A_VERITY_SIG_DATA, desc->signature, le32_to_cpu(desc->sig_size)); ret = send_cmd(sctx); tlv_put_failure: out: return ret; } static int process_verity(struct send_ctx *sctx) { int ret = 0; struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; struct inode *inode; struct fs_path *p; inode = btrfs_iget(fs_info->sb, sctx->cur_ino, sctx->send_root); if (IS_ERR(inode)) return PTR_ERR(inode); ret = btrfs_get_verity_descriptor(inode, NULL, 0); if (ret < 0) goto iput; if (ret > FS_VERITY_MAX_DESCRIPTOR_SIZE) { ret = -EMSGSIZE; goto iput; } if (!sctx->verity_descriptor) { sctx->verity_descriptor = kvmalloc(FS_VERITY_MAX_DESCRIPTOR_SIZE, GFP_KERNEL); if (!sctx->verity_descriptor) { ret = -ENOMEM; goto iput; } } ret = btrfs_get_verity_descriptor(inode, sctx->verity_descriptor, ret); if (ret < 0) goto iput; p = fs_path_alloc(); if (!p) { ret = -ENOMEM; goto iput; } ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto free_path; ret = send_verity(sctx, p, sctx->verity_descriptor); if (ret < 0) goto free_path; free_path: fs_path_free(p); iput: iput(inode); return ret; } static inline u64 max_send_read_size(const struct send_ctx *sctx) { return sctx->send_max_size - SZ_16K; } static int put_data_header(struct send_ctx *sctx, u32 len) { if (WARN_ON_ONCE(sctx->put_data)) return -EINVAL; sctx->put_data = true; if (sctx->proto >= 2) { /* * Since v2, the data attribute header doesn't include a length, * it is implicitly to the end of the command. */ if (sctx->send_max_size - sctx->send_size < sizeof(__le16) + len) return -EOVERFLOW; put_unaligned_le16(BTRFS_SEND_A_DATA, sctx->send_buf + sctx->send_size); sctx->send_size += sizeof(__le16); } else { struct btrfs_tlv_header *hdr; if (sctx->send_max_size - sctx->send_size < sizeof(*hdr) + len) return -EOVERFLOW; hdr = (struct btrfs_tlv_header *)(sctx->send_buf + sctx->send_size); put_unaligned_le16(BTRFS_SEND_A_DATA, &hdr->tlv_type); put_unaligned_le16(len, &hdr->tlv_len); sctx->send_size += sizeof(*hdr); } return 0; } static int put_file_data(struct send_ctx *sctx, u64 offset, u32 len) { struct btrfs_root *root = sctx->send_root; struct btrfs_fs_info *fs_info = root->fs_info; struct page *page; pgoff_t index = offset >> PAGE_SHIFT; pgoff_t last_index; unsigned pg_offset = offset_in_page(offset); int ret; ret = put_data_header(sctx, len); if (ret) return ret; last_index = (offset + len - 1) >> PAGE_SHIFT; while (index <= last_index) { unsigned cur_len = min_t(unsigned, len, PAGE_SIZE - pg_offset); page = find_lock_page(sctx->cur_inode->i_mapping, index); if (!page) { page_cache_sync_readahead(sctx->cur_inode->i_mapping, &sctx->ra, NULL, index, last_index + 1 - index); page = find_or_create_page(sctx->cur_inode->i_mapping, index, GFP_KERNEL); if (!page) { ret = -ENOMEM; break; } } if (PageReadahead(page)) page_cache_async_readahead(sctx->cur_inode->i_mapping, &sctx->ra, NULL, page_folio(page), index, last_index + 1 - index); if (!PageUptodate(page)) { btrfs_read_folio(NULL, page_folio(page)); lock_page(page); if (!PageUptodate(page)) { unlock_page(page); btrfs_err(fs_info, "send: IO error at offset %llu for inode %llu root %llu", page_offset(page), sctx->cur_ino, sctx->send_root->root_key.objectid); put_page(page); ret = -EIO; break; } } memcpy_from_page(sctx->send_buf + sctx->send_size, page, pg_offset, cur_len); unlock_page(page); put_page(page); index++; pg_offset = 0; len -= cur_len; sctx->send_size += cur_len; } return ret; } /* * Read some bytes from the current inode/file and send a write command to * user space. */ static int send_write(struct send_ctx *sctx, u64 offset, u32 len) { struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret = 0; struct fs_path *p; p = fs_path_alloc(); if (!p) return -ENOMEM; btrfs_debug(fs_info, "send_write offset=%llu, len=%d", offset, len); ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE); if (ret < 0) goto out; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); ret = put_file_data(sctx, offset, len); if (ret < 0) goto out; ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } /* * Send a clone command to user space. */ static int send_clone(struct send_ctx *sctx, u64 offset, u32 len, struct clone_root *clone_root) { int ret = 0; struct fs_path *p; u64 gen; btrfs_debug(sctx->send_root->fs_info, "send_clone offset=%llu, len=%d, clone_root=%llu, clone_inode=%llu, clone_offset=%llu", offset, len, clone_root->root->root_key.objectid, clone_root->ino, clone_root->offset); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_CLONE); if (ret < 0) goto out; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto out; TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_LEN, len); TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); if (clone_root->root == sctx->send_root) { ret = get_inode_gen(sctx->send_root, clone_root->ino, &gen); if (ret < 0) goto out; ret = get_cur_path(sctx, clone_root->ino, gen, p); } else { ret = get_inode_path(clone_root->root, clone_root->ino, p); } if (ret < 0) goto out; /* * If the parent we're using has a received_uuid set then use that as * our clone source as that is what we will look for when doing a * receive. * * This covers the case that we create a snapshot off of a received * subvolume and then use that as the parent and try to receive on a * different host. */ if (!btrfs_is_empty_uuid(clone_root->root->root_item.received_uuid)) TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID, clone_root->root->root_item.received_uuid); else TLV_PUT_UUID(sctx, BTRFS_SEND_A_CLONE_UUID, clone_root->root->root_item.uuid); TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_CTRANSID, btrfs_root_ctransid(&clone_root->root->root_item)); TLV_PUT_PATH(sctx, BTRFS_SEND_A_CLONE_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_CLONE_OFFSET, clone_root->offset); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } /* * Send an update extent command to user space. */ static int send_update_extent(struct send_ctx *sctx, u64 offset, u32 len) { int ret = 0; struct fs_path *p; p = fs_path_alloc(); if (!p) return -ENOMEM; ret = begin_cmd(sctx, BTRFS_SEND_C_UPDATE_EXTENT); if (ret < 0) goto out; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto out; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); TLV_PUT_U64(sctx, BTRFS_SEND_A_SIZE, len); ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(p); return ret; } static int send_hole(struct send_ctx *sctx, u64 end) { struct fs_path *p = NULL; u64 read_size = max_send_read_size(sctx); u64 offset = sctx->cur_inode_last_extent; int ret = 0; /* * A hole that starts at EOF or beyond it. Since we do not yet support * fallocate (for extent preallocation and hole punching), sending a * write of zeroes starting at EOF or beyond would later require issuing * a truncate operation which would undo the write and achieve nothing. */ if (offset >= sctx->cur_inode_size) return 0; /* * Don't go beyond the inode's i_size due to prealloc extents that start * after the i_size. */ end = min_t(u64, end, sctx->cur_inode_size); if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA) return send_update_extent(sctx, offset, end - offset); p = fs_path_alloc(); if (!p) return -ENOMEM; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, p); if (ret < 0) goto tlv_put_failure; while (offset < end) { u64 len = min(end - offset, read_size); ret = begin_cmd(sctx, BTRFS_SEND_C_WRITE); if (ret < 0) break; TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, p); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); ret = put_data_header(sctx, len); if (ret < 0) break; memset(sctx->send_buf + sctx->send_size, 0, len); sctx->send_size += len; ret = send_cmd(sctx); if (ret < 0) break; offset += len; } sctx->cur_inode_next_write_offset = offset; tlv_put_failure: fs_path_free(p); return ret; } static int send_encoded_inline_extent(struct send_ctx *sctx, struct btrfs_path *path, u64 offset, u64 len) { struct btrfs_root *root = sctx->send_root; struct btrfs_fs_info *fs_info = root->fs_info; struct inode *inode; struct fs_path *fspath; struct extent_buffer *leaf = path->nodes[0]; struct btrfs_key key; struct btrfs_file_extent_item *ei; u64 ram_bytes; size_t inline_size; int ret; inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root); if (IS_ERR(inode)) return PTR_ERR(inode); fspath = fs_path_alloc(); if (!fspath) { ret = -ENOMEM; goto out; } ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE); if (ret < 0) goto out; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath); if (ret < 0) goto out; btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); ram_bytes = btrfs_file_extent_ram_bytes(leaf, ei); inline_size = btrfs_file_extent_inline_item_len(leaf, path->slots[0]); TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, fspath); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN, min(key.offset + ram_bytes - offset, len)); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN, ram_bytes); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET, offset - key.offset); ret = btrfs_encoded_io_compression_from_extent(fs_info, btrfs_file_extent_compression(leaf, ei)); if (ret < 0) goto out; TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret); ret = put_data_header(sctx, inline_size); if (ret < 0) goto out; read_extent_buffer(leaf, sctx->send_buf + sctx->send_size, btrfs_file_extent_inline_start(ei), inline_size); sctx->send_size += inline_size; ret = send_cmd(sctx); tlv_put_failure: out: fs_path_free(fspath); iput(inode); return ret; } static int send_encoded_extent(struct send_ctx *sctx, struct btrfs_path *path, u64 offset, u64 len) { struct btrfs_root *root = sctx->send_root; struct btrfs_fs_info *fs_info = root->fs_info; struct inode *inode; struct fs_path *fspath; struct extent_buffer *leaf = path->nodes[0]; struct btrfs_key key; struct btrfs_file_extent_item *ei; u64 disk_bytenr, disk_num_bytes; u32 data_offset; struct btrfs_cmd_header *hdr; u32 crc; int ret; inode = btrfs_iget(fs_info->sb, sctx->cur_ino, root); if (IS_ERR(inode)) return PTR_ERR(inode); fspath = fs_path_alloc(); if (!fspath) { ret = -ENOMEM; goto out; } ret = begin_cmd(sctx, BTRFS_SEND_C_ENCODED_WRITE); if (ret < 0) goto out; ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath); if (ret < 0) goto out; btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei); disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, ei); TLV_PUT_PATH(sctx, BTRFS_SEND_A_PATH, fspath); TLV_PUT_U64(sctx, BTRFS_SEND_A_FILE_OFFSET, offset); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_FILE_LEN, min(key.offset + btrfs_file_extent_num_bytes(leaf, ei) - offset, len)); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_LEN, btrfs_file_extent_ram_bytes(leaf, ei)); TLV_PUT_U64(sctx, BTRFS_SEND_A_UNENCODED_OFFSET, offset - key.offset + btrfs_file_extent_offset(leaf, ei)); ret = btrfs_encoded_io_compression_from_extent(fs_info, btrfs_file_extent_compression(leaf, ei)); if (ret < 0) goto out; TLV_PUT_U32(sctx, BTRFS_SEND_A_COMPRESSION, ret); TLV_PUT_U32(sctx, BTRFS_SEND_A_ENCRYPTION, 0); ret = put_data_header(sctx, disk_num_bytes); if (ret < 0) goto out; /* * We want to do I/O directly into the send buffer, so get the next page * boundary in the send buffer. This means that there may be a gap * between the beginning of the command and the file data. */ data_offset = PAGE_ALIGN(sctx->send_size); if (data_offset > sctx->send_max_size || sctx->send_max_size - data_offset < disk_num_bytes) { ret = -EOVERFLOW; goto out; } /* * Note that send_buf is a mapping of send_buf_pages, so this is really * reading into send_buf. */ ret = btrfs_encoded_read_regular_fill_pages(BTRFS_I(inode), offset, disk_bytenr, disk_num_bytes, sctx->send_buf_pages + (data_offset >> PAGE_SHIFT)); if (ret) goto out; hdr = (struct btrfs_cmd_header *)sctx->send_buf; hdr->len = cpu_to_le32(sctx->send_size + disk_num_bytes - sizeof(*hdr)); hdr->crc = 0; crc = btrfs_crc32c(0, sctx->send_buf, sctx->send_size); crc = btrfs_crc32c(crc, sctx->send_buf + data_offset, disk_num_bytes); hdr->crc = cpu_to_le32(crc); ret = write_buf(sctx->send_filp, sctx->send_buf, sctx->send_size, &sctx->send_off); if (!ret) { ret = write_buf(sctx->send_filp, sctx->send_buf + data_offset, disk_num_bytes, &sctx->send_off); } sctx->send_size = 0; sctx->put_data = false; tlv_put_failure: out: fs_path_free(fspath); iput(inode); return ret; } static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path, const u64 offset, const u64 len) { const u64 end = offset + len; struct extent_buffer *leaf = path->nodes[0]; struct btrfs_file_extent_item *ei; u64 read_size = max_send_read_size(sctx); u64 sent = 0; if (sctx->flags & BTRFS_SEND_FLAG_NO_FILE_DATA) return send_update_extent(sctx, offset, len); ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item); if ((sctx->flags & BTRFS_SEND_FLAG_COMPRESSED) && btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) { bool is_inline = (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE); /* * Send the compressed extent unless the compressed data is * larger than the decompressed data. This can happen if we're * not sending the entire extent, either because it has been * partially overwritten/truncated or because this is a part of * the extent that we couldn't clone in clone_range(). */ if (is_inline && btrfs_file_extent_inline_item_len(leaf, path->slots[0]) <= len) { return send_encoded_inline_extent(sctx, path, offset, len); } else if (!is_inline && btrfs_file_extent_disk_num_bytes(leaf, ei) <= len) { return send_encoded_extent(sctx, path, offset, len); } } if (sctx->cur_inode == NULL) { struct btrfs_root *root = sctx->send_root; sctx->cur_inode = btrfs_iget(root->fs_info->sb, sctx->cur_ino, root); if (IS_ERR(sctx->cur_inode)) { int err = PTR_ERR(sctx->cur_inode); sctx->cur_inode = NULL; return err; } memset(&sctx->ra, 0, sizeof(struct file_ra_state)); file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping); /* * It's very likely there are no pages from this inode in the page * cache, so after reading extents and sending their data, we clean * the page cache to avoid trashing the page cache (adding pressure * to the page cache and forcing eviction of other data more useful * for applications). * * We decide if we should clean the page cache simply by checking * if the inode's mapping nrpages is 0 when we first open it, and * not by using something like filemap_range_has_page() before * reading an extent because when we ask the readahead code to * read a given file range, it may (and almost always does) read * pages from beyond that range (see the documentation for * page_cache_sync_readahead()), so it would not be reliable, * because after reading the first extent future calls to * filemap_range_has_page() would return true because the readahead * on the previous extent resulted in reading pages of the current * extent as well. */ sctx->clean_page_cache = (sctx->cur_inode->i_mapping->nrpages == 0); sctx->page_cache_clear_start = round_down(offset, PAGE_SIZE); } while (sent < len) { u64 size = min(len - sent, read_size); int ret; ret = send_write(sctx, offset + sent, size); if (ret < 0) return ret; sent += size; } if (sctx->clean_page_cache && PAGE_ALIGNED(end)) { /* * Always operate only on ranges that are a multiple of the page * size. This is not only to prevent zeroing parts of a page in * the case of subpage sector size, but also to guarantee we evict * pages, as passing a range that is smaller than page size does * not evict the respective page (only zeroes part of its content). * * Always start from the end offset of the last range cleared. * This is because the readahead code may (and very often does) * reads pages beyond the range we request for readahead. So if * we have an extent layout like this: * * [ extent A ] [ extent B ] [ extent C ] * * When we ask page_cache_sync_readahead() to read extent A, it * may also trigger reads for pages of extent B. If we are doing * an incremental send and extent B has not changed between the * parent and send snapshots, some or all of its pages may end * up being read and placed in the page cache. So when truncating * the page cache we always start from the end offset of the * previously processed extent up to the end of the current * extent. */ truncate_inode_pages_range(&sctx->cur_inode->i_data, sctx->page_cache_clear_start, end - 1); sctx->page_cache_clear_start = end; } return 0; } /* * Search for a capability xattr related to sctx->cur_ino. If the capability is * found, call send_set_xattr function to emit it. * * Return 0 if there isn't a capability, or when the capability was emitted * successfully, or < 0 if an error occurred. */ static int send_capabilities(struct send_ctx *sctx) { struct fs_path *fspath = NULL; struct btrfs_path *path; struct btrfs_dir_item *di; struct extent_buffer *leaf; unsigned long data_ptr; char *buf = NULL; int buf_len; int ret = 0; path = alloc_path_for_send(); if (!path) return -ENOMEM; di = btrfs_lookup_xattr(NULL, sctx->send_root, path, sctx->cur_ino, XATTR_NAME_CAPS, strlen(XATTR_NAME_CAPS), 0); if (!di) { /* There is no xattr for this inode */ goto out; } else if (IS_ERR(di)) { ret = PTR_ERR(di); goto out; } leaf = path->nodes[0]; buf_len = btrfs_dir_data_len(leaf, di); fspath = fs_path_alloc(); buf = kmalloc(buf_len, GFP_KERNEL); if (!fspath || !buf) { ret = -ENOMEM; goto out; } ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath); if (ret < 0) goto out; data_ptr = (unsigned long)(di + 1) + btrfs_dir_name_len(leaf, di); read_extent_buffer(leaf, buf, data_ptr, buf_len); ret = send_set_xattr(sctx, fspath, XATTR_NAME_CAPS, strlen(XATTR_NAME_CAPS), buf, buf_len); out: kfree(buf); fs_path_free(fspath); btrfs_free_path(path); return ret; } static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path, struct clone_root *clone_root, const u64 disk_byte, u64 data_offset, u64 offset, u64 len) { struct btrfs_path *path; struct btrfs_key key; int ret; struct btrfs_inode_info info; u64 clone_src_i_size = 0; /* * Prevent cloning from a zero offset with a length matching the sector * size because in some scenarios this will make the receiver fail. * * For example, if in the source filesystem the extent at offset 0 * has a length of sectorsize and it was written using direct IO, then * it can never be an inline extent (even if compression is enabled). * Then this extent can be cloned in the original filesystem to a non * zero file offset, but it may not be possible to clone in the * destination filesystem because it can be inlined due to compression * on the destination filesystem (as the receiver's write operations are * always done using buffered IO). The same happens when the original * filesystem does not have compression enabled but the destination * filesystem has. */ if (clone_root->offset == 0 && len == sctx->send_root->fs_info->sectorsize) return send_extent_data(sctx, dst_path, offset, len); path = alloc_path_for_send(); if (!path) return -ENOMEM; /* * There are inodes that have extents that lie behind its i_size. Don't * accept clones from these extents. */ ret = get_inode_info(clone_root->root, clone_root->ino, &info); btrfs_release_path(path); if (ret < 0) goto out; clone_src_i_size = info.size; /* * We can't send a clone operation for the entire range if we find * extent items in the respective range in the source file that * refer to different extents or if we find holes. * So check for that and do a mix of clone and regular write/copy * operations if needed. * * Example: * * mkfs.btrfs -f /dev/sda * mount /dev/sda /mnt * xfs_io -f -c "pwrite -S 0xaa 0K 100K" /mnt/foo * cp --reflink=always /mnt/foo /mnt/bar * xfs_io -c "pwrite -S 0xbb 50K 50K" /mnt/foo * btrfs subvolume snapshot -r /mnt /mnt/snap * * If when we send the snapshot and we are processing file bar (which * has a higher inode number than foo) we blindly send a clone operation * for the [0, 100K[ range from foo to bar, the receiver ends up getting * a file bar that matches the content of file foo - iow, doesn't match * the content from bar in the original filesystem. */ key.objectid = clone_root->ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = clone_root->offset; ret = btrfs_search_slot(NULL, clone_root->root, &key, path, 0, 0); if (ret < 0) goto out; if (ret > 0 && path->slots[0] > 0) { btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1); if (key.objectid == clone_root->ino && key.type == BTRFS_EXTENT_DATA_KEY) path->slots[0]--; } while (true) { struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; struct btrfs_file_extent_item *ei; u8 type; u64 ext_len; u64 clone_len; u64 clone_data_offset; bool crossed_src_i_size = false; if (slot >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(clone_root->root, path); if (ret < 0) goto out; else if (ret > 0) break; continue; } btrfs_item_key_to_cpu(leaf, &key, slot); /* * We might have an implicit trailing hole (NO_HOLES feature * enabled). We deal with it after leaving this loop. */ if (key.objectid != clone_root->ino || key.type != BTRFS_EXTENT_DATA_KEY) break; ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); type = btrfs_file_extent_type(leaf, ei); if (type == BTRFS_FILE_EXTENT_INLINE) { ext_len = btrfs_file_extent_ram_bytes(leaf, ei); ext_len = PAGE_ALIGN(ext_len); } else { ext_len = btrfs_file_extent_num_bytes(leaf, ei); } if (key.offset + ext_len <= clone_root->offset) goto next; if (key.offset > clone_root->offset) { /* Implicit hole, NO_HOLES feature enabled. */ u64 hole_len = key.offset - clone_root->offset; if (hole_len > len) hole_len = len; ret = send_extent_data(sctx, dst_path, offset, hole_len); if (ret < 0) goto out; len -= hole_len; if (len == 0) break; offset += hole_len; clone_root->offset += hole_len; data_offset += hole_len; } if (key.offset >= clone_root->offset + len) break; if (key.offset >= clone_src_i_size) break; if (key.offset + ext_len > clone_src_i_size) { ext_len = clone_src_i_size - key.offset; crossed_src_i_size = true; } clone_data_offset = btrfs_file_extent_offset(leaf, ei); if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) { clone_root->offset = key.offset; if (clone_data_offset < data_offset && clone_data_offset + ext_len > data_offset) { u64 extent_offset; extent_offset = data_offset - clone_data_offset; ext_len -= extent_offset; clone_data_offset += extent_offset; clone_root->offset += extent_offset; } } clone_len = min_t(u64, ext_len, len); if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte && clone_data_offset == data_offset) { const u64 src_end = clone_root->offset + clone_len; const u64 sectorsize = SZ_64K; /* * We can't clone the last block, when its size is not * sector size aligned, into the middle of a file. If we * do so, the receiver will get a failure (-EINVAL) when * trying to clone or will silently corrupt the data in * the destination file if it's on a kernel without the * fix introduced by commit ac765f83f1397646 * ("Btrfs: fix data corruption due to cloning of eof * block). * * So issue a clone of the aligned down range plus a * regular write for the eof block, if we hit that case. * * Also, we use the maximum possible sector size, 64K, * because we don't know what's the sector size of the * filesystem that receives the stream, so we have to * assume the largest possible sector size. */ if (src_end == clone_src_i_size && !IS_ALIGNED(src_end, sectorsize) && offset + clone_len < sctx->cur_inode_size) { u64 slen; slen = ALIGN_DOWN(src_end - clone_root->offset, sectorsize); if (slen > 0) { ret = send_clone(sctx, offset, slen, clone_root); if (ret < 0) goto out; } ret = send_extent_data(sctx, dst_path, offset + slen, clone_len - slen); } else { ret = send_clone(sctx, offset, clone_len, clone_root); } } else if (crossed_src_i_size && clone_len < len) { /* * If we are at i_size of the clone source inode and we * can not clone from it, terminate the loop. This is * to avoid sending two write operations, one with a * length matching clone_len and the final one after * this loop with a length of len - clone_len. * * When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED * was passed to the send ioctl), this helps avoid * sending an encoded write for an offset that is not * sector size aligned, in case the i_size of the source * inode is not sector size aligned. That will make the * receiver fallback to decompression of the data and * writing it using regular buffered IO, therefore while * not incorrect, it's not optimal due decompression and * possible re-compression at the receiver. */ break; } else { ret = send_extent_data(sctx, dst_path, offset, clone_len); } if (ret < 0) goto out; len -= clone_len; if (len == 0) break; offset += clone_len; clone_root->offset += clone_len; /* * If we are cloning from the file we are currently processing, * and using the send root as the clone root, we must stop once * the current clone offset reaches the current eof of the file * at the receiver, otherwise we would issue an invalid clone * operation (source range going beyond eof) and cause the * receiver to fail. So if we reach the current eof, bail out * and fallback to a regular write. */ if (clone_root->root == sctx->send_root && clone_root->ino == sctx->cur_ino && clone_root->offset >= sctx->cur_inode_next_write_offset) break; data_offset += clone_len; next: path->slots[0]++; } if (len > 0) ret = send_extent_data(sctx, dst_path, offset, len); else ret = 0; out: btrfs_free_path(path); return ret; } static int send_write_or_clone(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key, struct clone_root *clone_root) { int ret = 0; u64 offset = key->offset; u64 end; u64 bs = sctx->send_root->fs_info->sb->s_blocksize; end = min_t(u64, btrfs_file_extent_end(path), sctx->cur_inode_size); if (offset >= end) return 0; if (clone_root && IS_ALIGNED(end, bs)) { struct btrfs_file_extent_item *ei; u64 disk_byte; u64 data_offset; ei = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_file_extent_item); disk_byte = btrfs_file_extent_disk_bytenr(path->nodes[0], ei); data_offset = btrfs_file_extent_offset(path->nodes[0], ei); ret = clone_range(sctx, path, clone_root, disk_byte, data_offset, offset, end - offset); } else { ret = send_extent_data(sctx, path, offset, end - offset); } sctx->cur_inode_next_write_offset = end; return ret; } static int is_extent_unchanged(struct send_ctx *sctx, struct btrfs_path *left_path, struct btrfs_key *ekey) { int ret = 0; struct btrfs_key key; struct btrfs_path *path = NULL; struct extent_buffer *eb; int slot; struct btrfs_key found_key; struct btrfs_file_extent_item *ei; u64 left_disknr; u64 right_disknr; u64 left_offset; u64 right_offset; u64 left_offset_fixed; u64 left_len; u64 right_len; u64 left_gen; u64 right_gen; u8 left_type; u8 right_type; path = alloc_path_for_send(); if (!path) return -ENOMEM; eb = left_path->nodes[0]; slot = left_path->slots[0]; ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item); left_type = btrfs_file_extent_type(eb, ei); if (left_type != BTRFS_FILE_EXTENT_REG) { ret = 0; goto out; } left_disknr = btrfs_file_extent_disk_bytenr(eb, ei); left_len = btrfs_file_extent_num_bytes(eb, ei); left_offset = btrfs_file_extent_offset(eb, ei); left_gen = btrfs_file_extent_generation(eb, ei); /* * Following comments will refer to these graphics. L is the left * extents which we are checking at the moment. 1-8 are the right * extents that we iterate. * * |-----L-----| * |-1-|-2a-|-3-|-4-|-5-|-6-| * * |-----L-----| * |--1--|-2b-|...(same as above) * * Alternative situation. Happens on files where extents got split. * |-----L-----| * |-----------7-----------|-6-| * * Alternative situation. Happens on files which got larger. * |-----L-----| * |-8-| * Nothing follows after 8. */ key.objectid = ekey->objectid; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = ekey->offset; ret = btrfs_search_slot_for_read(sctx->parent_root, &key, path, 0, 0); if (ret < 0) goto out; if (ret) { ret = 0; goto out; } /* * Handle special case where the right side has no extents at all. */ eb = path->nodes[0]; slot = path->slots[0]; btrfs_item_key_to_cpu(eb, &found_key, slot); if (found_key.objectid != key.objectid || found_key.type != key.type) { /* If we're a hole then just pretend nothing changed */ ret = (left_disknr) ? 0 : 1; goto out; } /* * We're now on 2a, 2b or 7. */ key = found_key; while (key.offset < ekey->offset + left_len) { ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item); right_type = btrfs_file_extent_type(eb, ei); if (right_type != BTRFS_FILE_EXTENT_REG && right_type != BTRFS_FILE_EXTENT_INLINE) { ret = 0; goto out; } if (right_type == BTRFS_FILE_EXTENT_INLINE) { right_len = btrfs_file_extent_ram_bytes(eb, ei); right_len = PAGE_ALIGN(right_len); } else { right_len = btrfs_file_extent_num_bytes(eb, ei); } /* * Are we at extent 8? If yes, we know the extent is changed. * This may only happen on the first iteration. */ if (found_key.offset + right_len <= ekey->offset) { /* If we're a hole just pretend nothing changed */ ret = (left_disknr) ? 0 : 1; goto out; } /* * We just wanted to see if when we have an inline extent, what * follows it is a regular extent (wanted to check the above * condition for inline extents too). This should normally not * happen but it's possible for example when we have an inline * compressed extent representing data with a size matching * the page size (currently the same as sector size). */ if (right_type == BTRFS_FILE_EXTENT_INLINE) { ret = 0; goto out; } right_disknr = btrfs_file_extent_disk_bytenr(eb, ei); right_offset = btrfs_file_extent_offset(eb, ei); right_gen = btrfs_file_extent_generation(eb, ei); left_offset_fixed = left_offset; if (key.offset < ekey->offset) { /* Fix the right offset for 2a and 7. */ right_offset += ekey->offset - key.offset; } else { /* Fix the left offset for all behind 2a and 2b */ left_offset_fixed += key.offset - ekey->offset; } /* * Check if we have the same extent. */ if (left_disknr != right_disknr || left_offset_fixed != right_offset || left_gen != right_gen) { ret = 0; goto out; } /* * Go to the next extent. */ ret = btrfs_next_item(sctx->parent_root, path); if (ret < 0) goto out; if (!ret) { eb = path->nodes[0]; slot = path->slots[0]; btrfs_item_key_to_cpu(eb, &found_key, slot); } if (ret || found_key.objectid != key.objectid || found_key.type != key.type) { key.offset += right_len; break; } if (found_key.offset != key.offset + right_len) { ret = 0; goto out; } key = found_key; } /* * We're now behind the left extent (treat as unchanged) or at the end * of the right side (treat as changed). */ if (key.offset >= ekey->offset + left_len) ret = 1; else ret = 0; out: btrfs_free_path(path); return ret; } static int get_last_extent(struct send_ctx *sctx, u64 offset) { struct btrfs_path *path; struct btrfs_root *root = sctx->send_root; struct btrfs_key key; int ret; path = alloc_path_for_send(); if (!path) return -ENOMEM; sctx->cur_inode_last_extent = 0; key.objectid = sctx->cur_ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = offset; ret = btrfs_search_slot_for_read(root, &key, path, 0, 1); if (ret < 0) goto out; ret = 0; btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); if (key.objectid != sctx->cur_ino || key.type != BTRFS_EXTENT_DATA_KEY) goto out; sctx->cur_inode_last_extent = btrfs_file_extent_end(path); out: btrfs_free_path(path); return ret; } static int range_is_hole_in_parent(struct send_ctx *sctx, const u64 start, const u64 end) { struct btrfs_path *path; struct btrfs_key key; struct btrfs_root *root = sctx->parent_root; u64 search_start = start; int ret; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = sctx->cur_ino; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = search_start; ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); if (ret < 0) goto out; if (ret > 0 && path->slots[0] > 0) path->slots[0]--; while (search_start < end) { struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; struct btrfs_file_extent_item *fi; u64 extent_end; if (slot >= btrfs_header_nritems(leaf)) { ret = btrfs_next_leaf(root, path); if (ret < 0) goto out; else if (ret > 0) break; continue; } btrfs_item_key_to_cpu(leaf, &key, slot); if (key.objectid < sctx->cur_ino || key.type < BTRFS_EXTENT_DATA_KEY) goto next; if (key.objectid > sctx->cur_ino || key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) break; fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); extent_end = btrfs_file_extent_end(path); if (extent_end <= start) goto next; if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0) { search_start = extent_end; goto next; } ret = 0; goto out; next: path->slots[0]++; } ret = 1; out: btrfs_free_path(path); return ret; } static int maybe_send_hole(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key) { int ret = 0; if (sctx->cur_ino != key->objectid || !need_send_hole(sctx)) return 0; if (sctx->cur_inode_last_extent == (u64)-1) { ret = get_last_extent(sctx, key->offset - 1); if (ret) return ret; } if (path->slots[0] == 0 && sctx->cur_inode_last_extent < key->offset) { /* * We might have skipped entire leafs that contained only * file extent items for our current inode. These leafs have * a generation number smaller (older) than the one in the * current leaf and the leaf our last extent came from, and * are located between these 2 leafs. */ ret = get_last_extent(sctx, key->offset - 1); if (ret) return ret; } if (sctx->cur_inode_last_extent < key->offset) { ret = range_is_hole_in_parent(sctx, sctx->cur_inode_last_extent, key->offset); if (ret < 0) return ret; else if (ret == 0) ret = send_hole(sctx, key->offset); else ret = 0; } sctx->cur_inode_last_extent = btrfs_file_extent_end(path); return ret; } static int process_extent(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key) { struct clone_root *found_clone = NULL; int ret = 0; if (S_ISLNK(sctx->cur_inode_mode)) return 0; if (sctx->parent_root && !sctx->cur_inode_new) { ret = is_extent_unchanged(sctx, path, key); if (ret < 0) goto out; if (ret) { ret = 0; goto out_hole; } } else { struct btrfs_file_extent_item *ei; u8 type; ei = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_file_extent_item); type = btrfs_file_extent_type(path->nodes[0], ei); if (type == BTRFS_FILE_EXTENT_PREALLOC || type == BTRFS_FILE_EXTENT_REG) { /* * The send spec does not have a prealloc command yet, * so just leave a hole for prealloc'ed extents until * we have enough commands queued up to justify rev'ing * the send spec. */ if (type == BTRFS_FILE_EXTENT_PREALLOC) { ret = 0; goto out; } /* Have a hole, just skip it. */ if (btrfs_file_extent_disk_bytenr(path->nodes[0], ei) == 0) { ret = 0; goto out; } } } ret = find_extent_clone(sctx, path, key->objectid, key->offset, sctx->cur_inode_size, &found_clone); if (ret != -ENOENT && ret < 0) goto out; ret = send_write_or_clone(sctx, path, key, found_clone); if (ret) goto out; out_hole: ret = maybe_send_hole(sctx, path, key); out: return ret; } static int process_all_extents(struct send_ctx *sctx) { int ret = 0; int iter_ret = 0; struct btrfs_root *root; struct btrfs_path *path; struct btrfs_key key; struct btrfs_key found_key; root = sctx->send_root; path = alloc_path_for_send(); if (!path) return -ENOMEM; key.objectid = sctx->cmp_key->objectid; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = 0; btrfs_for_each_slot(root, &key, &found_key, path, iter_ret) { if (found_key.objectid != key.objectid || found_key.type != key.type) { ret = 0; break; } ret = process_extent(sctx, path, &found_key); if (ret < 0) break; } /* Catch error found during iteration */ if (iter_ret < 0) ret = iter_ret; btrfs_free_path(path); return ret; } static int process_recorded_refs_if_needed(struct send_ctx *sctx, int at_end, int *pending_move, int *refs_processed) { int ret = 0; if (sctx->cur_ino == 0) goto out; if (!at_end && sctx->cur_ino == sctx->cmp_key->objectid && sctx->cmp_key->type <= BTRFS_INODE_EXTREF_KEY) goto out; if (list_empty(&sctx->new_refs) && list_empty(&sctx->deleted_refs)) goto out; ret = process_recorded_refs(sctx, pending_move); if (ret < 0) goto out; *refs_processed = 1; out: return ret; } static int finish_inode_if_needed(struct send_ctx *sctx, int at_end) { int ret = 0; struct btrfs_inode_info info; u64 left_mode; u64 left_uid; u64 left_gid; u64 left_fileattr; u64 right_mode; u64 right_uid; u64 right_gid; u64 right_fileattr; int need_chmod = 0; int need_chown = 0; bool need_fileattr = false; int need_truncate = 1; int pending_move = 0; int refs_processed = 0; if (sctx->ignore_cur_inode) return 0; ret = process_recorded_refs_if_needed(sctx, at_end, &pending_move, &refs_processed); if (ret < 0) goto out; /* * We have processed the refs and thus need to advance send_progress. * Now, calls to get_cur_xxx will take the updated refs of the current * inode into account. * * On the other hand, if our current inode is a directory and couldn't * be moved/renamed because its parent was renamed/moved too and it has * a higher inode number, we can only move/rename our current inode * after we moved/renamed its parent. Therefore in this case operate on * the old path (pre move/rename) of our current inode, and the * move/rename will be performed later. */ if (refs_processed && !pending_move) sctx->send_progress = sctx->cur_ino + 1; if (sctx->cur_ino == 0 || sctx->cur_inode_deleted) goto out; if (!at_end && sctx->cmp_key->objectid == sctx->cur_ino) goto out; ret = get_inode_info(sctx->send_root, sctx->cur_ino, &info); if (ret < 0) goto out; left_mode = info.mode; left_uid = info.uid; left_gid = info.gid; left_fileattr = info.fileattr; if (!sctx->parent_root || sctx->cur_inode_new) { need_chown = 1; if (!S_ISLNK(sctx->cur_inode_mode)) need_chmod = 1; if (sctx->cur_inode_next_write_offset == sctx->cur_inode_size) need_truncate = 0; } else { u64 old_size; ret = get_inode_info(sctx->parent_root, sctx->cur_ino, &info); if (ret < 0) goto out; old_size = info.size; right_mode = info.mode; right_uid = info.uid; right_gid = info.gid; right_fileattr = info.fileattr; if (left_uid != right_uid || left_gid != right_gid) need_chown = 1; if (!S_ISLNK(sctx->cur_inode_mode) && left_mode != right_mode) need_chmod = 1; if (!S_ISLNK(sctx->cur_inode_mode) && left_fileattr != right_fileattr) need_fileattr = true; if ((old_size == sctx->cur_inode_size) || (sctx->cur_inode_size > old_size && sctx->cur_inode_next_write_offset == sctx->cur_inode_size)) need_truncate = 0; } if (S_ISREG(sctx->cur_inode_mode)) { if (need_send_hole(sctx)) { if (sctx->cur_inode_last_extent == (u64)-1 || sctx->cur_inode_last_extent < sctx->cur_inode_size) { ret = get_last_extent(sctx, (u64)-1); if (ret) goto out; } if (sctx->cur_inode_last_extent < sctx->cur_inode_size) { ret = send_hole(sctx, sctx->cur_inode_size); if (ret) goto out; } } if (need_truncate) { ret = send_truncate(sctx, sctx->cur_ino, sctx->cur_inode_gen, sctx->cur_inode_size); if (ret < 0) goto out; } } if (need_chown) { ret = send_chown(sctx, sctx->cur_ino, sctx->cur_inode_gen, left_uid, left_gid); if (ret < 0) goto out; } if (need_chmod) { ret = send_chmod(sctx, sctx->cur_ino, sctx->cur_inode_gen, left_mode); if (ret < 0) goto out; } if (need_fileattr) { ret = send_fileattr(sctx, sctx->cur_ino, sctx->cur_inode_gen, left_fileattr); if (ret < 0) goto out; } if (proto_cmd_ok(sctx, BTRFS_SEND_C_ENABLE_VERITY) && sctx->cur_inode_needs_verity) { ret = process_verity(sctx); if (ret < 0) goto out; } ret = send_capabilities(sctx); if (ret < 0) goto out; /* * If other directory inodes depended on our current directory * inode's move/rename, now do their move/rename operations. */ if (!is_waiting_for_move(sctx, sctx->cur_ino)) { ret = apply_children_dir_moves(sctx); if (ret) goto out; /* * Need to send that every time, no matter if it actually * changed between the two trees as we have done changes to * the inode before. If our inode is a directory and it's * waiting to be moved/renamed, we will send its utimes when * it's moved/renamed, therefore we don't need to do it here. */ sctx->send_progress = sctx->cur_ino + 1; /* * If the current inode is a non-empty directory, delay issuing * the utimes command for it, as it's very likely we have inodes * with an higher number inside it. We want to issue the utimes * command only after adding all dentries to it. */ if (S_ISDIR(sctx->cur_inode_mode) && sctx->cur_inode_size > 0) ret = cache_dir_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen); else ret = send_utimes(sctx, sctx->cur_ino, sctx->cur_inode_gen); if (ret < 0) goto out; } out: if (!ret) ret = trim_dir_utimes_cache(sctx); return ret; } static void close_current_inode(struct send_ctx *sctx) { u64 i_size; if (sctx->cur_inode == NULL) return; i_size = i_size_read(sctx->cur_inode); /* * If we are doing an incremental send, we may have extents between the * last processed extent and the i_size that have not been processed * because they haven't changed but we may have read some of their pages * through readahead, see the comments at send_extent_data(). */ if (sctx->clean_page_cache && sctx->page_cache_clear_start < i_size) truncate_inode_pages_range(&sctx->cur_inode->i_data, sctx->page_cache_clear_start, round_up(i_size, PAGE_SIZE) - 1); iput(sctx->cur_inode); sctx->cur_inode = NULL; } static int changed_inode(struct send_ctx *sctx, enum btrfs_compare_tree_result result) { int ret = 0; struct btrfs_key *key = sctx->cmp_key; struct btrfs_inode_item *left_ii = NULL; struct btrfs_inode_item *right_ii = NULL; u64 left_gen = 0; u64 right_gen = 0; close_current_inode(sctx); sctx->cur_ino = key->objectid; sctx->cur_inode_new_gen = false; sctx->cur_inode_last_extent = (u64)-1; sctx->cur_inode_next_write_offset = 0; sctx->ignore_cur_inode = false; /* * Set send_progress to current inode. This will tell all get_cur_xxx * functions that the current inode's refs are not updated yet. Later, * when process_recorded_refs is finished, it is set to cur_ino + 1. */ sctx->send_progress = sctx->cur_ino; if (result == BTRFS_COMPARE_TREE_NEW || result == BTRFS_COMPARE_TREE_CHANGED) { left_ii = btrfs_item_ptr(sctx->left_path->nodes[0], sctx->left_path->slots[0], struct btrfs_inode_item); left_gen = btrfs_inode_generation(sctx->left_path->nodes[0], left_ii); } else { right_ii = btrfs_item_ptr(sctx->right_path->nodes[0], sctx->right_path->slots[0], struct btrfs_inode_item); right_gen = btrfs_inode_generation(sctx->right_path->nodes[0], right_ii); } if (result == BTRFS_COMPARE_TREE_CHANGED) { right_ii = btrfs_item_ptr(sctx->right_path->nodes[0], sctx->right_path->slots[0], struct btrfs_inode_item); right_gen = btrfs_inode_generation(sctx->right_path->nodes[0], right_ii); /* * The cur_ino = root dir case is special here. We can't treat * the inode as deleted+reused because it would generate a * stream that tries to delete/mkdir the root dir. */ if (left_gen != right_gen && sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) sctx->cur_inode_new_gen = true; } /* * Normally we do not find inodes with a link count of zero (orphans) * because the most common case is to create a snapshot and use it * for a send operation. However other less common use cases involve * using a subvolume and send it after turning it to RO mode just * after deleting all hard links of a file while holding an open * file descriptor against it or turning a RO snapshot into RW mode, * keep an open file descriptor against a file, delete it and then * turn the snapshot back to RO mode before using it for a send * operation. The former is what the receiver operation does. * Therefore, if we want to send these snapshots soon after they're * received, we need to handle orphan inodes as well. Moreover, orphans * can appear not only in the send snapshot but also in the parent * snapshot. Here are several cases: * * Case 1: BTRFS_COMPARE_TREE_NEW * | send snapshot | action * -------------------------------- * nlink | 0 | ignore * * Case 2: BTRFS_COMPARE_TREE_DELETED * | parent snapshot | action * ---------------------------------- * nlink | 0 | as usual * Note: No unlinks will be sent because there're no paths for it. * * Case 3: BTRFS_COMPARE_TREE_CHANGED * | | parent snapshot | send snapshot | action * ----------------------------------------------------------------------- * subcase 1 | nlink | 0 | 0 | ignore * subcase 2 | nlink | >0 | 0 | new_gen(deletion) * subcase 3 | nlink | 0 | >0 | new_gen(creation) * */ if (result == BTRFS_COMPARE_TREE_NEW) { if (btrfs_inode_nlink(sctx->left_path->nodes[0], left_ii) == 0) { sctx->ignore_cur_inode = true; goto out; } sctx->cur_inode_gen = left_gen; sctx->cur_inode_new = true; sctx->cur_inode_deleted = false; sctx->cur_inode_size = btrfs_inode_size( sctx->left_path->nodes[0], left_ii); sctx->cur_inode_mode = btrfs_inode_mode( sctx->left_path->nodes[0], left_ii); sctx->cur_inode_rdev = btrfs_inode_rdev( sctx->left_path->nodes[0], left_ii); if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) ret = send_create_inode_if_needed(sctx); } else if (result == BTRFS_COMPARE_TREE_DELETED) { sctx->cur_inode_gen = right_gen; sctx->cur_inode_new = false; sctx->cur_inode_deleted = true; sctx->cur_inode_size = btrfs_inode_size( sctx->right_path->nodes[0], right_ii); sctx->cur_inode_mode = btrfs_inode_mode( sctx->right_path->nodes[0], right_ii); } else if (result == BTRFS_COMPARE_TREE_CHANGED) { u32 new_nlinks, old_nlinks; new_nlinks = btrfs_inode_nlink(sctx->left_path->nodes[0], left_ii); old_nlinks = btrfs_inode_nlink(sctx->right_path->nodes[0], right_ii); if (new_nlinks == 0 && old_nlinks == 0) { sctx->ignore_cur_inode = true; goto out; } else if (new_nlinks == 0 || old_nlinks == 0) { sctx->cur_inode_new_gen = 1; } /* * We need to do some special handling in case the inode was * reported as changed with a changed generation number. This * means that the original inode was deleted and new inode * reused the same inum. So we have to treat the old inode as * deleted and the new one as new. */ if (sctx->cur_inode_new_gen) { /* * First, process the inode as if it was deleted. */ if (old_nlinks > 0) { sctx->cur_inode_gen = right_gen; sctx->cur_inode_new = false; sctx->cur_inode_deleted = true; sctx->cur_inode_size = btrfs_inode_size( sctx->right_path->nodes[0], right_ii); sctx->cur_inode_mode = btrfs_inode_mode( sctx->right_path->nodes[0], right_ii); ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_DELETED); if (ret < 0) goto out; } /* * Now process the inode as if it was new. */ if (new_nlinks > 0) { sctx->cur_inode_gen = left_gen; sctx->cur_inode_new = true; sctx->cur_inode_deleted = false; sctx->cur_inode_size = btrfs_inode_size( sctx->left_path->nodes[0], left_ii); sctx->cur_inode_mode = btrfs_inode_mode( sctx->left_path->nodes[0], left_ii); sctx->cur_inode_rdev = btrfs_inode_rdev( sctx->left_path->nodes[0], left_ii); ret = send_create_inode_if_needed(sctx); if (ret < 0) goto out; ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW); if (ret < 0) goto out; /* * Advance send_progress now as we did not get * into process_recorded_refs_if_needed in the * new_gen case. */ sctx->send_progress = sctx->cur_ino + 1; /* * Now process all extents and xattrs of the * inode as if they were all new. */ ret = process_all_extents(sctx); if (ret < 0) goto out; ret = process_all_new_xattrs(sctx); if (ret < 0) goto out; } } else { sctx->cur_inode_gen = left_gen; sctx->cur_inode_new = false; sctx->cur_inode_new_gen = false; sctx->cur_inode_deleted = false; sctx->cur_inode_size = btrfs_inode_size( sctx->left_path->nodes[0], left_ii); sctx->cur_inode_mode = btrfs_inode_mode( sctx->left_path->nodes[0], left_ii); } } out: return ret; } /* * We have to process new refs before deleted refs, but compare_trees gives us * the new and deleted refs mixed. To fix this, we record the new/deleted refs * first and later process them in process_recorded_refs. * For the cur_inode_new_gen case, we skip recording completely because * changed_inode did already initiate processing of refs. The reason for this is * that in this case, compare_tree actually compares the refs of 2 different * inodes. To fix this, process_all_refs is used in changed_inode to handle all * refs of the right tree as deleted and all refs of the left tree as new. */ static int changed_ref(struct send_ctx *sctx, enum btrfs_compare_tree_result result) { int ret = 0; if (sctx->cur_ino != sctx->cmp_key->objectid) { inconsistent_snapshot_error(sctx, result, "reference"); return -EIO; } if (!sctx->cur_inode_new_gen && sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID) { if (result == BTRFS_COMPARE_TREE_NEW) ret = record_new_ref(sctx); else if (result == BTRFS_COMPARE_TREE_DELETED) ret = record_deleted_ref(sctx); else if (result == BTRFS_COMPARE_TREE_CHANGED) ret = record_changed_ref(sctx); } return ret; } /* * Process new/deleted/changed xattrs. We skip processing in the * cur_inode_new_gen case because changed_inode did already initiate processing * of xattrs. The reason is the same as in changed_ref */ static int changed_xattr(struct send_ctx *sctx, enum btrfs_compare_tree_result result) { int ret = 0; if (sctx->cur_ino != sctx->cmp_key->objectid) { inconsistent_snapshot_error(sctx, result, "xattr"); return -EIO; } if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) { if (result == BTRFS_COMPARE_TREE_NEW) ret = process_new_xattr(sctx); else if (result == BTRFS_COMPARE_TREE_DELETED) ret = process_deleted_xattr(sctx); else if (result == BTRFS_COMPARE_TREE_CHANGED) ret = process_changed_xattr(sctx); } return ret; } /* * Process new/deleted/changed extents. We skip processing in the * cur_inode_new_gen case because changed_inode did already initiate processing * of extents. The reason is the same as in changed_ref */ static int changed_extent(struct send_ctx *sctx, enum btrfs_compare_tree_result result) { int ret = 0; /* * We have found an extent item that changed without the inode item * having changed. This can happen either after relocation (where the * disk_bytenr of an extent item is replaced at * relocation.c:replace_file_extents()) or after deduplication into a * file in both the parent and send snapshots (where an extent item can * get modified or replaced with a new one). Note that deduplication * updates the inode item, but it only changes the iversion (sequence * field in the inode item) of the inode, so if a file is deduplicated * the same amount of times in both the parent and send snapshots, its * iversion becomes the same in both snapshots, whence the inode item is * the same on both snapshots. */ if (sctx->cur_ino != sctx->cmp_key->objectid) return 0; if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) { if (result != BTRFS_COMPARE_TREE_DELETED) ret = process_extent(sctx, sctx->left_path, sctx->cmp_key); } return ret; } static int changed_verity(struct send_ctx *sctx, enum btrfs_compare_tree_result result) { int ret = 0; if (!sctx->cur_inode_new_gen && !sctx->cur_inode_deleted) { if (result == BTRFS_COMPARE_TREE_NEW) sctx->cur_inode_needs_verity = true; } return ret; } static int dir_changed(struct send_ctx *sctx, u64 dir) { u64 orig_gen, new_gen; int ret; ret = get_inode_gen(sctx->send_root, dir, &new_gen); if (ret) return ret; ret = get_inode_gen(sctx->parent_root, dir, &orig_gen); if (ret) return ret; return (orig_gen != new_gen) ? 1 : 0; } static int compare_refs(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key) { struct btrfs_inode_extref *extref; struct extent_buffer *leaf; u64 dirid = 0, last_dirid = 0; unsigned long ptr; u32 item_size; u32 cur_offset = 0; int ref_name_len; int ret = 0; /* Easy case, just check this one dirid */ if (key->type == BTRFS_INODE_REF_KEY) { dirid = key->offset; ret = dir_changed(sctx, dirid); goto out; } leaf = path->nodes[0]; item_size = btrfs_item_size(leaf, path->slots[0]); ptr = btrfs_item_ptr_offset(leaf, path->slots[0]); while (cur_offset < item_size) { extref = (struct btrfs_inode_extref *)(ptr + cur_offset); dirid = btrfs_inode_extref_parent(leaf, extref); ref_name_len = btrfs_inode_extref_name_len(leaf, extref); cur_offset += ref_name_len + sizeof(*extref); if (dirid == last_dirid) continue; ret = dir_changed(sctx, dirid); if (ret) break; last_dirid = dirid; } out: return ret; } /* * Updates compare related fields in sctx and simply forwards to the actual * changed_xxx functions. */ static int changed_cb(struct btrfs_path *left_path, struct btrfs_path *right_path, struct btrfs_key *key, enum btrfs_compare_tree_result result, struct send_ctx *sctx) { int ret = 0; /* * We can not hold the commit root semaphore here. This is because in * the case of sending and receiving to the same filesystem, using a * pipe, could result in a deadlock: * * 1) The task running send blocks on the pipe because it's full; * * 2) The task running receive, which is the only consumer of the pipe, * is waiting for a transaction commit (for example due to a space * reservation when doing a write or triggering a transaction commit * when creating a subvolume); * * 3) The transaction is waiting to write lock the commit root semaphore, * but can not acquire it since it's being held at 1). * * Down this call chain we write to the pipe through kernel_write(). * The same type of problem can also happen when sending to a file that * is stored in the same filesystem - when reserving space for a write * into the file, we can trigger a transaction commit. * * Our caller has supplied us with clones of leaves from the send and * parent roots, so we're safe here from a concurrent relocation and * further reallocation of metadata extents while we are here. Below we * also assert that the leaves are clones. */ lockdep_assert_not_held(&sctx->send_root->fs_info->commit_root_sem); /* * We always have a send root, so left_path is never NULL. We will not * have a leaf when we have reached the end of the send root but have * not yet reached the end of the parent root. */ if (left_path->nodes[0]) ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &left_path->nodes[0]->bflags)); /* * When doing a full send we don't have a parent root, so right_path is * NULL. When doing an incremental send, we may have reached the end of * the parent root already, so we don't have a leaf at right_path. */ if (right_path && right_path->nodes[0]) ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &right_path->nodes[0]->bflags)); if (result == BTRFS_COMPARE_TREE_SAME) { if (key->type == BTRFS_INODE_REF_KEY || key->type == BTRFS_INODE_EXTREF_KEY) { ret = compare_refs(sctx, left_path, key); if (!ret) return 0; if (ret < 0) return ret; } else if (key->type == BTRFS_EXTENT_DATA_KEY) { return maybe_send_hole(sctx, left_path, key); } else { return 0; } result = BTRFS_COMPARE_TREE_CHANGED; ret = 0; } sctx->left_path = left_path; sctx->right_path = right_path; sctx->cmp_key = key; ret = finish_inode_if_needed(sctx, 0); if (ret < 0) goto out; /* Ignore non-FS objects */ if (key->objectid == BTRFS_FREE_INO_OBJECTID || key->objectid == BTRFS_FREE_SPACE_OBJECTID) goto out; if (key->type == BTRFS_INODE_ITEM_KEY) { ret = changed_inode(sctx, result); } else if (!sctx->ignore_cur_inode) { if (key->type == BTRFS_INODE_REF_KEY || key->type == BTRFS_INODE_EXTREF_KEY) ret = changed_ref(sctx, result); else if (key->type == BTRFS_XATTR_ITEM_KEY) ret = changed_xattr(sctx, result); else if (key->type == BTRFS_EXTENT_DATA_KEY) ret = changed_extent(sctx, result); else if (key->type == BTRFS_VERITY_DESC_ITEM_KEY && key->offset == 0) ret = changed_verity(sctx, result); } out: return ret; } static int search_key_again(const struct send_ctx *sctx, struct btrfs_root *root, struct btrfs_path *path, const struct btrfs_key *key) { int ret; if (!path->need_commit_sem) lockdep_assert_held_read(&root->fs_info->commit_root_sem); /* * Roots used for send operations are readonly and no one can add, * update or remove keys from them, so we should be able to find our * key again. The only exception is deduplication, which can operate on * readonly roots and add, update or remove keys to/from them - but at * the moment we don't allow it to run in parallel with send. */ ret = btrfs_search_slot(NULL, root, key, path, 0, 0); ASSERT(ret <= 0); if (ret > 0) { btrfs_print_tree(path->nodes[path->lowest_level], false); btrfs_err(root->fs_info, "send: key (%llu %u %llu) not found in %s root %llu, lowest_level %d, slot %d", key->objectid, key->type, key->offset, (root == sctx->parent_root ? "parent" : "send"), root->root_key.objectid, path->lowest_level, path->slots[path->lowest_level]); return -EUCLEAN; } return ret; } static int full_send_tree(struct send_ctx *sctx) { int ret; struct btrfs_root *send_root = sctx->send_root; struct btrfs_key key; struct btrfs_fs_info *fs_info = send_root->fs_info; struct btrfs_path *path; path = alloc_path_for_send(); if (!path) return -ENOMEM; path->reada = READA_FORWARD_ALWAYS; key.objectid = BTRFS_FIRST_FREE_OBJECTID; key.type = BTRFS_INODE_ITEM_KEY; key.offset = 0; down_read(&fs_info->commit_root_sem); sctx->last_reloc_trans = fs_info->last_reloc_trans; up_read(&fs_info->commit_root_sem); ret = btrfs_search_slot_for_read(send_root, &key, path, 1, 0); if (ret < 0) goto out; if (ret) goto out_finish; while (1) { btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); ret = changed_cb(path, NULL, &key, BTRFS_COMPARE_TREE_NEW, sctx); if (ret < 0) goto out; down_read(&fs_info->commit_root_sem); if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { sctx->last_reloc_trans = fs_info->last_reloc_trans; up_read(&fs_info->commit_root_sem); /* * A transaction used for relocating a block group was * committed or is about to finish its commit. Release * our path (leaf) and restart the search, so that we * avoid operating on any file extent items that are * stale, with a disk_bytenr that reflects a pre * relocation value. This way we avoid as much as * possible to fallback to regular writes when checking * if we can clone file ranges. */ btrfs_release_path(path); ret = search_key_again(sctx, send_root, path, &key); if (ret < 0) goto out; } else { up_read(&fs_info->commit_root_sem); } ret = btrfs_next_item(send_root, path); if (ret < 0) goto out; if (ret) { ret = 0; break; } } out_finish: ret = finish_inode_if_needed(sctx, 1); out: btrfs_free_path(path); return ret; } static int replace_node_with_clone(struct btrfs_path *path, int level) { struct extent_buffer *clone; clone = btrfs_clone_extent_buffer(path->nodes[level]); if (!clone) return -ENOMEM; free_extent_buffer(path->nodes[level]); path->nodes[level] = clone; return 0; } static int tree_move_down(struct btrfs_path *path, int *level, u64 reada_min_gen) { struct extent_buffer *eb; struct extent_buffer *parent = path->nodes[*level]; int slot = path->slots[*level]; const int nritems = btrfs_header_nritems(parent); u64 reada_max; u64 reada_done = 0; lockdep_assert_held_read(&parent->fs_info->commit_root_sem); BUG_ON(*level == 0); eb = btrfs_read_node_slot(parent, slot); if (IS_ERR(eb)) return PTR_ERR(eb); /* * Trigger readahead for the next leaves we will process, so that it is * very likely that when we need them they are already in memory and we * will not block on disk IO. For nodes we only do readahead for one, * since the time window between processing nodes is typically larger. */ reada_max = (*level == 1 ? SZ_128K : eb->fs_info->nodesize); for (slot++; slot < nritems && reada_done < reada_max; slot++) { if (btrfs_node_ptr_generation(parent, slot) > reada_min_gen) { btrfs_readahead_node_child(parent, slot); reada_done += eb->fs_info->nodesize; } } path->nodes[*level - 1] = eb; path->slots[*level - 1] = 0; (*level)--; if (*level == 0) return replace_node_with_clone(path, 0); return 0; } static int tree_move_next_or_upnext(struct btrfs_path *path, int *level, int root_level) { int ret = 0; int nritems; nritems = btrfs_header_nritems(path->nodes[*level]); path->slots[*level]++; while (path->slots[*level] >= nritems) { if (*level == root_level) { path->slots[*level] = nritems - 1; return -1; } /* move upnext */ path->slots[*level] = 0; free_extent_buffer(path->nodes[*level]); path->nodes[*level] = NULL; (*level)++; path->slots[*level]++; nritems = btrfs_header_nritems(path->nodes[*level]); ret = 1; } return ret; } /* * Returns 1 if it had to move up and next. 0 is returned if it moved only next * or down. */ static int tree_advance(struct btrfs_path *path, int *level, int root_level, int allow_down, struct btrfs_key *key, u64 reada_min_gen) { int ret; if (*level == 0 || !allow_down) { ret = tree_move_next_or_upnext(path, level, root_level); } else { ret = tree_move_down(path, level, reada_min_gen); } /* * Even if we have reached the end of a tree, ret is -1, update the key * anyway, so that in case we need to restart due to a block group * relocation, we can assert that the last key of the root node still * exists in the tree. */ if (*level == 0) btrfs_item_key_to_cpu(path->nodes[*level], key, path->slots[*level]); else btrfs_node_key_to_cpu(path->nodes[*level], key, path->slots[*level]); return ret; } static int tree_compare_item(struct btrfs_path *left_path, struct btrfs_path *right_path, char *tmp_buf) { int cmp; int len1, len2; unsigned long off1, off2; len1 = btrfs_item_size(left_path->nodes[0], left_path->slots[0]); len2 = btrfs_item_size(right_path->nodes[0], right_path->slots[0]); if (len1 != len2) return 1; off1 = btrfs_item_ptr_offset(left_path->nodes[0], left_path->slots[0]); off2 = btrfs_item_ptr_offset(right_path->nodes[0], right_path->slots[0]); read_extent_buffer(left_path->nodes[0], tmp_buf, off1, len1); cmp = memcmp_extent_buffer(right_path->nodes[0], tmp_buf, off2, len1); if (cmp) return 1; return 0; } /* * A transaction used for relocating a block group was committed or is about to * finish its commit. Release our paths and restart the search, so that we are * not using stale extent buffers: * * 1) For levels > 0, we are only holding references of extent buffers, without * any locks on them, which does not prevent them from having been relocated * and reallocated after the last time we released the commit root semaphore. * The exception are the root nodes, for which we always have a clone, see * the comment at btrfs_compare_trees(); * * 2) For leaves, level 0, we are holding copies (clones) of extent buffers, so * we are safe from the concurrent relocation and reallocation. However they * can have file extent items with a pre relocation disk_bytenr value, so we * restart the start from the current commit roots and clone the new leaves so * that we get the post relocation disk_bytenr values. Not doing so, could * make us clone the wrong data in case there are new extents using the old * disk_bytenr that happen to be shared. */ static int restart_after_relocation(struct btrfs_path *left_path, struct btrfs_path *right_path, const struct btrfs_key *left_key, const struct btrfs_key *right_key, int left_level, int right_level, const struct send_ctx *sctx) { int root_level; int ret; lockdep_assert_held_read(&sctx->send_root->fs_info->commit_root_sem); btrfs_release_path(left_path); btrfs_release_path(right_path); /* * Since keys can not be added or removed to/from our roots because they * are readonly and we do not allow deduplication to run in parallel * (which can add, remove or change keys), the layout of the trees should * not change. */ left_path->lowest_level = left_level; ret = search_key_again(sctx, sctx->send_root, left_path, left_key); if (ret < 0) return ret; right_path->lowest_level = right_level; ret = search_key_again(sctx, sctx->parent_root, right_path, right_key); if (ret < 0) return ret; /* * If the lowest level nodes are leaves, clone them so that they can be * safely used by changed_cb() while not under the protection of the * commit root semaphore, even if relocation and reallocation happens in * parallel. */ if (left_level == 0) { ret = replace_node_with_clone(left_path, 0); if (ret < 0) return ret; } if (right_level == 0) { ret = replace_node_with_clone(right_path, 0); if (ret < 0) return ret; } /* * Now clone the root nodes (unless they happen to be the leaves we have * already cloned). This is to protect against concurrent snapshotting of * the send and parent roots (see the comment at btrfs_compare_trees()). */ root_level = btrfs_header_level(sctx->send_root->commit_root); if (root_level > 0) { ret = replace_node_with_clone(left_path, root_level); if (ret < 0) return ret; } root_level = btrfs_header_level(sctx->parent_root->commit_root); if (root_level > 0) { ret = replace_node_with_clone(right_path, root_level); if (ret < 0) return ret; } return 0; } /* * This function compares two trees and calls the provided callback for * every changed/new/deleted item it finds. * If shared tree blocks are encountered, whole subtrees are skipped, making * the compare pretty fast on snapshotted subvolumes. * * This currently works on commit roots only. As commit roots are read only, * we don't do any locking. The commit roots are protected with transactions. * Transactions are ended and rejoined when a commit is tried in between. * * This function checks for modifications done to the trees while comparing. * If it detects a change, it aborts immediately. */ static int btrfs_compare_trees(struct btrfs_root *left_root, struct btrfs_root *right_root, struct send_ctx *sctx) { struct btrfs_fs_info *fs_info = left_root->fs_info; int ret; int cmp; struct btrfs_path *left_path = NULL; struct btrfs_path *right_path = NULL; struct btrfs_key left_key; struct btrfs_key right_key; char *tmp_buf = NULL; int left_root_level; int right_root_level; int left_level; int right_level; int left_end_reached = 0; int right_end_reached = 0; int advance_left = 0; int advance_right = 0; u64 left_blockptr; u64 right_blockptr; u64 left_gen; u64 right_gen; u64 reada_min_gen; left_path = btrfs_alloc_path(); if (!left_path) { ret = -ENOMEM; goto out; } right_path = btrfs_alloc_path(); if (!right_path) { ret = -ENOMEM; goto out; } tmp_buf = kvmalloc(fs_info->nodesize, GFP_KERNEL); if (!tmp_buf) { ret = -ENOMEM; goto out; } left_path->search_commit_root = 1; left_path->skip_locking = 1; right_path->search_commit_root = 1; right_path->skip_locking = 1; /* * Strategy: Go to the first items of both trees. Then do * * If both trees are at level 0 * Compare keys of current items * If left < right treat left item as new, advance left tree * and repeat * If left > right treat right item as deleted, advance right tree * and repeat * If left == right do deep compare of items, treat as changed if * needed, advance both trees and repeat * If both trees are at the same level but not at level 0 * Compare keys of current nodes/leafs * If left < right advance left tree and repeat * If left > right advance right tree and repeat * If left == right compare blockptrs of the next nodes/leafs * If they match advance both trees but stay at the same level * and repeat * If they don't match advance both trees while allowing to go * deeper and repeat * If tree levels are different * Advance the tree that needs it and repeat * * Advancing a tree means: * If we are at level 0, try to go to the next slot. If that's not * possible, go one level up and repeat. Stop when we found a level * where we could go to the next slot. We may at this point be on a * node or a leaf. * * If we are not at level 0 and not on shared tree blocks, go one * level deeper. * * If we are not at level 0 and on shared tree blocks, go one slot to * the right if possible or go up and right. */ down_read(&fs_info->commit_root_sem); left_level = btrfs_header_level(left_root->commit_root); left_root_level = left_level; /* * We clone the root node of the send and parent roots to prevent races * with snapshot creation of these roots. Snapshot creation COWs the * root node of a tree, so after the transaction is committed the old * extent can be reallocated while this send operation is still ongoing. * So we clone them, under the commit root semaphore, to be race free. */ left_path->nodes[left_level] = btrfs_clone_extent_buffer(left_root->commit_root); if (!left_path->nodes[left_level]) { ret = -ENOMEM; goto out_unlock; } right_level = btrfs_header_level(right_root->commit_root); right_root_level = right_level; right_path->nodes[right_level] = btrfs_clone_extent_buffer(right_root->commit_root); if (!right_path->nodes[right_level]) { ret = -ENOMEM; goto out_unlock; } /* * Our right root is the parent root, while the left root is the "send" * root. We know that all new nodes/leaves in the left root must have * a generation greater than the right root's generation, so we trigger * readahead for those nodes and leaves of the left root, as we know we * will need to read them at some point. */ reada_min_gen = btrfs_header_generation(right_root->commit_root); if (left_level == 0) btrfs_item_key_to_cpu(left_path->nodes[left_level], &left_key, left_path->slots[left_level]); else btrfs_node_key_to_cpu(left_path->nodes[left_level], &left_key, left_path->slots[left_level]); if (right_level == 0) btrfs_item_key_to_cpu(right_path->nodes[right_level], &right_key, right_path->slots[right_level]); else btrfs_node_key_to_cpu(right_path->nodes[right_level], &right_key, right_path->slots[right_level]); sctx->last_reloc_trans = fs_info->last_reloc_trans; while (1) { if (need_resched() || rwsem_is_contended(&fs_info->commit_root_sem)) { up_read(&fs_info->commit_root_sem); cond_resched(); down_read(&fs_info->commit_root_sem); } if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { ret = restart_after_relocation(left_path, right_path, &left_key, &right_key, left_level, right_level, sctx); if (ret < 0) goto out_unlock; sctx->last_reloc_trans = fs_info->last_reloc_trans; } if (advance_left && !left_end_reached) { ret = tree_advance(left_path, &left_level, left_root_level, advance_left != ADVANCE_ONLY_NEXT, &left_key, reada_min_gen); if (ret == -1) left_end_reached = ADVANCE; else if (ret < 0) goto out_unlock; advance_left = 0; } if (advance_right && !right_end_reached) { ret = tree_advance(right_path, &right_level, right_root_level, advance_right != ADVANCE_ONLY_NEXT, &right_key, reada_min_gen); if (ret == -1) right_end_reached = ADVANCE; else if (ret < 0) goto out_unlock; advance_right = 0; } if (left_end_reached && right_end_reached) { ret = 0; goto out_unlock; } else if (left_end_reached) { if (right_level == 0) { up_read(&fs_info->commit_root_sem); ret = changed_cb(left_path, right_path, &right_key, BTRFS_COMPARE_TREE_DELETED, sctx); if (ret < 0) goto out; down_read(&fs_info->commit_root_sem); } advance_right = ADVANCE; continue; } else if (right_end_reached) { if (left_level == 0) { up_read(&fs_info->commit_root_sem); ret = changed_cb(left_path, right_path, &left_key, BTRFS_COMPARE_TREE_NEW, sctx); if (ret < 0) goto out; down_read(&fs_info->commit_root_sem); } advance_left = ADVANCE; continue; } if (left_level == 0 && right_level == 0) { up_read(&fs_info->commit_root_sem); cmp = btrfs_comp_cpu_keys(&left_key, &right_key); if (cmp < 0) { ret = changed_cb(left_path, right_path, &left_key, BTRFS_COMPARE_TREE_NEW, sctx); advance_left = ADVANCE; } else if (cmp > 0) { ret = changed_cb(left_path, right_path, &right_key, BTRFS_COMPARE_TREE_DELETED, sctx); advance_right = ADVANCE; } else { enum btrfs_compare_tree_result result; WARN_ON(!extent_buffer_uptodate(left_path->nodes[0])); ret = tree_compare_item(left_path, right_path, tmp_buf); if (ret) result = BTRFS_COMPARE_TREE_CHANGED; else result = BTRFS_COMPARE_TREE_SAME; ret = changed_cb(left_path, right_path, &left_key, result, sctx); advance_left = ADVANCE; advance_right = ADVANCE; } if (ret < 0) goto out; down_read(&fs_info->commit_root_sem); } else if (left_level == right_level) { cmp = btrfs_comp_cpu_keys(&left_key, &right_key); if (cmp < 0) { advance_left = ADVANCE; } else if (cmp > 0) { advance_right = ADVANCE; } else { left_blockptr = btrfs_node_blockptr( left_path->nodes[left_level], left_path->slots[left_level]); right_blockptr = btrfs_node_blockptr( right_path->nodes[right_level], right_path->slots[right_level]); left_gen = btrfs_node_ptr_generation( left_path->nodes[left_level], left_path->slots[left_level]); right_gen = btrfs_node_ptr_generation( right_path->nodes[right_level], right_path->slots[right_level]); if (left_blockptr == right_blockptr && left_gen == right_gen) { /* * As we're on a shared block, don't * allow to go deeper. */ advance_left = ADVANCE_ONLY_NEXT; advance_right = ADVANCE_ONLY_NEXT; } else { advance_left = ADVANCE; advance_right = ADVANCE; } } } else if (left_level < right_level) { advance_right = ADVANCE; } else { advance_left = ADVANCE; } } out_unlock: up_read(&fs_info->commit_root_sem); out: btrfs_free_path(left_path); btrfs_free_path(right_path); kvfree(tmp_buf); return ret; } static int send_subvol(struct send_ctx *sctx) { int ret; if (!(sctx->flags & BTRFS_SEND_FLAG_OMIT_STREAM_HEADER)) { ret = send_header(sctx); if (ret < 0) goto out; } ret = send_subvol_begin(sctx); if (ret < 0) goto out; if (sctx->parent_root) { ret = btrfs_compare_trees(sctx->send_root, sctx->parent_root, sctx); if (ret < 0) goto out; ret = finish_inode_if_needed(sctx, 1); if (ret < 0) goto out; } else { ret = full_send_tree(sctx); if (ret < 0) goto out; } out: free_recorded_refs(sctx); return ret; } /* * If orphan cleanup did remove any orphans from a root, it means the tree * was modified and therefore the commit root is not the same as the current * root anymore. This is a problem, because send uses the commit root and * therefore can see inode items that don't exist in the current root anymore, * and for example make calls to btrfs_iget, which will do tree lookups based * on the current root and not on the commit root. Those lookups will fail, * returning a -ESTALE error, and making send fail with that error. So make * sure a send does not see any orphans we have just removed, and that it will * see the same inodes regardless of whether a transaction commit happened * before it started (meaning that the commit root will be the same as the * current root) or not. */ static int ensure_commit_roots_uptodate(struct send_ctx *sctx) { int i; struct btrfs_trans_handle *trans = NULL; again: if (sctx->parent_root && sctx->parent_root->node != sctx->parent_root->commit_root) goto commit_trans; for (i = 0; i < sctx->clone_roots_cnt; i++) if (sctx->clone_roots[i].root->node != sctx->clone_roots[i].root->commit_root) goto commit_trans; if (trans) return btrfs_end_transaction(trans); return 0; commit_trans: /* Use any root, all fs roots will get their commit roots updated. */ if (!trans) { trans = btrfs_join_transaction(sctx->send_root); if (IS_ERR(trans)) return PTR_ERR(trans); goto again; } return btrfs_commit_transaction(trans); } /* * Make sure any existing dellaloc is flushed for any root used by a send * operation so that we do not miss any data and we do not race with writeback * finishing and changing a tree while send is using the tree. This could * happen if a subvolume is in RW mode, has delalloc, is turned to RO mode and * a send operation then uses the subvolume. * After flushing delalloc ensure_commit_roots_uptodate() must be called. */ static int flush_delalloc_roots(struct send_ctx *sctx) { struct btrfs_root *root = sctx->parent_root; int ret; int i; if (root) { ret = btrfs_start_delalloc_snapshot(root, false); if (ret) return ret; btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX); } for (i = 0; i < sctx->clone_roots_cnt; i++) { root = sctx->clone_roots[i].root; ret = btrfs_start_delalloc_snapshot(root, false); if (ret) return ret; btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX); } return 0; } static void btrfs_root_dec_send_in_progress(struct btrfs_root* root) { spin_lock(&root->root_item_lock); root->send_in_progress--; /* * Not much left to do, we don't know why it's unbalanced and * can't blindly reset it to 0. */ if (root->send_in_progress < 0) btrfs_err(root->fs_info, "send_in_progress unbalanced %d root %llu", root->send_in_progress, root->root_key.objectid); spin_unlock(&root->root_item_lock); } static void dedupe_in_progress_warn(const struct btrfs_root *root) { btrfs_warn_rl(root->fs_info, "cannot use root %llu for send while deduplications on it are in progress (%d in progress)", root->root_key.objectid, root->dedupe_in_progress); } long btrfs_ioctl_send(struct inode *inode, struct btrfs_ioctl_send_args *arg) { int ret = 0; struct btrfs_root *send_root = BTRFS_I(inode)->root; struct btrfs_fs_info *fs_info = send_root->fs_info; struct btrfs_root *clone_root; struct send_ctx *sctx = NULL; u32 i; u64 *clone_sources_tmp = NULL; int clone_sources_to_rollback = 0; size_t alloc_size; int sort_clone_roots = 0; struct btrfs_lru_cache_entry *entry; struct btrfs_lru_cache_entry *tmp; if (!capable(CAP_SYS_ADMIN)) return -EPERM; /* * The subvolume must remain read-only during send, protect against * making it RW. This also protects against deletion. */ spin_lock(&send_root->root_item_lock); if (btrfs_root_readonly(send_root) && send_root->dedupe_in_progress) { dedupe_in_progress_warn(send_root); spin_unlock(&send_root->root_item_lock); return -EAGAIN; } send_root->send_in_progress++; spin_unlock(&send_root->root_item_lock); /* * Userspace tools do the checks and warn the user if it's * not RO. */ if (!btrfs_root_readonly(send_root)) { ret = -EPERM; goto out; } /* * Check that we don't overflow at later allocations, we request * clone_sources_count + 1 items, and compare to unsigned long inside * access_ok. Also set an upper limit for allocation size so this can't * easily exhaust memory. Max number of clone sources is about 200K. */ if (arg->clone_sources_count > SZ_8M / sizeof(struct clone_root)) { ret = -EINVAL; goto out; } if (arg->flags & ~BTRFS_SEND_FLAG_MASK) { ret = -EINVAL; goto out; } sctx = kzalloc(sizeof(struct send_ctx), GFP_KERNEL); if (!sctx) { ret = -ENOMEM; goto out; } INIT_LIST_HEAD(&sctx->new_refs); INIT_LIST_HEAD(&sctx->deleted_refs); btrfs_lru_cache_init(&sctx->name_cache, SEND_MAX_NAME_CACHE_SIZE); btrfs_lru_cache_init(&sctx->backref_cache, SEND_MAX_BACKREF_CACHE_SIZE); btrfs_lru_cache_init(&sctx->dir_created_cache, SEND_MAX_DIR_CREATED_CACHE_SIZE); /* * This cache is periodically trimmed to a fixed size elsewhere, see * cache_dir_utimes() and trim_dir_utimes_cache(). */ btrfs_lru_cache_init(&sctx->dir_utimes_cache, 0); sctx->pending_dir_moves = RB_ROOT; sctx->waiting_dir_moves = RB_ROOT; sctx->orphan_dirs = RB_ROOT; sctx->rbtree_new_refs = RB_ROOT; sctx->rbtree_deleted_refs = RB_ROOT; sctx->flags = arg->flags; if (arg->flags & BTRFS_SEND_FLAG_VERSION) { if (arg->version > BTRFS_SEND_STREAM_VERSION) { ret = -EPROTO; goto out; } /* Zero means "use the highest version" */ sctx->proto = arg->version ?: BTRFS_SEND_STREAM_VERSION; } else { sctx->proto = 1; } if ((arg->flags & BTRFS_SEND_FLAG_COMPRESSED) && sctx->proto < 2) { ret = -EINVAL; goto out; } sctx->send_filp = fget(arg->send_fd); if (!sctx->send_filp) { ret = -EBADF; goto out; } sctx->send_root = send_root; /* * Unlikely but possible, if the subvolume is marked for deletion but * is slow to remove the directory entry, send can still be started */ if (btrfs_root_dead(sctx->send_root)) { ret = -EPERM; goto out; } sctx->clone_roots_cnt = arg->clone_sources_count; if (sctx->proto >= 2) { u32 send_buf_num_pages; sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V2; sctx->send_buf = vmalloc(sctx->send_max_size); if (!sctx->send_buf) { ret = -ENOMEM; goto out; } send_buf_num_pages = sctx->send_max_size >> PAGE_SHIFT; sctx->send_buf_pages = kcalloc(send_buf_num_pages, sizeof(*sctx->send_buf_pages), GFP_KERNEL); if (!sctx->send_buf_pages) { ret = -ENOMEM; goto out; } for (i = 0; i < send_buf_num_pages; i++) { sctx->send_buf_pages[i] = vmalloc_to_page(sctx->send_buf + (i << PAGE_SHIFT)); } } else { sctx->send_max_size = BTRFS_SEND_BUF_SIZE_V1; sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL); } if (!sctx->send_buf) { ret = -ENOMEM; goto out; } sctx->clone_roots = kvcalloc(sizeof(*sctx->clone_roots), arg->clone_sources_count + 1, GFP_KERNEL); if (!sctx->clone_roots) { ret = -ENOMEM; goto out; } alloc_size = array_size(sizeof(*arg->clone_sources), arg->clone_sources_count); if (arg->clone_sources_count) { clone_sources_tmp = kvmalloc(alloc_size, GFP_KERNEL); if (!clone_sources_tmp) { ret = -ENOMEM; goto out; } ret = copy_from_user(clone_sources_tmp, arg->clone_sources, alloc_size); if (ret) { ret = -EFAULT; goto out; } for (i = 0; i < arg->clone_sources_count; i++) { clone_root = btrfs_get_fs_root(fs_info, clone_sources_tmp[i], true); if (IS_ERR(clone_root)) { ret = PTR_ERR(clone_root); goto out; } spin_lock(&clone_root->root_item_lock); if (!btrfs_root_readonly(clone_root) || btrfs_root_dead(clone_root)) { spin_unlock(&clone_root->root_item_lock); btrfs_put_root(clone_root); ret = -EPERM; goto out; } if (clone_root->dedupe_in_progress) { dedupe_in_progress_warn(clone_root); spin_unlock(&clone_root->root_item_lock); btrfs_put_root(clone_root); ret = -EAGAIN; goto out; } clone_root->send_in_progress++; spin_unlock(&clone_root->root_item_lock); sctx->clone_roots[i].root = clone_root; clone_sources_to_rollback = i + 1; } kvfree(clone_sources_tmp); clone_sources_tmp = NULL; } if (arg->parent_root) { sctx->parent_root = btrfs_get_fs_root(fs_info, arg->parent_root, true); if (IS_ERR(sctx->parent_root)) { ret = PTR_ERR(sctx->parent_root); goto out; } spin_lock(&sctx->parent_root->root_item_lock); sctx->parent_root->send_in_progress++; if (!btrfs_root_readonly(sctx->parent_root) || btrfs_root_dead(sctx->parent_root)) { spin_unlock(&sctx->parent_root->root_item_lock); ret = -EPERM; goto out; } if (sctx->parent_root->dedupe_in_progress) { dedupe_in_progress_warn(sctx->parent_root); spin_unlock(&sctx->parent_root->root_item_lock); ret = -EAGAIN; goto out; } spin_unlock(&sctx->parent_root->root_item_lock); } /* * Clones from send_root are allowed, but only if the clone source * is behind the current send position. This is checked while searching * for possible clone sources. */ sctx->clone_roots[sctx->clone_roots_cnt++].root = btrfs_grab_root(sctx->send_root); /* We do a bsearch later */ sort(sctx->clone_roots, sctx->clone_roots_cnt, sizeof(*sctx->clone_roots), __clone_root_cmp_sort, NULL); sort_clone_roots = 1; ret = flush_delalloc_roots(sctx); if (ret) goto out; ret = ensure_commit_roots_uptodate(sctx); if (ret) goto out; ret = send_subvol(sctx); if (ret < 0) goto out; btrfs_lru_cache_for_each_entry_safe(&sctx->dir_utimes_cache, entry, tmp) { ret = send_utimes(sctx, entry->key, entry->gen); if (ret < 0) goto out; btrfs_lru_cache_remove(&sctx->dir_utimes_cache, entry); } if (!(sctx->flags & BTRFS_SEND_FLAG_OMIT_END_CMD)) { ret = begin_cmd(sctx, BTRFS_SEND_C_END); if (ret < 0) goto out; ret = send_cmd(sctx); if (ret < 0) goto out; } out: WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->pending_dir_moves)); while (sctx && !RB_EMPTY_ROOT(&sctx->pending_dir_moves)) { struct rb_node *n; struct pending_dir_move *pm; n = rb_first(&sctx->pending_dir_moves); pm = rb_entry(n, struct pending_dir_move, node); while (!list_empty(&pm->list)) { struct pending_dir_move *pm2; pm2 = list_first_entry(&pm->list, struct pending_dir_move, list); free_pending_move(sctx, pm2); } free_pending_move(sctx, pm); } WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->waiting_dir_moves)); while (sctx && !RB_EMPTY_ROOT(&sctx->waiting_dir_moves)) { struct rb_node *n; struct waiting_dir_move *dm; n = rb_first(&sctx->waiting_dir_moves); dm = rb_entry(n, struct waiting_dir_move, node); rb_erase(&dm->node, &sctx->waiting_dir_moves); kfree(dm); } WARN_ON(sctx && !ret && !RB_EMPTY_ROOT(&sctx->orphan_dirs)); while (sctx && !RB_EMPTY_ROOT(&sctx->orphan_dirs)) { struct rb_node *n; struct orphan_dir_info *odi; n = rb_first(&sctx->orphan_dirs); odi = rb_entry(n, struct orphan_dir_info, node); free_orphan_dir_info(sctx, odi); } if (sort_clone_roots) { for (i = 0; i < sctx->clone_roots_cnt; i++) { btrfs_root_dec_send_in_progress( sctx->clone_roots[i].root); btrfs_put_root(sctx->clone_roots[i].root); } } else { for (i = 0; sctx && i < clone_sources_to_rollback; i++) { btrfs_root_dec_send_in_progress( sctx->clone_roots[i].root); btrfs_put_root(sctx->clone_roots[i].root); } btrfs_root_dec_send_in_progress(send_root); } if (sctx && !IS_ERR_OR_NULL(sctx->parent_root)) { btrfs_root_dec_send_in_progress(sctx->parent_root); btrfs_put_root(sctx->parent_root); } kvfree(clone_sources_tmp); if (sctx) { if (sctx->send_filp) fput(sctx->send_filp); kvfree(sctx->clone_roots); kfree(sctx->send_buf_pages); kvfree(sctx->send_buf); kvfree(sctx->verity_descriptor); close_current_inode(sctx); btrfs_lru_cache_clear(&sctx->name_cache); btrfs_lru_cache_clear(&sctx->backref_cache); btrfs_lru_cache_clear(&sctx->dir_created_cache); btrfs_lru_cache_clear(&sctx->dir_utimes_cache); kfree(sctx); } return ret; } |
4 2 30 16483 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_CTYPE_H #define _LINUX_CTYPE_H #include <linux/compiler.h> /* * NOTE! This ctype does not handle EOF like the standard C * library is required to. */ #define _U 0x01 /* upper */ #define _L 0x02 /* lower */ #define _D 0x04 /* digit */ #define _C 0x08 /* cntrl */ #define _P 0x10 /* punct */ #define _S 0x20 /* white space (space/lf/tab) */ #define _X 0x40 /* hex digit */ #define _SP 0x80 /* hard space (0x20) */ extern const unsigned char _ctype[]; #define __ismask(x) (_ctype[(int)(unsigned char)(x)]) #define isalnum(c) ((__ismask(c)&(_U|_L|_D)) != 0) #define isalpha(c) ((__ismask(c)&(_U|_L)) != 0) #define iscntrl(c) ((__ismask(c)&(_C)) != 0) #define isgraph(c) ((__ismask(c)&(_P|_U|_L|_D)) != 0) #define islower(c) ((__ismask(c)&(_L)) != 0) #define isprint(c) ((__ismask(c)&(_P|_U|_L|_D|_SP)) != 0) #define ispunct(c) ((__ismask(c)&(_P)) != 0) /* Note: isspace() must return false for %NUL-terminator */ #define isspace(c) ((__ismask(c)&(_S)) != 0) #define isupper(c) ((__ismask(c)&(_U)) != 0) #define isxdigit(c) ((__ismask(c)&(_D|_X)) != 0) #define isascii(c) (((unsigned char)(c))<=0x7f) #define toascii(c) (((unsigned char)(c))&0x7f) #if __has_builtin(__builtin_isdigit) #define isdigit(c) __builtin_isdigit(c) #else static inline int isdigit(int c) { return '0' <= c && c <= '9'; } #endif static inline unsigned char __tolower(unsigned char c) { if (isupper(c)) c -= 'A'-'a'; return c; } static inline unsigned char __toupper(unsigned char c) { if (islower(c)) c -= 'a'-'A'; return c; } #define tolower(c) __tolower(c) #define toupper(c) __toupper(c) /* * Fast implementation of tolower() for internal usage. Do not use in your * code. */ static inline char _tolower(const char c) { return c | 0x20; } /* Fast check for octal digit */ static inline int isodigit(const char c) { return c >= '0' && c <= '7'; } #endif |
26 12490 12169 11558 3 12336 5905 19 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_PGTABLE_DEFS_H #define _ASM_X86_PGTABLE_DEFS_H #include <linux/const.h> #include <linux/mem_encrypt.h> #include <asm/page_types.h> #define _PAGE_BIT_PRESENT 0 /* is present */ #define _PAGE_BIT_RW 1 /* writeable */ #define _PAGE_BIT_USER 2 /* userspace addressable */ #define _PAGE_BIT_PWT 3 /* page write through */ #define _PAGE_BIT_PCD 4 /* page cache disabled */ #define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */ #define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */ #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_PAT 7 /* on 4KB pages */ #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ #define _PAGE_BIT_SOFTW1 9 /* available for programmer */ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ #define _PAGE_BIT_SOFTW4 57 /* available for programmer */ #define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ #define _PAGE_BIT_PKEY_BIT3 62 /* Protection Keys, bit 4/4 */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ #define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1 #define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #ifdef CONFIG_X86_64 #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit */ #else /* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */ #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit */ #endif /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL #define _PAGE_PRESENT (_AT(pteval_t, 1) << _PAGE_BIT_PRESENT) #define _PAGE_RW (_AT(pteval_t, 1) << _PAGE_BIT_RW) #define _PAGE_USER (_AT(pteval_t, 1) << _PAGE_BIT_USER) #define _PAGE_PWT (_AT(pteval_t, 1) << _PAGE_BIT_PWT) #define _PAGE_PCD (_AT(pteval_t, 1) << _PAGE_BIT_PCD) #define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED) #define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY) #define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) #define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) #define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1) #define _PAGE_SOFTW2 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2) #define _PAGE_SOFTW3 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW3) #define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT) #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS #define _PAGE_PKEY_BIT0 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0) #define _PAGE_PKEY_BIT1 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1) #define _PAGE_PKEY_BIT2 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2) #define _PAGE_PKEY_BIT3 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3) #else #define _PAGE_PKEY_BIT0 (_AT(pteval_t, 0)) #define _PAGE_PKEY_BIT1 (_AT(pteval_t, 0)) #define _PAGE_PKEY_BIT2 (_AT(pteval_t, 0)) #define _PAGE_PKEY_BIT3 (_AT(pteval_t, 0)) #endif #define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \ _PAGE_PKEY_BIT1 | \ _PAGE_PKEY_BIT2 | \ _PAGE_PKEY_BIT3) #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED) #else #define _PAGE_KNL_ERRATUM_MASK 0 #endif #ifdef CONFIG_MEM_SOFT_DIRTY #define _PAGE_SOFT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SOFT_DIRTY) #else #define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) #endif /* * Tracking soft dirty bit when a page goes to a swap is tricky. * We need a bit which can be stored in pte _and_ not conflict * with swap entry format. On x86 bits 1-4 are *not* involved * into swap entry computation, but bit 7 is used for thp migration, * so we borrow bit 1 for soft dirty tracking. * * Please note that this bit must be treated as swap dirty page * mark if and only if the PTE/PMD has present bit clear! */ #ifdef CONFIG_MEM_SOFT_DIRTY #define _PAGE_SWP_SOFT_DIRTY _PAGE_RW #else #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0)) #endif #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP #define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) #define _PAGE_SWP_UFFD_WP _PAGE_USER #else #define _PAGE_UFFD_WP (_AT(pteval_t, 0)) #define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0)) #endif #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) #define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4) #else #define _PAGE_NX (_AT(pteval_t, 0)) #define _PAGE_DEVMAP (_AT(pteval_t, 0)) #define _PAGE_SOFTW4 (_AT(pteval_t, 0)) #endif /* * The hardware requires shadow stack to be Write=0,Dirty=1. However, * there are valid cases where the kernel might create read-only PTEs that * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit, * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have * (Write=0,SavedDirty=1,Dirty=0) set. */ #define _PAGE_SAVED_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY) #define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY) #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) /* * Set of bits not changed in pte_modify. The pte's * protection key is treated like _PAGE_RW, for * instance, and is *not* included in this mask since * pte_modify() does modify it. */ #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | \ _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \ _PAGE_DEVMAP | _PAGE_ENC | _PAGE_UFFD_WP) #define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT) #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE) /* * The cache modes defined here are used to translate between pure SW usage * and the HW defined cache mode bits and/or PAT entries. * * The resulting bits for PWT, PCD and PAT should be chosen in a way * to have the WB mode at index 0 (all bits clear). This is the default * right now and likely would break too much if changed. */ #ifndef __ASSEMBLY__ enum page_cache_mode { _PAGE_CACHE_MODE_WB = 0, _PAGE_CACHE_MODE_WC = 1, _PAGE_CACHE_MODE_UC_MINUS = 2, _PAGE_CACHE_MODE_UC = 3, _PAGE_CACHE_MODE_WT = 4, _PAGE_CACHE_MODE_WP = 5, _PAGE_CACHE_MODE_NUM = 8 }; #endif #define _PAGE_ENC (_AT(pteval_t, sme_me_mask)) #define _PAGE_CACHE_MASK (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT) #define _PAGE_LARGE_CACHE_MASK (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT_LARGE) #define _PAGE_NOCACHE (cachemode2protval(_PAGE_CACHE_MODE_UC)) #define _PAGE_CACHE_WP (cachemode2protval(_PAGE_CACHE_MODE_WP)) #define __PP _PAGE_PRESENT #define __RW _PAGE_RW #define _USR _PAGE_USER #define ___A _PAGE_ACCESSED #define ___D _PAGE_DIRTY #define ___G _PAGE_GLOBAL #define __NX _PAGE_NX #define _ENC _PAGE_ENC #define __WP _PAGE_CACHE_WP #define __NC _PAGE_NOCACHE #define _PSE _PAGE_PSE #define pgprot_val(x) ((x).pgprot) #define __pgprot(x) ((pgprot_t) { (x) } ) #define __pg(x) __pgprot(x) #define PAGE_NONE __pg( 0| 0| 0|___A| 0| 0| 0|___G) #define PAGE_SHARED __pg(__PP|__RW|_USR|___A|__NX| 0| 0| 0) #define PAGE_SHARED_EXEC __pg(__PP|__RW|_USR|___A| 0| 0| 0| 0) #define PAGE_COPY_NOEXEC __pg(__PP| 0|_USR|___A|__NX| 0| 0| 0) #define PAGE_COPY_EXEC __pg(__PP| 0|_USR|___A| 0| 0| 0| 0) #define PAGE_COPY __pg(__PP| 0|_USR|___A|__NX| 0| 0| 0) #define PAGE_READONLY __pg(__PP| 0|_USR|___A|__NX| 0| 0| 0) #define PAGE_READONLY_EXEC __pg(__PP| 0|_USR|___A| 0| 0| 0| 0) #define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) #define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) /* * Page tables needs to have Write=1 in order for any lower PTEs to be * writable. This includes shadow stack memory (Write=0, Dirty=1) */ #define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0) #define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC) #define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0) #define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC) #define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G) #define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G) #define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) #define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) #define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC) #define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G) #define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G) #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G) #define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP) #define __PAGE_KERNEL_IO __PAGE_KERNEL #define __PAGE_KERNEL_IO_NOCACHE __PAGE_KERNEL_NOCACHE #ifndef __ASSEMBLY__ #define __PAGE_KERNEL_ENC (__PAGE_KERNEL | _ENC) #define __PAGE_KERNEL_ENC_WP (__PAGE_KERNEL_WP | _ENC) #define __PAGE_KERNEL_NOENC (__PAGE_KERNEL | 0) #define __PAGE_KERNEL_NOENC_WP (__PAGE_KERNEL_WP | 0) #define __pgprot_mask(x) __pgprot((x) & __default_kernel_pte_mask) #define PAGE_KERNEL __pgprot_mask(__PAGE_KERNEL | _ENC) #define PAGE_KERNEL_NOENC __pgprot_mask(__PAGE_KERNEL | 0) #define PAGE_KERNEL_RO __pgprot_mask(__PAGE_KERNEL_RO | _ENC) #define PAGE_KERNEL_EXEC __pgprot_mask(__PAGE_KERNEL_EXEC | _ENC) #define PAGE_KERNEL_EXEC_NOENC __pgprot_mask(__PAGE_KERNEL_EXEC | 0) #define PAGE_KERNEL_ROX __pgprot_mask(__PAGE_KERNEL_ROX | _ENC) #define PAGE_KERNEL_NOCACHE __pgprot_mask(__PAGE_KERNEL_NOCACHE | _ENC) #define PAGE_KERNEL_LARGE __pgprot_mask(__PAGE_KERNEL_LARGE | _ENC) #define PAGE_KERNEL_LARGE_EXEC __pgprot_mask(__PAGE_KERNEL_LARGE_EXEC | _ENC) #define PAGE_KERNEL_VVAR __pgprot_mask(__PAGE_KERNEL_VVAR | _ENC) #define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO) #define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE) #endif /* __ASSEMBLY__ */ /* * early identity mapping pte attrib macros. */ #ifdef CONFIG_X86_64 #define __PAGE_KERNEL_IDENT_LARGE_EXEC __PAGE_KERNEL_LARGE_EXEC #else #define PTE_IDENT_ATTR 0x003 /* PRESENT+RW */ #define PDE_IDENT_ATTR 0x063 /* PRESENT+RW+DIRTY+ACCESSED */ #define PGD_IDENT_ATTR 0x001 /* PRESENT (no other attributes) */ #endif #ifdef CONFIG_X86_32 # include <asm/pgtable_32_types.h> #else # include <asm/pgtable_64_types.h> #endif #ifndef __ASSEMBLY__ #include <linux/types.h> /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */ #define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK) /* * Extracts the flags from a (pte|pmd|pud|pgd)val_t * This includes the protection key value. */ #define PTE_FLAGS_MASK (~PTE_PFN_MASK) typedef struct pgprot { pgprotval_t pgprot; } pgprot_t; typedef struct { pgdval_t pgd; } pgd_t; static inline pgprot_t pgprot_nx(pgprot_t prot) { return __pgprot(pgprot_val(prot) | _PAGE_NX); } #define pgprot_nx pgprot_nx #ifdef CONFIG_X86_PAE /* * PHYSICAL_PAGE_MASK might be non-constant when SME is compiled in, so we can't * use it here. */ #define PGD_PAE_PAGE_MASK ((signed long)PAGE_MASK) #define PGD_PAE_PHYS_MASK (((1ULL << __PHYSICAL_MASK_SHIFT)-1) & PGD_PAE_PAGE_MASK) /* * PAE allows Base Address, P, PWT, PCD and AVL bits to be set in PGD entries. * All other bits are Reserved MBZ */ #define PGD_ALLOWED_BITS (PGD_PAE_PHYS_MASK | _PAGE_PRESENT | \ _PAGE_PWT | _PAGE_PCD | \ _PAGE_SOFTW1 | _PAGE_SOFTW2 | _PAGE_SOFTW3) #else /* No need to mask any bits for !PAE */ #define PGD_ALLOWED_BITS (~0ULL) #endif static inline pgd_t native_make_pgd(pgdval_t val) { return (pgd_t) { val & PGD_ALLOWED_BITS }; } static inline pgdval_t native_pgd_val(pgd_t pgd) { return pgd.pgd & PGD_ALLOWED_BITS; } static inline pgdval_t pgd_flags(pgd_t pgd) { return native_pgd_val(pgd) & PTE_FLAGS_MASK; } #if CONFIG_PGTABLE_LEVELS > 4 typedef struct { p4dval_t p4d; } p4d_t; static inline p4d_t native_make_p4d(pudval_t val) { return (p4d_t) { val }; } static inline p4dval_t native_p4d_val(p4d_t p4d) { return p4d.p4d; } #else #include <asm-generic/pgtable-nop4d.h> static inline p4d_t native_make_p4d(pudval_t val) { return (p4d_t) { .pgd = native_make_pgd((pgdval_t)val) }; } static inline p4dval_t native_p4d_val(p4d_t p4d) { return native_pgd_val(p4d.pgd); } #endif #if CONFIG_PGTABLE_LEVELS > 3 typedef struct { pudval_t pud; } pud_t; static inline pud_t native_make_pud(pmdval_t val) { return (pud_t) { val }; } static inline pudval_t native_pud_val(pud_t pud) { return pud.pud; } #else #include <asm-generic/pgtable-nopud.h> static inline pud_t native_make_pud(pudval_t val) { return (pud_t) { .p4d.pgd = native_make_pgd(val) }; } static inline pudval_t native_pud_val(pud_t pud) { return native_pgd_val(pud.p4d.pgd); } #endif #if CONFIG_PGTABLE_LEVELS > 2 static inline pmd_t native_make_pmd(pmdval_t val) { return (pmd_t) { .pmd = val }; } static inline pmdval_t native_pmd_val(pmd_t pmd) { return pmd.pmd; } #else #include <asm-generic/pgtable-nopmd.h> static inline pmd_t native_make_pmd(pmdval_t val) { return (pmd_t) { .pud.p4d.pgd = native_make_pgd(val) }; } static inline pmdval_t native_pmd_val(pmd_t pmd) { return native_pgd_val(pmd.pud.p4d.pgd); } #endif static inline p4dval_t p4d_pfn_mask(p4d_t p4d) { /* No 512 GiB huge pages yet */ return PTE_PFN_MASK; } static inline p4dval_t p4d_flags_mask(p4d_t p4d) { return ~p4d_pfn_mask(p4d); } static inline p4dval_t p4d_flags(p4d_t p4d) { return native_p4d_val(p4d) & p4d_flags_mask(p4d); } static inline pudval_t pud_pfn_mask(pud_t pud) { if (native_pud_val(pud) & _PAGE_PSE) return PHYSICAL_PUD_PAGE_MASK; else return PTE_PFN_MASK; } static inline pudval_t pud_flags_mask(pud_t pud) { return ~pud_pfn_mask(pud); } static inline pudval_t pud_flags(pud_t pud) { return native_pud_val(pud) & pud_flags_mask(pud); } static inline pmdval_t pmd_pfn_mask(pmd_t pmd) { if (native_pmd_val(pmd) & _PAGE_PSE) return PHYSICAL_PMD_PAGE_MASK; else return PTE_PFN_MASK; } static inline pmdval_t pmd_flags_mask(pmd_t pmd) { return ~pmd_pfn_mask(pmd); } static inline pmdval_t pmd_flags(pmd_t pmd) { return native_pmd_val(pmd) & pmd_flags_mask(pmd); } static inline pte_t native_make_pte(pteval_t val) { return (pte_t) { .pte = val }; } static inline pteval_t native_pte_val(pte_t pte) { return pte.pte; } static inline pteval_t pte_flags(pte_t pte) { return native_pte_val(pte) & PTE_FLAGS_MASK; } #define __pte2cm_idx(cb) \ ((((cb) >> (_PAGE_BIT_PAT - 2)) & 4) | \ (((cb) >> (_PAGE_BIT_PCD - 1)) & 2) | \ (((cb) >> _PAGE_BIT_PWT) & 1)) #define __cm_idx2pte(i) \ ((((i) & 4) << (_PAGE_BIT_PAT - 2)) | \ (((i) & 2) << (_PAGE_BIT_PCD - 1)) | \ (((i) & 1) << _PAGE_BIT_PWT)) unsigned long cachemode2protval(enum page_cache_mode pcm); static inline pgprotval_t protval_4k_2_large(pgprotval_t val) { return (val & ~(_PAGE_PAT | _PAGE_PAT_LARGE)) | ((val & _PAGE_PAT) << (_PAGE_BIT_PAT_LARGE - _PAGE_BIT_PAT)); } static inline pgprot_t pgprot_4k_2_large(pgprot_t pgprot) { return __pgprot(protval_4k_2_large(pgprot_val(pgprot))); } static inline pgprotval_t protval_large_2_4k(pgprotval_t val) { return (val & ~(_PAGE_PAT | _PAGE_PAT_LARGE)) | ((val & _PAGE_PAT_LARGE) >> (_PAGE_BIT_PAT_LARGE - _PAGE_BIT_PAT)); } static inline pgprot_t pgprot_large_2_4k(pgprot_t pgprot) { return __pgprot(protval_large_2_4k(pgprot_val(pgprot))); } typedef struct page *pgtable_t; extern pteval_t __supported_pte_mask; extern pteval_t __default_kernel_pte_mask; extern void set_nx(void); extern int nx_enabled; #define pgprot_writecombine pgprot_writecombine extern pgprot_t pgprot_writecombine(pgprot_t prot); #define pgprot_writethrough pgprot_writethrough extern pgprot_t pgprot_writethrough(pgprot_t prot); /* Indicate that x86 has its own track and untrack pfn vma functions */ #define __HAVE_PFNMAP_TRACKING #define __HAVE_PHYS_MEM_ACCESS_PROT struct file; pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot); /* Install a pte for a particular vaddr in kernel space. */ void set_pte_vaddr(unsigned long vaddr, pte_t pte); #ifdef CONFIG_X86_32 extern void native_pagetable_init(void); #else #define native_pagetable_init paging_init #endif enum pg_level { PG_LEVEL_NONE, PG_LEVEL_4K, PG_LEVEL_2M, PG_LEVEL_1G, PG_LEVEL_512G, PG_LEVEL_NUM }; #ifdef CONFIG_PROC_FS extern void update_page_count(int level, unsigned long pages); #else static inline void update_page_count(int level, unsigned long pages) { } #endif /* * Helper function that returns the kernel pagetable entry controlling * the virtual address 'address'. NULL means no pagetable entry present. * NOTE: the return type is pte_t but if the pmd is PSE then we return it * as a pte too. */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level); extern pmd_t *lookup_pmd_address(unsigned long address); extern phys_addr_t slow_virt_to_phys(void *__address); extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address, unsigned numpages, unsigned long page_flags); extern int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address, unsigned long numpages); #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_PGTABLE_DEFS_H */ |
64 64 64 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | // SPDX-License-Identifier: GPL-2.0-or-later /* mpihelp-mul_1.c - MPI helper functions * Copyright (C) 1994, 1996, 1997, 1998, 2001 Free Software Foundation, Inc. * * This file is part of GnuPG. * * Note: This code is heavily based on the GNU MP Library. * Actually it's the same code with only minor changes in the * way the data is stored; this is to support the abstraction * of an optional secure memory allocation which may be used * to avoid revealing of sensitive data due to paging etc. * The GNU MP Library itself is published under the LGPL; * however I decided to publish this code under the plain GPL. */ #include "mpi-internal.h" #include "longlong.h" mpi_limb_t mpihelp_mul_1(mpi_ptr_t res_ptr, mpi_ptr_t s1_ptr, mpi_size_t s1_size, mpi_limb_t s2_limb) { mpi_limb_t cy_limb; mpi_size_t j; mpi_limb_t prod_high, prod_low; /* The loop counter and index J goes from -S1_SIZE to -1. This way * the loop becomes faster. */ j = -s1_size; /* Offset the base pointers to compensate for the negative indices. */ s1_ptr -= j; res_ptr -= j; cy_limb = 0; do { umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); prod_low += cy_limb; cy_limb = (prod_low < cy_limb ? 1 : 0) + prod_high; res_ptr[j] = prod_low; } while (++j); return cy_limb; } |
1688 1688 1687 1687 1687 1688 1688 1682 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | // SPDX-License-Identifier: GPL-2.0 /* * SHA1 routine optimized to do word accesses rather than byte accesses, * and to avoid unnecessary copies into the context array. * * This was based on the git SHA1 implementation. */ #include <linux/kernel.h> #include <linux/export.h> #include <linux/module.h> #include <linux/bitops.h> #include <linux/string.h> #include <crypto/sha1.h> #include <asm/unaligned.h> /* * If you have 32 registers or more, the compiler can (and should) * try to change the array[] accesses into registers. However, on * machines with less than ~25 registers, that won't really work, * and at least gcc will make an unholy mess of it. * * So to avoid that mess which just slows things down, we force * the stores to memory to actually happen (we might be better off * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as * suggested by Artur Skawina - that will also make gcc unable to * try to do the silly "optimize away loads" part because it won't * see what the value will be). * * Ben Herrenschmidt reports that on PPC, the C version comes close * to the optimized asm with this (ie on PPC you don't want that * 'volatile', since there are lots of registers). * * On ARM we get the best code generation by forcing a full memory barrier * between each SHA_ROUND, otherwise gcc happily get wild with spilling and * the stack frame size simply explode and performance goes down the drain. */ #ifdef CONFIG_X86 #define setW(x, val) (*(volatile __u32 *)&W(x) = (val)) #elif defined(CONFIG_ARM) #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0) #else #define setW(x, val) (W(x) = (val)) #endif /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) /* * Where do we get the source from? The first 16 iterations get it from * the input data, the next mix it from the 512-bit array. */ #define SHA_SRC(t) get_unaligned_be32((__u32 *)data + t) #define SHA_MIX(t) rol32(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1) #define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \ __u32 TEMP = input(t); setW(t, TEMP); \ E += TEMP + rol32(A,5) + (fn) + (constant); \ B = ror32(B, 2); \ TEMP = E; E = D; D = C; C = B; B = A; A = TEMP; } while (0) #define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) #define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E ) #define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E ) #define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E ) #define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6, A, B, C, D, E ) /** * sha1_transform - single block SHA1 transform (deprecated) * * @digest: 160 bit digest to update * @data: 512 bits of data to hash * @array: 16 words of workspace (see note) * * This function executes SHA-1's internal compression function. It updates the * 160-bit internal state (@digest) with a single 512-bit data block (@data). * * Don't use this function. SHA-1 is no longer considered secure. And even if * you do have to use SHA-1, this isn't the correct way to hash something with * SHA-1 as this doesn't handle padding and finalization. * * Note: If the hash is security sensitive, the caller should be sure * to clear the workspace. This is left to the caller to avoid * unnecessary clears between chained hashing operations. */ void sha1_transform(__u32 *digest, const char *data, __u32 *array) { __u32 A, B, C, D, E; unsigned int i = 0; A = digest[0]; B = digest[1]; C = digest[2]; D = digest[3]; E = digest[4]; /* Round 1 - iterations 0-16 take their input from 'data' */ for (; i < 16; ++i) T_0_15(i, A, B, C, D, E); /* Round 1 - tail. Input from 512-bit mixing array */ for (; i < 20; ++i) T_16_19(i, A, B, C, D, E); /* Round 2 */ for (; i < 40; ++i) T_20_39(i, A, B, C, D, E); /* Round 3 */ for (; i < 60; ++i) T_40_59(i, A, B, C, D, E); /* Round 4 */ for (; i < 80; ++i) T_60_79(i, A, B, C, D, E); digest[0] += A; digest[1] += B; digest[2] += C; digest[3] += D; digest[4] += E; } EXPORT_SYMBOL(sha1_transform); /** * sha1_init - initialize the vectors for a SHA1 digest * @buf: vector to initialize */ void sha1_init(__u32 *buf) { buf[0] = 0x67452301; buf[1] = 0xefcdab89; buf[2] = 0x98badcfe; buf[3] = 0x10325476; buf[4] = 0xc3d2e1f0; } EXPORT_SYMBOL(sha1_init); MODULE_LICENSE("GPL"); |
1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | // SPDX-License-Identifier: GPL-2.0-only /* * AppArmor security module * * This file contains AppArmor /proc/<pid>/attr/ interface functions * * Copyright (C) 1998-2008 Novell/SUSE * Copyright 2009-2010 Canonical Ltd. */ #include "include/apparmor.h" #include "include/cred.h" #include "include/policy.h" #include "include/policy_ns.h" #include "include/domain.h" #include "include/procattr.h" /** * aa_getprocattr - Return the label information for @label * @label: the label to print label info about (NOT NULL) * @string: Returns - string containing the label info (NOT NULL) * * Requires: label != NULL && string != NULL * * Creates a string containing the label information for @label. * * Returns: size of string placed in @string else error code on failure */ int aa_getprocattr(struct aa_label *label, char **string) { struct aa_ns *ns = labels_ns(label); struct aa_ns *current_ns = aa_get_current_ns(); int len; if (!aa_ns_visible(current_ns, ns, true)) { aa_put_ns(current_ns); return -EACCES; } len = aa_label_snxprint(NULL, 0, current_ns, label, FLAG_SHOW_MODE | FLAG_VIEW_SUBNS | FLAG_HIDDEN_UNCONFINED); AA_BUG(len < 0); *string = kmalloc(len + 2, GFP_KERNEL); if (!*string) { aa_put_ns(current_ns); return -ENOMEM; } len = aa_label_snxprint(*string, len + 2, current_ns, label, FLAG_SHOW_MODE | FLAG_VIEW_SUBNS | FLAG_HIDDEN_UNCONFINED); if (len < 0) { aa_put_ns(current_ns); return len; } (*string)[len] = '\n'; (*string)[len + 1] = 0; aa_put_ns(current_ns); return len + 1; } /** * split_token_from_name - separate a string of form <token>^<name> * @op: operation being checked * @args: string to parse (NOT NULL) * @token: stores returned parsed token value (NOT NULL) * * Returns: start position of name after token else NULL on failure */ static char *split_token_from_name(const char *op, char *args, u64 *token) { char *name; *token = simple_strtoull(args, &name, 16); if ((name == args) || *name != '^') { AA_ERROR("%s: Invalid input '%s'", op, args); return ERR_PTR(-EINVAL); } name++; /* skip ^ */ if (!*name) name = NULL; return name; } /** * aa_setprocattr_changehat - handle procattr interface to change_hat * @args: args received from writing to /proc/<pid>/attr/current (NOT NULL) * @size: size of the args * @flags: set of flags governing behavior * * Returns: %0 or error code if change_hat fails */ int aa_setprocattr_changehat(char *args, size_t size, int flags) { char *hat; u64 token; const char *hats[16]; /* current hard limit on # of names */ int count = 0; hat = split_token_from_name(OP_CHANGE_HAT, args, &token); if (IS_ERR(hat)) return PTR_ERR(hat); if (!hat && !token) { AA_ERROR("change_hat: Invalid input, NULL hat and NULL magic"); return -EINVAL; } if (hat) { /* set up hat name vector, args guaranteed null terminated * at args[size] by setprocattr. * * If there are multiple hat names in the buffer each is * separated by a \0. Ie. userspace writes them pre tokenized */ char *end = args + size; for (count = 0; (hat < end) && count < 16; ++count) { char *next = hat + strlen(hat) + 1; hats[count] = hat; AA_DEBUG("%s: (pid %d) Magic 0x%llx count %d hat '%s'\n" , __func__, current->pid, token, count, hat); hat = next; } } else AA_DEBUG("%s: (pid %d) Magic 0x%llx count %d Hat '%s'\n", __func__, current->pid, token, count, "<NULL>"); return aa_change_hat(hats, count, token, flags); } |
56 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | /* SPDX-License-Identifier: GPL-2.0 */ #include <linux/module.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/slab.h> #include <linux/parser.h> #include <linux/errno.h> #include <linux/stringhash.h> #include "utf8n.h" int utf8_validate(const struct unicode_map *um, const struct qstr *str) { if (utf8nlen(um, UTF8_NFDI, str->name, str->len) < 0) return -1; return 0; } EXPORT_SYMBOL(utf8_validate); int utf8_strncmp(const struct unicode_map *um, const struct qstr *s1, const struct qstr *s2) { struct utf8cursor cur1, cur2; int c1, c2; if (utf8ncursor(&cur1, um, UTF8_NFDI, s1->name, s1->len) < 0) return -EINVAL; if (utf8ncursor(&cur2, um, UTF8_NFDI, s2->name, s2->len) < 0) return -EINVAL; do { c1 = utf8byte(&cur1); c2 = utf8byte(&cur2); if (c1 < 0 || c2 < 0) return -EINVAL; if (c1 != c2) return 1; } while (c1); return 0; } EXPORT_SYMBOL(utf8_strncmp); int utf8_strncasecmp(const struct unicode_map *um, const struct qstr *s1, const struct qstr *s2) { struct utf8cursor cur1, cur2; int c1, c2; if (utf8ncursor(&cur1, um, UTF8_NFDICF, s1->name, s1->len) < 0) return -EINVAL; if (utf8ncursor(&cur2, um, UTF8_NFDICF, s2->name, s2->len) < 0) return -EINVAL; do { c1 = utf8byte(&cur1); c2 = utf8byte(&cur2); if (c1 < 0 || c2 < 0) return -EINVAL; if (c1 != c2) return 1; } while (c1); return 0; } EXPORT_SYMBOL(utf8_strncasecmp); /* String cf is expected to be a valid UTF-8 casefolded * string. */ int utf8_strncasecmp_folded(const struct unicode_map *um, const struct qstr *cf, const struct qstr *s1) { struct utf8cursor cur1; int c1, c2; int i = 0; if (utf8ncursor(&cur1, um, UTF8_NFDICF, s1->name, s1->len) < 0) return -EINVAL; do { c1 = utf8byte(&cur1); c2 = cf->name[i++]; if (c1 < 0) return -EINVAL; if (c1 != c2) return 1; } while (c1); return 0; } EXPORT_SYMBOL(utf8_strncasecmp_folded); int utf8_casefold(const struct unicode_map *um, const struct qstr *str, unsigned char *dest, size_t dlen) { struct utf8cursor cur; size_t nlen = 0; if (utf8ncursor(&cur, um, UTF8_NFDICF, str->name, str->len) < 0) return -EINVAL; for (nlen = 0; nlen < dlen; nlen++) { int c = utf8byte(&cur); dest[nlen] = c; if (!c) return nlen; if (c == -1) break; } return -EINVAL; } EXPORT_SYMBOL(utf8_casefold); int utf8_casefold_hash(const struct unicode_map *um, const void *salt, struct qstr *str) { struct utf8cursor cur; int c; unsigned long hash = init_name_hash(salt); if (utf8ncursor(&cur, um, UTF8_NFDICF, str->name, str->len) < 0) return -EINVAL; while ((c = utf8byte(&cur))) { if (c < 0) return -EINVAL; hash = partial_name_hash((unsigned char)c, hash); } str->hash = end_name_hash(hash); return 0; } EXPORT_SYMBOL(utf8_casefold_hash); int utf8_normalize(const struct unicode_map *um, const struct qstr *str, unsigned char *dest, size_t dlen) { struct utf8cursor cur; ssize_t nlen = 0; if (utf8ncursor(&cur, um, UTF8_NFDI, str->name, str->len) < 0) return -EINVAL; for (nlen = 0; nlen < dlen; nlen++) { int c = utf8byte(&cur); dest[nlen] = c; if (!c) return nlen; if (c == -1) break; } return -EINVAL; } EXPORT_SYMBOL(utf8_normalize); static const struct utf8data *find_table_version(const struct utf8data *table, size_t nr_entries, unsigned int version) { size_t i = nr_entries - 1; while (version < table[i].maxage) i--; if (version > table[i].maxage) return NULL; return &table[i]; } struct unicode_map *utf8_load(unsigned int version) { struct unicode_map *um; um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL); if (!um) return ERR_PTR(-ENOMEM); um->version = version; um->tables = symbol_request(utf8_data_table); if (!um->tables) goto out_free_um; if (!utf8version_is_supported(um, version)) goto out_symbol_put; um->ntab[UTF8_NFDI] = find_table_version(um->tables->utf8nfdidata, um->tables->utf8nfdidata_size, um->version); if (!um->ntab[UTF8_NFDI]) goto out_symbol_put; um->ntab[UTF8_NFDICF] = find_table_version(um->tables->utf8nfdicfdata, um->tables->utf8nfdicfdata_size, um->version); if (!um->ntab[UTF8_NFDICF]) goto out_symbol_put; return um; out_symbol_put: symbol_put(um->tables); out_free_um: kfree(um); return ERR_PTR(-EINVAL); } EXPORT_SYMBOL(utf8_load); void utf8_unload(struct unicode_map *um) { if (um) { symbol_put(utf8_data_table); kfree(um); } } EXPORT_SYMBOL(utf8_unload); |
321 319 319 319 316 316 232 232 235 4 4 4 4 4 4 110 111 101 13 110 10 9 8 6 6 6 193 192 192 193 193 193 193 193 193 193 192 192 124 40 118 118 118 240 31 21 193 192 6 193 15 195 3 192 192 2 193 7 195 174 21 20 19 8 18 1 1 18 8 16 3 15 7 14 6 13 5 10 3 10 2 17 7 177 178 4 177 5 177 177 199 198 198 4 22 22 196 195 21 191 193 13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2006 IBM Corporation * * Author: Serge Hallyn <serue@us.ibm.com> * * Jun 2006 - namespaces support * OpenVZ, SWsoft Inc. * Pavel Emelianov <xemul@openvz.org> */ #include <linux/slab.h> #include <linux/export.h> #include <linux/nsproxy.h> #include <linux/init_task.h> #include <linux/mnt_namespace.h> #include <linux/utsname.h> #include <linux/pid_namespace.h> #include <net/net_namespace.h> #include <linux/ipc_namespace.h> #include <linux/time_namespace.h> #include <linux/fs_struct.h> #include <linux/proc_fs.h> #include <linux/proc_ns.h> #include <linux/file.h> #include <linux/syscalls.h> #include <linux/cgroup.h> #include <linux/perf_event.h> static struct kmem_cache *nsproxy_cachep; struct nsproxy init_nsproxy = { .count = REFCOUNT_INIT(1), .uts_ns = &init_uts_ns, #if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC) .ipc_ns = &init_ipc_ns, #endif .mnt_ns = NULL, .pid_ns_for_children = &init_pid_ns, #ifdef CONFIG_NET .net_ns = &init_net, #endif #ifdef CONFIG_CGROUPS .cgroup_ns = &init_cgroup_ns, #endif #ifdef CONFIG_TIME_NS .time_ns = &init_time_ns, .time_ns_for_children = &init_time_ns, #endif }; static inline struct nsproxy *create_nsproxy(void) { struct nsproxy *nsproxy; nsproxy = kmem_cache_alloc(nsproxy_cachep, GFP_KERNEL); if (nsproxy) refcount_set(&nsproxy->count, 1); return nsproxy; } /* * Create new nsproxy and all of its the associated namespaces. * Return the newly created nsproxy. Do not attach this to the task, * leave it to the caller to do proper locking and attach it to task. */ static struct nsproxy *create_new_namespaces(unsigned long flags, struct task_struct *tsk, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct nsproxy *new_nsp; int err; new_nsp = create_nsproxy(); if (!new_nsp) return ERR_PTR(-ENOMEM); new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs); if (IS_ERR(new_nsp->mnt_ns)) { err = PTR_ERR(new_nsp->mnt_ns); goto out_ns; } new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns); if (IS_ERR(new_nsp->uts_ns)) { err = PTR_ERR(new_nsp->uts_ns); goto out_uts; } new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns); if (IS_ERR(new_nsp->ipc_ns)) { err = PTR_ERR(new_nsp->ipc_ns); goto out_ipc; } new_nsp->pid_ns_for_children = copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children); if (IS_ERR(new_nsp->pid_ns_for_children)) { err = PTR_ERR(new_nsp->pid_ns_for_children); goto out_pid; } new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns, tsk->nsproxy->cgroup_ns); if (IS_ERR(new_nsp->cgroup_ns)) { err = PTR_ERR(new_nsp->cgroup_ns); goto out_cgroup; } new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns); if (IS_ERR(new_nsp->net_ns)) { err = PTR_ERR(new_nsp->net_ns); goto out_net; } new_nsp->time_ns_for_children = copy_time_ns(flags, user_ns, tsk->nsproxy->time_ns_for_children); if (IS_ERR(new_nsp->time_ns_for_children)) { err = PTR_ERR(new_nsp->time_ns_for_children); goto out_time; } new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns); return new_nsp; out_time: put_net(new_nsp->net_ns); out_net: put_cgroup_ns(new_nsp->cgroup_ns); out_cgroup: if (new_nsp->pid_ns_for_children) put_pid_ns(new_nsp->pid_ns_for_children); out_pid: if (new_nsp->ipc_ns) put_ipc_ns(new_nsp->ipc_ns); out_ipc: if (new_nsp->uts_ns) put_uts_ns(new_nsp->uts_ns); out_uts: if (new_nsp->mnt_ns) put_mnt_ns(new_nsp->mnt_ns); out_ns: kmem_cache_free(nsproxy_cachep, new_nsp); return ERR_PTR(err); } /* * called from clone. This now handles copy for nsproxy and all * namespaces therein. */ int copy_namespaces(unsigned long flags, struct task_struct *tsk) { struct nsproxy *old_ns = tsk->nsproxy; struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns); struct nsproxy *new_ns; if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWCGROUP | CLONE_NEWTIME)))) { if ((flags & CLONE_VM) || likely(old_ns->time_ns_for_children == old_ns->time_ns)) { get_nsproxy(old_ns); return 0; } } else if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; /* * CLONE_NEWIPC must detach from the undolist: after switching * to a new ipc namespace, the semaphore arrays from the old * namespace are unreachable. In clone parlance, CLONE_SYSVSEM * means share undolist with parent, so we must forbid using * it along with CLONE_NEWIPC. */ if ((flags & (CLONE_NEWIPC | CLONE_SYSVSEM)) == (CLONE_NEWIPC | CLONE_SYSVSEM)) return -EINVAL; new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs); if (IS_ERR(new_ns)) return PTR_ERR(new_ns); if ((flags & CLONE_VM) == 0) timens_on_fork(new_ns, tsk); tsk->nsproxy = new_ns; return 0; } void free_nsproxy(struct nsproxy *ns) { if (ns->mnt_ns) put_mnt_ns(ns->mnt_ns); if (ns->uts_ns) put_uts_ns(ns->uts_ns); if (ns->ipc_ns) put_ipc_ns(ns->ipc_ns); if (ns->pid_ns_for_children) put_pid_ns(ns->pid_ns_for_children); if (ns->time_ns) put_time_ns(ns->time_ns); if (ns->time_ns_for_children) put_time_ns(ns->time_ns_for_children); put_cgroup_ns(ns->cgroup_ns); put_net(ns->net_ns); kmem_cache_free(nsproxy_cachep, ns); } /* * Called from unshare. Unshare all the namespaces part of nsproxy. * On success, returns the new nsproxy. */ int unshare_nsproxy_namespaces(unsigned long unshare_flags, struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs) { struct user_namespace *user_ns; int err = 0; if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP | CLONE_NEWTIME))) return 0; user_ns = new_cred ? new_cred->user_ns : current_user_ns(); if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; *new_nsp = create_new_namespaces(unshare_flags, current, user_ns, new_fs ? new_fs : current->fs); if (IS_ERR(*new_nsp)) { err = PTR_ERR(*new_nsp); goto out; } out: return err; } void switch_task_namespaces(struct task_struct *p, struct nsproxy *new) { struct nsproxy *ns; might_sleep(); task_lock(p); ns = p->nsproxy; p->nsproxy = new; task_unlock(p); if (ns) put_nsproxy(ns); } void exit_task_namespaces(struct task_struct *p) { switch_task_namespaces(p, NULL); } int exec_task_namespaces(void) { struct task_struct *tsk = current; struct nsproxy *new; if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns) return 0; new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs); if (IS_ERR(new)) return PTR_ERR(new); timens_on_fork(new, tsk); switch_task_namespaces(tsk, new); return 0; } static int check_setns_flags(unsigned long flags) { if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWTIME | CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWCGROUP))) return -EINVAL; #ifndef CONFIG_USER_NS if (flags & CLONE_NEWUSER) return -EINVAL; #endif #ifndef CONFIG_PID_NS if (flags & CLONE_NEWPID) return -EINVAL; #endif #ifndef CONFIG_UTS_NS if (flags & CLONE_NEWUTS) return -EINVAL; #endif #ifndef CONFIG_IPC_NS if (flags & CLONE_NEWIPC) return -EINVAL; #endif #ifndef CONFIG_CGROUPS if (flags & CLONE_NEWCGROUP) return -EINVAL; #endif #ifndef CONFIG_NET_NS if (flags & CLONE_NEWNET) return -EINVAL; #endif #ifndef CONFIG_TIME_NS if (flags & CLONE_NEWTIME) return -EINVAL; #endif return 0; } static void put_nsset(struct nsset *nsset) { unsigned flags = nsset->flags; if (flags & CLONE_NEWUSER) put_cred(nsset_cred(nsset)); /* * We only created a temporary copy if we attached to more than just * the mount namespace. */ if (nsset->fs && (flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) free_fs_struct(nsset->fs); if (nsset->nsproxy) free_nsproxy(nsset->nsproxy); } static int prepare_nsset(unsigned flags, struct nsset *nsset) { struct task_struct *me = current; nsset->nsproxy = create_new_namespaces(0, me, current_user_ns(), me->fs); if (IS_ERR(nsset->nsproxy)) return PTR_ERR(nsset->nsproxy); if (flags & CLONE_NEWUSER) nsset->cred = prepare_creds(); else nsset->cred = current_cred(); if (!nsset->cred) goto out; /* Only create a temporary copy of fs_struct if we really need to. */ if (flags == CLONE_NEWNS) { nsset->fs = me->fs; } else if (flags & CLONE_NEWNS) { nsset->fs = copy_fs_struct(me->fs); if (!nsset->fs) goto out; } nsset->flags = flags; return 0; out: put_nsset(nsset); return -ENOMEM; } static inline int validate_ns(struct nsset *nsset, struct ns_common *ns) { return ns->ops->install(nsset, ns); } /* * This is the inverse operation to unshare(). * Ordering is equivalent to the standard ordering used everywhere else * during unshare and process creation. The switch to the new set of * namespaces occurs at the point of no return after installation of * all requested namespaces was successful in commit_nsset(). */ static int validate_nsset(struct nsset *nsset, struct pid *pid) { int ret = 0; unsigned flags = nsset->flags; struct user_namespace *user_ns = NULL; struct pid_namespace *pid_ns = NULL; struct nsproxy *nsp; struct task_struct *tsk; /* Take a "snapshot" of the target task's namespaces. */ rcu_read_lock(); tsk = pid_task(pid, PIDTYPE_PID); if (!tsk) { rcu_read_unlock(); return -ESRCH; } if (!ptrace_may_access(tsk, PTRACE_MODE_READ_REALCREDS)) { rcu_read_unlock(); return -EPERM; } task_lock(tsk); nsp = tsk->nsproxy; if (nsp) get_nsproxy(nsp); task_unlock(tsk); if (!nsp) { rcu_read_unlock(); return -ESRCH; } #ifdef CONFIG_PID_NS if (flags & CLONE_NEWPID) { pid_ns = task_active_pid_ns(tsk); if (unlikely(!pid_ns)) { rcu_read_unlock(); ret = -ESRCH; goto out; } get_pid_ns(pid_ns); } #endif #ifdef CONFIG_USER_NS if (flags & CLONE_NEWUSER) user_ns = get_user_ns(__task_cred(tsk)->user_ns); #endif rcu_read_unlock(); /* * Install requested namespaces. The caller will have * verified earlier that the requested namespaces are * supported on this kernel. We don't report errors here * if a namespace is requested that isn't supported. */ #ifdef CONFIG_USER_NS if (flags & CLONE_NEWUSER) { ret = validate_ns(nsset, &user_ns->ns); if (ret) goto out; } #endif if (flags & CLONE_NEWNS) { ret = validate_ns(nsset, from_mnt_ns(nsp->mnt_ns)); if (ret) goto out; } #ifdef CONFIG_UTS_NS if (flags & CLONE_NEWUTS) { ret = validate_ns(nsset, &nsp->uts_ns->ns); if (ret) goto out; } #endif #ifdef CONFIG_IPC_NS if (flags & CLONE_NEWIPC) { ret = validate_ns(nsset, &nsp->ipc_ns->ns); if (ret) goto out; } #endif #ifdef CONFIG_PID_NS if (flags & CLONE_NEWPID) { ret = validate_ns(nsset, &pid_ns->ns); if (ret) goto out; } #endif #ifdef CONFIG_CGROUPS if (flags & CLONE_NEWCGROUP) { ret = validate_ns(nsset, &nsp->cgroup_ns->ns); if (ret) goto out; } #endif #ifdef CONFIG_NET_NS if (flags & CLONE_NEWNET) { ret = validate_ns(nsset, &nsp->net_ns->ns); if (ret) goto out; } #endif #ifdef CONFIG_TIME_NS if (flags & CLONE_NEWTIME) { ret = validate_ns(nsset, &nsp->time_ns->ns); if (ret) goto out; } #endif out: if (pid_ns) put_pid_ns(pid_ns); if (nsp) put_nsproxy(nsp); put_user_ns(user_ns); return ret; } /* * This is the point of no return. There are just a few namespaces * that do some actual work here and it's sufficiently minimal that * a separate ns_common operation seems unnecessary for now. * Unshare is doing the same thing. If we'll end up needing to do * more in a given namespace or a helper here is ultimately not * exported anymore a simple commit handler for each namespace * should be added to ns_common. */ static void commit_nsset(struct nsset *nsset) { unsigned flags = nsset->flags; struct task_struct *me = current; #ifdef CONFIG_USER_NS if (flags & CLONE_NEWUSER) { /* transfer ownership */ commit_creds(nsset_cred(nsset)); nsset->cred = NULL; } #endif /* We only need to commit if we have used a temporary fs_struct. */ if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) { set_fs_root(me->fs, &nsset->fs->root); set_fs_pwd(me->fs, &nsset->fs->pwd); } #ifdef CONFIG_IPC_NS if (flags & CLONE_NEWIPC) exit_sem(me); #endif #ifdef CONFIG_TIME_NS if (flags & CLONE_NEWTIME) timens_commit(me, nsset->nsproxy->time_ns); #endif /* transfer ownership */ switch_task_namespaces(me, nsset->nsproxy); nsset->nsproxy = NULL; } SYSCALL_DEFINE2(setns, int, fd, int, flags) { struct fd f = fdget(fd); struct ns_common *ns = NULL; struct nsset nsset = {}; int err = 0; if (!f.file) return -EBADF; if (proc_ns_file(f.file)) { ns = get_proc_ns(file_inode(f.file)); if (flags && (ns->ops->type != flags)) err = -EINVAL; flags = ns->ops->type; } else if (!IS_ERR(pidfd_pid(f.file))) { err = check_setns_flags(flags); } else { err = -EINVAL; } if (err) goto out; err = prepare_nsset(flags, &nsset); if (err) goto out; if (proc_ns_file(f.file)) err = validate_ns(&nsset, ns); else err = validate_nsset(&nsset, f.file->private_data); if (!err) { commit_nsset(&nsset); perf_event_namespaces(current); } put_nsset(&nsset); out: fdput(f); return err; } int __init nsproxy_cache_init(void) { nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT); return 0; } |
59 103 103 8 8 7 2 81 80 81 81 80 81 59 59 59 59 59 59 60 60 59 3 3 60 31 31 31 31 31 72 71 67 21 72 62 21 71 21 72 74 68 24 24 74 107 107 107 107 44 1 107 107 107 107 107 1 107 39 39 26 38 39 107 107 107 39 107 107 29 107 10 107 107 107 1 107 107 107 107 107 35 107 107 35 4 4 107 1 1 57 57 57 57 57 57 57 57 57 57 129 128 136 136 77 21 19 19 6 6 6 44 43 1 44 112 113 112 136 3 3 58 58 58 57 46 57 3 3 3 1 3 3 58 58 4 3 3 79 78 79 78 78 79 79 79 79 3 3 78 79 79 79 79 79 79 79 79 79 79 3 79 79 58 58 79 78 79 79 121 117 32 4 57 57 57 55 57 57 57 57 57 2 2 2 2 101 47 47 11 12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2000-2006 Silicon Graphics, Inc. * All Rights Reserved. */ #include "xfs.h" #include <linux/backing-dev.h> #include <linux/dax.h> #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_trace.h" #include "xfs_log.h" #include "xfs_log_recover.h" #include "xfs_log_priv.h" #include "xfs_trans.h" #include "xfs_buf_item.h" #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_ag.h" struct kmem_cache *xfs_buf_cache; /* * Locking orders * * xfs_buf_ioacct_inc: * xfs_buf_ioacct_dec: * b_sema (caller holds) * b_lock * * xfs_buf_stale: * b_sema (caller holds) * b_lock * lru_lock * * xfs_buf_rele: * b_lock * pag_buf_lock * lru_lock * * xfs_buftarg_drain_rele * lru_lock * b_lock (trylock due to inversion) * * xfs_buftarg_isolate * lru_lock * b_lock (trylock due to inversion) */ static int __xfs_buf_submit(struct xfs_buf *bp, bool wait); static inline int xfs_buf_submit( struct xfs_buf *bp) { return __xfs_buf_submit(bp, !(bp->b_flags & XBF_ASYNC)); } static inline int xfs_buf_is_vmapped( struct xfs_buf *bp) { /* * Return true if the buffer is vmapped. * * b_addr is null if the buffer is not mapped, but the code is clever * enough to know it doesn't have to map a single page, so the check has * to be both for b_addr and bp->b_page_count > 1. */ return bp->b_addr && bp->b_page_count > 1; } static inline int xfs_buf_vmap_len( struct xfs_buf *bp) { return (bp->b_page_count * PAGE_SIZE); } /* * Bump the I/O in flight count on the buftarg if we haven't yet done so for * this buffer. The count is incremented once per buffer (per hold cycle) * because the corresponding decrement is deferred to buffer release. Buffers * can undergo I/O multiple times in a hold-release cycle and per buffer I/O * tracking adds unnecessary overhead. This is used for sychronization purposes * with unmount (see xfs_buftarg_drain()), so all we really need is a count of * in-flight buffers. * * Buffers that are never released (e.g., superblock, iclog buffers) must set * the XBF_NO_IOACCT flag before I/O submission. Otherwise, the buftarg count * never reaches zero and unmount hangs indefinitely. */ static inline void xfs_buf_ioacct_inc( struct xfs_buf *bp) { if (bp->b_flags & XBF_NO_IOACCT) return; ASSERT(bp->b_flags & XBF_ASYNC); spin_lock(&bp->b_lock); if (!(bp->b_state & XFS_BSTATE_IN_FLIGHT)) { bp->b_state |= XFS_BSTATE_IN_FLIGHT; percpu_counter_inc(&bp->b_target->bt_io_count); } spin_unlock(&bp->b_lock); } /* * Clear the in-flight state on a buffer about to be released to the LRU or * freed and unaccount from the buftarg. */ static inline void __xfs_buf_ioacct_dec( struct xfs_buf *bp) { lockdep_assert_held(&bp->b_lock); if (bp->b_state & XFS_BSTATE_IN_FLIGHT) { bp->b_state &= ~XFS_BSTATE_IN_FLIGHT; percpu_counter_dec(&bp->b_target->bt_io_count); } } static inline void xfs_buf_ioacct_dec( struct xfs_buf *bp) { spin_lock(&bp->b_lock); __xfs_buf_ioacct_dec(bp); spin_unlock(&bp->b_lock); } /* * When we mark a buffer stale, we remove the buffer from the LRU and clear the * b_lru_ref count so that the buffer is freed immediately when the buffer * reference count falls to zero. If the buffer is already on the LRU, we need * to remove the reference that LRU holds on the buffer. * * This prevents build-up of stale buffers on the LRU. */ void xfs_buf_stale( struct xfs_buf *bp) { ASSERT(xfs_buf_islocked(bp)); bp->b_flags |= XBF_STALE; /* * Clear the delwri status so that a delwri queue walker will not * flush this buffer to disk now that it is stale. The delwri queue has * a reference to the buffer, so this is safe to do. */ bp->b_flags &= ~_XBF_DELWRI_Q; /* * Once the buffer is marked stale and unlocked, a subsequent lookup * could reset b_flags. There is no guarantee that the buffer is * unaccounted (released to LRU) before that occurs. Drop in-flight * status now to preserve accounting consistency. */ spin_lock(&bp->b_lock); __xfs_buf_ioacct_dec(bp); atomic_set(&bp->b_lru_ref, 0); if (!(bp->b_state & XFS_BSTATE_DISPOSE) && (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru))) atomic_dec(&bp->b_hold); ASSERT(atomic_read(&bp->b_hold) >= 1); spin_unlock(&bp->b_lock); } static int xfs_buf_get_maps( struct xfs_buf *bp, int map_count) { ASSERT(bp->b_maps == NULL); bp->b_map_count = map_count; if (map_count == 1) { bp->b_maps = &bp->__b_map; return 0; } bp->b_maps = kmem_zalloc(map_count * sizeof(struct xfs_buf_map), KM_NOFS); if (!bp->b_maps) return -ENOMEM; return 0; } /* * Frees b_pages if it was allocated. */ static void xfs_buf_free_maps( struct xfs_buf *bp) { if (bp->b_maps != &bp->__b_map) { kmem_free(bp->b_maps); bp->b_maps = NULL; } } static int _xfs_buf_alloc( struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp) { struct xfs_buf *bp; int error; int i; *bpp = NULL; bp = kmem_cache_zalloc(xfs_buf_cache, GFP_NOFS | __GFP_NOFAIL); /* * We don't want certain flags to appear in b_flags unless they are * specifically set by later operations on the buffer. */ flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD); atomic_set(&bp->b_hold, 1); atomic_set(&bp->b_lru_ref, 1); init_completion(&bp->b_iowait); INIT_LIST_HEAD(&bp->b_lru); INIT_LIST_HEAD(&bp->b_list); INIT_LIST_HEAD(&bp->b_li_list); sema_init(&bp->b_sema, 0); /* held, no waiters */ spin_lock_init(&bp->b_lock); bp->b_target = target; bp->b_mount = target->bt_mount; bp->b_flags = flags; /* * Set length and io_length to the same value initially. * I/O routines should use io_length, which will be the same in * most cases but may be reset (e.g. XFS recovery). */ error = xfs_buf_get_maps(bp, nmaps); if (error) { kmem_cache_free(xfs_buf_cache, bp); return error; } bp->b_rhash_key = map[0].bm_bn; bp->b_length = 0; for (i = 0; i < nmaps; i++) { bp->b_maps[i].bm_bn = map[i].bm_bn; bp->b_maps[i].bm_len = map[i].bm_len; bp->b_length += map[i].bm_len; } atomic_set(&bp->b_pin_count, 0); init_waitqueue_head(&bp->b_waiters); XFS_STATS_INC(bp->b_mount, xb_create); trace_xfs_buf_init(bp, _RET_IP_); *bpp = bp; return 0; } static void xfs_buf_free_pages( struct xfs_buf *bp) { uint i; ASSERT(bp->b_flags & _XBF_PAGES); if (xfs_buf_is_vmapped(bp)) vm_unmap_ram(bp->b_addr, bp->b_page_count); for (i = 0; i < bp->b_page_count; i++) { if (bp->b_pages[i]) __free_page(bp->b_pages[i]); } mm_account_reclaimed_pages(bp->b_page_count); if (bp->b_pages != bp->b_page_array) kmem_free(bp->b_pages); bp->b_pages = NULL; bp->b_flags &= ~_XBF_PAGES; } static void xfs_buf_free_callback( struct callback_head *cb) { struct xfs_buf *bp = container_of(cb, struct xfs_buf, b_rcu); xfs_buf_free_maps(bp); kmem_cache_free(xfs_buf_cache, bp); } static void xfs_buf_free( struct xfs_buf *bp) { trace_xfs_buf_free(bp, _RET_IP_); ASSERT(list_empty(&bp->b_lru)); if (bp->b_flags & _XBF_PAGES) xfs_buf_free_pages(bp); else if (bp->b_flags & _XBF_KMEM) kmem_free(bp->b_addr); call_rcu(&bp->b_rcu, xfs_buf_free_callback); } static int xfs_buf_alloc_kmem( struct xfs_buf *bp, xfs_buf_flags_t flags) { xfs_km_flags_t kmflag_mask = KM_NOFS; size_t size = BBTOB(bp->b_length); /* Assure zeroed buffer for non-read cases. */ if (!(flags & XBF_READ)) kmflag_mask |= KM_ZERO; bp->b_addr = kmem_alloc(size, kmflag_mask); if (!bp->b_addr) return -ENOMEM; if (((unsigned long)(bp->b_addr + size - 1) & PAGE_MASK) != ((unsigned long)bp->b_addr & PAGE_MASK)) { /* b_addr spans two pages - use alloc_page instead */ kmem_free(bp->b_addr); bp->b_addr = NULL; return -ENOMEM; } bp->b_offset = offset_in_page(bp->b_addr); bp->b_pages = bp->b_page_array; bp->b_pages[0] = kmem_to_page(bp->b_addr); bp->b_page_count = 1; bp->b_flags |= _XBF_KMEM; return 0; } static int xfs_buf_alloc_pages( struct xfs_buf *bp, xfs_buf_flags_t flags) { gfp_t gfp_mask = __GFP_NOWARN; long filled = 0; if (flags & XBF_READ_AHEAD) gfp_mask |= __GFP_NORETRY; else gfp_mask |= GFP_NOFS; /* Make sure that we have a page list */ bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE); if (bp->b_page_count <= XB_PAGES) { bp->b_pages = bp->b_page_array; } else { bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count, gfp_mask); if (!bp->b_pages) return -ENOMEM; } bp->b_flags |= _XBF_PAGES; /* Assure zeroed buffer for non-read cases. */ if (!(flags & XBF_READ)) gfp_mask |= __GFP_ZERO; /* * Bulk filling of pages can take multiple calls. Not filling the entire * array is not an allocation failure, so don't back off if we get at * least one extra page. */ for (;;) { long last = filled; filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count, bp->b_pages); if (filled == bp->b_page_count) { XFS_STATS_INC(bp->b_mount, xb_page_found); break; } if (filled != last) continue; if (flags & XBF_READ_AHEAD) { xfs_buf_free_pages(bp); return -ENOMEM; } XFS_STATS_INC(bp->b_mount, xb_page_retries); memalloc_retry_wait(gfp_mask); } return 0; } /* * Map buffer into kernel address-space if necessary. */ STATIC int _xfs_buf_map_pages( struct xfs_buf *bp, xfs_buf_flags_t flags) { ASSERT(bp->b_flags & _XBF_PAGES); if (bp->b_page_count == 1) { /* A single page buffer is always mappable */ bp->b_addr = page_address(bp->b_pages[0]); } else if (flags & XBF_UNMAPPED) { bp->b_addr = NULL; } else { int retried = 0; unsigned nofs_flag; /* * vm_map_ram() will allocate auxiliary structures (e.g. * pagetables) with GFP_KERNEL, yet we are likely to be under * GFP_NOFS context here. Hence we need to tell memory reclaim * that we are in such a context via PF_MEMALLOC_NOFS to prevent * memory reclaim re-entering the filesystem here and * potentially deadlocking. */ nofs_flag = memalloc_nofs_save(); do { bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, -1); if (bp->b_addr) break; vm_unmap_aliases(); } while (retried++ <= 1); memalloc_nofs_restore(nofs_flag); if (!bp->b_addr) return -ENOMEM; } return 0; } /* * Finding and Reading Buffers */ static int _xfs_buf_obj_cmp( struct rhashtable_compare_arg *arg, const void *obj) { const struct xfs_buf_map *map = arg->key; const struct xfs_buf *bp = obj; /* * The key hashing in the lookup path depends on the key being the * first element of the compare_arg, make sure to assert this. */ BUILD_BUG_ON(offsetof(struct xfs_buf_map, bm_bn) != 0); if (bp->b_rhash_key != map->bm_bn) return 1; if (unlikely(bp->b_length != map->bm_len)) { /* * found a block number match. If the range doesn't * match, the only way this is allowed is if the buffer * in the cache is stale and the transaction that made * it stale has not yet committed. i.e. we are * reallocating a busy extent. Skip this buffer and * continue searching for an exact match. */ if (!(map->bm_flags & XBM_LIVESCAN)) ASSERT(bp->b_flags & XBF_STALE); return 1; } return 0; } static const struct rhashtable_params xfs_buf_hash_params = { .min_size = 32, /* empty AGs have minimal footprint */ .nelem_hint = 16, .key_len = sizeof(xfs_daddr_t), .key_offset = offsetof(struct xfs_buf, b_rhash_key), .head_offset = offsetof(struct xfs_buf, b_rhash_head), .automatic_shrinking = true, .obj_cmpfn = _xfs_buf_obj_cmp, }; int xfs_buf_hash_init( struct xfs_perag *pag) { spin_lock_init(&pag->pag_buf_lock); return rhashtable_init(&pag->pag_buf_hash, &xfs_buf_hash_params); } void xfs_buf_hash_destroy( struct xfs_perag *pag) { rhashtable_destroy(&pag->pag_buf_hash); } static int xfs_buf_map_verify( struct xfs_buftarg *btp, struct xfs_buf_map *map) { xfs_daddr_t eofs; /* Check for IOs smaller than the sector size / not sector aligned */ ASSERT(!(BBTOB(map->bm_len) < btp->bt_meta_sectorsize)); ASSERT(!(BBTOB(map->bm_bn) & (xfs_off_t)btp->bt_meta_sectormask)); /* * Corrupted block numbers can get through to here, unfortunately, so we * have to check that the buffer falls within the filesystem bounds. */ eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks); if (map->bm_bn < 0 || map->bm_bn >= eofs) { xfs_alert(btp->bt_mount, "%s: daddr 0x%llx out of range, EOFS 0x%llx", __func__, map->bm_bn, eofs); WARN_ON(1); return -EFSCORRUPTED; } return 0; } static int xfs_buf_find_lock( struct xfs_buf *bp, xfs_buf_flags_t flags) { if (flags & XBF_TRYLOCK) { if (!xfs_buf_trylock(bp)) { XFS_STATS_INC(bp->b_mount, xb_busy_locked); return -EAGAIN; } } else { xfs_buf_lock(bp); XFS_STATS_INC(bp->b_mount, xb_get_locked_waited); } /* * if the buffer is stale, clear all the external state associated with * it. We need to keep flags such as how we allocated the buffer memory * intact here. */ if (bp->b_flags & XBF_STALE) { if (flags & XBF_LIVESCAN) { xfs_buf_unlock(bp); return -ENOENT; } ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0); bp->b_flags &= _XBF_KMEM | _XBF_PAGES; bp->b_ops = NULL; } return 0; } static inline int xfs_buf_lookup( struct xfs_perag *pag, struct xfs_buf_map *map, xfs_buf_flags_t flags, struct xfs_buf **bpp) { struct xfs_buf *bp; int error; rcu_read_lock(); bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params); if (!bp || !atomic_inc_not_zero(&bp->b_hold)) { rcu_read_unlock(); return -ENOENT; } rcu_read_unlock(); error = xfs_buf_find_lock(bp, flags); if (error) { xfs_buf_rele(bp); return error; } trace_xfs_buf_find(bp, flags, _RET_IP_); *bpp = bp; return 0; } /* * Insert the new_bp into the hash table. This consumes the perag reference * taken for the lookup regardless of the result of the insert. */ static int xfs_buf_find_insert( struct xfs_buftarg *btp, struct xfs_perag *pag, struct xfs_buf_map *cmap, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp) { struct xfs_buf *new_bp; struct xfs_buf *bp; int error; error = _xfs_buf_alloc(btp, map, nmaps, flags, &new_bp); if (error) goto out_drop_pag; /* * For buffers that fit entirely within a single page, first attempt to * allocate the memory from the heap to minimise memory usage. If we * can't get heap memory for these small buffers, we fall back to using * the page allocator. */ if (BBTOB(new_bp->b_length) >= PAGE_SIZE || xfs_buf_alloc_kmem(new_bp, flags) < 0) { error = xfs_buf_alloc_pages(new_bp, flags); if (error) goto out_free_buf; } spin_lock(&pag->pag_buf_lock); bp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head, xfs_buf_hash_params); if (IS_ERR(bp)) { error = PTR_ERR(bp); spin_unlock(&pag->pag_buf_lock); goto out_free_buf; } if (bp) { /* found an existing buffer */ atomic_inc(&bp->b_hold); spin_unlock(&pag->pag_buf_lock); error = xfs_buf_find_lock(bp, flags); if (error) xfs_buf_rele(bp); else *bpp = bp; goto out_free_buf; } /* The new buffer keeps the perag reference until it is freed. */ new_bp->b_pag = pag; spin_unlock(&pag->pag_buf_lock); *bpp = new_bp; return 0; out_free_buf: xfs_buf_free(new_bp); out_drop_pag: xfs_perag_put(pag); return error; } /* * Assembles a buffer covering the specified range. The code is optimised for * cache hits, as metadata intensive workloads will see 3 orders of magnitude * more hits than misses. */ int xfs_buf_get_map( struct xfs_buftarg *btp, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp) { struct xfs_perag *pag; struct xfs_buf *bp = NULL; struct xfs_buf_map cmap = { .bm_bn = map[0].bm_bn }; int error; int i; if (flags & XBF_LIVESCAN) cmap.bm_flags |= XBM_LIVESCAN; for (i = 0; i < nmaps; i++) cmap.bm_len += map[i].bm_len; error = xfs_buf_map_verify(btp, &cmap); if (error) return error; pag = xfs_perag_get(btp->bt_mount, xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn)); error = xfs_buf_lookup(pag, &cmap, flags, &bp); if (error && error != -ENOENT) goto out_put_perag; /* cache hits always outnumber misses by at least 10:1 */ if (unlikely(!bp)) { XFS_STATS_INC(btp->bt_mount, xb_miss_locked); if (flags & XBF_INCORE) goto out_put_perag; /* xfs_buf_find_insert() consumes the perag reference. */ error = xfs_buf_find_insert(btp, pag, &cmap, map, nmaps, flags, &bp); if (error) return error; } else { XFS_STATS_INC(btp->bt_mount, xb_get_locked); xfs_perag_put(pag); } /* We do not hold a perag reference anymore. */ if (!bp->b_addr) { error = _xfs_buf_map_pages(bp, flags); if (unlikely(error)) { xfs_warn_ratelimited(btp->bt_mount, "%s: failed to map %u pages", __func__, bp->b_page_count); xfs_buf_relse(bp); return error; } } /* * Clear b_error if this is a lookup from a caller that doesn't expect * valid data to be found in the buffer. */ if (!(flags & XBF_READ)) xfs_buf_ioerror(bp, 0); XFS_STATS_INC(btp->bt_mount, xb_get); trace_xfs_buf_get(bp, flags, _RET_IP_); *bpp = bp; return 0; out_put_perag: xfs_perag_put(pag); return error; } int _xfs_buf_read( struct xfs_buf *bp, xfs_buf_flags_t flags) { ASSERT(!(flags & XBF_WRITE)); ASSERT(bp->b_maps[0].bm_bn != XFS_BUF_DADDR_NULL); bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_READ_AHEAD | XBF_DONE); bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD); return xfs_buf_submit(bp); } /* * Reverify a buffer found in cache without an attached ->b_ops. * * If the caller passed an ops structure and the buffer doesn't have ops * assigned, set the ops and use it to verify the contents. If verification * fails, clear XBF_DONE. We assume the buffer has no recorded errors and is * already in XBF_DONE state on entry. * * Under normal operations, every in-core buffer is verified on read I/O * completion. There are two scenarios that can lead to in-core buffers without * an assigned ->b_ops. The first is during log recovery of buffers on a V4 * filesystem, though these buffers are purged at the end of recovery. The * other is online repair, which intentionally reads with a NULL buffer ops to * run several verifiers across an in-core buffer in order to establish buffer * type. If repair can't establish that, the buffer will be left in memory * with NULL buffer ops. */ int xfs_buf_reverify( struct xfs_buf *bp, const struct xfs_buf_ops *ops) { ASSERT(bp->b_flags & XBF_DONE); ASSERT(bp->b_error == 0); if (!ops || bp->b_ops) return 0; bp->b_ops = ops; bp->b_ops->verify_read(bp); if (bp->b_error) bp->b_flags &= ~XBF_DONE; return bp->b_error; } int xfs_buf_read_map( struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp, const struct xfs_buf_ops *ops, xfs_failaddr_t fa) { struct xfs_buf *bp; int error; flags |= XBF_READ; *bpp = NULL; error = xfs_buf_get_map(target, map, nmaps, flags, &bp); if (error) return error; trace_xfs_buf_read(bp, flags, _RET_IP_); if (!(bp->b_flags & XBF_DONE)) { /* Initiate the buffer read and wait. */ XFS_STATS_INC(target->bt_mount, xb_get_read); bp->b_ops = ops; error = _xfs_buf_read(bp, flags); /* Readahead iodone already dropped the buffer, so exit. */ if (flags & XBF_ASYNC) return 0; } else { /* Buffer already read; all we need to do is check it. */ error = xfs_buf_reverify(bp, ops); /* Readahead already finished; drop the buffer and exit. */ if (flags & XBF_ASYNC) { xfs_buf_relse(bp); return 0; } /* We do not want read in the flags */ bp->b_flags &= ~XBF_READ; ASSERT(bp->b_ops != NULL || ops == NULL); } /* * If we've had a read error, then the contents of the buffer are * invalid and should not be used. To ensure that a followup read tries * to pull the buffer from disk again, we clear the XBF_DONE flag and * mark the buffer stale. This ensures that anyone who has a current * reference to the buffer will interpret it's contents correctly and * future cache lookups will also treat it as an empty, uninitialised * buffer. */ if (error) { /* * Check against log shutdown for error reporting because * metadata writeback may require a read first and we need to * report errors in metadata writeback until the log is shut * down. High level transaction read functions already check * against mount shutdown, anyway, so we only need to be * concerned about low level IO interactions here. */ if (!xlog_is_shutdown(target->bt_mount->m_log)) xfs_buf_ioerror_alert(bp, fa); bp->b_flags &= ~XBF_DONE; xfs_buf_stale(bp); xfs_buf_relse(bp); /* bad CRC means corrupted metadata */ if (error == -EFSBADCRC) error = -EFSCORRUPTED; return error; } *bpp = bp; return 0; } /* * If we are not low on memory then do the readahead in a deadlock * safe manner. */ void xfs_buf_readahead_map( struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, const struct xfs_buf_ops *ops) { struct xfs_buf *bp; xfs_buf_read_map(target, map, nmaps, XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD, &bp, ops, __this_address); } /* * Read an uncached buffer from disk. Allocates and returns a locked * buffer containing the disk contents or nothing. Uncached buffers always have * a cache index of XFS_BUF_DADDR_NULL so we can easily determine if the buffer * is cached or uncached during fault diagnosis. */ int xfs_buf_read_uncached( struct xfs_buftarg *target, xfs_daddr_t daddr, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp, const struct xfs_buf_ops *ops) { struct xfs_buf *bp; int error; *bpp = NULL; error = xfs_buf_get_uncached(target, numblks, flags, &bp); if (error) return error; /* set up the buffer for a read IO */ ASSERT(bp->b_map_count == 1); bp->b_rhash_key = XFS_BUF_DADDR_NULL; bp->b_maps[0].bm_bn = daddr; bp->b_flags |= XBF_READ; bp->b_ops = ops; xfs_buf_submit(bp); if (bp->b_error) { error = bp->b_error; xfs_buf_relse(bp); return error; } *bpp = bp; return 0; } int xfs_buf_get_uncached( struct xfs_buftarg *target, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp) { int error; struct xfs_buf *bp; DEFINE_SINGLE_BUF_MAP(map, XFS_BUF_DADDR_NULL, numblks); *bpp = NULL; /* flags might contain irrelevant bits, pass only what we care about */ error = _xfs_buf_alloc(target, &map, 1, flags & XBF_NO_IOACCT, &bp); if (error) return error; error = xfs_buf_alloc_pages(bp, flags); if (error) goto fail_free_buf; error = _xfs_buf_map_pages(bp, 0); if (unlikely(error)) { xfs_warn(target->bt_mount, "%s: failed to map pages", __func__); goto fail_free_buf; } trace_xfs_buf_get_uncached(bp, _RET_IP_); *bpp = bp; return 0; fail_free_buf: xfs_buf_free(bp); return error; } /* * Increment reference count on buffer, to hold the buffer concurrently * with another thread which may release (free) the buffer asynchronously. * Must hold the buffer already to call this function. */ void xfs_buf_hold( struct xfs_buf *bp) { trace_xfs_buf_hold(bp, _RET_IP_); atomic_inc(&bp->b_hold); } /* * Release a hold on the specified buffer. If the hold count is 1, the buffer is * placed on LRU or freed (depending on b_lru_ref). */ void xfs_buf_rele( struct xfs_buf *bp) { struct xfs_perag *pag = bp->b_pag; bool release; bool freebuf = false; trace_xfs_buf_rele(bp, _RET_IP_); if (!pag) { ASSERT(list_empty(&bp->b_lru)); if (atomic_dec_and_test(&bp->b_hold)) { xfs_buf_ioacct_dec(bp); xfs_buf_free(bp); } return; } ASSERT(atomic_read(&bp->b_hold) > 0); /* * We grab the b_lock here first to serialise racing xfs_buf_rele() * calls. The pag_buf_lock being taken on the last reference only * serialises against racing lookups in xfs_buf_find(). IOWs, the second * to last reference we drop here is not serialised against the last * reference until we take bp->b_lock. Hence if we don't grab b_lock * first, the last "release" reference can win the race to the lock and * free the buffer before the second-to-last reference is processed, * leading to a use-after-free scenario. */ spin_lock(&bp->b_lock); release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock); if (!release) { /* * Drop the in-flight state if the buffer is already on the LRU * and it holds the only reference. This is racy because we * haven't acquired the pag lock, but the use of _XBF_IN_FLIGHT * ensures the decrement occurs only once per-buf. */ if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) __xfs_buf_ioacct_dec(bp); goto out_unlock; } /* the last reference has been dropped ... */ __xfs_buf_ioacct_dec(bp); if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { /* * If the buffer is added to the LRU take a new reference to the * buffer for the LRU and clear the (now stale) dispose list * state flag */ if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) { bp->b_state &= ~XFS_BSTATE_DISPOSE; atomic_inc(&bp->b_hold); } spin_unlock(&pag->pag_buf_lock); } else { /* * most of the time buffers will already be removed from the * LRU, so optimise that case by checking for the * XFS_BSTATE_DISPOSE flag indicating the last list the buffer * was on was the disposal list */ if (!(bp->b_state & XFS_BSTATE_DISPOSE)) { list_lru_del(&bp->b_target->bt_lru, &bp->b_lru); } else { ASSERT(list_empty(&bp->b_lru)); } ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head, xfs_buf_hash_params); spin_unlock(&pag->pag_buf_lock); xfs_perag_put(pag); freebuf = true; } out_unlock: spin_unlock(&bp->b_lock); if (freebuf) xfs_buf_free(bp); } /* * Lock a buffer object, if it is not already locked. * * If we come across a stale, pinned, locked buffer, we know that we are * being asked to lock a buffer that has been reallocated. Because it is * pinned, we know that the log has not been pushed to disk and hence it * will still be locked. Rather than continuing to have trylock attempts * fail until someone else pushes the log, push it ourselves before * returning. This means that the xfsaild will not get stuck trying * to push on stale inode buffers. */ int xfs_buf_trylock( struct xfs_buf *bp) { int locked; locked = down_trylock(&bp->b_sema) == 0; if (locked) trace_xfs_buf_trylock(bp, _RET_IP_); else trace_xfs_buf_trylock_fail(bp, _RET_IP_); return locked; } /* * Lock a buffer object. * * If we come across a stale, pinned, locked buffer, we know that we * are being asked to lock a buffer that has been reallocated. Because * it is pinned, we know that the log has not been pushed to disk and * hence it will still be locked. Rather than sleeping until someone * else pushes the log, push it ourselves before trying to get the lock. */ void xfs_buf_lock( struct xfs_buf *bp) { trace_xfs_buf_lock(bp, _RET_IP_); if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE)) xfs_log_force(bp->b_mount, 0); down(&bp->b_sema); trace_xfs_buf_lock_done(bp, _RET_IP_); } void xfs_buf_unlock( struct xfs_buf *bp) { ASSERT(xfs_buf_islocked(bp)); up(&bp->b_sema); trace_xfs_buf_unlock(bp, _RET_IP_); } STATIC void xfs_buf_wait_unpin( struct xfs_buf *bp) { DECLARE_WAITQUEUE (wait, current); if (atomic_read(&bp->b_pin_count) == 0) return; add_wait_queue(&bp->b_waiters, &wait); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (atomic_read(&bp->b_pin_count) == 0) break; io_schedule(); } remove_wait_queue(&bp->b_waiters, &wait); set_current_state(TASK_RUNNING); } static void xfs_buf_ioerror_alert_ratelimited( struct xfs_buf *bp) { static unsigned long lasttime; static struct xfs_buftarg *lasttarg; if (bp->b_target != lasttarg || time_after(jiffies, (lasttime + 5*HZ))) { lasttime = jiffies; xfs_buf_ioerror_alert(bp, __this_address); } lasttarg = bp->b_target; } /* * Account for this latest trip around the retry handler, and decide if * we've failed enough times to constitute a permanent failure. */ static bool xfs_buf_ioerror_permanent( struct xfs_buf *bp, struct xfs_error_cfg *cfg) { struct xfs_mount *mp = bp->b_mount; if (cfg->max_retries != XFS_ERR_RETRY_FOREVER && ++bp->b_retries > cfg->max_retries) return true; if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER && time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time)) return true; /* At unmount we may treat errors differently */ if (xfs_is_unmounting(mp) && mp->m_fail_unmount) return true; return false; } /* * On a sync write or shutdown we just want to stale the buffer and let the * caller handle the error in bp->b_error appropriately. * * If the write was asynchronous then no one will be looking for the error. If * this is the first failure of this type, clear the error state and write the * buffer out again. This means we always retry an async write failure at least * once, but we also need to set the buffer up to behave correctly now for * repeated failures. * * If we get repeated async write failures, then we take action according to the * error configuration we have been set up to use. * * Returns true if this function took care of error handling and the caller must * not touch the buffer again. Return false if the caller should proceed with * normal I/O completion handling. */ static bool xfs_buf_ioend_handle_error( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_error_cfg *cfg; /* * If we've already shutdown the journal because of I/O errors, there's * no point in giving this a retry. */ if (xlog_is_shutdown(mp->m_log)) goto out_stale; xfs_buf_ioerror_alert_ratelimited(bp); /* * We're not going to bother about retrying this during recovery. * One strike! */ if (bp->b_flags & _XBF_LOGRECOVERY) { xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); return false; } /* * Synchronous writes will have callers process the error. */ if (!(bp->b_flags & XBF_ASYNC)) goto out_stale; trace_xfs_buf_iodone_async(bp, _RET_IP_); cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error); if (bp->b_last_error != bp->b_error || !(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL))) { bp->b_last_error = bp->b_error; if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER && !bp->b_first_retry_time) bp->b_first_retry_time = jiffies; goto resubmit; } /* * Permanent error - we need to trigger a shutdown if we haven't already * to indicate that inconsistency will result from this action. */ if (xfs_buf_ioerror_permanent(bp, cfg)) { xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); goto out_stale; } /* Still considered a transient error. Caller will schedule retries. */ if (bp->b_flags & _XBF_INODES) xfs_buf_inode_io_fail(bp); else if (bp->b_flags & _XBF_DQUOTS) xfs_buf_dquot_io_fail(bp); else ASSERT(list_empty(&bp->b_li_list)); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return true; resubmit: xfs_buf_ioerror(bp, 0); bp->b_flags |= (XBF_DONE | XBF_WRITE_FAIL); xfs_buf_submit(bp); return true; out_stale: xfs_buf_stale(bp); bp->b_flags |= XBF_DONE; bp->b_flags &= ~XBF_WRITE; trace_xfs_buf_error_relse(bp, _RET_IP_); return false; } static void xfs_buf_ioend( struct xfs_buf *bp) { trace_xfs_buf_iodone(bp, _RET_IP_); /* * Pull in IO completion errors now. We are guaranteed to be running * single threaded, so we don't need the lock to read b_io_error. */ if (!bp->b_error && bp->b_io_error) xfs_buf_ioerror(bp, bp->b_io_error); if (bp->b_flags & XBF_READ) { if (!bp->b_error && bp->b_ops) bp->b_ops->verify_read(bp); if (!bp->b_error) bp->b_flags |= XBF_DONE; } else { if (!bp->b_error) { bp->b_flags &= ~XBF_WRITE_FAIL; bp->b_flags |= XBF_DONE; } if (unlikely(bp->b_error) && xfs_buf_ioend_handle_error(bp)) return; /* clear the retry state */ bp->b_last_error = 0; bp->b_retries = 0; bp->b_first_retry_time = 0; /* * Note that for things like remote attribute buffers, there may * not be a buffer log item here, so processing the buffer log * item must remain optional. */ if (bp->b_log_item) xfs_buf_item_done(bp); if (bp->b_flags & _XBF_INODES) xfs_buf_inode_iodone(bp); else if (bp->b_flags & _XBF_DQUOTS) xfs_buf_dquot_iodone(bp); } bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | _XBF_LOGRECOVERY); if (bp->b_flags & XBF_ASYNC) xfs_buf_relse(bp); else complete(&bp->b_iowait); } static void xfs_buf_ioend_work( struct work_struct *work) { struct xfs_buf *bp = container_of(work, struct xfs_buf, b_ioend_work); xfs_buf_ioend(bp); } static void xfs_buf_ioend_async( struct xfs_buf *bp) { INIT_WORK(&bp->b_ioend_work, xfs_buf_ioend_work); queue_work(bp->b_mount->m_buf_workqueue, &bp->b_ioend_work); } void __xfs_buf_ioerror( struct xfs_buf *bp, int error, xfs_failaddr_t failaddr) { ASSERT(error <= 0 && error >= -1000); bp->b_error = error; trace_xfs_buf_ioerror(bp, error, failaddr); } void xfs_buf_ioerror_alert( struct xfs_buf *bp, xfs_failaddr_t func) { xfs_buf_alert_ratelimited(bp, "XFS: metadata IO error", "metadata I/O error in \"%pS\" at daddr 0x%llx len %d error %d", func, (uint64_t)xfs_buf_daddr(bp), bp->b_length, -bp->b_error); } /* * To simulate an I/O failure, the buffer must be locked and held with at least * three references. The LRU reference is dropped by the stale call. The buf * item reference is dropped via ioend processing. The third reference is owned * by the caller and is dropped on I/O completion if the buffer is XBF_ASYNC. */ void xfs_buf_ioend_fail( struct xfs_buf *bp) { bp->b_flags &= ~XBF_DONE; xfs_buf_stale(bp); xfs_buf_ioerror(bp, -EIO); xfs_buf_ioend(bp); } int xfs_bwrite( struct xfs_buf *bp) { int error; ASSERT(xfs_buf_islocked(bp)); bp->b_flags |= XBF_WRITE; bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q | XBF_DONE); error = xfs_buf_submit(bp); if (error) xfs_force_shutdown(bp->b_mount, SHUTDOWN_META_IO_ERROR); return error; } static void xfs_buf_bio_end_io( struct bio *bio) { struct xfs_buf *bp = (struct xfs_buf *)bio->bi_private; if (!bio->bi_status && (bp->b_flags & XBF_WRITE) && (bp->b_flags & XBF_ASYNC) && XFS_TEST_ERROR(false, bp->b_mount, XFS_ERRTAG_BUF_IOERROR)) bio->bi_status = BLK_STS_IOERR; /* * don't overwrite existing errors - otherwise we can lose errors on * buffers that require multiple bios to complete. */ if (bio->bi_status) { int error = blk_status_to_errno(bio->bi_status); cmpxchg(&bp->b_io_error, 0, error); } if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ)) invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp)); if (atomic_dec_and_test(&bp->b_io_remaining) == 1) xfs_buf_ioend_async(bp); bio_put(bio); } static void xfs_buf_ioapply_map( struct xfs_buf *bp, int map, int *buf_offset, int *count, blk_opf_t op) { int page_index; unsigned int total_nr_pages = bp->b_page_count; int nr_pages; struct bio *bio; sector_t sector = bp->b_maps[map].bm_bn; int size; int offset; /* skip the pages in the buffer before the start offset */ page_index = 0; offset = *buf_offset; while (offset >= PAGE_SIZE) { page_index++; offset -= PAGE_SIZE; } /* * Limit the IO size to the length of the current vector, and update the * remaining IO count for the next time around. */ size = min_t(int, BBTOB(bp->b_maps[map].bm_len), *count); *count -= size; *buf_offset += size; next_chunk: atomic_inc(&bp->b_io_remaining); nr_pages = bio_max_segs(total_nr_pages); bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO); bio->bi_iter.bi_sector = sector; bio->bi_end_io = xfs_buf_bio_end_io; bio->bi_private = bp; for (; size && nr_pages; nr_pages--, page_index++) { int rbytes, nbytes = PAGE_SIZE - offset; if (nbytes > size) nbytes = size; rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes, offset); if (rbytes < nbytes) break; offset = 0; sector += BTOBB(nbytes); size -= nbytes; total_nr_pages--; } if (likely(bio->bi_iter.bi_size)) { if (xfs_buf_is_vmapped(bp)) { flush_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp)); } submit_bio(bio); if (size) goto next_chunk; } else { /* * This is guaranteed not to be the last io reference count * because the caller (xfs_buf_submit) holds a count itself. */ atomic_dec(&bp->b_io_remaining); xfs_buf_ioerror(bp, -EIO); bio_put(bio); } } STATIC void _xfs_buf_ioapply( struct xfs_buf *bp) { struct blk_plug plug; blk_opf_t op; int offset; int size; int i; /* * Make sure we capture only current IO errors rather than stale errors * left over from previous use of the buffer (e.g. failed readahead). */ bp->b_error = 0; if (bp->b_flags & XBF_WRITE) { op = REQ_OP_WRITE; /* * Run the write verifier callback function if it exists. If * this function fails it will mark the buffer with an error and * the IO should not be dispatched. */ if (bp->b_ops) { bp->b_ops->verify_write(bp); if (bp->b_error) { xfs_force_shutdown(bp->b_mount, SHUTDOWN_CORRUPT_INCORE); return; } } else if (bp->b_rhash_key != XFS_BUF_DADDR_NULL) { struct xfs_mount *mp = bp->b_mount; /* * non-crc filesystems don't attach verifiers during * log recovery, so don't warn for such filesystems. */ if (xfs_has_crc(mp)) { xfs_warn(mp, "%s: no buf ops on daddr 0x%llx len %d", __func__, xfs_buf_daddr(bp), bp->b_length); xfs_hex_dump(bp->b_addr, XFS_CORRUPTION_DUMP_LEN); dump_stack(); } } } else { op = REQ_OP_READ; if (bp->b_flags & XBF_READ_AHEAD) op |= REQ_RAHEAD; } /* we only use the buffer cache for meta-data */ op |= REQ_META; /* * Walk all the vectors issuing IO on them. Set up the initial offset * into the buffer and the desired IO size before we start - * _xfs_buf_ioapply_vec() will modify them appropriately for each * subsequent call. */ offset = bp->b_offset; size = BBTOB(bp->b_length); blk_start_plug(&plug); for (i = 0; i < bp->b_map_count; i++) { xfs_buf_ioapply_map(bp, i, &offset, &size, op); if (bp->b_error) break; if (size <= 0) break; /* all done */ } blk_finish_plug(&plug); } /* * Wait for I/O completion of a sync buffer and return the I/O error code. */ static int xfs_buf_iowait( struct xfs_buf *bp) { ASSERT(!(bp->b_flags & XBF_ASYNC)); trace_xfs_buf_iowait(bp, _RET_IP_); wait_for_completion(&bp->b_iowait); trace_xfs_buf_iowait_done(bp, _RET_IP_); return bp->b_error; } /* * Buffer I/O submission path, read or write. Asynchronous submission transfers * the buffer lock ownership and the current reference to the IO. It is not * safe to reference the buffer after a call to this function unless the caller * holds an additional reference itself. */ static int __xfs_buf_submit( struct xfs_buf *bp, bool wait) { int error = 0; trace_xfs_buf_submit(bp, _RET_IP_); ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); /* * On log shutdown we stale and complete the buffer immediately. We can * be called to read the superblock before the log has been set up, so * be careful checking the log state. * * Checking the mount shutdown state here can result in the log tail * moving inappropriately on disk as the log may not yet be shut down. * i.e. failing this buffer on mount shutdown can remove it from the AIL * and move the tail of the log forwards without having written this * buffer to disk. This corrupts the log tail state in memory, and * because the log may not be shut down yet, it can then be propagated * to disk before the log is shutdown. Hence we check log shutdown * state here rather than mount state to avoid corrupting the log tail * on shutdown. */ if (bp->b_mount->m_log && xlog_is_shutdown(bp->b_mount->m_log)) { xfs_buf_ioend_fail(bp); return -EIO; } /* * Grab a reference so the buffer does not go away underneath us. For * async buffers, I/O completion drops the callers reference, which * could occur before submission returns. */ xfs_buf_hold(bp); if (bp->b_flags & XBF_WRITE) xfs_buf_wait_unpin(bp); /* clear the internal error state to avoid spurious errors */ bp->b_io_error = 0; /* * Set the count to 1 initially, this will stop an I/O completion * callout which happens before we have started all the I/O from calling * xfs_buf_ioend too early. */ atomic_set(&bp->b_io_remaining, 1); if (bp->b_flags & XBF_ASYNC) xfs_buf_ioacct_inc(bp); _xfs_buf_ioapply(bp); /* * If _xfs_buf_ioapply failed, we can get back here with only the IO * reference we took above. If we drop it to zero, run completion so * that we don't return to the caller with completion still pending. */ if (atomic_dec_and_test(&bp->b_io_remaining) == 1) { if (bp->b_error || !(bp->b_flags & XBF_ASYNC)) xfs_buf_ioend(bp); else xfs_buf_ioend_async(bp); } if (wait) error = xfs_buf_iowait(bp); /* * Release the hold that keeps the buffer referenced for the entire * I/O. Note that if the buffer is async, it is not safe to reference * after this release. */ xfs_buf_rele(bp); return error; } void * xfs_buf_offset( struct xfs_buf *bp, size_t offset) { struct page *page; if (bp->b_addr) return bp->b_addr + offset; page = bp->b_pages[offset >> PAGE_SHIFT]; return page_address(page) + (offset & (PAGE_SIZE-1)); } void xfs_buf_zero( struct xfs_buf *bp, size_t boff, size_t bsize) { size_t bend; bend = boff + bsize; while (boff < bend) { struct page *page; int page_index, page_offset, csize; page_index = (boff + bp->b_offset) >> PAGE_SHIFT; page_offset = (boff + bp->b_offset) & ~PAGE_MASK; page = bp->b_pages[page_index]; csize = min_t(size_t, PAGE_SIZE - page_offset, BBTOB(bp->b_length) - boff); ASSERT((csize + page_offset) <= PAGE_SIZE); memset(page_address(page) + page_offset, 0, csize); boff += csize; } } /* * Log a message about and stale a buffer that a caller has decided is corrupt. * * This function should be called for the kinds of metadata corruption that * cannot be detect from a verifier, such as incorrect inter-block relationship * data. Do /not/ call this function from a verifier function. * * The buffer must be XBF_DONE prior to the call. Afterwards, the buffer will * be marked stale, but b_error will not be set. The caller is responsible for * releasing the buffer or fixing it. */ void __xfs_buf_mark_corrupt( struct xfs_buf *bp, xfs_failaddr_t fa) { ASSERT(bp->b_flags & XBF_DONE); xfs_buf_corruption_error(bp, fa); xfs_buf_stale(bp); } /* * Handling of buffer targets (buftargs). */ /* * Wait for any bufs with callbacks that have been submitted but have not yet * returned. These buffers will have an elevated hold count, so wait on those * while freeing all the buffers only held by the LRU. */ static enum lru_status xfs_buftarg_drain_rele( struct list_head *item, struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); struct list_head *dispose = arg; if (atomic_read(&bp->b_hold) > 1) { /* need to wait, so skip it this pass */ trace_xfs_buf_drain_buftarg(bp, _RET_IP_); return LRU_SKIP; } if (!spin_trylock(&bp->b_lock)) return LRU_SKIP; /* * clear the LRU reference count so the buffer doesn't get * ignored in xfs_buf_rele(). */ atomic_set(&bp->b_lru_ref, 0); bp->b_state |= XFS_BSTATE_DISPOSE; list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } /* * Wait for outstanding I/O on the buftarg to complete. */ void xfs_buftarg_wait( struct xfs_buftarg *btp) { /* * First wait on the buftarg I/O count for all in-flight buffers to be * released. This is critical as new buffers do not make the LRU until * they are released. * * Next, flush the buffer workqueue to ensure all completion processing * has finished. Just waiting on buffer locks is not sufficient for * async IO as the reference count held over IO is not released until * after the buffer lock is dropped. Hence we need to ensure here that * all reference counts have been dropped before we start walking the * LRU list. */ while (percpu_counter_sum(&btp->bt_io_count)) delay(100); flush_workqueue(btp->bt_mount->m_buf_workqueue); } void xfs_buftarg_drain( struct xfs_buftarg *btp) { LIST_HEAD(dispose); int loop = 0; bool write_fail = false; xfs_buftarg_wait(btp); /* loop until there is nothing left on the lru list. */ while (list_lru_count(&btp->bt_lru)) { list_lru_walk(&btp->bt_lru, xfs_buftarg_drain_rele, &dispose, LONG_MAX); while (!list_empty(&dispose)) { struct xfs_buf *bp; bp = list_first_entry(&dispose, struct xfs_buf, b_lru); list_del_init(&bp->b_lru); if (bp->b_flags & XBF_WRITE_FAIL) { write_fail = true; xfs_buf_alert_ratelimited(bp, "XFS: Corruption Alert", "Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", (long long)xfs_buf_daddr(bp)); } xfs_buf_rele(bp); } if (loop++ != 0) delay(100); } /* * If one or more failed buffers were freed, that means dirty metadata * was thrown away. This should only ever happen after I/O completion * handling has elevated I/O error(s) to permanent failures and shuts * down the journal. */ if (write_fail) { ASSERT(xlog_is_shutdown(btp->bt_mount->m_log)); xfs_alert(btp->bt_mount, "Please run xfs_repair to determine the extent of the problem."); } } static enum lru_status xfs_buftarg_isolate( struct list_head *item, struct list_lru_one *lru, spinlock_t *lru_lock, void *arg) { struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru); struct list_head *dispose = arg; /* * we are inverting the lru lock/bp->b_lock here, so use a trylock. * If we fail to get the lock, just skip it. */ if (!spin_trylock(&bp->b_lock)) return LRU_SKIP; /* * Decrement the b_lru_ref count unless the value is already * zero. If the value is already zero, we need to reclaim the * buffer, otherwise it gets another trip through the LRU. */ if (atomic_add_unless(&bp->b_lru_ref, -1, 0)) { spin_unlock(&bp->b_lock); return LRU_ROTATE; } bp->b_state |= XFS_BSTATE_DISPOSE; list_lru_isolate_move(lru, item, dispose); spin_unlock(&bp->b_lock); return LRU_REMOVED; } static unsigned long xfs_buftarg_shrink_scan( struct shrinker *shrink, struct shrink_control *sc) { struct xfs_buftarg *btp = container_of(shrink, struct xfs_buftarg, bt_shrinker); LIST_HEAD(dispose); unsigned long freed; freed = list_lru_shrink_walk(&btp->bt_lru, sc, xfs_buftarg_isolate, &dispose); while (!list_empty(&dispose)) { struct xfs_buf *bp; bp = list_first_entry(&dispose, struct xfs_buf, b_lru); list_del_init(&bp->b_lru); xfs_buf_rele(bp); } return freed; } static unsigned long xfs_buftarg_shrink_count( struct shrinker *shrink, struct shrink_control *sc) { struct xfs_buftarg *btp = container_of(shrink, struct xfs_buftarg, bt_shrinker); return list_lru_shrink_count(&btp->bt_lru, sc); } void xfs_free_buftarg( struct xfs_buftarg *btp) { struct block_device *bdev = btp->bt_bdev; unregister_shrinker(&btp->bt_shrinker); ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0); percpu_counter_destroy(&btp->bt_io_count); list_lru_destroy(&btp->bt_lru); fs_put_dax(btp->bt_daxdev, btp->bt_mount); /* the main block device is closed by kill_block_super */ if (bdev != btp->bt_mount->m_super->s_bdev) blkdev_put(bdev, btp->bt_mount->m_super); kmem_free(btp); } int xfs_setsize_buftarg( xfs_buftarg_t *btp, unsigned int sectorsize) { /* Set up metadata sector size info */ btp->bt_meta_sectorsize = sectorsize; btp->bt_meta_sectormask = sectorsize - 1; if (set_blocksize(btp->bt_bdev, sectorsize)) { xfs_warn(btp->bt_mount, "Cannot set_blocksize to %u on device %pg", sectorsize, btp->bt_bdev); return -EINVAL; } /* Set up device logical sector size mask */ btp->bt_logical_sectorsize = bdev_logical_block_size(btp->bt_bdev); btp->bt_logical_sectormask = bdev_logical_block_size(btp->bt_bdev) - 1; return 0; } /* * When allocating the initial buffer target we have not yet * read in the superblock, so don't know what sized sectors * are being used at this early stage. Play safe. */ STATIC int xfs_setsize_buftarg_early( xfs_buftarg_t *btp, struct block_device *bdev) { return xfs_setsize_buftarg(btp, bdev_logical_block_size(bdev)); } struct xfs_buftarg * xfs_alloc_buftarg( struct xfs_mount *mp, struct block_device *bdev) { xfs_buftarg_t *btp; const struct dax_holder_operations *ops = NULL; #if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE) ops = &xfs_dax_holder_operations; #endif btp = kmem_zalloc(sizeof(*btp), KM_NOFS); btp->bt_mount = mp; btp->bt_dev = bdev->bd_dev; btp->bt_bdev = bdev; btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, mp, ops); /* * Buffer IO error rate limiting. Limit it to no more than 10 messages * per 30 seconds so as to not spam logs too much on repeated errors. */ ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ, DEFAULT_RATELIMIT_BURST); if (xfs_setsize_buftarg_early(btp, bdev)) goto error_free; if (list_lru_init(&btp->bt_lru)) goto error_free; if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL)) goto error_lru; btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count; btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan; btp->bt_shrinker.seeks = DEFAULT_SEEKS; btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE; if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s", mp->m_super->s_id)) goto error_pcpu; return btp; error_pcpu: percpu_counter_destroy(&btp->bt_io_count); error_lru: list_lru_destroy(&btp->bt_lru); error_free: kmem_free(btp); return NULL; } /* * Cancel a delayed write list. * * Remove each buffer from the list, clear the delwri queue flag and drop the * associated buffer reference. */ void xfs_buf_delwri_cancel( struct list_head *list) { struct xfs_buf *bp; while (!list_empty(list)) { bp = list_first_entry(list, struct xfs_buf, b_list); xfs_buf_lock(bp); bp->b_flags &= ~_XBF_DELWRI_Q; list_del_init(&bp->b_list); xfs_buf_relse(bp); } } /* * Add a buffer to the delayed write list. * * This queues a buffer for writeout if it hasn't already been. Note that * neither this routine nor the buffer list submission functions perform * any internal synchronization. It is expected that the lists are thread-local * to the callers. * * Returns true if we queued up the buffer, or false if it already had * been on the buffer list. */ bool xfs_buf_delwri_queue( struct xfs_buf *bp, struct list_head *list) { ASSERT(xfs_buf_islocked(bp)); ASSERT(!(bp->b_flags & XBF_READ)); /* * If the buffer is already marked delwri it already is queued up * by someone else for imediate writeout. Just ignore it in that * case. */ if (bp->b_flags & _XBF_DELWRI_Q) { trace_xfs_buf_delwri_queued(bp, _RET_IP_); return false; } trace_xfs_buf_delwri_queue(bp, _RET_IP_); /* * If a buffer gets written out synchronously or marked stale while it * is on a delwri list we lazily remove it. To do this, the other party * clears the _XBF_DELWRI_Q flag but otherwise leaves the buffer alone. * It remains referenced and on the list. In a rare corner case it * might get readded to a delwri list after the synchronous writeout, in * which case we need just need to re-add the flag here. */ bp->b_flags |= _XBF_DELWRI_Q; if (list_empty(&bp->b_list)) { atomic_inc(&bp->b_hold); list_add_tail(&bp->b_list, list); } return true; } /* * Compare function is more complex than it needs to be because * the return value is only 32 bits and we are doing comparisons * on 64 bit values */ static int xfs_buf_cmp( void *priv, const struct list_head *a, const struct list_head *b) { struct xfs_buf *ap = container_of(a, struct xfs_buf, b_list); struct xfs_buf *bp = container_of(b, struct xfs_buf, b_list); xfs_daddr_t diff; diff = ap->b_maps[0].bm_bn - bp->b_maps[0].bm_bn; if (diff < 0) return -1; if (diff > 0) return 1; return 0; } /* * Submit buffers for write. If wait_list is specified, the buffers are * submitted using sync I/O and placed on the wait list such that the caller can * iowait each buffer. Otherwise async I/O is used and the buffers are released * at I/O completion time. In either case, buffers remain locked until I/O * completes and the buffer is released from the queue. */ static int xfs_buf_delwri_submit_buffers( struct list_head *buffer_list, struct list_head *wait_list) { struct xfs_buf *bp, *n; int pinned = 0; struct blk_plug plug; list_sort(NULL, buffer_list, xfs_buf_cmp); blk_start_plug(&plug); list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (!xfs_buf_trylock(bp)) continue; if (xfs_buf_ispinned(bp)) { xfs_buf_unlock(bp); pinned++; continue; } } else { xfs_buf_lock(bp); } /* * Someone else might have written the buffer synchronously or * marked it stale in the meantime. In that case only the * _XBF_DELWRI_Q flag got cleared, and we have to drop the * reference and remove it from the list here. */ if (!(bp->b_flags & _XBF_DELWRI_Q)) { list_del_init(&bp->b_list); xfs_buf_relse(bp); continue; } trace_xfs_buf_delwri_split(bp, _RET_IP_); /* * If we have a wait list, each buffer (and associated delwri * queue reference) transfers to it and is submitted * synchronously. Otherwise, drop the buffer from the delwri * queue and submit async. */ bp->b_flags &= ~_XBF_DELWRI_Q; bp->b_flags |= XBF_WRITE; if (wait_list) { bp->b_flags &= ~XBF_ASYNC; list_move_tail(&bp->b_list, wait_list); } else { bp->b_flags |= XBF_ASYNC; list_del_init(&bp->b_list); } __xfs_buf_submit(bp, false); } blk_finish_plug(&plug); return pinned; } /* * Write out a buffer list asynchronously. * * This will take the @buffer_list, write all non-locked and non-pinned buffers * out and not wait for I/O completion on any of the buffers. This interface * is only safely useable for callers that can track I/O completion by higher * level means, e.g. AIL pushing as the @buffer_list is consumed in this * function. * * Note: this function will skip buffers it would block on, and in doing so * leaves them on @buffer_list so they can be retried on a later pass. As such, * it is up to the caller to ensure that the buffer list is fully submitted or * cancelled appropriately when they are finished with the list. Failure to * cancel or resubmit the list until it is empty will result in leaked buffers * at unmount time. */ int xfs_buf_delwri_submit_nowait( struct list_head *buffer_list) { return xfs_buf_delwri_submit_buffers(buffer_list, NULL); } /* * Write out a buffer list synchronously. * * This will take the @buffer_list, write all buffers out and wait for I/O * completion on all of the buffers. @buffer_list is consumed by the function, * so callers must have some other way of tracking buffers if they require such * functionality. */ int xfs_buf_delwri_submit( struct list_head *buffer_list) { LIST_HEAD (wait_list); int error = 0, error2; struct xfs_buf *bp; xfs_buf_delwri_submit_buffers(buffer_list, &wait_list); /* Wait for IO to complete. */ while (!list_empty(&wait_list)) { bp = list_first_entry(&wait_list, struct xfs_buf, b_list); list_del_init(&bp->b_list); /* * Wait on the locked buffer, check for errors and unlock and * release the delwri queue reference. */ error2 = xfs_buf_iowait(bp); xfs_buf_relse(bp); if (!error) error = error2; } return error; } /* * Push a single buffer on a delwri queue. * * The purpose of this function is to submit a single buffer of a delwri queue * and return with the buffer still on the original queue. The waiting delwri * buffer submission infrastructure guarantees transfer of the delwri queue * buffer reference to a temporary wait list. We reuse this infrastructure to * transfer the buffer back to the original queue. * * Note the buffer transitions from the queued state, to the submitted and wait * listed state and back to the queued state during this call. The buffer * locking and queue management logic between _delwri_pushbuf() and * _delwri_queue() guarantee that the buffer cannot be queued to another list * before returning. */ int xfs_buf_delwri_pushbuf( struct xfs_buf *bp, struct list_head *buffer_list) { LIST_HEAD (submit_list); int error; ASSERT(bp->b_flags & _XBF_DELWRI_Q); trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); /* * Isolate the buffer to a new local list so we can submit it for I/O * independently from the rest of the original list. */ xfs_buf_lock(bp); list_move(&bp->b_list, &submit_list); xfs_buf_unlock(bp); /* * Delwri submission clears the DELWRI_Q buffer flag and returns with * the buffer on the wait list with the original reference. Rather than * bounce the buffer from a local wait list back to the original list * after I/O completion, reuse the original list as the wait list. */ xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); /* * The buffer is now locked, under I/O and wait listed on the original * delwri queue. Wait for I/O completion, restore the DELWRI_Q flag and * return with the buffer unlocked and on the original queue. */ error = xfs_buf_iowait(bp); bp->b_flags |= _XBF_DELWRI_Q; xfs_buf_unlock(bp); return error; } void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref) { /* * Set the lru reference count to 0 based on the error injection tag. * This allows userspace to disrupt buffer caching for debug/testing * purposes. */ if (XFS_TEST_ERROR(false, bp->b_mount, XFS_ERRTAG_BUF_LRU_REF)) lru_ref = 0; atomic_set(&bp->b_lru_ref, lru_ref); } /* * Verify an on-disk magic value against the magic value specified in the * verifier structure. The verifier magic is in disk byte order so the caller is * expected to pass the value directly from disk. */ bool xfs_verify_magic( struct xfs_buf *bp, __be32 dmagic) { struct xfs_mount *mp = bp->b_mount; int idx; idx = xfs_has_crc(mp); if (WARN_ON(!bp->b_ops || !bp->b_ops->magic[idx])) return false; return dmagic == bp->b_ops->magic[idx]; } /* * Verify an on-disk magic value against the magic value specified in the * verifier structure. The verifier magic is in disk byte order so the caller is * expected to pass the value directly from disk. */ bool xfs_verify_magic16( struct xfs_buf *bp, __be16 dmagic) { struct xfs_mount *mp = bp->b_mount; int idx; idx = xfs_has_crc(mp); if (WARN_ON(!bp->b_ops || !bp->b_ops->magic16[idx])) return false; return dmagic == bp->b_ops->magic16[idx]; } |
1 2 2 2 2 2 2 2 6 5 11 6 5 8 6 2 1 1 1 1 1 1 1 1 1 1 1 1 14 14 14 14 2 1 1 1 13 12 12 12 12 12 2 2 1 1 12 12 12 12 12 12 12 12 12 2 3 1 1 2 2 1 1 2 2 7 6 1 7 5 15 1 1 14 2 14 3 2 13 2 2 13 4 13 2 14 6 6 2 4 2 4 4 6 27 27 26 27 3 2 2 3 4 3 3 3 4 3 4 4 4 4 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 15 15 15 11 11 11 25 25 25 1 18 18 26 26 26 26 2 25 4 26 1 26 1 26 31 31 31 31 31 31 31 31 31 31 31 30 31 31 31 30 30 31 26 26 3 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 | // SPDX-License-Identifier: GPL-2.0-only /* * mac80211 configuration hooks for cfg80211 * * Copyright 2006-2010 Johannes Berg <johannes@sipsolutions.net> * Copyright 2013-2015 Intel Mobile Communications GmbH * Copyright (C) 2015-2017 Intel Deutschland GmbH * Copyright (C) 2018-2022 Intel Corporation */ #include <linux/ieee80211.h> #include <linux/nl80211.h> #include <linux/rtnetlink.h> #include <linux/slab.h> #include <net/net_namespace.h> #include <linux/rcupdate.h> #include <linux/fips.h> #include <linux/if_ether.h> #include <net/cfg80211.h> #include "ieee80211_i.h" #include "driver-ops.h" #include "rate.h" #include "mesh.h" #include "wme.h" static struct ieee80211_link_data * ieee80211_link_or_deflink(struct ieee80211_sub_if_data *sdata, int link_id, bool require_valid) { struct ieee80211_link_data *link; if (link_id < 0) { /* * For keys, if sdata is not an MLD, we might not use * the return value at all (if it's not a pairwise key), * so in that case (require_valid==false) don't error. */ if (require_valid && ieee80211_vif_is_mld(&sdata->vif)) return ERR_PTR(-EINVAL); return &sdata->deflink; } link = sdata_dereference(sdata->link[link_id], sdata); if (!link) return ERR_PTR(-ENOLINK); return link; } static void ieee80211_set_mu_mimo_follow(struct ieee80211_sub_if_data *sdata, struct vif_params *params) { bool mu_mimo_groups = false; bool mu_mimo_follow = false; if (params->vht_mumimo_groups) { u64 membership; BUILD_BUG_ON(sizeof(membership) != WLAN_MEMBERSHIP_LEN); memcpy(sdata->vif.bss_conf.mu_group.membership, params->vht_mumimo_groups, WLAN_MEMBERSHIP_LEN); memcpy(sdata->vif.bss_conf.mu_group.position, params->vht_mumimo_groups + WLAN_MEMBERSHIP_LEN, WLAN_USER_POSITION_LEN); ieee80211_link_info_change_notify(sdata, &sdata->deflink, BSS_CHANGED_MU_GROUPS); /* don't care about endianness - just check for 0 */ memcpy(&membership, params->vht_mumimo_groups, WLAN_MEMBERSHIP_LEN); mu_mimo_groups = membership != 0; } if (params->vht_mumimo_follow_addr) { mu_mimo_follow = is_valid_ether_addr(params->vht_mumimo_follow_addr); ether_addr_copy(sdata->u.mntr.mu_follow_addr, params->vht_mumimo_follow_addr); } sdata->vif.bss_conf.mu_mimo_owner = mu_mimo_groups || mu_mimo_follow; } static int ieee80211_set_mon_options(struct ieee80211_sub_if_data *sdata, struct vif_params *params) { struct ieee80211_local *local = sdata->local; struct ieee80211_sub_if_data *monitor_sdata; /* check flags first */ if (params->flags && ieee80211_sdata_running(sdata)) { u32 mask = MONITOR_FLAG_COOK_FRAMES | MONITOR_FLAG_ACTIVE; /* * Prohibit MONITOR_FLAG_COOK_FRAMES and * MONITOR_FLAG_ACTIVE to be changed while the * interface is up. * Else we would need to add a lot of cruft * to update everything: * cooked_mntrs, monitor and all fif_* counters * reconfigure hardware */ if ((params->flags & mask) != (sdata->u.mntr.flags & mask)) return -EBUSY; } /* also validate MU-MIMO change */ monitor_sdata = wiphy_dereference(local->hw.wiphy, local->monitor_sdata); if (!monitor_sdata && (params->vht_mumimo_groups || params->vht_mumimo_follow_addr)) return -EOPNOTSUPP; /* apply all changes now - no failures allowed */ if (monitor_sdata) ieee80211_set_mu_mimo_follow(monitor_sdata, params); if (params->flags) { if (ieee80211_sdata_running(sdata)) { ieee80211_adjust_monitor_flags(sdata, -1); sdata->u.mntr.flags = params->flags; ieee80211_adjust_monitor_flags(sdata, 1); ieee80211_configure_filter(local); } else { /* * Because the interface is down, ieee80211_do_stop * and ieee80211_do_open take care of "everything" * mentioned in the comment above. */ sdata->u.mntr.flags = params->flags; } } return 0; } static int ieee80211_set_ap_mbssid_options(struct ieee80211_sub_if_data *sdata, struct cfg80211_mbssid_config params, struct ieee80211_bss_conf *link_conf) { struct ieee80211_sub_if_data *tx_sdata; sdata->vif.mbssid_tx_vif = NULL; link_conf->bssid_index = 0; link_conf->nontransmitted = false; link_conf->ema_ap = false; link_conf->bssid_indicator = 0; if (sdata->vif.type != NL80211_IFTYPE_AP || !params.tx_wdev) return -EINVAL; tx_sdata = IEEE80211_WDEV_TO_SUB_IF(params.tx_wdev); if (!tx_sdata) return -EINVAL; if (tx_sdata == sdata) { sdata->vif.mbssid_tx_vif = &sdata->vif; } else { sdata->vif.mbssid_tx_vif = &tx_sdata->vif; link_conf->nontransmitted = true; link_conf->bssid_index = params.index; } if (params.ema) link_conf->ema_ap = true; return 0; } static struct wireless_dev *ieee80211_add_iface(struct wiphy *wiphy, const char *name, unsigned char name_assign_type, enum nl80211_iftype type, struct vif_params *params) { struct ieee80211_local *local = wiphy_priv(wiphy); struct wireless_dev *wdev; struct ieee80211_sub_if_data *sdata; int err; err = ieee80211_if_add(local, name, name_assign_type, &wdev, type, params); if (err) return ERR_PTR(err); sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); if (type == NL80211_IFTYPE_MONITOR) { err = ieee80211_set_mon_options(sdata, params); if (err) { ieee80211_if_remove(sdata); return NULL; } } return wdev; } static int ieee80211_del_iface(struct wiphy *wiphy, struct wireless_dev *wdev) { ieee80211_if_remove(IEEE80211_WDEV_TO_SUB_IF(wdev)); return 0; } static int ieee80211_change_iface(struct wiphy *wiphy, struct net_device *dev, enum nl80211_iftype type, struct vif_params *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct sta_info *sta; int ret; ret = ieee80211_if_change_type(sdata, type); if (ret) return ret; if (type == NL80211_IFTYPE_AP_VLAN && params->use_4addr == 0) { RCU_INIT_POINTER(sdata->u.vlan.sta, NULL); ieee80211_check_fast_rx_iface(sdata); } else if (type == NL80211_IFTYPE_STATION && params->use_4addr >= 0) { struct ieee80211_if_managed *ifmgd = &sdata->u.mgd; if (params->use_4addr == ifmgd->use_4addr) return 0; /* FIXME: no support for 4-addr MLO yet */ if (ieee80211_vif_is_mld(&sdata->vif)) return -EOPNOTSUPP; sdata->u.mgd.use_4addr = params->use_4addr; if (!ifmgd->associated) return 0; mutex_lock(&local->sta_mtx); sta = sta_info_get(sdata, sdata->deflink.u.mgd.bssid); if (sta) drv_sta_set_4addr(local, sdata, &sta->sta, params->use_4addr); mutex_unlock(&local->sta_mtx); if (params->use_4addr) ieee80211_send_4addr_nullfunc(local, sdata); } if (sdata->vif.type == NL80211_IFTYPE_MONITOR) { ret = ieee80211_set_mon_options(sdata, params); if (ret) return ret; } return 0; } static int ieee80211_start_p2p_device(struct wiphy *wiphy, struct wireless_dev *wdev) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); int ret; mutex_lock(&sdata->local->chanctx_mtx); ret = ieee80211_check_combinations(sdata, NULL, 0, 0); mutex_unlock(&sdata->local->chanctx_mtx); if (ret < 0) return ret; return ieee80211_do_open(wdev, true); } static void ieee80211_stop_p2p_device(struct wiphy *wiphy, struct wireless_dev *wdev) { ieee80211_sdata_stop(IEEE80211_WDEV_TO_SUB_IF(wdev)); } static int ieee80211_start_nan(struct wiphy *wiphy, struct wireless_dev *wdev, struct cfg80211_nan_conf *conf) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); int ret; mutex_lock(&sdata->local->chanctx_mtx); ret = ieee80211_check_combinations(sdata, NULL, 0, 0); mutex_unlock(&sdata->local->chanctx_mtx); if (ret < 0) return ret; ret = ieee80211_do_open(wdev, true); if (ret) return ret; ret = drv_start_nan(sdata->local, sdata, conf); if (ret) ieee80211_sdata_stop(sdata); sdata->u.nan.conf = *conf; return ret; } static void ieee80211_stop_nan(struct wiphy *wiphy, struct wireless_dev *wdev) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); drv_stop_nan(sdata->local, sdata); ieee80211_sdata_stop(sdata); } static int ieee80211_nan_change_conf(struct wiphy *wiphy, struct wireless_dev *wdev, struct cfg80211_nan_conf *conf, u32 changes) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); struct cfg80211_nan_conf new_conf; int ret = 0; if (sdata->vif.type != NL80211_IFTYPE_NAN) return -EOPNOTSUPP; if (!ieee80211_sdata_running(sdata)) return -ENETDOWN; new_conf = sdata->u.nan.conf; if (changes & CFG80211_NAN_CONF_CHANGED_PREF) new_conf.master_pref = conf->master_pref; if (changes & CFG80211_NAN_CONF_CHANGED_BANDS) new_conf.bands = conf->bands; ret = drv_nan_change_conf(sdata->local, sdata, &new_conf, changes); if (!ret) sdata->u.nan.conf = new_conf; return ret; } static int ieee80211_add_nan_func(struct wiphy *wiphy, struct wireless_dev *wdev, struct cfg80211_nan_func *nan_func) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); int ret; if (sdata->vif.type != NL80211_IFTYPE_NAN) return -EOPNOTSUPP; if (!ieee80211_sdata_running(sdata)) return -ENETDOWN; spin_lock_bh(&sdata->u.nan.func_lock); ret = idr_alloc(&sdata->u.nan.function_inst_ids, nan_func, 1, sdata->local->hw.max_nan_de_entries + 1, GFP_ATOMIC); spin_unlock_bh(&sdata->u.nan.func_lock); if (ret < 0) return ret; nan_func->instance_id = ret; WARN_ON(nan_func->instance_id == 0); ret = drv_add_nan_func(sdata->local, sdata, nan_func); if (ret) { spin_lock_bh(&sdata->u.nan.func_lock); idr_remove(&sdata->u.nan.function_inst_ids, nan_func->instance_id); spin_unlock_bh(&sdata->u.nan.func_lock); } return ret; } static struct cfg80211_nan_func * ieee80211_find_nan_func_by_cookie(struct ieee80211_sub_if_data *sdata, u64 cookie) { struct cfg80211_nan_func *func; int id; lockdep_assert_held(&sdata->u.nan.func_lock); idr_for_each_entry(&sdata->u.nan.function_inst_ids, func, id) { if (func->cookie == cookie) return func; } return NULL; } static void ieee80211_del_nan_func(struct wiphy *wiphy, struct wireless_dev *wdev, u64 cookie) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); struct cfg80211_nan_func *func; u8 instance_id = 0; if (sdata->vif.type != NL80211_IFTYPE_NAN || !ieee80211_sdata_running(sdata)) return; spin_lock_bh(&sdata->u.nan.func_lock); func = ieee80211_find_nan_func_by_cookie(sdata, cookie); if (func) instance_id = func->instance_id; spin_unlock_bh(&sdata->u.nan.func_lock); if (instance_id) drv_del_nan_func(sdata->local, sdata, instance_id); } static int ieee80211_set_noack_map(struct wiphy *wiphy, struct net_device *dev, u16 noack_map) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); sdata->noack_map = noack_map; ieee80211_check_fast_xmit_iface(sdata); return 0; } static int ieee80211_set_tx(struct ieee80211_sub_if_data *sdata, const u8 *mac_addr, u8 key_idx) { struct ieee80211_local *local = sdata->local; struct ieee80211_key *key; struct sta_info *sta; int ret = -EINVAL; if (!wiphy_ext_feature_isset(local->hw.wiphy, NL80211_EXT_FEATURE_EXT_KEY_ID)) return -EINVAL; sta = sta_info_get_bss(sdata, mac_addr); if (!sta) return -EINVAL; if (sta->ptk_idx == key_idx) return 0; mutex_lock(&local->key_mtx); key = key_mtx_dereference(local, sta->ptk[key_idx]); if (key && key->conf.flags & IEEE80211_KEY_FLAG_NO_AUTO_TX) ret = ieee80211_set_tx_key(key); mutex_unlock(&local->key_mtx); return ret; } static int ieee80211_add_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx, bool pairwise, const u8 *mac_addr, struct key_params *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link = ieee80211_link_or_deflink(sdata, link_id, false); struct ieee80211_local *local = sdata->local; struct sta_info *sta = NULL; struct ieee80211_key *key; int err; if (!ieee80211_sdata_running(sdata)) return -ENETDOWN; if (IS_ERR(link)) return PTR_ERR(link); if (pairwise && params->mode == NL80211_KEY_SET_TX) return ieee80211_set_tx(sdata, mac_addr, key_idx); /* reject WEP and TKIP keys if WEP failed to initialize */ switch (params->cipher) { case WLAN_CIPHER_SUITE_WEP40: case WLAN_CIPHER_SUITE_TKIP: case WLAN_CIPHER_SUITE_WEP104: if (link_id >= 0) return -EINVAL; if (WARN_ON_ONCE(fips_enabled)) return -EINVAL; break; default: break; } key = ieee80211_key_alloc(params->cipher, key_idx, params->key_len, params->key, params->seq_len, params->seq); if (IS_ERR(key)) return PTR_ERR(key); key->conf.link_id = link_id; if (pairwise) key->conf.flags |= IEEE80211_KEY_FLAG_PAIRWISE; if (params->mode == NL80211_KEY_NO_TX) key->conf.flags |= IEEE80211_KEY_FLAG_NO_AUTO_TX; mutex_lock(&local->sta_mtx); if (mac_addr) { sta = sta_info_get_bss(sdata, mac_addr); /* * The ASSOC test makes sure the driver is ready to * receive the key. When wpa_supplicant has roamed * using FT, it attempts to set the key before * association has completed, this rejects that attempt * so it will set the key again after association. * * TODO: accept the key if we have a station entry and * add it to the device after the station. */ if (!sta || !test_sta_flag(sta, WLAN_STA_ASSOC)) { ieee80211_key_free_unused(key); err = -ENOENT; goto out_unlock; } } switch (sdata->vif.type) { case NL80211_IFTYPE_STATION: if (sdata->u.mgd.mfp != IEEE80211_MFP_DISABLED) key->conf.flags |= IEEE80211_KEY_FLAG_RX_MGMT; break; case NL80211_IFTYPE_AP: case NL80211_IFTYPE_AP_VLAN: /* Keys without a station are used for TX only */ if (sta && test_sta_flag(sta, WLAN_STA_MFP)) key->conf.flags |= IEEE80211_KEY_FLAG_RX_MGMT; break; case NL80211_IFTYPE_ADHOC: /* no MFP (yet) */ break; case NL80211_IFTYPE_MESH_POINT: #ifdef CONFIG_MAC80211_MESH if (sdata->u.mesh.security != IEEE80211_MESH_SEC_NONE) key->conf.flags |= IEEE80211_KEY_FLAG_RX_MGMT; break; #endif case NL80211_IFTYPE_WDS: case NL80211_IFTYPE_MONITOR: case NL80211_IFTYPE_P2P_DEVICE: case NL80211_IFTYPE_NAN: case NL80211_IFTYPE_UNSPECIFIED: case NUM_NL80211_IFTYPES: case NL80211_IFTYPE_P2P_CLIENT: case NL80211_IFTYPE_P2P_GO: case NL80211_IFTYPE_OCB: /* shouldn't happen */ WARN_ON_ONCE(1); break; } err = ieee80211_key_link(key, link, sta); /* KRACK protection, shouldn't happen but just silently accept key */ if (err == -EALREADY) err = 0; out_unlock: mutex_unlock(&local->sta_mtx); return err; } static struct ieee80211_key * ieee80211_lookup_key(struct ieee80211_sub_if_data *sdata, int link_id, u8 key_idx, bool pairwise, const u8 *mac_addr) { struct ieee80211_local *local __maybe_unused = sdata->local; struct ieee80211_link_data *link = &sdata->deflink; struct ieee80211_key *key; if (link_id >= 0) { link = rcu_dereference_check(sdata->link[link_id], lockdep_is_held(&sdata->wdev.mtx)); if (!link) return NULL; } if (mac_addr) { struct sta_info *sta; struct link_sta_info *link_sta; sta = sta_info_get_bss(sdata, mac_addr); if (!sta) return NULL; if (link_id >= 0) { link_sta = rcu_dereference_check(sta->link[link_id], lockdep_is_held(&local->sta_mtx)); if (!link_sta) return NULL; } else { link_sta = &sta->deflink; } if (pairwise && key_idx < NUM_DEFAULT_KEYS) return rcu_dereference_check_key_mtx(local, sta->ptk[key_idx]); if (!pairwise && key_idx < NUM_DEFAULT_KEYS + NUM_DEFAULT_MGMT_KEYS + NUM_DEFAULT_BEACON_KEYS) return rcu_dereference_check_key_mtx(local, link_sta->gtk[key_idx]); return NULL; } if (pairwise && key_idx < NUM_DEFAULT_KEYS) return rcu_dereference_check_key_mtx(local, sdata->keys[key_idx]); key = rcu_dereference_check_key_mtx(local, link->gtk[key_idx]); if (key) return key; /* or maybe it was a WEP key */ if (key_idx < NUM_DEFAULT_KEYS) return rcu_dereference_check_key_mtx(local, sdata->keys[key_idx]); return NULL; } static int ieee80211_del_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx, bool pairwise, const u8 *mac_addr) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct ieee80211_key *key; int ret; mutex_lock(&local->sta_mtx); mutex_lock(&local->key_mtx); key = ieee80211_lookup_key(sdata, link_id, key_idx, pairwise, mac_addr); if (!key) { ret = -ENOENT; goto out_unlock; } ieee80211_key_free(key, sdata->vif.type == NL80211_IFTYPE_STATION); ret = 0; out_unlock: mutex_unlock(&local->key_mtx); mutex_unlock(&local->sta_mtx); return ret; } static int ieee80211_get_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx, bool pairwise, const u8 *mac_addr, void *cookie, void (*callback)(void *cookie, struct key_params *params)) { struct ieee80211_sub_if_data *sdata; u8 seq[6] = {0}; struct key_params params; struct ieee80211_key *key; u64 pn64; u32 iv32; u16 iv16; int err = -ENOENT; struct ieee80211_key_seq kseq = {}; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); key = ieee80211_lookup_key(sdata, link_id, key_idx, pairwise, mac_addr); if (!key) goto out; memset(¶ms, 0, sizeof(params)); params.cipher = key->conf.cipher; switch (key->conf.cipher) { case WLAN_CIPHER_SUITE_TKIP: pn64 = atomic64_read(&key->conf.tx_pn); iv32 = TKIP_PN_TO_IV32(pn64); iv16 = TKIP_PN_TO_IV16(pn64); if (key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE && !(key->conf.flags & IEEE80211_KEY_FLAG_GENERATE_IV)) { drv_get_key_seq(sdata->local, key, &kseq); iv32 = kseq.tkip.iv32; iv16 = kseq.tkip.iv16; } seq[0] = iv16 & 0xff; seq[1] = (iv16 >> 8) & 0xff; seq[2] = iv32 & 0xff; seq[3] = (iv32 >> 8) & 0xff; seq[4] = (iv32 >> 16) & 0xff; seq[5] = (iv32 >> 24) & 0xff; params.seq = seq; params.seq_len = 6; break; case WLAN_CIPHER_SUITE_CCMP: case WLAN_CIPHER_SUITE_CCMP_256: case WLAN_CIPHER_SUITE_AES_CMAC: case WLAN_CIPHER_SUITE_BIP_CMAC_256: BUILD_BUG_ON(offsetof(typeof(kseq), ccmp) != offsetof(typeof(kseq), aes_cmac)); fallthrough; case WLAN_CIPHER_SUITE_BIP_GMAC_128: case WLAN_CIPHER_SUITE_BIP_GMAC_256: BUILD_BUG_ON(offsetof(typeof(kseq), ccmp) != offsetof(typeof(kseq), aes_gmac)); fallthrough; case WLAN_CIPHER_SUITE_GCMP: case WLAN_CIPHER_SUITE_GCMP_256: BUILD_BUG_ON(offsetof(typeof(kseq), ccmp) != offsetof(typeof(kseq), gcmp)); if (key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE && !(key->conf.flags & IEEE80211_KEY_FLAG_GENERATE_IV)) { drv_get_key_seq(sdata->local, key, &kseq); memcpy(seq, kseq.ccmp.pn, 6); } else { pn64 = atomic64_read(&key->conf.tx_pn); seq[0] = pn64; seq[1] = pn64 >> 8; seq[2] = pn64 >> 16; seq[3] = pn64 >> 24; seq[4] = pn64 >> 32; seq[5] = pn64 >> 40; } params.seq = seq; params.seq_len = 6; break; default: if (!(key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE)) break; if (WARN_ON(key->conf.flags & IEEE80211_KEY_FLAG_GENERATE_IV)) break; drv_get_key_seq(sdata->local, key, &kseq); params.seq = kseq.hw.seq; params.seq_len = kseq.hw.seq_len; break; } params.key = key->conf.key; params.key_len = key->conf.keylen; callback(cookie, ¶ms); err = 0; out: rcu_read_unlock(); return err; } static int ieee80211_config_default_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx, bool uni, bool multi) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link = ieee80211_link_or_deflink(sdata, link_id, false); if (IS_ERR(link)) return PTR_ERR(link); ieee80211_set_default_key(link, key_idx, uni, multi); return 0; } static int ieee80211_config_default_mgmt_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link = ieee80211_link_or_deflink(sdata, link_id, true); if (IS_ERR(link)) return PTR_ERR(link); ieee80211_set_default_mgmt_key(link, key_idx); return 0; } static int ieee80211_config_default_beacon_key(struct wiphy *wiphy, struct net_device *dev, int link_id, u8 key_idx) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link = ieee80211_link_or_deflink(sdata, link_id, true); if (IS_ERR(link)) return PTR_ERR(link); ieee80211_set_default_beacon_key(link, key_idx); return 0; } void sta_set_rate_info_tx(struct sta_info *sta, const struct ieee80211_tx_rate *rate, struct rate_info *rinfo) { rinfo->flags = 0; if (rate->flags & IEEE80211_TX_RC_MCS) { rinfo->flags |= RATE_INFO_FLAGS_MCS; rinfo->mcs = rate->idx; } else if (rate->flags & IEEE80211_TX_RC_VHT_MCS) { rinfo->flags |= RATE_INFO_FLAGS_VHT_MCS; rinfo->mcs = ieee80211_rate_get_vht_mcs(rate); rinfo->nss = ieee80211_rate_get_vht_nss(rate); } else { struct ieee80211_supported_band *sband; int shift = ieee80211_vif_get_shift(&sta->sdata->vif); u16 brate; sband = ieee80211_get_sband(sta->sdata); WARN_ON_ONCE(sband && !sband->bitrates); if (sband && sband->bitrates) { brate = sband->bitrates[rate->idx].bitrate; rinfo->legacy = DIV_ROUND_UP(brate, 1 << shift); } } if (rate->flags & IEEE80211_TX_RC_40_MHZ_WIDTH) rinfo->bw = RATE_INFO_BW_40; else if (rate->flags & IEEE80211_TX_RC_80_MHZ_WIDTH) rinfo->bw = RATE_INFO_BW_80; else if (rate->flags & IEEE80211_TX_RC_160_MHZ_WIDTH) rinfo->bw = RATE_INFO_BW_160; else rinfo->bw = RATE_INFO_BW_20; if (rate->flags & IEEE80211_TX_RC_SHORT_GI) rinfo->flags |= RATE_INFO_FLAGS_SHORT_GI; } static int ieee80211_dump_station(struct wiphy *wiphy, struct net_device *dev, int idx, u8 *mac, struct station_info *sinfo) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct sta_info *sta; int ret = -ENOENT; mutex_lock(&local->sta_mtx); sta = sta_info_get_by_idx(sdata, idx); if (sta) { ret = 0; memcpy(mac, sta->sta.addr, ETH_ALEN); sta_set_sinfo(sta, sinfo, true); } mutex_unlock(&local->sta_mtx); return ret; } static int ieee80211_dump_survey(struct wiphy *wiphy, struct net_device *dev, int idx, struct survey_info *survey) { struct ieee80211_local *local = wdev_priv(dev->ieee80211_ptr); return drv_get_survey(local, idx, survey); } static int ieee80211_get_station(struct wiphy *wiphy, struct net_device *dev, const u8 *mac, struct station_info *sinfo) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct sta_info *sta; int ret = -ENOENT; mutex_lock(&local->sta_mtx); sta = sta_info_get_bss(sdata, mac); if (sta) { ret = 0; sta_set_sinfo(sta, sinfo, true); } mutex_unlock(&local->sta_mtx); return ret; } static int ieee80211_set_monitor_channel(struct wiphy *wiphy, struct cfg80211_chan_def *chandef) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata; int ret = 0; if (cfg80211_chandef_identical(&local->monitor_chandef, chandef)) return 0; if (local->use_chanctx) { sdata = wiphy_dereference(local->hw.wiphy, local->monitor_sdata); if (sdata) { sdata_lock(sdata); mutex_lock(&local->mtx); ieee80211_link_release_channel(&sdata->deflink); ret = ieee80211_link_use_channel(&sdata->deflink, chandef, IEEE80211_CHANCTX_EXCLUSIVE); mutex_unlock(&local->mtx); sdata_unlock(sdata); } } else { mutex_lock(&local->mtx); if (local->open_count == local->monitors) { local->_oper_chandef = *chandef; ieee80211_hw_config(local, 0); } mutex_unlock(&local->mtx); } if (ret == 0) local->monitor_chandef = *chandef; return ret; } static int ieee80211_set_probe_resp(struct ieee80211_sub_if_data *sdata, const u8 *resp, size_t resp_len, const struct ieee80211_csa_settings *csa, const struct ieee80211_color_change_settings *cca, struct ieee80211_link_data *link) { struct probe_resp *new, *old; if (!resp || !resp_len) return 1; old = sdata_dereference(link->u.ap.probe_resp, sdata); new = kzalloc(sizeof(struct probe_resp) + resp_len, GFP_KERNEL); if (!new) return -ENOMEM; new->len = resp_len; memcpy(new->data, resp, resp_len); if (csa) memcpy(new->cntdwn_counter_offsets, csa->counter_offsets_presp, csa->n_counter_offsets_presp * sizeof(new->cntdwn_counter_offsets[0])); else if (cca) new->cntdwn_counter_offsets[0] = cca->counter_offset_presp; rcu_assign_pointer(link->u.ap.probe_resp, new); if (old) kfree_rcu(old, rcu_head); return 0; } static int ieee80211_set_fils_discovery(struct ieee80211_sub_if_data *sdata, struct cfg80211_fils_discovery *params, struct ieee80211_link_data *link, struct ieee80211_bss_conf *link_conf) { struct fils_discovery_data *new, *old = NULL; struct ieee80211_fils_discovery *fd; if (!params->tmpl || !params->tmpl_len) return -EINVAL; fd = &link_conf->fils_discovery; fd->min_interval = params->min_interval; fd->max_interval = params->max_interval; old = sdata_dereference(link->u.ap.fils_discovery, sdata); new = kzalloc(sizeof(*new) + params->tmpl_len, GFP_KERNEL); if (!new) return -ENOMEM; new->len = params->tmpl_len; memcpy(new->data, params->tmpl, params->tmpl_len); rcu_assign_pointer(link->u.ap.fils_discovery, new); if (old) kfree_rcu(old, rcu_head); return 0; } static int ieee80211_set_unsol_bcast_probe_resp(struct ieee80211_sub_if_data *sdata, struct cfg80211_unsol_bcast_probe_resp *params, struct ieee80211_link_data *link, struct ieee80211_bss_conf *link_conf) { struct unsol_bcast_probe_resp_data *new, *old = NULL; if (!params->tmpl || !params->tmpl_len) return -EINVAL; old = sdata_dereference(link->u.ap.unsol_bcast_probe_resp, sdata); new = kzalloc(sizeof(*new) + params->tmpl_len, GFP_KERNEL); if (!new) return -ENOMEM; new->len = params->tmpl_len; memcpy(new->data, params->tmpl, params->tmpl_len); rcu_assign_pointer(link->u.ap.unsol_bcast_probe_resp, new); if (old) kfree_rcu(old, rcu_head); link_conf->unsol_bcast_probe_resp_interval = params->interval; return 0; } static int ieee80211_set_ftm_responder_params( struct ieee80211_sub_if_data *sdata, const u8 *lci, size_t lci_len, const u8 *civicloc, size_t civicloc_len, struct ieee80211_bss_conf *link_conf) { struct ieee80211_ftm_responder_params *new, *old; u8 *pos; int len; if (!lci_len && !civicloc_len) return 0; old = link_conf->ftmr_params; len = lci_len + civicloc_len; new = kzalloc(sizeof(*new) + len, GFP_KERNEL); if (!new) return -ENOMEM; pos = (u8 *)(new + 1); if (lci_len) { new->lci_len = lci_len; new->lci = pos; memcpy(pos, lci, lci_len); pos += lci_len; } if (civicloc_len) { new->civicloc_len = civicloc_len; new->civicloc = pos; memcpy(pos, civicloc, civicloc_len); pos += civicloc_len; } link_conf->ftmr_params = new; kfree(old); return 0; } static int ieee80211_copy_mbssid_beacon(u8 *pos, struct cfg80211_mbssid_elems *dst, struct cfg80211_mbssid_elems *src) { int i, offset = 0; for (i = 0; i < src->cnt; i++) { memcpy(pos + offset, src->elem[i].data, src->elem[i].len); dst->elem[i].len = src->elem[i].len; dst->elem[i].data = pos + offset; offset += dst->elem[i].len; } dst->cnt = src->cnt; return offset; } static int ieee80211_copy_rnr_beacon(u8 *pos, struct cfg80211_rnr_elems *dst, struct cfg80211_rnr_elems *src) { int i, offset = 0; for (i = 0; i < src->cnt; i++) { memcpy(pos + offset, src->elem[i].data, src->elem[i].len); dst->elem[i].len = src->elem[i].len; dst->elem[i].data = pos + offset; offset += dst->elem[i].len; } dst->cnt = src->cnt; return offset; } static int ieee80211_assign_beacon(struct ieee80211_sub_if_data *sdata, struct ieee80211_link_data *link, struct cfg80211_beacon_data *params, const struct ieee80211_csa_settings *csa, const struct ieee80211_color_change_settings *cca, u64 *changed) { struct cfg80211_mbssid_elems *mbssid = NULL; struct cfg80211_rnr_elems *rnr = NULL; struct beacon_data *new, *old; int new_head_len, new_tail_len; int size, err; u64 _changed = BSS_CHANGED_BEACON; struct ieee80211_bss_conf *link_conf = link->conf; old = sdata_dereference(link->u.ap.beacon, sdata); /* Need to have a beacon head if we don't have one yet */ if (!params->head && !old) return -EINVAL; /* new or old head? */ if (params->head) new_head_len = params->head_len; else new_head_len = old->head_len; /* new or old tail? */ if (params->tail || !old) /* params->tail_len will be zero for !params->tail */ new_tail_len = params->tail_len; else new_tail_len = old->tail_len; size = sizeof(*new) + new_head_len + new_tail_len; /* new or old multiple BSSID elements? */ if (params->mbssid_ies) { mbssid = params->mbssid_ies; size += struct_size(new->mbssid_ies, elem, mbssid->cnt); if (params->rnr_ies) { rnr = params->rnr_ies; size += struct_size(new->rnr_ies, elem, rnr->cnt); } size += ieee80211_get_mbssid_beacon_len(mbssid, rnr, mbssid->cnt); } else if (old && old->mbssid_ies) { mbssid = old->mbssid_ies; size += struct_size(new->mbssid_ies, elem, mbssid->cnt); if (old && old->rnr_ies) { rnr = old->rnr_ies; size += struct_size(new->rnr_ies, elem, rnr->cnt); } size += ieee80211_get_mbssid_beacon_len(mbssid, rnr, mbssid->cnt); } new = kzalloc(size, GFP_KERNEL); if (!new) return -ENOMEM; /* start filling the new info now */ /* * pointers go into the block we allocated, * memory is | beacon_data | head | tail | mbssid_ies | rnr_ies */ new->head = ((u8 *) new) + sizeof(*new); new->tail = new->head + new_head_len; new->head_len = new_head_len; new->tail_len = new_tail_len; /* copy in optional mbssid_ies */ if (mbssid) { u8 *pos = new->tail + new->tail_len; new->mbssid_ies = (void *)pos; pos += struct_size(new->mbssid_ies, elem, mbssid->cnt); pos += ieee80211_copy_mbssid_beacon(pos, new->mbssid_ies, mbssid); if (rnr) { new->rnr_ies = (void *)pos; pos += struct_size(new->rnr_ies, elem, rnr->cnt); ieee80211_copy_rnr_beacon(pos, new->rnr_ies, rnr); } /* update bssid_indicator */ link_conf->bssid_indicator = ilog2(__roundup_pow_of_two(mbssid->cnt + 1)); } if (csa) { new->cntdwn_current_counter = csa->count; memcpy(new->cntdwn_counter_offsets, csa->counter_offsets_beacon, csa->n_counter_offsets_beacon * sizeof(new->cntdwn_counter_offsets[0])); } else if (cca) { new->cntdwn_current_counter = cca->count; new->cntdwn_counter_offsets[0] = cca->counter_offset_beacon; } /* copy in head */ if (params->head) memcpy(new->head, params->head, new_head_len); else memcpy(new->head, old->head, new_head_len); /* copy in optional tail */ if (params->tail) memcpy(new->tail, params->tail, new_tail_len); else if (old) memcpy(new->tail, old->tail, new_tail_len); err = ieee80211_set_probe_resp(sdata, params->probe_resp, params->probe_resp_len, csa, cca, link); if (err < 0) { kfree(new); return err; } if (err == 0) _changed |= BSS_CHANGED_AP_PROBE_RESP; if (params->ftm_responder != -1) { link_conf->ftm_responder = params->ftm_responder; err = ieee80211_set_ftm_responder_params(sdata, params->lci, params->lci_len, params->civicloc, params->civicloc_len, link_conf); if (err < 0) { kfree(new); return err; } _changed |= BSS_CHANGED_FTM_RESPONDER; } rcu_assign_pointer(link->u.ap.beacon, new); sdata->u.ap.active = true; if (old) kfree_rcu(old, rcu_head); *changed |= _changed; return 0; } static int ieee80211_start_ap(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_ap_settings *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct beacon_data *old; struct ieee80211_sub_if_data *vlan; u64 changed = BSS_CHANGED_BEACON_INT | BSS_CHANGED_BEACON_ENABLED | BSS_CHANGED_BEACON | BSS_CHANGED_P2P_PS | BSS_CHANGED_TXPOWER | BSS_CHANGED_TWT; int i, err; int prev_beacon_int; unsigned int link_id = params->beacon.link_id; struct ieee80211_link_data *link; struct ieee80211_bss_conf *link_conf; link = sdata_dereference(sdata->link[link_id], sdata); if (!link) return -ENOLINK; link_conf = link->conf; old = sdata_dereference(link->u.ap.beacon, sdata); if (old) return -EALREADY; if (params->smps_mode != NL80211_SMPS_OFF) return -ENOTSUPP; link->smps_mode = IEEE80211_SMPS_OFF; link->needed_rx_chains = sdata->local->rx_chains; prev_beacon_int = link_conf->beacon_int; link_conf->beacon_int = params->beacon_interval; if (params->ht_cap) link_conf->ht_ldpc = params->ht_cap->cap_info & cpu_to_le16(IEEE80211_HT_CAP_LDPC_CODING); if (params->vht_cap) { link_conf->vht_ldpc = params->vht_cap->vht_cap_info & cpu_to_le32(IEEE80211_VHT_CAP_RXLDPC); link_conf->vht_su_beamformer = params->vht_cap->vht_cap_info & cpu_to_le32(IEEE80211_VHT_CAP_SU_BEAMFORMER_CAPABLE); link_conf->vht_su_beamformee = params->vht_cap->vht_cap_info & cpu_to_le32(IEEE80211_VHT_CAP_SU_BEAMFORMEE_CAPABLE); link_conf->vht_mu_beamformer = params->vht_cap->vht_cap_info & cpu_to_le32(IEEE80211_VHT_CAP_MU_BEAMFORMER_CAPABLE); link_conf->vht_mu_beamformee = params->vht_cap->vht_cap_info & cpu_to_le32(IEEE80211_VHT_CAP_MU_BEAMFORMEE_CAPABLE); } if (params->he_cap && params->he_oper) { link_conf->he_support = true; link_conf->htc_trig_based_pkt_ext = le32_get_bits(params->he_oper->he_oper_params, IEEE80211_HE_OPERATION_DFLT_PE_DURATION_MASK); link_conf->frame_time_rts_th = le32_get_bits(params->he_oper->he_oper_params, IEEE80211_HE_OPERATION_RTS_THRESHOLD_MASK); changed |= BSS_CHANGED_HE_OBSS_PD; if (params->beacon.he_bss_color.enabled) changed |= BSS_CHANGED_HE_BSS_COLOR; } if (params->he_cap) { link_conf->he_ldpc = params->he_cap->phy_cap_info[1] & IEEE80211_HE_PHY_CAP1_LDPC_CODING_IN_PAYLOAD; link_conf->he_su_beamformer = params->he_cap->phy_cap_info[3] & IEEE80211_HE_PHY_CAP3_SU_BEAMFORMER; link_conf->he_su_beamformee = params->he_cap->phy_cap_info[4] & IEEE80211_HE_PHY_CAP4_SU_BEAMFORMEE; link_conf->he_mu_beamformer = params->he_cap->phy_cap_info[4] & IEEE80211_HE_PHY_CAP4_MU_BEAMFORMER; link_conf->he_full_ul_mumimo = params->he_cap->phy_cap_info[2] & IEEE80211_HE_PHY_CAP2_UL_MU_FULL_MU_MIMO; } if (params->eht_cap) { if (!link_conf->he_support) return -EOPNOTSUPP; link_conf->eht_support = true; link_conf->eht_puncturing = params->punct_bitmap; changed |= BSS_CHANGED_EHT_PUNCTURING; link_conf->eht_su_beamformer = params->eht_cap->fixed.phy_cap_info[0] & IEEE80211_EHT_PHY_CAP0_SU_BEAMFORMER; link_conf->eht_su_beamformee = params->eht_cap->fixed.phy_cap_info[0] & IEEE80211_EHT_PHY_CAP0_SU_BEAMFORMEE; link_conf->eht_mu_beamformer = params->eht_cap->fixed.phy_cap_info[7] & (IEEE80211_EHT_PHY_CAP7_MU_BEAMFORMER_80MHZ | IEEE80211_EHT_PHY_CAP7_MU_BEAMFORMER_160MHZ | IEEE80211_EHT_PHY_CAP7_MU_BEAMFORMER_320MHZ); } else { link_conf->eht_su_beamformer = false; link_conf->eht_su_beamformee = false; link_conf->eht_mu_beamformer = false; } if (sdata->vif.type == NL80211_IFTYPE_AP && params->mbssid_config.tx_wdev) { err = ieee80211_set_ap_mbssid_options(sdata, params->mbssid_config, link_conf); if (err) return err; } mutex_lock(&local->mtx); err = ieee80211_link_use_channel(link, ¶ms->chandef, IEEE80211_CHANCTX_SHARED); if (!err) ieee80211_link_copy_chanctx_to_vlans(link, false); mutex_unlock(&local->mtx); if (err) { link_conf->beacon_int = prev_beacon_int; return err; } /* * Apply control port protocol, this allows us to * not encrypt dynamic WEP control frames. */ sdata->control_port_protocol = params->crypto.control_port_ethertype; sdata->control_port_no_encrypt = params->crypto.control_port_no_encrypt; sdata->control_port_over_nl80211 = params->crypto.control_port_over_nl80211; sdata->control_port_no_preauth = params->crypto.control_port_no_preauth; list_for_each_entry(vlan, &sdata->u.ap.vlans, u.vlan.list) { vlan->control_port_protocol = params->crypto.control_port_ethertype; vlan->control_port_no_encrypt = params->crypto.control_port_no_encrypt; vlan->control_port_over_nl80211 = params->crypto.control_port_over_nl80211; vlan->control_port_no_preauth = params->crypto.control_port_no_preauth; } link_conf->dtim_period = params->dtim_period; link_conf->enable_beacon = true; link_conf->allow_p2p_go_ps = sdata->vif.p2p; link_conf->twt_responder = params->twt_responder; link_conf->he_obss_pd = params->he_obss_pd; link_conf->he_bss_color = params->beacon.he_bss_color; sdata->vif.cfg.s1g = params->chandef.chan->band == NL80211_BAND_S1GHZ; sdata->vif.cfg.ssid_len = params->ssid_len; if (params->ssid_len) memcpy(sdata->vif.cfg.ssid, params->ssid, params->ssid_len); link_conf->hidden_ssid = (params->hidden_ssid != NL80211_HIDDEN_SSID_NOT_IN_USE); memset(&link_conf->p2p_noa_attr, 0, sizeof(link_conf->p2p_noa_attr)); link_conf->p2p_noa_attr.oppps_ctwindow = params->p2p_ctwindow & IEEE80211_P2P_OPPPS_CTWINDOW_MASK; if (params->p2p_opp_ps) link_conf->p2p_noa_attr.oppps_ctwindow |= IEEE80211_P2P_OPPPS_ENABLE_BIT; sdata->beacon_rate_set = false; if (wiphy_ext_feature_isset(local->hw.wiphy, NL80211_EXT_FEATURE_BEACON_RATE_LEGACY)) { for (i = 0; i < NUM_NL80211_BANDS; i++) { sdata->beacon_rateidx_mask[i] = params->beacon_rate.control[i].legacy; if (sdata->beacon_rateidx_mask[i]) sdata->beacon_rate_set = true; } } if (ieee80211_hw_check(&local->hw, HAS_RATE_CONTROL)) link_conf->beacon_tx_rate = params->beacon_rate; err = ieee80211_assign_beacon(sdata, link, ¶ms->beacon, NULL, NULL, &changed); if (err < 0) goto error; if (params->fils_discovery.max_interval) { err = ieee80211_set_fils_discovery(sdata, ¶ms->fils_discovery, link, link_conf); if (err < 0) goto error; changed |= BSS_CHANGED_FILS_DISCOVERY; } if (params->unsol_bcast_probe_resp.interval) { err = ieee80211_set_unsol_bcast_probe_resp(sdata, ¶ms->unsol_bcast_probe_resp, link, link_conf); if (err < 0) goto error; changed |= BSS_CHANGED_UNSOL_BCAST_PROBE_RESP; } err = drv_start_ap(sdata->local, sdata, link_conf); if (err) { old = sdata_dereference(link->u.ap.beacon, sdata); if (old) kfree_rcu(old, rcu_head); RCU_INIT_POINTER(link->u.ap.beacon, NULL); sdata->u.ap.active = false; goto error; } ieee80211_recalc_dtim(local, sdata); ieee80211_vif_cfg_change_notify(sdata, BSS_CHANGED_SSID); ieee80211_link_info_change_notify(sdata, link, changed); netif_carrier_on(dev); list_for_each_entry(vlan, &sdata->u.ap.vlans, u.vlan.list) netif_carrier_on(vlan->dev); return 0; error: mutex_lock(&local->mtx); ieee80211_link_release_channel(link); mutex_unlock(&local->mtx); return err; } static int ieee80211_change_beacon(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_beacon_data *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link; struct beacon_data *old; int err; struct ieee80211_bss_conf *link_conf; u64 changed = 0; sdata_assert_lock(sdata); link = sdata_dereference(sdata->link[params->link_id], sdata); if (!link) return -ENOLINK; link_conf = link->conf; /* don't allow changing the beacon while a countdown is in place - offset * of channel switch counter may change */ if (link_conf->csa_active || link_conf->color_change_active) return -EBUSY; old = sdata_dereference(link->u.ap.beacon, sdata); if (!old) return -ENOENT; err = ieee80211_assign_beacon(sdata, link, params, NULL, NULL, &changed); if (err < 0) return err; if (params->he_bss_color_valid && params->he_bss_color.enabled != link_conf->he_bss_color.enabled) { link_conf->he_bss_color.enabled = params->he_bss_color.enabled; changed |= BSS_CHANGED_HE_BSS_COLOR; } ieee80211_link_info_change_notify(sdata, link, changed); return 0; } static void ieee80211_free_next_beacon(struct ieee80211_link_data *link) { if (!link->u.ap.next_beacon) return; kfree(link->u.ap.next_beacon->mbssid_ies); kfree(link->u.ap.next_beacon->rnr_ies); kfree(link->u.ap.next_beacon); link->u.ap.next_beacon = NULL; } static int ieee80211_stop_ap(struct wiphy *wiphy, struct net_device *dev, unsigned int link_id) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_sub_if_data *vlan; struct ieee80211_local *local = sdata->local; struct beacon_data *old_beacon; struct probe_resp *old_probe_resp; struct fils_discovery_data *old_fils_discovery; struct unsol_bcast_probe_resp_data *old_unsol_bcast_probe_resp; struct cfg80211_chan_def chandef; struct ieee80211_link_data *link = sdata_dereference(sdata->link[link_id], sdata); struct ieee80211_bss_conf *link_conf = link->conf; sdata_assert_lock(sdata); old_beacon = sdata_dereference(link->u.ap.beacon, sdata); if (!old_beacon) return -ENOENT; old_probe_resp = sdata_dereference(link->u.ap.probe_resp, sdata); old_fils_discovery = sdata_dereference(link->u.ap.fils_discovery, sdata); old_unsol_bcast_probe_resp = sdata_dereference(link->u.ap.unsol_bcast_probe_resp, sdata); /* abort any running channel switch or color change */ mutex_lock(&local->mtx); link_conf->csa_active = false; link_conf->color_change_active = false; if (link->csa_block_tx) { ieee80211_wake_vif_queues(local, sdata, IEEE80211_QUEUE_STOP_REASON_CSA); link->csa_block_tx = false; } mutex_unlock(&local->mtx); ieee80211_free_next_beacon(link); /* turn off carrier for this interface and dependent VLANs */ list_for_each_entry(vlan, &sdata->u.ap.vlans, u.vlan.list) netif_carrier_off(vlan->dev); netif_carrier_off(dev); /* remove beacon and probe response */ sdata->u.ap.active = false; RCU_INIT_POINTER(link->u.ap.beacon, NULL); RCU_INIT_POINTER(link->u.ap.probe_resp, NULL); RCU_INIT_POINTER(link->u.ap.fils_discovery, NULL); RCU_INIT_POINTER(link->u.ap.unsol_bcast_probe_resp, NULL); kfree_rcu(old_beacon, rcu_head); if (old_probe_resp) kfree_rcu(old_probe_resp, rcu_head); if (old_fils_discovery) kfree_rcu(old_fils_discovery, rcu_head); if (old_unsol_bcast_probe_resp) kfree_rcu(old_unsol_bcast_probe_resp, rcu_head); kfree(link_conf->ftmr_params); link_conf->ftmr_params = NULL; sdata->vif.mbssid_tx_vif = NULL; link_conf->bssid_index = 0; link_conf->nontransmitted = false; link_conf->ema_ap = false; link_conf->bssid_indicator = 0; __sta_info_flush(sdata, true); ieee80211_free_keys(sdata, true); link_conf->enable_beacon = false; sdata->beacon_rate_set = false; sdata->vif.cfg.ssid_len = 0; clear_bit(SDATA_STATE_OFFCHANNEL_BEACON_STOPPED, &sdata->state); ieee80211_link_info_change_notify(sdata, link, BSS_CHANGED_BEACON_ENABLED); if (sdata->wdev.cac_started) { chandef = link_conf->chandef; cancel_delayed_work_sync(&link->dfs_cac_timer_work); cfg80211_cac_event(sdata->dev, &chandef, NL80211_RADAR_CAC_ABORTED, GFP_KERNEL); } drv_stop_ap(sdata->local, sdata, link_conf); /* free all potentially still buffered bcast frames */ local->total_ps_buffered -= skb_queue_len(&sdata->u.ap.ps.bc_buf); ieee80211_purge_tx_queue(&local->hw, &sdata->u.ap.ps.bc_buf); mutex_lock(&local->mtx); ieee80211_link_copy_chanctx_to_vlans(link, true); ieee80211_link_release_channel(link); mutex_unlock(&local->mtx); return 0; } static int sta_apply_auth_flags(struct ieee80211_local *local, struct sta_info *sta, u32 mask, u32 set) { int ret; if (mask & BIT(NL80211_STA_FLAG_AUTHENTICATED) && set & BIT(NL80211_STA_FLAG_AUTHENTICATED) && !test_sta_flag(sta, WLAN_STA_AUTH)) { ret = sta_info_move_state(sta, IEEE80211_STA_AUTH); if (ret) return ret; } if (mask & BIT(NL80211_STA_FLAG_ASSOCIATED) && set & BIT(NL80211_STA_FLAG_ASSOCIATED) && !test_sta_flag(sta, WLAN_STA_ASSOC)) { /* * When peer becomes associated, init rate control as * well. Some drivers require rate control initialized * before drv_sta_state() is called. */ if (!test_sta_flag(sta, WLAN_STA_RATE_CONTROL)) rate_control_rate_init(sta); ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); if (ret) return ret; } if (mask & BIT(NL80211_STA_FLAG_AUTHORIZED)) { if (set & BIT(NL80211_STA_FLAG_AUTHORIZED)) ret = sta_info_move_state(sta, IEEE80211_STA_AUTHORIZED); else if (test_sta_flag(sta, WLAN_STA_AUTHORIZED)) ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); else ret = 0; if (ret) return ret; } if (mask & BIT(NL80211_STA_FLAG_ASSOCIATED) && !(set & BIT(NL80211_STA_FLAG_ASSOCIATED)) && test_sta_flag(sta, WLAN_STA_ASSOC)) { ret = sta_info_move_state(sta, IEEE80211_STA_AUTH); if (ret) return ret; } if (mask & BIT(NL80211_STA_FLAG_AUTHENTICATED) && !(set & BIT(NL80211_STA_FLAG_AUTHENTICATED)) && test_sta_flag(sta, WLAN_STA_AUTH)) { ret = sta_info_move_state(sta, IEEE80211_STA_NONE); if (ret) return ret; } return 0; } static void sta_apply_mesh_params(struct ieee80211_local *local, struct sta_info *sta, struct station_parameters *params) { #ifdef CONFIG_MAC80211_MESH struct ieee80211_sub_if_data *sdata = sta->sdata; u64 changed = 0; if (params->sta_modify_mask & STATION_PARAM_APPLY_PLINK_STATE) { switch (params->plink_state) { case NL80211_PLINK_ESTAB: if (sta->mesh->plink_state != NL80211_PLINK_ESTAB) changed = mesh_plink_inc_estab_count(sdata); sta->mesh->plink_state = params->plink_state; sta->mesh->aid = params->peer_aid; ieee80211_mps_sta_status_update(sta); changed |= ieee80211_mps_set_sta_local_pm(sta, sdata->u.mesh.mshcfg.power_mode); ewma_mesh_tx_rate_avg_init(&sta->mesh->tx_rate_avg); /* init at low value */ ewma_mesh_tx_rate_avg_add(&sta->mesh->tx_rate_avg, 10); break; case NL80211_PLINK_LISTEN: case NL80211_PLINK_BLOCKED: case NL80211_PLINK_OPN_SNT: case NL80211_PLINK_OPN_RCVD: case NL80211_PLINK_CNF_RCVD: case NL80211_PLINK_HOLDING: if (sta->mesh->plink_state == NL80211_PLINK_ESTAB) changed = mesh_plink_dec_estab_count(sdata); sta->mesh->plink_state = params->plink_state; ieee80211_mps_sta_status_update(sta); changed |= ieee80211_mps_set_sta_local_pm(sta, NL80211_MESH_POWER_UNKNOWN); break; default: /* nothing */ break; } } switch (params->plink_action) { case NL80211_PLINK_ACTION_NO_ACTION: /* nothing */ break; case NL80211_PLINK_ACTION_OPEN: changed |= mesh_plink_open(sta); break; case NL80211_PLINK_ACTION_BLOCK: changed |= mesh_plink_block(sta); break; } if (params->local_pm) changed |= ieee80211_mps_set_sta_local_pm(sta, params->local_pm); ieee80211_mbss_info_change_notify(sdata, changed); #endif } static int sta_link_apply_parameters(struct ieee80211_local *local, struct sta_info *sta, bool new_link, struct link_station_parameters *params) { int ret = 0; struct ieee80211_supported_band *sband; struct ieee80211_sub_if_data *sdata = sta->sdata; u32 link_id = params->link_id < 0 ? 0 : params->link_id; struct ieee80211_link_data *link = sdata_dereference(sdata->link[link_id], sdata); struct link_sta_info *link_sta = rcu_dereference_protected(sta->link[link_id], lockdep_is_held(&local->sta_mtx)); /* * If there are no changes, then accept a link that doesn't exist, * unless it's a new link. */ if (params->link_id < 0 && !new_link && !params->link_mac && !params->txpwr_set && !params->supported_rates_len && !params->ht_capa && !params->vht_capa && !params->he_capa && !params->eht_capa && !params->opmode_notif_used) return 0; if (!link || !link_sta) return -EINVAL; sband = ieee80211_get_link_sband(link); if (!sband) return -EINVAL; if (params->link_mac) { if (new_link) { memcpy(link_sta->addr, params->link_mac, ETH_ALEN); memcpy(link_sta->pub->addr, params->link_mac, ETH_ALEN); } else if (!ether_addr_equal(link_sta->addr, params->link_mac)) { return -EINVAL; } } else if (new_link) { return -EINVAL; } if (params->txpwr_set) { link_sta->pub->txpwr.type = params->txpwr.type; if (params->txpwr.type == NL80211_TX_POWER_LIMITED) link_sta->pub->txpwr.power = params->txpwr.power; ret = drv_sta_set_txpwr(local, sdata, sta); if (ret) return ret; } if (params->supported_rates && params->supported_rates_len) { ieee80211_parse_bitrates(link->conf->chandef.width, sband, params->supported_rates, params->supported_rates_len, &link_sta->pub->supp_rates[sband->band]); } if (params->ht_capa) ieee80211_ht_cap_ie_to_sta_ht_cap(sdata, sband, params->ht_capa, link_sta); /* VHT can override some HT caps such as the A-MSDU max length */ if (params->vht_capa) ieee80211_vht_cap_ie_to_sta_vht_cap(sdata, sband, params->vht_capa, NULL, link_sta); if (params->he_capa) ieee80211_he_cap_ie_to_sta_he_cap(sdata, sband, (void *)params->he_capa, params->he_capa_len, (void *)params->he_6ghz_capa, link_sta); if (params->he_capa && params->eht_capa) ieee80211_eht_cap_ie_to_sta_eht_cap(sdata, sband, (u8 *)params->he_capa, params->he_capa_len, params->eht_capa, params->eht_capa_len, link_sta); if (params->opmode_notif_used) { /* returned value is only needed for rc update, but the * rc isn't initialized here yet, so ignore it */ __ieee80211_vht_handle_opmode(sdata, link_sta, params->opmode_notif, sband->band); } return ret; } static int sta_apply_parameters(struct ieee80211_local *local, struct sta_info *sta, struct station_parameters *params) { struct ieee80211_sub_if_data *sdata = sta->sdata; u32 mask, set; int ret = 0; mask = params->sta_flags_mask; set = params->sta_flags_set; if (ieee80211_vif_is_mesh(&sdata->vif)) { /* * In mesh mode, ASSOCIATED isn't part of the nl80211 * API but must follow AUTHENTICATED for driver state. */ if (mask & BIT(NL80211_STA_FLAG_AUTHENTICATED)) mask |= BIT(NL80211_STA_FLAG_ASSOCIATED); if (set & BIT(NL80211_STA_FLAG_AUTHENTICATED)) set |= BIT(NL80211_STA_FLAG_ASSOCIATED); } else if (test_sta_flag(sta, WLAN_STA_TDLS_PEER)) { /* * TDLS -- everything follows authorized, but * only becoming authorized is possible, not * going back */ if (set & BIT(NL80211_STA_FLAG_AUTHORIZED)) { set |= BIT(NL80211_STA_FLAG_AUTHENTICATED) | BIT(NL80211_STA_FLAG_ASSOCIATED); mask |= BIT(NL80211_STA_FLAG_AUTHENTICATED) | BIT(NL80211_STA_FLAG_ASSOCIATED); } } if (mask & BIT(NL80211_STA_FLAG_WME) && local->hw.queues >= IEEE80211_NUM_ACS) sta->sta.wme = set & BIT(NL80211_STA_FLAG_WME); /* auth flags will be set later for TDLS, * and for unassociated stations that move to associated */ if (!test_sta_flag(sta, WLAN_STA_TDLS_PEER) && !((mask & BIT(NL80211_STA_FLAG_ASSOCIATED)) && (set & BIT(NL80211_STA_FLAG_ASSOCIATED)))) { ret = sta_apply_auth_flags(local, sta, mask, set); if (ret) return ret; } if (mask & BIT(NL80211_STA_FLAG_SHORT_PREAMBLE)) { if (set & BIT(NL80211_STA_FLAG_SHORT_PREAMBLE)) set_sta_flag(sta, WLAN_STA_SHORT_PREAMBLE); else clear_sta_flag(sta, WLAN_STA_SHORT_PREAMBLE); } if (mask & BIT(NL80211_STA_FLAG_MFP)) { sta->sta.mfp = !!(set & BIT(NL80211_STA_FLAG_MFP)); if (set & BIT(NL80211_STA_FLAG_MFP)) set_sta_flag(sta, WLAN_STA_MFP); else clear_sta_flag(sta, WLAN_STA_MFP); } if (mask & BIT(NL80211_STA_FLAG_TDLS_PEER)) { if (set & BIT(NL80211_STA_FLAG_TDLS_PEER)) set_sta_flag(sta, WLAN_STA_TDLS_PEER); else clear_sta_flag(sta, WLAN_STA_TDLS_PEER); } /* mark TDLS channel switch support, if the AP allows it */ if (test_sta_flag(sta, WLAN_STA_TDLS_PEER) && !sdata->deflink.u.mgd.tdls_chan_switch_prohibited && params->ext_capab_len >= 4 && params->ext_capab[3] & WLAN_EXT_CAPA4_TDLS_CHAN_SWITCH) set_sta_flag(sta, WLAN_STA_TDLS_CHAN_SWITCH); if (test_sta_flag(sta, WLAN_STA_TDLS_PEER) && !sdata->u.mgd.tdls_wider_bw_prohibited && ieee80211_hw_check(&local->hw, TDLS_WIDER_BW) && params->ext_capab_len >= 8 && params->ext_capab[7] & WLAN_EXT_CAPA8_TDLS_WIDE_BW_ENABLED) set_sta_flag(sta, WLAN_STA_TDLS_WIDER_BW); if (params->sta_modify_mask & STATION_PARAM_APPLY_UAPSD) { sta->sta.uapsd_queues = params->uapsd_queues; sta->sta.max_sp = params->max_sp; } ieee80211_sta_set_max_amsdu_subframes(sta, params->ext_capab, params->ext_capab_len); /* * cfg80211 validates this (1-2007) and allows setting the AID * only when creating a new station entry */ if (params->aid) sta->sta.aid = params->aid; /* * Some of the following updates would be racy if called on an * existing station, via ieee80211_change_station(). However, * all such changes are rejected by cfg80211 except for updates * changing the supported rates on an existing but not yet used * TDLS peer. */ if (params->listen_interval >= 0) sta->listen_interval = params->listen_interval; ret = sta_link_apply_parameters(local, sta, false, ¶ms->link_sta_params); if (ret) return ret; if (params->support_p2p_ps >= 0) sta->sta.support_p2p_ps = params->support_p2p_ps; if (ieee80211_vif_is_mesh(&sdata->vif)) sta_apply_mesh_params(local, sta, params); if (params->airtime_weight) sta->airtime_weight = params->airtime_weight; /* set the STA state after all sta info from usermode has been set */ if (test_sta_flag(sta, WLAN_STA_TDLS_PEER) || set & BIT(NL80211_STA_FLAG_ASSOCIATED)) { ret = sta_apply_auth_flags(local, sta, mask, set); if (ret) return ret; } /* Mark the STA as MLO if MLD MAC address is available */ if (params->link_sta_params.mld_mac) sta->sta.mlo = true; return 0; } static int ieee80211_add_station(struct wiphy *wiphy, struct net_device *dev, const u8 *mac, struct station_parameters *params) { struct ieee80211_local *local = wiphy_priv(wiphy); struct sta_info *sta; struct ieee80211_sub_if_data *sdata; int err; if (params->vlan) { sdata = IEEE80211_DEV_TO_SUB_IF(params->vlan); if (sdata->vif.type != NL80211_IFTYPE_AP_VLAN && sdata->vif.type != NL80211_IFTYPE_AP) return -EINVAL; } else sdata = IEEE80211_DEV_TO_SUB_IF(dev); if (ether_addr_equal(mac, sdata->vif.addr)) return -EINVAL; if (!is_valid_ether_addr(mac)) return -EINVAL; if (params->sta_flags_set & BIT(NL80211_STA_FLAG_TDLS_PEER) && sdata->vif.type == NL80211_IFTYPE_STATION && !sdata->u.mgd.associated) return -EINVAL; /* * If we have a link ID, it can be a non-MLO station on an AP MLD, * but we need to have a link_mac in that case as well, so use the * STA's MAC address in that case. */ if (params->link_sta_params.link_id >= 0) sta = sta_info_alloc_with_link(sdata, mac, params->link_sta_params.link_id, params->link_sta_params.link_mac ?: mac, GFP_KERNEL); else sta = sta_info_alloc(sdata, mac, GFP_KERNEL); if (!sta) return -ENOMEM; if (params->sta_flags_set & BIT(NL80211_STA_FLAG_TDLS_PEER)) sta->sta.tdls = true; /* Though the mutex is not needed here (since the station is not * visible yet), sta_apply_parameters (and inner functions) require * the mutex due to other paths. */ mutex_lock(&local->sta_mtx); err = sta_apply_parameters(local, sta, params); mutex_unlock(&local->sta_mtx); if (err) { sta_info_free(local, sta); return err; } /* * for TDLS and for unassociated station, rate control should be * initialized only when rates are known and station is marked * authorized/associated */ if (!test_sta_flag(sta, WLAN_STA_TDLS_PEER) && test_sta_flag(sta, WLAN_STA_ASSOC)) rate_control_rate_init(sta); return sta_info_insert(sta); } static int ieee80211_del_station(struct wiphy *wiphy, struct net_device *dev, struct station_del_parameters *params) { struct ieee80211_sub_if_data *sdata; sdata = IEEE80211_DEV_TO_SUB_IF(dev); if (params->mac) return sta_info_destroy_addr_bss(sdata, params->mac); sta_info_flush(sdata); return 0; } static int ieee80211_change_station(struct wiphy *wiphy, struct net_device *dev, const u8 *mac, struct station_parameters *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = wiphy_priv(wiphy); struct sta_info *sta; struct ieee80211_sub_if_data *vlansdata; enum cfg80211_station_type statype; int err; mutex_lock(&local->sta_mtx); sta = sta_info_get_bss(sdata, mac); if (!sta) { err = -ENOENT; goto out_err; } switch (sdata->vif.type) { case NL80211_IFTYPE_MESH_POINT: if (sdata->u.mesh.user_mpm) statype = CFG80211_STA_MESH_PEER_USER; else statype = CFG80211_STA_MESH_PEER_KERNEL; break; case NL80211_IFTYPE_ADHOC: statype = CFG80211_STA_IBSS; break; case NL80211_IFTYPE_STATION: if (!test_sta_flag(sta, WLAN_STA_TDLS_PEER)) { statype = CFG80211_STA_AP_STA; break; } if (test_sta_flag(sta, WLAN_STA_AUTHORIZED)) statype = CFG80211_STA_TDLS_PEER_ACTIVE; else statype = CFG80211_STA_TDLS_PEER_SETUP; break; case NL80211_IFTYPE_AP: case NL80211_IFTYPE_AP_VLAN: if (test_sta_flag(sta, WLAN_STA_ASSOC)) statype = CFG80211_STA_AP_CLIENT; else statype = CFG80211_STA_AP_CLIENT_UNASSOC; break; default: err = -EOPNOTSUPP; goto out_err; } err = cfg80211_check_station_change(wiphy, params, statype); if (err) goto out_err; if (params->vlan && params->vlan != sta->sdata->dev) { vlansdata = IEEE80211_DEV_TO_SUB_IF(params->vlan); if (params->vlan->ieee80211_ptr->use_4addr) { if (vlansdata->u.vlan.sta) { err = -EBUSY; goto out_err; } rcu_assign_pointer(vlansdata->u.vlan.sta, sta); __ieee80211_check_fast_rx_iface(vlansdata); drv_sta_set_4addr(local, sta->sdata, &sta->sta, true); } if (sta->sdata->vif.type == NL80211_IFTYPE_AP_VLAN && sta->sdata->u.vlan.sta) { ieee80211_clear_fast_rx(sta); RCU_INIT_POINTER(sta->sdata->u.vlan.sta, NULL); } if (test_sta_flag(sta, WLAN_STA_AUTHORIZED)) ieee80211_vif_dec_num_mcast(sta->sdata); sta->sdata = vlansdata; ieee80211_check_fast_xmit(sta); if (test_sta_flag(sta, WLAN_STA_AUTHORIZED)) { ieee80211_vif_inc_num_mcast(sta->sdata); cfg80211_send_layer2_update(sta->sdata->dev, sta->sta.addr); } } /* we use sta_info_get_bss() so this might be different */ if (sdata != sta->sdata) { mutex_lock_nested(&sta->sdata->wdev.mtx, 1); err = sta_apply_parameters(local, sta, params); mutex_unlock(&sta->sdata->wdev.mtx); } else { err = sta_apply_parameters(local, sta, params); } if (err) goto out_err; mutex_unlock(&local->sta_mtx); if (sdata->vif.type == NL80211_IFTYPE_STATION && params->sta_flags_mask & BIT(NL80211_STA_FLAG_AUTHORIZED)) { ieee80211_recalc_ps(local); ieee80211_recalc_ps_vif(sdata); } return 0; out_err: mutex_unlock(&local->sta_mtx); return err; } #ifdef CONFIG_MAC80211_MESH static int ieee80211_add_mpath(struct wiphy *wiphy, struct net_device *dev, const u8 *dst, const u8 *next_hop) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; struct sta_info *sta; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); sta = sta_info_get(sdata, next_hop); if (!sta) { rcu_read_unlock(); return -ENOENT; } mpath = mesh_path_add(sdata, dst); if (IS_ERR(mpath)) { rcu_read_unlock(); return PTR_ERR(mpath); } mesh_path_fix_nexthop(mpath, sta); rcu_read_unlock(); return 0; } static int ieee80211_del_mpath(struct wiphy *wiphy, struct net_device *dev, const u8 *dst) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); if (dst) return mesh_path_del(sdata, dst); mesh_path_flush_by_iface(sdata); return 0; } static int ieee80211_change_mpath(struct wiphy *wiphy, struct net_device *dev, const u8 *dst, const u8 *next_hop) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; struct sta_info *sta; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); sta = sta_info_get(sdata, next_hop); if (!sta) { rcu_read_unlock(); return -ENOENT; } mpath = mesh_path_lookup(sdata, dst); if (!mpath) { rcu_read_unlock(); return -ENOENT; } mesh_path_fix_nexthop(mpath, sta); rcu_read_unlock(); return 0; } static void mpath_set_pinfo(struct mesh_path *mpath, u8 *next_hop, struct mpath_info *pinfo) { struct sta_info *next_hop_sta = rcu_dereference(mpath->next_hop); if (next_hop_sta) memcpy(next_hop, next_hop_sta->sta.addr, ETH_ALEN); else eth_zero_addr(next_hop); memset(pinfo, 0, sizeof(*pinfo)); pinfo->generation = mpath->sdata->u.mesh.mesh_paths_generation; pinfo->filled = MPATH_INFO_FRAME_QLEN | MPATH_INFO_SN | MPATH_INFO_METRIC | MPATH_INFO_EXPTIME | MPATH_INFO_DISCOVERY_TIMEOUT | MPATH_INFO_DISCOVERY_RETRIES | MPATH_INFO_FLAGS | MPATH_INFO_HOP_COUNT | MPATH_INFO_PATH_CHANGE; pinfo->frame_qlen = mpath->frame_queue.qlen; pinfo->sn = mpath->sn; pinfo->metric = mpath->metric; if (time_before(jiffies, mpath->exp_time)) pinfo->exptime = jiffies_to_msecs(mpath->exp_time - jiffies); pinfo->discovery_timeout = jiffies_to_msecs(mpath->discovery_timeout); pinfo->discovery_retries = mpath->discovery_retries; if (mpath->flags & MESH_PATH_ACTIVE) pinfo->flags |= NL80211_MPATH_FLAG_ACTIVE; if (mpath->flags & MESH_PATH_RESOLVING) pinfo->flags |= NL80211_MPATH_FLAG_RESOLVING; if (mpath->flags & MESH_PATH_SN_VALID) pinfo->flags |= NL80211_MPATH_FLAG_SN_VALID; if (mpath->flags & MESH_PATH_FIXED) pinfo->flags |= NL80211_MPATH_FLAG_FIXED; if (mpath->flags & MESH_PATH_RESOLVED) pinfo->flags |= NL80211_MPATH_FLAG_RESOLVED; pinfo->hop_count = mpath->hop_count; pinfo->path_change_count = mpath->path_change_count; } static int ieee80211_get_mpath(struct wiphy *wiphy, struct net_device *dev, u8 *dst, u8 *next_hop, struct mpath_info *pinfo) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); mpath = mesh_path_lookup(sdata, dst); if (!mpath) { rcu_read_unlock(); return -ENOENT; } memcpy(dst, mpath->dst, ETH_ALEN); mpath_set_pinfo(mpath, next_hop, pinfo); rcu_read_unlock(); return 0; } static int ieee80211_dump_mpath(struct wiphy *wiphy, struct net_device *dev, int idx, u8 *dst, u8 *next_hop, struct mpath_info *pinfo) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); mpath = mesh_path_lookup_by_idx(sdata, idx); if (!mpath) { rcu_read_unlock(); return -ENOENT; } memcpy(dst, mpath->dst, ETH_ALEN); mpath_set_pinfo(mpath, next_hop, pinfo); rcu_read_unlock(); return 0; } static void mpp_set_pinfo(struct mesh_path *mpath, u8 *mpp, struct mpath_info *pinfo) { memset(pinfo, 0, sizeof(*pinfo)); memcpy(mpp, mpath->mpp, ETH_ALEN); pinfo->generation = mpath->sdata->u.mesh.mpp_paths_generation; } static int ieee80211_get_mpp(struct wiphy *wiphy, struct net_device *dev, u8 *dst, u8 *mpp, struct mpath_info *pinfo) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); mpath = mpp_path_lookup(sdata, dst); if (!mpath) { rcu_read_unlock(); return -ENOENT; } memcpy(dst, mpath->dst, ETH_ALEN); mpp_set_pinfo(mpath, mpp, pinfo); rcu_read_unlock(); return 0; } static int ieee80211_dump_mpp(struct wiphy *wiphy, struct net_device *dev, int idx, u8 *dst, u8 *mpp, struct mpath_info *pinfo) { struct ieee80211_sub_if_data *sdata; struct mesh_path *mpath; sdata = IEEE80211_DEV_TO_SUB_IF(dev); rcu_read_lock(); mpath = mpp_path_lookup_by_idx(sdata, idx); if (!mpath) { rcu_read_unlock(); return -ENOENT; } memcpy(dst, mpath->dst, ETH_ALEN); mpp_set_pinfo(mpath, mpp, pinfo); rcu_read_unlock(); return 0; } static int ieee80211_get_mesh_config(struct wiphy *wiphy, struct net_device *dev, struct mesh_config *conf) { struct ieee80211_sub_if_data *sdata; sdata = IEEE80211_DEV_TO_SUB_IF(dev); memcpy(conf, &(sdata->u.mesh.mshcfg), sizeof(struct mesh_config)); return 0; } static inline bool _chg_mesh_attr(enum nl80211_meshconf_params parm, u32 mask) { return (mask >> (parm-1)) & 0x1; } static int copy_mesh_setup(struct ieee80211_if_mesh *ifmsh, const struct mesh_setup *setup) { u8 *new_ie; struct ieee80211_sub_if_data *sdata = container_of(ifmsh, struct ieee80211_sub_if_data, u.mesh); int i; /* allocate information elements */ new_ie = NULL; if (setup->ie_len) { new_ie = kmemdup(setup->ie, setup->ie_len, GFP_KERNEL); if (!new_ie) return -ENOMEM; } ifmsh->ie_len = setup->ie_len; ifmsh->ie = new_ie; /* now copy the rest of the setup parameters */ ifmsh->mesh_id_len = setup->mesh_id_len; memcpy(ifmsh->mesh_id, setup->mesh_id, ifmsh->mesh_id_len); ifmsh->mesh_sp_id = setup->sync_method; ifmsh->mesh_pp_id = setup->path_sel_proto; ifmsh->mesh_pm_id = setup->path_metric; ifmsh->user_mpm = setup->user_mpm; ifmsh->mesh_auth_id = setup->auth_id; ifmsh->security = IEEE80211_MESH_SEC_NONE; ifmsh->userspace_handles_dfs = setup->userspace_handles_dfs; if (setup->is_authenticated) ifmsh->security |= IEEE80211_MESH_SEC_AUTHED; if (setup->is_secure) ifmsh->security |= IEEE80211_MESH_SEC_SECURED; /* mcast rate setting in Mesh Node */ memcpy(sdata->vif.bss_conf.mcast_rate, setup->mcast_rate, sizeof(setup->mcast_rate)); sdata->vif.bss_conf.basic_rates = setup->basic_rates; sdata->vif.bss_conf.beacon_int = setup->beacon_interval; sdata->vif.bss_conf.dtim_period = setup->dtim_period; sdata->beacon_rate_set = false; if (wiphy_ext_feature_isset(sdata->local->hw.wiphy, NL80211_EXT_FEATURE_BEACON_RATE_LEGACY)) { for (i = 0; i < NUM_NL80211_BANDS; i++) { sdata->beacon_rateidx_mask[i] = setup->beacon_rate.control[i].legacy; if (sdata->beacon_rateidx_mask[i]) sdata->beacon_rate_set = true; } } return 0; } static int ieee80211_update_mesh_config(struct wiphy *wiphy, struct net_device *dev, u32 mask, const struct mesh_config *nconf) { struct mesh_config *conf; struct ieee80211_sub_if_data *sdata; struct ieee80211_if_mesh *ifmsh; sdata = IEEE80211_DEV_TO_SUB_IF(dev); ifmsh = &sdata->u.mesh; /* Set the config options which we are interested in setting */ conf = &(sdata->u.mesh.mshcfg); if (_chg_mesh_attr(NL80211_MESHCONF_RETRY_TIMEOUT, mask)) conf->dot11MeshRetryTimeout = nconf->dot11MeshRetryTimeout; if (_chg_mesh_attr(NL80211_MESHCONF_CONFIRM_TIMEOUT, mask)) conf->dot11MeshConfirmTimeout = nconf->dot11MeshConfirmTimeout; if (_chg_mesh_attr(NL80211_MESHCONF_HOLDING_TIMEOUT, mask)) conf->dot11MeshHoldingTimeout = nconf->dot11MeshHoldingTimeout; if (_chg_mesh_attr(NL80211_MESHCONF_MAX_PEER_LINKS, mask)) conf->dot11MeshMaxPeerLinks = nconf->dot11MeshMaxPeerLinks; if (_chg_mesh_attr(NL80211_MESHCONF_MAX_RETRIES, mask)) conf->dot11MeshMaxRetries = nconf->dot11MeshMaxRetries; if (_chg_mesh_attr(NL80211_MESHCONF_TTL, mask)) conf->dot11MeshTTL = nconf->dot11MeshTTL; if (_chg_mesh_attr(NL80211_MESHCONF_ELEMENT_TTL, mask)) conf->element_ttl = nconf->element_ttl; if (_chg_mesh_attr(NL80211_MESHCONF_AUTO_OPEN_PLINKS, mask)) { if (ifmsh->user_mpm) return -EBUSY; conf->auto_open_plinks = nconf->auto_open_plinks; } if (_chg_mesh_attr(NL80211_MESHCONF_SYNC_OFFSET_MAX_NEIGHBOR, mask)) conf->dot11MeshNbrOffsetMaxNeighbor = nconf->dot11MeshNbrOffsetMaxNeighbor; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_MAX_PREQ_RETRIES, mask)) conf->dot11MeshHWMPmaxPREQretries = nconf->dot11MeshHWMPmaxPREQretries; if (_chg_mesh_attr(NL80211_MESHCONF_PATH_REFRESH_TIME, mask)) conf->path_refresh_time = nconf->path_refresh_time; if (_chg_mesh_attr(NL80211_MESHCONF_MIN_DISCOVERY_TIMEOUT, mask)) conf->min_discovery_timeout = nconf->min_discovery_timeout; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_ACTIVE_PATH_TIMEOUT, mask)) conf->dot11MeshHWMPactivePathTimeout = nconf->dot11MeshHWMPactivePathTimeout; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_PREQ_MIN_INTERVAL, mask)) conf->dot11MeshHWMPpreqMinInterval = nconf->dot11MeshHWMPpreqMinInterval; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_PERR_MIN_INTERVAL, mask)) conf->dot11MeshHWMPperrMinInterval = nconf->dot11MeshHWMPperrMinInterval; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_NET_DIAM_TRVS_TIME, mask)) conf->dot11MeshHWMPnetDiameterTraversalTime = nconf->dot11MeshHWMPnetDiameterTraversalTime; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_ROOTMODE, mask)) { conf->dot11MeshHWMPRootMode = nconf->dot11MeshHWMPRootMode; ieee80211_mesh_root_setup(ifmsh); } if (_chg_mesh_attr(NL80211_MESHCONF_GATE_ANNOUNCEMENTS, mask)) { /* our current gate announcement implementation rides on root * announcements, so require this ifmsh to also be a root node * */ if (nconf->dot11MeshGateAnnouncementProtocol && !(conf->dot11MeshHWMPRootMode > IEEE80211_ROOTMODE_ROOT)) { conf->dot11MeshHWMPRootMode = IEEE80211_PROACTIVE_RANN; ieee80211_mesh_root_setup(ifmsh); } conf->dot11MeshGateAnnouncementProtocol = nconf->dot11MeshGateAnnouncementProtocol; } if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_RANN_INTERVAL, mask)) conf->dot11MeshHWMPRannInterval = nconf->dot11MeshHWMPRannInterval; if (_chg_mesh_attr(NL80211_MESHCONF_FORWARDING, mask)) conf->dot11MeshForwarding = nconf->dot11MeshForwarding; if (_chg_mesh_attr(NL80211_MESHCONF_RSSI_THRESHOLD, mask)) { /* our RSSI threshold implementation is supported only for * devices that report signal in dBm. */ if (!ieee80211_hw_check(&sdata->local->hw, SIGNAL_DBM)) return -ENOTSUPP; conf->rssi_threshold = nconf->rssi_threshold; } if (_chg_mesh_attr(NL80211_MESHCONF_HT_OPMODE, mask)) { conf->ht_opmode = nconf->ht_opmode; sdata->vif.bss_conf.ht_operation_mode = nconf->ht_opmode; ieee80211_link_info_change_notify(sdata, &sdata->deflink, BSS_CHANGED_HT); } if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_PATH_TO_ROOT_TIMEOUT, mask)) conf->dot11MeshHWMPactivePathToRootTimeout = nconf->dot11MeshHWMPactivePathToRootTimeout; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_ROOT_INTERVAL, mask)) conf->dot11MeshHWMProotInterval = nconf->dot11MeshHWMProotInterval; if (_chg_mesh_attr(NL80211_MESHCONF_HWMP_CONFIRMATION_INTERVAL, mask)) conf->dot11MeshHWMPconfirmationInterval = nconf->dot11MeshHWMPconfirmationInterval; if (_chg_mesh_attr(NL80211_MESHCONF_POWER_MODE, mask)) { conf->power_mode = nconf->power_mode; ieee80211_mps_local_status_update(sdata); } if (_chg_mesh_attr(NL80211_MESHCONF_AWAKE_WINDOW, mask)) conf->dot11MeshAwakeWindowDuration = nconf->dot11MeshAwakeWindowDuration; if (_chg_mesh_attr(NL80211_MESHCONF_PLINK_TIMEOUT, mask)) conf->plink_timeout = nconf->plink_timeout; if (_chg_mesh_attr(NL80211_MESHCONF_CONNECTED_TO_GATE, mask)) conf->dot11MeshConnectedToMeshGate = nconf->dot11MeshConnectedToMeshGate; if (_chg_mesh_attr(NL80211_MESHCONF_NOLEARN, mask)) conf->dot11MeshNolearn = nconf->dot11MeshNolearn; if (_chg_mesh_attr(NL80211_MESHCONF_CONNECTED_TO_AS, mask)) conf->dot11MeshConnectedToAuthServer = nconf->dot11MeshConnectedToAuthServer; ieee80211_mbss_info_change_notify(sdata, BSS_CHANGED_BEACON); return 0; } static int ieee80211_join_mesh(struct wiphy *wiphy, struct net_device *dev, const struct mesh_config *conf, const struct mesh_setup *setup) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_if_mesh *ifmsh = &sdata->u.mesh; int err; memcpy(&ifmsh->mshcfg, conf, sizeof(struct mesh_config)); err = copy_mesh_setup(ifmsh, setup); if (err) return err; sdata->control_port_over_nl80211 = setup->control_port_over_nl80211; /* can mesh use other SMPS modes? */ sdata->deflink.smps_mode = IEEE80211_SMPS_OFF; sdata->deflink.needed_rx_chains = sdata->local->rx_chains; mutex_lock(&sdata->local->mtx); err = ieee80211_link_use_channel(&sdata->deflink, &setup->chandef, IEEE80211_CHANCTX_SHARED); mutex_unlock(&sdata->local->mtx); if (err) return err; return ieee80211_start_mesh(sdata); } static int ieee80211_leave_mesh(struct wiphy *wiphy, struct net_device *dev) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); ieee80211_stop_mesh(sdata); mutex_lock(&sdata->local->mtx); ieee80211_link_release_channel(&sdata->deflink); kfree(sdata->u.mesh.ie); mutex_unlock(&sdata->local->mtx); return 0; } #endif static int ieee80211_change_bss(struct wiphy *wiphy, struct net_device *dev, struct bss_parameters *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link; struct ieee80211_supported_band *sband; u64 changed = 0; link = ieee80211_link_or_deflink(sdata, params->link_id, true); if (IS_ERR(link)) return PTR_ERR(link); if (!sdata_dereference(link->u.ap.beacon, sdata)) return -ENOENT; sband = ieee80211_get_link_sband(link); if (!sband) return -EINVAL; if (params->basic_rates) { if (!ieee80211_parse_bitrates(link->conf->chandef.width, wiphy->bands[sband->band], params->basic_rates, params->basic_rates_len, &link->conf->basic_rates)) return -EINVAL; changed |= BSS_CHANGED_BASIC_RATES; ieee80211_check_rate_mask(link); } if (params->use_cts_prot >= 0) { link->conf->use_cts_prot = params->use_cts_prot; changed |= BSS_CHANGED_ERP_CTS_PROT; } if (params->use_short_preamble >= 0) { link->conf->use_short_preamble = params->use_short_preamble; changed |= BSS_CHANGED_ERP_PREAMBLE; } if (!link->conf->use_short_slot && (sband->band == NL80211_BAND_5GHZ || sband->band == NL80211_BAND_6GHZ)) { link->conf->use_short_slot = true; changed |= BSS_CHANGED_ERP_SLOT; } if (params->use_short_slot_time >= 0) { link->conf->use_short_slot = params->use_short_slot_time; changed |= BSS_CHANGED_ERP_SLOT; } if (params->ap_isolate >= 0) { if (params->ap_isolate) sdata->flags |= IEEE80211_SDATA_DONT_BRIDGE_PACKETS; else sdata->flags &= ~IEEE80211_SDATA_DONT_BRIDGE_PACKETS; ieee80211_check_fast_rx_iface(sdata); } if (params->ht_opmode >= 0) { link->conf->ht_operation_mode = (u16)params->ht_opmode; changed |= BSS_CHANGED_HT; } if (params->p2p_ctwindow >= 0) { link->conf->p2p_noa_attr.oppps_ctwindow &= ~IEEE80211_P2P_OPPPS_CTWINDOW_MASK; link->conf->p2p_noa_attr.oppps_ctwindow |= params->p2p_ctwindow & IEEE80211_P2P_OPPPS_CTWINDOW_MASK; changed |= BSS_CHANGED_P2P_PS; } if (params->p2p_opp_ps > 0) { link->conf->p2p_noa_attr.oppps_ctwindow |= IEEE80211_P2P_OPPPS_ENABLE_BIT; changed |= BSS_CHANGED_P2P_PS; } else if (params->p2p_opp_ps == 0) { link->conf->p2p_noa_attr.oppps_ctwindow &= ~IEEE80211_P2P_OPPPS_ENABLE_BIT; changed |= BSS_CHANGED_P2P_PS; } ieee80211_link_info_change_notify(sdata, link, changed); return 0; } static int ieee80211_set_txq_params(struct wiphy *wiphy, struct net_device *dev, struct ieee80211_txq_params *params) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link = ieee80211_link_or_deflink(sdata, params->link_id, true); struct ieee80211_tx_queue_params p; if (!local->ops->conf_tx) return -EOPNOTSUPP; if (local->hw.queues < IEEE80211_NUM_ACS) return -EOPNOTSUPP; if (IS_ERR(link)) return PTR_ERR(link); memset(&p, 0, sizeof(p)); p.aifs = params->aifs; p.cw_max = params->cwmax; p.cw_min = params->cwmin; p.txop = params->txop; /* * Setting tx queue params disables u-apsd because it's only * called in master mode. */ p.uapsd = false; ieee80211_regulatory_limit_wmm_params(sdata, &p, params->ac); link->tx_conf[params->ac] = p; if (drv_conf_tx(local, link, params->ac, &p)) { wiphy_debug(local->hw.wiphy, "failed to set TX queue parameters for AC %d\n", params->ac); return -EINVAL; } ieee80211_link_info_change_notify(sdata, link, BSS_CHANGED_QOS); return 0; } #ifdef CONFIG_PM static int ieee80211_suspend(struct wiphy *wiphy, struct cfg80211_wowlan *wowlan) { return __ieee80211_suspend(wiphy_priv(wiphy), wowlan); } static int ieee80211_resume(struct wiphy *wiphy) { return __ieee80211_resume(wiphy_priv(wiphy)); } #else #define ieee80211_suspend NULL #define ieee80211_resume NULL #endif static int ieee80211_scan(struct wiphy *wiphy, struct cfg80211_scan_request *req) { struct ieee80211_sub_if_data *sdata; sdata = IEEE80211_WDEV_TO_SUB_IF(req->wdev); switch (ieee80211_vif_type_p2p(&sdata->vif)) { case NL80211_IFTYPE_STATION: case NL80211_IFTYPE_ADHOC: case NL80211_IFTYPE_MESH_POINT: case NL80211_IFTYPE_P2P_CLIENT: case NL80211_IFTYPE_P2P_DEVICE: break; case NL80211_IFTYPE_P2P_GO: if (sdata->local->ops->hw_scan) break; /* * FIXME: implement NoA while scanning in software, * for now fall through to allow scanning only when * beaconing hasn't been configured yet */ fallthrough; case NL80211_IFTYPE_AP: /* * If the scan has been forced (and the driver supports * forcing), don't care about being beaconing already. * This will create problems to the attached stations (e.g. all * the frames sent while scanning on other channel will be * lost) */ if (sdata->deflink.u.ap.beacon && (!(wiphy->features & NL80211_FEATURE_AP_SCAN) || !(req->flags & NL80211_SCAN_FLAG_AP))) return -EOPNOTSUPP; break; case NL80211_IFTYPE_NAN: default: return -EOPNOTSUPP; } return ieee80211_request_scan(sdata, req); } static void ieee80211_abort_scan(struct wiphy *wiphy, struct wireless_dev *wdev) { ieee80211_scan_cancel(wiphy_priv(wiphy)); } static int ieee80211_sched_scan_start(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_sched_scan_request *req) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); if (!sdata->local->ops->sched_scan_start) return -EOPNOTSUPP; return ieee80211_request_sched_scan_start(sdata, req); } static int ieee80211_sched_scan_stop(struct wiphy *wiphy, struct net_device *dev, u64 reqid) { struct ieee80211_local *local = wiphy_priv(wiphy); if (!local->ops->sched_scan_stop) return -EOPNOTSUPP; return ieee80211_request_sched_scan_stop(local); } static int ieee80211_auth(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_auth_request *req) { return ieee80211_mgd_auth(IEEE80211_DEV_TO_SUB_IF(dev), req); } static int ieee80211_assoc(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_assoc_request *req) { return ieee80211_mgd_assoc(IEEE80211_DEV_TO_SUB_IF(dev), req); } static int ieee80211_deauth(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_deauth_request *req) { return ieee80211_mgd_deauth(IEEE80211_DEV_TO_SUB_IF(dev), req); } static int ieee80211_disassoc(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_disassoc_request *req) { return ieee80211_mgd_disassoc(IEEE80211_DEV_TO_SUB_IF(dev), req); } static int ieee80211_join_ibss(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_ibss_params *params) { return ieee80211_ibss_join(IEEE80211_DEV_TO_SUB_IF(dev), params); } static int ieee80211_leave_ibss(struct wiphy *wiphy, struct net_device *dev) { return ieee80211_ibss_leave(IEEE80211_DEV_TO_SUB_IF(dev)); } static int ieee80211_join_ocb(struct wiphy *wiphy, struct net_device *dev, struct ocb_setup *setup) { return ieee80211_ocb_join(IEEE80211_DEV_TO_SUB_IF(dev), setup); } static int ieee80211_leave_ocb(struct wiphy *wiphy, struct net_device *dev) { return ieee80211_ocb_leave(IEEE80211_DEV_TO_SUB_IF(dev)); } static int ieee80211_set_mcast_rate(struct wiphy *wiphy, struct net_device *dev, int rate[NUM_NL80211_BANDS]) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); memcpy(sdata->vif.bss_conf.mcast_rate, rate, sizeof(int) * NUM_NL80211_BANDS); ieee80211_link_info_change_notify(sdata, &sdata->deflink, BSS_CHANGED_MCAST_RATE); return 0; } static int ieee80211_set_wiphy_params(struct wiphy *wiphy, u32 changed) { struct ieee80211_local *local = wiphy_priv(wiphy); int err; if (changed & WIPHY_PARAM_FRAG_THRESHOLD) { ieee80211_check_fast_xmit_all(local); err = drv_set_frag_threshold(local, wiphy->frag_threshold); if (err) { ieee80211_check_fast_xmit_all(local); return err; } } if ((changed & WIPHY_PARAM_COVERAGE_CLASS) || (changed & WIPHY_PARAM_DYN_ACK)) { s16 coverage_class; coverage_class = changed & WIPHY_PARAM_COVERAGE_CLASS ? wiphy->coverage_class : -1; err = drv_set_coverage_class(local, coverage_class); if (err) return err; } if (changed & WIPHY_PARAM_RTS_THRESHOLD) { err = drv_set_rts_threshold(local, wiphy->rts_threshold); if (err) return err; } if (changed & WIPHY_PARAM_RETRY_SHORT) { if (wiphy->retry_short > IEEE80211_MAX_TX_RETRY) return -EINVAL; local->hw.conf.short_frame_max_tx_count = wiphy->retry_short; } if (changed & WIPHY_PARAM_RETRY_LONG) { if (wiphy->retry_long > IEEE80211_MAX_TX_RETRY) return -EINVAL; local->hw.conf.long_frame_max_tx_count = wiphy->retry_long; } if (changed & (WIPHY_PARAM_RETRY_SHORT | WIPHY_PARAM_RETRY_LONG)) ieee80211_hw_config(local, IEEE80211_CONF_CHANGE_RETRY_LIMITS); if (changed & (WIPHY_PARAM_TXQ_LIMIT | WIPHY_PARAM_TXQ_MEMORY_LIMIT | WIPHY_PARAM_TXQ_QUANTUM)) ieee80211_txq_set_params(local); return 0; } static int ieee80211_set_tx_power(struct wiphy *wiphy, struct wireless_dev *wdev, enum nl80211_tx_power_setting type, int mbm) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata; enum nl80211_tx_power_setting txp_type = type; bool update_txp_type = false; bool has_monitor = false; if (wdev) { sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); if (sdata->vif.type == NL80211_IFTYPE_MONITOR) { sdata = wiphy_dereference(local->hw.wiphy, local->monitor_sdata); if (!sdata) return -EOPNOTSUPP; } switch (type) { case NL80211_TX_POWER_AUTOMATIC: sdata->deflink.user_power_level = IEEE80211_UNSET_POWER_LEVEL; txp_type = NL80211_TX_POWER_LIMITED; break; case NL80211_TX_POWER_LIMITED: case NL80211_TX_POWER_FIXED: if (mbm < 0 || (mbm % 100)) return -EOPNOTSUPP; sdata->deflink.user_power_level = MBM_TO_DBM(mbm); break; } if (txp_type != sdata->vif.bss_conf.txpower_type) { update_txp_type = true; sdata->vif.bss_conf.txpower_type = txp_type; } ieee80211_recalc_txpower(sdata, update_txp_type); return 0; } switch (type) { case NL80211_TX_POWER_AUTOMATIC: local->user_power_level = IEEE80211_UNSET_POWER_LEVEL; txp_type = NL80211_TX_POWER_LIMITED; break; case NL80211_TX_POWER_LIMITED: case NL80211_TX_POWER_FIXED: if (mbm < 0 || (mbm % 100)) return -EOPNOTSUPP; local->user_power_level = MBM_TO_DBM(mbm); break; } mutex_lock(&local->iflist_mtx); list_for_each_entry(sdata, &local->interfaces, list) { if (sdata->vif.type == NL80211_IFTYPE_MONITOR) { has_monitor = true; continue; } sdata->deflink.user_power_level = local->user_power_level; if (txp_type != sdata->vif.bss_conf.txpower_type) update_txp_type = true; sdata->vif.bss_conf.txpower_type = txp_type; } list_for_each_entry(sdata, &local->interfaces, list) { if (sdata->vif.type == NL80211_IFTYPE_MONITOR) continue; ieee80211_recalc_txpower(sdata, update_txp_type); } mutex_unlock(&local->iflist_mtx); if (has_monitor) { sdata = wiphy_dereference(local->hw.wiphy, local->monitor_sdata); if (sdata) { sdata->deflink.user_power_level = local->user_power_level; if (txp_type != sdata->vif.bss_conf.txpower_type) update_txp_type = true; sdata->vif.bss_conf.txpower_type = txp_type; ieee80211_recalc_txpower(sdata, update_txp_type); } } return 0; } static int ieee80211_get_tx_power(struct wiphy *wiphy, struct wireless_dev *wdev, int *dbm) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); if (local->ops->get_txpower) return drv_get_txpower(local, sdata, dbm); if (!local->use_chanctx) *dbm = local->hw.conf.power_level; else *dbm = sdata->vif.bss_conf.txpower; return 0; } static void ieee80211_rfkill_poll(struct wiphy *wiphy) { struct ieee80211_local *local = wiphy_priv(wiphy); drv_rfkill_poll(local); } #ifdef CONFIG_NL80211_TESTMODE static int ieee80211_testmode_cmd(struct wiphy *wiphy, struct wireless_dev *wdev, void *data, int len) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_vif *vif = NULL; if (!local->ops->testmode_cmd) return -EOPNOTSUPP; if (wdev) { struct ieee80211_sub_if_data *sdata; sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); if (sdata->flags & IEEE80211_SDATA_IN_DRIVER) vif = &sdata->vif; } return local->ops->testmode_cmd(&local->hw, vif, data, len); } static int ieee80211_testmode_dump(struct wiphy *wiphy, struct sk_buff *skb, struct netlink_callback *cb, void *data, int len) { struct ieee80211_local *local = wiphy_priv(wiphy); if (!local->ops->testmode_dump) return -EOPNOTSUPP; return local->ops->testmode_dump(&local->hw, skb, cb, data, len); } #endif int __ieee80211_request_smps_mgd(struct ieee80211_sub_if_data *sdata, struct ieee80211_link_data *link, enum ieee80211_smps_mode smps_mode) { const u8 *ap; enum ieee80211_smps_mode old_req; int err; struct sta_info *sta; bool tdls_peer_found = false; lockdep_assert_held(&sdata->wdev.mtx); if (WARN_ON_ONCE(sdata->vif.type != NL80211_IFTYPE_STATION)) return -EINVAL; old_req = link->u.mgd.req_smps; link->u.mgd.req_smps = smps_mode; if (old_req == smps_mode && smps_mode != IEEE80211_SMPS_AUTOMATIC) return 0; /* * If not associated, or current association is not an HT * association, there's no need to do anything, just store * the new value until we associate. */ if (!sdata->u.mgd.associated || link->conf->chandef.width == NL80211_CHAN_WIDTH_20_NOHT) return 0; ap = link->u.mgd.bssid; rcu_read_lock(); list_for_each_entry_rcu(sta, &sdata->local->sta_list, list) { if (!sta->sta.tdls || sta->sdata != sdata || !sta->uploaded || !test_sta_flag(sta, WLAN_STA_AUTHORIZED)) continue; tdls_peer_found = true; break; } rcu_read_unlock(); if (smps_mode == IEEE80211_SMPS_AUTOMATIC) { if (tdls_peer_found || !sdata->u.mgd.powersave) smps_mode = IEEE80211_SMPS_OFF; else smps_mode = IEEE80211_SMPS_DYNAMIC; } /* send SM PS frame to AP */ err = ieee80211_send_smps_action(sdata, smps_mode, ap, ap); if (err) link->u.mgd.req_smps = old_req; else if (smps_mode != IEEE80211_SMPS_OFF && tdls_peer_found) ieee80211_teardown_tdls_peers(sdata); return err; } static int ieee80211_set_power_mgmt(struct wiphy *wiphy, struct net_device *dev, bool enabled, int timeout) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = wdev_priv(dev->ieee80211_ptr); unsigned int link_id; if (sdata->vif.type != NL80211_IFTYPE_STATION) return -EOPNOTSUPP; if (!ieee80211_hw_check(&local->hw, SUPPORTS_PS)) return -EOPNOTSUPP; if (enabled == sdata->u.mgd.powersave && timeout == local->dynamic_ps_forced_timeout) return 0; sdata->u.mgd.powersave = enabled; local->dynamic_ps_forced_timeout = timeout; /* no change, but if automatic follow powersave */ sdata_lock(sdata); for (link_id = 0; link_id < ARRAY_SIZE(sdata->link); link_id++) { struct ieee80211_link_data *link; link = sdata_dereference(sdata->link[link_id], sdata); if (!link) continue; __ieee80211_request_smps_mgd(sdata, link, link->u.mgd.req_smps); } sdata_unlock(sdata); if (ieee80211_hw_check(&local->hw, SUPPORTS_DYNAMIC_PS)) ieee80211_hw_config(local, IEEE80211_CONF_CHANGE_PS); ieee80211_recalc_ps(local); ieee80211_recalc_ps_vif(sdata); ieee80211_check_fast_rx_iface(sdata); return 0; } static int ieee80211_set_cqm_rssi_config(struct wiphy *wiphy, struct net_device *dev, s32 rssi_thold, u32 rssi_hyst) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_vif *vif = &sdata->vif; struct ieee80211_bss_conf *bss_conf = &vif->bss_conf; if (rssi_thold == bss_conf->cqm_rssi_thold && rssi_hyst == bss_conf->cqm_rssi_hyst) return 0; if (sdata->vif.driver_flags & IEEE80211_VIF_BEACON_FILTER && !(sdata->vif.driver_flags & IEEE80211_VIF_SUPPORTS_CQM_RSSI)) return -EOPNOTSUPP; bss_conf->cqm_rssi_thold = rssi_thold; bss_conf->cqm_rssi_hyst = rssi_hyst; bss_conf->cqm_rssi_low = 0; bss_conf->cqm_rssi_high = 0; sdata->deflink.u.mgd.last_cqm_event_signal = 0; /* tell the driver upon association, unless already associated */ if (sdata->u.mgd.associated && sdata->vif.driver_flags & IEEE80211_VIF_SUPPORTS_CQM_RSSI) ieee80211_link_info_change_notify(sdata, &sdata->deflink, BSS_CHANGED_CQM); return 0; } static int ieee80211_set_cqm_rssi_range_config(struct wiphy *wiphy, struct net_device *dev, s32 rssi_low, s32 rssi_high) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_vif *vif = &sdata->vif; struct ieee80211_bss_conf *bss_conf = &vif->bss_conf; if (sdata->vif.driver_flags & IEEE80211_VIF_BEACON_FILTER) return -EOPNOTSUPP; bss_conf->cqm_rssi_low = rssi_low; bss_conf->cqm_rssi_high = rssi_high; bss_conf->cqm_rssi_thold = 0; bss_conf->cqm_rssi_hyst = 0; sdata->deflink.u.mgd.last_cqm_event_signal = 0; /* tell the driver upon association, unless already associated */ if (sdata->u.mgd.associated && sdata->vif.driver_flags & IEEE80211_VIF_SUPPORTS_CQM_RSSI) ieee80211_link_info_change_notify(sdata, &sdata->deflink, BSS_CHANGED_CQM); return 0; } static int ieee80211_set_bitrate_mask(struct wiphy *wiphy, struct net_device *dev, unsigned int link_id, const u8 *addr, const struct cfg80211_bitrate_mask *mask) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = wdev_priv(dev->ieee80211_ptr); int i, ret; if (!ieee80211_sdata_running(sdata)) return -ENETDOWN; /* * If active validate the setting and reject it if it doesn't leave * at least one basic rate usable, since we really have to be able * to send something, and if we're an AP we have to be able to do * so at a basic rate so that all clients can receive it. */ if (rcu_access_pointer(sdata->vif.bss_conf.chanctx_conf) && sdata->vif.bss_conf.chandef.chan) { u32 basic_rates = sdata->vif.bss_conf.basic_rates; enum nl80211_band band = sdata->vif.bss_conf.chandef.chan->band; if (!(mask->control[band].legacy & basic_rates)) return -EINVAL; } if (ieee80211_hw_check(&local->hw, HAS_RATE_CONTROL)) { ret = drv_set_bitrate_mask(local, sdata, mask); if (ret) return ret; } for (i = 0; i < NUM_NL80211_BANDS; i++) { struct ieee80211_supported_band *sband = wiphy->bands[i]; int j; sdata->rc_rateidx_mask[i] = mask->control[i].legacy; memcpy(sdata->rc_rateidx_mcs_mask[i], mask->control[i].ht_mcs, sizeof(mask->control[i].ht_mcs)); memcpy(sdata->rc_rateidx_vht_mcs_mask[i], mask->control[i].vht_mcs, sizeof(mask->control[i].vht_mcs)); sdata->rc_has_mcs_mask[i] = false; sdata->rc_has_vht_mcs_mask[i] = false; if (!sband) continue; for (j = 0; j < IEEE80211_HT_MCS_MASK_LEN; j++) { if (sdata->rc_rateidx_mcs_mask[i][j] != 0xff) { sdata->rc_has_mcs_mask[i] = true; break; } } for (j = 0; j < NL80211_VHT_NSS_MAX; j++) { if (sdata->rc_rateidx_vht_mcs_mask[i][j] != 0xffff) { sdata->rc_has_vht_mcs_mask[i] = true; break; } } } return 0; } static int ieee80211_start_radar_detection(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_chan_def *chandef, u32 cac_time_ms) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; int err; mutex_lock(&local->mtx); if (!list_empty(&local->roc_list) || local->scanning) { err = -EBUSY; goto out_unlock; } /* whatever, but channel contexts should not complain about that one */ sdata->deflink.smps_mode = IEEE80211_SMPS_OFF; sdata->deflink.needed_rx_chains = local->rx_chains; err = ieee80211_link_use_channel(&sdata->deflink, chandef, IEEE80211_CHANCTX_SHARED); if (err) goto out_unlock; ieee80211_queue_delayed_work(&sdata->local->hw, &sdata->deflink.dfs_cac_timer_work, msecs_to_jiffies(cac_time_ms)); out_unlock: mutex_unlock(&local->mtx); return err; } static void ieee80211_end_cac(struct wiphy *wiphy, struct net_device *dev) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; mutex_lock(&local->mtx); list_for_each_entry(sdata, &local->interfaces, list) { /* it might be waiting for the local->mtx, but then * by the time it gets it, sdata->wdev.cac_started * will no longer be true */ cancel_delayed_work(&sdata->deflink.dfs_cac_timer_work); if (sdata->wdev.cac_started) { ieee80211_link_release_channel(&sdata->deflink); sdata->wdev.cac_started = false; } } mutex_unlock(&local->mtx); } static struct cfg80211_beacon_data * cfg80211_beacon_dup(struct cfg80211_beacon_data *beacon) { struct cfg80211_beacon_data *new_beacon; u8 *pos; int len; len = beacon->head_len + beacon->tail_len + beacon->beacon_ies_len + beacon->proberesp_ies_len + beacon->assocresp_ies_len + beacon->probe_resp_len + beacon->lci_len + beacon->civicloc_len; if (beacon->mbssid_ies) len += ieee80211_get_mbssid_beacon_len(beacon->mbssid_ies, beacon->rnr_ies, beacon->mbssid_ies->cnt); new_beacon = kzalloc(sizeof(*new_beacon) + len, GFP_KERNEL); if (!new_beacon) return NULL; if (beacon->mbssid_ies && beacon->mbssid_ies->cnt) { new_beacon->mbssid_ies = kzalloc(struct_size(new_beacon->mbssid_ies, elem, beacon->mbssid_ies->cnt), GFP_KERNEL); if (!new_beacon->mbssid_ies) { kfree(new_beacon); return NULL; } if (beacon->rnr_ies && beacon->rnr_ies->cnt) { new_beacon->rnr_ies = kzalloc(struct_size(new_beacon->rnr_ies, elem, beacon->rnr_ies->cnt), GFP_KERNEL); if (!new_beacon->rnr_ies) { kfree(new_beacon->mbssid_ies); kfree(new_beacon); return NULL; } } } pos = (u8 *)(new_beacon + 1); if (beacon->head_len) { new_beacon->head_len = beacon->head_len; new_beacon->head = pos; memcpy(pos, beacon->head, beacon->head_len); pos += beacon->head_len; } if (beacon->tail_len) { new_beacon->tail_len = beacon->tail_len; new_beacon->tail = pos; memcpy(pos, beacon->tail, beacon->tail_len); pos += beacon->tail_len; } if (beacon->beacon_ies_len) { new_beacon->beacon_ies_len = beacon->beacon_ies_len; new_beacon->beacon_ies = pos; memcpy(pos, beacon->beacon_ies, beacon->beacon_ies_len); pos += beacon->beacon_ies_len; } if (beacon->proberesp_ies_len) { new_beacon->proberesp_ies_len = beacon->proberesp_ies_len; new_beacon->proberesp_ies = pos; memcpy(pos, beacon->proberesp_ies, beacon->proberesp_ies_len); pos += beacon->proberesp_ies_len; } if (beacon->assocresp_ies_len) { new_beacon->assocresp_ies_len = beacon->assocresp_ies_len; new_beacon->assocresp_ies = pos; memcpy(pos, beacon->assocresp_ies, beacon->assocresp_ies_len); pos += beacon->assocresp_ies_len; } if (beacon->probe_resp_len) { new_beacon->probe_resp_len = beacon->probe_resp_len; new_beacon->probe_resp = pos; memcpy(pos, beacon->probe_resp, beacon->probe_resp_len); pos += beacon->probe_resp_len; } if (beacon->mbssid_ies && beacon->mbssid_ies->cnt) { pos += ieee80211_copy_mbssid_beacon(pos, new_beacon->mbssid_ies, beacon->mbssid_ies); if (beacon->rnr_ies && beacon->rnr_ies->cnt) pos += ieee80211_copy_rnr_beacon(pos, new_beacon->rnr_ies, beacon->rnr_ies); } /* might copy -1, meaning no changes requested */ new_beacon->ftm_responder = beacon->ftm_responder; if (beacon->lci) { new_beacon->lci_len = beacon->lci_len; new_beacon->lci = pos; memcpy(pos, beacon->lci, beacon->lci_len); pos += beacon->lci_len; } if (beacon->civicloc) { new_beacon->civicloc_len = beacon->civicloc_len; new_beacon->civicloc = pos; memcpy(pos, beacon->civicloc, beacon->civicloc_len); pos += beacon->civicloc_len; } return new_beacon; } void ieee80211_csa_finish(struct ieee80211_vif *vif) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); struct ieee80211_local *local = sdata->local; rcu_read_lock(); if (vif->mbssid_tx_vif == vif) { /* Trigger ieee80211_csa_finish() on the non-transmitting * interfaces when channel switch is received on * transmitting interface */ struct ieee80211_sub_if_data *iter; list_for_each_entry_rcu(iter, &local->interfaces, list) { if (!ieee80211_sdata_running(iter)) continue; if (iter == sdata || iter->vif.mbssid_tx_vif != vif) continue; ieee80211_queue_work(&iter->local->hw, &iter->deflink.csa_finalize_work); } } ieee80211_queue_work(&local->hw, &sdata->deflink.csa_finalize_work); rcu_read_unlock(); } EXPORT_SYMBOL(ieee80211_csa_finish); void ieee80211_channel_switch_disconnect(struct ieee80211_vif *vif, bool block_tx) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); struct ieee80211_if_managed *ifmgd = &sdata->u.mgd; struct ieee80211_local *local = sdata->local; sdata->deflink.csa_block_tx = block_tx; sdata_info(sdata, "channel switch failed, disconnecting\n"); wiphy_work_queue(local->hw.wiphy, &ifmgd->csa_connection_drop_work); } EXPORT_SYMBOL(ieee80211_channel_switch_disconnect); static int ieee80211_set_after_csa_beacon(struct ieee80211_sub_if_data *sdata, u64 *changed) { int err; switch (sdata->vif.type) { case NL80211_IFTYPE_AP: if (!sdata->deflink.u.ap.next_beacon) return -EINVAL; err = ieee80211_assign_beacon(sdata, &sdata->deflink, sdata->deflink.u.ap.next_beacon, NULL, NULL, changed); ieee80211_free_next_beacon(&sdata->deflink); if (err < 0) return err; break; case NL80211_IFTYPE_ADHOC: err = ieee80211_ibss_finish_csa(sdata, changed); if (err < 0) return err; break; #ifdef CONFIG_MAC80211_MESH case NL80211_IFTYPE_MESH_POINT: err = ieee80211_mesh_finish_csa(sdata, changed); if (err < 0) return err; break; #endif default: WARN_ON(1); return -EINVAL; } return 0; } static int __ieee80211_csa_finalize(struct ieee80211_sub_if_data *sdata) { struct ieee80211_local *local = sdata->local; u64 changed = 0; int err; sdata_assert_lock(sdata); lockdep_assert_held(&local->mtx); lockdep_assert_held(&local->chanctx_mtx); /* * using reservation isn't immediate as it may be deferred until later * with multi-vif. once reservation is complete it will re-schedule the * work with no reserved_chanctx so verify chandef to check if it * completed successfully */ if (sdata->deflink.reserved_chanctx) { /* * with multi-vif csa driver may call ieee80211_csa_finish() * many times while waiting for other interfaces to use their * reservations */ if (sdata->deflink.reserved_ready) return 0; return ieee80211_link_use_reserved_context(&sdata->deflink); } if (!cfg80211_chandef_identical(&sdata->vif.bss_conf.chandef, &sdata->deflink.csa_chandef)) return -EINVAL; sdata->vif.bss_conf.csa_active = false; err = ieee80211_set_after_csa_beacon(sdata, &changed); if (err) return err; if (sdata->vif.bss_conf.eht_puncturing != sdata->vif.bss_conf.csa_punct_bitmap) { sdata->vif.bss_conf.eht_puncturing = sdata->vif.bss_conf.csa_punct_bitmap; changed |= BSS_CHANGED_EHT_PUNCTURING; } ieee80211_link_info_change_notify(sdata, &sdata->deflink, changed); if (sdata->deflink.csa_block_tx) { ieee80211_wake_vif_queues(local, sdata, IEEE80211_QUEUE_STOP_REASON_CSA); sdata->deflink.csa_block_tx = false; } err = drv_post_channel_switch(sdata); if (err) return err; cfg80211_ch_switch_notify(sdata->dev, &sdata->deflink.csa_chandef, 0, sdata->vif.bss_conf.eht_puncturing); return 0; } static void ieee80211_csa_finalize(struct ieee80211_sub_if_data *sdata) { if (__ieee80211_csa_finalize(sdata)) { sdata_info(sdata, "failed to finalize CSA, disconnecting\n"); cfg80211_stop_iface(sdata->local->hw.wiphy, &sdata->wdev, GFP_KERNEL); } } void ieee80211_csa_finalize_work(struct work_struct *work) { struct ieee80211_sub_if_data *sdata = container_of(work, struct ieee80211_sub_if_data, deflink.csa_finalize_work); struct ieee80211_local *local = sdata->local; sdata_lock(sdata); mutex_lock(&local->mtx); mutex_lock(&local->chanctx_mtx); /* AP might have been stopped while waiting for the lock. */ if (!sdata->vif.bss_conf.csa_active) goto unlock; if (!ieee80211_sdata_running(sdata)) goto unlock; ieee80211_csa_finalize(sdata); unlock: mutex_unlock(&local->chanctx_mtx); mutex_unlock(&local->mtx); sdata_unlock(sdata); } static int ieee80211_set_csa_beacon(struct ieee80211_sub_if_data *sdata, struct cfg80211_csa_settings *params, u64 *changed) { struct ieee80211_csa_settings csa = {}; int err; switch (sdata->vif.type) { case NL80211_IFTYPE_AP: sdata->deflink.u.ap.next_beacon = cfg80211_beacon_dup(¶ms->beacon_after); if (!sdata->deflink.u.ap.next_beacon) return -ENOMEM; /* * With a count of 0, we don't have to wait for any * TBTT before switching, so complete the CSA * immediately. In theory, with a count == 1 we * should delay the switch until just before the next * TBTT, but that would complicate things so we switch * immediately too. If we would delay the switch * until the next TBTT, we would have to set the probe * response here. * * TODO: A channel switch with count <= 1 without * sending a CSA action frame is kind of useless, * because the clients won't know we're changing * channels. The action frame must be implemented * either here or in the userspace. */ if (params->count <= 1) break; if ((params->n_counter_offsets_beacon > IEEE80211_MAX_CNTDWN_COUNTERS_NUM) || (params->n_counter_offsets_presp > IEEE80211_MAX_CNTDWN_COUNTERS_NUM)) { ieee80211_free_next_beacon(&sdata->deflink); return -EINVAL; } csa.counter_offsets_beacon = params->counter_offsets_beacon; csa.counter_offsets_presp = params->counter_offsets_presp; csa.n_counter_offsets_beacon = params->n_counter_offsets_beacon; csa.n_counter_offsets_presp = params->n_counter_offsets_presp; csa.count = params->count; err = ieee80211_assign_beacon(sdata, &sdata->deflink, ¶ms->beacon_csa, &csa, NULL, changed); if (err < 0) { ieee80211_free_next_beacon(&sdata->deflink); return err; } break; case NL80211_IFTYPE_ADHOC: if (!sdata->vif.cfg.ibss_joined) return -EINVAL; if (params->chandef.width != sdata->u.ibss.chandef.width) return -EINVAL; switch (params->chandef.width) { case NL80211_CHAN_WIDTH_40: if (cfg80211_get_chandef_type(¶ms->chandef) != cfg80211_get_chandef_type(&sdata->u.ibss.chandef)) return -EINVAL; break; case NL80211_CHAN_WIDTH_5: case NL80211_CHAN_WIDTH_10: case NL80211_CHAN_WIDTH_20_NOHT: case NL80211_CHAN_WIDTH_20: break; default: return -EINVAL; } /* changes into another band are not supported */ if (sdata->u.ibss.chandef.chan->band != params->chandef.chan->band) return -EINVAL; /* see comments in the NL80211_IFTYPE_AP block */ if (params->count > 1) { err = ieee80211_ibss_csa_beacon(sdata, params, changed); if (err < 0) return err; } ieee80211_send_action_csa(sdata, params); break; #ifdef CONFIG_MAC80211_MESH case NL80211_IFTYPE_MESH_POINT: { struct ieee80211_if_mesh *ifmsh = &sdata->u.mesh; /* changes into another band are not supported */ if (sdata->vif.bss_conf.chandef.chan->band != params->chandef.chan->band) return -EINVAL; if (ifmsh->csa_role == IEEE80211_MESH_CSA_ROLE_NONE) { ifmsh->csa_role = IEEE80211_MESH_CSA_ROLE_INIT; if (!ifmsh->pre_value) ifmsh->pre_value = 1; else ifmsh->pre_value++; } /* see comments in the NL80211_IFTYPE_AP block */ if (params->count > 1) { err = ieee80211_mesh_csa_beacon(sdata, params, changed); if (err < 0) { ifmsh->csa_role = IEEE80211_MESH_CSA_ROLE_NONE; return err; } } if (ifmsh->csa_role == IEEE80211_MESH_CSA_ROLE_INIT) ieee80211_send_action_csa(sdata, params); break; } #endif default: return -EOPNOTSUPP; } return 0; } static void ieee80211_color_change_abort(struct ieee80211_sub_if_data *sdata) { sdata->vif.bss_conf.color_change_active = false; ieee80211_free_next_beacon(&sdata->deflink); cfg80211_color_change_aborted_notify(sdata->dev); } static int __ieee80211_channel_switch(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_csa_settings *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct ieee80211_channel_switch ch_switch; struct ieee80211_chanctx_conf *conf; struct ieee80211_chanctx *chanctx; u64 changed = 0; int err; sdata_assert_lock(sdata); lockdep_assert_held(&local->mtx); if (!list_empty(&local->roc_list) || local->scanning) return -EBUSY; if (sdata->wdev.cac_started) return -EBUSY; if (cfg80211_chandef_identical(¶ms->chandef, &sdata->vif.bss_conf.chandef)) return -EINVAL; /* don't allow another channel switch if one is already active. */ if (sdata->vif.bss_conf.csa_active) return -EBUSY; mutex_lock(&local->chanctx_mtx); conf = rcu_dereference_protected(sdata->vif.bss_conf.chanctx_conf, lockdep_is_held(&local->chanctx_mtx)); if (!conf) { err = -EBUSY; goto out; } if (params->chandef.chan->freq_offset) { /* this may work, but is untested */ err = -EOPNOTSUPP; goto out; } chanctx = container_of(conf, struct ieee80211_chanctx, conf); ch_switch.timestamp = 0; ch_switch.device_timestamp = 0; ch_switch.block_tx = params->block_tx; ch_switch.chandef = params->chandef; ch_switch.count = params->count; err = drv_pre_channel_switch(sdata, &ch_switch); if (err) goto out; err = ieee80211_link_reserve_chanctx(&sdata->deflink, ¶ms->chandef, chanctx->mode, params->radar_required); if (err) goto out; /* if reservation is invalid then this will fail */ err = ieee80211_check_combinations(sdata, NULL, chanctx->mode, 0); if (err) { ieee80211_link_unreserve_chanctx(&sdata->deflink); goto out; } /* if there is a color change in progress, abort it */ if (sdata->vif.bss_conf.color_change_active) ieee80211_color_change_abort(sdata); err = ieee80211_set_csa_beacon(sdata, params, &changed); if (err) { ieee80211_link_unreserve_chanctx(&sdata->deflink); goto out; } if (params->punct_bitmap && !sdata->vif.bss_conf.eht_support) goto out; sdata->deflink.csa_chandef = params->chandef; sdata->deflink.csa_block_tx = params->block_tx; sdata->vif.bss_conf.csa_active = true; sdata->vif.bss_conf.csa_punct_bitmap = params->punct_bitmap; if (sdata->deflink.csa_block_tx) ieee80211_stop_vif_queues(local, sdata, IEEE80211_QUEUE_STOP_REASON_CSA); cfg80211_ch_switch_started_notify(sdata->dev, &sdata->deflink.csa_chandef, 0, params->count, params->block_tx, sdata->vif.bss_conf.csa_punct_bitmap); if (changed) { ieee80211_link_info_change_notify(sdata, &sdata->deflink, changed); drv_channel_switch_beacon(sdata, ¶ms->chandef); } else { /* if the beacon didn't change, we can finalize immediately */ ieee80211_csa_finalize(sdata); } out: mutex_unlock(&local->chanctx_mtx); return err; } int ieee80211_channel_switch(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_csa_settings *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; int err; mutex_lock(&local->mtx); err = __ieee80211_channel_switch(wiphy, dev, params); mutex_unlock(&local->mtx); return err; } u64 ieee80211_mgmt_tx_cookie(struct ieee80211_local *local) { lockdep_assert_held(&local->mtx); local->roc_cookie_counter++; /* wow, you wrapped 64 bits ... more likely a bug */ if (WARN_ON(local->roc_cookie_counter == 0)) local->roc_cookie_counter++; return local->roc_cookie_counter; } int ieee80211_attach_ack_skb(struct ieee80211_local *local, struct sk_buff *skb, u64 *cookie, gfp_t gfp) { unsigned long spin_flags; struct sk_buff *ack_skb; int id; ack_skb = skb_copy(skb, gfp); if (!ack_skb) return -ENOMEM; spin_lock_irqsave(&local->ack_status_lock, spin_flags); id = idr_alloc(&local->ack_status_frames, ack_skb, 1, 0x2000, GFP_ATOMIC); spin_unlock_irqrestore(&local->ack_status_lock, spin_flags); if (id < 0) { kfree_skb(ack_skb); return -ENOMEM; } IEEE80211_SKB_CB(skb)->ack_frame_id = id; *cookie = ieee80211_mgmt_tx_cookie(local); IEEE80211_SKB_CB(ack_skb)->ack.cookie = *cookie; return 0; } static void ieee80211_update_mgmt_frame_registrations(struct wiphy *wiphy, struct wireless_dev *wdev, struct mgmt_frame_regs *upd) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); u32 preq_mask = BIT(IEEE80211_STYPE_PROBE_REQ >> 4); u32 action_mask = BIT(IEEE80211_STYPE_ACTION >> 4); bool global_change, intf_change; global_change = (local->probe_req_reg != !!(upd->global_stypes & preq_mask)) || (local->rx_mcast_action_reg != !!(upd->global_mcast_stypes & action_mask)); local->probe_req_reg = upd->global_stypes & preq_mask; local->rx_mcast_action_reg = upd->global_mcast_stypes & action_mask; intf_change = (sdata->vif.probe_req_reg != !!(upd->interface_stypes & preq_mask)) || (sdata->vif.rx_mcast_action_reg != !!(upd->interface_mcast_stypes & action_mask)); sdata->vif.probe_req_reg = upd->interface_stypes & preq_mask; sdata->vif.rx_mcast_action_reg = upd->interface_mcast_stypes & action_mask; if (!local->open_count) return; if (intf_change && ieee80211_sdata_running(sdata)) drv_config_iface_filter(local, sdata, sdata->vif.probe_req_reg ? FIF_PROBE_REQ : 0, FIF_PROBE_REQ); if (global_change) ieee80211_configure_filter(local); } static int ieee80211_set_antenna(struct wiphy *wiphy, u32 tx_ant, u32 rx_ant) { struct ieee80211_local *local = wiphy_priv(wiphy); if (local->started) return -EOPNOTSUPP; return drv_set_antenna(local, tx_ant, rx_ant); } static int ieee80211_get_antenna(struct wiphy *wiphy, u32 *tx_ant, u32 *rx_ant) { struct ieee80211_local *local = wiphy_priv(wiphy); return drv_get_antenna(local, tx_ant, rx_ant); } static int ieee80211_set_rekey_data(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_gtk_rekey_data *data) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); if (!local->ops->set_rekey_data) return -EOPNOTSUPP; drv_set_rekey_data(local, sdata, data); return 0; } static int ieee80211_probe_client(struct wiphy *wiphy, struct net_device *dev, const u8 *peer, u64 *cookie) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; struct ieee80211_qos_hdr *nullfunc; struct sk_buff *skb; int size = sizeof(*nullfunc); __le16 fc; bool qos; struct ieee80211_tx_info *info; struct sta_info *sta; struct ieee80211_chanctx_conf *chanctx_conf; enum nl80211_band band; int ret; /* the lock is needed to assign the cookie later */ mutex_lock(&local->mtx); rcu_read_lock(); sta = sta_info_get_bss(sdata, peer); if (!sta) { ret = -ENOLINK; goto unlock; } qos = sta->sta.wme; chanctx_conf = rcu_dereference(sdata->vif.bss_conf.chanctx_conf); if (WARN_ON(!chanctx_conf)) { ret = -EINVAL; goto unlock; } band = chanctx_conf->def.chan->band; if (qos) { fc = cpu_to_le16(IEEE80211_FTYPE_DATA | IEEE80211_STYPE_QOS_NULLFUNC | IEEE80211_FCTL_FROMDS); } else { size -= 2; fc = cpu_to_le16(IEEE80211_FTYPE_DATA | IEEE80211_STYPE_NULLFUNC | IEEE80211_FCTL_FROMDS); } skb = dev_alloc_skb(local->hw.extra_tx_headroom + size); if (!skb) { ret = -ENOMEM; goto unlock; } skb->dev = dev; skb_reserve(skb, local->hw.extra_tx_headroom); nullfunc = skb_put(skb, size); nullfunc->frame_control = fc; nullfunc->duration_id = 0; memcpy(nullfunc->addr1, sta->sta.addr, ETH_ALEN); memcpy(nullfunc->addr2, sdata->vif.addr, ETH_ALEN); memcpy(nullfunc->addr3, sdata->vif.addr, ETH_ALEN); nullfunc->seq_ctrl = 0; info = IEEE80211_SKB_CB(skb); info->flags |= IEEE80211_TX_CTL_REQ_TX_STATUS | IEEE80211_TX_INTFL_NL80211_FRAME_TX; info->band = band; skb_set_queue_mapping(skb, IEEE80211_AC_VO); skb->priority = 7; if (qos) nullfunc->qos_ctrl = cpu_to_le16(7); ret = ieee80211_attach_ack_skb(local, skb, cookie, GFP_ATOMIC); if (ret) { kfree_skb(skb); goto unlock; } local_bh_disable(); ieee80211_xmit(sdata, sta, skb); local_bh_enable(); ret = 0; unlock: rcu_read_unlock(); mutex_unlock(&local->mtx); return ret; } static int ieee80211_cfg_get_channel(struct wiphy *wiphy, struct wireless_dev *wdev, unsigned int link_id, struct cfg80211_chan_def *chandef) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_chanctx_conf *chanctx_conf; struct ieee80211_link_data *link; int ret = -ENODATA; rcu_read_lock(); link = rcu_dereference(sdata->link[link_id]); if (!link) { ret = -ENOLINK; goto out; } chanctx_conf = rcu_dereference(link->conf->chanctx_conf); if (chanctx_conf) { *chandef = link->conf->chandef; ret = 0; } else if (local->open_count > 0 && local->open_count == local->monitors && sdata->vif.type == NL80211_IFTYPE_MONITOR) { if (local->use_chanctx) *chandef = local->monitor_chandef; else *chandef = local->_oper_chandef; ret = 0; } out: rcu_read_unlock(); return ret; } #ifdef CONFIG_PM static void ieee80211_set_wakeup(struct wiphy *wiphy, bool enabled) { drv_set_wakeup(wiphy_priv(wiphy), enabled); } #endif static int ieee80211_set_qos_map(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_qos_map *qos_map) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct mac80211_qos_map *new_qos_map, *old_qos_map; if (qos_map) { new_qos_map = kzalloc(sizeof(*new_qos_map), GFP_KERNEL); if (!new_qos_map) return -ENOMEM; memcpy(&new_qos_map->qos_map, qos_map, sizeof(*qos_map)); } else { /* A NULL qos_map was passed to disable QoS mapping */ new_qos_map = NULL; } old_qos_map = sdata_dereference(sdata->qos_map, sdata); rcu_assign_pointer(sdata->qos_map, new_qos_map); if (old_qos_map) kfree_rcu(old_qos_map, rcu_head); return 0; } static int ieee80211_set_ap_chanwidth(struct wiphy *wiphy, struct net_device *dev, unsigned int link_id, struct cfg80211_chan_def *chandef) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_link_data *link; int ret; u64 changed = 0; link = sdata_dereference(sdata->link[link_id], sdata); ret = ieee80211_link_change_bandwidth(link, chandef, &changed); if (ret == 0) ieee80211_link_info_change_notify(sdata, link, changed); return ret; } static int ieee80211_add_tx_ts(struct wiphy *wiphy, struct net_device *dev, u8 tsid, const u8 *peer, u8 up, u16 admitted_time) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_if_managed *ifmgd = &sdata->u.mgd; int ac = ieee802_1d_to_ac[up]; if (sdata->vif.type != NL80211_IFTYPE_STATION) return -EOPNOTSUPP; if (!(sdata->wmm_acm & BIT(up))) return -EINVAL; if (ifmgd->tx_tspec[ac].admitted_time) return -EBUSY; if (admitted_time) { ifmgd->tx_tspec[ac].admitted_time = 32 * admitted_time; ifmgd->tx_tspec[ac].tsid = tsid; ifmgd->tx_tspec[ac].up = up; } return 0; } static int ieee80211_del_tx_ts(struct wiphy *wiphy, struct net_device *dev, u8 tsid, const u8 *peer) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_if_managed *ifmgd = &sdata->u.mgd; struct ieee80211_local *local = wiphy_priv(wiphy); int ac; for (ac = 0; ac < IEEE80211_NUM_ACS; ac++) { struct ieee80211_sta_tx_tspec *tx_tspec = &ifmgd->tx_tspec[ac]; /* skip unused entries */ if (!tx_tspec->admitted_time) continue; if (tx_tspec->tsid != tsid) continue; /* due to this new packets will be reassigned to non-ACM ACs */ tx_tspec->up = -1; /* Make sure that all packets have been sent to avoid to * restore the QoS params on packets that are still on the * queues. */ synchronize_net(); ieee80211_flush_queues(local, sdata, false); /* restore the normal QoS parameters * (unconditionally to avoid races) */ tx_tspec->action = TX_TSPEC_ACTION_STOP_DOWNGRADE; tx_tspec->downgraded = false; ieee80211_sta_handle_tspec_ac_params(sdata); /* finally clear all the data */ memset(tx_tspec, 0, sizeof(*tx_tspec)); return 0; } return -ENOENT; } void ieee80211_nan_func_terminated(struct ieee80211_vif *vif, u8 inst_id, enum nl80211_nan_func_term_reason reason, gfp_t gfp) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); struct cfg80211_nan_func *func; u64 cookie; if (WARN_ON(vif->type != NL80211_IFTYPE_NAN)) return; spin_lock_bh(&sdata->u.nan.func_lock); func = idr_find(&sdata->u.nan.function_inst_ids, inst_id); if (WARN_ON(!func)) { spin_unlock_bh(&sdata->u.nan.func_lock); return; } cookie = func->cookie; idr_remove(&sdata->u.nan.function_inst_ids, inst_id); spin_unlock_bh(&sdata->u.nan.func_lock); cfg80211_free_nan_func(func); cfg80211_nan_func_terminated(ieee80211_vif_to_wdev(vif), inst_id, reason, cookie, gfp); } EXPORT_SYMBOL(ieee80211_nan_func_terminated); void ieee80211_nan_func_match(struct ieee80211_vif *vif, struct cfg80211_nan_match_params *match, gfp_t gfp) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); struct cfg80211_nan_func *func; if (WARN_ON(vif->type != NL80211_IFTYPE_NAN)) return; spin_lock_bh(&sdata->u.nan.func_lock); func = idr_find(&sdata->u.nan.function_inst_ids, match->inst_id); if (WARN_ON(!func)) { spin_unlock_bh(&sdata->u.nan.func_lock); return; } match->cookie = func->cookie; spin_unlock_bh(&sdata->u.nan.func_lock); cfg80211_nan_match(ieee80211_vif_to_wdev(vif), match, gfp); } EXPORT_SYMBOL(ieee80211_nan_func_match); static int ieee80211_set_multicast_to_unicast(struct wiphy *wiphy, struct net_device *dev, const bool enabled) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); sdata->u.ap.multicast_to_unicast = enabled; return 0; } void ieee80211_fill_txq_stats(struct cfg80211_txq_stats *txqstats, struct txq_info *txqi) { if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_BACKLOG_BYTES))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_BACKLOG_BYTES); txqstats->backlog_bytes = txqi->tin.backlog_bytes; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_BACKLOG_PACKETS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_BACKLOG_PACKETS); txqstats->backlog_packets = txqi->tin.backlog_packets; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_FLOWS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_FLOWS); txqstats->flows = txqi->tin.flows; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_DROPS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_DROPS); txqstats->drops = txqi->cstats.drop_count; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_ECN_MARKS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_ECN_MARKS); txqstats->ecn_marks = txqi->cstats.ecn_mark; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_OVERLIMIT))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_OVERLIMIT); txqstats->overlimit = txqi->tin.overlimit; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_COLLISIONS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_COLLISIONS); txqstats->collisions = txqi->tin.collisions; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_TX_BYTES))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_TX_BYTES); txqstats->tx_bytes = txqi->tin.tx_bytes; } if (!(txqstats->filled & BIT(NL80211_TXQ_STATS_TX_PACKETS))) { txqstats->filled |= BIT(NL80211_TXQ_STATS_TX_PACKETS); txqstats->tx_packets = txqi->tin.tx_packets; } } static int ieee80211_get_txq_stats(struct wiphy *wiphy, struct wireless_dev *wdev, struct cfg80211_txq_stats *txqstats) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata; int ret = 0; spin_lock_bh(&local->fq.lock); rcu_read_lock(); if (wdev) { sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); if (!sdata->vif.txq) { ret = 1; goto out; } ieee80211_fill_txq_stats(txqstats, to_txq_info(sdata->vif.txq)); } else { /* phy stats */ txqstats->filled |= BIT(NL80211_TXQ_STATS_BACKLOG_PACKETS) | BIT(NL80211_TXQ_STATS_BACKLOG_BYTES) | BIT(NL80211_TXQ_STATS_OVERLIMIT) | BIT(NL80211_TXQ_STATS_OVERMEMORY) | BIT(NL80211_TXQ_STATS_COLLISIONS) | BIT(NL80211_TXQ_STATS_MAX_FLOWS); txqstats->backlog_packets = local->fq.backlog; txqstats->backlog_bytes = local->fq.memory_usage; txqstats->overlimit = local->fq.overlimit; txqstats->overmemory = local->fq.overmemory; txqstats->collisions = local->fq.collisions; txqstats->max_flows = local->fq.flows_cnt; } out: rcu_read_unlock(); spin_unlock_bh(&local->fq.lock); return ret; } static int ieee80211_get_ftm_responder_stats(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_ftm_responder_stats *ftm_stats) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); return drv_get_ftm_responder_stats(local, sdata, ftm_stats); } static int ieee80211_start_pmsr(struct wiphy *wiphy, struct wireless_dev *dev, struct cfg80211_pmsr_request *request) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(dev); return drv_start_pmsr(local, sdata, request); } static void ieee80211_abort_pmsr(struct wiphy *wiphy, struct wireless_dev *dev, struct cfg80211_pmsr_request *request) { struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(dev); return drv_abort_pmsr(local, sdata, request); } static int ieee80211_set_tid_config(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_tid_config *tid_conf) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct sta_info *sta; int ret; if (!sdata->local->ops->set_tid_config) return -EOPNOTSUPP; if (!tid_conf->peer) return drv_set_tid_config(sdata->local, sdata, NULL, tid_conf); mutex_lock(&sdata->local->sta_mtx); sta = sta_info_get_bss(sdata, tid_conf->peer); if (!sta) { mutex_unlock(&sdata->local->sta_mtx); return -ENOENT; } ret = drv_set_tid_config(sdata->local, sdata, &sta->sta, tid_conf); mutex_unlock(&sdata->local->sta_mtx); return ret; } static int ieee80211_reset_tid_config(struct wiphy *wiphy, struct net_device *dev, const u8 *peer, u8 tids) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct sta_info *sta; int ret; if (!sdata->local->ops->reset_tid_config) return -EOPNOTSUPP; if (!peer) return drv_reset_tid_config(sdata->local, sdata, NULL, tids); mutex_lock(&sdata->local->sta_mtx); sta = sta_info_get_bss(sdata, peer); if (!sta) { mutex_unlock(&sdata->local->sta_mtx); return -ENOENT; } ret = drv_reset_tid_config(sdata->local, sdata, &sta->sta, tids); mutex_unlock(&sdata->local->sta_mtx); return ret; } static int ieee80211_set_sar_specs(struct wiphy *wiphy, struct cfg80211_sar_specs *sar) { struct ieee80211_local *local = wiphy_priv(wiphy); if (!local->ops->set_sar_specs) return -EOPNOTSUPP; return local->ops->set_sar_specs(&local->hw, sar); } static int ieee80211_set_after_color_change_beacon(struct ieee80211_sub_if_data *sdata, u64 *changed) { switch (sdata->vif.type) { case NL80211_IFTYPE_AP: { int ret; if (!sdata->deflink.u.ap.next_beacon) return -EINVAL; ret = ieee80211_assign_beacon(sdata, &sdata->deflink, sdata->deflink.u.ap.next_beacon, NULL, NULL, changed); ieee80211_free_next_beacon(&sdata->deflink); if (ret < 0) return ret; break; } default: WARN_ON_ONCE(1); return -EINVAL; } return 0; } static int ieee80211_set_color_change_beacon(struct ieee80211_sub_if_data *sdata, struct cfg80211_color_change_settings *params, u64 *changed) { struct ieee80211_color_change_settings color_change = {}; int err; switch (sdata->vif.type) { case NL80211_IFTYPE_AP: sdata->deflink.u.ap.next_beacon = cfg80211_beacon_dup(¶ms->beacon_next); if (!sdata->deflink.u.ap.next_beacon) return -ENOMEM; if (params->count <= 1) break; color_change.counter_offset_beacon = params->counter_offset_beacon; color_change.counter_offset_presp = params->counter_offset_presp; color_change.count = params->count; err = ieee80211_assign_beacon(sdata, &sdata->deflink, ¶ms->beacon_color_change, NULL, &color_change, changed); if (err < 0) { ieee80211_free_next_beacon(&sdata->deflink); return err; } break; default: return -EOPNOTSUPP; } return 0; } static void ieee80211_color_change_bss_config_notify(struct ieee80211_sub_if_data *sdata, u8 color, int enable, u64 changed) { sdata->vif.bss_conf.he_bss_color.color = color; sdata->vif.bss_conf.he_bss_color.enabled = enable; changed |= BSS_CHANGED_HE_BSS_COLOR; ieee80211_link_info_change_notify(sdata, &sdata->deflink, changed); if (!sdata->vif.bss_conf.nontransmitted && sdata->vif.mbssid_tx_vif) { struct ieee80211_sub_if_data *child; mutex_lock(&sdata->local->iflist_mtx); list_for_each_entry(child, &sdata->local->interfaces, list) { if (child != sdata && child->vif.mbssid_tx_vif == &sdata->vif) { child->vif.bss_conf.he_bss_color.color = color; child->vif.bss_conf.he_bss_color.enabled = enable; ieee80211_link_info_change_notify(child, &child->deflink, BSS_CHANGED_HE_BSS_COLOR); } } mutex_unlock(&sdata->local->iflist_mtx); } } static int ieee80211_color_change_finalize(struct ieee80211_sub_if_data *sdata) { struct ieee80211_local *local = sdata->local; u64 changed = 0; int err; sdata_assert_lock(sdata); lockdep_assert_held(&local->mtx); sdata->vif.bss_conf.color_change_active = false; err = ieee80211_set_after_color_change_beacon(sdata, &changed); if (err) { cfg80211_color_change_aborted_notify(sdata->dev); return err; } ieee80211_color_change_bss_config_notify(sdata, sdata->vif.bss_conf.color_change_color, 1, changed); cfg80211_color_change_notify(sdata->dev); return 0; } void ieee80211_color_change_finalize_work(struct work_struct *work) { struct ieee80211_sub_if_data *sdata = container_of(work, struct ieee80211_sub_if_data, deflink.color_change_finalize_work); struct ieee80211_local *local = sdata->local; sdata_lock(sdata); mutex_lock(&local->mtx); /* AP might have been stopped while waiting for the lock. */ if (!sdata->vif.bss_conf.color_change_active) goto unlock; if (!ieee80211_sdata_running(sdata)) goto unlock; ieee80211_color_change_finalize(sdata); unlock: mutex_unlock(&local->mtx); sdata_unlock(sdata); } void ieee80211_color_collision_detection_work(struct work_struct *work) { struct delayed_work *delayed_work = to_delayed_work(work); struct ieee80211_link_data *link = container_of(delayed_work, struct ieee80211_link_data, color_collision_detect_work); struct ieee80211_sub_if_data *sdata = link->sdata; sdata_lock(sdata); cfg80211_obss_color_collision_notify(sdata->dev, link->color_bitmap); sdata_unlock(sdata); } void ieee80211_color_change_finish(struct ieee80211_vif *vif) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); ieee80211_queue_work(&sdata->local->hw, &sdata->deflink.color_change_finalize_work); } EXPORT_SYMBOL_GPL(ieee80211_color_change_finish); void ieee80211_obss_color_collision_notify(struct ieee80211_vif *vif, u64 color_bitmap, gfp_t gfp) { struct ieee80211_sub_if_data *sdata = vif_to_sdata(vif); struct ieee80211_link_data *link = &sdata->deflink; if (sdata->vif.bss_conf.color_change_active || sdata->vif.bss_conf.csa_active) return; if (delayed_work_pending(&link->color_collision_detect_work)) return; link->color_bitmap = color_bitmap; /* queue the color collision detection event every 500 ms in order to * avoid sending too much netlink messages to userspace. */ ieee80211_queue_delayed_work(&sdata->local->hw, &link->color_collision_detect_work, msecs_to_jiffies(500)); } EXPORT_SYMBOL_GPL(ieee80211_obss_color_collision_notify); static int ieee80211_color_change(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_color_change_settings *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; u64 changed = 0; int err; sdata_assert_lock(sdata); if (sdata->vif.bss_conf.nontransmitted) return -EINVAL; mutex_lock(&local->mtx); /* don't allow another color change if one is already active or if csa * is active */ if (sdata->vif.bss_conf.color_change_active || sdata->vif.bss_conf.csa_active) { err = -EBUSY; goto out; } err = ieee80211_set_color_change_beacon(sdata, params, &changed); if (err) goto out; sdata->vif.bss_conf.color_change_active = true; sdata->vif.bss_conf.color_change_color = params->color; cfg80211_color_change_started_notify(sdata->dev, params->count); if (changed) ieee80211_color_change_bss_config_notify(sdata, 0, 0, changed); else /* if the beacon didn't change, we can finalize immediately */ ieee80211_color_change_finalize(sdata); out: mutex_unlock(&local->mtx); return err; } static int ieee80211_set_radar_background(struct wiphy *wiphy, struct cfg80211_chan_def *chandef) { struct ieee80211_local *local = wiphy_priv(wiphy); if (!local->ops->set_radar_background) return -EOPNOTSUPP; return local->ops->set_radar_background(&local->hw, chandef); } static int ieee80211_add_intf_link(struct wiphy *wiphy, struct wireless_dev *wdev, unsigned int link_id) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); int res; if (wdev->use_4addr) return -EOPNOTSUPP; mutex_lock(&sdata->local->mtx); res = ieee80211_vif_set_links(sdata, wdev->valid_links, 0); mutex_unlock(&sdata->local->mtx); return res; } static void ieee80211_del_intf_link(struct wiphy *wiphy, struct wireless_dev *wdev, unsigned int link_id) { struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); mutex_lock(&sdata->local->mtx); ieee80211_vif_set_links(sdata, wdev->valid_links, 0); mutex_unlock(&sdata->local->mtx); } static int sta_add_link_station(struct ieee80211_local *local, struct ieee80211_sub_if_data *sdata, struct link_station_parameters *params) { struct sta_info *sta; int ret; sta = sta_info_get_bss(sdata, params->mld_mac); if (!sta) return -ENOENT; if (!sta->sta.valid_links) return -EINVAL; if (sta->sta.valid_links & BIT(params->link_id)) return -EALREADY; ret = ieee80211_sta_allocate_link(sta, params->link_id); if (ret) return ret; ret = sta_link_apply_parameters(local, sta, true, params); if (ret) { ieee80211_sta_free_link(sta, params->link_id); return ret; } /* ieee80211_sta_activate_link frees the link upon failure */ return ieee80211_sta_activate_link(sta, params->link_id); } static int ieee80211_add_link_station(struct wiphy *wiphy, struct net_device *dev, struct link_station_parameters *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = wiphy_priv(wiphy); int ret; mutex_lock(&sdata->local->sta_mtx); ret = sta_add_link_station(local, sdata, params); mutex_unlock(&sdata->local->sta_mtx); return ret; } static int sta_mod_link_station(struct ieee80211_local *local, struct ieee80211_sub_if_data *sdata, struct link_station_parameters *params) { struct sta_info *sta; sta = sta_info_get_bss(sdata, params->mld_mac); if (!sta) return -ENOENT; if (!(sta->sta.valid_links & BIT(params->link_id))) return -EINVAL; return sta_link_apply_parameters(local, sta, false, params); } static int ieee80211_mod_link_station(struct wiphy *wiphy, struct net_device *dev, struct link_station_parameters *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = wiphy_priv(wiphy); int ret; mutex_lock(&sdata->local->sta_mtx); ret = sta_mod_link_station(local, sdata, params); mutex_unlock(&sdata->local->sta_mtx); return ret; } static int sta_del_link_station(struct ieee80211_sub_if_data *sdata, struct link_station_del_parameters *params) { struct sta_info *sta; sta = sta_info_get_bss(sdata, params->mld_mac); if (!sta) return -ENOENT; if (!(sta->sta.valid_links & BIT(params->link_id))) return -EINVAL; /* must not create a STA without links */ if (sta->sta.valid_links == BIT(params->link_id)) return -EINVAL; ieee80211_sta_remove_link(sta, params->link_id); return 0; } static int ieee80211_del_link_station(struct wiphy *wiphy, struct net_device *dev, struct link_station_del_parameters *params) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); int ret; mutex_lock(&sdata->local->sta_mtx); ret = sta_del_link_station(sdata, params); mutex_unlock(&sdata->local->sta_mtx); return ret; } static int ieee80211_set_hw_timestamp(struct wiphy *wiphy, struct net_device *dev, struct cfg80211_set_hw_timestamp *hwts) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); struct ieee80211_local *local = sdata->local; if (!local->ops->set_hw_timestamp) return -EOPNOTSUPP; if (!check_sdata_in_driver(sdata)) return -EIO; return local->ops->set_hw_timestamp(&local->hw, &sdata->vif, hwts); } const struct cfg80211_ops mac80211_config_ops = { .add_virtual_intf = ieee80211_add_iface, .del_virtual_intf = ieee80211_del_iface, .change_virtual_intf = ieee80211_change_iface, .start_p2p_device = ieee80211_start_p2p_device, .stop_p2p_device = ieee80211_stop_p2p_device, .add_key = ieee80211_add_key, .del_key = ieee80211_del_key, .get_key = ieee80211_get_key, .set_default_key = ieee80211_config_default_key, .set_default_mgmt_key = ieee80211_config_default_mgmt_key, .set_default_beacon_key = ieee80211_config_default_beacon_key, .start_ap = ieee80211_start_ap, .change_beacon = ieee80211_change_beacon, .stop_ap = ieee80211_stop_ap, .add_station = ieee80211_add_station, .del_station = ieee80211_del_station, .change_station = ieee80211_change_station, .get_station = ieee80211_get_station, .dump_station = ieee80211_dump_station, .dump_survey = ieee80211_dump_survey, #ifdef CONFIG_MAC80211_MESH .add_mpath = ieee80211_add_mpath, .del_mpath = ieee80211_del_mpath, .change_mpath = ieee80211_change_mpath, .get_mpath = ieee80211_get_mpath, .dump_mpath = ieee80211_dump_mpath, .get_mpp = ieee80211_get_mpp, .dump_mpp = ieee80211_dump_mpp, .update_mesh_config = ieee80211_update_mesh_config, .get_mesh_config = ieee80211_get_mesh_config, .join_mesh = ieee80211_join_mesh, .leave_mesh = ieee80211_leave_mesh, #endif .join_ocb = ieee80211_join_ocb, .leave_ocb = ieee80211_leave_ocb, .change_bss = ieee80211_change_bss, .inform_bss = ieee80211_inform_bss, .set_txq_params = ieee80211_set_txq_params, .set_monitor_channel = ieee80211_set_monitor_channel, .suspend = ieee80211_suspend, .resume = ieee80211_resume, .scan = ieee80211_scan, .abort_scan = ieee80211_abort_scan, .sched_scan_start = ieee80211_sched_scan_start, .sched_scan_stop = ieee80211_sched_scan_stop, .auth = ieee80211_auth, .assoc = ieee80211_assoc, .deauth = ieee80211_deauth, .disassoc = ieee80211_disassoc, .join_ibss = ieee80211_join_ibss, .leave_ibss = ieee80211_leave_ibss, .set_mcast_rate = ieee80211_set_mcast_rate, .set_wiphy_params = ieee80211_set_wiphy_params, .set_tx_power = ieee80211_set_tx_power, .get_tx_power = ieee80211_get_tx_power, .rfkill_poll = ieee80211_rfkill_poll, CFG80211_TESTMODE_CMD(ieee80211_testmode_cmd) CFG80211_TESTMODE_DUMP(ieee80211_testmode_dump) .set_power_mgmt = ieee80211_set_power_mgmt, .set_bitrate_mask = ieee80211_set_bitrate_mask, .remain_on_channel = ieee80211_remain_on_channel, .cancel_remain_on_channel = ieee80211_cancel_remain_on_channel, .mgmt_tx = ieee80211_mgmt_tx, .mgmt_tx_cancel_wait = ieee80211_mgmt_tx_cancel_wait, .set_cqm_rssi_config = ieee80211_set_cqm_rssi_config, .set_cqm_rssi_range_config = ieee80211_set_cqm_rssi_range_config, .update_mgmt_frame_registrations = ieee80211_update_mgmt_frame_registrations, .set_antenna = ieee80211_set_antenna, .get_antenna = ieee80211_get_antenna, .set_rekey_data = ieee80211_set_rekey_data, .tdls_oper = ieee80211_tdls_oper, .tdls_mgmt = ieee80211_tdls_mgmt, .tdls_channel_switch = ieee80211_tdls_channel_switch, .tdls_cancel_channel_switch = ieee80211_tdls_cancel_channel_switch, .probe_client = ieee80211_probe_client, .set_noack_map = ieee80211_set_noack_map, #ifdef CONFIG_PM .set_wakeup = ieee80211_set_wakeup, #endif .get_channel = ieee80211_cfg_get_channel, .start_radar_detection = ieee80211_start_radar_detection, .end_cac = ieee80211_end_cac, .channel_switch = ieee80211_channel_switch, .set_qos_map = ieee80211_set_qos_map, .set_ap_chanwidth = ieee80211_set_ap_chanwidth, .add_tx_ts = ieee80211_add_tx_ts, .del_tx_ts = ieee80211_del_tx_ts, .start_nan = ieee80211_start_nan, .stop_nan = ieee80211_stop_nan, .nan_change_conf = ieee80211_nan_change_conf, .add_nan_func = ieee80211_add_nan_func, .del_nan_func = ieee80211_del_nan_func, .set_multicast_to_unicast = ieee80211_set_multicast_to_unicast, .tx_control_port = ieee80211_tx_control_port, .get_txq_stats = ieee80211_get_txq_stats, .get_ftm_responder_stats = ieee80211_get_ftm_responder_stats, .start_pmsr = ieee80211_start_pmsr, .abort_pmsr = ieee80211_abort_pmsr, .probe_mesh_link = ieee80211_probe_mesh_link, .set_tid_config = ieee80211_set_tid_config, .reset_tid_config = ieee80211_reset_tid_config, .set_sar_specs = ieee80211_set_sar_specs, .color_change = ieee80211_color_change, .set_radar_background = ieee80211_set_radar_background, .add_intf_link = ieee80211_add_intf_link, .del_intf_link = ieee80211_del_intf_link, .add_link_station = ieee80211_add_link_station, .mod_link_station = ieee80211_mod_link_station, .del_link_station = ieee80211_del_link_station, .set_hw_timestamp = ieee80211_set_hw_timestamp, }; |
1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | // SPDX-License-Identifier: GPL-2.0 #include <linux/file.h> #include <linux/fs.h> #include <linux/fsnotify_backend.h> #include <linux/idr.h> #include <linux/init.h> #include <linux/inotify.h> #include <linux/fanotify.h> #include <linux/kernel.h> #include <linux/namei.h> #include <linux/sched.h> #include <linux/types.h> #include <linux/seq_file.h> #include <linux/exportfs.h> #include "inotify/inotify.h" #include "fanotify/fanotify.h" #include "fdinfo.h" #include "fsnotify.h" #if defined(CONFIG_PROC_FS) #if defined(CONFIG_INOTIFY_USER) || defined(CONFIG_FANOTIFY) static void show_fdinfo(struct seq_file *m, struct file *f, void (*show)(struct seq_file *m, struct fsnotify_mark *mark)) { struct fsnotify_group *group = f->private_data; struct fsnotify_mark *mark; fsnotify_group_lock(group); list_for_each_entry(mark, &group->marks_list, g_list) { show(m, mark); if (seq_has_overflowed(m)) break; } fsnotify_group_unlock(group); } #if defined(CONFIG_EXPORTFS) static void show_mark_fhandle(struct seq_file *m, struct inode *inode) { struct { struct file_handle handle; u8 pad[MAX_HANDLE_SZ]; } f; int size, ret, i; f.handle.handle_bytes = sizeof(f.pad); size = f.handle.handle_bytes >> 2; ret = exportfs_encode_fid(inode, (struct fid *)f.handle.f_handle, &size); if ((ret == FILEID_INVALID) || (ret < 0)) { WARN_ONCE(1, "Can't encode file handler for inotify: %d\n", ret); return; } f.handle.handle_type = ret; f.handle.handle_bytes = size * sizeof(u32); seq_printf(m, "fhandle-bytes:%x fhandle-type:%x f_handle:", f.handle.handle_bytes, f.handle.handle_type); for (i = 0; i < f.handle.handle_bytes; i++) seq_printf(m, "%02x", (int)f.handle.f_handle[i]); } #else static void show_mark_fhandle(struct seq_file *m, struct inode *inode) { } #endif #ifdef CONFIG_INOTIFY_USER static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark) { struct inotify_inode_mark *inode_mark; struct inode *inode; if (mark->connector->type != FSNOTIFY_OBJ_TYPE_INODE) return; inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark); inode = igrab(fsnotify_conn_inode(mark->connector)); if (inode) { seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:0 ", inode_mark->wd, inode->i_ino, inode->i_sb->s_dev, inotify_mark_user_mask(mark)); show_mark_fhandle(m, inode); seq_putc(m, '\n'); iput(inode); } } void inotify_show_fdinfo(struct seq_file *m, struct file *f) { show_fdinfo(m, f, inotify_fdinfo); } #endif /* CONFIG_INOTIFY_USER */ #ifdef CONFIG_FANOTIFY static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark) { unsigned int mflags = fanotify_mark_user_flags(mark); struct inode *inode; if (mark->connector->type == FSNOTIFY_OBJ_TYPE_INODE) { inode = igrab(fsnotify_conn_inode(mark->connector)); if (!inode) return; seq_printf(m, "fanotify ino:%lx sdev:%x mflags:%x mask:%x ignored_mask:%x ", inode->i_ino, inode->i_sb->s_dev, mflags, mark->mask, mark->ignore_mask); show_mark_fhandle(m, inode); seq_putc(m, '\n'); iput(inode); } else if (mark->connector->type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) { struct mount *mnt = fsnotify_conn_mount(mark->connector); seq_printf(m, "fanotify mnt_id:%x mflags:%x mask:%x ignored_mask:%x\n", mnt->mnt_id, mflags, mark->mask, mark->ignore_mask); } else if (mark->connector->type == FSNOTIFY_OBJ_TYPE_SB) { struct super_block *sb = fsnotify_conn_sb(mark->connector); seq_printf(m, "fanotify sdev:%x mflags:%x mask:%x ignored_mask:%x\n", sb->s_dev, mflags, mark->mask, mark->ignore_mask); } } void fanotify_show_fdinfo(struct seq_file *m, struct file *f) { struct fsnotify_group *group = f->private_data; seq_printf(m, "fanotify flags:%x event-flags:%x\n", group->fanotify_data.flags & FANOTIFY_INIT_FLAGS, group->fanotify_data.f_flags); show_fdinfo(m, f, fanotify_fdinfo); } #endif /* CONFIG_FANOTIFY */ #endif /* CONFIG_INOTIFY_USER || CONFIG_FANOTIFY */ #endif /* CONFIG_PROC_FS */ |
38 19 38 30 38 37 48 12 5 37 42 8 11 16 8 79 9 11 8 11 18 10 8 3 3 3 1 2 1 8 6 65 66 49 79 79 34 28 28 76 76 79 76 79 76 33 32 32 32 32 1 43 3 3 3 3 3 2 2 2 2 2 2 2 5 2 3 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 4 6 1 6 1 4 2 1 2 4 4 2 4 1 2 16 7 7 16 10 8 7 2 12 19 19 17 17 17 17 14 14 14 5 5 16 7 7 4 4 1 1 7 134 134 134 133 132 136 1 4 136 133 3 3 2 1 2 2 112 119 110 110 2 2 1 1 110 3 109 5 1 4 4 5 9 8 1 1 1 8 1 2 2 5 4 3 1 18 2 18 3 18 9 18 4 3 18 5 3 15 1 14 1 14 12 20 2 2 5 18 17 18 18 9 9 18 18 4 18 9 4 1 18 18 8 18 17 9 8 6 5 2 2 2 1 2 2 12 12 14 10 10 7 10 2 1 15 14 2 1 12 11 44 35 5 4 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 | /* FUSE: Filesystem in Userspace Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu> This program can be distributed under the terms of the GNU GPL. See the file COPYING. */ #include "fuse_i.h" #include <linux/pagemap.h> #include <linux/file.h> #include <linux/fs_context.h> #include <linux/moduleparam.h> #include <linux/sched.h> #include <linux/namei.h> #include <linux/slab.h> #include <linux/xattr.h> #include <linux/iversion.h> #include <linux/posix_acl.h> #include <linux/security.h> #include <linux/types.h> #include <linux/kernel.h> static bool __read_mostly allow_sys_admin_access; module_param(allow_sys_admin_access, bool, 0644); MODULE_PARM_DESC(allow_sys_admin_access, "Allow users with CAP_SYS_ADMIN in initial userns to bypass allow_other access check"); static void fuse_advise_use_readdirplus(struct inode *dir) { struct fuse_inode *fi = get_fuse_inode(dir); set_bit(FUSE_I_ADVISE_RDPLUS, &fi->state); } #if BITS_PER_LONG >= 64 static inline void __fuse_dentry_settime(struct dentry *entry, u64 time) { entry->d_fsdata = (void *) time; } static inline u64 fuse_dentry_time(const struct dentry *entry) { return (u64)entry->d_fsdata; } #else union fuse_dentry { u64 time; struct rcu_head rcu; }; static inline void __fuse_dentry_settime(struct dentry *dentry, u64 time) { ((union fuse_dentry *) dentry->d_fsdata)->time = time; } static inline u64 fuse_dentry_time(const struct dentry *entry) { return ((union fuse_dentry *) entry->d_fsdata)->time; } #endif static void fuse_dentry_settime(struct dentry *dentry, u64 time) { struct fuse_conn *fc = get_fuse_conn_super(dentry->d_sb); bool delete = !time && fc->delete_stale; /* * Mess with DCACHE_OP_DELETE because dput() will be faster without it. * Don't care about races, either way it's just an optimization */ if ((!delete && (dentry->d_flags & DCACHE_OP_DELETE)) || (delete && !(dentry->d_flags & DCACHE_OP_DELETE))) { spin_lock(&dentry->d_lock); if (!delete) dentry->d_flags &= ~DCACHE_OP_DELETE; else dentry->d_flags |= DCACHE_OP_DELETE; spin_unlock(&dentry->d_lock); } __fuse_dentry_settime(dentry, time); } /* * FUSE caches dentries and attributes with separate timeout. The * time in jiffies until the dentry/attributes are valid is stored in * dentry->d_fsdata and fuse_inode->i_time respectively. */ /* * Calculate the time in jiffies until a dentry/attributes are valid */ u64 fuse_time_to_jiffies(u64 sec, u32 nsec) { if (sec || nsec) { struct timespec64 ts = { sec, min_t(u32, nsec, NSEC_PER_SEC - 1) }; return get_jiffies_64() + timespec64_to_jiffies(&ts); } else return 0; } /* * Set dentry and possibly attribute timeouts from the lookup/mk* * replies */ void fuse_change_entry_timeout(struct dentry *entry, struct fuse_entry_out *o) { fuse_dentry_settime(entry, fuse_time_to_jiffies(o->entry_valid, o->entry_valid_nsec)); } void fuse_invalidate_attr_mask(struct inode *inode, u32 mask) { set_mask_bits(&get_fuse_inode(inode)->inval_mask, 0, mask); } /* * Mark the attributes as stale, so that at the next call to * ->getattr() they will be fetched from userspace */ void fuse_invalidate_attr(struct inode *inode) { fuse_invalidate_attr_mask(inode, STATX_BASIC_STATS); } static void fuse_dir_changed(struct inode *dir) { fuse_invalidate_attr(dir); inode_maybe_inc_iversion(dir, false); } /* * Mark the attributes as stale due to an atime change. Avoid the invalidate if * atime is not used. */ void fuse_invalidate_atime(struct inode *inode) { if (!IS_RDONLY(inode)) fuse_invalidate_attr_mask(inode, STATX_ATIME); } /* * Just mark the entry as stale, so that a next attempt to look it up * will result in a new lookup call to userspace * * This is called when a dentry is about to become negative and the * timeout is unknown (unlink, rmdir, rename and in some cases * lookup) */ void fuse_invalidate_entry_cache(struct dentry *entry) { fuse_dentry_settime(entry, 0); } /* * Same as fuse_invalidate_entry_cache(), but also try to remove the * dentry from the hash */ static void fuse_invalidate_entry(struct dentry *entry) { d_invalidate(entry); fuse_invalidate_entry_cache(entry); } static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args, u64 nodeid, const struct qstr *name, struct fuse_entry_out *outarg) { memset(outarg, 0, sizeof(struct fuse_entry_out)); args->opcode = FUSE_LOOKUP; args->nodeid = nodeid; args->in_numargs = 1; args->in_args[0].size = name->len + 1; args->in_args[0].value = name->name; args->out_numargs = 1; args->out_args[0].size = sizeof(struct fuse_entry_out); args->out_args[0].value = outarg; } /* * Check whether the dentry is still valid * * If the entry validity timeout has expired and the dentry is * positive, try to redo the lookup. If the lookup results in a * different inode, then let the VFS invalidate the dentry and redo * the lookup once more. If the lookup results in the same inode, * then refresh the attributes, timeouts and mark the dentry valid. */ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags) { struct inode *inode; struct dentry *parent; struct fuse_mount *fm; struct fuse_inode *fi; int ret; inode = d_inode_rcu(entry); if (inode && fuse_is_bad(inode)) goto invalid; else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) || (flags & (LOOKUP_EXCL | LOOKUP_REVAL | LOOKUP_RENAME_TARGET))) { struct fuse_entry_out outarg; FUSE_ARGS(args); struct fuse_forget_link *forget; u64 attr_version; /* For negative dentries, always do a fresh lookup */ if (!inode) goto invalid; ret = -ECHILD; if (flags & LOOKUP_RCU) goto out; fm = get_fuse_mount(inode); forget = fuse_alloc_forget(); ret = -ENOMEM; if (!forget) goto out; attr_version = fuse_get_attr_version(fm->fc); parent = dget_parent(entry); fuse_lookup_init(fm->fc, &args, get_node_id(d_inode(parent)), &entry->d_name, &outarg); ret = fuse_simple_request(fm, &args); dput(parent); /* Zero nodeid is same as -ENOENT */ if (!ret && !outarg.nodeid) ret = -ENOENT; if (!ret) { fi = get_fuse_inode(inode); if (outarg.nodeid != get_node_id(inode) || (bool) IS_AUTOMOUNT(inode) != (bool) (outarg.attr.flags & FUSE_ATTR_SUBMOUNT)) { fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1); goto invalid; } spin_lock(&fi->lock); fi->nlookup++; spin_unlock(&fi->lock); } kfree(forget); if (ret == -ENOMEM || ret == -EINTR) goto out; if (ret || fuse_invalid_attr(&outarg.attr) || fuse_stale_inode(inode, outarg.generation, &outarg.attr)) goto invalid; forget_all_cached_acls(inode); fuse_change_attributes(inode, &outarg.attr, NULL, ATTR_TIMEOUT(&outarg), attr_version); fuse_change_entry_timeout(entry, &outarg); } else if (inode) { fi = get_fuse_inode(inode); if (flags & LOOKUP_RCU) { if (test_bit(FUSE_I_INIT_RDPLUS, &fi->state)) return -ECHILD; } else if (test_and_clear_bit(FUSE_I_INIT_RDPLUS, &fi->state)) { parent = dget_parent(entry); fuse_advise_use_readdirplus(d_inode(parent)); dput(parent); } } ret = 1; out: return ret; invalid: ret = 0; goto out; } #if BITS_PER_LONG < 64 static int fuse_dentry_init(struct dentry *dentry) { dentry->d_fsdata = kzalloc(sizeof(union fuse_dentry), GFP_KERNEL_ACCOUNT | __GFP_RECLAIMABLE); return dentry->d_fsdata ? 0 : -ENOMEM; } static void fuse_dentry_release(struct dentry *dentry) { union fuse_dentry *fd = dentry->d_fsdata; kfree_rcu(fd, rcu); } #endif static int fuse_dentry_delete(const struct dentry *dentry) { return time_before64(fuse_dentry_time(dentry), get_jiffies_64()); } /* * Create a fuse_mount object with a new superblock (with path->dentry * as the root), and return that mount so it can be auto-mounted on * @path. */ static struct vfsmount *fuse_dentry_automount(struct path *path) { struct fs_context *fsc; struct vfsmount *mnt; struct fuse_inode *mp_fi = get_fuse_inode(d_inode(path->dentry)); fsc = fs_context_for_submount(path->mnt->mnt_sb->s_type, path->dentry); if (IS_ERR(fsc)) return ERR_CAST(fsc); /* Pass the FUSE inode of the mount for fuse_get_tree_submount() */ fsc->fs_private = mp_fi; /* Create the submount */ mnt = fc_mount(fsc); if (!IS_ERR(mnt)) mntget(mnt); put_fs_context(fsc); return mnt; } const struct dentry_operations fuse_dentry_operations = { .d_revalidate = fuse_dentry_revalidate, .d_delete = fuse_dentry_delete, #if BITS_PER_LONG < 64 .d_init = fuse_dentry_init, .d_release = fuse_dentry_release, #endif .d_automount = fuse_dentry_automount, }; const struct dentry_operations fuse_root_dentry_operations = { #if BITS_PER_LONG < 64 .d_init = fuse_dentry_init, .d_release = fuse_dentry_release, #endif }; int fuse_valid_type(int m) { return S_ISREG(m) || S_ISDIR(m) || S_ISLNK(m) || S_ISCHR(m) || S_ISBLK(m) || S_ISFIFO(m) || S_ISSOCK(m); } static bool fuse_valid_size(u64 size) { return size <= LLONG_MAX; } bool fuse_invalid_attr(struct fuse_attr *attr) { return !fuse_valid_type(attr->mode) || !fuse_valid_size(attr->size); } int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name, struct fuse_entry_out *outarg, struct inode **inode) { struct fuse_mount *fm = get_fuse_mount_super(sb); FUSE_ARGS(args); struct fuse_forget_link *forget; u64 attr_version; int err; *inode = NULL; err = -ENAMETOOLONG; if (name->len > FUSE_NAME_MAX) goto out; forget = fuse_alloc_forget(); err = -ENOMEM; if (!forget) goto out; attr_version = fuse_get_attr_version(fm->fc); fuse_lookup_init(fm->fc, &args, nodeid, name, outarg); err = fuse_simple_request(fm, &args); /* Zero nodeid is same as -ENOENT, but with valid timeout */ if (err || !outarg->nodeid) goto out_put_forget; err = -EIO; if (fuse_invalid_attr(&outarg->attr)) goto out_put_forget; *inode = fuse_iget(sb, outarg->nodeid, outarg->generation, &outarg->attr, ATTR_TIMEOUT(outarg), attr_version); err = -ENOMEM; if (!*inode) { fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1); goto out; } err = 0; out_put_forget: kfree(forget); out: return err; } static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry, unsigned int flags) { int err; struct fuse_entry_out outarg; struct inode *inode; struct dentry *newent; bool outarg_valid = true; bool locked; if (fuse_is_bad(dir)) return ERR_PTR(-EIO); locked = fuse_lock_inode(dir); err = fuse_lookup_name(dir->i_sb, get_node_id(dir), &entry->d_name, &outarg, &inode); fuse_unlock_inode(dir, locked); if (err == -ENOENT) { outarg_valid = false; err = 0; } if (err) goto out_err; err = -EIO; if (inode && get_node_id(inode) == FUSE_ROOT_ID) goto out_iput; newent = d_splice_alias(inode, entry); err = PTR_ERR(newent); if (IS_ERR(newent)) goto out_err; entry = newent ? newent : entry; if (outarg_valid) fuse_change_entry_timeout(entry, &outarg); else fuse_invalidate_entry_cache(entry); if (inode) fuse_advise_use_readdirplus(dir); return newent; out_iput: iput(inode); out_err: return ERR_PTR(err); } static int get_security_context(struct dentry *entry, umode_t mode, struct fuse_in_arg *ext) { struct fuse_secctx *fctx; struct fuse_secctx_header *header; void *ctx = NULL, *ptr; u32 ctxlen, total_len = sizeof(*header); int err, nr_ctx = 0; const char *name; size_t namelen; err = security_dentry_init_security(entry, mode, &entry->d_name, &name, &ctx, &ctxlen); if (err) { if (err != -EOPNOTSUPP) goto out_err; /* No LSM is supporting this security hook. Ignore error */ ctxlen = 0; ctx = NULL; } if (ctxlen) { nr_ctx = 1; namelen = strlen(name) + 1; err = -EIO; if (WARN_ON(namelen > XATTR_NAME_MAX + 1 || ctxlen > S32_MAX)) goto out_err; total_len += FUSE_REC_ALIGN(sizeof(*fctx) + namelen + ctxlen); } err = -ENOMEM; header = ptr = kzalloc(total_len, GFP_KERNEL); if (!ptr) goto out_err; header->nr_secctx = nr_ctx; header->size = total_len; ptr += sizeof(*header); if (nr_ctx) { fctx = ptr; fctx->size = ctxlen; ptr += sizeof(*fctx); strcpy(ptr, name); ptr += namelen; memcpy(ptr, ctx, ctxlen); } ext->size = total_len; ext->value = header; err = 0; out_err: kfree(ctx); return err; } static void *extend_arg(struct fuse_in_arg *buf, u32 bytes) { void *p; u32 newlen = buf->size + bytes; p = krealloc(buf->value, newlen, GFP_KERNEL); if (!p) { kfree(buf->value); buf->size = 0; buf->value = NULL; return NULL; } memset(p + buf->size, 0, bytes); buf->value = p; buf->size = newlen; return p + newlen - bytes; } static u32 fuse_ext_size(size_t size) { return FUSE_REC_ALIGN(sizeof(struct fuse_ext_header) + size); } /* * This adds just a single supplementary group that matches the parent's group. */ static int get_create_supp_group(struct inode *dir, struct fuse_in_arg *ext) { struct fuse_conn *fc = get_fuse_conn(dir); struct fuse_ext_header *xh; struct fuse_supp_groups *sg; kgid_t kgid = dir->i_gid; gid_t parent_gid = from_kgid(fc->user_ns, kgid); u32 sg_len = fuse_ext_size(sizeof(*sg) + sizeof(sg->groups[0])); if (parent_gid == (gid_t) -1 || gid_eq(kgid, current_fsgid()) || !in_group_p(kgid)) return 0; xh = extend_arg(ext, sg_len); if (!xh) return -ENOMEM; xh->size = sg_len; xh->type = FUSE_EXT_GROUPS; sg = (struct fuse_supp_groups *) &xh[1]; sg->nr_groups = 1; sg->groups[0] = parent_gid; return 0; } static int get_create_ext(struct fuse_args *args, struct inode *dir, struct dentry *dentry, umode_t mode) { struct fuse_conn *fc = get_fuse_conn_super(dentry->d_sb); struct fuse_in_arg ext = { .size = 0, .value = NULL }; int err = 0; if (fc->init_security) err = get_security_context(dentry, mode, &ext); if (!err && fc->create_supp_group) err = get_create_supp_group(dir, &ext); if (!err && ext.size) { WARN_ON(args->in_numargs >= ARRAY_SIZE(args->in_args)); args->is_ext = true; args->ext_idx = args->in_numargs++; args->in_args[args->ext_idx] = ext; } else { kfree(ext.value); } return err; } static void free_ext_value(struct fuse_args *args) { if (args->is_ext) kfree(args->in_args[args->ext_idx].value); } /* * Atomic create+open operation * * If the filesystem doesn't support this, then fall back to separate * 'mknod' + 'open' requests. */ static int fuse_create_open(struct inode *dir, struct dentry *entry, struct file *file, unsigned int flags, umode_t mode, u32 opcode) { int err; struct inode *inode; struct fuse_mount *fm = get_fuse_mount(dir); FUSE_ARGS(args); struct fuse_forget_link *forget; struct fuse_create_in inarg; struct fuse_open_out outopen; struct fuse_entry_out outentry; struct fuse_inode *fi; struct fuse_file *ff; bool trunc = flags & O_TRUNC; /* Userspace expects S_IFREG in create mode */ BUG_ON((mode & S_IFMT) != S_IFREG); forget = fuse_alloc_forget(); err = -ENOMEM; if (!forget) goto out_err; err = -ENOMEM; ff = fuse_file_alloc(fm); if (!ff) goto out_put_forget_req; if (!fm->fc->dont_mask) mode &= ~current_umask(); flags &= ~O_NOCTTY; memset(&inarg, 0, sizeof(inarg)); memset(&outentry, 0, sizeof(outentry)); inarg.flags = flags; inarg.mode = mode; inarg.umask = current_umask(); if (fm->fc->handle_killpriv_v2 && trunc && !(flags & O_EXCL) && !capable(CAP_FSETID)) { inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID; } args.opcode = opcode; args.nodeid = get_node_id(dir); args.in_numargs = 2; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.in_args[1].size = entry->d_name.len + 1; args.in_args[1].value = entry->d_name.name; args.out_numargs = 2; args.out_args[0].size = sizeof(outentry); args.out_args[0].value = &outentry; args.out_args[1].size = sizeof(outopen); args.out_args[1].value = &outopen; err = get_create_ext(&args, dir, entry, mode); if (err) goto out_put_forget_req; err = fuse_simple_request(fm, &args); free_ext_value(&args); if (err) goto out_free_ff; err = -EIO; if (!S_ISREG(outentry.attr.mode) || invalid_nodeid(outentry.nodeid) || fuse_invalid_attr(&outentry.attr)) goto out_free_ff; ff->fh = outopen.fh; ff->nodeid = outentry.nodeid; ff->open_flags = outopen.open_flags; inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation, &outentry.attr, ATTR_TIMEOUT(&outentry), 0); if (!inode) { flags &= ~(O_CREAT | O_EXCL | O_TRUNC); fuse_sync_release(NULL, ff, flags); fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1); err = -ENOMEM; goto out_err; } kfree(forget); d_instantiate(entry, inode); fuse_change_entry_timeout(entry, &outentry); fuse_dir_changed(dir); err = finish_open(file, entry, generic_file_open); if (err) { fi = get_fuse_inode(inode); fuse_sync_release(fi, ff, flags); } else { file->private_data = ff; fuse_finish_open(inode, file); if (fm->fc->atomic_o_trunc && trunc) truncate_pagecache(inode, 0); else if (!(ff->open_flags & FOPEN_KEEP_CACHE)) invalidate_inode_pages2(inode->i_mapping); } return err; out_free_ff: fuse_file_free(ff); out_put_forget_req: kfree(forget); out_err: return err; } static int fuse_mknod(struct mnt_idmap *, struct inode *, struct dentry *, umode_t, dev_t); static int fuse_atomic_open(struct inode *dir, struct dentry *entry, struct file *file, unsigned flags, umode_t mode) { int err; struct fuse_conn *fc = get_fuse_conn(dir); struct dentry *res = NULL; if (fuse_is_bad(dir)) return -EIO; if (d_in_lookup(entry)) { res = fuse_lookup(dir, entry, 0); if (IS_ERR(res)) return PTR_ERR(res); if (res) entry = res; } if (!(flags & O_CREAT) || d_really_is_positive(entry)) goto no_open; /* Only creates */ file->f_mode |= FMODE_CREATED; if (fc->no_create) goto mknod; err = fuse_create_open(dir, entry, file, flags, mode, FUSE_CREATE); if (err == -ENOSYS) { fc->no_create = 1; goto mknod; } else if (err == -EEXIST) fuse_invalidate_entry(entry); out_dput: dput(res); return err; mknod: err = fuse_mknod(&nop_mnt_idmap, dir, entry, mode, 0); if (err) goto out_dput; no_open: return finish_no_open(file, res); } /* * Code shared between mknod, mkdir, symlink and link */ static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args, struct inode *dir, struct dentry *entry, umode_t mode) { struct fuse_entry_out outarg; struct inode *inode; struct dentry *d; int err; struct fuse_forget_link *forget; if (fuse_is_bad(dir)) return -EIO; forget = fuse_alloc_forget(); if (!forget) return -ENOMEM; memset(&outarg, 0, sizeof(outarg)); args->nodeid = get_node_id(dir); args->out_numargs = 1; args->out_args[0].size = sizeof(outarg); args->out_args[0].value = &outarg; if (args->opcode != FUSE_LINK) { err = get_create_ext(args, dir, entry, mode); if (err) goto out_put_forget_req; } err = fuse_simple_request(fm, args); free_ext_value(args); if (err) goto out_put_forget_req; err = -EIO; if (invalid_nodeid(outarg.nodeid) || fuse_invalid_attr(&outarg.attr)) goto out_put_forget_req; if ((outarg.attr.mode ^ mode) & S_IFMT) goto out_put_forget_req; inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation, &outarg.attr, ATTR_TIMEOUT(&outarg), 0); if (!inode) { fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1); return -ENOMEM; } kfree(forget); d_drop(entry); d = d_splice_alias(inode, entry); if (IS_ERR(d)) return PTR_ERR(d); if (d) { fuse_change_entry_timeout(d, &outarg); dput(d); } else { fuse_change_entry_timeout(entry, &outarg); } fuse_dir_changed(dir); return 0; out_put_forget_req: if (err == -EEXIST) fuse_invalidate_entry(entry); kfree(forget); return err; } static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, struct dentry *entry, umode_t mode, dev_t rdev) { struct fuse_mknod_in inarg; struct fuse_mount *fm = get_fuse_mount(dir); FUSE_ARGS(args); if (!fm->fc->dont_mask) mode &= ~current_umask(); memset(&inarg, 0, sizeof(inarg)); inarg.mode = mode; inarg.rdev = new_encode_dev(rdev); inarg.umask = current_umask(); args.opcode = FUSE_MKNOD; args.in_numargs = 2; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.in_args[1].size = entry->d_name.len + 1; args.in_args[1].value = entry->d_name.name; return create_new_entry(fm, &args, dir, entry, mode); } static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *entry, umode_t mode, bool excl) { return fuse_mknod(&nop_mnt_idmap, dir, entry, mode, 0); } static int fuse_tmpfile(struct mnt_idmap *idmap, struct inode *dir, struct file *file, umode_t mode) { struct fuse_conn *fc = get_fuse_conn(dir); int err; if (fc->no_tmpfile) return -EOPNOTSUPP; err = fuse_create_open(dir, file->f_path.dentry, file, file->f_flags, mode, FUSE_TMPFILE); if (err == -ENOSYS) { fc->no_tmpfile = 1; err = -EOPNOTSUPP; } return err; } static int fuse_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *entry, umode_t mode) { struct fuse_mkdir_in inarg; struct fuse_mount *fm = get_fuse_mount(dir); FUSE_ARGS(args); if (!fm->fc->dont_mask) mode &= ~current_umask(); memset(&inarg, 0, sizeof(inarg)); inarg.mode = mode; inarg.umask = current_umask(); args.opcode = FUSE_MKDIR; args.in_numargs = 2; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.in_args[1].size = entry->d_name.len + 1; args.in_args[1].value = entry->d_name.name; return create_new_entry(fm, &args, dir, entry, S_IFDIR); } static int fuse_symlink(struct mnt_idmap *idmap, struct inode *dir, struct dentry *entry, const char *link) { struct fuse_mount *fm = get_fuse_mount(dir); unsigned len = strlen(link) + 1; FUSE_ARGS(args); args.opcode = FUSE_SYMLINK; args.in_numargs = 2; args.in_args[0].size = entry->d_name.len + 1; args.in_args[0].value = entry->d_name.name; args.in_args[1].size = len; args.in_args[1].value = link; return create_new_entry(fm, &args, dir, entry, S_IFLNK); } void fuse_flush_time_update(struct inode *inode) { int err = sync_inode_metadata(inode, 1); mapping_set_error(inode->i_mapping, err); } static void fuse_update_ctime_in_cache(struct inode *inode) { if (!IS_NOCMTIME(inode)) { inode_set_ctime_current(inode); mark_inode_dirty_sync(inode); fuse_flush_time_update(inode); } } void fuse_update_ctime(struct inode *inode) { fuse_invalidate_attr_mask(inode, STATX_CTIME); fuse_update_ctime_in_cache(inode); } static void fuse_entry_unlinked(struct dentry *entry) { struct inode *inode = d_inode(entry); struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_inode *fi = get_fuse_inode(inode); spin_lock(&fi->lock); fi->attr_version = atomic64_inc_return(&fc->attr_version); /* * If i_nlink == 0 then unlink doesn't make sense, yet this can * happen if userspace filesystem is careless. It would be * difficult to enforce correct nlink usage so just ignore this * condition here */ if (S_ISDIR(inode->i_mode)) clear_nlink(inode); else if (inode->i_nlink > 0) drop_nlink(inode); spin_unlock(&fi->lock); fuse_invalidate_entry_cache(entry); fuse_update_ctime(inode); } static int fuse_unlink(struct inode *dir, struct dentry *entry) { int err; struct fuse_mount *fm = get_fuse_mount(dir); FUSE_ARGS(args); if (fuse_is_bad(dir)) return -EIO; args.opcode = FUSE_UNLINK; args.nodeid = get_node_id(dir); args.in_numargs = 1; args.in_args[0].size = entry->d_name.len + 1; args.in_args[0].value = entry->d_name.name; err = fuse_simple_request(fm, &args); if (!err) { fuse_dir_changed(dir); fuse_entry_unlinked(entry); } else if (err == -EINTR || err == -ENOENT) fuse_invalidate_entry(entry); return err; } static int fuse_rmdir(struct inode *dir, struct dentry *entry) { int err; struct fuse_mount *fm = get_fuse_mount(dir); FUSE_ARGS(args); if (fuse_is_bad(dir)) return -EIO; args.opcode = FUSE_RMDIR; args.nodeid = get_node_id(dir); args.in_numargs = 1; args.in_args[0].size = entry->d_name.len + 1; args.in_args[0].value = entry->d_name.name; err = fuse_simple_request(fm, &args); if (!err) { fuse_dir_changed(dir); fuse_entry_unlinked(entry); } else if (err == -EINTR || err == -ENOENT) fuse_invalidate_entry(entry); return err; } static int fuse_rename_common(struct inode *olddir, struct dentry *oldent, struct inode *newdir, struct dentry *newent, unsigned int flags, int opcode, size_t argsize) { int err; struct fuse_rename2_in inarg; struct fuse_mount *fm = get_fuse_mount(olddir); FUSE_ARGS(args); memset(&inarg, 0, argsize); inarg.newdir = get_node_id(newdir); inarg.flags = flags; args.opcode = opcode; args.nodeid = get_node_id(olddir); args.in_numargs = 3; args.in_args[0].size = argsize; args.in_args[0].value = &inarg; args.in_args[1].size = oldent->d_name.len + 1; args.in_args[1].value = oldent->d_name.name; args.in_args[2].size = newent->d_name.len + 1; args.in_args[2].value = newent->d_name.name; err = fuse_simple_request(fm, &args); if (!err) { /* ctime changes */ fuse_update_ctime(d_inode(oldent)); if (flags & RENAME_EXCHANGE) fuse_update_ctime(d_inode(newent)); fuse_dir_changed(olddir); if (olddir != newdir) fuse_dir_changed(newdir); /* newent will end up negative */ if (!(flags & RENAME_EXCHANGE) && d_really_is_positive(newent)) fuse_entry_unlinked(newent); } else if (err == -EINTR || err == -ENOENT) { /* If request was interrupted, DEITY only knows if the rename actually took place. If the invalidation fails (e.g. some process has CWD under the renamed directory), then there can be inconsistency between the dcache and the real filesystem. Tough luck. */ fuse_invalidate_entry(oldent); if (d_really_is_positive(newent)) fuse_invalidate_entry(newent); } return err; } static int fuse_rename2(struct mnt_idmap *idmap, struct inode *olddir, struct dentry *oldent, struct inode *newdir, struct dentry *newent, unsigned int flags) { struct fuse_conn *fc = get_fuse_conn(olddir); int err; if (fuse_is_bad(olddir)) return -EIO; if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) return -EINVAL; if (flags) { if (fc->no_rename2 || fc->minor < 23) return -EINVAL; err = fuse_rename_common(olddir, oldent, newdir, newent, flags, FUSE_RENAME2, sizeof(struct fuse_rename2_in)); if (err == -ENOSYS) { fc->no_rename2 = 1; err = -EINVAL; } } else { err = fuse_rename_common(olddir, oldent, newdir, newent, 0, FUSE_RENAME, sizeof(struct fuse_rename_in)); } return err; } static int fuse_link(struct dentry *entry, struct inode *newdir, struct dentry *newent) { int err; struct fuse_link_in inarg; struct inode *inode = d_inode(entry); struct fuse_mount *fm = get_fuse_mount(inode); FUSE_ARGS(args); memset(&inarg, 0, sizeof(inarg)); inarg.oldnodeid = get_node_id(inode); args.opcode = FUSE_LINK; args.in_numargs = 2; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.in_args[1].size = newent->d_name.len + 1; args.in_args[1].value = newent->d_name.name; err = create_new_entry(fm, &args, newdir, newent, inode->i_mode); if (!err) fuse_update_ctime_in_cache(inode); else if (err == -EINTR) fuse_invalidate_attr(inode); return err; } static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, struct kstat *stat) { unsigned int blkbits; struct fuse_conn *fc = get_fuse_conn(inode); stat->dev = inode->i_sb->s_dev; stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777); stat->nlink = attr->nlink; stat->uid = make_kuid(fc->user_ns, attr->uid); stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; stat->mtime.tv_sec = attr->mtime; stat->mtime.tv_nsec = attr->mtimensec; stat->ctime.tv_sec = attr->ctime; stat->ctime.tv_nsec = attr->ctimensec; stat->size = attr->size; stat->blocks = attr->blocks; if (attr->blksize != 0) blkbits = ilog2(attr->blksize); else blkbits = inode->i_sb->s_blocksize_bits; stat->blksize = 1 << blkbits; } static void fuse_statx_to_attr(struct fuse_statx *sx, struct fuse_attr *attr) { memset(attr, 0, sizeof(*attr)); attr->ino = sx->ino; attr->size = sx->size; attr->blocks = sx->blocks; attr->atime = sx->atime.tv_sec; attr->mtime = sx->mtime.tv_sec; attr->ctime = sx->ctime.tv_sec; attr->atimensec = sx->atime.tv_nsec; attr->mtimensec = sx->mtime.tv_nsec; attr->ctimensec = sx->ctime.tv_nsec; attr->mode = sx->mode; attr->nlink = sx->nlink; attr->uid = sx->uid; attr->gid = sx->gid; attr->rdev = new_encode_dev(MKDEV(sx->rdev_major, sx->rdev_minor)); attr->blksize = sx->blksize; } static int fuse_do_statx(struct inode *inode, struct file *file, struct kstat *stat) { int err; struct fuse_attr attr; struct fuse_statx *sx; struct fuse_statx_in inarg; struct fuse_statx_out outarg; struct fuse_mount *fm = get_fuse_mount(inode); u64 attr_version = fuse_get_attr_version(fm->fc); FUSE_ARGS(args); memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); /* Directories have separate file-handle space */ if (file && S_ISREG(inode->i_mode)) { struct fuse_file *ff = file->private_data; inarg.getattr_flags |= FUSE_GETATTR_FH; inarg.fh = ff->fh; } /* For now leave sync hints as the default, request all stats. */ inarg.sx_flags = 0; inarg.sx_mask = STATX_BASIC_STATS | STATX_BTIME; args.opcode = FUSE_STATX; args.nodeid = get_node_id(inode); args.in_numargs = 1; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.out_numargs = 1; args.out_args[0].size = sizeof(outarg); args.out_args[0].value = &outarg; err = fuse_simple_request(fm, &args); if (err) return err; sx = &outarg.stat; if (((sx->mask & STATX_SIZE) && !fuse_valid_size(sx->size)) || ((sx->mask & STATX_TYPE) && (!fuse_valid_type(sx->mode) || inode_wrong_type(inode, sx->mode)))) { make_bad_inode(inode); return -EIO; } fuse_statx_to_attr(&outarg.stat, &attr); if ((sx->mask & STATX_BASIC_STATS) == STATX_BASIC_STATS) { fuse_change_attributes(inode, &attr, &outarg.stat, ATTR_TIMEOUT(&outarg), attr_version); } if (stat) { stat->result_mask = sx->mask & (STATX_BASIC_STATS | STATX_BTIME); stat->btime.tv_sec = sx->btime.tv_sec; stat->btime.tv_nsec = min_t(u32, sx->btime.tv_nsec, NSEC_PER_SEC - 1); fuse_fillattr(inode, &attr, stat); stat->result_mask |= STATX_TYPE; } return 0; } static int fuse_do_getattr(struct inode *inode, struct kstat *stat, struct file *file) { int err; struct fuse_getattr_in inarg; struct fuse_attr_out outarg; struct fuse_mount *fm = get_fuse_mount(inode); FUSE_ARGS(args); u64 attr_version; attr_version = fuse_get_attr_version(fm->fc); memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); /* Directories have separate file-handle space */ if (file && S_ISREG(inode->i_mode)) { struct fuse_file *ff = file->private_data; inarg.getattr_flags |= FUSE_GETATTR_FH; inarg.fh = ff->fh; } args.opcode = FUSE_GETATTR; args.nodeid = get_node_id(inode); args.in_numargs = 1; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; args.out_numargs = 1; args.out_args[0].size = sizeof(outarg); args.out_args[0].value = &outarg; err = fuse_simple_request(fm, &args); if (!err) { if (fuse_invalid_attr(&outarg.attr) || inode_wrong_type(inode, outarg.attr.mode)) { fuse_make_bad(inode); err = -EIO; } else { fuse_change_attributes(inode, &outarg.attr, NULL, ATTR_TIMEOUT(&outarg), attr_version); if (stat) fuse_fillattr(inode, &outarg.attr, stat); } } return err; } static int fuse_update_get_attr(struct inode *inode, struct file *file, struct kstat *stat, u32 request_mask, unsigned int flags) { struct fuse_inode *fi = get_fuse_inode(inode); struct fuse_conn *fc = get_fuse_conn(inode); int err = 0; bool sync; u32 inval_mask = READ_ONCE(fi->inval_mask); u32 cache_mask = fuse_get_cache_mask(inode); /* FUSE only supports basic stats and possibly btime */ request_mask &= STATX_BASIC_STATS | STATX_BTIME; retry: if (fc->no_statx) request_mask &= STATX_BASIC_STATS; if (!request_mask) sync = false; else if (flags & AT_STATX_FORCE_SYNC) sync = true; else if (flags & AT_STATX_DONT_SYNC) sync = false; else if (request_mask & inval_mask & ~cache_mask) sync = true; else sync = time_before64(fi->i_time, get_jiffies_64()); if (sync) { forget_all_cached_acls(inode); /* Try statx if BTIME is requested */ if (!fc->no_statx && (request_mask & ~STATX_BASIC_STATS)) { err = fuse_do_statx(inode, file, stat); if (err == -ENOSYS) { fc->no_statx = 1; goto retry; } } else { err = fuse_do_getattr(inode, stat, file); } } else if (stat) { generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat); stat->mode = fi->orig_i_mode; stat->ino = fi->orig_ino; if (test_bit(FUSE_I_BTIME, &fi->state)) { stat->btime = fi->i_btime; stat->result_mask |= STATX_BTIME; } } return err; } int fuse_update_attributes(struct inode *inode, struct file *file, u32 mask) { return fuse_update_get_attr(inode, file, NULL, mask, 0); } int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid, u64 child_nodeid, struct qstr *name, u32 flags) { int err = -ENOTDIR; struct inode *parent; struct dentry *dir; struct dentry *entry; parent = fuse_ilookup(fc, parent_nodeid, NULL); if (!parent) return -ENOENT; inode_lock_nested(parent, I_MUTEX_PARENT); if (!S_ISDIR(parent->i_mode)) goto unlock; err = -ENOENT; dir = d_find_alias(parent); if (!dir) goto unlock; name->hash = full_name_hash(dir, name->name, name->len); entry = d_lookup(dir, name); dput(dir); if (!entry) goto unlock; fuse_dir_changed(parent); if (!(flags & FUSE_EXPIRE_ONLY)) d_invalidate(entry); fuse_invalidate_entry_cache(entry); if (child_nodeid != 0 && d_really_is_positive(entry)) { inode_lock(d_inode(entry)); if (get_node_id(d_inode(entry)) != child_nodeid) { err = -ENOENT; goto badentry; } if (d_mountpoint(entry)) { err = -EBUSY; goto badentry; } if (d_is_dir(entry)) { shrink_dcache_parent(entry); if (!simple_empty(entry)) { err = -ENOTEMPTY; goto badentry; } d_inode(entry)->i_flags |= S_DEAD; } dont_mount(entry); clear_nlink(d_inode(entry)); err = 0; badentry: inode_unlock(d_inode(entry)); if (!err) d_delete(entry); } else { err = 0; } dput(entry); unlock: inode_unlock(parent); iput(parent); return err; } static inline bool fuse_permissible_uidgid(struct fuse_conn *fc) { const struct cred *cred = current_cred(); return (uid_eq(cred->euid, fc->user_id) && uid_eq(cred->suid, fc->user_id) && uid_eq(cred->uid, fc->user_id) && gid_eq(cred->egid, fc->group_id) && gid_eq(cred->sgid, fc->group_id) && gid_eq(cred->gid, fc->group_id)); } /* * Calling into a user-controlled filesystem gives the filesystem * daemon ptrace-like capabilities over the current process. This * means, that the filesystem daemon is able to record the exact * filesystem operations performed, and can also control the behavior * of the requester process in otherwise impossible ways. For example * it can delay the operation for arbitrary length of time allowing * DoS against the requester. * * For this reason only those processes can call into the filesystem, * for which the owner of the mount has ptrace privilege. This * excludes processes started by other users, suid or sgid processes. */ bool fuse_allow_current_process(struct fuse_conn *fc) { bool allow; if (fc->allow_other) allow = current_in_userns(fc->user_ns); else allow = fuse_permissible_uidgid(fc); if (!allow && allow_sys_admin_access && capable(CAP_SYS_ADMIN)) allow = true; return allow; } static int fuse_access(struct inode *inode, int mask) { struct fuse_mount *fm = get_fuse_mount(inode); FUSE_ARGS(args); struct fuse_access_in inarg; int err; BUG_ON(mask & MAY_NOT_BLOCK); if (fm->fc->no_access) return 0; memset(&inarg, 0, sizeof(inarg)); inarg.mask = mask & (MAY_READ | MAY_WRITE | MAY_EXEC); args.opcode = FUSE_ACCESS; args.nodeid = get_node_id(inode); args.in_numargs = 1; args.in_args[0].size = sizeof(inarg); args.in_args[0].value = &inarg; err = fuse_simple_request(fm, &args); if (err == -ENOSYS) { fm->fc->no_access = 1; err = 0; } return err; } static int fuse_perm_getattr(struct inode *inode, int mask) { if (mask & MAY_NOT_BLOCK) return -ECHILD; forget_all_cached_acls(inode); return fuse_do_getattr(inode, NULL, NULL); } /* * Check permission. The two basic access models of FUSE are: * * 1) Local access checking ('default_permissions' mount option) based * on file mode. This is the plain old disk filesystem permission * modell. * * 2) "Remote" access checking, where server is responsible for * checking permission in each inode operation. An exception to this * is if ->permission() was invoked from sys_access() in which case an * access request is sent. Execute permission is still checked * locally based on file mode. */ static int fuse_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { struct fuse_conn *fc = get_fuse_conn(inode); bool refreshed = false; int err = 0; if (fuse_is_bad(inode)) return -EIO; if (!fuse_allow_current_process(fc)) return -EACCES; /* * If attributes are needed, refresh them before proceeding */ if (fc->default_permissions || ((mask & MAY_EXEC) && S_ISREG(inode->i_mode))) { struct fuse_inode *fi = get_fuse_inode(inode); u32 perm_mask = STATX_MODE | STATX_UID | STATX_GID; if (perm_mask & READ_ONCE(fi->inval_mask) || time_before64(fi->i_time, get_jiffies_64())) { refreshed = true; err = fuse_perm_getattr(inode, mask); if (err) return err; } } if (fc->default_permissions) { err = generic_permission(&nop_mnt_idmap, inode, mask); /* If permission is denied, try to refresh file attributes. This is also needed, because the root node will at first have no permissions */ if (err == -EACCES && !refreshed) { err = fuse_perm_getattr(inode, mask); if (!err) err = generic_permission(&nop_mnt_idmap, inode, mask); } /* Note: the opposite of the above test does not exist. So if permissions are revoked this won't be noticed immediately, only after the attribute timeout has expired */ } else if (mask & (MAY_ACCESS | MAY_CHDIR)) { err = fuse_access(inode, mask); } else if ((mask & MAY_EXEC) && S_ISREG(inode->i_mode)) { if (!(inode->i_mode & S_IXUGO)) { if (refreshed) return -EACCES; err = fuse_perm_getattr(inode, mask); if (!err && !(inode->i_mode & S_IXUGO)) return -EACCES; } } return err; } static int fuse_readlink_page(struct inode *inode, struct page *page) { struct fuse_mount *fm = get_fuse_mount(inode); struct fuse_page_desc desc = { .length = PAGE_SIZE - 1 }; struct fuse_args_pages ap = { .num_pages = 1, .pages = &page, .descs = &desc, }; char *link; ssize_t res; ap.args.opcode = FUSE_READLINK; ap.args.nodeid = get_node_id(inode); ap.args.out_pages = true; ap.args.out_argvar = true; ap.args.page_zeroing = true; ap.args.out_numargs = 1; ap.args.out_args[0].size = desc.length; res = fuse_simple_request(fm, &ap.args); fuse_invalidate_atime(inode); if (res < 0) return res; if (WARN_ON(res >= PAGE_SIZE)) return -EIO; link = page_address(page); link[res] = '\0'; return 0; } static const char *fuse_get_link(struct dentry *dentry, struct inode *inode, struct delayed_call *callback) { struct fuse_conn *fc = get_fuse_conn(inode); struct page *page; int err; err = -EIO; if (fuse_is_bad(inode)) goto out_err; if (fc->cache_symlinks) return page_get_link(dentry, inode, callback); err = -ECHILD; if (!dentry) goto out_err; page = alloc_page(GFP_KERNEL); err = -ENOMEM; if (!page) goto out_err; err = fuse_readlink_page(inode, page); if (err) { __free_page(page); goto out_err; } set_delayed_call(callback, page_put_link, page); return page_address(page); out_err: return ERR_PTR(err); } static int fuse_dir_open(struct inode *inode, struct file *file) { return fuse_open_common(inode, file, true); } static int fuse_dir_release(struct inode *inode, struct file *file) { fuse_release_common(file, true); return 0; } static int fuse_dir_fsync(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); int err; if (fuse_is_bad(inode)) return -EIO; if (fc->no_fsyncdir) return 0; inode_lock(inode); err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNCDIR); if (err == -ENOSYS) { fc->no_fsyncdir = 1; err = 0; } inode_unlock(inode); return err; } static long fuse_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct fuse_conn *fc = get_fuse_conn(file->f_mapping->host); /* FUSE_IOCTL_DIR only supported for API version >= 7.18 */ if (fc->minor < 18) return -ENOTTY; return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_DIR); } static long fuse_dir_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct fuse_conn *fc = get_fuse_conn(file->f_mapping->host); if (fc->minor < 18) return -ENOTTY; return fuse_ioctl_common(file, cmd, arg, FUSE_IOCTL_COMPAT | FUSE_IOCTL_DIR); } static bool update_mtime(unsigned ivalid, bool trust_local_mtime) { /* Always update if mtime is explicitly set */ if (ivalid & ATTR_MTIME_SET) return true; /* Or if kernel i_mtime is the official one */ if (trust_local_mtime) return true; /* If it's an open(O_TRUNC) or an ftruncate(), don't update */ if ((ivalid & ATTR_SIZE) && (ivalid & (ATTR_OPEN | ATTR_FILE))) return false; /* In all other cases update */ return true; } static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid); if (ivalid & ATTR_GID) arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid); if (ivalid & ATTR_SIZE) arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size; if (ivalid & ATTR_ATIME) { arg->valid |= FATTR_ATIME; arg->atime = iattr->ia_atime.tv_sec; arg->atimensec = iattr->ia_atime.tv_nsec; if (!(ivalid & ATTR_ATIME_SET)) arg->valid |= FATTR_ATIME_NOW; } if ((ivalid & ATTR_MTIME) && update_mtime(ivalid, trust_local_cmtime)) { arg->valid |= FATTR_MTIME; arg->mtime = iattr->ia_mtime.tv_sec; arg->mtimensec = iattr->ia_mtime.tv_nsec; if (!(ivalid & ATTR_MTIME_SET) && !trust_local_cmtime) arg->valid |= FATTR_MTIME_NOW; } if ((ivalid & ATTR_CTIME) && trust_local_cmtime) { arg->valid |= FATTR_CTIME; arg->ctime = iattr->ia_ctime.tv_sec; arg->ctimensec = iattr->ia_ctime.tv_nsec; } } /* * Prevent concurrent writepages on inode * * This is done by adding a negative bias to the inode write counter * and waiting for all pending writes to finish. */ void fuse_set_nowrite(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); BUG_ON(!inode_is_locked(inode)); spin_lock(&fi->lock); BUG_ON(fi->writectr < 0); fi->writectr += FUSE_NOWRITE; spin_unlock(&fi->lock); wait_event(fi->page_waitq, fi->writectr == FUSE_NOWRITE); } /* * Allow writepages on inode * * Remove the bias from the writecounter and send any queued * writepages. */ static void __fuse_release_nowrite(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); BUG_ON(fi->writectr != FUSE_NOWRITE); fi->writectr = 0; fuse_flush_writepages(inode); } void fuse_release_nowrite(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); spin_lock(&fi->lock); __fuse_release_nowrite(inode); spin_unlock(&fi->lock); } static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_args *args, struct inode *inode, struct fuse_setattr_in *inarg_p, struct fuse_attr_out *outarg_p) { args->opcode = FUSE_SETATTR; args->nodeid = get_node_id(inode); args->in_numargs = 1; args->in_args[0].size = sizeof(*inarg_p); args->in_args[0].value = inarg_p; args->out_numargs = 1; args->out_args[0].size = sizeof(*outarg_p); args->out_args[0].value = outarg_p; } /* * Flush inode->i_mtime to the server */ int fuse_flush_times(struct inode *inode, struct fuse_file *ff) { struct fuse_mount *fm = get_fuse_mount(inode); FUSE_ARGS(args); struct fuse_setattr_in inarg; struct fuse_attr_out outarg; memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); inarg.valid = FATTR_MTIME; inarg.mtime = inode->i_mtime.tv_sec; inarg.mtimensec = inode->i_mtime.tv_nsec; if (fm->fc->minor >= 23) { inarg.valid |= FATTR_CTIME; inarg.ctime = inode_get_ctime(inode).tv_sec; inarg.ctimensec = inode_get_ctime(inode).tv_nsec; } if (ff) { inarg.valid |= FATTR_FH; inarg.fh = ff->fh; } fuse_setattr_fill(fm->fc, &args, inode, &inarg, &outarg); return fuse_simple_request(fm, &args); } /* * Set attributes, and at the same time refresh them. * * Truncation is slightly complicated, because the 'truncate' request * may fail, in which case we don't want to touch the mapping. * vmtruncate() doesn't allow for this case, so do the rlimit checking * and the actual truncation by hand. */ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, struct file *file) { struct inode *inode = d_inode(dentry); struct fuse_mount *fm = get_fuse_mount(inode); struct fuse_conn *fc = fm->fc; struct fuse_inode *fi = get_fuse_inode(inode); struct address_space *mapping = inode->i_mapping; FUSE_ARGS(args); struct fuse_setattr_in inarg; struct fuse_attr_out outarg; bool is_truncate = false; bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode); loff_t oldsize; int err; bool trust_local_cmtime = is_wb; bool fault_blocked = false; if (!fc->default_permissions) attr->ia_valid |= ATTR_FORCE; err = setattr_prepare(&nop_mnt_idmap, dentry, attr); if (err) return err; if (attr->ia_valid & ATTR_SIZE) { if (WARN_ON(!S_ISREG(inode->i_mode))) return -EIO; is_truncate = true; } if (FUSE_IS_DAX(inode) && is_truncate) { filemap_invalidate_lock(mapping); fault_blocked = true; err = fuse_dax_break_layouts(inode, 0, 0); if (err) { filemap_invalidate_unlock(mapping); return err; } } if (attr->ia_valid & ATTR_OPEN) { /* This is coming from open(..., ... | O_TRUNC); */ WARN_ON(!(attr->ia_valid & ATTR_SIZE)); WARN_ON(attr->ia_size != 0); if (fc->atomic_o_trunc) { /* * No need to send request to userspace, since actual * truncation has already been done by OPEN. But still * need to truncate page cache. */ i_size_write(inode, 0); truncate_pagecache(inode, 0); goto out; } file = NULL; } /* Flush dirty data/metadata before non-truncate SETATTR */ if (is_wb && attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_MTIME_SET | ATTR_TIMES_SET)) { err = write_inode_now(inode, true); if (err) return err; fuse_set_nowrite(inode); fuse_release_nowrite(inode); } if (is_truncate) { fuse_set_nowrite(inode); set_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (trust_local_cmtime && attr->ia_size != inode->i_size) attr->ia_valid |= ATTR_MTIME | ATTR_CTIME; } memset(&inarg, 0, sizeof(inarg)); memset(&outarg, 0, sizeof(outarg)); iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime); if (file) { struct fuse_file *ff = file->private_data; inarg.valid |= FATTR_FH; inarg.fh = ff->fh; } /* Kill suid/sgid for non-directory chown unconditionally */ if (fc->handle_killpriv_v2 && !S_ISDIR(inode->i_mode) && attr->ia_valid & (ATTR_UID | ATTR_GID)) inarg.valid |= FATTR_KILL_SUIDGID; if (attr->ia_valid & ATTR_SIZE) { /* For mandatory locking in truncate */ inarg.valid |= FATTR_LOCKOWNER; inarg.lock_owner = fuse_lock_owner_id(fc, current->files); /* Kill suid/sgid for truncate only if no CAP_FSETID */ if (fc->handle_killpriv_v2 && !capable(CAP_FSETID)) inarg.valid |= FATTR_KILL_SUIDGID; } fuse_setattr_fill(fc, &args, inode, &inarg, &outarg); err = fuse_simple_request(fm, &args); if (err) { if (err == -EINTR) fuse_invalidate_attr(inode); goto error; } if (fuse_invalid_attr(&outarg.attr) || inode_wrong_type(inode, outarg.attr.mode)) { fuse_make_bad(inode); err = -EIO; goto error; } spin_lock(&fi->lock); /* the kernel maintains i_mtime locally */ if (trust_local_cmtime) { if (attr->ia_valid & ATTR_MTIME) inode->i_mtime = attr->ia_mtime; if (attr->ia_valid & ATTR_CTIME) inode_set_ctime_to_ts(inode, attr->ia_ctime); /* FIXME: clear I_DIRTY_SYNC? */ } fuse_change_attributes_common(inode, &outarg.attr, NULL, ATTR_TIMEOUT(&outarg), fuse_get_cache_mask(inode)); oldsize = inode->i_size; /* see the comment in fuse_change_attributes() */ if (!is_wb || is_truncate) i_size_write(inode, outarg.attr.size); if (is_truncate) { /* NOTE: this may release/reacquire fi->lock */ __fuse_release_nowrite(inode); } spin_unlock(&fi->lock); /* * Only call invalidate_inode_pages2() after removing * FUSE_NOWRITE, otherwise fuse_launder_folio() would deadlock. */ if ((is_truncate || !is_wb) && S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { truncate_pagecache(inode, outarg.attr.size); invalidate_inode_pages2(mapping); } clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); out: if (fault_blocked) filemap_invalidate_unlock(mapping); return 0; error: if (is_truncate) fuse_release_nowrite(inode); clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (fault_blocked) filemap_invalidate_unlock(mapping); return err; } static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry, struct iattr *attr) { struct inode *inode = d_inode(entry); struct fuse_conn *fc = get_fuse_conn(inode); struct file *file = (attr->ia_valid & ATTR_FILE) ? attr->ia_file : NULL; int ret; if (fuse_is_bad(inode)) return -EIO; if (!fuse_allow_current_process(get_fuse_conn(inode))) return -EACCES; if (attr->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID)) { attr->ia_valid &= ~(ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_MODE); /* * The only sane way to reliably kill suid/sgid is to do it in * the userspace filesystem * * This should be done on write(), truncate() and chown(). */ if (!fc->handle_killpriv && !fc->handle_killpriv_v2) { /* * ia_mode calculation may have used stale i_mode. * Refresh and recalculate. */ ret = fuse_do_getattr(inode, NULL, file); if (ret) return ret; attr->ia_mode = inode->i_mode; if (inode->i_mode & S_ISUID) { attr->ia_valid |= ATTR_MODE; attr->ia_mode &= ~S_ISUID; } if ((inode->i_mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) { attr->ia_valid |= ATTR_MODE; attr->ia_mode &= ~S_ISGID; } } } if (!attr->ia_valid) return 0; ret = fuse_do_setattr(entry, attr, file); if (!ret) { /* * If filesystem supports acls it may have updated acl xattrs in * the filesystem, so forget cached acls for the inode. */ if (fc->posix_acl) forget_all_cached_acls(inode); /* Directory mode changed, may need to revalidate access */ if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE)) fuse_invalidate_entry_cache(entry); } return ret; } static int fuse_getattr(struct mnt_idmap *idmap, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int flags) { struct inode *inode = d_inode(path->dentry); struct fuse_conn *fc = get_fuse_conn(inode); if (fuse_is_bad(inode)) return -EIO; if (!fuse_allow_current_process(fc)) { if (!request_mask) { /* * If user explicitly requested *nothing* then don't * error out, but return st_dev only. */ stat->result_mask = 0; stat->dev = inode->i_sb->s_dev; return 0; } return -EACCES; } return fuse_update_get_attr(inode, NULL, stat, request_mask, flags); } static const struct inode_operations fuse_dir_inode_operations = { .lookup = fuse_lookup, .mkdir = fuse_mkdir, .symlink = fuse_symlink, .unlink = fuse_unlink, .rmdir = fuse_rmdir, .rename = fuse_rename2, .link = fuse_link, .setattr = fuse_setattr, .create = fuse_create, .atomic_open = fuse_atomic_open, .tmpfile = fuse_tmpfile, .mknod = fuse_mknod, .permission = fuse_permission, .getattr = fuse_getattr, .listxattr = fuse_listxattr, .get_inode_acl = fuse_get_inode_acl, .get_acl = fuse_get_acl, .set_acl = fuse_set_acl, .fileattr_get = fuse_fileattr_get, .fileattr_set = fuse_fileattr_set, }; static const struct file_operations fuse_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, .iterate_shared = fuse_readdir, .open = fuse_dir_open, .release = fuse_dir_release, .fsync = fuse_dir_fsync, .unlocked_ioctl = fuse_dir_ioctl, .compat_ioctl = fuse_dir_compat_ioctl, }; static const struct inode_operations fuse_common_inode_operations = { .setattr = fuse_setattr, .permission = fuse_permission, .getattr = fuse_getattr, .listxattr = fuse_listxattr, .get_inode_acl = fuse_get_inode_acl, .get_acl = fuse_get_acl, .set_acl = fuse_set_acl, .fileattr_get = fuse_fileattr_get, .fileattr_set = fuse_fileattr_set, }; static const struct inode_operations fuse_symlink_inode_operations = { .setattr = fuse_setattr, .get_link = fuse_get_link, .getattr = fuse_getattr, .listxattr = fuse_listxattr, }; void fuse_init_common(struct inode *inode) { inode->i_op = &fuse_common_inode_operations; } void fuse_init_dir(struct inode *inode) { struct fuse_inode *fi = get_fuse_inode(inode); inode->i_op = &fuse_dir_inode_operations; inode->i_fop = &fuse_dir_operations; spin_lock_init(&fi->rdc.lock); fi->rdc.cached = false; fi->rdc.size = 0; fi->rdc.pos = 0; fi->rdc.version = 0; } static int fuse_symlink_read_folio(struct file *null, struct folio *folio) { int err = fuse_readlink_page(folio->mapping->host, &folio->page); if (!err) folio_mark_uptodate(folio); folio_unlock(folio); return err; } static const struct address_space_operations fuse_symlink_aops = { .read_folio = fuse_symlink_read_folio, }; void fuse_init_symlink(struct inode *inode) { inode->i_op = &fuse_symlink_inode_operations; inode->i_data.a_ops = &fuse_symlink_aops; inode_nohighmem(inode); } |
131 40 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | // SPDX-License-Identifier: GPL-2.0 #include <linux/swap_cgroup.h> #include <linux/vmalloc.h> #include <linux/mm.h> #include <linux/swapops.h> /* depends on mm.h include */ static DEFINE_MUTEX(swap_cgroup_mutex); struct swap_cgroup_ctrl { struct page **map; unsigned long length; spinlock_t lock; }; static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; struct swap_cgroup { unsigned short id; }; #define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) /* * SwapCgroup implements "lookup" and "exchange" operations. * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge * against SwapCache. At swap_free(), this is accessed directly from swap. * * This means, * - we have no race in "exchange" when we're accessed via SwapCache because * SwapCache(and its swp_entry) is under lock. * - When called via swap_free(), there is no user of this entry and no race. * Then, we don't need lock around "exchange". * * TODO: we can push these buffers out to HIGHMEM. */ /* * allocate buffer for swap_cgroup. */ static int swap_cgroup_prepare(int type) { struct page *page; struct swap_cgroup_ctrl *ctrl; unsigned long idx, max; ctrl = &swap_cgroup_ctrl[type]; for (idx = 0; idx < ctrl->length; idx++) { page = alloc_page(GFP_KERNEL | __GFP_ZERO); if (!page) goto not_enough_page; ctrl->map[idx] = page; if (!(idx % SWAP_CLUSTER_MAX)) cond_resched(); } return 0; not_enough_page: max = idx; for (idx = 0; idx < max; idx++) __free_page(ctrl->map[idx]); return -ENOMEM; } static struct swap_cgroup *__lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl, pgoff_t offset) { struct page *mappage; struct swap_cgroup *sc; mappage = ctrl->map[offset / SC_PER_PAGE]; sc = page_address(mappage); return sc + offset % SC_PER_PAGE; } static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, struct swap_cgroup_ctrl **ctrlp) { pgoff_t offset = swp_offset(ent); struct swap_cgroup_ctrl *ctrl; ctrl = &swap_cgroup_ctrl[swp_type(ent)]; if (ctrlp) *ctrlp = ctrl; return __lookup_swap_cgroup(ctrl, offset); } /** * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry. * @ent: swap entry to be cmpxchged * @old: old id * @new: new id * * Returns old id at success, 0 at failure. * (There is no mem_cgroup using 0 as its id) */ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new) { struct swap_cgroup_ctrl *ctrl; struct swap_cgroup *sc; unsigned long flags; unsigned short retval; sc = lookup_swap_cgroup(ent, &ctrl); spin_lock_irqsave(&ctrl->lock, flags); retval = sc->id; if (retval == old) sc->id = new; else retval = 0; spin_unlock_irqrestore(&ctrl->lock, flags); return retval; } /** * swap_cgroup_record - record mem_cgroup for a set of swap entries * @ent: the first swap entry to be recorded into * @id: mem_cgroup to be recorded * @nr_ents: number of swap entries to be recorded * * Returns old value at success, 0 at failure. * (Of course, old value can be 0.) */ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, unsigned int nr_ents) { struct swap_cgroup_ctrl *ctrl; struct swap_cgroup *sc; unsigned short old; unsigned long flags; pgoff_t offset = swp_offset(ent); pgoff_t end = offset + nr_ents; sc = lookup_swap_cgroup(ent, &ctrl); spin_lock_irqsave(&ctrl->lock, flags); old = sc->id; for (;;) { VM_BUG_ON(sc->id != old); sc->id = id; offset++; if (offset == end) break; if (offset % SC_PER_PAGE) sc++; else sc = __lookup_swap_cgroup(ctrl, offset); } spin_unlock_irqrestore(&ctrl->lock, flags); return old; } /** * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry * @ent: swap entry to be looked up. * * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) */ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) { return lookup_swap_cgroup(ent, NULL)->id; } int swap_cgroup_swapon(int type, unsigned long max_pages) { void *array; unsigned long length; struct swap_cgroup_ctrl *ctrl; if (mem_cgroup_disabled()) return 0; length = DIV_ROUND_UP(max_pages, SC_PER_PAGE); array = vcalloc(length, sizeof(void *)); if (!array) goto nomem; ctrl = &swap_cgroup_ctrl[type]; mutex_lock(&swap_cgroup_mutex); ctrl->length = length; ctrl->map = array; spin_lock_init(&ctrl->lock); if (swap_cgroup_prepare(type)) { /* memory shortage */ ctrl->map = NULL; ctrl->length = 0; mutex_unlock(&swap_cgroup_mutex); vfree(array); goto nomem; } mutex_unlock(&swap_cgroup_mutex); return 0; nomem: pr_info("couldn't allocate enough memory for swap_cgroup\n"); pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n"); return -ENOMEM; } void swap_cgroup_swapoff(int type) { struct page **map; unsigned long i, length; struct swap_cgroup_ctrl *ctrl; if (mem_cgroup_disabled()) return; mutex_lock(&swap_cgroup_mutex); ctrl = &swap_cgroup_ctrl[type]; map = ctrl->map; length = ctrl->length; ctrl->map = NULL; ctrl->length = 0; mutex_unlock(&swap_cgroup_mutex); if (map) { for (i = 0; i < length; i++) { struct page *page = map[i]; if (page) __free_page(page); if (!(i % SWAP_CLUSTER_MAX)) cond_resched(); } vfree(map); } } |
3 76 77 77 78 78 78 78 76 106 105 3 3 1 3 1 1 109 109 109 109 108 107 79 78 57 78 77 78 59 1 1 103 104 104 58 78 104 78 58 55 55 40 16 16 56 109 41 41 41 18 2 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 | // SPDX-License-Identifier: GPL-2.0 /* * Released under the GPLv2 only. */ #include <linux/module.h> #include <linux/string.h> #include <linux/bitops.h> #include <linux/slab.h> #include <linux/log2.h> #include <linux/kmsan.h> #include <linux/usb.h> #include <linux/wait.h> #include <linux/usb/hcd.h> #include <linux/scatterlist.h> #define to_urb(d) container_of(d, struct urb, kref) static void urb_destroy(struct kref *kref) { struct urb *urb = to_urb(kref); if (urb->transfer_flags & URB_FREE_BUFFER) kfree(urb->transfer_buffer); kfree(urb); } /** * usb_init_urb - initializes a urb so that it can be used by a USB driver * @urb: pointer to the urb to initialize * * Initializes a urb so that the USB subsystem can use it properly. * * If a urb is created with a call to usb_alloc_urb() it is not * necessary to call this function. Only use this if you allocate the * space for a struct urb on your own. If you call this function, be * careful when freeing the memory for your urb that it is no longer in * use by the USB core. * * Only use this function if you _really_ understand what you are doing. */ void usb_init_urb(struct urb *urb) { if (urb) { memset(urb, 0, sizeof(*urb)); kref_init(&urb->kref); INIT_LIST_HEAD(&urb->urb_list); INIT_LIST_HEAD(&urb->anchor_list); } } EXPORT_SYMBOL_GPL(usb_init_urb); /** * usb_alloc_urb - creates a new urb for a USB driver to use * @iso_packets: number of iso packets for this urb * @mem_flags: the type of memory to allocate, see kmalloc() for a list of * valid options for this. * * Creates an urb for the USB driver to use, initializes a few internal * structures, increments the usage counter, and returns a pointer to it. * * If the driver want to use this urb for interrupt, control, or bulk * endpoints, pass '0' as the number of iso packets. * * The driver must call usb_free_urb() when it is finished with the urb. * * Return: A pointer to the new urb, or %NULL if no memory is available. */ struct urb *usb_alloc_urb(int iso_packets, gfp_t mem_flags) { struct urb *urb; urb = kmalloc(struct_size(urb, iso_frame_desc, iso_packets), mem_flags); if (!urb) return NULL; usb_init_urb(urb); return urb; } EXPORT_SYMBOL_GPL(usb_alloc_urb); /** * usb_free_urb - frees the memory used by a urb when all users of it are finished * @urb: pointer to the urb to free, may be NULL * * Must be called when a user of a urb is finished with it. When the last user * of the urb calls this function, the memory of the urb is freed. * * Note: The transfer buffer associated with the urb is not freed unless the * URB_FREE_BUFFER transfer flag is set. */ void usb_free_urb(struct urb *urb) { if (urb) kref_put(&urb->kref, urb_destroy); } EXPORT_SYMBOL_GPL(usb_free_urb); /** * usb_get_urb - increments the reference count of the urb * @urb: pointer to the urb to modify, may be NULL * * This must be called whenever a urb is transferred from a device driver to a * host controller driver. This allows proper reference counting to happen * for urbs. * * Return: A pointer to the urb with the incremented reference counter. */ struct urb *usb_get_urb(struct urb *urb) { if (urb) kref_get(&urb->kref); return urb; } EXPORT_SYMBOL_GPL(usb_get_urb); /** * usb_anchor_urb - anchors an URB while it is processed * @urb: pointer to the urb to anchor * @anchor: pointer to the anchor * * This can be called to have access to URBs which are to be executed * without bothering to track them */ void usb_anchor_urb(struct urb *urb, struct usb_anchor *anchor) { unsigned long flags; spin_lock_irqsave(&anchor->lock, flags); usb_get_urb(urb); list_add_tail(&urb->anchor_list, &anchor->urb_list); urb->anchor = anchor; if (unlikely(anchor->poisoned)) atomic_inc(&urb->reject); spin_unlock_irqrestore(&anchor->lock, flags); } EXPORT_SYMBOL_GPL(usb_anchor_urb); static int usb_anchor_check_wakeup(struct usb_anchor *anchor) { return atomic_read(&anchor->suspend_wakeups) == 0 && list_empty(&anchor->urb_list); } /* Callers must hold anchor->lock */ static void __usb_unanchor_urb(struct urb *urb, struct usb_anchor *anchor) { urb->anchor = NULL; list_del(&urb->anchor_list); usb_put_urb(urb); if (usb_anchor_check_wakeup(anchor)) wake_up(&anchor->wait); } /** * usb_unanchor_urb - unanchors an URB * @urb: pointer to the urb to anchor * * Call this to stop the system keeping track of this URB */ void usb_unanchor_urb(struct urb *urb) { unsigned long flags; struct usb_anchor *anchor; if (!urb) return; anchor = urb->anchor; if (!anchor) return; spin_lock_irqsave(&anchor->lock, flags); /* * At this point, we could be competing with another thread which * has the same intention. To protect the urb from being unanchored * twice, only the winner of the race gets the job. */ if (likely(anchor == urb->anchor)) __usb_unanchor_urb(urb, anchor); spin_unlock_irqrestore(&anchor->lock, flags); } EXPORT_SYMBOL_GPL(usb_unanchor_urb); /*-------------------------------------------------------------------*/ static const int pipetypes[4] = { PIPE_CONTROL, PIPE_ISOCHRONOUS, PIPE_BULK, PIPE_INTERRUPT }; /** * usb_pipe_type_check - sanity check of a specific pipe for a usb device * @dev: struct usb_device to be checked * @pipe: pipe to check * * This performs a light-weight sanity check for the endpoint in the * given usb device. It returns 0 if the pipe is valid for the specific usb * device, otherwise a negative error code. */ int usb_pipe_type_check(struct usb_device *dev, unsigned int pipe) { const struct usb_host_endpoint *ep; ep = usb_pipe_endpoint(dev, pipe); if (!ep) return -EINVAL; if (usb_pipetype(pipe) != pipetypes[usb_endpoint_type(&ep->desc)]) return -EINVAL; return 0; } EXPORT_SYMBOL_GPL(usb_pipe_type_check); /** * usb_urb_ep_type_check - sanity check of endpoint in the given urb * @urb: urb to be checked * * This performs a light-weight sanity check for the endpoint in the * given urb. It returns 0 if the urb contains a valid endpoint, otherwise * a negative error code. */ int usb_urb_ep_type_check(const struct urb *urb) { return usb_pipe_type_check(urb->dev, urb->pipe); } EXPORT_SYMBOL_GPL(usb_urb_ep_type_check); /** * usb_submit_urb - issue an asynchronous transfer request for an endpoint * @urb: pointer to the urb describing the request * @mem_flags: the type of memory to allocate, see kmalloc() for a list * of valid options for this. * * This submits a transfer request, and transfers control of the URB * describing that request to the USB subsystem. Request completion will * be indicated later, asynchronously, by calling the completion handler. * The three types of completion are success, error, and unlink * (a software-induced fault, also called "request cancellation"). * * URBs may be submitted in interrupt context. * * The caller must have correctly initialized the URB before submitting * it. Functions such as usb_fill_bulk_urb() and usb_fill_control_urb() are * available to ensure that most fields are correctly initialized, for * the particular kind of transfer, although they will not initialize * any transfer flags. * * If the submission is successful, the complete() callback from the URB * will be called exactly once, when the USB core and Host Controller Driver * (HCD) are finished with the URB. When the completion function is called, * control of the URB is returned to the device driver which issued the * request. The completion handler may then immediately free or reuse that * URB. * * With few exceptions, USB device drivers should never access URB fields * provided by usbcore or the HCD until its complete() is called. * The exceptions relate to periodic transfer scheduling. For both * interrupt and isochronous urbs, as part of successful URB submission * urb->interval is modified to reflect the actual transfer period used * (normally some power of two units). And for isochronous urbs, * urb->start_frame is modified to reflect when the URB's transfers were * scheduled to start. * * Not all isochronous transfer scheduling policies will work, but most * host controller drivers should easily handle ISO queues going from now * until 10-200 msec into the future. Drivers should try to keep at * least one or two msec of data in the queue; many controllers require * that new transfers start at least 1 msec in the future when they are * added. If the driver is unable to keep up and the queue empties out, * the behavior for new submissions is governed by the URB_ISO_ASAP flag. * If the flag is set, or if the queue is idle, then the URB is always * assigned to the first available (and not yet expired) slot in the * endpoint's schedule. If the flag is not set and the queue is active * then the URB is always assigned to the next slot in the schedule * following the end of the endpoint's previous URB, even if that slot is * in the past. When a packet is assigned in this way to a slot that has * already expired, the packet is not transmitted and the corresponding * usb_iso_packet_descriptor's status field will return -EXDEV. If this * would happen to all the packets in the URB, submission fails with a * -EXDEV error code. * * For control endpoints, the synchronous usb_control_msg() call is * often used (in non-interrupt context) instead of this call. * That is often used through convenience wrappers, for the requests * that are standardized in the USB 2.0 specification. For bulk * endpoints, a synchronous usb_bulk_msg() call is available. * * Return: * 0 on successful submissions. A negative error number otherwise. * * Request Queuing: * * URBs may be submitted to endpoints before previous ones complete, to * minimize the impact of interrupt latencies and system overhead on data * throughput. With that queuing policy, an endpoint's queue would never * be empty. This is required for continuous isochronous data streams, * and may also be required for some kinds of interrupt transfers. Such * queuing also maximizes bandwidth utilization by letting USB controllers * start work on later requests before driver software has finished the * completion processing for earlier (successful) requests. * * As of Linux 2.6, all USB endpoint transfer queues support depths greater * than one. This was previously a HCD-specific behavior, except for ISO * transfers. Non-isochronous endpoint queues are inactive during cleanup * after faults (transfer errors or cancellation). * * Reserved Bandwidth Transfers: * * Periodic transfers (interrupt or isochronous) are performed repeatedly, * using the interval specified in the urb. Submitting the first urb to * the endpoint reserves the bandwidth necessary to make those transfers. * If the USB subsystem can't allocate sufficient bandwidth to perform * the periodic request, submitting such a periodic request should fail. * * For devices under xHCI, the bandwidth is reserved at configuration time, or * when the alt setting is selected. If there is not enough bus bandwidth, the * configuration/alt setting request will fail. Therefore, submissions to * periodic endpoints on devices under xHCI should never fail due to bandwidth * constraints. * * Device drivers must explicitly request that repetition, by ensuring that * some URB is always on the endpoint's queue (except possibly for short * periods during completion callbacks). When there is no longer an urb * queued, the endpoint's bandwidth reservation is canceled. This means * drivers can use their completion handlers to ensure they keep bandwidth * they need, by reinitializing and resubmitting the just-completed urb * until the driver longer needs that periodic bandwidth. * * Memory Flags: * * The general rules for how to decide which mem_flags to use * are the same as for kmalloc. There are four * different possible values; GFP_KERNEL, GFP_NOFS, GFP_NOIO and * GFP_ATOMIC. * * GFP_NOFS is not ever used, as it has not been implemented yet. * * GFP_ATOMIC is used when * (a) you are inside a completion handler, an interrupt, bottom half, * tasklet or timer, or * (b) you are holding a spinlock or rwlock (does not apply to * semaphores), or * (c) current->state != TASK_RUNNING, this is the case only after * you've changed it. * * GFP_NOIO is used in the block io path and error handling of storage * devices. * * All other situations use GFP_KERNEL. * * Some more specific rules for mem_flags can be inferred, such as * (1) start_xmit, timeout, and receive methods of network drivers must * use GFP_ATOMIC (they are called with a spinlock held); * (2) queuecommand methods of scsi drivers must use GFP_ATOMIC (also * called with a spinlock held); * (3) If you use a kernel thread with a network driver you must use * GFP_NOIO, unless (b) or (c) apply; * (4) after you have done a down() you can use GFP_KERNEL, unless (b) or (c) * apply or your are in a storage driver's block io path; * (5) USB probe and disconnect can use GFP_KERNEL unless (b) or (c) apply; and * (6) changing firmware on a running storage or net device uses * GFP_NOIO, unless b) or c) apply * */ int usb_submit_urb(struct urb *urb, gfp_t mem_flags) { int xfertype, max; struct usb_device *dev; struct usb_host_endpoint *ep; int is_out; unsigned int allowed; if (!urb || !urb->complete) return -EINVAL; if (urb->hcpriv) { WARN_ONCE(1, "URB %pK submitted while active\n", urb); return -EBUSY; } dev = urb->dev; if ((!dev) || (dev->state < USB_STATE_UNAUTHENTICATED)) return -ENODEV; /* For now, get the endpoint from the pipe. Eventually drivers * will be required to set urb->ep directly and we will eliminate * urb->pipe. */ ep = usb_pipe_endpoint(dev, urb->pipe); if (!ep) return -ENOENT; urb->ep = ep; urb->status = -EINPROGRESS; urb->actual_length = 0; /* Lots of sanity checks, so HCDs can rely on clean data * and don't need to duplicate tests */ xfertype = usb_endpoint_type(&ep->desc); if (xfertype == USB_ENDPOINT_XFER_CONTROL) { struct usb_ctrlrequest *setup = (struct usb_ctrlrequest *) urb->setup_packet; if (!setup) return -ENOEXEC; is_out = !(setup->bRequestType & USB_DIR_IN) || !setup->wLength; dev_WARN_ONCE(&dev->dev, (usb_pipeout(urb->pipe) != is_out), "BOGUS control dir, pipe %x doesn't match bRequestType %x\n", urb->pipe, setup->bRequestType); if (le16_to_cpu(setup->wLength) != urb->transfer_buffer_length) { dev_dbg(&dev->dev, "BOGUS control len %d doesn't match transfer length %d\n", le16_to_cpu(setup->wLength), urb->transfer_buffer_length); return -EBADR; } } else { is_out = usb_endpoint_dir_out(&ep->desc); } /* Clear the internal flags and cache the direction for later use */ urb->transfer_flags &= ~(URB_DIR_MASK | URB_DMA_MAP_SINGLE | URB_DMA_MAP_PAGE | URB_DMA_MAP_SG | URB_MAP_LOCAL | URB_SETUP_MAP_SINGLE | URB_SETUP_MAP_LOCAL | URB_DMA_SG_COMBINED); urb->transfer_flags |= (is_out ? URB_DIR_OUT : URB_DIR_IN); kmsan_handle_urb(urb, is_out); if (xfertype != USB_ENDPOINT_XFER_CONTROL && dev->state < USB_STATE_CONFIGURED) return -ENODEV; max = usb_endpoint_maxp(&ep->desc); if (max <= 0) { dev_dbg(&dev->dev, "bogus endpoint ep%d%s in %s (bad maxpacket %d)\n", usb_endpoint_num(&ep->desc), is_out ? "out" : "in", __func__, max); return -EMSGSIZE; } /* periodic transfers limit size per frame/uframe, * but drivers only control those sizes for ISO. * while we're checking, initialize return status. */ if (xfertype == USB_ENDPOINT_XFER_ISOC) { int n, len; /* SuperSpeed isoc endpoints have up to 16 bursts of up to * 3 packets each */ if (dev->speed >= USB_SPEED_SUPER) { int burst = 1 + ep->ss_ep_comp.bMaxBurst; int mult = USB_SS_MULT(ep->ss_ep_comp.bmAttributes); max *= burst; max *= mult; } if (dev->speed == USB_SPEED_SUPER_PLUS && USB_SS_SSP_ISOC_COMP(ep->ss_ep_comp.bmAttributes)) { struct usb_ssp_isoc_ep_comp_descriptor *isoc_ep_comp; isoc_ep_comp = &ep->ssp_isoc_ep_comp; max = le32_to_cpu(isoc_ep_comp->dwBytesPerInterval); } /* "high bandwidth" mode, 1-3 packets/uframe? */ if (dev->speed == USB_SPEED_HIGH) max *= usb_endpoint_maxp_mult(&ep->desc); if (urb->number_of_packets <= 0) return -EINVAL; for (n = 0; n < urb->number_of_packets; n++) { len = urb->iso_frame_desc[n].length; if (len < 0 || len > max) return -EMSGSIZE; urb->iso_frame_desc[n].status = -EXDEV; urb->iso_frame_desc[n].actual_length = 0; } } else if (urb->num_sgs && !urb->dev->bus->no_sg_constraint) { struct scatterlist *sg; int i; for_each_sg(urb->sg, sg, urb->num_sgs - 1, i) if (sg->length % max) return -EINVAL; } /* the I/O buffer must be mapped/unmapped, except when length=0 */ if (urb->transfer_buffer_length > INT_MAX) return -EMSGSIZE; /* * stuff that drivers shouldn't do, but which shouldn't * cause problems in HCDs if they get it wrong. */ /* Check that the pipe's type matches the endpoint's type */ if (usb_pipe_type_check(urb->dev, urb->pipe)) dev_WARN(&dev->dev, "BOGUS urb xfer, pipe %x != type %x\n", usb_pipetype(urb->pipe), pipetypes[xfertype]); /* Check against a simple/standard policy */ allowed = (URB_NO_TRANSFER_DMA_MAP | URB_NO_INTERRUPT | URB_DIR_MASK | URB_FREE_BUFFER); switch (xfertype) { case USB_ENDPOINT_XFER_BULK: case USB_ENDPOINT_XFER_INT: if (is_out) allowed |= URB_ZERO_PACKET; fallthrough; default: /* all non-iso endpoints */ if (!is_out) allowed |= URB_SHORT_NOT_OK; break; case USB_ENDPOINT_XFER_ISOC: allowed |= URB_ISO_ASAP; break; } allowed &= urb->transfer_flags; /* warn if submitter gave bogus flags */ if (allowed != urb->transfer_flags) dev_WARN(&dev->dev, "BOGUS urb flags, %x --> %x\n", urb->transfer_flags, allowed); /* * Force periodic transfer intervals to be legal values that are * a power of two (so HCDs don't need to). * * FIXME want bus->{intr,iso}_sched_horizon values here. Each HC * supports different values... this uses EHCI/UHCI defaults (and * EHCI can use smaller non-default values). */ switch (xfertype) { case USB_ENDPOINT_XFER_ISOC: case USB_ENDPOINT_XFER_INT: /* too small? */ if (urb->interval <= 0) return -EINVAL; /* too big? */ switch (dev->speed) { case USB_SPEED_SUPER_PLUS: case USB_SPEED_SUPER: /* units are 125us */ /* Handle up to 2^(16-1) microframes */ if (urb->interval > (1 << 15)) return -EINVAL; max = 1 << 15; break; case USB_SPEED_HIGH: /* units are microframes */ /* NOTE usb handles 2^15 */ if (urb->interval > (1024 * 8)) urb->interval = 1024 * 8; max = 1024 * 8; break; case USB_SPEED_FULL: /* units are frames/msec */ case USB_SPEED_LOW: if (xfertype == USB_ENDPOINT_XFER_INT) { if (urb->interval > 255) return -EINVAL; /* NOTE ohci only handles up to 32 */ max = 128; } else { if (urb->interval > 1024) urb->interval = 1024; /* NOTE usb and ohci handle up to 2^15 */ max = 1024; } break; default: return -EINVAL; } /* Round down to a power of 2, no more than max */ urb->interval = min(max, 1 << ilog2(urb->interval)); } return usb_hcd_submit_urb(urb, mem_flags); } EXPORT_SYMBOL_GPL(usb_submit_urb); /*-------------------------------------------------------------------*/ /** * usb_unlink_urb - abort/cancel a transfer request for an endpoint * @urb: pointer to urb describing a previously submitted request, * may be NULL * * This routine cancels an in-progress request. URBs complete only once * per submission, and may be canceled only once per submission. * Successful cancellation means termination of @urb will be expedited * and the completion handler will be called with a status code * indicating that the request has been canceled (rather than any other * code). * * Drivers should not call this routine or related routines, such as * usb_kill_urb() or usb_unlink_anchored_urbs(), after their disconnect * method has returned. The disconnect function should synchronize with * a driver's I/O routines to insure that all URB-related activity has * completed before it returns. * * This request is asynchronous, however the HCD might call the ->complete() * callback during unlink. Therefore when drivers call usb_unlink_urb(), they * must not hold any locks that may be taken by the completion function. * Success is indicated by returning -EINPROGRESS, at which time the URB will * probably not yet have been given back to the device driver. When it is * eventually called, the completion function will see @urb->status == * -ECONNRESET. * Failure is indicated by usb_unlink_urb() returning any other value. * Unlinking will fail when @urb is not currently "linked" (i.e., it was * never submitted, or it was unlinked before, or the hardware is already * finished with it), even if the completion handler has not yet run. * * The URB must not be deallocated while this routine is running. In * particular, when a driver calls this routine, it must insure that the * completion handler cannot deallocate the URB. * * Return: -EINPROGRESS on success. See description for other values on * failure. * * Unlinking and Endpoint Queues: * * [The behaviors and guarantees described below do not apply to virtual * root hubs but only to endpoint queues for physical USB devices.] * * Host Controller Drivers (HCDs) place all the URBs for a particular * endpoint in a queue. Normally the queue advances as the controller * hardware processes each request. But when an URB terminates with an * error its queue generally stops (see below), at least until that URB's * completion routine returns. It is guaranteed that a stopped queue * will not restart until all its unlinked URBs have been fully retired, * with their completion routines run, even if that's not until some time * after the original completion handler returns. The same behavior and * guarantee apply when an URB terminates because it was unlinked. * * Bulk and interrupt endpoint queues are guaranteed to stop whenever an * URB terminates with any sort of error, including -ECONNRESET, -ENOENT, * and -EREMOTEIO. Control endpoint queues behave the same way except * that they are not guaranteed to stop for -EREMOTEIO errors. Queues * for isochronous endpoints are treated differently, because they must * advance at fixed rates. Such queues do not stop when an URB * encounters an error or is unlinked. An unlinked isochronous URB may * leave a gap in the stream of packets; it is undefined whether such * gaps can be filled in. * * Note that early termination of an URB because a short packet was * received will generate a -EREMOTEIO error if and only if the * URB_SHORT_NOT_OK flag is set. By setting this flag, USB device * drivers can build deep queues for large or complex bulk transfers * and clean them up reliably after any sort of aborted transfer by * unlinking all pending URBs at the first fault. * * When a control URB terminates with an error other than -EREMOTEIO, it * is quite likely that the status stage of the transfer will not take * place. */ int usb_unlink_urb(struct urb *urb) { if (!urb) return -EINVAL; if (!urb->dev) return -ENODEV; if (!urb->ep) return -EIDRM; return usb_hcd_unlink_urb(urb, -ECONNRESET); } EXPORT_SYMBOL_GPL(usb_unlink_urb); /** * usb_kill_urb - cancel a transfer request and wait for it to finish * @urb: pointer to URB describing a previously submitted request, * may be NULL * * This routine cancels an in-progress request. It is guaranteed that * upon return all completion handlers will have finished and the URB * will be totally idle and available for reuse. These features make * this an ideal way to stop I/O in a disconnect() callback or close() * function. If the request has not already finished or been unlinked * the completion handler will see urb->status == -ENOENT. * * While the routine is running, attempts to resubmit the URB will fail * with error -EPERM. Thus even if the URB's completion handler always * tries to resubmit, it will not succeed and the URB will become idle. * * The URB must not be deallocated while this routine is running. In * particular, when a driver calls this routine, it must insure that the * completion handler cannot deallocate the URB. * * This routine may not be used in an interrupt context (such as a bottom * half or a completion handler), or when holding a spinlock, or in other * situations where the caller can't schedule(). * * This routine should not be called by a driver after its disconnect * method has returned. */ void usb_kill_urb(struct urb *urb) { might_sleep(); if (!(urb && urb->dev && urb->ep)) return; atomic_inc(&urb->reject); /* * Order the write of urb->reject above before the read * of urb->use_count below. Pairs with the barriers in * __usb_hcd_giveback_urb() and usb_hcd_submit_urb(). */ smp_mb__after_atomic(); usb_hcd_unlink_urb(urb, -ENOENT); wait_event(usb_kill_urb_queue, atomic_read(&urb->use_count) == 0); atomic_dec(&urb->reject); } EXPORT_SYMBOL_GPL(usb_kill_urb); /** * usb_poison_urb - reliably kill a transfer and prevent further use of an URB * @urb: pointer to URB describing a previously submitted request, * may be NULL * * This routine cancels an in-progress request. It is guaranteed that * upon return all completion handlers will have finished and the URB * will be totally idle and cannot be reused. These features make * this an ideal way to stop I/O in a disconnect() callback. * If the request has not already finished or been unlinked * the completion handler will see urb->status == -ENOENT. * * After and while the routine runs, attempts to resubmit the URB will fail * with error -EPERM. Thus even if the URB's completion handler always * tries to resubmit, it will not succeed and the URB will become idle. * * The URB must not be deallocated while this routine is running. In * particular, when a driver calls this routine, it must insure that the * completion handler cannot deallocate the URB. * * This routine may not be used in an interrupt context (such as a bottom * half or a completion handler), or when holding a spinlock, or in other * situations where the caller can't schedule(). * * This routine should not be called by a driver after its disconnect * method has returned. */ void usb_poison_urb(struct urb *urb) { might_sleep(); if (!urb) return; atomic_inc(&urb->reject); /* * Order the write of urb->reject above before the read * of urb->use_count below. Pairs with the barriers in * __usb_hcd_giveback_urb() and usb_hcd_submit_urb(). */ smp_mb__after_atomic(); if (!urb->dev || !urb->ep) return; usb_hcd_unlink_urb(urb, -ENOENT); wait_event(usb_kill_urb_queue, atomic_read(&urb->use_count) == 0); } EXPORT_SYMBOL_GPL(usb_poison_urb); void usb_unpoison_urb(struct urb *urb) { if (!urb) return; atomic_dec(&urb->reject); } EXPORT_SYMBOL_GPL(usb_unpoison_urb); /** * usb_block_urb - reliably prevent further use of an URB * @urb: pointer to URB to be blocked, may be NULL * * After the routine has run, attempts to resubmit the URB will fail * with error -EPERM. Thus even if the URB's completion handler always * tries to resubmit, it will not succeed and the URB will become idle. * * The URB must not be deallocated while this routine is running. In * particular, when a driver calls this routine, it must insure that the * completion handler cannot deallocate the URB. */ void usb_block_urb(struct urb *urb) { if (!urb) return; atomic_inc(&urb->reject); } EXPORT_SYMBOL_GPL(usb_block_urb); /** * usb_kill_anchored_urbs - kill all URBs associated with an anchor * @anchor: anchor the requests are bound to * * This kills all outstanding URBs starting from the back of the queue, * with guarantee that no completer callbacks will take place from the * anchor after this function returns. * * This routine should not be called by a driver after its disconnect * method has returned. */ void usb_kill_anchored_urbs(struct usb_anchor *anchor) { struct urb *victim; int surely_empty; do { spin_lock_irq(&anchor->lock); while (!list_empty(&anchor->urb_list)) { victim = list_entry(anchor->urb_list.prev, struct urb, anchor_list); /* make sure the URB isn't freed before we kill it */ usb_get_urb(victim); spin_unlock_irq(&anchor->lock); /* this will unanchor the URB */ usb_kill_urb(victim); usb_put_urb(victim); spin_lock_irq(&anchor->lock); } surely_empty = usb_anchor_check_wakeup(anchor); spin_unlock_irq(&anchor->lock); cpu_relax(); } while (!surely_empty); } EXPORT_SYMBOL_GPL(usb_kill_anchored_urbs); /** * usb_poison_anchored_urbs - cease all traffic from an anchor * @anchor: anchor the requests are bound to * * this allows all outstanding URBs to be poisoned starting * from the back of the queue. Newly added URBs will also be * poisoned * * This routine should not be called by a driver after its disconnect * method has returned. */ void usb_poison_anchored_urbs(struct usb_anchor *anchor) { struct urb *victim; int surely_empty; do { spin_lock_irq(&anchor->lock); anchor->poisoned = 1; while (!list_empty(&anchor->urb_list)) { victim = list_entry(anchor->urb_list.prev, struct urb, anchor_list); /* make sure the URB isn't freed before we kill it */ usb_get_urb(victim); spin_unlock_irq(&anchor->lock); /* this will unanchor the URB */ usb_poison_urb(victim); usb_put_urb(victim); spin_lock_irq(&anchor->lock); } surely_empty = usb_anchor_check_wakeup(anchor); spin_unlock_irq(&anchor->lock); cpu_relax(); } while (!surely_empty); } EXPORT_SYMBOL_GPL(usb_poison_anchored_urbs); /** * usb_unpoison_anchored_urbs - let an anchor be used successfully again * @anchor: anchor the requests are bound to * * Reverses the effect of usb_poison_anchored_urbs * the anchor can be used normally after it returns */ void usb_unpoison_anchored_urbs(struct usb_anchor *anchor) { unsigned long flags; struct urb *lazarus; spin_lock_irqsave(&anchor->lock, flags); list_for_each_entry(lazarus, &anchor->urb_list, anchor_list) { usb_unpoison_urb(lazarus); } anchor->poisoned = 0; spin_unlock_irqrestore(&anchor->lock, flags); } EXPORT_SYMBOL_GPL(usb_unpoison_anchored_urbs); /** * usb_unlink_anchored_urbs - asynchronously cancel transfer requests en masse * @anchor: anchor the requests are bound to * * this allows all outstanding URBs to be unlinked starting * from the back of the queue. This function is asynchronous. * The unlinking is just triggered. It may happen after this * function has returned. * * This routine should not be called by a driver after its disconnect * method has returned. */ void usb_unlink_anchored_urbs(struct usb_anchor *anchor) { struct urb *victim; while ((victim = usb_get_from_anchor(anchor)) != NULL) { usb_unlink_urb(victim); usb_put_urb(victim); } } EXPORT_SYMBOL_GPL(usb_unlink_anchored_urbs); /** * usb_anchor_suspend_wakeups * @anchor: the anchor you want to suspend wakeups on * * Call this to stop the last urb being unanchored from waking up any * usb_wait_anchor_empty_timeout waiters. This is used in the hcd urb give- * back path to delay waking up until after the completion handler has run. */ void usb_anchor_suspend_wakeups(struct usb_anchor *anchor) { if (anchor) atomic_inc(&anchor->suspend_wakeups); } EXPORT_SYMBOL_GPL(usb_anchor_suspend_wakeups); /** * usb_anchor_resume_wakeups * @anchor: the anchor you want to resume wakeups on * * Allow usb_wait_anchor_empty_timeout waiters to be woken up again, and * wake up any current waiters if the anchor is empty. */ void usb_anchor_resume_wakeups(struct usb_anchor *anchor) { if (!anchor) return; atomic_dec(&anchor->suspend_wakeups); if (usb_anchor_check_wakeup(anchor)) wake_up(&anchor->wait); } EXPORT_SYMBOL_GPL(usb_anchor_resume_wakeups); /** * usb_wait_anchor_empty_timeout - wait for an anchor to be unused * @anchor: the anchor you want to become unused * @timeout: how long you are willing to wait in milliseconds * * Call this is you want to be sure all an anchor's * URBs have finished * * Return: Non-zero if the anchor became unused. Zero on timeout. */ int usb_wait_anchor_empty_timeout(struct usb_anchor *anchor, unsigned int timeout) { return wait_event_timeout(anchor->wait, usb_anchor_check_wakeup(anchor), msecs_to_jiffies(timeout)); } EXPORT_SYMBOL_GPL(usb_wait_anchor_empty_timeout); /** * usb_get_from_anchor - get an anchor's oldest urb * @anchor: the anchor whose urb you want * * This will take the oldest urb from an anchor, * unanchor and return it * * Return: The oldest urb from @anchor, or %NULL if @anchor has no * urbs associated with it. */ struct urb *usb_get_from_anchor(struct usb_anchor *anchor) { struct urb *victim; unsigned long flags; spin_lock_irqsave(&anchor->lock, flags); if (!list_empty(&anchor->urb_list)) { victim = list_entry(anchor->urb_list.next, struct urb, anchor_list); usb_get_urb(victim); __usb_unanchor_urb(victim, anchor); } else { victim = NULL; } spin_unlock_irqrestore(&anchor->lock, flags); return victim; } EXPORT_SYMBOL_GPL(usb_get_from_anchor); /** * usb_scuttle_anchored_urbs - unanchor all an anchor's urbs * @anchor: the anchor whose urbs you want to unanchor * * use this to get rid of all an anchor's urbs */ void usb_scuttle_anchored_urbs(struct usb_anchor *anchor) { struct urb *victim; unsigned long flags; int surely_empty; do { spin_lock_irqsave(&anchor->lock, flags); while (!list_empty(&anchor->urb_list)) { victim = list_entry(anchor->urb_list.prev, struct urb, anchor_list); __usb_unanchor_urb(victim, anchor); } surely_empty = usb_anchor_check_wakeup(anchor); spin_unlock_irqrestore(&anchor->lock, flags); cpu_relax(); } while (!surely_empty); } EXPORT_SYMBOL_GPL(usb_scuttle_anchored_urbs); /** * usb_anchor_empty - is an anchor empty * @anchor: the anchor you want to query * * Return: 1 if the anchor has no urbs associated with it. */ int usb_anchor_empty(struct usb_anchor *anchor) { return list_empty(&anchor->urb_list); } EXPORT_SYMBOL_GPL(usb_anchor_empty); |
2 2 2 2 2 2 2 2 2 2 2 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 | // SPDX-License-Identifier: GPL-2.0 /* * Universal Host Controller Interface driver for USB. * * Maintainer: Alan Stern <stern@rowland.harvard.edu> * * (C) Copyright 1999 Linus Torvalds * (C) Copyright 1999-2002 Johannes Erdfelt, johannes@erdfelt.com * (C) Copyright 1999 Randy Dunlap * (C) Copyright 1999 Georg Acher, acher@in.tum.de * (C) Copyright 1999 Deti Fliegl, deti@fliegl.de * (C) Copyright 1999 Thomas Sailer, sailer@ife.ee.ethz.ch * (C) Copyright 1999 Roman Weissgaerber, weissg@vienna.at * (C) Copyright 2000 Yggdrasil Computing, Inc. (port of new PCI interface * support from usb-ohci.c by Adam Richter, adam@yggdrasil.com). * (C) Copyright 1999 Gregory P. Smith (from usb-ohci.c) * (C) Copyright 2004-2007 Alan Stern, stern@rowland.harvard.edu */ /* * Technically, updating td->status here is a race, but it's not really a * problem. The worst that can happen is that we set the IOC bit again * generating a spurious interrupt. We could fix this by creating another * QH and leaving the IOC bit always set, but then we would have to play * games with the FSBR code to make sure we get the correct order in all * the cases. I don't think it's worth the effort */ static void uhci_set_next_interrupt(struct uhci_hcd *uhci) { if (uhci->is_stopped) mod_timer(&uhci_to_hcd(uhci)->rh_timer, jiffies); uhci->term_td->status |= cpu_to_hc32(uhci, TD_CTRL_IOC); } static inline void uhci_clear_next_interrupt(struct uhci_hcd *uhci) { uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC); } /* * Full-Speed Bandwidth Reclamation (FSBR). * We turn on FSBR whenever a queue that wants it is advancing, * and leave it on for a short time thereafter. */ static void uhci_fsbr_on(struct uhci_hcd *uhci) { struct uhci_qh *lqh; /* The terminating skeleton QH always points back to the first * FSBR QH. Make the last async QH point to the terminating * skeleton QH. */ uhci->fsbr_is_on = 1; lqh = list_entry(uhci->skel_async_qh->node.prev, struct uhci_qh, node); lqh->link = LINK_TO_QH(uhci, uhci->skel_term_qh); } static void uhci_fsbr_off(struct uhci_hcd *uhci) { struct uhci_qh *lqh; /* Remove the link from the last async QH to the terminating * skeleton QH. */ uhci->fsbr_is_on = 0; lqh = list_entry(uhci->skel_async_qh->node.prev, struct uhci_qh, node); lqh->link = UHCI_PTR_TERM(uhci); } static void uhci_add_fsbr(struct uhci_hcd *uhci, struct urb *urb) { struct urb_priv *urbp = urb->hcpriv; urbp->fsbr = 1; } static void uhci_urbp_wants_fsbr(struct uhci_hcd *uhci, struct urb_priv *urbp) { if (urbp->fsbr) { uhci->fsbr_is_wanted = 1; if (!uhci->fsbr_is_on) uhci_fsbr_on(uhci); else if (uhci->fsbr_expiring) { uhci->fsbr_expiring = 0; del_timer(&uhci->fsbr_timer); } } } static void uhci_fsbr_timeout(struct timer_list *t) { struct uhci_hcd *uhci = from_timer(uhci, t, fsbr_timer); unsigned long flags; spin_lock_irqsave(&uhci->lock, flags); if (uhci->fsbr_expiring) { uhci->fsbr_expiring = 0; uhci_fsbr_off(uhci); } spin_unlock_irqrestore(&uhci->lock, flags); } static struct uhci_td *uhci_alloc_td(struct uhci_hcd *uhci) { dma_addr_t dma_handle; struct uhci_td *td; td = dma_pool_alloc(uhci->td_pool, GFP_ATOMIC, &dma_handle); if (!td) return NULL; td->dma_handle = dma_handle; td->frame = -1; INIT_LIST_HEAD(&td->list); INIT_LIST_HEAD(&td->fl_list); return td; } static void uhci_free_td(struct uhci_hcd *uhci, struct uhci_td *td) { if (!list_empty(&td->list)) dev_WARN(uhci_dev(uhci), "td %p still in list!\n", td); if (!list_empty(&td->fl_list)) dev_WARN(uhci_dev(uhci), "td %p still in fl_list!\n", td); dma_pool_free(uhci->td_pool, td, td->dma_handle); } static inline void uhci_fill_td(struct uhci_hcd *uhci, struct uhci_td *td, u32 status, u32 token, u32 buffer) { td->status = cpu_to_hc32(uhci, status); td->token = cpu_to_hc32(uhci, token); td->buffer = cpu_to_hc32(uhci, buffer); } static void uhci_add_td_to_urbp(struct uhci_td *td, struct urb_priv *urbp) { list_add_tail(&td->list, &urbp->td_list); } static void uhci_remove_td_from_urbp(struct uhci_td *td) { list_del_init(&td->list); } /* * We insert Isochronous URBs directly into the frame list at the beginning */ static inline void uhci_insert_td_in_frame_list(struct uhci_hcd *uhci, struct uhci_td *td, unsigned framenum) { framenum &= (UHCI_NUMFRAMES - 1); td->frame = framenum; /* Is there a TD already mapped there? */ if (uhci->frame_cpu[framenum]) { struct uhci_td *ftd, *ltd; ftd = uhci->frame_cpu[framenum]; ltd = list_entry(ftd->fl_list.prev, struct uhci_td, fl_list); list_add_tail(&td->fl_list, &ftd->fl_list); td->link = ltd->link; wmb(); ltd->link = LINK_TO_TD(uhci, td); } else { td->link = uhci->frame[framenum]; wmb(); uhci->frame[framenum] = LINK_TO_TD(uhci, td); uhci->frame_cpu[framenum] = td; } } static inline void uhci_remove_td_from_frame_list(struct uhci_hcd *uhci, struct uhci_td *td) { /* If it's not inserted, don't remove it */ if (td->frame == -1) { WARN_ON(!list_empty(&td->fl_list)); return; } if (uhci->frame_cpu[td->frame] == td) { if (list_empty(&td->fl_list)) { uhci->frame[td->frame] = td->link; uhci->frame_cpu[td->frame] = NULL; } else { struct uhci_td *ntd; ntd = list_entry(td->fl_list.next, struct uhci_td, fl_list); uhci->frame[td->frame] = LINK_TO_TD(uhci, ntd); uhci->frame_cpu[td->frame] = ntd; } } else { struct uhci_td *ptd; ptd = list_entry(td->fl_list.prev, struct uhci_td, fl_list); ptd->link = td->link; } list_del_init(&td->fl_list); td->frame = -1; } static inline void uhci_remove_tds_from_frame(struct uhci_hcd *uhci, unsigned int framenum) { struct uhci_td *ftd, *ltd; framenum &= (UHCI_NUMFRAMES - 1); ftd = uhci->frame_cpu[framenum]; if (ftd) { ltd = list_entry(ftd->fl_list.prev, struct uhci_td, fl_list); uhci->frame[framenum] = ltd->link; uhci->frame_cpu[framenum] = NULL; while (!list_empty(&ftd->fl_list)) list_del_init(ftd->fl_list.prev); } } /* * Remove all the TDs for an Isochronous URB from the frame list */ static void uhci_unlink_isochronous_tds(struct uhci_hcd *uhci, struct urb *urb) { struct urb_priv *urbp = (struct urb_priv *) urb->hcpriv; struct uhci_td *td; list_for_each_entry(td, &urbp->td_list, list) uhci_remove_td_from_frame_list(uhci, td); } static struct uhci_qh *uhci_alloc_qh(struct uhci_hcd *uhci, struct usb_device *udev, struct usb_host_endpoint *hep) { dma_addr_t dma_handle; struct uhci_qh *qh; qh = dma_pool_zalloc(uhci->qh_pool, GFP_ATOMIC, &dma_handle); if (!qh) return NULL; qh->dma_handle = dma_handle; qh->element = UHCI_PTR_TERM(uhci); qh->link = UHCI_PTR_TERM(uhci); INIT_LIST_HEAD(&qh->queue); INIT_LIST_HEAD(&qh->node); if (udev) { /* Normal QH */ qh->type = usb_endpoint_type(&hep->desc); if (qh->type != USB_ENDPOINT_XFER_ISOC) { qh->dummy_td = uhci_alloc_td(uhci); if (!qh->dummy_td) { dma_pool_free(uhci->qh_pool, qh, dma_handle); return NULL; } } qh->state = QH_STATE_IDLE; qh->hep = hep; qh->udev = udev; hep->hcpriv = qh; if (qh->type == USB_ENDPOINT_XFER_INT || qh->type == USB_ENDPOINT_XFER_ISOC) qh->load = usb_calc_bus_time(udev->speed, usb_endpoint_dir_in(&hep->desc), qh->type == USB_ENDPOINT_XFER_ISOC, usb_endpoint_maxp(&hep->desc)) / 1000 + 1; } else { /* Skeleton QH */ qh->state = QH_STATE_ACTIVE; qh->type = -1; } return qh; } static void uhci_free_qh(struct uhci_hcd *uhci, struct uhci_qh *qh) { WARN_ON(qh->state != QH_STATE_IDLE && qh->udev); if (!list_empty(&qh->queue)) dev_WARN(uhci_dev(uhci), "qh %p list not empty!\n", qh); list_del(&qh->node); if (qh->udev) { qh->hep->hcpriv = NULL; if (qh->dummy_td) uhci_free_td(uhci, qh->dummy_td); } dma_pool_free(uhci->qh_pool, qh, qh->dma_handle); } /* * When a queue is stopped and a dequeued URB is given back, adjust * the previous TD link (if the URB isn't first on the queue) or * save its toggle value (if it is first and is currently executing). * * Returns 0 if the URB should not yet be given back, 1 otherwise. */ static int uhci_cleanup_queue(struct uhci_hcd *uhci, struct uhci_qh *qh, struct urb *urb) { struct urb_priv *urbp = urb->hcpriv; struct uhci_td *td; int ret = 1; /* Isochronous pipes don't use toggles and their TD link pointers * get adjusted during uhci_urb_dequeue(). But since their queues * cannot truly be stopped, we have to watch out for dequeues * occurring after the nominal unlink frame. */ if (qh->type == USB_ENDPOINT_XFER_ISOC) { ret = (uhci->frame_number + uhci->is_stopped != qh->unlink_frame); goto done; } /* If the URB isn't first on its queue, adjust the link pointer * of the last TD in the previous URB. The toggle doesn't need * to be saved since this URB can't be executing yet. */ if (qh->queue.next != &urbp->node) { struct urb_priv *purbp; struct uhci_td *ptd; purbp = list_entry(urbp->node.prev, struct urb_priv, node); WARN_ON(list_empty(&purbp->td_list)); ptd = list_entry(purbp->td_list.prev, struct uhci_td, list); td = list_entry(urbp->td_list.prev, struct uhci_td, list); ptd->link = td->link; goto done; } /* If the QH element pointer is UHCI_PTR_TERM then then currently * executing URB has already been unlinked, so this one isn't it. */ if (qh_element(qh) == UHCI_PTR_TERM(uhci)) goto done; qh->element = UHCI_PTR_TERM(uhci); /* Control pipes don't have to worry about toggles */ if (qh->type == USB_ENDPOINT_XFER_CONTROL) goto done; /* Save the next toggle value */ WARN_ON(list_empty(&urbp->td_list)); td = list_entry(urbp->td_list.next, struct uhci_td, list); qh->needs_fixup = 1; qh->initial_toggle = uhci_toggle(td_token(uhci, td)); done: return ret; } /* * Fix up the data toggles for URBs in a queue, when one of them * terminates early (short transfer, error, or dequeued). */ static void uhci_fixup_toggles(struct uhci_hcd *uhci, struct uhci_qh *qh, int skip_first) { struct urb_priv *urbp = NULL; struct uhci_td *td; unsigned int toggle = qh->initial_toggle; unsigned int pipe; /* Fixups for a short transfer start with the second URB in the * queue (the short URB is the first). */ if (skip_first) urbp = list_entry(qh->queue.next, struct urb_priv, node); /* When starting with the first URB, if the QH element pointer is * still valid then we know the URB's toggles are okay. */ else if (qh_element(qh) != UHCI_PTR_TERM(uhci)) toggle = 2; /* Fix up the toggle for the URBs in the queue. Normally this * loop won't run more than once: When an error or short transfer * occurs, the queue usually gets emptied. */ urbp = list_prepare_entry(urbp, &qh->queue, node); list_for_each_entry_continue(urbp, &qh->queue, node) { /* If the first TD has the right toggle value, we don't * need to change any toggles in this URB */ td = list_entry(urbp->td_list.next, struct uhci_td, list); if (toggle > 1 || uhci_toggle(td_token(uhci, td)) == toggle) { td = list_entry(urbp->td_list.prev, struct uhci_td, list); toggle = uhci_toggle(td_token(uhci, td)) ^ 1; /* Otherwise all the toggles in the URB have to be switched */ } else { list_for_each_entry(td, &urbp->td_list, list) { td->token ^= cpu_to_hc32(uhci, TD_TOKEN_TOGGLE); toggle ^= 1; } } } wmb(); pipe = list_entry(qh->queue.next, struct urb_priv, node)->urb->pipe; usb_settoggle(qh->udev, usb_pipeendpoint(pipe), usb_pipeout(pipe), toggle); qh->needs_fixup = 0; } /* * Link an Isochronous QH into its skeleton's list */ static inline void link_iso(struct uhci_hcd *uhci, struct uhci_qh *qh) { list_add_tail(&qh->node, &uhci->skel_iso_qh->node); /* Isochronous QHs aren't linked by the hardware */ } /* * Link a high-period interrupt QH into the schedule at the end of its * skeleton's list */ static void link_interrupt(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct uhci_qh *pqh; list_add_tail(&qh->node, &uhci->skelqh[qh->skel]->node); pqh = list_entry(qh->node.prev, struct uhci_qh, node); qh->link = pqh->link; wmb(); pqh->link = LINK_TO_QH(uhci, qh); } /* * Link a period-1 interrupt or async QH into the schedule at the * correct spot in the async skeleton's list, and update the FSBR link */ static void link_async(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct uhci_qh *pqh; __hc32 link_to_new_qh; /* Find the predecessor QH for our new one and insert it in the list. * The list of QHs is expected to be short, so linear search won't * take too long. */ list_for_each_entry_reverse(pqh, &uhci->skel_async_qh->node, node) { if (pqh->skel <= qh->skel) break; } list_add(&qh->node, &pqh->node); /* Link it into the schedule */ qh->link = pqh->link; wmb(); link_to_new_qh = LINK_TO_QH(uhci, qh); pqh->link = link_to_new_qh; /* If this is now the first FSBR QH, link the terminating skeleton * QH to it. */ if (pqh->skel < SKEL_FSBR && qh->skel >= SKEL_FSBR) uhci->skel_term_qh->link = link_to_new_qh; } /* * Put a QH on the schedule in both hardware and software */ static void uhci_activate_qh(struct uhci_hcd *uhci, struct uhci_qh *qh) { WARN_ON(list_empty(&qh->queue)); /* Set the element pointer if it isn't set already. * This isn't needed for Isochronous queues, but it doesn't hurt. */ if (qh_element(qh) == UHCI_PTR_TERM(uhci)) { struct urb_priv *urbp = list_entry(qh->queue.next, struct urb_priv, node); struct uhci_td *td = list_entry(urbp->td_list.next, struct uhci_td, list); qh->element = LINK_TO_TD(uhci, td); } /* Treat the queue as if it has just advanced */ qh->wait_expired = 0; qh->advance_jiffies = jiffies; if (qh->state == QH_STATE_ACTIVE) return; qh->state = QH_STATE_ACTIVE; /* Move the QH from its old list to the correct spot in the appropriate * skeleton's list */ if (qh == uhci->next_qh) uhci->next_qh = list_entry(qh->node.next, struct uhci_qh, node); list_del(&qh->node); if (qh->skel == SKEL_ISO) link_iso(uhci, qh); else if (qh->skel < SKEL_ASYNC) link_interrupt(uhci, qh); else link_async(uhci, qh); } /* * Unlink a high-period interrupt QH from the schedule */ static void unlink_interrupt(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct uhci_qh *pqh; pqh = list_entry(qh->node.prev, struct uhci_qh, node); pqh->link = qh->link; mb(); } /* * Unlink a period-1 interrupt or async QH from the schedule */ static void unlink_async(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct uhci_qh *pqh; __hc32 link_to_next_qh = qh->link; pqh = list_entry(qh->node.prev, struct uhci_qh, node); pqh->link = link_to_next_qh; /* If this was the old first FSBR QH, link the terminating skeleton * QH to the next (new first FSBR) QH. */ if (pqh->skel < SKEL_FSBR && qh->skel >= SKEL_FSBR) uhci->skel_term_qh->link = link_to_next_qh; mb(); } /* * Take a QH off the hardware schedule */ static void uhci_unlink_qh(struct uhci_hcd *uhci, struct uhci_qh *qh) { if (qh->state == QH_STATE_UNLINKING) return; WARN_ON(qh->state != QH_STATE_ACTIVE || !qh->udev); qh->state = QH_STATE_UNLINKING; /* Unlink the QH from the schedule and record when we did it */ if (qh->skel == SKEL_ISO) ; else if (qh->skel < SKEL_ASYNC) unlink_interrupt(uhci, qh); else unlink_async(uhci, qh); uhci_get_current_frame_number(uhci); qh->unlink_frame = uhci->frame_number; /* Force an interrupt so we know when the QH is fully unlinked */ if (list_empty(&uhci->skel_unlink_qh->node) || uhci->is_stopped) uhci_set_next_interrupt(uhci); /* Move the QH from its old list to the end of the unlinking list */ if (qh == uhci->next_qh) uhci->next_qh = list_entry(qh->node.next, struct uhci_qh, node); list_move_tail(&qh->node, &uhci->skel_unlink_qh->node); } /* * When we and the controller are through with a QH, it becomes IDLE. * This happens when a QH has been off the schedule (on the unlinking * list) for more than one frame, or when an error occurs while adding * the first URB onto a new QH. */ static void uhci_make_qh_idle(struct uhci_hcd *uhci, struct uhci_qh *qh) { WARN_ON(qh->state == QH_STATE_ACTIVE); if (qh == uhci->next_qh) uhci->next_qh = list_entry(qh->node.next, struct uhci_qh, node); list_move(&qh->node, &uhci->idle_qh_list); qh->state = QH_STATE_IDLE; /* Now that the QH is idle, its post_td isn't being used */ if (qh->post_td) { uhci_free_td(uhci, qh->post_td); qh->post_td = NULL; } /* If anyone is waiting for a QH to become idle, wake them up */ if (uhci->num_waiting) wake_up_all(&uhci->waitqh); } /* * Find the highest existing bandwidth load for a given phase and period. */ static int uhci_highest_load(struct uhci_hcd *uhci, int phase, int period) { int highest_load = uhci->load[phase]; for (phase += period; phase < MAX_PHASE; phase += period) highest_load = max_t(int, highest_load, uhci->load[phase]); return highest_load; } /* * Set qh->phase to the optimal phase for a periodic transfer and * check whether the bandwidth requirement is acceptable. */ static int uhci_check_bandwidth(struct uhci_hcd *uhci, struct uhci_qh *qh) { int minimax_load; /* Find the optimal phase (unless it is already set) and get * its load value. */ if (qh->phase >= 0) minimax_load = uhci_highest_load(uhci, qh->phase, qh->period); else { int phase, load; int max_phase = min_t(int, MAX_PHASE, qh->period); qh->phase = 0; minimax_load = uhci_highest_load(uhci, qh->phase, qh->period); for (phase = 1; phase < max_phase; ++phase) { load = uhci_highest_load(uhci, phase, qh->period); if (load < minimax_load) { minimax_load = load; qh->phase = phase; } } } /* Maximum allowable periodic bandwidth is 90%, or 900 us per frame */ if (minimax_load + qh->load > 900) { dev_dbg(uhci_dev(uhci), "bandwidth allocation failed: " "period %d, phase %d, %d + %d us\n", qh->period, qh->phase, minimax_load, qh->load); return -ENOSPC; } return 0; } /* * Reserve a periodic QH's bandwidth in the schedule */ static void uhci_reserve_bandwidth(struct uhci_hcd *uhci, struct uhci_qh *qh) { int i; int load = qh->load; char *p = "??"; for (i = qh->phase; i < MAX_PHASE; i += qh->period) { uhci->load[i] += load; uhci->total_load += load; } uhci_to_hcd(uhci)->self.bandwidth_allocated = uhci->total_load / MAX_PHASE; switch (qh->type) { case USB_ENDPOINT_XFER_INT: ++uhci_to_hcd(uhci)->self.bandwidth_int_reqs; p = "INT"; break; case USB_ENDPOINT_XFER_ISOC: ++uhci_to_hcd(uhci)->self.bandwidth_isoc_reqs; p = "ISO"; break; } qh->bandwidth_reserved = 1; dev_dbg(uhci_dev(uhci), "%s dev %d ep%02x-%s, period %d, phase %d, %d us\n", "reserve", qh->udev->devnum, qh->hep->desc.bEndpointAddress, p, qh->period, qh->phase, load); } /* * Release a periodic QH's bandwidth reservation */ static void uhci_release_bandwidth(struct uhci_hcd *uhci, struct uhci_qh *qh) { int i; int load = qh->load; char *p = "??"; for (i = qh->phase; i < MAX_PHASE; i += qh->period) { uhci->load[i] -= load; uhci->total_load -= load; } uhci_to_hcd(uhci)->self.bandwidth_allocated = uhci->total_load / MAX_PHASE; switch (qh->type) { case USB_ENDPOINT_XFER_INT: --uhci_to_hcd(uhci)->self.bandwidth_int_reqs; p = "INT"; break; case USB_ENDPOINT_XFER_ISOC: --uhci_to_hcd(uhci)->self.bandwidth_isoc_reqs; p = "ISO"; break; } qh->bandwidth_reserved = 0; dev_dbg(uhci_dev(uhci), "%s dev %d ep%02x-%s, period %d, phase %d, %d us\n", "release", qh->udev->devnum, qh->hep->desc.bEndpointAddress, p, qh->period, qh->phase, load); } static inline struct urb_priv *uhci_alloc_urb_priv(struct uhci_hcd *uhci, struct urb *urb) { struct urb_priv *urbp; urbp = kmem_cache_zalloc(uhci_up_cachep, GFP_ATOMIC); if (!urbp) return NULL; urbp->urb = urb; urb->hcpriv = urbp; INIT_LIST_HEAD(&urbp->node); INIT_LIST_HEAD(&urbp->td_list); return urbp; } static void uhci_free_urb_priv(struct uhci_hcd *uhci, struct urb_priv *urbp) { struct uhci_td *td, *tmp; if (!list_empty(&urbp->node)) dev_WARN(uhci_dev(uhci), "urb %p still on QH's list!\n", urbp->urb); list_for_each_entry_safe(td, tmp, &urbp->td_list, list) { uhci_remove_td_from_urbp(td); uhci_free_td(uhci, td); } kmem_cache_free(uhci_up_cachep, urbp); } /* * Map status to standard result codes * * <status> is (td_status(uhci, td) & 0xF60000), a.k.a. * uhci_status_bits(td_status(uhci, td)). * Note: <status> does not include the TD_CTRL_NAK bit. * <dir_out> is True for output TDs and False for input TDs. */ static int uhci_map_status(int status, int dir_out) { if (!status) return 0; if (status & TD_CTRL_BITSTUFF) /* Bitstuff error */ return -EPROTO; if (status & TD_CTRL_CRCTIMEO) { /* CRC/Timeout */ if (dir_out) return -EPROTO; else return -EILSEQ; } if (status & TD_CTRL_BABBLE) /* Babble */ return -EOVERFLOW; if (status & TD_CTRL_DBUFERR) /* Buffer error */ return -ENOSR; if (status & TD_CTRL_STALLED) /* Stalled */ return -EPIPE; return 0; } /* * Control transfers */ static int uhci_submit_control(struct uhci_hcd *uhci, struct urb *urb, struct uhci_qh *qh) { struct uhci_td *td; unsigned long destination, status; int maxsze = usb_endpoint_maxp(&qh->hep->desc); int len = urb->transfer_buffer_length; dma_addr_t data = urb->transfer_dma; __hc32 *plink; struct urb_priv *urbp = urb->hcpriv; int skel; /* The "pipe" thing contains the destination in bits 8--18 */ destination = (urb->pipe & PIPE_DEVEP_MASK) | USB_PID_SETUP; /* 3 errors, dummy TD remains inactive */ status = uhci_maxerr(3); if (urb->dev->speed == USB_SPEED_LOW) status |= TD_CTRL_LS; /* * Build the TD for the control request setup packet */ td = qh->dummy_td; uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status, destination | uhci_explen(8), urb->setup_dma); plink = &td->link; status |= TD_CTRL_ACTIVE; /* * If direction is "send", change the packet ID from SETUP (0x2D) * to OUT (0xE1). Else change it from SETUP to IN (0x69) and * set Short Packet Detect (SPD) for all data packets. * * 0-length transfers always get treated as "send". */ if (usb_pipeout(urb->pipe) || len == 0) destination ^= (USB_PID_SETUP ^ USB_PID_OUT); else { destination ^= (USB_PID_SETUP ^ USB_PID_IN); status |= TD_CTRL_SPD; } /* * Build the DATA TDs */ while (len > 0) { int pktsze = maxsze; if (len <= pktsze) { /* The last data packet */ pktsze = len; status &= ~TD_CTRL_SPD; } td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); /* Alternate Data0/1 (start with Data1) */ destination ^= TD_TOKEN_TOGGLE; uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status, destination | uhci_explen(pktsze), data); plink = &td->link; data += pktsze; len -= pktsze; } /* * Build the final TD for control status */ td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); /* Change direction for the status transaction */ destination ^= (USB_PID_IN ^ USB_PID_OUT); destination |= TD_TOKEN_TOGGLE; /* End in Data1 */ uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status | TD_CTRL_IOC, destination | uhci_explen(0), 0); plink = &td->link; /* * Build the new dummy TD and activate the old one */ td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); uhci_fill_td(uhci, td, 0, USB_PID_OUT | uhci_explen(0), 0); wmb(); qh->dummy_td->status |= cpu_to_hc32(uhci, TD_CTRL_ACTIVE); qh->dummy_td = td; /* Low-speed transfers get a different queue, and won't hog the bus. * Also, some devices enumerate better without FSBR; the easiest way * to do that is to put URBs on the low-speed queue while the device * isn't in the CONFIGURED state. */ if (urb->dev->speed == USB_SPEED_LOW || urb->dev->state != USB_STATE_CONFIGURED) skel = SKEL_LS_CONTROL; else { skel = SKEL_FS_CONTROL; uhci_add_fsbr(uhci, urb); } if (qh->state != QH_STATE_ACTIVE) qh->skel = skel; return 0; nomem: /* Remove the dummy TD from the td_list so it doesn't get freed */ uhci_remove_td_from_urbp(qh->dummy_td); return -ENOMEM; } /* * Common submit for bulk and interrupt */ static int uhci_submit_common(struct uhci_hcd *uhci, struct urb *urb, struct uhci_qh *qh) { struct uhci_td *td; unsigned long destination, status; int maxsze = usb_endpoint_maxp(&qh->hep->desc); int len = urb->transfer_buffer_length; int this_sg_len; dma_addr_t data; __hc32 *plink; struct urb_priv *urbp = urb->hcpriv; unsigned int toggle; struct scatterlist *sg; int i; if (len < 0) return -EINVAL; /* The "pipe" thing contains the destination in bits 8--18 */ destination = (urb->pipe & PIPE_DEVEP_MASK) | usb_packetid(urb->pipe); toggle = usb_gettoggle(urb->dev, usb_pipeendpoint(urb->pipe), usb_pipeout(urb->pipe)); /* 3 errors, dummy TD remains inactive */ status = uhci_maxerr(3); if (urb->dev->speed == USB_SPEED_LOW) status |= TD_CTRL_LS; if (usb_pipein(urb->pipe)) status |= TD_CTRL_SPD; i = urb->num_mapped_sgs; if (len > 0 && i > 0) { sg = urb->sg; data = sg_dma_address(sg); /* urb->transfer_buffer_length may be smaller than the * size of the scatterlist (or vice versa) */ this_sg_len = min_t(int, sg_dma_len(sg), len); } else { sg = NULL; data = urb->transfer_dma; this_sg_len = len; } /* * Build the DATA TDs */ plink = NULL; td = qh->dummy_td; for (;;) { /* Allow zero length packets */ int pktsze = maxsze; if (len <= pktsze) { /* The last packet */ pktsze = len; if (!(urb->transfer_flags & URB_SHORT_NOT_OK)) status &= ~TD_CTRL_SPD; } if (plink) { td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); } uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status, destination | uhci_explen(pktsze) | (toggle << TD_TOKEN_TOGGLE_SHIFT), data); plink = &td->link; status |= TD_CTRL_ACTIVE; toggle ^= 1; data += pktsze; this_sg_len -= pktsze; len -= maxsze; if (this_sg_len <= 0) { if (--i <= 0 || len <= 0) break; sg = sg_next(sg); data = sg_dma_address(sg); this_sg_len = min_t(int, sg_dma_len(sg), len); } } /* * URB_ZERO_PACKET means adding a 0-length packet, if direction * is OUT and the transfer_length was an exact multiple of maxsze, * hence (len = transfer_length - N * maxsze) == 0 * however, if transfer_length == 0, the zero packet was already * prepared above. */ if ((urb->transfer_flags & URB_ZERO_PACKET) && usb_pipeout(urb->pipe) && len == 0 && urb->transfer_buffer_length > 0) { td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status, destination | uhci_explen(0) | (toggle << TD_TOKEN_TOGGLE_SHIFT), data); plink = &td->link; toggle ^= 1; } /* Set the interrupt-on-completion flag on the last packet. * A more-or-less typical 4 KB URB (= size of one memory page) * will require about 3 ms to transfer; that's a little on the * fast side but not enough to justify delaying an interrupt * more than 2 or 3 URBs, so we will ignore the URB_NO_INTERRUPT * flag setting. */ td->status |= cpu_to_hc32(uhci, TD_CTRL_IOC); /* * Build the new dummy TD and activate the old one */ td = uhci_alloc_td(uhci); if (!td) goto nomem; *plink = LINK_TO_TD(uhci, td); uhci_fill_td(uhci, td, 0, USB_PID_OUT | uhci_explen(0), 0); wmb(); qh->dummy_td->status |= cpu_to_hc32(uhci, TD_CTRL_ACTIVE); qh->dummy_td = td; usb_settoggle(urb->dev, usb_pipeendpoint(urb->pipe), usb_pipeout(urb->pipe), toggle); return 0; nomem: /* Remove the dummy TD from the td_list so it doesn't get freed */ uhci_remove_td_from_urbp(qh->dummy_td); return -ENOMEM; } static int uhci_submit_bulk(struct uhci_hcd *uhci, struct urb *urb, struct uhci_qh *qh) { int ret; /* Can't have low-speed bulk transfers */ if (urb->dev->speed == USB_SPEED_LOW) return -EINVAL; if (qh->state != QH_STATE_ACTIVE) qh->skel = SKEL_BULK; ret = uhci_submit_common(uhci, urb, qh); if (ret == 0) uhci_add_fsbr(uhci, urb); return ret; } static int uhci_submit_interrupt(struct uhci_hcd *uhci, struct urb *urb, struct uhci_qh *qh) { int ret; /* USB 1.1 interrupt transfers only involve one packet per interval. * Drivers can submit URBs of any length, but longer ones will need * multiple intervals to complete. */ if (!qh->bandwidth_reserved) { int exponent; /* Figure out which power-of-two queue to use */ for (exponent = 7; exponent >= 0; --exponent) { if ((1 << exponent) <= urb->interval) break; } if (exponent < 0) return -EINVAL; /* If the slot is full, try a lower period */ do { qh->period = 1 << exponent; qh->skel = SKEL_INDEX(exponent); /* For now, interrupt phase is fixed by the layout * of the QH lists. */ qh->phase = (qh->period / 2) & (MAX_PHASE - 1); ret = uhci_check_bandwidth(uhci, qh); } while (ret != 0 && --exponent >= 0); if (ret) return ret; } else if (qh->period > urb->interval) return -EINVAL; /* Can't decrease the period */ ret = uhci_submit_common(uhci, urb, qh); if (ret == 0) { urb->interval = qh->period; if (!qh->bandwidth_reserved) uhci_reserve_bandwidth(uhci, qh); } return ret; } /* * Fix up the data structures following a short transfer */ static int uhci_fixup_short_transfer(struct uhci_hcd *uhci, struct uhci_qh *qh, struct urb_priv *urbp) { struct uhci_td *td; struct list_head *tmp; int ret; td = list_entry(urbp->td_list.prev, struct uhci_td, list); if (qh->type == USB_ENDPOINT_XFER_CONTROL) { /* When a control transfer is short, we have to restart * the queue at the status stage transaction, which is * the last TD. */ WARN_ON(list_empty(&urbp->td_list)); qh->element = LINK_TO_TD(uhci, td); tmp = td->list.prev; ret = -EINPROGRESS; } else { /* When a bulk/interrupt transfer is short, we have to * fix up the toggles of the following URBs on the queue * before restarting the queue at the next URB. */ qh->initial_toggle = uhci_toggle(td_token(uhci, qh->post_td)) ^ 1; uhci_fixup_toggles(uhci, qh, 1); if (list_empty(&urbp->td_list)) td = qh->post_td; qh->element = td->link; tmp = urbp->td_list.prev; ret = 0; } /* Remove all the TDs we skipped over, from tmp back to the start */ while (tmp != &urbp->td_list) { td = list_entry(tmp, struct uhci_td, list); tmp = tmp->prev; uhci_remove_td_from_urbp(td); uhci_free_td(uhci, td); } return ret; } /* * Common result for control, bulk, and interrupt */ static int uhci_result_common(struct uhci_hcd *uhci, struct urb *urb) { struct urb_priv *urbp = urb->hcpriv; struct uhci_qh *qh = urbp->qh; struct uhci_td *td, *tmp; unsigned status; int ret = 0; list_for_each_entry_safe(td, tmp, &urbp->td_list, list) { unsigned int ctrlstat; int len; ctrlstat = td_status(uhci, td); status = uhci_status_bits(ctrlstat); if (status & TD_CTRL_ACTIVE) return -EINPROGRESS; len = uhci_actual_length(ctrlstat); urb->actual_length += len; if (status) { ret = uhci_map_status(status, uhci_packetout(td_token(uhci, td))); if ((debug == 1 && ret != -EPIPE) || debug > 1) { /* Some debugging code */ dev_dbg(&urb->dev->dev, "%s: failed with status %x\n", __func__, status); if (debug > 1 && errbuf) { /* Print the chain for debugging */ uhci_show_qh(uhci, urbp->qh, errbuf, ERRBUF_LEN - EXTRA_SPACE, 0); lprintk(errbuf); } } /* Did we receive a short packet? */ } else if (len < uhci_expected_length(td_token(uhci, td))) { /* For control transfers, go to the status TD if * this isn't already the last data TD */ if (qh->type == USB_ENDPOINT_XFER_CONTROL) { if (td->list.next != urbp->td_list.prev) ret = 1; } /* For bulk and interrupt, this may be an error */ else if (urb->transfer_flags & URB_SHORT_NOT_OK) ret = -EREMOTEIO; /* Fixup needed only if this isn't the URB's last TD */ else if (&td->list != urbp->td_list.prev) ret = 1; } uhci_remove_td_from_urbp(td); if (qh->post_td) uhci_free_td(uhci, qh->post_td); qh->post_td = td; if (ret != 0) goto err; } return ret; err: if (ret < 0) { /* Note that the queue has stopped and save * the next toggle value */ qh->element = UHCI_PTR_TERM(uhci); qh->is_stopped = 1; qh->needs_fixup = (qh->type != USB_ENDPOINT_XFER_CONTROL); qh->initial_toggle = uhci_toggle(td_token(uhci, td)) ^ (ret == -EREMOTEIO); } else /* Short packet received */ ret = uhci_fixup_short_transfer(uhci, qh, urbp); return ret; } /* * Isochronous transfers */ static int uhci_submit_isochronous(struct uhci_hcd *uhci, struct urb *urb, struct uhci_qh *qh) { struct uhci_td *td = NULL; /* Since urb->number_of_packets > 0 */ int i; unsigned frame, next; unsigned long destination, status; struct urb_priv *urbp = (struct urb_priv *) urb->hcpriv; /* Values must not be too big (could overflow below) */ if (urb->interval >= UHCI_NUMFRAMES || urb->number_of_packets >= UHCI_NUMFRAMES) return -EFBIG; uhci_get_current_frame_number(uhci); /* Check the period and figure out the starting frame number */ if (!qh->bandwidth_reserved) { qh->period = urb->interval; qh->phase = -1; /* Find the best phase */ i = uhci_check_bandwidth(uhci, qh); if (i) return i; /* Allow a little time to allocate the TDs */ next = uhci->frame_number + 10; frame = qh->phase; /* Round up to the first available slot */ frame += (next - frame + qh->period - 1) & -qh->period; } else if (qh->period != urb->interval) { return -EINVAL; /* Can't change the period */ } else { next = uhci->frame_number + 1; /* Find the next unused frame */ if (list_empty(&qh->queue)) { frame = qh->iso_frame; } else { struct urb *lurb; lurb = list_entry(qh->queue.prev, struct urb_priv, node)->urb; frame = lurb->start_frame + lurb->number_of_packets * lurb->interval; } /* Fell behind? */ if (!uhci_frame_before_eq(next, frame)) { /* USB_ISO_ASAP: Round up to the first available slot */ if (urb->transfer_flags & URB_ISO_ASAP) frame += (next - frame + qh->period - 1) & -qh->period; /* * Not ASAP: Use the next slot in the stream, * no matter what. */ else if (!uhci_frame_before_eq(next, frame + (urb->number_of_packets - 1) * qh->period)) dev_dbg(uhci_dev(uhci), "iso underrun %p (%u+%u < %u)\n", urb, frame, (urb->number_of_packets - 1) * qh->period, next); } } /* Make sure we won't have to go too far into the future */ if (uhci_frame_before_eq(uhci->last_iso_frame + UHCI_NUMFRAMES, frame + urb->number_of_packets * urb->interval)) return -EFBIG; urb->start_frame = frame; status = TD_CTRL_ACTIVE | TD_CTRL_IOS; destination = (urb->pipe & PIPE_DEVEP_MASK) | usb_packetid(urb->pipe); for (i = 0; i < urb->number_of_packets; i++) { td = uhci_alloc_td(uhci); if (!td) return -ENOMEM; uhci_add_td_to_urbp(td, urbp); uhci_fill_td(uhci, td, status, destination | uhci_explen(urb->iso_frame_desc[i].length), urb->transfer_dma + urb->iso_frame_desc[i].offset); } /* Set the interrupt-on-completion flag on the last packet. */ td->status |= cpu_to_hc32(uhci, TD_CTRL_IOC); /* Add the TDs to the frame list */ frame = urb->start_frame; list_for_each_entry(td, &urbp->td_list, list) { uhci_insert_td_in_frame_list(uhci, td, frame); frame += qh->period; } if (list_empty(&qh->queue)) { qh->iso_packet_desc = &urb->iso_frame_desc[0]; qh->iso_frame = urb->start_frame; } qh->skel = SKEL_ISO; if (!qh->bandwidth_reserved) uhci_reserve_bandwidth(uhci, qh); return 0; } static int uhci_result_isochronous(struct uhci_hcd *uhci, struct urb *urb) { struct uhci_td *td, *tmp; struct urb_priv *urbp = urb->hcpriv; struct uhci_qh *qh = urbp->qh; list_for_each_entry_safe(td, tmp, &urbp->td_list, list) { unsigned int ctrlstat; int status; int actlength; if (uhci_frame_before_eq(uhci->cur_iso_frame, qh->iso_frame)) return -EINPROGRESS; uhci_remove_tds_from_frame(uhci, qh->iso_frame); ctrlstat = td_status(uhci, td); if (ctrlstat & TD_CTRL_ACTIVE) { status = -EXDEV; /* TD was added too late? */ } else { status = uhci_map_status(uhci_status_bits(ctrlstat), usb_pipeout(urb->pipe)); actlength = uhci_actual_length(ctrlstat); urb->actual_length += actlength; qh->iso_packet_desc->actual_length = actlength; qh->iso_packet_desc->status = status; } if (status) urb->error_count++; uhci_remove_td_from_urbp(td); uhci_free_td(uhci, td); qh->iso_frame += qh->period; ++qh->iso_packet_desc; } return 0; } static int uhci_urb_enqueue(struct usb_hcd *hcd, struct urb *urb, gfp_t mem_flags) { int ret; struct uhci_hcd *uhci = hcd_to_uhci(hcd); unsigned long flags; struct urb_priv *urbp; struct uhci_qh *qh; spin_lock_irqsave(&uhci->lock, flags); ret = usb_hcd_link_urb_to_ep(hcd, urb); if (ret) goto done_not_linked; ret = -ENOMEM; urbp = uhci_alloc_urb_priv(uhci, urb); if (!urbp) goto done; if (urb->ep->hcpriv) qh = urb->ep->hcpriv; else { qh = uhci_alloc_qh(uhci, urb->dev, urb->ep); if (!qh) goto err_no_qh; } urbp->qh = qh; switch (qh->type) { case USB_ENDPOINT_XFER_CONTROL: ret = uhci_submit_control(uhci, urb, qh); break; case USB_ENDPOINT_XFER_BULK: ret = uhci_submit_bulk(uhci, urb, qh); break; case USB_ENDPOINT_XFER_INT: ret = uhci_submit_interrupt(uhci, urb, qh); break; case USB_ENDPOINT_XFER_ISOC: urb->error_count = 0; ret = uhci_submit_isochronous(uhci, urb, qh); break; } if (ret != 0) goto err_submit_failed; /* Add this URB to the QH */ list_add_tail(&urbp->node, &qh->queue); /* If the new URB is the first and only one on this QH then either * the QH is new and idle or else it's unlinked and waiting to * become idle, so we can activate it right away. But only if the * queue isn't stopped. */ if (qh->queue.next == &urbp->node && !qh->is_stopped) { uhci_activate_qh(uhci, qh); uhci_urbp_wants_fsbr(uhci, urbp); } goto done; err_submit_failed: if (qh->state == QH_STATE_IDLE) uhci_make_qh_idle(uhci, qh); /* Reclaim unused QH */ err_no_qh: uhci_free_urb_priv(uhci, urbp); done: if (ret) usb_hcd_unlink_urb_from_ep(hcd, urb); done_not_linked: spin_unlock_irqrestore(&uhci->lock, flags); return ret; } static int uhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) { struct uhci_hcd *uhci = hcd_to_uhci(hcd); unsigned long flags; struct uhci_qh *qh; int rc; spin_lock_irqsave(&uhci->lock, flags); rc = usb_hcd_check_unlink_urb(hcd, urb, status); if (rc) goto done; qh = ((struct urb_priv *) urb->hcpriv)->qh; /* Remove Isochronous TDs from the frame list ASAP */ if (qh->type == USB_ENDPOINT_XFER_ISOC) { uhci_unlink_isochronous_tds(uhci, urb); mb(); /* If the URB has already started, update the QH unlink time */ uhci_get_current_frame_number(uhci); if (uhci_frame_before_eq(urb->start_frame, uhci->frame_number)) qh->unlink_frame = uhci->frame_number; } uhci_unlink_qh(uhci, qh); done: spin_unlock_irqrestore(&uhci->lock, flags); return rc; } /* * Finish unlinking an URB and give it back */ static void uhci_giveback_urb(struct uhci_hcd *uhci, struct uhci_qh *qh, struct urb *urb, int status) __releases(uhci->lock) __acquires(uhci->lock) { struct urb_priv *urbp = (struct urb_priv *) urb->hcpriv; if (qh->type == USB_ENDPOINT_XFER_CONTROL) { /* Subtract off the length of the SETUP packet from * urb->actual_length. */ urb->actual_length -= min_t(u32, 8, urb->actual_length); } /* When giving back the first URB in an Isochronous queue, * reinitialize the QH's iso-related members for the next URB. */ else if (qh->type == USB_ENDPOINT_XFER_ISOC && urbp->node.prev == &qh->queue && urbp->node.next != &qh->queue) { struct urb *nurb = list_entry(urbp->node.next, struct urb_priv, node)->urb; qh->iso_packet_desc = &nurb->iso_frame_desc[0]; qh->iso_frame = nurb->start_frame; } /* Take the URB off the QH's queue. If the queue is now empty, * this is a perfect time for a toggle fixup. */ list_del_init(&urbp->node); if (list_empty(&qh->queue) && qh->needs_fixup) { usb_settoggle(urb->dev, usb_pipeendpoint(urb->pipe), usb_pipeout(urb->pipe), qh->initial_toggle); qh->needs_fixup = 0; } uhci_free_urb_priv(uhci, urbp); usb_hcd_unlink_urb_from_ep(uhci_to_hcd(uhci), urb); spin_unlock(&uhci->lock); usb_hcd_giveback_urb(uhci_to_hcd(uhci), urb, status); spin_lock(&uhci->lock); /* If the queue is now empty, we can unlink the QH and give up its * reserved bandwidth. */ if (list_empty(&qh->queue)) { uhci_unlink_qh(uhci, qh); if (qh->bandwidth_reserved) uhci_release_bandwidth(uhci, qh); } } /* * Scan the URBs in a QH's queue */ #define QH_FINISHED_UNLINKING(qh) \ (qh->state == QH_STATE_UNLINKING && \ uhci->frame_number + uhci->is_stopped != qh->unlink_frame) static void uhci_scan_qh(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct urb_priv *urbp; struct urb *urb; int status; while (!list_empty(&qh->queue)) { urbp = list_entry(qh->queue.next, struct urb_priv, node); urb = urbp->urb; if (qh->type == USB_ENDPOINT_XFER_ISOC) status = uhci_result_isochronous(uhci, urb); else status = uhci_result_common(uhci, urb); if (status == -EINPROGRESS) break; /* Dequeued but completed URBs can't be given back unless * the QH is stopped or has finished unlinking. */ if (urb->unlinked) { if (QH_FINISHED_UNLINKING(qh)) qh->is_stopped = 1; else if (!qh->is_stopped) return; } uhci_giveback_urb(uhci, qh, urb, status); if (status < 0) break; } /* If the QH is neither stopped nor finished unlinking (normal case), * our work here is done. */ if (QH_FINISHED_UNLINKING(qh)) qh->is_stopped = 1; else if (!qh->is_stopped) return; /* Otherwise give back each of the dequeued URBs */ restart: list_for_each_entry(urbp, &qh->queue, node) { urb = urbp->urb; if (urb->unlinked) { /* Fix up the TD links and save the toggles for * non-Isochronous queues. For Isochronous queues, * test for too-recent dequeues. */ if (!uhci_cleanup_queue(uhci, qh, urb)) { qh->is_stopped = 0; return; } uhci_giveback_urb(uhci, qh, urb, 0); goto restart; } } qh->is_stopped = 0; /* There are no more dequeued URBs. If there are still URBs on the * queue, the QH can now be re-activated. */ if (!list_empty(&qh->queue)) { if (qh->needs_fixup) uhci_fixup_toggles(uhci, qh, 0); /* If the first URB on the queue wants FSBR but its time * limit has expired, set the next TD to interrupt on * completion before reactivating the QH. */ urbp = list_entry(qh->queue.next, struct urb_priv, node); if (urbp->fsbr && qh->wait_expired) { struct uhci_td *td = list_entry(urbp->td_list.next, struct uhci_td, list); td->status |= cpu_to_hc32(uhci, TD_CTRL_IOC); } uhci_activate_qh(uhci, qh); } /* The queue is empty. The QH can become idle if it is fully * unlinked. */ else if (QH_FINISHED_UNLINKING(qh)) uhci_make_qh_idle(uhci, qh); } /* * Check for queues that have made some forward progress. * Returns 0 if the queue is not Isochronous, is ACTIVE, and * has not advanced since last examined; 1 otherwise. * * Early Intel controllers have a bug which causes qh->element sometimes * not to advance when a TD completes successfully. The queue remains * stuck on the inactive completed TD. We detect such cases and advance * the element pointer by hand. */ static int uhci_advance_check(struct uhci_hcd *uhci, struct uhci_qh *qh) { struct urb_priv *urbp = NULL; struct uhci_td *td; int ret = 1; unsigned status; if (qh->type == USB_ENDPOINT_XFER_ISOC) goto done; /* Treat an UNLINKING queue as though it hasn't advanced. * This is okay because reactivation will treat it as though * it has advanced, and if it is going to become IDLE then * this doesn't matter anyway. Furthermore it's possible * for an UNLINKING queue not to have any URBs at all, or * for its first URB not to have any TDs (if it was dequeued * just as it completed). So it's not easy in any case to * test whether such queues have advanced. */ if (qh->state != QH_STATE_ACTIVE) { urbp = NULL; status = 0; } else { urbp = list_entry(qh->queue.next, struct urb_priv, node); td = list_entry(urbp->td_list.next, struct uhci_td, list); status = td_status(uhci, td); if (!(status & TD_CTRL_ACTIVE)) { /* We're okay, the queue has advanced */ qh->wait_expired = 0; qh->advance_jiffies = jiffies; goto done; } ret = uhci->is_stopped; } /* The queue hasn't advanced; check for timeout */ if (qh->wait_expired) goto done; if (time_after(jiffies, qh->advance_jiffies + QH_WAIT_TIMEOUT)) { /* Detect the Intel bug and work around it */ if (qh->post_td && qh_element(qh) == LINK_TO_TD(uhci, qh->post_td)) { qh->element = qh->post_td->link; qh->advance_jiffies = jiffies; ret = 1; goto done; } qh->wait_expired = 1; /* If the current URB wants FSBR, unlink it temporarily * so that we can safely set the next TD to interrupt on * completion. That way we'll know as soon as the queue * starts moving again. */ if (urbp && urbp->fsbr && !(status & TD_CTRL_IOC)) uhci_unlink_qh(uhci, qh); } else { /* Unmoving but not-yet-expired queues keep FSBR alive */ if (urbp) uhci_urbp_wants_fsbr(uhci, urbp); } done: return ret; } /* * Process events in the schedule, but only in one thread at a time */ static void uhci_scan_schedule(struct uhci_hcd *uhci) { int i; struct uhci_qh *qh; /* Don't allow re-entrant calls */ if (uhci->scan_in_progress) { uhci->need_rescan = 1; return; } uhci->scan_in_progress = 1; rescan: uhci->need_rescan = 0; uhci->fsbr_is_wanted = 0; uhci_clear_next_interrupt(uhci); uhci_get_current_frame_number(uhci); uhci->cur_iso_frame = uhci->frame_number; /* Go through all the QH queues and process the URBs in each one */ for (i = 0; i < UHCI_NUM_SKELQH - 1; ++i) { uhci->next_qh = list_entry(uhci->skelqh[i]->node.next, struct uhci_qh, node); while ((qh = uhci->next_qh) != uhci->skelqh[i]) { uhci->next_qh = list_entry(qh->node.next, struct uhci_qh, node); if (uhci_advance_check(uhci, qh)) { uhci_scan_qh(uhci, qh); if (qh->state == QH_STATE_ACTIVE) { uhci_urbp_wants_fsbr(uhci, list_entry(qh->queue.next, struct urb_priv, node)); } } } } uhci->last_iso_frame = uhci->cur_iso_frame; if (uhci->need_rescan) goto rescan; uhci->scan_in_progress = 0; if (uhci->fsbr_is_on && !uhci->fsbr_is_wanted && !uhci->fsbr_expiring) { uhci->fsbr_expiring = 1; mod_timer(&uhci->fsbr_timer, jiffies + FSBR_OFF_DELAY); } if (list_empty(&uhci->skel_unlink_qh->node)) uhci_clear_next_interrupt(uhci); else uhci_set_next_interrupt(uhci); } |
4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 | /* * fs/nfs/idmap.c * * UID and GID to name mapping for clients. * * Copyright (c) 2002 The Regents of the University of Michigan. * All rights reserved. * * Marius Aamodt Eriksen <marius@umich.edu> * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its * contributors may be used to endorse or promote products derived * from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include <linux/types.h> #include <linux/parser.h> #include <linux/fs.h> #include <net/net_namespace.h> #include <linux/sunrpc/rpc_pipe_fs.h> #include <linux/nfs_fs.h> #include <linux/nfs_fs_sb.h> #include <linux/key.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> #include <keys/request_key_auth-type.h> #include <linux/module.h> #include <linux/user_namespace.h> #include "internal.h" #include "netns.h" #include "nfs4idmap.h" #include "nfs4trace.h" #define NFS_UINT_MAXLEN 11 static const struct cred *id_resolver_cache; static struct key_type key_type_id_resolver_legacy; struct idmap_legacy_upcalldata { struct rpc_pipe_msg pipe_msg; struct idmap_msg idmap_msg; struct key *authkey; struct idmap *idmap; }; struct idmap { struct rpc_pipe_dir_object idmap_pdo; struct rpc_pipe *idmap_pipe; struct idmap_legacy_upcalldata *idmap_upcall_data; struct mutex idmap_mutex; struct user_namespace *user_ns; }; static struct user_namespace *idmap_userns(const struct idmap *idmap) { if (idmap && idmap->user_ns) return idmap->user_ns; return &init_user_ns; } /** * nfs_fattr_init_names - initialise the nfs_fattr owner_name/group_name fields * @fattr: fully initialised struct nfs_fattr * @owner_name: owner name string cache * @group_name: group name string cache */ void nfs_fattr_init_names(struct nfs_fattr *fattr, struct nfs4_string *owner_name, struct nfs4_string *group_name) { fattr->owner_name = owner_name; fattr->group_name = group_name; } static void nfs_fattr_free_owner_name(struct nfs_fattr *fattr) { fattr->valid &= ~NFS_ATTR_FATTR_OWNER_NAME; kfree(fattr->owner_name->data); } static void nfs_fattr_free_group_name(struct nfs_fattr *fattr) { fattr->valid &= ~NFS_ATTR_FATTR_GROUP_NAME; kfree(fattr->group_name->data); } static bool nfs_fattr_map_owner_name(struct nfs_server *server, struct nfs_fattr *fattr) { struct nfs4_string *owner = fattr->owner_name; kuid_t uid; if (!(fattr->valid & NFS_ATTR_FATTR_OWNER_NAME)) return false; if (nfs_map_name_to_uid(server, owner->data, owner->len, &uid) == 0) { fattr->uid = uid; fattr->valid |= NFS_ATTR_FATTR_OWNER; } return true; } static bool nfs_fattr_map_group_name(struct nfs_server *server, struct nfs_fattr *fattr) { struct nfs4_string *group = fattr->group_name; kgid_t gid; if (!(fattr->valid & NFS_ATTR_FATTR_GROUP_NAME)) return false; if (nfs_map_group_to_gid(server, group->data, group->len, &gid) == 0) { fattr->gid = gid; fattr->valid |= NFS_ATTR_FATTR_GROUP; } return true; } /** * nfs_fattr_free_names - free up the NFSv4 owner and group strings * @fattr: a fully initialised nfs_fattr structure */ void nfs_fattr_free_names(struct nfs_fattr *fattr) { if (fattr->valid & NFS_ATTR_FATTR_OWNER_NAME) nfs_fattr_free_owner_name(fattr); if (fattr->valid & NFS_ATTR_FATTR_GROUP_NAME) nfs_fattr_free_group_name(fattr); } /** * nfs_fattr_map_and_free_names - map owner/group strings into uid/gid and free * @server: pointer to the filesystem nfs_server structure * @fattr: a fully initialised nfs_fattr structure * * This helper maps the cached NFSv4 owner/group strings in fattr into * their numeric uid/gid equivalents, and then frees the cached strings. */ void nfs_fattr_map_and_free_names(struct nfs_server *server, struct nfs_fattr *fattr) { if (nfs_fattr_map_owner_name(server, fattr)) nfs_fattr_free_owner_name(fattr); if (nfs_fattr_map_group_name(server, fattr)) nfs_fattr_free_group_name(fattr); } int nfs_map_string_to_numeric(const char *name, size_t namelen, __u32 *res) { unsigned long val; char buf[16]; if (memchr(name, '@', namelen) != NULL || namelen >= sizeof(buf)) return 0; memcpy(buf, name, namelen); buf[namelen] = '\0'; if (kstrtoul(buf, 0, &val) != 0) return 0; *res = val; return 1; } EXPORT_SYMBOL_GPL(nfs_map_string_to_numeric); static int nfs_map_numeric_to_string(__u32 id, char *buf, size_t buflen) { return snprintf(buf, buflen, "%u", id); } static struct key_type key_type_id_resolver = { .name = "id_resolver", .preparse = user_preparse, .free_preparse = user_free_preparse, .instantiate = generic_key_instantiate, .revoke = user_revoke, .destroy = user_destroy, .describe = user_describe, .read = user_read, }; int nfs_idmap_init(void) { struct cred *cred; struct key *keyring; int ret = 0; printk(KERN_NOTICE "NFS: Registering the %s key type\n", key_type_id_resolver.name); cred = prepare_kernel_cred(&init_task); if (!cred) return -ENOMEM; keyring = keyring_alloc(".id_resolver", GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, cred, (KEY_POS_ALL & ~KEY_POS_SETATTR) | KEY_USR_VIEW | KEY_USR_READ, KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL); if (IS_ERR(keyring)) { ret = PTR_ERR(keyring); goto failed_put_cred; } ret = register_key_type(&key_type_id_resolver); if (ret < 0) goto failed_put_key; ret = register_key_type(&key_type_id_resolver_legacy); if (ret < 0) goto failed_reg_legacy; set_bit(KEY_FLAG_ROOT_CAN_CLEAR, &keyring->flags); cred->thread_keyring = keyring; cred->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING; id_resolver_cache = cred; return 0; failed_reg_legacy: unregister_key_type(&key_type_id_resolver); failed_put_key: key_put(keyring); failed_put_cred: put_cred(cred); return ret; } void nfs_idmap_quit(void) { key_revoke(id_resolver_cache->thread_keyring); unregister_key_type(&key_type_id_resolver); unregister_key_type(&key_type_id_resolver_legacy); put_cred(id_resolver_cache); } /* * Assemble the description to pass to request_key() * This function will allocate a new string and update dest to point * at it. The caller is responsible for freeing dest. * * On error 0 is returned. Otherwise, the length of dest is returned. */ static ssize_t nfs_idmap_get_desc(const char *name, size_t namelen, const char *type, size_t typelen, char **desc) { char *cp; size_t desclen = typelen + namelen + 2; *desc = kmalloc(desclen, GFP_KERNEL); if (!*desc) return -ENOMEM; cp = *desc; memcpy(cp, type, typelen); cp += typelen; *cp++ = ':'; memcpy(cp, name, namelen); cp += namelen; *cp = '\0'; return desclen; } static struct key *nfs_idmap_request_key(const char *name, size_t namelen, const char *type, struct idmap *idmap) { char *desc; struct key *rkey = ERR_PTR(-EAGAIN); ssize_t ret; ret = nfs_idmap_get_desc(name, namelen, type, strlen(type), &desc); if (ret < 0) return ERR_PTR(ret); if (!idmap->user_ns || idmap->user_ns == &init_user_ns) rkey = request_key(&key_type_id_resolver, desc, ""); if (IS_ERR(rkey)) { mutex_lock(&idmap->idmap_mutex); rkey = request_key_with_auxdata(&key_type_id_resolver_legacy, desc, NULL, "", 0, idmap); mutex_unlock(&idmap->idmap_mutex); } if (!IS_ERR(rkey)) set_bit(KEY_FLAG_ROOT_CAN_INVAL, &rkey->flags); kfree(desc); return rkey; } static ssize_t nfs_idmap_get_key(const char *name, size_t namelen, const char *type, void *data, size_t data_size, struct idmap *idmap) { const struct cred *saved_cred; struct key *rkey; const struct user_key_payload *payload; ssize_t ret; saved_cred = override_creds(id_resolver_cache); rkey = nfs_idmap_request_key(name, namelen, type, idmap); revert_creds(saved_cred); if (IS_ERR(rkey)) { ret = PTR_ERR(rkey); goto out; } rcu_read_lock(); rkey->perm |= KEY_USR_VIEW; ret = key_validate(rkey); if (ret < 0) goto out_up; payload = user_key_payload_rcu(rkey); if (IS_ERR_OR_NULL(payload)) { ret = PTR_ERR(payload); goto out_up; } ret = payload->datalen; if (ret > 0 && ret <= data_size) memcpy(data, payload->data, ret); else ret = -EINVAL; out_up: rcu_read_unlock(); key_put(rkey); out: return ret; } /* ID -> Name */ static ssize_t nfs_idmap_lookup_name(__u32 id, const char *type, char *buf, size_t buflen, struct idmap *idmap) { char id_str[NFS_UINT_MAXLEN]; int id_len; ssize_t ret; id_len = nfs_map_numeric_to_string(id, id_str, sizeof(id_str)); ret = nfs_idmap_get_key(id_str, id_len, type, buf, buflen, idmap); if (ret < 0) return -EINVAL; return ret; } /* Name -> ID */ static int nfs_idmap_lookup_id(const char *name, size_t namelen, const char *type, __u32 *id, struct idmap *idmap) { char id_str[NFS_UINT_MAXLEN]; long id_long; ssize_t data_size; int ret = 0; data_size = nfs_idmap_get_key(name, namelen, type, id_str, NFS_UINT_MAXLEN, idmap); if (data_size <= 0) { ret = -EINVAL; } else { ret = kstrtol(id_str, 10, &id_long); if (!ret) *id = (__u32)id_long; } return ret; } /* idmap classic begins here */ enum { Opt_find_uid, Opt_find_gid, Opt_find_user, Opt_find_group, Opt_find_err }; static const match_table_t nfs_idmap_tokens = { { Opt_find_uid, "uid:%s" }, { Opt_find_gid, "gid:%s" }, { Opt_find_user, "user:%s" }, { Opt_find_group, "group:%s" }, { Opt_find_err, NULL } }; static int nfs_idmap_legacy_upcall(struct key *, void *); static ssize_t idmap_pipe_downcall(struct file *, const char __user *, size_t); static void idmap_release_pipe(struct inode *); static void idmap_pipe_destroy_msg(struct rpc_pipe_msg *); static const struct rpc_pipe_ops idmap_upcall_ops = { .upcall = rpc_pipe_generic_upcall, .downcall = idmap_pipe_downcall, .release_pipe = idmap_release_pipe, .destroy_msg = idmap_pipe_destroy_msg, }; static struct key_type key_type_id_resolver_legacy = { .name = "id_legacy", .preparse = user_preparse, .free_preparse = user_free_preparse, .instantiate = generic_key_instantiate, .revoke = user_revoke, .destroy = user_destroy, .describe = user_describe, .read = user_read, .request_key = nfs_idmap_legacy_upcall, }; static void nfs_idmap_pipe_destroy(struct dentry *dir, struct rpc_pipe_dir_object *pdo) { struct idmap *idmap = pdo->pdo_data; struct rpc_pipe *pipe = idmap->idmap_pipe; if (pipe->dentry) { rpc_unlink(pipe->dentry); pipe->dentry = NULL; } } static int nfs_idmap_pipe_create(struct dentry *dir, struct rpc_pipe_dir_object *pdo) { struct idmap *idmap = pdo->pdo_data; struct rpc_pipe *pipe = idmap->idmap_pipe; struct dentry *dentry; dentry = rpc_mkpipe_dentry(dir, "idmap", idmap, pipe); if (IS_ERR(dentry)) return PTR_ERR(dentry); pipe->dentry = dentry; return 0; } static const struct rpc_pipe_dir_object_ops nfs_idmap_pipe_dir_object_ops = { .create = nfs_idmap_pipe_create, .destroy = nfs_idmap_pipe_destroy, }; int nfs_idmap_new(struct nfs_client *clp) { struct idmap *idmap; struct rpc_pipe *pipe; int error; idmap = kzalloc(sizeof(*idmap), GFP_KERNEL); if (idmap == NULL) return -ENOMEM; mutex_init(&idmap->idmap_mutex); idmap->user_ns = get_user_ns(clp->cl_rpcclient->cl_cred->user_ns); rpc_init_pipe_dir_object(&idmap->idmap_pdo, &nfs_idmap_pipe_dir_object_ops, idmap); pipe = rpc_mkpipe_data(&idmap_upcall_ops, 0); if (IS_ERR(pipe)) { error = PTR_ERR(pipe); goto err; } idmap->idmap_pipe = pipe; error = rpc_add_pipe_dir_object(clp->cl_net, &clp->cl_rpcclient->cl_pipedir_objects, &idmap->idmap_pdo); if (error) goto err_destroy_pipe; clp->cl_idmap = idmap; return 0; err_destroy_pipe: rpc_destroy_pipe_data(idmap->idmap_pipe); err: put_user_ns(idmap->user_ns); kfree(idmap); return error; } void nfs_idmap_delete(struct nfs_client *clp) { struct idmap *idmap = clp->cl_idmap; if (!idmap) return; clp->cl_idmap = NULL; rpc_remove_pipe_dir_object(clp->cl_net, &clp->cl_rpcclient->cl_pipedir_objects, &idmap->idmap_pdo); rpc_destroy_pipe_data(idmap->idmap_pipe); put_user_ns(idmap->user_ns); kfree(idmap); } static int nfs_idmap_prepare_message(char *desc, struct idmap *idmap, struct idmap_msg *im, struct rpc_pipe_msg *msg) { substring_t substr; int token, ret; im->im_type = IDMAP_TYPE_GROUP; token = match_token(desc, nfs_idmap_tokens, &substr); switch (token) { case Opt_find_uid: im->im_type = IDMAP_TYPE_USER; fallthrough; case Opt_find_gid: im->im_conv = IDMAP_CONV_NAMETOID; ret = match_strlcpy(im->im_name, &substr, IDMAP_NAMESZ); break; case Opt_find_user: im->im_type = IDMAP_TYPE_USER; fallthrough; case Opt_find_group: im->im_conv = IDMAP_CONV_IDTONAME; ret = match_int(&substr, &im->im_id); if (ret) goto out; break; default: ret = -EINVAL; goto out; } msg->data = im; msg->len = sizeof(struct idmap_msg); out: return ret; } static bool nfs_idmap_prepare_pipe_upcall(struct idmap *idmap, struct idmap_legacy_upcalldata *data) { if (idmap->idmap_upcall_data != NULL) { WARN_ON_ONCE(1); return false; } idmap->idmap_upcall_data = data; return true; } static void nfs_idmap_complete_pipe_upcall(struct idmap_legacy_upcalldata *data, int ret) { complete_request_key(data->authkey, ret); key_put(data->authkey); kfree(data); } static void nfs_idmap_abort_pipe_upcall(struct idmap *idmap, struct idmap_legacy_upcalldata *data, int ret) { if (cmpxchg(&idmap->idmap_upcall_data, data, NULL) == data) nfs_idmap_complete_pipe_upcall(data, ret); } static int nfs_idmap_legacy_upcall(struct key *authkey, void *aux) { struct idmap_legacy_upcalldata *data; struct request_key_auth *rka = get_request_key_auth(authkey); struct rpc_pipe_msg *msg; struct idmap_msg *im; struct idmap *idmap = aux; struct key *key = rka->target_key; int ret = -ENOKEY; if (!aux) goto out1; /* msg and im are freed in idmap_pipe_destroy_msg */ ret = -ENOMEM; data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) goto out1; msg = &data->pipe_msg; im = &data->idmap_msg; data->idmap = idmap; data->authkey = key_get(authkey); ret = nfs_idmap_prepare_message(key->description, idmap, im, msg); if (ret < 0) goto out2; ret = -EAGAIN; if (!nfs_idmap_prepare_pipe_upcall(idmap, data)) goto out2; ret = rpc_queue_upcall(idmap->idmap_pipe, msg); if (ret < 0) nfs_idmap_abort_pipe_upcall(idmap, data, ret); return ret; out2: kfree(data); out1: complete_request_key(authkey, ret); return ret; } static int nfs_idmap_instantiate(struct key *key, struct key *authkey, char *data, size_t datalen) { return key_instantiate_and_link(key, data, datalen, id_resolver_cache->thread_keyring, authkey); } static int nfs_idmap_read_and_verify_message(struct idmap_msg *im, struct idmap_msg *upcall, struct key *key, struct key *authkey) { char id_str[NFS_UINT_MAXLEN]; size_t len; int ret = -ENOKEY; /* ret = -ENOKEY */ if (upcall->im_type != im->im_type || upcall->im_conv != im->im_conv) goto out; switch (im->im_conv) { case IDMAP_CONV_NAMETOID: if (strcmp(upcall->im_name, im->im_name) != 0) break; /* Note: here we store the NUL terminator too */ len = 1 + nfs_map_numeric_to_string(im->im_id, id_str, sizeof(id_str)); ret = nfs_idmap_instantiate(key, authkey, id_str, len); break; case IDMAP_CONV_IDTONAME: if (upcall->im_id != im->im_id) break; len = strlen(im->im_name); ret = nfs_idmap_instantiate(key, authkey, im->im_name, len); break; default: ret = -EINVAL; } out: return ret; } static ssize_t idmap_pipe_downcall(struct file *filp, const char __user *src, size_t mlen) { struct request_key_auth *rka; struct rpc_inode *rpci = RPC_I(file_inode(filp)); struct idmap *idmap = (struct idmap *)rpci->private; struct idmap_legacy_upcalldata *data; struct key *authkey; struct idmap_msg im; size_t namelen_in; int ret = -ENOKEY; /* If instantiation is successful, anyone waiting for key construction * will have been woken up and someone else may now have used * idmap_key_cons - so after this point we may no longer touch it. */ data = xchg(&idmap->idmap_upcall_data, NULL); if (data == NULL) goto out_noupcall; authkey = data->authkey; rka = get_request_key_auth(authkey); if (mlen != sizeof(im)) { ret = -ENOSPC; goto out; } if (copy_from_user(&im, src, mlen) != 0) { ret = -EFAULT; goto out; } if (!(im.im_status & IDMAP_STATUS_SUCCESS)) { ret = -ENOKEY; goto out; } namelen_in = strnlen(im.im_name, IDMAP_NAMESZ); if (namelen_in == 0 || namelen_in == IDMAP_NAMESZ) { ret = -EINVAL; goto out; } ret = nfs_idmap_read_and_verify_message(&im, &data->idmap_msg, rka->target_key, authkey); if (ret >= 0) { key_set_timeout(rka->target_key, nfs_idmap_cache_timeout); ret = mlen; } out: nfs_idmap_complete_pipe_upcall(data, ret); out_noupcall: return ret; } static void idmap_pipe_destroy_msg(struct rpc_pipe_msg *msg) { struct idmap_legacy_upcalldata *data = container_of(msg, struct idmap_legacy_upcalldata, pipe_msg); struct idmap *idmap = data->idmap; if (msg->errno) nfs_idmap_abort_pipe_upcall(idmap, data, msg->errno); } static void idmap_release_pipe(struct inode *inode) { struct rpc_inode *rpci = RPC_I(inode); struct idmap *idmap = (struct idmap *)rpci->private; struct idmap_legacy_upcalldata *data; data = xchg(&idmap->idmap_upcall_data, NULL); if (data) nfs_idmap_complete_pipe_upcall(data, -EPIPE); } int nfs_map_name_to_uid(const struct nfs_server *server, const char *name, size_t namelen, kuid_t *uid) { struct idmap *idmap = server->nfs_client->cl_idmap; __u32 id = -1; int ret = 0; if (!nfs_map_string_to_numeric(name, namelen, &id)) ret = nfs_idmap_lookup_id(name, namelen, "uid", &id, idmap); if (ret == 0) { *uid = make_kuid(idmap_userns(idmap), id); if (!uid_valid(*uid)) ret = -ERANGE; } trace_nfs4_map_name_to_uid(name, namelen, id, ret); return ret; } int nfs_map_group_to_gid(const struct nfs_server *server, const char *name, size_t namelen, kgid_t *gid) { struct idmap *idmap = server->nfs_client->cl_idmap; __u32 id = -1; int ret = 0; if (!nfs_map_string_to_numeric(name, namelen, &id)) ret = nfs_idmap_lookup_id(name, namelen, "gid", &id, idmap); if (ret == 0) { *gid = make_kgid(idmap_userns(idmap), id); if (!gid_valid(*gid)) ret = -ERANGE; } trace_nfs4_map_group_to_gid(name, namelen, id, ret); return ret; } int nfs_map_uid_to_name(const struct nfs_server *server, kuid_t uid, char *buf, size_t buflen) { struct idmap *idmap = server->nfs_client->cl_idmap; int ret = -EINVAL; __u32 id; id = from_kuid_munged(idmap_userns(idmap), uid); if (!(server->caps & NFS_CAP_UIDGID_NOMAP)) ret = nfs_idmap_lookup_name(id, "user", buf, buflen, idmap); if (ret < 0) ret = nfs_map_numeric_to_string(id, buf, buflen); trace_nfs4_map_uid_to_name(buf, ret, id, ret); return ret; } int nfs_map_gid_to_group(const struct nfs_server *server, kgid_t gid, char *buf, size_t buflen) { struct idmap *idmap = server->nfs_client->cl_idmap; int ret = -EINVAL; __u32 id; id = from_kgid_munged(idmap_userns(idmap), gid); if (!(server->caps & NFS_CAP_UIDGID_NOMAP)) ret = nfs_idmap_lookup_name(id, "group", buf, buflen, idmap); if (ret < 0) ret = nfs_map_numeric_to_string(id, buf, buflen); trace_nfs4_map_gid_to_group(buf, ret, id, ret); return ret; } |
41 21 41 41 41 41 41 41 41 41 41 41 41 40 2 40 40 40 40 20 2 2 2 9 12 41 41 40 41 40 40 40 40 40 39 4 42 42 42 42 9 2 8 9 9 9 9 9 8 9 9 9 9 9 9 5 1 1 1 1 1 1 1 1 8 9 1 2 2 9 7 2 9 9 5 9 9 9 9 9 9 1 9 9 9 4 9 9 9 9 9 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 | // SPDX-License-Identifier: GPL-2.0-or-later /* * aops.c - NTFS kernel address space operations and page cache handling. * * Copyright (c) 2001-2014 Anton Altaparmakov and Tuxera Inc. * Copyright (c) 2002 Richard Russon */ #include <linux/errno.h> #include <linux/fs.h> #include <linux/gfp.h> #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/swap.h> #include <linux/buffer_head.h> #include <linux/writeback.h> #include <linux/bit_spinlock.h> #include <linux/bio.h> #include "aops.h" #include "attrib.h" #include "debug.h" #include "inode.h" #include "mft.h" #include "runlist.h" #include "types.h" #include "ntfs.h" /** * ntfs_end_buffer_async_read - async io completion for reading attributes * @bh: buffer head on which io is completed * @uptodate: whether @bh is now uptodate or not * * Asynchronous I/O completion handler for reading pages belonging to the * attribute address space of an inode. The inodes can either be files or * directories or they can be fake inodes describing some attribute. * * If NInoMstProtected(), perform the post read mst fixups when all IO on the * page has been completed and mark the page uptodate or set the error bit on * the page. To determine the size of the records that need fixing up, we * cheat a little bit by setting the index_block_size in ntfs_inode to the ntfs * record size, and index_block_size_bits, to the log(base 2) of the ntfs * record size. */ static void ntfs_end_buffer_async_read(struct buffer_head *bh, int uptodate) { unsigned long flags; struct buffer_head *first, *tmp; struct page *page; struct inode *vi; ntfs_inode *ni; int page_uptodate = 1; page = bh->b_page; vi = page->mapping->host; ni = NTFS_I(vi); if (likely(uptodate)) { loff_t i_size; s64 file_ofs, init_size; set_buffer_uptodate(bh); file_ofs = ((s64)page->index << PAGE_SHIFT) + bh_offset(bh); read_lock_irqsave(&ni->size_lock, flags); init_size = ni->initialized_size; i_size = i_size_read(vi); read_unlock_irqrestore(&ni->size_lock, flags); if (unlikely(init_size > i_size)) { /* Race with shrinking truncate. */ init_size = i_size; } /* Check for the current buffer head overflowing. */ if (unlikely(file_ofs + bh->b_size > init_size)) { int ofs; void *kaddr; ofs = 0; if (file_ofs < init_size) ofs = init_size - file_ofs; kaddr = kmap_atomic(page); memset(kaddr + bh_offset(bh) + ofs, 0, bh->b_size - ofs); flush_dcache_page(page); kunmap_atomic(kaddr); } } else { clear_buffer_uptodate(bh); SetPageError(page); ntfs_error(ni->vol->sb, "Buffer I/O error, logical block " "0x%llx.", (unsigned long long)bh->b_blocknr); } first = page_buffers(page); spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; do { if (!buffer_uptodate(tmp)) page_uptodate = 0; if (buffer_async_read(tmp)) { if (likely(buffer_locked(tmp))) goto still_busy; /* Async buffers must be locked. */ BUG(); } tmp = tmp->b_this_page; } while (tmp != bh); spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors then we can set the page uptodate, * but we first have to perform the post read mst fixups, if the * attribute is mst protected, i.e. if NInoMstProteced(ni) is true. * Note we ignore fixup errors as those are detected when * map_mft_record() is called which gives us per record granularity * rather than per page granularity. */ if (!NInoMstProtected(ni)) { if (likely(page_uptodate && !PageError(page))) SetPageUptodate(page); } else { u8 *kaddr; unsigned int i, recs; u32 rec_size; rec_size = ni->itype.index.block_size; recs = PAGE_SIZE / rec_size; /* Should have been verified before we got here... */ BUG_ON(!recs); kaddr = kmap_atomic(page); for (i = 0; i < recs; i++) post_read_mst_fixup((NTFS_RECORD*)(kaddr + i * rec_size), rec_size); kunmap_atomic(kaddr); flush_dcache_page(page); if (likely(page_uptodate && !PageError(page))) SetPageUptodate(page); } unlock_page(page); return; still_busy: spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } /** * ntfs_read_block - fill a @page of an address space with data * @page: page cache page to fill with data * * Fill the page @page of the address space belonging to the @page->host inode. * We read each buffer asynchronously and when all buffers are read in, our io * completion handler ntfs_end_buffer_read_async(), if required, automatically * applies the mst fixups to the page before finally marking it uptodate and * unlocking it. * * We only enforce allocated_size limit because i_size is checked for in * generic_file_read(). * * Return 0 on success and -errno on error. * * Contains an adapted version of fs/buffer.c::block_read_full_folio(). */ static int ntfs_read_block(struct page *page) { loff_t i_size; VCN vcn; LCN lcn; s64 init_size; struct inode *vi; ntfs_inode *ni; ntfs_volume *vol; runlist_element *rl; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; sector_t iblock, lblock, zblock; unsigned long flags; unsigned int blocksize, vcn_ofs; int i, nr; unsigned char blocksize_bits; vi = page->mapping->host; ni = NTFS_I(vi); vol = ni->vol; /* $MFT/$DATA must have its complete runlist in memory at all times. */ BUG_ON(!ni->runlist.rl && !ni->mft_no && !NInoAttr(ni)); blocksize = vol->sb->s_blocksize; blocksize_bits = vol->sb->s_blocksize_bits; if (!page_has_buffers(page)) { create_empty_buffers(page, blocksize, 0); if (unlikely(!page_has_buffers(page))) { unlock_page(page); return -ENOMEM; } } bh = head = page_buffers(page); BUG_ON(!bh); /* * We may be racing with truncate. To avoid some of the problems we * now take a snapshot of the various sizes and use those for the whole * of the function. In case of an extending truncate it just means we * may leave some buffers unmapped which are now allocated. This is * not a problem since these buffers will just get mapped when a write * occurs. In case of a shrinking truncate, we will detect this later * on due to the runlist being incomplete and if the page is being * fully truncated, truncate will throw it away as soon as we unlock * it so no need to worry what we do with it. */ iblock = (s64)page->index << (PAGE_SHIFT - blocksize_bits); read_lock_irqsave(&ni->size_lock, flags); lblock = (ni->allocated_size + blocksize - 1) >> blocksize_bits; init_size = ni->initialized_size; i_size = i_size_read(vi); read_unlock_irqrestore(&ni->size_lock, flags); if (unlikely(init_size > i_size)) { /* Race with shrinking truncate. */ init_size = i_size; } zblock = (init_size + blocksize - 1) >> blocksize_bits; /* Loop through all the buffers in the page. */ rl = NULL; nr = i = 0; do { int err = 0; if (unlikely(buffer_uptodate(bh))) continue; if (unlikely(buffer_mapped(bh))) { arr[nr++] = bh; continue; } bh->b_bdev = vol->sb->s_bdev; /* Is the block within the allowed limits? */ if (iblock < lblock) { bool is_retry = false; /* Convert iblock into corresponding vcn and offset. */ vcn = (VCN)iblock << blocksize_bits >> vol->cluster_size_bits; vcn_ofs = ((VCN)iblock << blocksize_bits) & vol->cluster_size_mask; if (!rl) { lock_retry_remap: down_read(&ni->runlist.lock); rl = ni->runlist.rl; } if (likely(rl != NULL)) { /* Seek to element containing target vcn. */ while (rl->length && rl[1].vcn <= vcn) rl++; lcn = ntfs_rl_vcn_to_lcn(rl, vcn); } else lcn = LCN_RL_NOT_MAPPED; /* Successful remap. */ if (lcn >= 0) { /* Setup buffer head to correct block. */ bh->b_blocknr = ((lcn << vol->cluster_size_bits) + vcn_ofs) >> blocksize_bits; set_buffer_mapped(bh); /* Only read initialized data blocks. */ if (iblock < zblock) { arr[nr++] = bh; continue; } /* Fully non-initialized data block, zero it. */ goto handle_zblock; } /* It is a hole, need to zero it. */ if (lcn == LCN_HOLE) goto handle_hole; /* If first try and runlist unmapped, map and retry. */ if (!is_retry && lcn == LCN_RL_NOT_MAPPED) { is_retry = true; /* * Attempt to map runlist, dropping lock for * the duration. */ up_read(&ni->runlist.lock); err = ntfs_map_runlist(ni, vcn); if (likely(!err)) goto lock_retry_remap; rl = NULL; } else if (!rl) up_read(&ni->runlist.lock); /* * If buffer is outside the runlist, treat it as a * hole. This can happen due to concurrent truncate * for example. */ if (err == -ENOENT || lcn == LCN_ENOENT) { err = 0; goto handle_hole; } /* Hard error, zero out region. */ if (!err) err = -EIO; bh->b_blocknr = -1; SetPageError(page); ntfs_error(vol->sb, "Failed to read from inode 0x%lx, " "attribute type 0x%x, vcn 0x%llx, " "offset 0x%x because its location on " "disk could not be determined%s " "(error code %i).", ni->mft_no, ni->type, (unsigned long long)vcn, vcn_ofs, is_retry ? " even after " "retrying" : "", err); } /* * Either iblock was outside lblock limits or * ntfs_rl_vcn_to_lcn() returned error. Just zero that portion * of the page and set the buffer uptodate. */ handle_hole: bh->b_blocknr = -1UL; clear_buffer_mapped(bh); handle_zblock: zero_user(page, i * blocksize, blocksize); if (likely(!err)) set_buffer_uptodate(bh); } while (i++, iblock++, (bh = bh->b_this_page) != head); /* Release the lock if we took it. */ if (rl) up_read(&ni->runlist.lock); /* Check we have at least one buffer ready for i/o. */ if (nr) { struct buffer_head *tbh; /* Lock the buffers. */ for (i = 0; i < nr; i++) { tbh = arr[i]; lock_buffer(tbh); tbh->b_end_io = ntfs_end_buffer_async_read; set_buffer_async_read(tbh); } /* Finally, start i/o on the buffers. */ for (i = 0; i < nr; i++) { tbh = arr[i]; if (likely(!buffer_uptodate(tbh))) submit_bh(REQ_OP_READ, tbh); else ntfs_end_buffer_async_read(tbh, 1); } return 0; } /* No i/o was scheduled on any of the buffers. */ if (likely(!PageError(page))) SetPageUptodate(page); else /* Signal synchronous i/o error. */ nr = -EIO; unlock_page(page); return nr; } /** * ntfs_read_folio - fill a @folio of a @file with data from the device * @file: open file to which the folio @folio belongs or NULL * @folio: page cache folio to fill with data * * For non-resident attributes, ntfs_read_folio() fills the @folio of the open * file @file by calling the ntfs version of the generic block_read_full_folio() * function, ntfs_read_block(), which in turn creates and reads in the buffers * associated with the folio asynchronously. * * For resident attributes, OTOH, ntfs_read_folio() fills @folio by copying the * data from the mft record (which at this stage is most likely in memory) and * fills the remainder with zeroes. Thus, in this case, I/O is synchronous, as * even if the mft record is not cached at this point in time, we need to wait * for it to be read in before we can do the copy. * * Return 0 on success and -errno on error. */ static int ntfs_read_folio(struct file *file, struct folio *folio) { struct page *page = &folio->page; loff_t i_size; struct inode *vi; ntfs_inode *ni, *base_ni; u8 *addr; ntfs_attr_search_ctx *ctx; MFT_RECORD *mrec; unsigned long flags; u32 attr_len; int err = 0; retry_readpage: BUG_ON(!PageLocked(page)); vi = page->mapping->host; i_size = i_size_read(vi); /* Is the page fully outside i_size? (truncate in progress) */ if (unlikely(page->index >= (i_size + PAGE_SIZE - 1) >> PAGE_SHIFT)) { zero_user(page, 0, PAGE_SIZE); ntfs_debug("Read outside i_size - truncated?"); goto done; } /* * This can potentially happen because we clear PageUptodate() during * ntfs_writepage() of MstProtected() attributes. */ if (PageUptodate(page)) { unlock_page(page); return 0; } ni = NTFS_I(vi); /* * Only $DATA attributes can be encrypted and only unnamed $DATA * attributes can be compressed. Index root can have the flags set but * this means to create compressed/encrypted files, not that the * attribute is compressed/encrypted. Note we need to check for * AT_INDEX_ALLOCATION since this is the type of both directory and * index inodes. */ if (ni->type != AT_INDEX_ALLOCATION) { /* If attribute is encrypted, deny access, just like NT4. */ if (NInoEncrypted(ni)) { BUG_ON(ni->type != AT_DATA); err = -EACCES; goto err_out; } /* Compressed data streams are handled in compress.c. */ if (NInoNonResident(ni) && NInoCompressed(ni)) { BUG_ON(ni->type != AT_DATA); BUG_ON(ni->name_len); return ntfs_read_compressed_block(page); } } /* NInoNonResident() == NInoIndexAllocPresent() */ if (NInoNonResident(ni)) { /* Normal, non-resident data stream. */ return ntfs_read_block(page); } /* * Attribute is resident, implying it is not compressed or encrypted. * This also means the attribute is smaller than an mft record and * hence smaller than a page, so can simply zero out any pages with * index above 0. Note the attribute can actually be marked compressed * but if it is resident the actual data is not compressed so we are * ok to ignore the compressed flag here. */ if (unlikely(page->index > 0)) { zero_user(page, 0, PAGE_SIZE); goto done; } if (!NInoAttr(ni)) base_ni = ni; else base_ni = ni->ext.base_ntfs_ino; /* Map, pin, and lock the mft record. */ mrec = map_mft_record(base_ni); if (IS_ERR(mrec)) { err = PTR_ERR(mrec); goto err_out; } /* * If a parallel write made the attribute non-resident, drop the mft * record and retry the read_folio. */ if (unlikely(NInoNonResident(ni))) { unmap_mft_record(base_ni); goto retry_readpage; } ctx = ntfs_attr_get_search_ctx(base_ni, mrec); if (unlikely(!ctx)) { err = -ENOMEM; goto unm_err_out; } err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE, 0, NULL, 0, ctx); if (unlikely(err)) goto put_unm_err_out; attr_len = le32_to_cpu(ctx->attr->data.resident.value_length); read_lock_irqsave(&ni->size_lock, flags); if (unlikely(attr_len > ni->initialized_size)) attr_len = ni->initialized_size; i_size = i_size_read(vi); read_unlock_irqrestore(&ni->size_lock, flags); if (unlikely(attr_len > i_size)) { /* Race with shrinking truncate. */ attr_len = i_size; } addr = kmap_atomic(page); /* Copy the data to the page. */ memcpy(addr, (u8*)ctx->attr + le16_to_cpu(ctx->attr->data.resident.value_offset), attr_len); /* Zero the remainder of the page. */ memset(addr + attr_len, 0, PAGE_SIZE - attr_len); flush_dcache_page(page); kunmap_atomic(addr); put_unm_err_out: ntfs_attr_put_search_ctx(ctx); unm_err_out: unmap_mft_record(base_ni); done: SetPageUptodate(page); err_out: unlock_page(page); return err; } #ifdef NTFS_RW /** * ntfs_write_block - write a @page to the backing store * @page: page cache page to write out * @wbc: writeback control structure * * This function is for writing pages belonging to non-resident, non-mst * protected attributes to their backing store. * * For a page with buffers, map and write the dirty buffers asynchronously * under page writeback. For a page without buffers, create buffers for the * page, then proceed as above. * * If a page doesn't have buffers the page dirty state is definitive. If a page * does have buffers, the page dirty state is just a hint, and the buffer dirty * state is definitive. (A hint which has rules: dirty buffers against a clean * page is illegal. Other combinations are legal and need to be handled. In * particular a dirty page containing clean buffers for example.) * * Return 0 on success and -errno on error. * * Based on ntfs_read_block() and __block_write_full_folio(). */ static int ntfs_write_block(struct page *page, struct writeback_control *wbc) { VCN vcn; LCN lcn; s64 initialized_size; loff_t i_size; sector_t block, dblock, iblock; struct inode *vi; ntfs_inode *ni; ntfs_volume *vol; runlist_element *rl; struct buffer_head *bh, *head; unsigned long flags; unsigned int blocksize, vcn_ofs; int err; bool need_end_writeback; unsigned char blocksize_bits; vi = page->mapping->host; ni = NTFS_I(vi); vol = ni->vol; ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, page index " "0x%lx.", ni->mft_no, ni->type, page->index); BUG_ON(!NInoNonResident(ni)); BUG_ON(NInoMstProtected(ni)); blocksize = vol->sb->s_blocksize; blocksize_bits = vol->sb->s_blocksize_bits; if (!page_has_buffers(page)) { BUG_ON(!PageUptodate(page)); create_empty_buffers(page, blocksize, (1 << BH_Uptodate) | (1 << BH_Dirty)); if (unlikely(!page_has_buffers(page))) { ntfs_warning(vol->sb, "Error allocating page " "buffers. Redirtying page so we try " "again later."); /* * Put the page back on mapping->dirty_pages, but leave * its buffers' dirty state as-is. */ redirty_page_for_writepage(wbc, page); unlock_page(page); return 0; } } bh = head = page_buffers(page); BUG_ON(!bh); /* NOTE: Different naming scheme to ntfs_read_block()! */ /* The first block in the page. */ block = (s64)page->index << (PAGE_SHIFT - blocksize_bits); read_lock_irqsave(&ni->size_lock, flags); i_size = i_size_read(vi); initialized_size = ni->initialized_size; read_unlock_irqrestore(&ni->size_lock, flags); /* The first out of bounds block for the data size. */ dblock = (i_size + blocksize - 1) >> blocksize_bits; /* The last (fully or partially) initialized block. */ iblock = initialized_size >> blocksize_bits; /* * Be very careful. We have no exclusion from block_dirty_folio * here, and the (potentially unmapped) buffers may become dirty at * any time. If a buffer becomes dirty here after we've inspected it * then we just miss that fact, and the page stays dirty. * * Buffers outside i_size may be dirtied by block_dirty_folio; * handle that here by just cleaning them. */ /* * Loop through all the buffers in the page, mapping all the dirty * buffers to disk addresses and handling any aliases from the * underlying block device's mapping. */ rl = NULL; err = 0; do { bool is_retry = false; if (unlikely(block >= dblock)) { /* * Mapped buffers outside i_size will occur, because * this page can be outside i_size when there is a * truncate in progress. The contents of such buffers * were zeroed by ntfs_writepage(). * * FIXME: What about the small race window where * ntfs_writepage() has not done any clearing because * the page was within i_size but before we get here, * vmtruncate() modifies i_size? */ clear_buffer_dirty(bh); set_buffer_uptodate(bh); continue; } /* Clean buffers are not written out, so no need to map them. */ if (!buffer_dirty(bh)) continue; /* Make sure we have enough initialized size. */ if (unlikely((block >= iblock) && (initialized_size < i_size))) { /* * If this page is fully outside initialized * size, zero out all pages between the current * initialized size and the current page. Just * use ntfs_read_folio() to do the zeroing * transparently. */ if (block > iblock) { // TODO: // For each page do: // - read_cache_page() // Again for each page do: // - wait_on_page_locked() // - Check (PageUptodate(page) && // !PageError(page)) // Update initialized size in the attribute and // in the inode. // Again, for each page do: // block_dirty_folio(); // put_page() // We don't need to wait on the writes. // Update iblock. } /* * The current page straddles initialized size. Zero * all non-uptodate buffers and set them uptodate (and * dirty?). Note, there aren't any non-uptodate buffers * if the page is uptodate. * FIXME: For an uptodate page, the buffers may need to * be written out because they were not initialized on * disk before. */ if (!PageUptodate(page)) { // TODO: // Zero any non-uptodate buffers up to i_size. // Set them uptodate and dirty. } // TODO: // Update initialized size in the attribute and in the // inode (up to i_size). // Update iblock. // FIXME: This is inefficient. Try to batch the two // size changes to happen in one go. ntfs_error(vol->sb, "Writing beyond initialized size " "is not supported yet. Sorry."); err = -EOPNOTSUPP; break; // Do NOT set_buffer_new() BUT DO clear buffer range // outside write request range. // set_buffer_uptodate() on complete buffers as well as // set_buffer_dirty(). } /* No need to map buffers that are already mapped. */ if (buffer_mapped(bh)) continue; /* Unmapped, dirty buffer. Need to map it. */ bh->b_bdev = vol->sb->s_bdev; /* Convert block into corresponding vcn and offset. */ vcn = (VCN)block << blocksize_bits; vcn_ofs = vcn & vol->cluster_size_mask; vcn >>= vol->cluster_size_bits; if (!rl) { lock_retry_remap: down_read(&ni->runlist.lock); rl = ni->runlist.rl; } if (likely(rl != NULL)) { /* Seek to element containing target vcn. */ while (rl->length && rl[1].vcn <= vcn) rl++; lcn = ntfs_rl_vcn_to_lcn(rl, vcn); } else lcn = LCN_RL_NOT_MAPPED; /* Successful remap. */ if (lcn >= 0) { /* Setup buffer head to point to correct block. */ bh->b_blocknr = ((lcn << vol->cluster_size_bits) + vcn_ofs) >> blocksize_bits; set_buffer_mapped(bh); continue; } /* It is a hole, need to instantiate it. */ if (lcn == LCN_HOLE) { u8 *kaddr; unsigned long *bpos, *bend; /* Check if the buffer is zero. */ kaddr = kmap_atomic(page); bpos = (unsigned long *)(kaddr + bh_offset(bh)); bend = (unsigned long *)((u8*)bpos + blocksize); do { if (unlikely(*bpos)) break; } while (likely(++bpos < bend)); kunmap_atomic(kaddr); if (bpos == bend) { /* * Buffer is zero and sparse, no need to write * it. */ bh->b_blocknr = -1; clear_buffer_dirty(bh); continue; } // TODO: Instantiate the hole. // clear_buffer_new(bh); // clean_bdev_bh_alias(bh); ntfs_error(vol->sb, "Writing into sparse regions is " "not supported yet. Sorry."); err = -EOPNOTSUPP; break; } /* If first try and runlist unmapped, map and retry. */ if (!is_retry && lcn == LCN_RL_NOT_MAPPED) { is_retry = true; /* * Attempt to map runlist, dropping lock for * the duration. */ up_read(&ni->runlist.lock); err = ntfs_map_runlist(ni, vcn); if (likely(!err)) goto lock_retry_remap; rl = NULL; } else if (!rl) up_read(&ni->runlist.lock); /* * If buffer is outside the runlist, truncate has cut it out * of the runlist. Just clean and clear the buffer and set it * uptodate so it can get discarded by the VM. */ if (err == -ENOENT || lcn == LCN_ENOENT) { bh->b_blocknr = -1; clear_buffer_dirty(bh); zero_user(page, bh_offset(bh), blocksize); set_buffer_uptodate(bh); err = 0; continue; } /* Failed to map the buffer, even after retrying. */ if (!err) err = -EIO; bh->b_blocknr = -1; ntfs_error(vol->sb, "Failed to write to inode 0x%lx, " "attribute type 0x%x, vcn 0x%llx, offset 0x%x " "because its location on disk could not be " "determined%s (error code %i).", ni->mft_no, ni->type, (unsigned long long)vcn, vcn_ofs, is_retry ? " even after " "retrying" : "", err); break; } while (block++, (bh = bh->b_this_page) != head); /* Release the lock if we took it. */ if (rl) up_read(&ni->runlist.lock); /* For the error case, need to reset bh to the beginning. */ bh = head; /* Just an optimization, so ->read_folio() is not called later. */ if (unlikely(!PageUptodate(page))) { int uptodate = 1; do { if (!buffer_uptodate(bh)) { uptodate = 0; bh = head; break; } } while ((bh = bh->b_this_page) != head); if (uptodate) SetPageUptodate(page); } /* Setup all mapped, dirty buffers for async write i/o. */ do { if (buffer_mapped(bh) && buffer_dirty(bh)) { lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { BUG_ON(!buffer_uptodate(bh)); mark_buffer_async_write(bh); } else unlock_buffer(bh); } else if (unlikely(err)) { /* * For the error case. The buffer may have been set * dirty during attachment to a dirty page. */ if (err != -ENOMEM) clear_buffer_dirty(bh); } } while ((bh = bh->b_this_page) != head); if (unlikely(err)) { // TODO: Remove the -EOPNOTSUPP check later on... if (unlikely(err == -EOPNOTSUPP)) err = 0; else if (err == -ENOMEM) { ntfs_warning(vol->sb, "Error allocating memory. " "Redirtying page so we try again " "later."); /* * Put the page back on mapping->dirty_pages, but * leave its buffer's dirty state as-is. */ redirty_page_for_writepage(wbc, page); err = 0; } else SetPageError(page); } BUG_ON(PageWriteback(page)); set_page_writeback(page); /* Keeps try_to_free_buffers() away. */ /* Submit the prepared buffers for i/o. */ need_end_writeback = true; do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { submit_bh(REQ_OP_WRITE, bh); need_end_writeback = false; } bh = next; } while (bh != head); unlock_page(page); /* If no i/o was started, need to end_page_writeback(). */ if (unlikely(need_end_writeback)) end_page_writeback(page); ntfs_debug("Done."); return err; } /** * ntfs_write_mst_block - write a @page to the backing store * @page: page cache page to write out * @wbc: writeback control structure * * This function is for writing pages belonging to non-resident, mst protected * attributes to their backing store. The only supported attributes are index * allocation and $MFT/$DATA. Both directory inodes and index inodes are * supported for the index allocation case. * * The page must remain locked for the duration of the write because we apply * the mst fixups, write, and then undo the fixups, so if we were to unlock the * page before undoing the fixups, any other user of the page will see the * page contents as corrupt. * * We clear the page uptodate flag for the duration of the function to ensure * exclusion for the $MFT/$DATA case against someone mapping an mft record we * are about to apply the mst fixups to. * * Return 0 on success and -errno on error. * * Based on ntfs_write_block(), ntfs_mft_writepage(), and * write_mft_record_nolock(). */ static int ntfs_write_mst_block(struct page *page, struct writeback_control *wbc) { sector_t block, dblock, rec_block; struct inode *vi = page->mapping->host; ntfs_inode *ni = NTFS_I(vi); ntfs_volume *vol = ni->vol; u8 *kaddr; unsigned int rec_size = ni->itype.index.block_size; ntfs_inode *locked_nis[PAGE_SIZE / NTFS_BLOCK_SIZE]; struct buffer_head *bh, *head, *tbh, *rec_start_bh; struct buffer_head *bhs[MAX_BUF_PER_PAGE]; runlist_element *rl; int i, nr_locked_nis, nr_recs, nr_bhs, max_bhs, bhs_per_rec, err, err2; unsigned bh_size, rec_size_bits; bool sync, is_mft, page_is_dirty, rec_is_dirty; unsigned char bh_size_bits; if (WARN_ON(rec_size < NTFS_BLOCK_SIZE)) return -EINVAL; ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, page index " "0x%lx.", vi->i_ino, ni->type, page->index); BUG_ON(!NInoNonResident(ni)); BUG_ON(!NInoMstProtected(ni)); is_mft = (S_ISREG(vi->i_mode) && !vi->i_ino); /* * NOTE: ntfs_write_mst_block() would be called for $MFTMirr if a page * in its page cache were to be marked dirty. However this should * never happen with the current driver and considering we do not * handle this case here we do want to BUG(), at least for now. */ BUG_ON(!(is_mft || S_ISDIR(vi->i_mode) || (NInoAttr(ni) && ni->type == AT_INDEX_ALLOCATION))); bh_size = vol->sb->s_blocksize; bh_size_bits = vol->sb->s_blocksize_bits; max_bhs = PAGE_SIZE / bh_size; BUG_ON(!max_bhs); BUG_ON(max_bhs > MAX_BUF_PER_PAGE); /* Were we called for sync purposes? */ sync = (wbc->sync_mode == WB_SYNC_ALL); /* Make sure we have mapped buffers. */ bh = head = page_buffers(page); BUG_ON(!bh); rec_size_bits = ni->itype.index.block_size_bits; BUG_ON(!(PAGE_SIZE >> rec_size_bits)); bhs_per_rec = rec_size >> bh_size_bits; BUG_ON(!bhs_per_rec); /* The first block in the page. */ rec_block = block = (sector_t)page->index << (PAGE_SHIFT - bh_size_bits); /* The first out of bounds block for the data size. */ dblock = (i_size_read(vi) + bh_size - 1) >> bh_size_bits; rl = NULL; err = err2 = nr_bhs = nr_recs = nr_locked_nis = 0; page_is_dirty = rec_is_dirty = false; rec_start_bh = NULL; do { bool is_retry = false; if (likely(block < rec_block)) { if (unlikely(block >= dblock)) { clear_buffer_dirty(bh); set_buffer_uptodate(bh); continue; } /* * This block is not the first one in the record. We * ignore the buffer's dirty state because we could * have raced with a parallel mark_ntfs_record_dirty(). */ if (!rec_is_dirty) continue; if (unlikely(err2)) { if (err2 != -ENOMEM) clear_buffer_dirty(bh); continue; } } else /* if (block == rec_block) */ { BUG_ON(block > rec_block); /* This block is the first one in the record. */ rec_block += bhs_per_rec; err2 = 0; if (unlikely(block >= dblock)) { clear_buffer_dirty(bh); continue; } if (!buffer_dirty(bh)) { /* Clean records are not written out. */ rec_is_dirty = false; continue; } rec_is_dirty = true; rec_start_bh = bh; } /* Need to map the buffer if it is not mapped already. */ if (unlikely(!buffer_mapped(bh))) { VCN vcn; LCN lcn; unsigned int vcn_ofs; bh->b_bdev = vol->sb->s_bdev; /* Obtain the vcn and offset of the current block. */ vcn = (VCN)block << bh_size_bits; vcn_ofs = vcn & vol->cluster_size_mask; vcn >>= vol->cluster_size_bits; if (!rl) { lock_retry_remap: down_read(&ni->runlist.lock); rl = ni->runlist.rl; } if (likely(rl != NULL)) { /* Seek to element containing target vcn. */ while (rl->length && rl[1].vcn <= vcn) rl++; lcn = ntfs_rl_vcn_to_lcn(rl, vcn); } else lcn = LCN_RL_NOT_MAPPED; /* Successful remap. */ if (likely(lcn >= 0)) { /* Setup buffer head to correct block. */ bh->b_blocknr = ((lcn << vol->cluster_size_bits) + vcn_ofs) >> bh_size_bits; set_buffer_mapped(bh); } else { /* * Remap failed. Retry to map the runlist once * unless we are working on $MFT which always * has the whole of its runlist in memory. */ if (!is_mft && !is_retry && lcn == LCN_RL_NOT_MAPPED) { is_retry = true; /* * Attempt to map runlist, dropping * lock for the duration. */ up_read(&ni->runlist.lock); err2 = ntfs_map_runlist(ni, vcn); if (likely(!err2)) goto lock_retry_remap; if (err2 == -ENOMEM) page_is_dirty = true; lcn = err2; } else { err2 = -EIO; if (!rl) up_read(&ni->runlist.lock); } /* Hard error. Abort writing this record. */ if (!err || err == -ENOMEM) err = err2; bh->b_blocknr = -1; ntfs_error(vol->sb, "Cannot write ntfs record " "0x%llx (inode 0x%lx, " "attribute type 0x%x) because " "its location on disk could " "not be determined (error " "code %lli).", (long long)block << bh_size_bits >> vol->mft_record_size_bits, ni->mft_no, ni->type, (long long)lcn); /* * If this is not the first buffer, remove the * buffers in this record from the list of * buffers to write and clear their dirty bit * if not error -ENOMEM. */ if (rec_start_bh != bh) { while (bhs[--nr_bhs] != rec_start_bh) ; if (err2 != -ENOMEM) { do { clear_buffer_dirty( rec_start_bh); } while ((rec_start_bh = rec_start_bh-> b_this_page) != bh); } } continue; } } BUG_ON(!buffer_uptodate(bh)); BUG_ON(nr_bhs >= max_bhs); bhs[nr_bhs++] = bh; } while (block++, (bh = bh->b_this_page) != head); if (unlikely(rl)) up_read(&ni->runlist.lock); /* If there were no dirty buffers, we are done. */ if (!nr_bhs) goto done; /* Map the page so we can access its contents. */ kaddr = kmap(page); /* Clear the page uptodate flag whilst the mst fixups are applied. */ BUG_ON(!PageUptodate(page)); ClearPageUptodate(page); for (i = 0; i < nr_bhs; i++) { unsigned int ofs; /* Skip buffers which are not at the beginning of records. */ if (i % bhs_per_rec) continue; tbh = bhs[i]; ofs = bh_offset(tbh); if (is_mft) { ntfs_inode *tni; unsigned long mft_no; /* Get the mft record number. */ mft_no = (((s64)page->index << PAGE_SHIFT) + ofs) >> rec_size_bits; /* Check whether to write this mft record. */ tni = NULL; if (!ntfs_may_write_mft_record(vol, mft_no, (MFT_RECORD*)(kaddr + ofs), &tni)) { /* * The record should not be written. This * means we need to redirty the page before * returning. */ page_is_dirty = true; /* * Remove the buffers in this mft record from * the list of buffers to write. */ do { bhs[i] = NULL; } while (++i % bhs_per_rec); continue; } /* * The record should be written. If a locked ntfs * inode was returned, add it to the array of locked * ntfs inodes. */ if (tni) locked_nis[nr_locked_nis++] = tni; } /* Apply the mst protection fixups. */ err2 = pre_write_mst_fixup((NTFS_RECORD*)(kaddr + ofs), rec_size); if (unlikely(err2)) { if (!err || err == -ENOMEM) err = -EIO; ntfs_error(vol->sb, "Failed to apply mst fixups " "(inode 0x%lx, attribute type 0x%x, " "page index 0x%lx, page offset 0x%x)!" " Unmount and run chkdsk.", vi->i_ino, ni->type, page->index, ofs); /* * Mark all the buffers in this record clean as we do * not want to write corrupt data to disk. */ do { clear_buffer_dirty(bhs[i]); bhs[i] = NULL; } while (++i % bhs_per_rec); continue; } nr_recs++; } /* If no records are to be written out, we are done. */ if (!nr_recs) goto unm_done; flush_dcache_page(page); /* Lock buffers and start synchronous write i/o on them. */ for (i = 0; i < nr_bhs; i++) { tbh = bhs[i]; if (!tbh) continue; if (!trylock_buffer(tbh)) BUG(); /* The buffer dirty state is now irrelevant, just clean it. */ clear_buffer_dirty(tbh); BUG_ON(!buffer_uptodate(tbh)); BUG_ON(!buffer_mapped(tbh)); get_bh(tbh); tbh->b_end_io = end_buffer_write_sync; submit_bh(REQ_OP_WRITE, tbh); } /* Synchronize the mft mirror now if not @sync. */ if (is_mft && !sync) goto do_mirror; do_wait: /* Wait on i/o completion of buffers. */ for (i = 0; i < nr_bhs; i++) { tbh = bhs[i]; if (!tbh) continue; wait_on_buffer(tbh); if (unlikely(!buffer_uptodate(tbh))) { ntfs_error(vol->sb, "I/O error while writing ntfs " "record buffer (inode 0x%lx, " "attribute type 0x%x, page index " "0x%lx, page offset 0x%lx)! Unmount " "and run chkdsk.", vi->i_ino, ni->type, page->index, bh_offset(tbh)); if (!err || err == -ENOMEM) err = -EIO; /* * Set the buffer uptodate so the page and buffer * states do not become out of sync. */ set_buffer_uptodate(tbh); } } /* If @sync, now synchronize the mft mirror. */ if (is_mft && sync) { do_mirror: for (i = 0; i < nr_bhs; i++) { unsigned long mft_no; unsigned int ofs; /* * Skip buffers which are not at the beginning of * records. */ if (i % bhs_per_rec) continue; tbh = bhs[i]; /* Skip removed buffers (and hence records). */ if (!tbh) continue; ofs = bh_offset(tbh); /* Get the mft record number. */ mft_no = (((s64)page->index << PAGE_SHIFT) + ofs) >> rec_size_bits; if (mft_no < vol->mftmirr_size) ntfs_sync_mft_mirror(vol, mft_no, (MFT_RECORD*)(kaddr + ofs), sync); } if (!sync) goto do_wait; } /* Remove the mst protection fixups again. */ for (i = 0; i < nr_bhs; i++) { if (!(i % bhs_per_rec)) { tbh = bhs[i]; if (!tbh) continue; post_write_mst_fixup((NTFS_RECORD*)(kaddr + bh_offset(tbh))); } } flush_dcache_page(page); unm_done: /* Unlock any locked inodes. */ while (nr_locked_nis-- > 0) { ntfs_inode *tni, *base_tni; tni = locked_nis[nr_locked_nis]; /* Get the base inode. */ mutex_lock(&tni->extent_lock); if (tni->nr_extents >= 0) base_tni = tni; else { base_tni = tni->ext.base_ntfs_ino; BUG_ON(!base_tni); } mutex_unlock(&tni->extent_lock); ntfs_debug("Unlocking %s inode 0x%lx.", tni == base_tni ? "base" : "extent", tni->mft_no); mutex_unlock(&tni->mrec_lock); atomic_dec(&tni->count); iput(VFS_I(base_tni)); } SetPageUptodate(page); kunmap(page); done: if (unlikely(err && err != -ENOMEM)) { /* * Set page error if there is only one ntfs record in the page. * Otherwise we would loose per-record granularity. */ if (ni->itype.index.block_size == PAGE_SIZE) SetPageError(page); NVolSetErrors(vol); } if (page_is_dirty) { ntfs_debug("Page still contains one or more dirty ntfs " "records. Redirtying the page starting at " "record 0x%lx.", page->index << (PAGE_SHIFT - rec_size_bits)); redirty_page_for_writepage(wbc, page); unlock_page(page); } else { /* * Keep the VM happy. This must be done otherwise the * radix-tree tag PAGECACHE_TAG_DIRTY remains set even though * the page is clean. */ BUG_ON(PageWriteback(page)); set_page_writeback(page); unlock_page(page); end_page_writeback(page); } if (likely(!err)) ntfs_debug("Done."); return err; } /** * ntfs_writepage - write a @page to the backing store * @page: page cache page to write out * @wbc: writeback control structure * * This is called from the VM when it wants to have a dirty ntfs page cache * page cleaned. The VM has already locked the page and marked it clean. * * For non-resident attributes, ntfs_writepage() writes the @page by calling * the ntfs version of the generic block_write_full_page() function, * ntfs_write_block(), which in turn if necessary creates and writes the * buffers associated with the page asynchronously. * * For resident attributes, OTOH, ntfs_writepage() writes the @page by copying * the data to the mft record (which at this stage is most likely in memory). * The mft record is then marked dirty and written out asynchronously via the * vfs inode dirty code path for the inode the mft record belongs to or via the * vm page dirty code path for the page the mft record is in. * * Based on ntfs_read_folio() and fs/buffer.c::block_write_full_page(). * * Return 0 on success and -errno on error. */ static int ntfs_writepage(struct page *page, struct writeback_control *wbc) { loff_t i_size; struct inode *vi = page->mapping->host; ntfs_inode *base_ni = NULL, *ni = NTFS_I(vi); char *addr; ntfs_attr_search_ctx *ctx = NULL; MFT_RECORD *m = NULL; u32 attr_len; int err; retry_writepage: BUG_ON(!PageLocked(page)); i_size = i_size_read(vi); /* Is the page fully outside i_size? (truncate in progress) */ if (unlikely(page->index >= (i_size + PAGE_SIZE - 1) >> PAGE_SHIFT)) { struct folio *folio = page_folio(page); /* * The page may have dirty, unmapped buffers. Make them * freeable here, so the page does not leak. */ block_invalidate_folio(folio, 0, folio_size(folio)); folio_unlock(folio); ntfs_debug("Write outside i_size - truncated?"); return 0; } /* * Only $DATA attributes can be encrypted and only unnamed $DATA * attributes can be compressed. Index root can have the flags set but * this means to create compressed/encrypted files, not that the * attribute is compressed/encrypted. Note we need to check for * AT_INDEX_ALLOCATION since this is the type of both directory and * index inodes. */ if (ni->type != AT_INDEX_ALLOCATION) { /* If file is encrypted, deny access, just like NT4. */ if (NInoEncrypted(ni)) { unlock_page(page); BUG_ON(ni->type != AT_DATA); ntfs_debug("Denying write access to encrypted file."); return -EACCES; } /* Compressed data streams are handled in compress.c. */ if (NInoNonResident(ni) && NInoCompressed(ni)) { BUG_ON(ni->type != AT_DATA); BUG_ON(ni->name_len); // TODO: Implement and replace this with // return ntfs_write_compressed_block(page); unlock_page(page); ntfs_error(vi->i_sb, "Writing to compressed files is " "not supported yet. Sorry."); return -EOPNOTSUPP; } // TODO: Implement and remove this check. if (NInoNonResident(ni) && NInoSparse(ni)) { unlock_page(page); ntfs_error(vi->i_sb, "Writing to sparse files is not " "supported yet. Sorry."); return -EOPNOTSUPP; } } /* NInoNonResident() == NInoIndexAllocPresent() */ if (NInoNonResident(ni)) { /* We have to zero every time due to mmap-at-end-of-file. */ if (page->index >= (i_size >> PAGE_SHIFT)) { /* The page straddles i_size. */ unsigned int ofs = i_size & ~PAGE_MASK; zero_user_segment(page, ofs, PAGE_SIZE); } /* Handle mst protected attributes. */ if (NInoMstProtected(ni)) return ntfs_write_mst_block(page, wbc); /* Normal, non-resident data stream. */ return ntfs_write_block(page, wbc); } /* * Attribute is resident, implying it is not compressed, encrypted, or * mst protected. This also means the attribute is smaller than an mft * record and hence smaller than a page, so can simply return error on * any pages with index above 0. Note the attribute can actually be * marked compressed but if it is resident the actual data is not * compressed so we are ok to ignore the compressed flag here. */ BUG_ON(page_has_buffers(page)); BUG_ON(!PageUptodate(page)); if (unlikely(page->index > 0)) { ntfs_error(vi->i_sb, "BUG()! page->index (0x%lx) > 0. " "Aborting write.", page->index); BUG_ON(PageWriteback(page)); set_page_writeback(page); unlock_page(page); end_page_writeback(page); return -EIO; } if (!NInoAttr(ni)) base_ni = ni; else base_ni = ni->ext.base_ntfs_ino; /* Map, pin, and lock the mft record. */ m = map_mft_record(base_ni); if (IS_ERR(m)) { err = PTR_ERR(m); m = NULL; ctx = NULL; goto err_out; } /* * If a parallel write made the attribute non-resident, drop the mft * record and retry the writepage. */ if (unlikely(NInoNonResident(ni))) { unmap_mft_record(base_ni); goto retry_writepage; } ctx = ntfs_attr_get_search_ctx(base_ni, m); if (unlikely(!ctx)) { err = -ENOMEM; goto err_out; } err = ntfs_attr_lookup(ni->type, ni->name, ni->name_len, CASE_SENSITIVE, 0, NULL, 0, ctx); if (unlikely(err)) goto err_out; /* * Keep the VM happy. This must be done otherwise the radix-tree tag * PAGECACHE_TAG_DIRTY remains set even though the page is clean. */ BUG_ON(PageWriteback(page)); set_page_writeback(page); unlock_page(page); attr_len = le32_to_cpu(ctx->attr->data.resident.value_length); i_size = i_size_read(vi); if (unlikely(attr_len > i_size)) { /* Race with shrinking truncate or a failed truncate. */ attr_len = i_size; /* * If the truncate failed, fix it up now. If a concurrent * truncate, we do its job, so it does not have to do anything. */ err = ntfs_resident_attr_value_resize(ctx->mrec, ctx->attr, attr_len); /* Shrinking cannot fail. */ BUG_ON(err); } addr = kmap_atomic(page); /* Copy the data from the page to the mft record. */ memcpy((u8*)ctx->attr + le16_to_cpu(ctx->attr->data.resident.value_offset), addr, attr_len); /* Zero out of bounds area in the page cache page. */ memset(addr + attr_len, 0, PAGE_SIZE - attr_len); kunmap_atomic(addr); flush_dcache_page(page); flush_dcache_mft_record_page(ctx->ntfs_ino); /* We are done with the page. */ end_page_writeback(page); /* Finally, mark the mft record dirty, so it gets written back. */ mark_mft_record_dirty(ctx->ntfs_ino); ntfs_attr_put_search_ctx(ctx); unmap_mft_record(base_ni); return 0; err_out: if (err == -ENOMEM) { ntfs_warning(vi->i_sb, "Error allocating memory. Redirtying " "page so we try again later."); /* * Put the page back on mapping->dirty_pages, but leave its * buffers' dirty state as-is. */ redirty_page_for_writepage(wbc, page); err = 0; } else { ntfs_error(vi->i_sb, "Resident attribute write failed with " "error %i.", err); SetPageError(page); NVolSetErrors(ni->vol); } unlock_page(page); if (ctx) ntfs_attr_put_search_ctx(ctx); if (m) unmap_mft_record(base_ni); return err; } #endif /* NTFS_RW */ /** * ntfs_bmap - map logical file block to physical device block * @mapping: address space mapping to which the block to be mapped belongs * @block: logical block to map to its physical device block * * For regular, non-resident files (i.e. not compressed and not encrypted), map * the logical @block belonging to the file described by the address space * mapping @mapping to its physical device block. * * The size of the block is equal to the @s_blocksize field of the super block * of the mounted file system which is guaranteed to be smaller than or equal * to the cluster size thus the block is guaranteed to fit entirely inside the * cluster which means we do not need to care how many contiguous bytes are * available after the beginning of the block. * * Return the physical device block if the mapping succeeded or 0 if the block * is sparse or there was an error. * * Note: This is a problem if someone tries to run bmap() on $Boot system file * as that really is in block zero but there is nothing we can do. bmap() is * just broken in that respect (just like it cannot distinguish sparse from * not available or error). */ static sector_t ntfs_bmap(struct address_space *mapping, sector_t block) { s64 ofs, size; loff_t i_size; LCN lcn; unsigned long blocksize, flags; ntfs_inode *ni = NTFS_I(mapping->host); ntfs_volume *vol = ni->vol; unsigned delta; unsigned char blocksize_bits, cluster_size_shift; ntfs_debug("Entering for mft_no 0x%lx, logical block 0x%llx.", ni->mft_no, (unsigned long long)block); if (ni->type != AT_DATA || !NInoNonResident(ni) || NInoEncrypted(ni)) { ntfs_error(vol->sb, "BMAP does not make sense for %s " "attributes, returning 0.", (ni->type != AT_DATA) ? "non-data" : (!NInoNonResident(ni) ? "resident" : "encrypted")); return 0; } /* None of these can happen. */ BUG_ON(NInoCompressed(ni)); BUG_ON(NInoMstProtected(ni)); blocksize = vol->sb->s_blocksize; blocksize_bits = vol->sb->s_blocksize_bits; ofs = (s64)block << blocksize_bits; read_lock_irqsave(&ni->size_lock, flags); size = ni->initialized_size; i_size = i_size_read(VFS_I(ni)); read_unlock_irqrestore(&ni->size_lock, flags); /* * If the offset is outside the initialized size or the block straddles * the initialized size then pretend it is a hole unless the * initialized size equals the file size. */ if (unlikely(ofs >= size || (ofs + blocksize > size && size < i_size))) goto hole; cluster_size_shift = vol->cluster_size_bits; down_read(&ni->runlist.lock); lcn = ntfs_attr_vcn_to_lcn_nolock(ni, ofs >> cluster_size_shift, false); up_read(&ni->runlist.lock); if (unlikely(lcn < LCN_HOLE)) { /* * Step down to an integer to avoid gcc doing a long long * comparision in the switch when we know @lcn is between * LCN_HOLE and LCN_EIO (i.e. -1 to -5). * * Otherwise older gcc (at least on some architectures) will * try to use __cmpdi2() which is of course not available in * the kernel. */ switch ((int)lcn) { case LCN_ENOENT: /* * If the offset is out of bounds then pretend it is a * hole. */ goto hole; case LCN_ENOMEM: ntfs_error(vol->sb, "Not enough memory to complete " "mapping for inode 0x%lx. " "Returning 0.", ni->mft_no); break; default: ntfs_error(vol->sb, "Failed to complete mapping for " "inode 0x%lx. Run chkdsk. " "Returning 0.", ni->mft_no); break; } return 0; } if (lcn < 0) { /* It is a hole. */ hole: ntfs_debug("Done (returning hole)."); return 0; } /* * The block is really allocated and fullfils all our criteria. * Convert the cluster to units of block size and return the result. */ delta = ofs & vol->cluster_size_mask; if (unlikely(sizeof(block) < sizeof(lcn))) { block = lcn = ((lcn << cluster_size_shift) + delta) >> blocksize_bits; /* If the block number was truncated return 0. */ if (unlikely(block != lcn)) { ntfs_error(vol->sb, "Physical block 0x%llx is too " "large to be returned, returning 0.", (long long)lcn); return 0; } } else block = ((lcn << cluster_size_shift) + delta) >> blocksize_bits; ntfs_debug("Done (returning block 0x%llx).", (unsigned long long)lcn); return block; } /* * ntfs_normal_aops - address space operations for normal inodes and attributes * * Note these are not used for compressed or mst protected inodes and * attributes. */ const struct address_space_operations ntfs_normal_aops = { .read_folio = ntfs_read_folio, #ifdef NTFS_RW .writepage = ntfs_writepage, .dirty_folio = block_dirty_folio, #endif /* NTFS_RW */ .bmap = ntfs_bmap, .migrate_folio = buffer_migrate_folio, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, }; /* * ntfs_compressed_aops - address space operations for compressed inodes */ const struct address_space_operations ntfs_compressed_aops = { .read_folio = ntfs_read_folio, #ifdef NTFS_RW .writepage = ntfs_writepage, .dirty_folio = block_dirty_folio, #endif /* NTFS_RW */ .migrate_folio = buffer_migrate_folio, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, }; /* * ntfs_mst_aops - general address space operations for mst protecteed inodes * and attributes */ const struct address_space_operations ntfs_mst_aops = { .read_folio = ntfs_read_folio, /* Fill page with data. */ #ifdef NTFS_RW .writepage = ntfs_writepage, /* Write dirty page to disk. */ .dirty_folio = filemap_dirty_folio, #endif /* NTFS_RW */ .migrate_folio = buffer_migrate_folio, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, }; #ifdef NTFS_RW /** * mark_ntfs_record_dirty - mark an ntfs record dirty * @page: page containing the ntfs record to mark dirty * @ofs: byte offset within @page at which the ntfs record begins * * Set the buffers and the page in which the ntfs record is located dirty. * * The latter also marks the vfs inode the ntfs record belongs to dirty * (I_DIRTY_PAGES only). * * If the page does not have buffers, we create them and set them uptodate. * The page may not be locked which is why we need to handle the buffers under * the mapping->private_lock. Once the buffers are marked dirty we no longer * need the lock since try_to_free_buffers() does not free dirty buffers. */ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) { struct address_space *mapping = page->mapping; ntfs_inode *ni = NTFS_I(mapping->host); struct buffer_head *bh, *head, *buffers_to_free = NULL; unsigned int end, bh_size, bh_ofs; BUG_ON(!PageUptodate(page)); end = ofs + ni->itype.index.block_size; bh_size = VFS_I(ni)->i_sb->s_blocksize; spin_lock(&mapping->private_lock); if (unlikely(!page_has_buffers(page))) { spin_unlock(&mapping->private_lock); bh = head = alloc_page_buffers(page, bh_size, true); spin_lock(&mapping->private_lock); if (likely(!page_has_buffers(page))) { struct buffer_head *tail; do { set_buffer_uptodate(bh); tail = bh; bh = bh->b_this_page; } while (bh); tail->b_this_page = head; attach_page_private(page, head); } else buffers_to_free = bh; } bh = head = page_buffers(page); BUG_ON(!bh); do { bh_ofs = bh_offset(bh); if (bh_ofs + bh_size <= ofs) continue; if (unlikely(bh_ofs >= end)) break; set_buffer_dirty(bh); } while ((bh = bh->b_this_page) != head); spin_unlock(&mapping->private_lock); filemap_dirty_folio(mapping, page_folio(page)); if (unlikely(buffers_to_free)) { do { bh = buffers_to_free->b_this_page; free_buffer_head(buffers_to_free); buffers_to_free = bh; } while (buffers_to_free); } } #endif /* NTFS_RW */ |
699 699 683 683 682 682 699 3 1 1 682 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | // SPDX-License-Identifier: GPL-2.0-only /* * AppArmor security module * * This file contains AppArmor security identifier (secid) manipulation fns * * Copyright 2009-2017 Canonical Ltd. * * AppArmor allocates a unique secid for every label used. If a label * is replaced it receives the secid of the label it is replacing. */ #include <linux/errno.h> #include <linux/err.h> #include <linux/gfp.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/xarray.h> #include "include/cred.h" #include "include/lib.h" #include "include/secid.h" #include "include/label.h" #include "include/policy_ns.h" /* * secids - do not pin labels with a refcount. They rely on the label * properly updating/freeing them */ #define AA_FIRST_SECID 2 static DEFINE_XARRAY_FLAGS(aa_secids, XA_FLAGS_LOCK_IRQ | XA_FLAGS_TRACK_FREE); int apparmor_display_secid_mode; /* * TODO: allow policy to reserve a secid range? * TODO: add secid pinning * TODO: use secid_update in label replace */ /** * aa_secid_update - update a secid mapping to a new label * @secid: secid to update * @label: label the secid will now map to */ void aa_secid_update(u32 secid, struct aa_label *label) { unsigned long flags; xa_lock_irqsave(&aa_secids, flags); __xa_store(&aa_secids, secid, label, 0); xa_unlock_irqrestore(&aa_secids, flags); } /* * see label for inverse aa_label_to_secid */ struct aa_label *aa_secid_to_label(u32 secid) { return xa_load(&aa_secids, secid); } int apparmor_secid_to_secctx(u32 secid, char **secdata, u32 *seclen) { /* TODO: cache secctx and ref count so we don't have to recreate */ struct aa_label *label = aa_secid_to_label(secid); int flags = FLAG_VIEW_SUBNS | FLAG_HIDDEN_UNCONFINED | FLAG_ABS_ROOT; int len; AA_BUG(!seclen); if (!label) return -EINVAL; if (apparmor_display_secid_mode) flags |= FLAG_SHOW_MODE; if (secdata) len = aa_label_asxprint(secdata, root_ns, label, flags, GFP_ATOMIC); else len = aa_label_snxprint(NULL, 0, root_ns, label, flags); if (len < 0) return -ENOMEM; *seclen = len; return 0; } int apparmor_secctx_to_secid(const char *secdata, u32 seclen, u32 *secid) { struct aa_label *label; label = aa_label_strn_parse(&root_ns->unconfined->label, secdata, seclen, GFP_KERNEL, false, false); if (IS_ERR(label)) return PTR_ERR(label); *secid = label->secid; return 0; } void apparmor_release_secctx(char *secdata, u32 seclen) { kfree(secdata); } /** * aa_alloc_secid - allocate a new secid for a profile * @label: the label to allocate a secid for * @gfp: memory allocation flags * * Returns: 0 with @label->secid initialized * <0 returns error with @label->secid set to AA_SECID_INVALID */ int aa_alloc_secid(struct aa_label *label, gfp_t gfp) { unsigned long flags; int ret; xa_lock_irqsave(&aa_secids, flags); ret = __xa_alloc(&aa_secids, &label->secid, label, XA_LIMIT(AA_FIRST_SECID, INT_MAX), gfp); xa_unlock_irqrestore(&aa_secids, flags); if (ret < 0) { label->secid = AA_SECID_INVALID; return ret; } return 0; } /** * aa_free_secid - free a secid * @secid: secid to free */ void aa_free_secid(u32 secid) { unsigned long flags; xa_lock_irqsave(&aa_secids, flags); __xa_erase(&aa_secids, secid); xa_unlock_irqrestore(&aa_secids, flags); } |
2 2 35 1 32 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | // SPDX-License-Identifier: GPL-2.0-or-later /* * ip_vs_proto.c: transport protocol load balancing support for IPVS * * Authors: Wensong Zhang <wensong@linuxvirtualserver.org> * Julian Anastasov <ja@ssi.bg> * * Changes: */ #define KMSG_COMPONENT "IPVS" #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt #include <linux/module.h> #include <linux/kernel.h> #include <linux/skbuff.h> #include <linux/gfp.h> #include <linux/in.h> #include <linux/ip.h> #include <net/protocol.h> #include <net/tcp.h> #include <net/udp.h> #include <linux/stat.h> #include <linux/proc_fs.h> #include <net/ip_vs.h> /* * IPVS protocols can only be registered/unregistered when the ipvs * module is loaded/unloaded, so no lock is needed in accessing the * ipvs protocol table. */ #define IP_VS_PROTO_TAB_SIZE 32 /* must be power of 2 */ #define IP_VS_PROTO_HASH(proto) ((proto) & (IP_VS_PROTO_TAB_SIZE-1)) static struct ip_vs_protocol *ip_vs_proto_table[IP_VS_PROTO_TAB_SIZE]; /* States for conn templates: NONE or words separated with ",", max 15 chars */ static const char *ip_vs_ctpl_state_name_table[IP_VS_CTPL_S_LAST] = { [IP_VS_CTPL_S_NONE] = "NONE", [IP_VS_CTPL_S_ASSURED] = "ASSURED", }; /* * register an ipvs protocol */ static int __used __init register_ip_vs_protocol(struct ip_vs_protocol *pp) { unsigned int hash = IP_VS_PROTO_HASH(pp->protocol); pp->next = ip_vs_proto_table[hash]; ip_vs_proto_table[hash] = pp; if (pp->init != NULL) pp->init(pp); return 0; } /* * register an ipvs protocols netns related data */ static int register_ip_vs_proto_netns(struct netns_ipvs *ipvs, struct ip_vs_protocol *pp) { unsigned int hash = IP_VS_PROTO_HASH(pp->protocol); struct ip_vs_proto_data *pd = kzalloc(sizeof(struct ip_vs_proto_data), GFP_KERNEL); if (!pd) return -ENOMEM; pd->pp = pp; /* For speed issues */ pd->next = ipvs->proto_data_table[hash]; ipvs->proto_data_table[hash] = pd; atomic_set(&pd->appcnt, 0); /* Init app counter */ if (pp->init_netns != NULL) { int ret = pp->init_netns(ipvs, pd); if (ret) { /* unlink an free proto data */ ipvs->proto_data_table[hash] = pd->next; kfree(pd); return ret; } } return 0; } /* * unregister an ipvs protocol */ static int unregister_ip_vs_protocol(struct ip_vs_protocol *pp) { struct ip_vs_protocol **pp_p; unsigned int hash = IP_VS_PROTO_HASH(pp->protocol); pp_p = &ip_vs_proto_table[hash]; for (; *pp_p; pp_p = &(*pp_p)->next) { if (*pp_p == pp) { *pp_p = pp->next; if (pp->exit != NULL) pp->exit(pp); return 0; } } return -ESRCH; } /* * unregister an ipvs protocols netns data */ static int unregister_ip_vs_proto_netns(struct netns_ipvs *ipvs, struct ip_vs_proto_data *pd) { struct ip_vs_proto_data **pd_p; unsigned int hash = IP_VS_PROTO_HASH(pd->pp->protocol); pd_p = &ipvs->proto_data_table[hash]; for (; *pd_p; pd_p = &(*pd_p)->next) { if (*pd_p == pd) { *pd_p = pd->next; if (pd->pp->exit_netns != NULL) pd->pp->exit_netns(ipvs, pd); kfree(pd); return 0; } } return -ESRCH; } /* * get ip_vs_protocol object by its proto. */ struct ip_vs_protocol * ip_vs_proto_get(unsigned short proto) { struct ip_vs_protocol *pp; unsigned int hash = IP_VS_PROTO_HASH(proto); for (pp = ip_vs_proto_table[hash]; pp; pp = pp->next) { if (pp->protocol == proto) return pp; } return NULL; } EXPORT_SYMBOL(ip_vs_proto_get); /* * get ip_vs_protocol object data by netns and proto */ struct ip_vs_proto_data * ip_vs_proto_data_get(struct netns_ipvs *ipvs, unsigned short proto) { struct ip_vs_proto_data *pd; unsigned int hash = IP_VS_PROTO_HASH(proto); for (pd = ipvs->proto_data_table[hash]; pd; pd = pd->next) { if (pd->pp->protocol == proto) return pd; } return NULL; } EXPORT_SYMBOL(ip_vs_proto_data_get); /* * Propagate event for state change to all protocols */ void ip_vs_protocol_timeout_change(struct netns_ipvs *ipvs, int flags) { struct ip_vs_proto_data *pd; int i; for (i = 0; i < IP_VS_PROTO_TAB_SIZE; i++) { for (pd = ipvs->proto_data_table[i]; pd; pd = pd->next) { if (pd->pp->timeout_change) pd->pp->timeout_change(pd, flags); } } } int * ip_vs_create_timeout_table(int *table, int size) { return kmemdup(table, size, GFP_KERNEL); } const char *ip_vs_state_name(const struct ip_vs_conn *cp) { unsigned int state = cp->state; struct ip_vs_protocol *pp; if (cp->flags & IP_VS_CONN_F_TEMPLATE) { if (state >= IP_VS_CTPL_S_LAST) return "ERR!"; return ip_vs_ctpl_state_name_table[state] ? : "?"; } pp = ip_vs_proto_get(cp->protocol); if (pp == NULL || pp->state_name == NULL) return (cp->protocol == IPPROTO_IP) ? "NONE" : "ERR!"; return pp->state_name(state); } static void ip_vs_tcpudp_debug_packet_v4(struct ip_vs_protocol *pp, const struct sk_buff *skb, int offset, const char *msg) { char buf[128]; struct iphdr _iph, *ih; ih = skb_header_pointer(skb, offset, sizeof(_iph), &_iph); if (ih == NULL) sprintf(buf, "TRUNCATED"); else if (ih->frag_off & htons(IP_OFFSET)) sprintf(buf, "%pI4->%pI4 frag", &ih->saddr, &ih->daddr); else { __be16 _ports[2], *pptr; pptr = skb_header_pointer(skb, offset + ih->ihl*4, sizeof(_ports), _ports); if (pptr == NULL) sprintf(buf, "TRUNCATED %pI4->%pI4", &ih->saddr, &ih->daddr); else sprintf(buf, "%pI4:%u->%pI4:%u", &ih->saddr, ntohs(pptr[0]), &ih->daddr, ntohs(pptr[1])); } pr_debug("%s: %s %s\n", msg, pp->name, buf); } #ifdef CONFIG_IP_VS_IPV6 static void ip_vs_tcpudp_debug_packet_v6(struct ip_vs_protocol *pp, const struct sk_buff *skb, int offset, const char *msg) { char buf[192]; struct ipv6hdr _iph, *ih; ih = skb_header_pointer(skb, offset, sizeof(_iph), &_iph); if (ih == NULL) sprintf(buf, "TRUNCATED"); else if (ih->nexthdr == IPPROTO_FRAGMENT) sprintf(buf, "%pI6c->%pI6c frag", &ih->saddr, &ih->daddr); else { __be16 _ports[2], *pptr; pptr = skb_header_pointer(skb, offset + sizeof(struct ipv6hdr), sizeof(_ports), _ports); if (pptr == NULL) sprintf(buf, "TRUNCATED %pI6c->%pI6c", &ih->saddr, &ih->daddr); else sprintf(buf, "%pI6c:%u->%pI6c:%u", &ih->saddr, ntohs(pptr[0]), &ih->daddr, ntohs(pptr[1])); } pr_debug("%s: %s %s\n", msg, pp->name, buf); } #endif void ip_vs_tcpudp_debug_packet(int af, struct ip_vs_protocol *pp, const struct sk_buff *skb, int offset, const char *msg) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) ip_vs_tcpudp_debug_packet_v6(pp, skb, offset, msg); else #endif ip_vs_tcpudp_debug_packet_v4(pp, skb, offset, msg); } /* * per network name-space init */ int __net_init ip_vs_protocol_net_init(struct netns_ipvs *ipvs) { int i, ret; static struct ip_vs_protocol *protos[] = { #ifdef CONFIG_IP_VS_PROTO_TCP &ip_vs_protocol_tcp, #endif #ifdef CONFIG_IP_VS_PROTO_UDP &ip_vs_protocol_udp, #endif #ifdef CONFIG_IP_VS_PROTO_SCTP &ip_vs_protocol_sctp, #endif #ifdef CONFIG_IP_VS_PROTO_AH &ip_vs_protocol_ah, #endif #ifdef CONFIG_IP_VS_PROTO_ESP &ip_vs_protocol_esp, #endif }; for (i = 0; i < ARRAY_SIZE(protos); i++) { ret = register_ip_vs_proto_netns(ipvs, protos[i]); if (ret < 0) goto cleanup; } return 0; cleanup: ip_vs_protocol_net_cleanup(ipvs); return ret; } void __net_exit ip_vs_protocol_net_cleanup(struct netns_ipvs *ipvs) { struct ip_vs_proto_data *pd; int i; /* unregister all the ipvs proto data for this netns */ for (i = 0; i < IP_VS_PROTO_TAB_SIZE; i++) { while ((pd = ipvs->proto_data_table[i]) != NULL) unregister_ip_vs_proto_netns(ipvs, pd); } } int __init ip_vs_protocol_init(void) { char protocols[64]; #define REGISTER_PROTOCOL(p) \ do { \ register_ip_vs_protocol(p); \ strcat(protocols, ", "); \ strcat(protocols, (p)->name); \ } while (0) protocols[0] = '\0'; protocols[2] = '\0'; #ifdef CONFIG_IP_VS_PROTO_TCP REGISTER_PROTOCOL(&ip_vs_protocol_tcp); #endif #ifdef CONFIG_IP_VS_PROTO_UDP REGISTER_PROTOCOL(&ip_vs_protocol_udp); #endif #ifdef CONFIG_IP_VS_PROTO_SCTP REGISTER_PROTOCOL(&ip_vs_protocol_sctp); #endif #ifdef CONFIG_IP_VS_PROTO_AH REGISTER_PROTOCOL(&ip_vs_protocol_ah); #endif #ifdef CONFIG_IP_VS_PROTO_ESP REGISTER_PROTOCOL(&ip_vs_protocol_esp); #endif pr_info("Registered protocols (%s)\n", &protocols[2]); return 0; } void ip_vs_protocol_cleanup(void) { struct ip_vs_protocol *pp; int i; /* unregister all the ipvs protocols */ for (i = 0; i < IP_VS_PROTO_TAB_SIZE; i++) { while ((pp = ip_vs_proto_table[i]) != NULL) unregister_ip_vs_protocol(pp); } } |
2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/mm/memory_hotplug.c * * Copyright (C) */ #include <linux/stddef.h> #include <linux/mm.h> #include <linux/sched/signal.h> #include <linux/swap.h> #include <linux/interrupt.h> #include <linux/pagemap.h> #include <linux/compiler.h> #include <linux/export.h> #include <linux/writeback.h> #include <linux/slab.h> #include <linux/sysctl.h> #include <linux/cpu.h> #include <linux/memory.h> #include <linux/memremap.h> #include <linux/memory_hotplug.h> #include <linux/vmalloc.h> #include <linux/ioport.h> #include <linux/delay.h> #include <linux/migrate.h> #include <linux/page-isolation.h> #include <linux/pfn.h> #include <linux/suspend.h> #include <linux/mm_inline.h> #include <linux/firmware-map.h> #include <linux/stop_machine.h> #include <linux/hugetlb.h> #include <linux/memblock.h> #include <linux/compaction.h> #include <linux/rmap.h> #include <linux/module.h> #include <asm/tlbflush.h> #include "internal.h" #include "shuffle.h" enum { MEMMAP_ON_MEMORY_DISABLE = 0, MEMMAP_ON_MEMORY_ENABLE, MEMMAP_ON_MEMORY_FORCE, }; static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE; static inline unsigned long memory_block_memmap_size(void) { return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page); } static inline unsigned long memory_block_memmap_on_memory_pages(void) { unsigned long nr_pages = PFN_UP(memory_block_memmap_size()); /* * In "forced" memmap_on_memory mode, we add extra pages to align the * vmemmap size to cover full pageblocks. That way, we can add memory * even if the vmemmap size is not properly aligned, however, we might waste * memory. */ if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) return pageblock_align(nr_pages); return nr_pages; } #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY /* * memory_hotplug.memmap_on_memory parameter */ static int set_memmap_mode(const char *val, const struct kernel_param *kp) { int ret, mode; bool enabled; if (sysfs_streq(val, "force") || sysfs_streq(val, "FORCE")) { mode = MEMMAP_ON_MEMORY_FORCE; } else { ret = kstrtobool(val, &enabled); if (ret < 0) return ret; if (enabled) mode = MEMMAP_ON_MEMORY_ENABLE; else mode = MEMMAP_ON_MEMORY_DISABLE; } *((int *)kp->arg) = mode; if (mode == MEMMAP_ON_MEMORY_FORCE) { unsigned long memmap_pages = memory_block_memmap_on_memory_pages(); pr_info_once("Memory hotplug will waste %ld pages in each memory block\n", memmap_pages - PFN_UP(memory_block_memmap_size())); } return 0; } static int get_memmap_mode(char *buffer, const struct kernel_param *kp) { if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE) return sprintf(buffer, "force\n"); return param_get_bool(buffer, kp); } static const struct kernel_param_ops memmap_mode_ops = { .set = set_memmap_mode, .get = get_memmap_mode, }; module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444); MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n" "With value \"force\" it could result in memory wastage due " "to memmap size limitations (Y/N/force)"); static inline bool mhp_memmap_on_memory(void) { return memmap_mode != MEMMAP_ON_MEMORY_DISABLE; } #else static inline bool mhp_memmap_on_memory(void) { return false; } #endif enum { ONLINE_POLICY_CONTIG_ZONES = 0, ONLINE_POLICY_AUTO_MOVABLE, }; static const char * const online_policy_to_str[] = { [ONLINE_POLICY_CONTIG_ZONES] = "contig-zones", [ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable", }; static int set_online_policy(const char *val, const struct kernel_param *kp) { int ret = sysfs_match_string(online_policy_to_str, val); if (ret < 0) return ret; *((int *)kp->arg) = ret; return 0; } static int get_online_policy(char *buffer, const struct kernel_param *kp) { return sprintf(buffer, "%s\n", online_policy_to_str[*((int *)kp->arg)]); } /* * memory_hotplug.online_policy: configure online behavior when onlining without * specifying a zone (MMOP_ONLINE) * * "contig-zones": keep zone contiguous * "auto-movable": online memory to ZONE_MOVABLE if the configuration * (auto_movable_ratio, auto_movable_numa_aware) allows for it */ static int online_policy __read_mostly = ONLINE_POLICY_CONTIG_ZONES; static const struct kernel_param_ops online_policy_ops = { .set = set_online_policy, .get = get_online_policy, }; module_param_cb(online_policy, &online_policy_ops, &online_policy, 0644); MODULE_PARM_DESC(online_policy, "Set the online policy (\"contig-zones\", \"auto-movable\") " "Default: \"contig-zones\""); /* * memory_hotplug.auto_movable_ratio: specify maximum MOVABLE:KERNEL ratio * * The ratio represent an upper limit and the kernel might decide to not * online some memory to ZONE_MOVABLE -- e.g., because hotplugged KERNEL memory * doesn't allow for more MOVABLE memory. */ static unsigned int auto_movable_ratio __read_mostly = 301; module_param(auto_movable_ratio, uint, 0644); MODULE_PARM_DESC(auto_movable_ratio, "Set the maximum ratio of MOVABLE:KERNEL memory in the system " "in percent for \"auto-movable\" online policy. Default: 301"); /* * memory_hotplug.auto_movable_numa_aware: consider numa node stats */ #ifdef CONFIG_NUMA static bool auto_movable_numa_aware __read_mostly = true; module_param(auto_movable_numa_aware, bool, 0644); MODULE_PARM_DESC(auto_movable_numa_aware, "Consider numa node stats in addition to global stats in " "\"auto-movable\" online policy. Default: true"); #endif /* CONFIG_NUMA */ /* * online_page_callback contains pointer to current page onlining function. * Initially it is generic_online_page(). If it is required it could be * changed by calling set_online_page_callback() for callback registration * and restore_online_page_callback() for generic callback restore. */ static online_page_callback_t online_page_callback = generic_online_page; static DEFINE_MUTEX(online_page_callback_lock); DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock); void get_online_mems(void) { percpu_down_read(&mem_hotplug_lock); } void put_online_mems(void) { percpu_up_read(&mem_hotplug_lock); } bool movable_node_enabled = false; #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE int mhp_default_online_type = MMOP_OFFLINE; #else int mhp_default_online_type = MMOP_ONLINE; #endif static int __init setup_memhp_default_state(char *str) { const int online_type = mhp_online_type_from_str(str); if (online_type >= 0) mhp_default_online_type = online_type; return 1; } __setup("memhp_default_state=", setup_memhp_default_state); void mem_hotplug_begin(void) { cpus_read_lock(); percpu_down_write(&mem_hotplug_lock); } void mem_hotplug_done(void) { percpu_up_write(&mem_hotplug_lock); cpus_read_unlock(); } u64 max_mem_size = U64_MAX; /* add this memory to iomem resource */ static struct resource *register_memory_resource(u64 start, u64 size, const char *resource_name) { struct resource *res; unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; if (strcmp(resource_name, "System RAM")) flags |= IORESOURCE_SYSRAM_DRIVER_MANAGED; if (!mhp_range_allowed(start, size, true)) return ERR_PTR(-E2BIG); /* * Make sure value parsed from 'mem=' only restricts memory adding * while booting, so that memory hotplug won't be impacted. Please * refer to document of 'mem=' in kernel-parameters.txt for more * details. */ if (start + size > max_mem_size && system_state < SYSTEM_RUNNING) return ERR_PTR(-E2BIG); /* * Request ownership of the new memory range. This might be * a child of an existing resource that was present but * not marked as busy. */ res = __request_region(&iomem_resource, start, size, resource_name, flags); if (!res) { pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n", start, start + size); return ERR_PTR(-EEXIST); } return res; } static void release_memory_resource(struct resource *res) { if (!res) return; release_resource(res); kfree(res); } static int check_pfn_span(unsigned long pfn, unsigned long nr_pages) { /* * Disallow all operations smaller than a sub-section and only * allow operations smaller than a section for * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range() * enforces a larger memory_block_size_bytes() granularity for * memory that will be marked online, so this check should only * fire for direct arch_{add,remove}_memory() users outside of * add_memory_resource(). */ unsigned long min_align; if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) min_align = PAGES_PER_SUBSECTION; else min_align = PAGES_PER_SECTION; if (!IS_ALIGNED(pfn | nr_pages, min_align)) return -EINVAL; return 0; } /* * Return page for the valid pfn only if the page is online. All pfn * walkers which rely on the fully initialized page->flags and others * should use this rather than pfn_valid && pfn_to_page */ struct page *pfn_to_online_page(unsigned long pfn) { unsigned long nr = pfn_to_section_nr(pfn); struct dev_pagemap *pgmap; struct mem_section *ms; if (nr >= NR_MEM_SECTIONS) return NULL; ms = __nr_to_section(nr); if (!online_section(ms)) return NULL; /* * Save some code text when online_section() + * pfn_section_valid() are sufficient. */ if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn)) return NULL; if (!pfn_section_valid(ms, pfn)) return NULL; if (!online_device_section(ms)) return pfn_to_page(pfn); /* * Slowpath: when ZONE_DEVICE collides with * ZONE_{NORMAL,MOVABLE} within the same section some pfns in * the section may be 'offline' but 'valid'. Only * get_dev_pagemap() can determine sub-section online status. */ pgmap = get_dev_pagemap(pfn, NULL); put_dev_pagemap(pgmap); /* The presence of a pgmap indicates ZONE_DEVICE offline pfn */ if (pgmap) return NULL; return pfn_to_page(pfn); } EXPORT_SYMBOL_GPL(pfn_to_online_page); int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, struct mhp_params *params) { const unsigned long end_pfn = pfn + nr_pages; unsigned long cur_nr_pages; int err; struct vmem_altmap *altmap = params->altmap; if (WARN_ON_ONCE(!pgprot_val(params->pgprot))) return -EINVAL; VM_BUG_ON(!mhp_range_allowed(PFN_PHYS(pfn), nr_pages * PAGE_SIZE, false)); if (altmap) { /* * Validate altmap is within bounds of the total request */ if (altmap->base_pfn != pfn || vmem_altmap_offset(altmap) > nr_pages) { pr_warn_once("memory add fail, invalid altmap\n"); return -EINVAL; } altmap->alloc = 0; } if (check_pfn_span(pfn, nr_pages)) { WARN(1, "Misaligned %s start: %#lx end: %#lx\n", __func__, pfn, pfn + nr_pages - 1); return -EINVAL; } for (; pfn < end_pfn; pfn += cur_nr_pages) { /* Select all remaining pages up to the next section boundary */ cur_nr_pages = min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn); err = sparse_add_section(nid, pfn, cur_nr_pages, altmap, params->pgmap); if (err) break; cond_resched(); } vmemmap_populate_print_last(); return err; } /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) { if (unlikely(!pfn_to_online_page(start_pfn))) continue; if (unlikely(pfn_to_nid(start_pfn) != nid)) continue; if (zone != page_zone(pfn_to_page(start_pfn))) continue; return start_pfn; } return 0; } /* find the biggest valid pfn in the range [start_pfn, end_pfn). */ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; /* pfn is the end pfn of a memory section. */ pfn = end_pfn - 1; for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) { if (unlikely(!pfn_to_online_page(pfn))) continue; if (unlikely(pfn_to_nid(pfn) != nid)) continue; if (zone != page_zone(pfn_to_page(pfn))) continue; return pfn; } return 0; } static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; int nid = zone_to_nid(zone); if (zone->zone_start_pfn == start_pfn) { /* * If the section is smallest section in the zone, it need * shrink zone->zone_start_pfn and zone->zone_spanned_pages. * In this case, we find second smallest valid mem_section * for shrinking zone. */ pfn = find_smallest_section_pfn(nid, zone, end_pfn, zone_end_pfn(zone)); if (pfn) { zone->spanned_pages = zone_end_pfn(zone) - pfn; zone->zone_start_pfn = pfn; } else { zone->zone_start_pfn = 0; zone->spanned_pages = 0; } } else if (zone_end_pfn(zone) == end_pfn) { /* * If the section is biggest section in the zone, it need * shrink zone->spanned_pages. * In this case, we find second biggest valid mem_section for * shrinking zone. */ pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn, start_pfn); if (pfn) zone->spanned_pages = pfn - zone->zone_start_pfn + 1; else { zone->zone_start_pfn = 0; zone->spanned_pages = 0; } } } static void update_pgdat_span(struct pglist_data *pgdat) { unsigned long node_start_pfn = 0, node_end_pfn = 0; struct zone *zone; for (zone = pgdat->node_zones; zone < pgdat->node_zones + MAX_NR_ZONES; zone++) { unsigned long end_pfn = zone_end_pfn(zone); /* No need to lock the zones, they can't change. */ if (!zone->spanned_pages) continue; if (!node_end_pfn) { node_start_pfn = zone->zone_start_pfn; node_end_pfn = end_pfn; continue; } if (end_pfn > node_end_pfn) node_end_pfn = end_pfn; if (zone->zone_start_pfn < node_start_pfn) node_start_pfn = zone->zone_start_pfn; } pgdat->node_start_pfn = node_start_pfn; pgdat->node_spanned_pages = node_end_pfn - node_start_pfn; } void __ref remove_pfn_range_from_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages) { const unsigned long end_pfn = start_pfn + nr_pages; struct pglist_data *pgdat = zone->zone_pgdat; unsigned long pfn, cur_nr_pages; /* Poison struct pages because they are now uninitialized again. */ for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) { cond_resched(); /* Select all remaining pages up to the next section boundary */ cur_nr_pages = min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn); page_init_poison(pfn_to_page(pfn), sizeof(struct page) * cur_nr_pages); } /* * Zone shrinking code cannot properly deal with ZONE_DEVICE. So * we will not try to shrink the zones - which is okay as * set_zone_contiguous() cannot deal with ZONE_DEVICE either way. */ if (zone_is_zone_device(zone)) return; clear_zone_contiguous(zone); shrink_zone_span(zone, start_pfn, start_pfn + nr_pages); update_pgdat_span(pgdat); set_zone_contiguous(zone); } /** * __remove_pages() - remove sections of pages * @pfn: starting pageframe (must be aligned to start of a section) * @nr_pages: number of pages to remove (must be multiple of section size) * @altmap: alternative device page map or %NULL if default memmap is used * * Generic helper function to remove section mappings and sysfs entries * for the section of the memory we are removing. Caller needs to make * sure that pages are marked reserved and zones are adjust properly by * calling offline_pages(). */ void __remove_pages(unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) { const unsigned long end_pfn = pfn + nr_pages; unsigned long cur_nr_pages; if (check_pfn_span(pfn, nr_pages)) { WARN(1, "Misaligned %s start: %#lx end: %#lx\n", __func__, pfn, pfn + nr_pages - 1); return; } for (; pfn < end_pfn; pfn += cur_nr_pages) { cond_resched(); /* Select all remaining pages up to the next section boundary */ cur_nr_pages = min(end_pfn - pfn, SECTION_ALIGN_UP(pfn + 1) - pfn); sparse_remove_section(pfn, cur_nr_pages, altmap); } } int set_online_page_callback(online_page_callback_t callback) { int rc = -EINVAL; get_online_mems(); mutex_lock(&online_page_callback_lock); if (online_page_callback == generic_online_page) { online_page_callback = callback; rc = 0; } mutex_unlock(&online_page_callback_lock); put_online_mems(); return rc; } EXPORT_SYMBOL_GPL(set_online_page_callback); int restore_online_page_callback(online_page_callback_t callback) { int rc = -EINVAL; get_online_mems(); mutex_lock(&online_page_callback_lock); if (online_page_callback == callback) { online_page_callback = generic_online_page; rc = 0; } mutex_unlock(&online_page_callback_lock); put_online_mems(); return rc; } EXPORT_SYMBOL_GPL(restore_online_page_callback); void generic_online_page(struct page *page, unsigned int order) { /* * Freeing the page with debug_pagealloc enabled will try to unmap it, * so we should map it first. This is better than introducing a special * case in page freeing fast path. */ debug_pagealloc_map_pages(page, 1 << order); __free_pages_core(page, order); totalram_pages_add(1UL << order); } EXPORT_SYMBOL_GPL(generic_online_page); static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) { const unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn; /* * Online the pages in MAX_ORDER aligned chunks. The callback might * decide to not expose all pages to the buddy (e.g., expose them * later). We account all pages as being online and belonging to this * zone ("present"). * When using memmap_on_memory, the range might not be aligned to * MAX_ORDER_NR_PAGES - 1, but pageblock aligned. __ffs() will detect * this and the first chunk to online will be pageblock_nr_pages. */ for (pfn = start_pfn; pfn < end_pfn;) { int order; /* * Free to online pages in the largest chunks alignment allows. * * __ffs() behaviour is undefined for 0. start == 0 is * MAX_ORDER-aligned, Set order to MAX_ORDER for the case. */ if (pfn) order = min_t(int, MAX_ORDER, __ffs(pfn)); else order = MAX_ORDER; (*online_page_callback)(pfn_to_page(pfn), order); pfn += (1UL << order); } /* mark all involved sections as online */ online_mem_sections(start_pfn, end_pfn); } /* check which state of node_states will be changed when online memory */ static void node_states_check_changes_online(unsigned long nr_pages, struct zone *zone, struct memory_notify *arg) { int nid = zone_to_nid(zone); arg->status_change_nid = NUMA_NO_NODE; arg->status_change_nid_normal = NUMA_NO_NODE; if (!node_state(nid, N_MEMORY)) arg->status_change_nid = nid; if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY)) arg->status_change_nid_normal = nid; } static void node_states_set_node(int node, struct memory_notify *arg) { if (arg->status_change_nid_normal >= 0) node_set_state(node, N_NORMAL_MEMORY); if (arg->status_change_nid >= 0) node_set_state(node, N_MEMORY); } static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages) { unsigned long old_end_pfn = zone_end_pfn(zone); if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn) zone->zone_start_pfn = start_pfn; zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn; } static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn, unsigned long nr_pages) { unsigned long old_end_pfn = pgdat_end_pfn(pgdat); if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn) pgdat->node_start_pfn = start_pfn; pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; } #ifdef CONFIG_ZONE_DEVICE static void section_taint_zone_device(unsigned long pfn) { struct mem_section *ms = __pfn_to_section(pfn); ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE; } #else static inline void section_taint_zone_device(unsigned long pfn) { } #endif /* * Associate the pfn range with the given zone, initializing the memmaps * and resizing the pgdat/zone data to span the added pages. After this * call, all affected pages are PG_reserved. * * All aligned pageblocks are initialized to the specified migratetype * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related * zone stats (e.g., nr_isolate_pageblock) are touched. */ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap, int migratetype) { struct pglist_data *pgdat = zone->zone_pgdat; int nid = pgdat->node_id; clear_zone_contiguous(zone); if (zone_is_empty(zone)) init_currently_empty_zone(zone, start_pfn, nr_pages); resize_zone_range(zone, start_pfn, nr_pages); resize_pgdat_range(pgdat, start_pfn, nr_pages); /* * Subsection population requires care in pfn_to_online_page(). * Set the taint to enable the slow path detection of * ZONE_DEVICE pages in an otherwise ZONE_{NORMAL,MOVABLE} * section. */ if (zone_is_zone_device(zone)) { if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION)) section_taint_zone_device(start_pfn); if (!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)) section_taint_zone_device(start_pfn + nr_pages); } /* * TODO now we have a visible range of pages which are not associated * with their zone properly. Not nice but set_pfnblock_flags_mask * expects the zone spans the pfn range. All the pages in the range * are reserved so nobody should be touching them so we should be safe */ memmap_init_range(nr_pages, nid, zone_idx(zone), start_pfn, 0, MEMINIT_HOTPLUG, altmap, migratetype); set_zone_contiguous(zone); } struct auto_movable_stats { unsigned long kernel_early_pages; unsigned long movable_pages; }; static void auto_movable_stats_account_zone(struct auto_movable_stats *stats, struct zone *zone) { if (zone_idx(zone) == ZONE_MOVABLE) { stats->movable_pages += zone->present_pages; } else { stats->kernel_early_pages += zone->present_early_pages; #ifdef CONFIG_CMA /* * CMA pages (never on hotplugged memory) behave like * ZONE_MOVABLE. */ stats->movable_pages += zone->cma_pages; stats->kernel_early_pages -= zone->cma_pages; #endif /* CONFIG_CMA */ } } struct auto_movable_group_stats { unsigned long movable_pages; unsigned long req_kernel_early_pages; }; static int auto_movable_stats_account_group(struct memory_group *group, void *arg) { const int ratio = READ_ONCE(auto_movable_ratio); struct auto_movable_group_stats *stats = arg; long pages; /* * We don't support modifying the config while the auto-movable online * policy is already enabled. Just avoid the division by zero below. */ if (!ratio) return 0; /* * Calculate how many early kernel pages this group requires to * satisfy the configured zone ratio. */ pages = group->present_movable_pages * 100 / ratio; pages -= group->present_kernel_pages; if (pages > 0) stats->req_kernel_early_pages += pages; stats->movable_pages += group->present_movable_pages; return 0; } static bool auto_movable_can_online_movable(int nid, struct memory_group *group, unsigned long nr_pages) { unsigned long kernel_early_pages, movable_pages; struct auto_movable_group_stats group_stats = {}; struct auto_movable_stats stats = {}; pg_data_t *pgdat = NODE_DATA(nid); struct zone *zone; int i; /* Walk all relevant zones and collect MOVABLE vs. KERNEL stats. */ if (nid == NUMA_NO_NODE) { /* TODO: cache values */ for_each_populated_zone(zone) auto_movable_stats_account_zone(&stats, zone); } else { for (i = 0; i < MAX_NR_ZONES; i++) { zone = pgdat->node_zones + i; if (populated_zone(zone)) auto_movable_stats_account_zone(&stats, zone); } } kernel_early_pages = stats.kernel_early_pages; movable_pages = stats.movable_pages; /* * Kernel memory inside dynamic memory group allows for more MOVABLE * memory within the same group. Remove the effect of all but the * current group from the stats. */ walk_dynamic_memory_groups(nid, auto_movable_stats_account_group, group, &group_stats); if (kernel_early_pages <= group_stats.req_kernel_early_pages) return false; kernel_early_pages -= group_stats.req_kernel_early_pages; movable_pages -= group_stats.movable_pages; if (group && group->is_dynamic) kernel_early_pages += group->present_kernel_pages; /* * Test if we could online the given number of pages to ZONE_MOVABLE * and still stay in the configured ratio. */ movable_pages += nr_pages; return movable_pages <= (auto_movable_ratio * kernel_early_pages) / 100; } /* * Returns a default kernel memory zone for the given pfn range. * If no kernel zone covers this pfn range it will automatically go * to the ZONE_NORMAL. */ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn, unsigned long nr_pages) { struct pglist_data *pgdat = NODE_DATA(nid); int zid; for (zid = 0; zid < ZONE_NORMAL; zid++) { struct zone *zone = &pgdat->node_zones[zid]; if (zone_intersects(zone, start_pfn, nr_pages)) return zone; } return &pgdat->node_zones[ZONE_NORMAL]; } /* * Determine to which zone to online memory dynamically based on user * configuration and system stats. We care about the following ratio: * * MOVABLE : KERNEL * * Whereby MOVABLE is memory in ZONE_MOVABLE and KERNEL is memory in * one of the kernel zones. CMA pages inside one of the kernel zones really * behaves like ZONE_MOVABLE, so we treat them accordingly. * * We don't allow for hotplugged memory in a KERNEL zone to increase the * amount of MOVABLE memory we can have, so we end up with: * * MOVABLE : KERNEL_EARLY * * Whereby KERNEL_EARLY is memory in one of the kernel zones, available sinze * boot. We base our calculation on KERNEL_EARLY internally, because: * * a) Hotplugged memory in one of the kernel zones can sometimes still get * hotunplugged, especially when hot(un)plugging individual memory blocks. * There is no coordination across memory devices, therefore "automatic" * hotunplugging, as implemented in hypervisors, could result in zone * imbalances. * b) Early/boot memory in one of the kernel zones can usually not get * hotunplugged again (e.g., no firmware interface to unplug, fragmented * with unmovable allocations). While there are corner cases where it might * still work, it is barely relevant in practice. * * Exceptions are dynamic memory groups, which allow for more MOVABLE * memory within the same memory group -- because in that case, there is * coordination within the single memory device managed by a single driver. * * We rely on "present pages" instead of "managed pages", as the latter is * highly unreliable and dynamic in virtualized environments, and does not * consider boot time allocations. For example, memory ballooning adjusts the * managed pages when inflating/deflating the balloon, and balloon compaction * can even migrate inflated pages between zones. * * Using "present pages" is better but some things to keep in mind are: * * a) Some memblock allocations, such as for the crashkernel area, are * effectively unused by the kernel, yet they account to "present pages". * Fortunately, these allocations are comparatively small in relevant setups * (e.g., fraction of system memory). * b) Some hotplugged memory blocks in virtualized environments, esecially * hotplugged by virtio-mem, look like they are completely present, however, * only parts of the memory block are actually currently usable. * "present pages" is an upper limit that can get reached at runtime. As * we base our calculations on KERNEL_EARLY, this is not an issue. */ static struct zone *auto_movable_zone_for_pfn(int nid, struct memory_group *group, unsigned long pfn, unsigned long nr_pages) { unsigned long online_pages = 0, max_pages, end_pfn; struct page *page; if (!auto_movable_ratio) goto kernel_zone; if (group && !group->is_dynamic) { max_pages = group->s.max_pages; online_pages = group->present_movable_pages; /* If anything is !MOVABLE online the rest !MOVABLE. */ if (group->present_kernel_pages) goto kernel_zone; } else if (!group || group->d.unit_pages == nr_pages) { max_pages = nr_pages; } else { max_pages = group->d.unit_pages; /* * Take a look at all online sections in the current unit. * We can safely assume that all pages within a section belong * to the same zone, because dynamic memory groups only deal * with hotplugged memory. */ pfn = ALIGN_DOWN(pfn, group->d.unit_pages); end_pfn = pfn + group->d.unit_pages; for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { page = pfn_to_online_page(pfn); if (!page) continue; /* If anything is !MOVABLE online the rest !MOVABLE. */ if (!is_zone_movable_page(page)) goto kernel_zone; online_pages += PAGES_PER_SECTION; } } /* * Online MOVABLE if we could *currently* online all remaining parts * MOVABLE. We expect to (add+) online them immediately next, so if * nobody interferes, all will be MOVABLE if possible. */ nr_pages = max_pages - online_pages; if (!auto_movable_can_online_movable(NUMA_NO_NODE, group, nr_pages)) goto kernel_zone; #ifdef CONFIG_NUMA if (auto_movable_numa_aware && !auto_movable_can_online_movable(nid, group, nr_pages)) goto kernel_zone; #endif /* CONFIG_NUMA */ return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; kernel_zone: return default_kernel_zone_for_pfn(nid, pfn, nr_pages); } static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, unsigned long nr_pages) { struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn, nr_pages); struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages); bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages); /* * We inherit the existing zone in a simple case where zones do not * overlap in the given range */ if (in_kernel ^ in_movable) return (in_kernel) ? kernel_zone : movable_zone; /* * If the range doesn't belong to any zone or two zones overlap in the * given range then we use movable zone only if movable_node is * enabled because we always online to a kernel zone by default. */ return movable_node_enabled ? movable_zone : kernel_zone; } struct zone *zone_for_pfn_range(int online_type, int nid, struct memory_group *group, unsigned long start_pfn, unsigned long nr_pages) { if (online_type == MMOP_ONLINE_KERNEL) return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages); if (online_type == MMOP_ONLINE_MOVABLE) return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; if (online_policy == ONLINE_POLICY_AUTO_MOVABLE) return auto_movable_zone_for_pfn(nid, group, start_pfn, nr_pages); return default_zone_for_pfn(nid, start_pfn, nr_pages); } /* * This function should only be called by memory_block_{online,offline}, * and {online,offline}_pages. */ void adjust_present_page_count(struct page *page, struct memory_group *group, long nr_pages) { struct zone *zone = page_zone(page); const bool movable = zone_idx(zone) == ZONE_MOVABLE; /* * We only support onlining/offlining/adding/removing of complete * memory blocks; therefore, either all is either early or hotplugged. */ if (early_section(__pfn_to_section(page_to_pfn(page)))) zone->present_early_pages += nr_pages; zone->present_pages += nr_pages; zone->zone_pgdat->node_present_pages += nr_pages; if (group && movable) group->present_movable_pages += nr_pages; else if (group && !movable) group->present_kernel_pages += nr_pages; } int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, struct zone *zone) { unsigned long end_pfn = pfn + nr_pages; int ret, i; ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)); if (ret) return ret; move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); for (i = 0; i < nr_pages; i++) SetPageVmemmapSelfHosted(pfn_to_page(pfn + i)); /* * It might be that the vmemmap_pages fully span sections. If that is * the case, mark those sections online here as otherwise they will be * left offline. */ if (nr_pages >= PAGES_PER_SECTION) online_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION)); return ret; } void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages) { unsigned long end_pfn = pfn + nr_pages; /* * It might be that the vmemmap_pages fully span sections. If that is * the case, mark those sections offline here as otherwise they will be * left online. */ if (nr_pages >= PAGES_PER_SECTION) offline_mem_sections(pfn, ALIGN_DOWN(end_pfn, PAGES_PER_SECTION)); /* * The pages associated with this vmemmap have been offlined, so * we can reset its state here. */ remove_pfn_range_from_zone(page_zone(pfn_to_page(pfn)), pfn, nr_pages); kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)); } int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { unsigned long flags; int need_zonelists_rebuild = 0; const int nid = zone_to_nid(zone); int ret; struct memory_notify arg; /* * {on,off}lining is constrained to full memory sections (or more * precisely to memory blocks from the user space POV). * memmap_on_memory is an exception because it reserves initial part * of the physical memory space for vmemmaps. That space is pageblock * aligned. */ if (WARN_ON_ONCE(!nr_pages || !pageblock_aligned(pfn) || !IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); /* associate pfn range with the zone */ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); arg.start_pfn = pfn; arg.nr_pages = nr_pages; node_states_check_changes_online(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_ONLINE, &arg); ret = notifier_to_errno(ret); if (ret) goto failed_addition; /* * Fixup the number of isolated pageblocks before marking the sections * onlining, such that undo_isolate_page_range() works correctly. */ spin_lock_irqsave(&zone->lock, flags); zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages; spin_unlock_irqrestore(&zone->lock, flags); /* * If this zone is not populated, then it is not in zonelist. * This means the page allocator ignores this zone. * So, zonelist must be updated after online. */ if (!populated_zone(zone)) { need_zonelists_rebuild = 1; setup_zone_pageset(zone); } online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); node_states_set_node(nid, &arg); if (need_zonelists_rebuild) build_all_zonelists(NULL); /* Basic onlining is complete, allow allocation of onlined pages. */ undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); /* * Freshly onlined pages aren't shuffled (e.g., all pages are placed to * the tail of the freelist when undoing isolation). Shuffle the whole * zone to make sure the just onlined pages are properly distributed * across the whole freelist - to create an initial shuffle. */ shuffle_zone(zone); /* reinitialise watermarks and update pcp limits */ init_per_zone_wmark_min(); kswapd_run(nid); kcompactd_run(nid); writeback_set_ratelimit(); memory_notify(MEM_ONLINE, &arg); mem_hotplug_done(); return 0; failed_addition: pr_debug("online_pages [mem %#010llx-%#010llx] failed\n", (unsigned long long) pfn << PAGE_SHIFT, (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); memory_notify(MEM_CANCEL_ONLINE, &arg); remove_pfn_range_from_zone(zone, pfn, nr_pages); mem_hotplug_done(); return ret; } /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ static pg_data_t __ref *hotadd_init_pgdat(int nid) { struct pglist_data *pgdat; /* * NODE_DATA is preallocated (free_area_init) but its internal * state is not allocated completely. Add missing pieces. * Completely offline nodes stay around and they just need * reintialization. */ pgdat = NODE_DATA(nid); /* init node's zones as empty zones, we don't have any present pages.*/ free_area_init_core_hotplug(pgdat); /* * The node we allocated has no zone fallback lists. For avoiding * to access not-initialized zonelist, build here. */ build_all_zonelists(pgdat); return pgdat; } /* * __try_online_node - online a node if offlined * @nid: the node ID * @set_node_online: Whether we want to online the node * called by cpu_up() to online a node without onlined memory. * * Returns: * 1 -> a new node has been allocated * 0 -> the node is already online * -ENOMEM -> the node could not be allocated */ static int __try_online_node(int nid, bool set_node_online) { pg_data_t *pgdat; int ret = 1; if (node_online(nid)) return 0; pgdat = hotadd_init_pgdat(nid); if (!pgdat) { pr_err("Cannot online node %d due to NULL pgdat\n", nid); ret = -ENOMEM; goto out; } if (set_node_online) { node_set_online(nid); ret = register_one_node(nid); BUG_ON(ret); } out: return ret; } /* * Users of this function always want to online/register the node */ int try_online_node(int nid) { int ret; mem_hotplug_begin(); ret = __try_online_node(nid, true); mem_hotplug_done(); return ret; } static int check_hotplug_memory_range(u64 start, u64 size) { /* memory range must be block size aligned */ if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) || !IS_ALIGNED(size, memory_block_size_bytes())) { pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx", memory_block_size_bytes(), start, size); return -EINVAL; } return 0; } static int online_memory_block(struct memory_block *mem, void *arg) { mem->online_type = mhp_default_online_type; return device_online(&mem->dev); } #ifndef arch_supports_memmap_on_memory static inline bool arch_supports_memmap_on_memory(unsigned long vmemmap_size) { /* * As default, we want the vmemmap to span a complete PMD such that we * can map the vmemmap using a single PMD if supported by the * architecture. */ return IS_ALIGNED(vmemmap_size, PMD_SIZE); } #endif static bool mhp_supports_memmap_on_memory(unsigned long size) { unsigned long vmemmap_size = memory_block_memmap_size(); unsigned long memmap_pages = memory_block_memmap_on_memory_pages(); /* * Besides having arch support and the feature enabled at runtime, we * need a few more assumptions to hold true: * * a) We span a single memory block: memory onlining/offlinin;g happens * in memory block granularity. We don't want the vmemmap of online * memory blocks to reside on offline memory blocks. In the future, * we might want to support variable-sized memory blocks to make the * feature more versatile. * * b) The vmemmap pages span complete PMDs: We don't want vmemmap code * to populate memory from the altmap for unrelated parts (i.e., * other memory blocks) * * c) The vmemmap pages (and thereby the pages that will be exposed to * the buddy) have to cover full pageblocks: memory onlining/offlining * code requires applicable ranges to be page-aligned, for example, to * set the migratetypes properly. * * TODO: Although we have a check here to make sure that vmemmap pages * fully populate a PMD, it is not the right place to check for * this. A much better solution involves improving vmemmap code * to fallback to base pages when trying to populate vmemmap using * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) return false; /* * Make sure the vmemmap allocation is fully contained * so that we always allocate vmemmap memory from altmap area. */ if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) return false; /* * start pfn should be pageblock_nr_pages aligned for correctly * setting migrate types */ if (!pageblock_aligned(memmap_pages)) return false; if (memmap_pages == PHYS_PFN(memory_block_size_bytes())) /* No effective hotplugged memory doesn't make sense. */ return false; return arch_supports_memmap_on_memory(vmemmap_size); } /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). * * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; enum memblock_flags memblock_flags = MEMBLOCK_NONE; struct vmem_altmap mhp_altmap = { .base_pfn = PHYS_PFN(res->start), .end_pfn = PHYS_PFN(res->end), }; struct memory_group *group = NULL; u64 start, size; bool new_node = false; int ret; start = res->start; size = resource_size(res); ret = check_hotplug_memory_range(start, size); if (ret) return ret; if (mhp_flags & MHP_NID_IS_MGID) { group = memory_group_find_by_id(nid); if (!group) return -EINVAL; nid = group->nid; } if (!node_possible(nid)) { WARN(1, "node %d was absent from the node_possible_map\n", nid); return -EINVAL; } mem_hotplug_begin(); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED) memblock_flags = MEMBLOCK_DRIVER_MANAGED; ret = memblock_add_node(start, size, nid, memblock_flags); if (ret) goto error_mem_hotplug_end; } ret = __try_online_node(nid, false); if (ret < 0) goto error; new_node = ret; /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (mhp_supports_memmap_on_memory(size)) { mhp_altmap.free = memory_block_memmap_on_memory_pages(); params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); if (!params.altmap) { ret = -ENOMEM; goto error; } memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); } /* fallback to not using altmap */ } /* call arch's memory hotadd */ ret = arch_add_memory(nid, start, size, ¶ms); if (ret < 0) goto error_free; /* create memory block devices after memory was added */ ret = create_memory_block_devices(start, size, params.altmap, group); if (ret) { arch_remove_memory(start, size, NULL); goto error_free; } if (new_node) { /* If sysfs file of new node can't be created, cpu on the node * can't be hot-added. There is no rollback way now. * So, check by BUG_ON() to catch it reluctantly.. * We online node here. We can't roll back from here. */ node_set_online(nid); ret = __register_one_node(nid); BUG_ON(ret); } register_memory_blocks_under_node(nid, PFN_DOWN(start), PFN_UP(start + size - 1), MEMINIT_HOTPLUG); /* create new memmap entry */ if (!strcmp(res->name, "System RAM")) firmware_map_add_hotplug(start, start + size, "System RAM"); /* device_online() will take the lock when calling online_pages() */ mem_hotplug_done(); /* * In case we're allowed to merge the resource, flag it and trigger * merging now that adding succeeded. */ if (mhp_flags & MHP_MERGE_RESOURCE) merge_system_ram_resource(res); /* online pages if requested */ if (mhp_default_online_type != MMOP_OFFLINE) walk_memory_blocks(start, size, NULL, online_memory_block); return ret; error_free: kfree(params.altmap); error: if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); error_mem_hotplug_end: mem_hotplug_done(); return ret; } /* requires device_hotplug_lock, see add_memory_resource() */ int __ref __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags) { struct resource *res; int ret; res = register_memory_resource(start, size, "System RAM"); if (IS_ERR(res)) return PTR_ERR(res); ret = add_memory_resource(nid, res, mhp_flags); if (ret < 0) release_memory_resource(res); return ret; } int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags) { int rc; lock_device_hotplug(); rc = __add_memory(nid, start, size, mhp_flags); unlock_device_hotplug(); return rc; } EXPORT_SYMBOL_GPL(add_memory); /* * Add special, driver-managed memory to the system as system RAM. Such * memory is not exposed via the raw firmware-provided memmap as system * RAM, instead, it is detected and added by a driver - during cold boot, * after a reboot, and after kexec. * * Reasons why this memory should not be used for the initial memmap of a * kexec kernel or for placing kexec images: * - The booting kernel is in charge of determining how this memory will be * used (e.g., use persistent memory as system RAM) * - Coordination with a hypervisor is required before this memory * can be used (e.g., inaccessible parts). * * For this memory, no entries in /sys/firmware/memmap ("raw firmware-provided * memory map") are created. Also, the created memory resource is flagged * with IORESOURCE_SYSRAM_DRIVER_MANAGED, so in-kernel users can special-case * this memory as well (esp., not place kexec images onto it). * * The resource_name (visible via /proc/iomem) has to have the format * "System RAM ($DRIVER)". */ int add_memory_driver_managed(int nid, u64 start, u64 size, const char *resource_name, mhp_t mhp_flags) { struct resource *res; int rc; if (!resource_name || strstr(resource_name, "System RAM (") != resource_name || resource_name[strlen(resource_name) - 1] != ')') return -EINVAL; lock_device_hotplug(); res = register_memory_resource(start, size, resource_name); if (IS_ERR(res)) { rc = PTR_ERR(res); goto out_unlock; } rc = add_memory_resource(nid, res, mhp_flags); if (rc < 0) release_memory_resource(res); out_unlock: unlock_device_hotplug(); return rc; } EXPORT_SYMBOL_GPL(add_memory_driver_managed); /* * Platforms should define arch_get_mappable_range() that provides * maximum possible addressable physical memory range for which the * linear mapping could be created. The platform returned address * range must adhere to these following semantics. * * - range.start <= range.end * - Range includes both end points [range.start..range.end] * * There is also a fallback definition provided here, allowing the * entire possible physical address range in case any platform does * not define arch_get_mappable_range(). */ struct range __weak arch_get_mappable_range(void) { struct range mhp_range = { .start = 0UL, .end = -1ULL, }; return mhp_range; } struct range mhp_get_pluggable_range(bool need_mapping) { const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1; struct range mhp_range; if (need_mapping) { mhp_range = arch_get_mappable_range(); if (mhp_range.start > max_phys) { mhp_range.start = 0; mhp_range.end = 0; } mhp_range.end = min_t(u64, mhp_range.end, max_phys); } else { mhp_range.start = 0; mhp_range.end = max_phys; } return mhp_range; } EXPORT_SYMBOL_GPL(mhp_get_pluggable_range); bool mhp_range_allowed(u64 start, u64 size, bool need_mapping) { struct range mhp_range = mhp_get_pluggable_range(need_mapping); u64 end = start + size; if (start < end && start >= mhp_range.start && (end - 1) <= mhp_range.end) return true; pr_warn("Hotplug memory [%#llx-%#llx] exceeds maximum addressable range [%#llx-%#llx]\n", start, end, mhp_range.start, mhp_range.end); return false; } #ifdef CONFIG_MEMORY_HOTREMOVE /* * Scan pfn range [start,end) to find movable/migratable pages (LRU pages, * non-lru movable pages and hugepages). Will skip over most unmovable * pages (esp., pages that can be skipped when offlining), but bail out on * definitely unmovable pages. * * Returns: * 0 in case a movable page is found and movable_pfn was updated. * -ENOENT in case no movable page was found. * -EBUSY in case a definitely unmovable page was found. */ static int scan_movable_pages(unsigned long start, unsigned long end, unsigned long *movable_pfn) { unsigned long pfn; for (pfn = start; pfn < end; pfn++) { struct page *page, *head; unsigned long skip; if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); if (PageLRU(page)) goto found; if (__PageMovable(page)) goto found; /* * PageOffline() pages that are not marked __PageMovable() and * have a reference count > 0 (after MEM_GOING_OFFLINE) are * definitely unmovable. If their reference count would be 0, * they could at least be skipped when offlining memory. */ if (PageOffline(page) && page_count(page)) return -EBUSY; if (!PageHuge(page)) continue; head = compound_head(page); /* * This test is racy as we hold no reference or lock. The * hugetlb page could have been free'ed and head is no longer * a hugetlb page before the following check. In such unlikely * cases false positives and negatives are possible. Calling * code must deal with these scenarios. */ if (HPageMigratable(head)) goto found; skip = compound_nr(head) - (page - head); pfn += skip - 1; } return -ENOENT; found: *movable_pfn = pfn; return 0; } static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; struct page *page, *head; LIST_HEAD(source); static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); for (pfn = start_pfn; pfn < end_pfn; pfn++) { struct folio *folio; bool isolated; if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); folio = page_folio(page); head = &folio->page; if (PageHuge(page)) { pfn = page_to_pfn(head) + compound_nr(head) - 1; isolate_hugetlb(folio, &source); continue; } else if (PageTransHuge(page)) pfn = page_to_pfn(head) + thp_nr_pages(page) - 1; /* * HWPoison pages have elevated reference counts so the migration would * fail on them. It also doesn't make any sense to migrate them in the * first place. Still try to unmap such a page in case it is still mapped * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep * the unmap as the catch all safety net). */ if (PageHWPoison(page)) { if (WARN_ON(folio_test_lru(folio))) folio_isolate_lru(folio); if (folio_mapped(folio)) try_to_unmap(folio, TTU_IGNORE_MLOCK); continue; } if (!get_page_unless_zero(page)) continue; /* * We can skip free pages. And we can deal with pages on * LRU and non-lru movable pages. */ if (PageLRU(page)) isolated = isolate_lru_page(page); else isolated = isolate_movable_page(page, ISOLATE_UNEVICTABLE); if (isolated) { list_add_tail(&page->lru, &source); if (!__PageMovable(page)) inc_node_page_state(page, NR_ISOLATED_ANON + page_is_file_lru(page)); } else { if (__ratelimit(&migrate_rs)) { pr_warn("failed to isolate pfn %lx\n", pfn); dump_page(page, "isolation failed"); } } put_page(page); } if (!list_empty(&source)) { nodemask_t nmask = node_states[N_MEMORY]; struct migration_target_control mtc = { .nmask = &nmask, .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, }; int ret; /* * We have checked that migration range is on a single zone so * we can use the nid of the first page to all the others. */ mtc.nid = page_to_nid(list_first_entry(&source, struct page, lru)); /* * try to allocate from a different node but reuse this node * if there are no other online nodes to be used (e.g. we are * offlining a part of the only existing node) */ node_clear(mtc.nid, nmask); if (nodes_empty(nmask)) node_set(mtc.nid, nmask); ret = migrate_pages(&source, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG, NULL); if (ret) { list_for_each_entry(page, &source, lru) { if (__ratelimit(&migrate_rs)) { pr_warn("migrating pfn %lx failed ret:%d\n", page_to_pfn(page), ret); dump_page(page, "migration failure"); } } putback_movable_pages(&source); } } } static int __init cmdline_parse_movable_node(char *p) { movable_node_enabled = true; return 0; } early_param("movable_node", cmdline_parse_movable_node); /* check which state of node_states will be changed when offline memory */ static void node_states_check_changes_offline(unsigned long nr_pages, struct zone *zone, struct memory_notify *arg) { struct pglist_data *pgdat = zone->zone_pgdat; unsigned long present_pages = 0; enum zone_type zt; arg->status_change_nid = NUMA_NO_NODE; arg->status_change_nid_normal = NUMA_NO_NODE; /* * Check whether node_states[N_NORMAL_MEMORY] will be changed. * If the memory to be offline is within the range * [0..ZONE_NORMAL], and it is the last present memory there, * the zones in that range will become empty after the offlining, * thus we can determine that we need to clear the node from * node_states[N_NORMAL_MEMORY]. */ for (zt = 0; zt <= ZONE_NORMAL; zt++) present_pages += pgdat->node_zones[zt].present_pages; if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages) arg->status_change_nid_normal = zone_to_nid(zone); /* * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM * does not apply as we don't support 32bit. * Here we count the possible pages from ZONE_MOVABLE. * If after having accounted all the pages, we see that the nr_pages * to be offlined is over or equal to the accounted pages, * we know that the node will become empty, and so, we can clear * it for N_MEMORY as well. */ present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages; if (nr_pages >= present_pages) arg->status_change_nid = zone_to_nid(zone); } static void node_states_clear_node(int node, struct memory_notify *arg) { if (arg->status_change_nid_normal >= 0) node_clear_state(node, N_NORMAL_MEMORY); if (arg->status_change_nid >= 0) node_clear_state(node, N_MEMORY); } static int count_system_ram_pages_cb(unsigned long start_pfn, unsigned long nr_pages, void *data) { unsigned long *nr_system_ram_pages = data; *nr_system_ram_pages += nr_pages; return 0; } int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { const unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn, system_ram_pages = 0; const int node = zone_to_nid(zone); unsigned long flags; struct memory_notify arg; char *reason; int ret; /* * {on,off}lining is constrained to full memory sections (or more * precisely to memory blocks from the user space POV). * memmap_on_memory is an exception because it reserves initial part * of the physical memory space for vmemmaps. That space is pageblock * aligned. */ if (WARN_ON_ONCE(!nr_pages || !pageblock_aligned(start_pfn) || !IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); /* * Don't allow to offline memory blocks that contain holes. * Consequently, memory blocks with holes can never get onlined * via the hotplug path - online_pages() - as hotplugged memory has * no holes. This way, we e.g., don't have to worry about marking * memory holes PG_reserved, don't need pfn_valid() checks, and can * avoid using walk_system_ram_range() later. */ walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages, count_system_ram_pages_cb); if (system_ram_pages != nr_pages) { ret = -EINVAL; reason = "memory holes"; goto failed_removal; } /* * We only support offlining of memory blocks managed by a single zone, * checked by calling code. This is just a sanity check that we might * want to remove in the future. */ if (WARN_ON_ONCE(page_zone(pfn_to_page(start_pfn)) != zone || page_zone(pfn_to_page(end_pfn - 1)) != zone)) { ret = -EINVAL; reason = "multizone range"; goto failed_removal; } /* * Disable pcplists so that page isolation cannot race with freeing * in a way that pages from isolated pageblock are left on pcplists. */ zone_pcp_disable(zone); lru_cache_disable(); /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE, MEMORY_OFFLINE | REPORT_FAILURE, GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL); if (ret) { reason = "failure to isolate range"; goto failed_removal_pcplists_disabled; } arg.start_pfn = start_pfn; arg.nr_pages = nr_pages; node_states_check_changes_offline(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_OFFLINE, &arg); ret = notifier_to_errno(ret); if (ret) { reason = "notifier failure"; goto failed_removal_isolated; } do { pfn = start_pfn; do { /* * Historically we always checked for any signal and * can't limit it to fatal signals without eventually * breaking user space. */ if (signal_pending(current)) { ret = -EINTR; reason = "signal backoff"; goto failed_removal_isolated; } cond_resched(); ret = scan_movable_pages(pfn, end_pfn, &pfn); if (!ret) { /* * TODO: fatal migration failures should bail * out */ do_migrate_range(pfn, end_pfn); } } while (!ret); if (ret != -ENOENT) { reason = "unmovable page"; goto failed_removal_isolated; } /* * Dissolve free hugepages in the memory block before doing * offlining actually in order to make hugetlbfs's object * counting consistent. */ ret = dissolve_free_huge_pages(start_pfn, end_pfn); if (ret) { reason = "failure to dissolve huge pages"; goto failed_removal_isolated; } ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); } while (ret); /* Mark all sections offline and remove free pages from the buddy. */ __offline_isolated_pages(start_pfn, end_pfn); pr_debug("Offlined Pages %ld\n", nr_pages); /* * The memory sections are marked offline, and the pageblock flags * effectively stale; nobody should be touching them. Fixup the number * of isolated pageblocks, memory onlining will properly revert this. */ spin_lock_irqsave(&zone->lock, flags); zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages; spin_unlock_irqrestore(&zone->lock, flags); lru_cache_enable(); zone_pcp_enable(zone); /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages); /* reinitialise watermarks and update pcp limits */ init_per_zone_wmark_min(); if (!populated_zone(zone)) { zone_pcp_reset(zone); build_all_zonelists(NULL); } node_states_clear_node(node, &arg); if (arg.status_change_nid >= 0) { kcompactd_stop(node); kswapd_stop(node); } writeback_set_ratelimit(); memory_notify(MEM_OFFLINE, &arg); remove_pfn_range_from_zone(zone, start_pfn, nr_pages); mem_hotplug_done(); return 0; failed_removal_isolated: /* pushback to free area */ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); memory_notify(MEM_CANCEL_OFFLINE, &arg); failed_removal_pcplists_disabled: lru_cache_enable(); zone_pcp_enable(zone); failed_removal: pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n", (unsigned long long) start_pfn << PAGE_SHIFT, ((unsigned long long) end_pfn << PAGE_SHIFT) - 1, reason); mem_hotplug_done(); return ret; } static int check_memblock_offlined_cb(struct memory_block *mem, void *arg) { int *nid = arg; *nid = mem->nid; if (unlikely(mem->state != MEM_OFFLINE)) { phys_addr_t beginpa, endpa; beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)); endpa = beginpa + memory_block_size_bytes() - 1; pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n", &beginpa, &endpa); return -EBUSY; } return 0; } static int test_has_altmap_cb(struct memory_block *mem, void *arg) { struct memory_block **mem_ptr = (struct memory_block **)arg; /* * return the memblock if we have altmap * and break callback. */ if (mem->altmap) { *mem_ptr = mem; return 1; } return 0; } static int check_cpu_on_node(int nid) { int cpu; for_each_present_cpu(cpu) { if (cpu_to_node(cpu) == nid) /* * the cpu on this node isn't removed, and we can't * offline this node. */ return -EBUSY; } return 0; } static int check_no_memblock_for_node_cb(struct memory_block *mem, void *arg) { int nid = *(int *)arg; /* * If a memory block belongs to multiple nodes, the stored nid is not * reliable. However, such blocks are always online (e.g., cannot get * offlined) and, therefore, are still spanned by the node. */ return mem->nid == nid ? -EEXIST : 0; } /** * try_offline_node * @nid: the node ID * * Offline a node if all memory sections and cpus of the node are removed. * * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call. */ void try_offline_node(int nid) { int rc; /* * If the node still spans pages (especially ZONE_DEVICE), don't * offline it. A node spans memory after move_pfn_range_to_zone(), * e.g., after the memory block was onlined. */ if (node_spanned_pages(nid)) return; /* * Especially offline memory blocks might not be spanned by the * node. They will get spanned by the node once they get onlined. * However, they link to the node in sysfs and can get onlined later. */ rc = for_each_memory_block(&nid, check_no_memblock_for_node_cb); if (rc) return; if (check_cpu_on_node(nid)) return; /* * all memory/cpu of this node are removed, we can offline this * node now. */ node_set_offline(nid); unregister_one_node(nid); } EXPORT_SYMBOL(try_offline_node); static int __ref try_remove_memory(u64 start, u64 size) { struct memory_block *mem; int rc = 0, nid = NUMA_NO_NODE; struct vmem_altmap *altmap = NULL; BUG_ON(check_hotplug_memory_range(start, size)); /* * All memory blocks must be offlined before removing memory. Check * whether all memory blocks in question are offline and return error * if this is not the case. * * While at it, determine the nid. Note that if we'd have mixed nodes, * we'd only try to offline the last determined one -- which is good * enough for the cases we care about. */ rc = walk_memory_blocks(start, size, &nid, check_memblock_offlined_cb); if (rc) return rc; /* * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in * the same granularity it was added - a single memory block. */ if (mhp_memmap_on_memory()) { rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); if (rc) { if (size != memory_block_size_bytes()) { pr_warn("Refuse to remove %#llx - %#llx," "wrong granularity\n", start, start + size); return -EINVAL; } altmap = mem->altmap; /* * Mark altmap NULL so that we can add a debug * check on memblock free. */ mem->altmap = NULL; } } /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); /* * Memory block device removal under the device_hotplug_lock is * a barrier against racing online attempts. */ remove_memory_block_devices(start, size); mem_hotplug_begin(); arch_remove_memory(start, size, altmap); /* Verify that all vmemmap pages have actually been freed. */ if (altmap) { WARN(altmap->alloc, "Altmap not fully unmapped"); kfree(altmap); } if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { memblock_phys_free(start, size); memblock_remove(start, size); } release_mem_region_adjustable(start, size); if (nid != NUMA_NO_NODE) try_offline_node(nid); mem_hotplug_done(); return 0; } /** * __remove_memory - Remove memory if every memory block is offline * @start: physical address of the region to remove * @size: size of the region to remove * * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations before this call, as required by * try_offline_node(). */ void __remove_memory(u64 start, u64 size) { /* * trigger BUG() if some memory is not offlined prior to calling this * function */ if (try_remove_memory(start, size)) BUG(); } /* * Remove memory if every memory block is offline, otherwise return -EBUSY is * some memory is not offline */ int remove_memory(u64 start, u64 size) { int rc; lock_device_hotplug(); rc = try_remove_memory(start, size); unlock_device_hotplug(); return rc; } EXPORT_SYMBOL_GPL(remove_memory); static int try_offline_memory_block(struct memory_block *mem, void *arg) { uint8_t online_type = MMOP_ONLINE_KERNEL; uint8_t **online_types = arg; struct page *page; int rc; /* * Sense the online_type via the zone of the memory block. Offlining * with multiple zones within one memory block will be rejected * by offlining code ... so we don't care about that. */ page = pfn_to_online_page(section_nr_to_pfn(mem->start_section_nr)); if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE) online_type = MMOP_ONLINE_MOVABLE; rc = device_offline(&mem->dev); /* * Default is MMOP_OFFLINE - change it only if offlining succeeded, * so try_reonline_memory_block() can do the right thing. */ if (!rc) **online_types = online_type; (*online_types)++; /* Ignore if already offline. */ return rc < 0 ? rc : 0; } static int try_reonline_memory_block(struct memory_block *mem, void *arg) { uint8_t **online_types = arg; int rc; if (**online_types != MMOP_OFFLINE) { mem->online_type = **online_types; rc = device_online(&mem->dev); if (rc < 0) pr_warn("%s: Failed to re-online memory: %d", __func__, rc); } /* Continue processing all remaining memory blocks. */ (*online_types)++; return 0; } /* * Try to offline and remove memory. Might take a long time to finish in case * memory is still in use. Primarily useful for memory devices that logically * unplugged all memory (so it's no longer in use) and want to offline + remove * that memory. */ int offline_and_remove_memory(u64 start, u64 size) { const unsigned long mb_count = size / memory_block_size_bytes(); uint8_t *online_types, *tmp; int rc; if (!IS_ALIGNED(start, memory_block_size_bytes()) || !IS_ALIGNED(size, memory_block_size_bytes()) || !size) return -EINVAL; /* * We'll remember the old online type of each memory block, so we can * try to revert whatever we did when offlining one memory block fails * after offlining some others succeeded. */ online_types = kmalloc_array(mb_count, sizeof(*online_types), GFP_KERNEL); if (!online_types) return -ENOMEM; /* * Initialize all states to MMOP_OFFLINE, so when we abort processing in * try_offline_memory_block(), we'll skip all unprocessed blocks in * try_reonline_memory_block(). */ memset(online_types, MMOP_OFFLINE, mb_count); lock_device_hotplug(); tmp = online_types; rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block); /* * In case we succeeded to offline all memory, remove it. * This cannot fail as it cannot get onlined in the meantime. */ if (!rc) { rc = try_remove_memory(start, size); if (rc) pr_err("%s: Failed to remove memory: %d", __func__, rc); } /* * Rollback what we did. While memory onlining might theoretically fail * (nacked by a notifier), it barely ever happens. */ if (rc) { tmp = online_types; walk_memory_blocks(start, size, &tmp, try_reonline_memory_block); } unlock_device_hotplug(); kfree(online_types); return rc; } EXPORT_SYMBOL_GPL(offline_and_remove_memory); #endif /* CONFIG_MEMORY_HOTREMOVE */ |
35 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright 2021 Google LLC * * sysfs support for blk-crypto. This file contains the code which exports the * crypto capabilities of devices via /sys/block/$disk/queue/crypto/. */ #include <linux/blk-crypto-profile.h> #include "blk-crypto-internal.h" struct blk_crypto_kobj { struct kobject kobj; struct blk_crypto_profile *profile; }; struct blk_crypto_attr { struct attribute attr; ssize_t (*show)(struct blk_crypto_profile *profile, struct blk_crypto_attr *attr, char *page); }; static struct blk_crypto_profile *kobj_to_crypto_profile(struct kobject *kobj) { return container_of(kobj, struct blk_crypto_kobj, kobj)->profile; } static struct blk_crypto_attr *attr_to_crypto_attr(struct attribute *attr) { return container_of(attr, struct blk_crypto_attr, attr); } static ssize_t max_dun_bits_show(struct blk_crypto_profile *profile, struct blk_crypto_attr *attr, char *page) { return sysfs_emit(page, "%u\n", 8 * profile->max_dun_bytes_supported); } static ssize_t num_keyslots_show(struct blk_crypto_profile *profile, struct blk_crypto_attr *attr, char *page) { return sysfs_emit(page, "%u\n", profile->num_slots); } #define BLK_CRYPTO_RO_ATTR(_name) \ static struct blk_crypto_attr _name##_attr = __ATTR_RO(_name) BLK_CRYPTO_RO_ATTR(max_dun_bits); BLK_CRYPTO_RO_ATTR(num_keyslots); static struct attribute *blk_crypto_attrs[] = { &max_dun_bits_attr.attr, &num_keyslots_attr.attr, NULL, }; static const struct attribute_group blk_crypto_attr_group = { .attrs = blk_crypto_attrs, }; /* * The encryption mode attributes. To avoid hard-coding the list of encryption * modes, these are initialized at boot time by blk_crypto_sysfs_init(). */ static struct blk_crypto_attr __blk_crypto_mode_attrs[BLK_ENCRYPTION_MODE_MAX]; static struct attribute *blk_crypto_mode_attrs[BLK_ENCRYPTION_MODE_MAX + 1]; static umode_t blk_crypto_mode_is_visible(struct kobject *kobj, struct attribute *attr, int n) { struct blk_crypto_profile *profile = kobj_to_crypto_profile(kobj); struct blk_crypto_attr *a = attr_to_crypto_attr(attr); int mode_num = a - __blk_crypto_mode_attrs; if (profile->modes_supported[mode_num]) return 0444; return 0; } static ssize_t blk_crypto_mode_show(struct blk_crypto_profile *profile, struct blk_crypto_attr *attr, char *page) { int mode_num = attr - __blk_crypto_mode_attrs; return sysfs_emit(page, "0x%x\n", profile->modes_supported[mode_num]); } static const struct attribute_group blk_crypto_modes_attr_group = { .name = "modes", .attrs = blk_crypto_mode_attrs, .is_visible = blk_crypto_mode_is_visible, }; static const struct attribute_group *blk_crypto_attr_groups[] = { &blk_crypto_attr_group, &blk_crypto_modes_attr_group, NULL, }; static ssize_t blk_crypto_attr_show(struct kobject *kobj, struct attribute *attr, char *page) { struct blk_crypto_profile *profile = kobj_to_crypto_profile(kobj); struct blk_crypto_attr *a = attr_to_crypto_attr(attr); return a->show(profile, a, page); } static const struct sysfs_ops blk_crypto_attr_ops = { .show = blk_crypto_attr_show, }; static void blk_crypto_release(struct kobject *kobj) { kfree(container_of(kobj, struct blk_crypto_kobj, kobj)); } static const struct kobj_type blk_crypto_ktype = { .default_groups = blk_crypto_attr_groups, .sysfs_ops = &blk_crypto_attr_ops, .release = blk_crypto_release, }; /* * If the request_queue has a blk_crypto_profile, create the "crypto" * subdirectory in sysfs (/sys/block/$disk/queue/crypto/). */ int blk_crypto_sysfs_register(struct gendisk *disk) { struct request_queue *q = disk->queue; struct blk_crypto_kobj *obj; int err; if (!q->crypto_profile) return 0; obj = kzalloc(sizeof(*obj), GFP_KERNEL); if (!obj) return -ENOMEM; obj->profile = q->crypto_profile; err = kobject_init_and_add(&obj->kobj, &blk_crypto_ktype, &disk->queue_kobj, "crypto"); if (err) { kobject_put(&obj->kobj); return err; } q->crypto_kobject = &obj->kobj; return 0; } void blk_crypto_sysfs_unregister(struct gendisk *disk) { kobject_put(disk->queue->crypto_kobject); } static int __init blk_crypto_sysfs_init(void) { int i; BUILD_BUG_ON(BLK_ENCRYPTION_MODE_INVALID != 0); for (i = 1; i < BLK_ENCRYPTION_MODE_MAX; i++) { struct blk_crypto_attr *attr = &__blk_crypto_mode_attrs[i]; attr->attr.name = blk_crypto_modes[i].name; attr->attr.mode = 0444; attr->show = blk_crypto_mode_show; blk_crypto_mode_attrs[i - 1] = &attr->attr; } return 0; } subsys_initcall(blk_crypto_sysfs_init); |
113 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Asynchronous Compression operations * * Copyright (c) 2016, Intel Corporation * Authors: Weigang Li <weigang.li@intel.com> * Giovanni Cabiddu <giovanni.cabiddu@intel.com> */ #ifndef _CRYPTO_ACOMP_H #define _CRYPTO_ACOMP_H #include <linux/atomic.h> #include <linux/container_of.h> #include <linux/crypto.h> #define CRYPTO_ACOMP_ALLOC_OUTPUT 0x00000001 #define CRYPTO_ACOMP_DST_MAX 131072 /** * struct acomp_req - asynchronous (de)compression request * * @base: Common attributes for asynchronous crypto requests * @src: Source Data * @dst: Destination data * @slen: Size of the input buffer * @dlen: Size of the output buffer and number of bytes produced * @flags: Internal flags * @__ctx: Start of private context data */ struct acomp_req { struct crypto_async_request base; struct scatterlist *src; struct scatterlist *dst; unsigned int slen; unsigned int dlen; u32 flags; void *__ctx[] CRYPTO_MINALIGN_ATTR; }; /** * struct crypto_acomp - user-instantiated objects which encapsulate * algorithms and core processing logic * * @compress: Function performs a compress operation * @decompress: Function performs a de-compress operation * @dst_free: Frees destination buffer if allocated inside the * algorithm * @reqsize: Context size for (de)compression requests * @base: Common crypto API algorithm data structure */ struct crypto_acomp { int (*compress)(struct acomp_req *req); int (*decompress)(struct acomp_req *req); void (*dst_free)(struct scatterlist *dst); unsigned int reqsize; struct crypto_tfm base; }; /* * struct crypto_istat_compress - statistics for compress algorithm * @compress_cnt: number of compress requests * @compress_tlen: total data size handled by compress requests * @decompress_cnt: number of decompress requests * @decompress_tlen: total data size handled by decompress requests * @err_cnt: number of error for compress requests */ struct crypto_istat_compress { atomic64_t compress_cnt; atomic64_t compress_tlen; atomic64_t decompress_cnt; atomic64_t decompress_tlen; atomic64_t err_cnt; }; #ifdef CONFIG_CRYPTO_STATS #define COMP_ALG_COMMON_STATS struct crypto_istat_compress stat; #else #define COMP_ALG_COMMON_STATS #endif #define COMP_ALG_COMMON { \ COMP_ALG_COMMON_STATS \ \ struct crypto_alg base; \ } struct comp_alg_common COMP_ALG_COMMON; /** * DOC: Asynchronous Compression API * * The Asynchronous Compression API is used with the algorithms of type * CRYPTO_ALG_TYPE_ACOMPRESS (listed as type "acomp" in /proc/crypto) */ /** * crypto_alloc_acomp() -- allocate ACOMPRESS tfm handle * @alg_name: is the cra_name / name or cra_driver_name / driver name of the * compression algorithm e.g. "deflate" * @type: specifies the type of the algorithm * @mask: specifies the mask for the algorithm * * Allocate a handle for a compression algorithm. The returned struct * crypto_acomp is the handle that is required for any subsequent * API invocation for the compression operations. * * Return: allocated handle in case of success; IS_ERR() is true in case * of an error, PTR_ERR() returns the error code. */ struct crypto_acomp *crypto_alloc_acomp(const char *alg_name, u32 type, u32 mask); /** * crypto_alloc_acomp_node() -- allocate ACOMPRESS tfm handle with desired NUMA node * @alg_name: is the cra_name / name or cra_driver_name / driver name of the * compression algorithm e.g. "deflate" * @type: specifies the type of the algorithm * @mask: specifies the mask for the algorithm * @node: specifies the NUMA node the ZIP hardware belongs to * * Allocate a handle for a compression algorithm. Drivers should try to use * (de)compressors on the specified NUMA node. * The returned struct crypto_acomp is the handle that is required for any * subsequent API invocation for the compression operations. * * Return: allocated handle in case of success; IS_ERR() is true in case * of an error, PTR_ERR() returns the error code. */ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type, u32 mask, int node); static inline struct crypto_tfm *crypto_acomp_tfm(struct crypto_acomp *tfm) { return &tfm->base; } static inline struct comp_alg_common *__crypto_comp_alg_common( struct crypto_alg *alg) { return container_of(alg, struct comp_alg_common, base); } static inline struct crypto_acomp *__crypto_acomp_tfm(struct crypto_tfm *tfm) { return container_of(tfm, struct crypto_acomp, base); } static inline struct comp_alg_common *crypto_comp_alg_common( struct crypto_acomp *tfm) { return __crypto_comp_alg_common(crypto_acomp_tfm(tfm)->__crt_alg); } static inline unsigned int crypto_acomp_reqsize(struct crypto_acomp *tfm) { return tfm->reqsize; } static inline void acomp_request_set_tfm(struct acomp_req *req, struct crypto_acomp *tfm) { req->base.tfm = crypto_acomp_tfm(tfm); } static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req) { return __crypto_acomp_tfm(req->base.tfm); } /** * crypto_free_acomp() -- free ACOMPRESS tfm handle * * @tfm: ACOMPRESS tfm handle allocated with crypto_alloc_acomp() * * If @tfm is a NULL or error pointer, this function does nothing. */ static inline void crypto_free_acomp(struct crypto_acomp *tfm) { crypto_destroy_tfm(tfm, crypto_acomp_tfm(tfm)); } static inline int crypto_has_acomp(const char *alg_name, u32 type, u32 mask) { type &= ~CRYPTO_ALG_TYPE_MASK; type |= CRYPTO_ALG_TYPE_ACOMPRESS; mask |= CRYPTO_ALG_TYPE_ACOMPRESS_MASK; return crypto_has_alg(alg_name, type, mask); } /** * acomp_request_alloc() -- allocates asynchronous (de)compression request * * @tfm: ACOMPRESS tfm handle allocated with crypto_alloc_acomp() * * Return: allocated handle in case of success or NULL in case of an error */ struct acomp_req *acomp_request_alloc(struct crypto_acomp *tfm); /** * acomp_request_free() -- zeroize and free asynchronous (de)compression * request as well as the output buffer if allocated * inside the algorithm * * @req: request to free */ void acomp_request_free(struct acomp_req *req); /** * acomp_request_set_callback() -- Sets an asynchronous callback * * Callback will be called when an asynchronous operation on a given * request is finished. * * @req: request that the callback will be set for * @flgs: specify for instance if the operation may backlog * @cmlp: callback which will be called * @data: private data used by the caller */ static inline void acomp_request_set_callback(struct acomp_req *req, u32 flgs, crypto_completion_t cmpl, void *data) { req->base.complete = cmpl; req->base.data = data; req->base.flags &= CRYPTO_ACOMP_ALLOC_OUTPUT; req->base.flags |= flgs & ~CRYPTO_ACOMP_ALLOC_OUTPUT; } /** * acomp_request_set_params() -- Sets request parameters * * Sets parameters required by an acomp operation * * @req: asynchronous compress request * @src: pointer to input buffer scatterlist * @dst: pointer to output buffer scatterlist. If this is NULL, the * acomp layer will allocate the output memory * @slen: size of the input buffer * @dlen: size of the output buffer. If dst is NULL, this can be used by * the user to specify the maximum amount of memory to allocate */ static inline void acomp_request_set_params(struct acomp_req *req, struct scatterlist *src, struct scatterlist *dst, unsigned int slen, unsigned int dlen) { req->src = src; req->dst = dst; req->slen = slen; req->dlen = dlen; req->flags &= ~CRYPTO_ACOMP_ALLOC_OUTPUT; if (!req->dst) req->flags |= CRYPTO_ACOMP_ALLOC_OUTPUT; } static inline struct crypto_istat_compress *comp_get_stat( struct comp_alg_common *alg) { #ifdef CONFIG_CRYPTO_STATS return &alg->stat; #else return NULL; #endif } static inline int crypto_comp_errstat(struct comp_alg_common *alg, int err) { if (!IS_ENABLED(CONFIG_CRYPTO_STATS)) return err; if (err && err != -EINPROGRESS && err != -EBUSY) atomic64_inc(&comp_get_stat(alg)->err_cnt); return err; } /** * crypto_acomp_compress() -- Invoke asynchronous compress operation * * Function invokes the asynchronous compress operation * * @req: asynchronous compress request * * Return: zero on success; error code in case of error */ static inline int crypto_acomp_compress(struct acomp_req *req) { struct crypto_acomp *tfm = crypto_acomp_reqtfm(req); struct comp_alg_common *alg; alg = crypto_comp_alg_common(tfm); if (IS_ENABLED(CONFIG_CRYPTO_STATS)) { struct crypto_istat_compress *istat = comp_get_stat(alg); atomic64_inc(&istat->compress_cnt); atomic64_add(req->slen, &istat->compress_tlen); } return crypto_comp_errstat(alg, tfm->compress(req)); } /** * crypto_acomp_decompress() -- Invoke asynchronous decompress operation * * Function invokes the asynchronous decompress operation * * @req: asynchronous compress request * * Return: zero on success; error code in case of error */ static inline int crypto_acomp_decompress(struct acomp_req *req) { struct crypto_acomp *tfm = crypto_acomp_reqtfm(req); struct comp_alg_common *alg; alg = crypto_comp_alg_common(tfm); if (IS_ENABLED(CONFIG_CRYPTO_STATS)) { struct crypto_istat_compress *istat = comp_get_stat(alg); atomic64_inc(&istat->decompress_cnt); atomic64_add(req->slen, &istat->decompress_tlen); } return crypto_comp_errstat(alg, tfm->decompress(req)); } #endif |
7427 1844 971 1064 12 277 366 16 18 1034 113 938 783 63 59 83 31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | /* SPDX-License-Identifier: GPL-2.0-or-later */ /* Red Black Trees (C) 1999 Andrea Arcangeli <andrea@suse.de> linux/include/linux/rbtree.h To use rbtrees you'll have to implement your own insert and search cores. This will avoid us to use callbacks and to drop drammatically performances. I know it's not the cleaner way, but in C (not in C++) to get performances and genericity... See Documentation/core-api/rbtree.rst for documentation and samples. */ #ifndef _LINUX_RBTREE_H #define _LINUX_RBTREE_H #include <linux/container_of.h> #include <linux/rbtree_types.h> #include <linux/stddef.h> #include <linux/rcupdate.h> #define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3)) #define rb_entry(ptr, type, member) container_of(ptr, type, member) #define RB_EMPTY_ROOT(root) (READ_ONCE((root)->rb_node) == NULL) /* 'empty' nodes are nodes that are known not to be inserted in an rbtree */ #define RB_EMPTY_NODE(node) \ ((node)->__rb_parent_color == (unsigned long)(node)) #define RB_CLEAR_NODE(node) \ ((node)->__rb_parent_color = (unsigned long)(node)) extern void rb_insert_color(struct rb_node *, struct rb_root *); extern void rb_erase(struct rb_node *, struct rb_root *); /* Find logical next and previous nodes in a tree */ extern struct rb_node *rb_next(const struct rb_node *); extern struct rb_node *rb_prev(const struct rb_node *); extern struct rb_node *rb_first(const struct rb_root *); extern struct rb_node *rb_last(const struct rb_root *); /* Postorder iteration - always visit the parent after its children */ extern struct rb_node *rb_first_postorder(const struct rb_root *); extern struct rb_node *rb_next_postorder(const struct rb_node *); /* Fast replacement of a single node without remove/rebalance/add/rebalance */ extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root); extern void rb_replace_node_rcu(struct rb_node *victim, struct rb_node *new, struct rb_root *root); static inline void rb_link_node(struct rb_node *node, struct rb_node *parent, struct rb_node **rb_link) { node->__rb_parent_color = (unsigned long)parent; node->rb_left = node->rb_right = NULL; *rb_link = node; } static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node *parent, struct rb_node **rb_link) { node->__rb_parent_color = (unsigned long)parent; node->rb_left = node->rb_right = NULL; rcu_assign_pointer(*rb_link, node); } #define rb_entry_safe(ptr, type, member) \ ({ typeof(ptr) ____ptr = (ptr); \ ____ptr ? rb_entry(____ptr, type, member) : NULL; \ }) /** * rbtree_postorder_for_each_entry_safe - iterate in post-order over rb_root of * given type allowing the backing memory of @pos to be invalidated * * @pos: the 'type *' to use as a loop cursor. * @n: another 'type *' to use as temporary storage * @root: 'rb_root *' of the rbtree. * @field: the name of the rb_node field within 'type'. * * rbtree_postorder_for_each_entry_safe() provides a similar guarantee as * list_for_each_entry_safe() and allows the iteration to continue independent * of changes to @pos by the body of the loop. * * Note, however, that it cannot handle other modifications that re-order the * rbtree it is iterating over. This includes calling rb_erase() on @pos, as * rb_erase() may rebalance the tree, causing us to miss some nodes. */ #define rbtree_postorder_for_each_entry_safe(pos, n, root, field) \ for (pos = rb_entry_safe(rb_first_postorder(root), typeof(*pos), field); \ pos && ({ n = rb_entry_safe(rb_next_postorder(&pos->field), \ typeof(*pos), field); 1; }); \ pos = n) /* Same as rb_first(), but O(1) */ #define rb_first_cached(root) (root)->rb_leftmost static inline void rb_insert_color_cached(struct rb_node *node, struct rb_root_cached *root, bool leftmost) { if (leftmost) root->rb_leftmost = node; rb_insert_color(node, &root->rb_root); } static inline struct rb_node * rb_erase_cached(struct rb_node *node, struct rb_root_cached *root) { struct rb_node *leftmost = NULL; if (root->rb_leftmost == node) leftmost = root->rb_leftmost = rb_next(node); rb_erase(node, &root->rb_root); return leftmost; } static inline void rb_replace_node_cached(struct rb_node *victim, struct rb_node *new, struct rb_root_cached *root) { if (root->rb_leftmost == victim) root->rb_leftmost = new; rb_replace_node(victim, new, &root->rb_root); } /* * The below helper functions use 2 operators with 3 different * calling conventions. The operators are related like: * * comp(a->key,b) < 0 := less(a,b) * comp(a->key,b) > 0 := less(b,a) * comp(a->key,b) == 0 := !less(a,b) && !less(b,a) * * If these operators define a partial order on the elements we make no * guarantee on which of the elements matching the key is found. See * rb_find(). * * The reason for this is to allow the find() interface without requiring an * on-stack dummy object, which might not be feasible due to object size. */ /** * rb_add_cached() - insert @node into the leftmost cached tree @tree * @node: node to insert * @tree: leftmost cached tree to insert @node into * @less: operator defining the (partial) node order * * Returns @node when it is the new leftmost, or NULL. */ static __always_inline struct rb_node * rb_add_cached(struct rb_node *node, struct rb_root_cached *tree, bool (*less)(struct rb_node *, const struct rb_node *)) { struct rb_node **link = &tree->rb_root.rb_node; struct rb_node *parent = NULL; bool leftmost = true; while (*link) { parent = *link; if (less(node, parent)) { link = &parent->rb_left; } else { link = &parent->rb_right; leftmost = false; } } rb_link_node(node, parent, link); rb_insert_color_cached(node, tree, leftmost); return leftmost ? node : NULL; } /** * rb_add() - insert @node into @tree * @node: node to insert * @tree: tree to insert @node into * @less: operator defining the (partial) node order */ static __always_inline void rb_add(struct rb_node *node, struct rb_root *tree, bool (*less)(struct rb_node *, const struct rb_node *)) { struct rb_node **link = &tree->rb_node; struct rb_node *parent = NULL; while (*link) { parent = *link; if (less(node, parent)) link = &parent->rb_left; else link = &parent->rb_right; } rb_link_node(node, parent, link); rb_insert_color(node, tree); } /** * rb_find_add() - find equivalent @node in @tree, or add @node * @node: node to look-for / insert * @tree: tree to search / modify * @cmp: operator defining the node order * * Returns the rb_node matching @node, or NULL when no match is found and @node * is inserted. */ static __always_inline struct rb_node * rb_find_add(struct rb_node *node, struct rb_root *tree, int (*cmp)(struct rb_node *, const struct rb_node *)) { struct rb_node **link = &tree->rb_node; struct rb_node *parent = NULL; int c; while (*link) { parent = *link; c = cmp(node, parent); if (c < 0) link = &parent->rb_left; else if (c > 0) link = &parent->rb_right; else return parent; } rb_link_node(node, parent, link); rb_insert_color(node, tree); return NULL; } /** * rb_find() - find @key in tree @tree * @key: key to match * @tree: tree to search * @cmp: operator defining the node order * * Returns the rb_node matching @key or NULL. */ static __always_inline struct rb_node * rb_find(const void *key, const struct rb_root *tree, int (*cmp)(const void *key, const struct rb_node *)) { struct rb_node *node = tree->rb_node; while (node) { int c = cmp(key, node); if (c < 0) node = node->rb_left; else if (c > 0) node = node->rb_right; else return node; } return NULL; } /** * rb_find_first() - find the first @key in @tree * @key: key to match * @tree: tree to search * @cmp: operator defining node order * * Returns the leftmost node matching @key, or NULL. */ static __always_inline struct rb_node * rb_find_first(const void *key, const struct rb_root *tree, int (*cmp)(const void *key, const struct rb_node *)) { struct rb_node *node = tree->rb_node; struct rb_node *match = NULL; while (node) { int c = cmp(key, node); if (c <= 0) { if (!c) match = node; node = node->rb_left; } else if (c > 0) { node = node->rb_right; } } return match; } /** * rb_next_match() - find the next @key in @tree * @key: key to match * @tree: tree to search * @cmp: operator defining node order * * Returns the next node matching @key, or NULL. */ static __always_inline struct rb_node * rb_next_match(const void *key, struct rb_node *node, int (*cmp)(const void *key, const struct rb_node *)) { node = rb_next(node); if (node && cmp(key, node)) node = NULL; return node; } /** * rb_for_each() - iterates a subtree matching @key * @node: iterator * @key: key to match * @tree: tree to search * @cmp: operator defining node order */ #define rb_for_each(node, key, tree, cmp) \ for ((node) = rb_find_first((key), (tree), (cmp)); \ (node); (node) = rb_next_match((key), (node), (cmp))) #endif /* _LINUX_RBTREE_H */ |
397 32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NF_CONNTRACK_SEQADJ_H #define _NF_CONNTRACK_SEQADJ_H #include <net/netfilter/nf_conntrack_extend.h> /** * struct nf_ct_seqadj - sequence number adjustment information * * @correction_pos: position of the last TCP sequence number modification * @offset_before: sequence number offset before last modification * @offset_after: sequence number offset after last modification */ struct nf_ct_seqadj { u32 correction_pos; s32 offset_before; s32 offset_after; }; struct nf_conn_seqadj { struct nf_ct_seqadj seq[IP_CT_DIR_MAX]; }; static inline struct nf_conn_seqadj *nfct_seqadj(const struct nf_conn *ct) { return nf_ct_ext_find(ct, NF_CT_EXT_SEQADJ); } static inline struct nf_conn_seqadj *nfct_seqadj_ext_add(struct nf_conn *ct) { return nf_ct_ext_add(ct, NF_CT_EXT_SEQADJ, GFP_ATOMIC); } int nf_ct_seqadj_init(struct nf_conn *ct, enum ip_conntrack_info ctinfo, s32 off); int nf_ct_seqadj_set(struct nf_conn *ct, enum ip_conntrack_info ctinfo, __be32 seq, s32 off); void nf_ct_tcp_seqadj_set(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, s32 off); int nf_ct_seq_adjust(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info ctinfo, unsigned int protoff); s32 nf_ct_seq_offset(const struct nf_conn *ct, enum ip_conntrack_dir, u32 seq); #endif /* _NF_CONNTRACK_SEQADJ_H */ |
28 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Task I/O accounting operations */ #ifndef __TASK_IO_ACCOUNTING_OPS_INCLUDED #define __TASK_IO_ACCOUNTING_OPS_INCLUDED #include <linux/sched.h> #ifdef CONFIG_TASK_IO_ACCOUNTING static inline void task_io_account_read(size_t bytes) { current->ioac.read_bytes += bytes; } /* * We approximate number of blocks, because we account bytes only. * A 'block' is 512 bytes */ static inline unsigned long task_io_get_inblock(const struct task_struct *p) { return p->ioac.read_bytes >> 9; } static inline void task_io_account_write(size_t bytes) { current->ioac.write_bytes += bytes; } /* * We approximate number of blocks, because we account bytes only. * A 'block' is 512 bytes */ static inline unsigned long task_io_get_oublock(const struct task_struct *p) { return p->ioac.write_bytes >> 9; } static inline void task_io_account_cancelled_write(size_t bytes) { current->ioac.cancelled_write_bytes += bytes; } static inline void task_io_accounting_init(struct task_io_accounting *ioac) { memset(ioac, 0, sizeof(*ioac)); } static inline void task_blk_io_accounting_add(struct task_io_accounting *dst, struct task_io_accounting *src) { dst->read_bytes += src->read_bytes; dst->write_bytes += src->write_bytes; dst->cancelled_write_bytes += src->cancelled_write_bytes; } #else static inline void task_io_account_read(size_t bytes) { } static inline unsigned long task_io_get_inblock(const struct task_struct *p) { return 0; } static inline void task_io_account_write(size_t bytes) { } static inline unsigned long task_io_get_oublock(const struct task_struct *p) { return 0; } static inline void task_io_account_cancelled_write(size_t bytes) { } static inline void task_io_accounting_init(struct task_io_accounting *ioac) { } static inline void task_blk_io_accounting_add(struct task_io_accounting *dst, struct task_io_accounting *src) { } #endif /* CONFIG_TASK_IO_ACCOUNTING */ #ifdef CONFIG_TASK_XACCT static inline void task_chr_io_accounting_add(struct task_io_accounting *dst, struct task_io_accounting *src) { dst->rchar += src->rchar; dst->wchar += src->wchar; dst->syscr += src->syscr; dst->syscw += src->syscw; } #else static inline void task_chr_io_accounting_add(struct task_io_accounting *dst, struct task_io_accounting *src) { } #endif /* CONFIG_TASK_XACCT */ static inline void task_io_accounting_add(struct task_io_accounting *dst, struct task_io_accounting *src) { task_chr_io_accounting_add(dst, src); task_blk_io_accounting_add(dst, src); } #endif /* __TASK_IO_ACCOUNTING_OPS_INCLUDED */ |
105 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | // SPDX-License-Identifier: GPL-2.0-only #include <linux/interval_tree.h> #include <linux/interval_tree_generic.h> #include <linux/compiler.h> #include <linux/export.h> #define START(node) ((node)->start) #define LAST(node) ((node)->last) INTERVAL_TREE_DEFINE(struct interval_tree_node, rb, unsigned long, __subtree_last, START, LAST,, interval_tree) EXPORT_SYMBOL_GPL(interval_tree_insert); EXPORT_SYMBOL_GPL(interval_tree_remove); EXPORT_SYMBOL_GPL(interval_tree_iter_first); EXPORT_SYMBOL_GPL(interval_tree_iter_next); #ifdef CONFIG_INTERVAL_TREE_SPAN_ITER /* * Roll nodes[1] into nodes[0] by advancing nodes[1] to the end of a contiguous * span of nodes. This makes nodes[0]->last the end of that contiguous used span * indexes that started at the original nodes[1]->start. nodes[1] is now the * first node starting the next used span. A hole span is between nodes[0]->last * and nodes[1]->start. nodes[1] must be !NULL. */ static void interval_tree_span_iter_next_gap(struct interval_tree_span_iter *state) { struct interval_tree_node *cur = state->nodes[1]; state->nodes[0] = cur; do { if (cur->last > state->nodes[0]->last) state->nodes[0] = cur; cur = interval_tree_iter_next(cur, state->first_index, state->last_index); } while (cur && (state->nodes[0]->last >= cur->start || state->nodes[0]->last + 1 == cur->start)); state->nodes[1] = cur; } void interval_tree_span_iter_first(struct interval_tree_span_iter *iter, struct rb_root_cached *itree, unsigned long first_index, unsigned long last_index) { iter->first_index = first_index; iter->last_index = last_index; iter->nodes[0] = NULL; iter->nodes[1] = interval_tree_iter_first(itree, first_index, last_index); if (!iter->nodes[1]) { /* No nodes intersect the span, whole span is hole */ iter->start_hole = first_index; iter->last_hole = last_index; iter->is_hole = 1; return; } if (iter->nodes[1]->start > first_index) { /* Leading hole on first iteration */ iter->start_hole = first_index; iter->last_hole = iter->nodes[1]->start - 1; iter->is_hole = 1; interval_tree_span_iter_next_gap(iter); return; } /* Starting inside a used */ iter->start_used = first_index; iter->is_hole = 0; interval_tree_span_iter_next_gap(iter); iter->last_used = iter->nodes[0]->last; if (iter->last_used >= last_index) { iter->last_used = last_index; iter->nodes[0] = NULL; iter->nodes[1] = NULL; } } EXPORT_SYMBOL_GPL(interval_tree_span_iter_first); void interval_tree_span_iter_next(struct interval_tree_span_iter *iter) { if (!iter->nodes[0] && !iter->nodes[1]) { iter->is_hole = -1; return; } if (iter->is_hole) { iter->start_used = iter->last_hole + 1; iter->last_used = iter->nodes[0]->last; if (iter->last_used >= iter->last_index) { iter->last_used = iter->last_index; iter->nodes[0] = NULL; iter->nodes[1] = NULL; } iter->is_hole = 0; return; } if (!iter->nodes[1]) { /* Trailing hole */ iter->start_hole = iter->nodes[0]->last + 1; iter->last_hole = iter->last_index; iter->nodes[0] = NULL; iter->is_hole = 1; return; } /* must have both nodes[0] and [1], interior hole */ iter->start_hole = iter->nodes[0]->last + 1; iter->last_hole = iter->nodes[1]->start - 1; iter->is_hole = 1; interval_tree_span_iter_next_gap(iter); } EXPORT_SYMBOL_GPL(interval_tree_span_iter_next); /* * Advance the iterator index to a specific position. The returned used/hole is * updated to start at new_index. This is faster than calling * interval_tree_span_iter_first() as it can avoid full searches in several * cases where the iterator is already set. */ void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter, struct rb_root_cached *itree, unsigned long new_index) { if (iter->is_hole == -1) return; iter->first_index = new_index; if (new_index > iter->last_index) { iter->is_hole = -1; return; } /* Rely on the union aliasing hole/used */ if (iter->start_hole <= new_index && new_index <= iter->last_hole) { iter->start_hole = new_index; return; } if (new_index == iter->last_hole + 1) interval_tree_span_iter_next(iter); else interval_tree_span_iter_first(iter, itree, new_index, iter->last_index); } EXPORT_SYMBOL_GPL(interval_tree_span_iter_advance); #endif |
3 3 3 3 3 3 3 3 3 3 3 5 4 1 5 1 4 4 5 8 2 6 6 2 2 6 1 7 1 6 3 8 1 7 7 3 3 2 8 2 9 1 8 9 3 6 1 1 1 2 1 1 2 3 3 3 3 3 3 3 3 5 4 4 3 5 7 7 7 7 7 7 7 7 7 7 7 7 6 5 4 6 5 5 5 5 5 5 5 7 4 4 4 4 4 4 7 3 1 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 | // SPDX-License-Identifier: GPL-2.0-or-later /* * V4L2 controls framework uAPI implementation: * * Copyright (C) 2010-2021 Hans Verkuil <hverkuil-cisco@xs4all.nl> */ #define pr_fmt(fmt) "v4l2-ctrls: " fmt #include <linux/export.h> #include <linux/mm.h> #include <linux/slab.h> #include <media/v4l2-ctrls.h> #include <media/v4l2-dev.h> #include <media/v4l2-device.h> #include <media/v4l2-event.h> #include <media/v4l2-ioctl.h> #include "v4l2-ctrls-priv.h" /* Internal temporary helper struct, one for each v4l2_ext_control */ struct v4l2_ctrl_helper { /* Pointer to the control reference of the master control */ struct v4l2_ctrl_ref *mref; /* The control ref corresponding to the v4l2_ext_control ID field. */ struct v4l2_ctrl_ref *ref; /* * v4l2_ext_control index of the next control belonging to the * same cluster, or 0 if there isn't any. */ u32 next; }; /* * Helper functions to copy control payload data from kernel space to * user space and vice versa. */ /* Helper function: copy the given control value back to the caller */ static int ptr_to_user(struct v4l2_ext_control *c, struct v4l2_ctrl *ctrl, union v4l2_ctrl_ptr ptr) { u32 len; if (ctrl->is_ptr && !ctrl->is_string) return copy_to_user(c->ptr, ptr.p_const, c->size) ? -EFAULT : 0; switch (ctrl->type) { case V4L2_CTRL_TYPE_STRING: len = strlen(ptr.p_char); if (c->size < len + 1) { c->size = ctrl->elem_size; return -ENOSPC; } return copy_to_user(c->string, ptr.p_char, len + 1) ? -EFAULT : 0; case V4L2_CTRL_TYPE_INTEGER64: c->value64 = *ptr.p_s64; break; default: c->value = *ptr.p_s32; break; } return 0; } /* Helper function: copy the current control value back to the caller */ static int cur_to_user(struct v4l2_ext_control *c, struct v4l2_ctrl *ctrl) { return ptr_to_user(c, ctrl, ctrl->p_cur); } /* Helper function: copy the new control value back to the caller */ static int new_to_user(struct v4l2_ext_control *c, struct v4l2_ctrl *ctrl) { return ptr_to_user(c, ctrl, ctrl->p_new); } /* Helper function: copy the request value back to the caller */ static int req_to_user(struct v4l2_ext_control *c, struct v4l2_ctrl_ref *ref) { return ptr_to_user(c, ref->ctrl, ref->p_req); } /* Helper function: copy the initial control value back to the caller */ static int def_to_user(struct v4l2_ext_control *c, struct v4l2_ctrl *ctrl) { ctrl->type_ops->init(ctrl, 0, ctrl->p_new); return ptr_to_user(c, ctrl, ctrl->p_new); } /* Helper function: copy the caller-provider value as the new control value */ static int user_to_new(struct v4l2_ext_control *c, struct v4l2_ctrl *ctrl) { int ret; u32 size; ctrl->is_new = 0; if (ctrl->is_dyn_array && c->size > ctrl->p_array_alloc_elems * ctrl->elem_size) { void *old = ctrl->p_array; void *tmp = kvzalloc(2 * c->size, GFP_KERNEL); if (!tmp) return -ENOMEM; memcpy(tmp, ctrl->p_new.p, ctrl->elems * ctrl->elem_size); memcpy(tmp + c->size, ctrl->p_cur.p, ctrl->elems * ctrl->elem_size); ctrl->p_new.p = tmp; ctrl->p_cur.p = tmp + c->size; ctrl->p_array = tmp; ctrl->p_array_alloc_elems = c->size / ctrl->elem_size; kvfree(old); } if (ctrl->is_ptr && !ctrl->is_string) { unsigned int elems = c->size / ctrl->elem_size; if (copy_from_user(ctrl->p_new.p, c->ptr, c->size)) return -EFAULT; ctrl->is_new = 1; if (ctrl->is_dyn_array) ctrl->new_elems = elems; else if (ctrl->is_array) ctrl->type_ops->init(ctrl, elems, ctrl->p_new); return 0; } switch (ctrl->type) { case V4L2_CTRL_TYPE_INTEGER64: *ctrl->p_new.p_s64 = c->value64; break; case V4L2_CTRL_TYPE_STRING: size = c->size; if (size == 0) return -ERANGE; if (size > ctrl->maximum + 1) size = ctrl->maximum + 1; ret = copy_from_user(ctrl->p_new.p_char, c->string, size) ? -EFAULT : 0; if (!ret) { char last = ctrl->p_new.p_char[size - 1]; ctrl->p_new.p_char[size - 1] = 0; /* * If the string was longer than ctrl->maximum, * then return an error. */ if (strlen(ctrl->p_new.p_char) == ctrl->maximum && last) return -ERANGE; ctrl->is_new = 1; } return ret; default: *ctrl->p_new.p_s32 = c->value; break; } ctrl->is_new = 1; return 0; } /* * VIDIOC_G/TRY/S_EXT_CTRLS implementation */ /* * Some general notes on the atomic requirements of VIDIOC_G/TRY/S_EXT_CTRLS: * * It is not a fully atomic operation, just best-effort only. After all, if * multiple controls have to be set through multiple i2c writes (for example) * then some initial writes may succeed while others fail. Thus leaving the * system in an inconsistent state. The question is how much effort you are * willing to spend on trying to make something atomic that really isn't. * * From the point of view of an application the main requirement is that * when you call VIDIOC_S_EXT_CTRLS and some values are invalid then an * error should be returned without actually affecting any controls. * * If all the values are correct, then it is acceptable to just give up * in case of low-level errors. * * It is important though that the application can tell when only a partial * configuration was done. The way we do that is through the error_idx field * of struct v4l2_ext_controls: if that is equal to the count field then no * controls were affected. Otherwise all controls before that index were * successful in performing their 'get' or 'set' operation, the control at * the given index failed, and you don't know what happened with the controls * after the failed one. Since if they were part of a control cluster they * could have been successfully processed (if a cluster member was encountered * at index < error_idx), they could have failed (if a cluster member was at * error_idx), or they may not have been processed yet (if the first cluster * member appeared after error_idx). * * It is all fairly theoretical, though. In practice all you can do is to * bail out. If error_idx == count, then it is an application bug. If * error_idx < count then it is only an application bug if the error code was * EBUSY. That usually means that something started streaming just when you * tried to set the controls. In all other cases it is a driver/hardware * problem and all you can do is to retry or bail out. * * Note that these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that * never modifies controls the error_idx is just set to whatever control * has an invalid value. */ /* * Prepare for the extended g/s/try functions. * Find the controls in the control array and do some basic checks. */ static int prepare_ext_ctrls(struct v4l2_ctrl_handler *hdl, struct v4l2_ext_controls *cs, struct v4l2_ctrl_helper *helpers, struct video_device *vdev, bool get) { struct v4l2_ctrl_helper *h; bool have_clusters = false; u32 i; for (i = 0, h = helpers; i < cs->count; i++, h++) { struct v4l2_ext_control *c = &cs->controls[i]; struct v4l2_ctrl_ref *ref; struct v4l2_ctrl *ctrl; u32 id = c->id & V4L2_CTRL_ID_MASK; cs->error_idx = i; if (cs->which && cs->which != V4L2_CTRL_WHICH_DEF_VAL && cs->which != V4L2_CTRL_WHICH_REQUEST_VAL && V4L2_CTRL_ID2WHICH(id) != cs->which) { dprintk(vdev, "invalid which 0x%x or control id 0x%x\n", cs->which, id); return -EINVAL; } /* * Old-style private controls are not allowed for * extended controls. */ if (id >= V4L2_CID_PRIVATE_BASE) { dprintk(vdev, "old-style private controls not allowed\n"); return -EINVAL; } ref = find_ref_lock(hdl, id); if (!ref) { dprintk(vdev, "cannot find control id 0x%x\n", id); return -EINVAL; } h->ref = ref; ctrl = ref->ctrl; if (ctrl->flags & V4L2_CTRL_FLAG_DISABLED) { dprintk(vdev, "control id 0x%x is disabled\n", id); return -EINVAL; } if (ctrl->cluster[0]->ncontrols > 1) have_clusters = true; if (ctrl->cluster[0] != ctrl) ref = find_ref_lock(hdl, ctrl->cluster[0]->id); if (ctrl->is_dyn_array) { unsigned int max_size = ctrl->dims[0] * ctrl->elem_size; unsigned int tot_size = ctrl->elem_size; if (cs->which == V4L2_CTRL_WHICH_REQUEST_VAL) tot_size *= ref->p_req_elems; else tot_size *= ctrl->elems; c->size = ctrl->elem_size * (c->size / ctrl->elem_size); if (get) { if (c->size < tot_size) { c->size = tot_size; return -ENOSPC; } c->size = tot_size; } else { if (c->size > max_size) { c->size = max_size; return -ENOSPC; } if (!c->size) return -EFAULT; } } else if (ctrl->is_ptr && !ctrl->is_string) { unsigned int tot_size = ctrl->elems * ctrl->elem_size; if (c->size < tot_size) { /* * In the get case the application first * queries to obtain the size of the control. */ if (get) { c->size = tot_size; return -ENOSPC; } dprintk(vdev, "pointer control id 0x%x size too small, %d bytes but %d bytes needed\n", id, c->size, tot_size); return -EFAULT; } c->size = tot_size; } /* Store the ref to the master control of the cluster */ h->mref = ref; /* * Initially set next to 0, meaning that there is no other * control in this helper array belonging to the same * cluster. */ h->next = 0; } /* * We are done if there were no controls that belong to a multi- * control cluster. */ if (!have_clusters) return 0; /* * The code below figures out in O(n) time which controls in the list * belong to the same cluster. */ /* This has to be done with the handler lock taken. */ mutex_lock(hdl->lock); /* First zero the helper field in the master control references */ for (i = 0; i < cs->count; i++) helpers[i].mref->helper = NULL; for (i = 0, h = helpers; i < cs->count; i++, h++) { struct v4l2_ctrl_ref *mref = h->mref; /* * If the mref->helper is set, then it points to an earlier * helper that belongs to the same cluster. */ if (mref->helper) { /* * Set the next field of mref->helper to the current * index: this means that the earlier helper now * points to the next helper in the same cluster. */ mref->helper->next = i; /* * mref should be set only for the first helper in the * cluster, clear the others. */ h->mref = NULL; } /* Point the mref helper to the current helper struct. */ mref->helper = h; } mutex_unlock(hdl->lock); return 0; } /* * Handles the corner case where cs->count == 0. It checks whether the * specified control class exists. If that class ID is 0, then it checks * whether there are any controls at all. */ static int class_check(struct v4l2_ctrl_handler *hdl, u32 which) { if (which == 0 || which == V4L2_CTRL_WHICH_DEF_VAL || which == V4L2_CTRL_WHICH_REQUEST_VAL) return 0; return find_ref_lock(hdl, which | 1) ? 0 : -EINVAL; } /* * Get extended controls. Allocates the helpers array if needed. * * Note that v4l2_g_ext_ctrls_common() with 'which' set to * V4L2_CTRL_WHICH_REQUEST_VAL is only called if the request was * completed, and in that case p_req_valid is true for all controls. */ int v4l2_g_ext_ctrls_common(struct v4l2_ctrl_handler *hdl, struct v4l2_ext_controls *cs, struct video_device *vdev) { struct v4l2_ctrl_helper helper[4]; struct v4l2_ctrl_helper *helpers = helper; int ret; int i, j; bool is_default, is_request; is_default = (cs->which == V4L2_CTRL_WHICH_DEF_VAL); is_request = (cs->which == V4L2_CTRL_WHICH_REQUEST_VAL); cs->error_idx = cs->count; cs->which = V4L2_CTRL_ID2WHICH(cs->which); if (!hdl) return -EINVAL; if (cs->count == 0) return class_check(hdl, cs->which); if (cs->count > ARRAY_SIZE(helper)) { helpers = kvmalloc_array(cs->count, sizeof(helper[0]), GFP_KERNEL); if (!helpers) return -ENOMEM; } ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, true); cs->error_idx = cs->count; for (i = 0; !ret && i < cs->count; i++) if (helpers[i].ref->ctrl->flags & V4L2_CTRL_FLAG_WRITE_ONLY) ret = -EACCES; for (i = 0; !ret && i < cs->count; i++) { struct v4l2_ctrl *master; bool is_volatile = false; u32 idx = i; if (!helpers[i].mref) continue; master = helpers[i].mref->ctrl; cs->error_idx = i; v4l2_ctrl_lock(master); /* * g_volatile_ctrl will update the new control values. * This makes no sense for V4L2_CTRL_WHICH_DEF_VAL and * V4L2_CTRL_WHICH_REQUEST_VAL. In the case of requests * it is v4l2_ctrl_request_complete() that copies the * volatile controls at the time of request completion * to the request, so you don't want to do that again. */ if (!is_default && !is_request && ((master->flags & V4L2_CTRL_FLAG_VOLATILE) || (master->has_volatiles && !is_cur_manual(master)))) { for (j = 0; j < master->ncontrols; j++) cur_to_new(master->cluster[j]); ret = call_op(master, g_volatile_ctrl); is_volatile = true; } if (ret) { v4l2_ctrl_unlock(master); break; } /* * Copy the default value (if is_default is true), the * request value (if is_request is true and p_req is valid), * the new volatile value (if is_volatile is true) or the * current value. */ do { struct v4l2_ctrl_ref *ref = helpers[idx].ref; if (is_default) ret = def_to_user(cs->controls + idx, ref->ctrl); else if (is_request && ref->p_req_array_enomem) ret = -ENOMEM; else if (is_request && ref->p_req_valid) ret = req_to_user(cs->controls + idx, ref); else if (is_volatile) ret = new_to_user(cs->controls + idx, ref->ctrl); else ret = cur_to_user(cs->controls + idx, ref->ctrl); idx = helpers[idx].next; } while (!ret && idx); v4l2_ctrl_unlock(master); } if (cs->count > ARRAY_SIZE(helper)) kvfree(helpers); return ret; } int v4l2_g_ext_ctrls(struct v4l2_ctrl_handler *hdl, struct video_device *vdev, struct media_device *mdev, struct v4l2_ext_controls *cs) { if (cs->which == V4L2_CTRL_WHICH_REQUEST_VAL) return v4l2_g_ext_ctrls_request(hdl, vdev, mdev, cs); return v4l2_g_ext_ctrls_common(hdl, cs, vdev); } EXPORT_SYMBOL(v4l2_g_ext_ctrls); /* Validate a new control */ static int validate_new(const struct v4l2_ctrl *ctrl, union v4l2_ctrl_ptr p_new) { return ctrl->type_ops->validate(ctrl, p_new); } /* Validate controls. */ static int validate_ctrls(struct v4l2_ext_controls *cs, struct v4l2_ctrl_helper *helpers, struct video_device *vdev, bool set) { unsigned int i; int ret = 0; cs->error_idx = cs->count; for (i = 0; i < cs->count; i++) { struct v4l2_ctrl *ctrl = helpers[i].ref->ctrl; union v4l2_ctrl_ptr p_new; cs->error_idx = i; if (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY) { dprintk(vdev, "control id 0x%x is read-only\n", ctrl->id); return -EACCES; } /* * This test is also done in try_set_control_cluster() which * is called in atomic context, so that has the final say, * but it makes sense to do an up-front check as well. Once * an error occurs in try_set_control_cluster() some other * controls may have been set already and we want to do a * best-effort to avoid that. */ if (set && (ctrl->flags & V4L2_CTRL_FLAG_GRABBED)) { dprintk(vdev, "control id 0x%x is grabbed, cannot set\n", ctrl->id); return -EBUSY; } /* * Skip validation for now if the payload needs to be copied * from userspace into kernelspace. We'll validate those later. */ if (ctrl->is_ptr) continue; if (ctrl->type == V4L2_CTRL_TYPE_INTEGER64) p_new.p_s64 = &cs->controls[i].value64; else p_new.p_s32 = &cs->controls[i].value; ret = validate_new(ctrl, p_new); if (ret) return ret; } return 0; } /* Try or try-and-set controls */ int try_set_ext_ctrls_common(struct v4l2_fh *fh, struct v4l2_ctrl_handler *hdl, struct v4l2_ext_controls *cs, struct video_device *vdev, bool set) { struct v4l2_ctrl_helper helper[4]; struct v4l2_ctrl_helper *helpers = helper; unsigned int i, j; int ret; cs->error_idx = cs->count; /* Default value cannot be changed */ if (cs->which == V4L2_CTRL_WHICH_DEF_VAL) { dprintk(vdev, "%s: cannot change default value\n", video_device_node_name(vdev)); return -EINVAL; } cs->which = V4L2_CTRL_ID2WHICH(cs->which); if (!hdl) { dprintk(vdev, "%s: invalid null control handler\n", video_device_node_name(vdev)); return -EINVAL; } if (cs->count == 0) return class_check(hdl, cs->which); if (cs->count > ARRAY_SIZE(helper)) { helpers = kvmalloc_array(cs->count, sizeof(helper[0]), GFP_KERNEL); if (!helpers) return -ENOMEM; } ret = prepare_ext_ctrls(hdl, cs, helpers, vdev, false); if (!ret) ret = validate_ctrls(cs, helpers, vdev, set); if (ret && set) cs->error_idx = cs->count; for (i = 0; !ret && i < cs->count; i++) { struct v4l2_ctrl *master; u32 idx = i; if (!helpers[i].mref) continue; cs->error_idx = i; master = helpers[i].mref->ctrl; v4l2_ctrl_lock(master); /* Reset the 'is_new' flags of the cluster */ for (j = 0; j < master->ncontrols; j++) if (master->cluster[j]) master->cluster[j]->is_new = 0; /* * For volatile autoclusters that are currently in auto mode * we need to discover if it will be set to manual mode. * If so, then we have to copy the current volatile values * first since those will become the new manual values (which * may be overwritten by explicit new values from this set * of controls). */ if (master->is_auto && master->has_volatiles && !is_cur_manual(master)) { /* Pick an initial non-manual value */ s32 new_auto_val = master->manual_mode_value + 1; u32 tmp_idx = idx; do { /* * Check if the auto control is part of the * list, and remember the new value. */ if (helpers[tmp_idx].ref->ctrl == master) new_auto_val = cs->controls[tmp_idx].value; tmp_idx = helpers[tmp_idx].next; } while (tmp_idx); /* * If the new value == the manual value, then copy * the current volatile values. */ if (new_auto_val == master->manual_mode_value) update_from_auto_cluster(master); } /* * Copy the new caller-supplied control values. * user_to_new() sets 'is_new' to 1. */ do { struct v4l2_ctrl *ctrl = helpers[idx].ref->ctrl; ret = user_to_new(cs->controls + idx, ctrl); if (!ret && ctrl->is_ptr) { ret = validate_new(ctrl, ctrl->p_new); if (ret) dprintk(vdev, "failed to validate control %s (%d)\n", v4l2_ctrl_get_name(ctrl->id), ret); } idx = helpers[idx].next; } while (!ret && idx); if (!ret) ret = try_or_set_cluster(fh, master, !hdl->req_obj.req && set, 0); if (!ret && hdl->req_obj.req && set) { for (j = 0; j < master->ncontrols; j++) { struct v4l2_ctrl_ref *ref = find_ref(hdl, master->cluster[j]->id); new_to_req(ref); } } /* Copy the new values back to userspace. */ if (!ret) { idx = i; do { ret = new_to_user(cs->controls + idx, helpers[idx].ref->ctrl); idx = helpers[idx].next; } while (!ret && idx); } v4l2_ctrl_unlock(master); } if (cs->count > ARRAY_SIZE(helper)) kvfree(helpers); return ret; } static int try_set_ext_ctrls(struct v4l2_fh *fh, struct v4l2_ctrl_handler *hdl, struct video_device *vdev, struct media_device *mdev, struct v4l2_ext_controls *cs, bool set) { int ret; if (cs->which == V4L2_CTRL_WHICH_REQUEST_VAL) return try_set_ext_ctrls_request(fh, hdl, vdev, mdev, cs, set); ret = try_set_ext_ctrls_common(fh, hdl, cs, vdev, set); if (ret) dprintk(vdev, "%s: try_set_ext_ctrls_common failed (%d)\n", video_device_node_name(vdev), ret); return ret; } int v4l2_try_ext_ctrls(struct v4l2_ctrl_handler *hdl, struct video_device *vdev, struct media_device *mdev, struct v4l2_ext_controls *cs) { return try_set_ext_ctrls(NULL, hdl, vdev, mdev, cs, false); } EXPORT_SYMBOL(v4l2_try_ext_ctrls); int v4l2_s_ext_ctrls(struct v4l2_fh *fh, struct v4l2_ctrl_handler *hdl, struct video_device *vdev, struct media_device *mdev, struct v4l2_ext_controls *cs) { return try_set_ext_ctrls(fh, hdl, vdev, mdev, cs, true); } EXPORT_SYMBOL(v4l2_s_ext_ctrls); /* * VIDIOC_G/S_CTRL implementation */ /* Helper function to get a single control */ static int get_ctrl(struct v4l2_ctrl *ctrl, struct v4l2_ext_control *c) { struct v4l2_ctrl *master = ctrl->cluster[0]; int ret = 0; int i; /* Compound controls are not supported. The new_to_user() and * cur_to_user() calls below would need to be modified not to access * userspace memory when called from get_ctrl(). */ if (!ctrl->is_int && ctrl->type != V4L2_CTRL_TYPE_INTEGER64) return -EINVAL; if (ctrl->flags & V4L2_CTRL_FLAG_WRITE_ONLY) return -EACCES; v4l2_ctrl_lock(master); /* g_volatile_ctrl will update the current control values */ if (ctrl->flags & V4L2_CTRL_FLAG_VOLATILE) { for (i = 0; i < master->ncontrols; i++) cur_to_new(master->cluster[i]); ret = call_op(master, g_volatile_ctrl); new_to_user(c, ctrl); } else { cur_to_user(c, ctrl); } v4l2_ctrl_unlock(master); return ret; } int v4l2_g_ctrl(struct v4l2_ctrl_handler *hdl, struct v4l2_control *control) { struct v4l2_ctrl *ctrl = v4l2_ctrl_find(hdl, control->id); struct v4l2_ext_control c; int ret; if (!ctrl || !ctrl->is_int) return -EINVAL; ret = get_ctrl(ctrl, &c); control->value = c.value; return ret; } EXPORT_SYMBOL(v4l2_g_ctrl); /* Helper function for VIDIOC_S_CTRL compatibility */ static int set_ctrl(struct v4l2_fh *fh, struct v4l2_ctrl *ctrl, u32 ch_flags) { struct v4l2_ctrl *master = ctrl->cluster[0]; int ret; int i; /* Reset the 'is_new' flags of the cluster */ for (i = 0; i < master->ncontrols; i++) if (master->cluster[i]) master->cluster[i]->is_new = 0; ret = validate_new(ctrl, ctrl->p_new); if (ret) return ret; /* * For autoclusters with volatiles that are switched from auto to * manual mode we have to update the current volatile values since * those will become the initial manual values after such a switch. */ if (master->is_auto && master->has_volatiles && ctrl == master && !is_cur_manual(master) && ctrl->val == master->manual_mode_value) update_from_auto_cluster(master); ctrl->is_new = 1; return try_or_set_cluster(fh, master, true, ch_flags); } /* Helper function for VIDIOC_S_CTRL compatibility */ static int set_ctrl_lock(struct v4l2_fh *fh, struct v4l2_ctrl *ctrl, struct v4l2_ext_control *c) { int ret; v4l2_ctrl_lock(ctrl); user_to_new(c, ctrl); ret = set_ctrl(fh, ctrl, 0); if (!ret) cur_to_user(c, ctrl); v4l2_ctrl_unlock(ctrl); return ret; } int v4l2_s_ctrl(struct v4l2_fh *fh, struct v4l2_ctrl_handler *hdl, struct v4l2_control *control) { struct v4l2_ctrl *ctrl = v4l2_ctrl_find(hdl, control->id); struct v4l2_ext_control c = { control->id }; int ret; if (!ctrl || !ctrl->is_int) return -EINVAL; if (ctrl->flags & V4L2_CTRL_FLAG_READ_ONLY) return -EACCES; c.value = control->value; ret = set_ctrl_lock(fh, ctrl, &c); control->value = c.value; return ret; } EXPORT_SYMBOL(v4l2_s_ctrl); /* * Helper functions for drivers to get/set controls. */ s32 v4l2_ctrl_g_ctrl(struct v4l2_ctrl *ctrl) { struct v4l2_ext_control c; /* It's a driver bug if this happens. */ if (WARN_ON(!ctrl->is_int)) return 0; c.value = 0; get_ctrl(ctrl, &c); return c.value; } EXPORT_SYMBOL(v4l2_ctrl_g_ctrl); s64 v4l2_ctrl_g_ctrl_int64(struct v4l2_ctrl *ctrl) { struct v4l2_ext_control c; /* It's a driver bug if this happens. */ if (WARN_ON(ctrl->is_ptr || ctrl->type != V4L2_CTRL_TYPE_INTEGER64)) return 0; c.value64 = 0; get_ctrl(ctrl, &c); return c.value64; } EXPORT_SYMBOL(v4l2_ctrl_g_ctrl_int64); int __v4l2_ctrl_s_ctrl(struct v4l2_ctrl *ctrl, s32 val) { lockdep_assert_held(ctrl->handler->lock); /* It's a driver bug if this happens. */ if (WARN_ON(!ctrl->is_int)) return -EINVAL; ctrl->val = val; return set_ctrl(NULL, ctrl, 0); } EXPORT_SYMBOL(__v4l2_ctrl_s_ctrl); int __v4l2_ctrl_s_ctrl_int64(struct v4l2_ctrl *ctrl, s64 val) { lockdep_assert_held(ctrl->handler->lock); /* It's a driver bug if this happens. */ if (WARN_ON(ctrl->is_ptr || ctrl->type != V4L2_CTRL_TYPE_INTEGER64)) return -EINVAL; *ctrl->p_new.p_s64 = val; return set_ctrl(NULL, ctrl, 0); } EXPORT_SYMBOL(__v4l2_ctrl_s_ctrl_int64); int __v4l2_ctrl_s_ctrl_string(struct v4l2_ctrl *ctrl, const char *s) { lockdep_assert_held(ctrl->handler->lock); /* It's a driver bug if this happens. */ if (WARN_ON(ctrl->type != V4L2_CTRL_TYPE_STRING)) return -EINVAL; strscpy(ctrl->p_new.p_char, s, ctrl->maximum + 1); return set_ctrl(NULL, ctrl, 0); } EXPORT_SYMBOL(__v4l2_ctrl_s_ctrl_string); int __v4l2_ctrl_s_ctrl_compound(struct v4l2_ctrl *ctrl, enum v4l2_ctrl_type type, const void *p) { lockdep_assert_held(ctrl->handler->lock); /* It's a driver bug if this happens. */ if (WARN_ON(ctrl->type != type)) return -EINVAL; /* Setting dynamic arrays is not (yet?) supported. */ if (WARN_ON(ctrl->is_dyn_array)) return -EINVAL; memcpy(ctrl->p_new.p, p, ctrl->elems * ctrl->elem_size); return set_ctrl(NULL, ctrl, 0); } EXPORT_SYMBOL(__v4l2_ctrl_s_ctrl_compound); /* * Modify the range of a control. */ int __v4l2_ctrl_modify_range(struct v4l2_ctrl *ctrl, s64 min, s64 max, u64 step, s64 def) { bool value_changed; bool range_changed = false; int ret; lockdep_assert_held(ctrl->handler->lock); switch (ctrl->type) { case V4L2_CTRL_TYPE_INTEGER: case V4L2_CTRL_TYPE_INTEGER64: case V4L2_CTRL_TYPE_BOOLEAN: case V4L2_CTRL_TYPE_MENU: case V4L2_CTRL_TYPE_INTEGER_MENU: case V4L2_CTRL_TYPE_BITMASK: case V4L2_CTRL_TYPE_U8: case V4L2_CTRL_TYPE_U16: case V4L2_CTRL_TYPE_U32: if (ctrl->is_array) return -EINVAL; ret = check_range(ctrl->type, min, max, step, def); if (ret) return ret; break; default: return -EINVAL; } if (ctrl->minimum != min || ctrl->maximum != max || ctrl->step != step || ctrl->default_value != def) { range_changed = true; ctrl->minimum = min; ctrl->maximum = max; ctrl->step = step; ctrl->default_value = def; } cur_to_new(ctrl); if (validate_new(ctrl, ctrl->p_new)) { if (ctrl->type == V4L2_CTRL_TYPE_INTEGER64) *ctrl->p_new.p_s64 = def; else *ctrl->p_new.p_s32 = def; } if (ctrl->type == V4L2_CTRL_TYPE_INTEGER64) value_changed = *ctrl->p_new.p_s64 != *ctrl->p_cur.p_s64; else value_changed = *ctrl->p_new.p_s32 != *ctrl->p_cur.p_s32; if (value_changed) ret = set_ctrl(NULL, ctrl, V4L2_EVENT_CTRL_CH_RANGE); else if (range_changed) send_event(NULL, ctrl, V4L2_EVENT_CTRL_CH_RANGE); return ret; } EXPORT_SYMBOL(__v4l2_ctrl_modify_range); int __v4l2_ctrl_modify_dimensions(struct v4l2_ctrl *ctrl, u32 dims[V4L2_CTRL_MAX_DIMS]) { unsigned int elems = 1; unsigned int i; void *p_array; lockdep_assert_held(ctrl->handler->lock); if (!ctrl->is_array || ctrl->is_dyn_array) return -EINVAL; for (i = 0; i < ctrl->nr_of_dims; i++) elems *= dims[i]; if (elems == 0) return -EINVAL; p_array = kvzalloc(2 * elems * ctrl->elem_size, GFP_KERNEL); if (!p_array) return -ENOMEM; kvfree(ctrl->p_array); ctrl->p_array_alloc_elems = elems; ctrl->elems = elems; ctrl->new_elems = elems; ctrl->p_array = p_array; ctrl->p_new.p = p_array; ctrl->p_cur.p = p_array + elems * ctrl->elem_size; for (i = 0; i < ctrl->nr_of_dims; i++) ctrl->dims[i] = dims[i]; ctrl->type_ops->init(ctrl, 0, ctrl->p_cur); cur_to_new(ctrl); send_event(NULL, ctrl, V4L2_EVENT_CTRL_CH_VALUE | V4L2_EVENT_CTRL_CH_DIMENSIONS); return 0; } EXPORT_SYMBOL(__v4l2_ctrl_modify_dimensions); /* Implement VIDIOC_QUERY_EXT_CTRL */ int v4l2_query_ext_ctrl(struct v4l2_ctrl_handler *hdl, struct v4l2_query_ext_ctrl *qc) { const unsigned int next_flags = V4L2_CTRL_FLAG_NEXT_CTRL | V4L2_CTRL_FLAG_NEXT_COMPOUND; u32 id = qc->id & V4L2_CTRL_ID_MASK; struct v4l2_ctrl_ref *ref; struct v4l2_ctrl *ctrl; if (!hdl) return -EINVAL; mutex_lock(hdl->lock); /* Try to find it */ ref = find_ref(hdl, id); if ((qc->id & next_flags) && !list_empty(&hdl->ctrl_refs)) { bool is_compound; /* Match any control that is not hidden */ unsigned int mask = 1; bool match = false; if ((qc->id & next_flags) == V4L2_CTRL_FLAG_NEXT_COMPOUND) { /* Match any hidden control */ match = true; } else if ((qc->id & next_flags) == next_flags) { /* Match any control, compound or not */ mask = 0; } /* Find the next control with ID > qc->id */ /* Did we reach the end of the control list? */ if (id >= node2id(hdl->ctrl_refs.prev)) { ref = NULL; /* Yes, so there is no next control */ } else if (ref) { /* * We found a control with the given ID, so just get * the next valid one in the list. */ list_for_each_entry_continue(ref, &hdl->ctrl_refs, node) { is_compound = ref->ctrl->is_array || ref->ctrl->type >= V4L2_CTRL_COMPOUND_TYPES; if (id < ref->ctrl->id && (is_compound & mask) == match) break; } if (&ref->node == &hdl->ctrl_refs) ref = NULL; } else { /* * No control with the given ID exists, so start * searching for the next largest ID. We know there * is one, otherwise the first 'if' above would have * been true. */ list_for_each_entry(ref, &hdl->ctrl_refs, node) { is_compound = ref->ctrl->is_array || ref->ctrl->type >= V4L2_CTRL_COMPOUND_TYPES; if (id < ref->ctrl->id && (is_compound & mask) == match) break; } if (&ref->node == &hdl->ctrl_refs) ref = NULL; } } mutex_unlock(hdl->lock); if (!ref) return -EINVAL; ctrl = ref->ctrl; memset(qc, 0, sizeof(*qc)); if (id >= V4L2_CID_PRIVATE_BASE) qc->id = id; else qc->id = ctrl->id; strscpy(qc->name, ctrl->name, sizeof(qc->name)); qc->flags = user_flags(ctrl); qc->type = ctrl->type; qc->elem_size = ctrl->elem_size; qc->elems = ctrl->elems; qc->nr_of_dims = ctrl->nr_of_dims; memcpy(qc->dims, ctrl->dims, qc->nr_of_dims * sizeof(qc->dims[0])); qc->minimum = ctrl->minimum; qc->maximum = ctrl->maximum; qc->default_value = ctrl->default_value; if (ctrl->type == V4L2_CTRL_TYPE_MENU || ctrl->type == V4L2_CTRL_TYPE_INTEGER_MENU) qc->step = 1; else qc->step = ctrl->step; return 0; } EXPORT_SYMBOL(v4l2_query_ext_ctrl); /* Implement VIDIOC_QUERYCTRL */ int v4l2_queryctrl(struct v4l2_ctrl_handler *hdl, struct v4l2_queryctrl *qc) { struct v4l2_query_ext_ctrl qec = { qc->id }; int rc; rc = v4l2_query_ext_ctrl(hdl, &qec); if (rc) return rc; qc->id = qec.id; qc->type = qec.type; qc->flags = qec.flags; strscpy(qc->name, qec.name, sizeof(qc->name)); switch (qc->type) { case V4L2_CTRL_TYPE_INTEGER: case V4L2_CTRL_TYPE_BOOLEAN: case V4L2_CTRL_TYPE_MENU: case V4L2_CTRL_TYPE_INTEGER_MENU: case V4L2_CTRL_TYPE_STRING: case V4L2_CTRL_TYPE_BITMASK: qc->minimum = qec.minimum; qc->maximum = qec.maximum; qc->step = qec.step; qc->default_value = qec.default_value; break; default: qc->minimum = 0; qc->maximum = 0; qc->step = 0; qc->default_value = 0; break; } return 0; } EXPORT_SYMBOL(v4l2_queryctrl); /* Implement VIDIOC_QUERYMENU */ int v4l2_querymenu(struct v4l2_ctrl_handler *hdl, struct v4l2_querymenu *qm) { struct v4l2_ctrl *ctrl; u32 i = qm->index; ctrl = v4l2_ctrl_find(hdl, qm->id); if (!ctrl) return -EINVAL; qm->reserved = 0; /* Sanity checks */ switch (ctrl->type) { case V4L2_CTRL_TYPE_MENU: if (!ctrl->qmenu) return -EINVAL; break; case V4L2_CTRL_TYPE_INTEGER_MENU: if (!ctrl->qmenu_int) return -EINVAL; break; default: return -EINVAL; } if (i < ctrl->minimum || i > ctrl->maximum) return -EINVAL; /* Use mask to see if this menu item should be skipped */ if (ctrl->menu_skip_mask & (1ULL << i)) return -EINVAL; /* Empty menu items should also be skipped */ if (ctrl->type == V4L2_CTRL_TYPE_MENU) { if (!ctrl->qmenu[i] || ctrl->qmenu[i][0] == '\0') return -EINVAL; strscpy(qm->name, ctrl->qmenu[i], sizeof(qm->name)); } else { qm->value = ctrl->qmenu_int[i]; } return 0; } EXPORT_SYMBOL(v4l2_querymenu); /* * VIDIOC_LOG_STATUS helpers */ int v4l2_ctrl_log_status(struct file *file, void *fh) { struct video_device *vfd = video_devdata(file); struct v4l2_fh *vfh = file->private_data; if (test_bit(V4L2_FL_USES_V4L2_FH, &vfd->flags) && vfd->v4l2_dev) v4l2_ctrl_handler_log_status(vfh->ctrl_handler, vfd->v4l2_dev->name); return 0; } EXPORT_SYMBOL(v4l2_ctrl_log_status); int v4l2_ctrl_subdev_log_status(struct v4l2_subdev *sd) { v4l2_ctrl_handler_log_status(sd->ctrl_handler, sd->name); return 0; } EXPORT_SYMBOL(v4l2_ctrl_subdev_log_status); /* * VIDIOC_(UN)SUBSCRIBE_EVENT implementation */ static int v4l2_ctrl_add_event(struct v4l2_subscribed_event *sev, unsigned int elems) { struct v4l2_ctrl *ctrl = v4l2_ctrl_find(sev->fh->ctrl_handler, sev->id); if (!ctrl) return -EINVAL; v4l2_ctrl_lock(ctrl); list_add_tail(&sev->node, &ctrl->ev_subs); if (ctrl->type != V4L2_CTRL_TYPE_CTRL_CLASS && (sev->flags & V4L2_EVENT_SUB_FL_SEND_INITIAL)) send_initial_event(sev->fh, ctrl); v4l2_ctrl_unlock(ctrl); return 0; } static void v4l2_ctrl_del_event(struct v4l2_subscribed_event *sev) { struct v4l2_ctrl *ctrl = v4l2_ctrl_find(sev->fh->ctrl_handler, sev->id); if (!ctrl) return; v4l2_ctrl_lock(ctrl); list_del(&sev->node); v4l2_ctrl_unlock(ctrl); } void v4l2_ctrl_replace(struct v4l2_event *old, const struct v4l2_event *new) { u32 old_changes = old->u.ctrl.changes; old->u.ctrl = new->u.ctrl; old->u.ctrl.changes |= old_changes; } EXPORT_SYMBOL(v4l2_ctrl_replace); void v4l2_ctrl_merge(const struct v4l2_event *old, struct v4l2_event *new) { new->u.ctrl.changes |= old->u.ctrl.changes; } EXPORT_SYMBOL(v4l2_ctrl_merge); const struct v4l2_subscribed_event_ops v4l2_ctrl_sub_ev_ops = { .add = v4l2_ctrl_add_event, .del = v4l2_ctrl_del_event, .replace = v4l2_ctrl_replace, .merge = v4l2_ctrl_merge, }; EXPORT_SYMBOL(v4l2_ctrl_sub_ev_ops); int v4l2_ctrl_subscribe_event(struct v4l2_fh *fh, const struct v4l2_event_subscription *sub) { if (sub->type == V4L2_EVENT_CTRL) return v4l2_event_subscribe(fh, sub, 0, &v4l2_ctrl_sub_ev_ops); return -EINVAL; } EXPORT_SYMBOL(v4l2_ctrl_subscribe_event); int v4l2_ctrl_subdev_subscribe_event(struct v4l2_subdev *sd, struct v4l2_fh *fh, struct v4l2_event_subscription *sub) { if (!sd->ctrl_handler) return -EINVAL; return v4l2_ctrl_subscribe_event(fh, sub); } EXPORT_SYMBOL(v4l2_ctrl_subdev_subscribe_event); /* * poll helper */ __poll_t v4l2_ctrl_poll(struct file *file, struct poll_table_struct *wait) { struct v4l2_fh *fh = file->private_data; poll_wait(file, &fh->wait, wait); if (v4l2_event_pending(fh)) return EPOLLPRI; return 0; } EXPORT_SYMBOL(v4l2_ctrl_poll); |
7 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _CRYPTO_XTS_H #define _CRYPTO_XTS_H #include <crypto/b128ops.h> #include <crypto/internal/skcipher.h> #include <linux/fips.h> #define XTS_BLOCK_SIZE 16 static inline int xts_verify_key(struct crypto_skcipher *tfm, const u8 *key, unsigned int keylen) { /* * key consists of keys of equal size concatenated, therefore * the length must be even. */ if (keylen % 2) return -EINVAL; /* * In FIPS mode only a combined key length of either 256 or * 512 bits is allowed, c.f. FIPS 140-3 IG C.I. */ if (fips_enabled && keylen != 32 && keylen != 64) return -EINVAL; /* * Ensure that the AES and tweak key are not identical when * in FIPS mode or the FORBID_WEAK_KEYS flag is set. */ if ((fips_enabled || (crypto_skcipher_get_flags(tfm) & CRYPTO_TFM_REQ_FORBID_WEAK_KEYS)) && !crypto_memneq(key, key + (keylen / 2), keylen / 2)) return -EINVAL; return 0; } #endif /* _CRYPTO_XTS_H */ |
24 25 25 22 25 19 15 19 4 24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | /* * Cryptographic API. * * MD5 Message Digest Algorithm (RFC1321). * * Derived from cryptoapi implementation, originally based on the * public domain implementation written by Colin Plumb in 1993. * * Copyright (c) Cryptoapi developers. * Copyright (c) 2002 James Morris <jmorris@intercode.com.au> * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the Free * Software Foundation; either version 2 of the License, or (at your option) * any later version. * */ #include <crypto/internal/hash.h> #include <crypto/md5.h> #include <linux/init.h> #include <linux/module.h> #include <linux/string.h> #include <linux/types.h> #include <asm/byteorder.h> const u8 md5_zero_message_hash[MD5_DIGEST_SIZE] = { 0xd4, 0x1d, 0x8c, 0xd9, 0x8f, 0x00, 0xb2, 0x04, 0xe9, 0x80, 0x09, 0x98, 0xec, 0xf8, 0x42, 0x7e, }; EXPORT_SYMBOL_GPL(md5_zero_message_hash); #define F1(x, y, z) (z ^ (x & (y ^ z))) #define F2(x, y, z) F1(z, x, y) #define F3(x, y, z) (x ^ y ^ z) #define F4(x, y, z) (y ^ (x | ~z)) #define MD5STEP(f, w, x, y, z, in, s) \ (w += f(x, y, z) + in, w = (w<<s | w>>(32-s)) + x) static void md5_transform(__u32 *hash, __u32 const *in) { u32 a, b, c, d; a = hash[0]; b = hash[1]; c = hash[2]; d = hash[3]; MD5STEP(F1, a, b, c, d, in[0] + 0xd76aa478, 7); MD5STEP(F1, d, a, b, c, in[1] + 0xe8c7b756, 12); MD5STEP(F1, c, d, a, b, in[2] + 0x242070db, 17); MD5STEP(F1, b, c, d, a, in[3] + 0xc1bdceee, 22); MD5STEP(F1, a, b, c, d, in[4] + 0xf57c0faf, 7); MD5STEP(F1, d, a, b, c, in[5] + 0x4787c62a, 12); MD5STEP(F1, c, d, a, b, in[6] + 0xa8304613, 17); MD5STEP(F1, b, c, d, a, in[7] + 0xfd469501, 22); MD5STEP(F1, a, b, c, d, in[8] + 0x698098d8, 7); MD5STEP(F1, d, a, b, c, in[9] + 0x8b44f7af, 12); MD5STEP(F1, c, d, a, b, in[10] + 0xffff5bb1, 17); MD5STEP(F1, b, c, d, a, in[11] + 0x895cd7be, 22); MD5STEP(F1, a, b, c, d, in[12] + 0x6b901122, 7); MD5STEP(F1, d, a, b, c, in[13] + 0xfd987193, 12); MD5STEP(F1, c, d, a, b, in[14] + 0xa679438e, 17); MD5STEP(F1, b, c, d, a, in[15] + 0x49b40821, 22); MD5STEP(F2, a, b, c, d, in[1] + 0xf61e2562, 5); MD5STEP(F2, d, a, b, c, in[6] + 0xc040b340, 9); MD5STEP(F2, c, d, a, b, in[11] + 0x265e5a51, 14); MD5STEP(F2, b, c, d, a, in[0] + 0xe9b6c7aa, 20); MD5STEP(F2, a, b, c, d, in[5] + 0xd62f105d, 5); MD5STEP(F2, d, a, b, c, in[10] + 0x02441453, 9); MD5STEP(F2, c, d, a, b, in[15] + 0xd8a1e681, 14); MD5STEP(F2, b, c, d, a, in[4] + 0xe7d3fbc8, 20); MD5STEP(F2, a, b, c, d, in[9] + 0x21e1cde6, 5); MD5STEP(F2, d, a, b, c, in[14] + 0xc33707d6, 9); MD5STEP(F2, c, d, a, b, in[3] + 0xf4d50d87, 14); MD5STEP(F2, b, c, d, a, in[8] + 0x455a14ed, 20); MD5STEP(F2, a, b, c, d, in[13] + 0xa9e3e905, 5); MD5STEP(F2, d, a, b, c, in[2] + 0xfcefa3f8, 9); MD5STEP(F2, c, d, a, b, in[7] + 0x676f02d9, 14); MD5STEP(F2, b, c, d, a, in[12] + 0x8d2a4c8a, 20); MD5STEP(F3, a, b, c, d, in[5] + 0xfffa3942, 4); MD5STEP(F3, d, a, b, c, in[8] + 0x8771f681, 11); MD5STEP(F3, c, d, a, b, in[11] + 0x6d9d6122, 16); MD5STEP(F3, b, c, d, a, in[14] + 0xfde5380c, 23); MD5STEP(F3, a, b, c, d, in[1] + 0xa4beea44, 4); MD5STEP(F3, d, a, b, c, in[4] + 0x4bdecfa9, 11); MD5STEP(F3, c, d, a, b, in[7] + 0xf6bb4b60, 16); MD5STEP(F3, b, c, d, a, in[10] + 0xbebfbc70, 23); MD5STEP(F3, a, b, c, d, in[13] + 0x289b7ec6, 4); MD5STEP(F3, d, a, b, c, in[0] + 0xeaa127fa, 11); MD5STEP(F3, c, d, a, b, in[3] + 0xd4ef3085, 16); MD5STEP(F3, b, c, d, a, in[6] + 0x04881d05, 23); MD5STEP(F3, a, b, c, d, in[9] + 0xd9d4d039, 4); MD5STEP(F3, d, a, b, c, in[12] + 0xe6db99e5, 11); MD5STEP(F3, c, d, a, b, in[15] + 0x1fa27cf8, 16); MD5STEP(F3, b, c, d, a, in[2] + 0xc4ac5665, 23); MD5STEP(F4, a, b, c, d, in[0] + 0xf4292244, 6); MD5STEP(F4, d, a, b, c, in[7] + 0x432aff97, 10); MD5STEP(F4, c, d, a, b, in[14] + 0xab9423a7, 15); MD5STEP(F4, b, c, d, a, in[5] + 0xfc93a039, 21); MD5STEP(F4, a, b, c, d, in[12] + 0x655b59c3, 6); MD5STEP(F4, d, a, b, c, in[3] + 0x8f0ccc92, 10); MD5STEP(F4, c, d, a, b, in[10] + 0xffeff47d, 15); MD5STEP(F4, b, c, d, a, in[1] + 0x85845dd1, 21); MD5STEP(F4, a, b, c, d, in[8] + 0x6fa87e4f, 6); MD5STEP(F4, d, a, b, c, in[15] + 0xfe2ce6e0, 10); MD5STEP(F4, c, d, a, b, in[6] + 0xa3014314, 15); MD5STEP(F4, b, c, d, a, in[13] + 0x4e0811a1, 21); MD5STEP(F4, a, b, c, d, in[4] + 0xf7537e82, 6); MD5STEP(F4, d, a, b, c, in[11] + 0xbd3af235, 10); MD5STEP(F4, c, d, a, b, in[2] + 0x2ad7d2bb, 15); MD5STEP(F4, b, c, d, a, in[9] + 0xeb86d391, 21); hash[0] += a; hash[1] += b; hash[2] += c; hash[3] += d; } static inline void md5_transform_helper(struct md5_state *ctx) { le32_to_cpu_array(ctx->block, sizeof(ctx->block) / sizeof(u32)); md5_transform(ctx->hash, ctx->block); } static int md5_init(struct shash_desc *desc) { struct md5_state *mctx = shash_desc_ctx(desc); mctx->hash[0] = MD5_H0; mctx->hash[1] = MD5_H1; mctx->hash[2] = MD5_H2; mctx->hash[3] = MD5_H3; mctx->byte_count = 0; return 0; } static int md5_update(struct shash_desc *desc, const u8 *data, unsigned int len) { struct md5_state *mctx = shash_desc_ctx(desc); const u32 avail = sizeof(mctx->block) - (mctx->byte_count & 0x3f); mctx->byte_count += len; if (avail > len) { memcpy((char *)mctx->block + (sizeof(mctx->block) - avail), data, len); return 0; } memcpy((char *)mctx->block + (sizeof(mctx->block) - avail), data, avail); md5_transform_helper(mctx); data += avail; len -= avail; while (len >= sizeof(mctx->block)) { memcpy(mctx->block, data, sizeof(mctx->block)); md5_transform_helper(mctx); data += sizeof(mctx->block); len -= sizeof(mctx->block); } memcpy(mctx->block, data, len); return 0; } static int md5_final(struct shash_desc *desc, u8 *out) { struct md5_state *mctx = shash_desc_ctx(desc); const unsigned int offset = mctx->byte_count & 0x3f; char *p = (char *)mctx->block + offset; int padding = 56 - (offset + 1); *p++ = 0x80; if (padding < 0) { memset(p, 0x00, padding + sizeof (u64)); md5_transform_helper(mctx); p = (char *)mctx->block; padding = 56; } memset(p, 0, padding); mctx->block[14] = mctx->byte_count << 3; mctx->block[15] = mctx->byte_count >> 29; le32_to_cpu_array(mctx->block, (sizeof(mctx->block) - sizeof(u64)) / sizeof(u32)); md5_transform(mctx->hash, mctx->block); cpu_to_le32_array(mctx->hash, sizeof(mctx->hash) / sizeof(u32)); memcpy(out, mctx->hash, sizeof(mctx->hash)); memset(mctx, 0, sizeof(*mctx)); return 0; } static int md5_export(struct shash_desc *desc, void *out) { struct md5_state *ctx = shash_desc_ctx(desc); memcpy(out, ctx, sizeof(*ctx)); return 0; } static int md5_import(struct shash_desc *desc, const void *in) { struct md5_state *ctx = shash_desc_ctx(desc); memcpy(ctx, in, sizeof(*ctx)); return 0; } static struct shash_alg alg = { .digestsize = MD5_DIGEST_SIZE, .init = md5_init, .update = md5_update, .final = md5_final, .export = md5_export, .import = md5_import, .descsize = sizeof(struct md5_state), .statesize = sizeof(struct md5_state), .base = { .cra_name = "md5", .cra_driver_name = "md5-generic", .cra_blocksize = MD5_HMAC_BLOCK_SIZE, .cra_module = THIS_MODULE, } }; static int __init md5_mod_init(void) { return crypto_register_shash(&alg); } static void __exit md5_mod_fini(void) { crypto_unregister_shash(&alg); } subsys_initcall(md5_mod_init); module_exit(md5_mod_fini); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("MD5 Message Digest Algorithm"); MODULE_ALIAS_CRYPTO("md5"); |
2 2 1 4 3 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | // SPDX-License-Identifier: GPL-2.0 #include <linux/kernel.h> #include <linux/errno.h> #include <linux/fs.h> #include <linux/file.h> #include <linux/mm.h> #include <linux/slab.h> #include <linux/namei.h> #include <linux/io_uring.h> #include <linux/splice.h> #include <uapi/linux/io_uring.h> #include "io_uring.h" #include "splice.h" struct io_splice { struct file *file_out; loff_t off_out; loff_t off_in; u64 len; int splice_fd_in; unsigned int flags; }; static int __io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; sp->len = READ_ONCE(sqe->len); sp->flags = READ_ONCE(sqe->splice_flags); if (unlikely(sp->flags & ~valid_flags)) return -EINVAL; sp->splice_fd_in = READ_ONCE(sqe->splice_fd_in); req->flags |= REQ_F_FORCE_ASYNC; return 0; } int io_tee_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (READ_ONCE(sqe->splice_off_in) || READ_ONCE(sqe->off)) return -EINVAL; return __io_splice_prep(req, sqe); } int io_tee(struct io_kiocb *req, unsigned int issue_flags) { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); struct file *out = sp->file_out; unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; struct file *in; long ret = 0; WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK); if (sp->flags & SPLICE_F_FD_IN_FIXED) in = io_file_get_fixed(req, sp->splice_fd_in, issue_flags); else in = io_file_get_normal(req, sp->splice_fd_in); if (!in) { ret = -EBADF; goto done; } if (sp->len) ret = do_tee(in, out, sp->len, flags); if (!(sp->flags & SPLICE_F_FD_IN_FIXED)) fput(in); done: if (ret != sp->len) req_set_fail(req); io_req_set_res(req, ret, 0); return IOU_OK; } int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); sp->off_in = READ_ONCE(sqe->splice_off_in); sp->off_out = READ_ONCE(sqe->off); return __io_splice_prep(req, sqe); } int io_splice(struct io_kiocb *req, unsigned int issue_flags) { struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice); struct file *out = sp->file_out; unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; loff_t *poff_in, *poff_out; struct file *in; long ret = 0; WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK); if (sp->flags & SPLICE_F_FD_IN_FIXED) in = io_file_get_fixed(req, sp->splice_fd_in, issue_flags); else in = io_file_get_normal(req, sp->splice_fd_in); if (!in) { ret = -EBADF; goto done; } poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; if (sp->len) ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); if (!(sp->flags & SPLICE_F_FD_IN_FIXED)) fput(in); done: if (ret != sp->len) req_set_fail(req); io_req_set_res(req, ret, 0); return IOU_OK; } |
52 51 40 51 50 49 4 4 4 4 4 4 4 4 4 4 4 4 2 4 3 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 25 23 19 18 20 18 35 35 35 34 19 17 16 6 11 32 1 32 15 32 13 12 17 11 32 5 32 28 10 19 25 10 4 9 26 14 35 35 47 44 43 13 13 43 7 43 5 43 9 43 7 3 1 4 1 4 4 4 4 1 4 43 39 20 35 44 47 46 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ #include "netlink.h" #include "device.h" #include "peer.h" #include "socket.h" #include "queueing.h" #include "messages.h" #include <uapi/linux/wireguard.h> #include <linux/if.h> #include <net/genetlink.h> #include <net/sock.h> #include <crypto/algapi.h> static struct genl_family genl_family; static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), [WGDEVICE_A_FLAGS] = { .type = NLA_U32 }, [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, [WGDEVICE_A_PEERS] = { .type = NLA_NESTED } }; static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(NOISE_SYMMETRIC_KEY_LEN), [WGPEER_A_FLAGS] = { .type = NLA_U32 }, [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, [WGPEER_A_ALLOWEDIPS] = { .type = NLA_NESTED }, [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } }; static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 } }; static struct wg_device *lookup_interface(struct nlattr **attrs, struct sk_buff *skb) { struct net_device *dev = NULL; if (!attrs[WGDEVICE_A_IFINDEX] == !attrs[WGDEVICE_A_IFNAME]) return ERR_PTR(-EBADR); if (attrs[WGDEVICE_A_IFINDEX]) dev = dev_get_by_index(sock_net(skb->sk), nla_get_u32(attrs[WGDEVICE_A_IFINDEX])); else if (attrs[WGDEVICE_A_IFNAME]) dev = dev_get_by_name(sock_net(skb->sk), nla_data(attrs[WGDEVICE_A_IFNAME])); if (!dev) return ERR_PTR(-ENODEV); if (!dev->rtnl_link_ops || !dev->rtnl_link_ops->kind || strcmp(dev->rtnl_link_ops->kind, KBUILD_MODNAME)) { dev_put(dev); return ERR_PTR(-EOPNOTSUPP); } return netdev_priv(dev); } static int get_allowedips(struct sk_buff *skb, const u8 *ip, u8 cidr, int family) { struct nlattr *allowedip_nest; allowedip_nest = nla_nest_start(skb, 0); if (!allowedip_nest) return -EMSGSIZE; if (nla_put_u8(skb, WGALLOWEDIP_A_CIDR_MASK, cidr) || nla_put_u16(skb, WGALLOWEDIP_A_FAMILY, family) || nla_put(skb, WGALLOWEDIP_A_IPADDR, family == AF_INET6 ? sizeof(struct in6_addr) : sizeof(struct in_addr), ip)) { nla_nest_cancel(skb, allowedip_nest); return -EMSGSIZE; } nla_nest_end(skb, allowedip_nest); return 0; } struct dump_ctx { struct wg_device *wg; struct wg_peer *next_peer; u64 allowedips_seq; struct allowedips_node *next_allowedip; }; #define DUMP_CTX(cb) ((struct dump_ctx *)(cb)->args) static int get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx) { struct nlattr *allowedips_nest, *peer_nest = nla_nest_start(skb, 0); struct allowedips_node *allowedips_node = ctx->next_allowedip; bool fail; if (!peer_nest) return -EMSGSIZE; down_read(&peer->handshake.lock); fail = nla_put(skb, WGPEER_A_PUBLIC_KEY, NOISE_PUBLIC_KEY_LEN, peer->handshake.remote_static); up_read(&peer->handshake.lock); if (fail) goto err; if (!allowedips_node) { const struct __kernel_timespec last_handshake = { .tv_sec = peer->walltime_last_handshake.tv_sec, .tv_nsec = peer->walltime_last_handshake.tv_nsec }; down_read(&peer->handshake.lock); fail = nla_put(skb, WGPEER_A_PRESHARED_KEY, NOISE_SYMMETRIC_KEY_LEN, peer->handshake.preshared_key); up_read(&peer->handshake.lock); if (fail) goto err; if (nla_put(skb, WGPEER_A_LAST_HANDSHAKE_TIME, sizeof(last_handshake), &last_handshake) || nla_put_u16(skb, WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL, peer->persistent_keepalive_interval) || nla_put_u64_64bit(skb, WGPEER_A_TX_BYTES, peer->tx_bytes, WGPEER_A_UNSPEC) || nla_put_u64_64bit(skb, WGPEER_A_RX_BYTES, peer->rx_bytes, WGPEER_A_UNSPEC) || nla_put_u32(skb, WGPEER_A_PROTOCOL_VERSION, 1)) goto err; read_lock_bh(&peer->endpoint_lock); if (peer->endpoint.addr.sa_family == AF_INET) fail = nla_put(skb, WGPEER_A_ENDPOINT, sizeof(peer->endpoint.addr4), &peer->endpoint.addr4); else if (peer->endpoint.addr.sa_family == AF_INET6) fail = nla_put(skb, WGPEER_A_ENDPOINT, sizeof(peer->endpoint.addr6), &peer->endpoint.addr6); read_unlock_bh(&peer->endpoint_lock); if (fail) goto err; allowedips_node = list_first_entry_or_null(&peer->allowedips_list, struct allowedips_node, peer_list); } if (!allowedips_node) goto no_allowedips; if (!ctx->allowedips_seq) ctx->allowedips_seq = peer->device->peer_allowedips.seq; else if (ctx->allowedips_seq != peer->device->peer_allowedips.seq) goto no_allowedips; allowedips_nest = nla_nest_start(skb, WGPEER_A_ALLOWEDIPS); if (!allowedips_nest) goto err; list_for_each_entry_from(allowedips_node, &peer->allowedips_list, peer_list) { u8 cidr, ip[16] __aligned(__alignof(u64)); int family; family = wg_allowedips_read_node(allowedips_node, ip, &cidr); if (get_allowedips(skb, ip, cidr, family)) { nla_nest_end(skb, allowedips_nest); nla_nest_end(skb, peer_nest); ctx->next_allowedip = allowedips_node; return -EMSGSIZE; } } nla_nest_end(skb, allowedips_nest); no_allowedips: nla_nest_end(skb, peer_nest); ctx->next_allowedip = NULL; ctx->allowedips_seq = 0; return 0; err: nla_nest_cancel(skb, peer_nest); return -EMSGSIZE; } static int wg_get_device_start(struct netlink_callback *cb) { struct wg_device *wg; wg = lookup_interface(genl_info_dump(cb)->attrs, cb->skb); if (IS_ERR(wg)) return PTR_ERR(wg); DUMP_CTX(cb)->wg = wg; return 0; } static int wg_get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) { struct wg_peer *peer, *next_peer_cursor; struct dump_ctx *ctx = DUMP_CTX(cb); struct wg_device *wg = ctx->wg; struct nlattr *peers_nest; int ret = -EMSGSIZE; bool done = true; void *hdr; rtnl_lock(); mutex_lock(&wg->device_update_lock); cb->seq = wg->device_update_gen; next_peer_cursor = ctx->next_peer; hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &genl_family, NLM_F_MULTI, WG_CMD_GET_DEVICE); if (!hdr) goto out; genl_dump_check_consistent(cb, hdr); if (!ctx->next_peer) { if (nla_put_u16(skb, WGDEVICE_A_LISTEN_PORT, wg->incoming_port) || nla_put_u32(skb, WGDEVICE_A_FWMARK, wg->fwmark) || nla_put_u32(skb, WGDEVICE_A_IFINDEX, wg->dev->ifindex) || nla_put_string(skb, WGDEVICE_A_IFNAME, wg->dev->name)) goto out; down_read(&wg->static_identity.lock); if (wg->static_identity.has_identity) { if (nla_put(skb, WGDEVICE_A_PRIVATE_KEY, NOISE_PUBLIC_KEY_LEN, wg->static_identity.static_private) || nla_put(skb, WGDEVICE_A_PUBLIC_KEY, NOISE_PUBLIC_KEY_LEN, wg->static_identity.static_public)) { up_read(&wg->static_identity.lock); goto out; } } up_read(&wg->static_identity.lock); } peers_nest = nla_nest_start(skb, WGDEVICE_A_PEERS); if (!peers_nest) goto out; ret = 0; /* If the last cursor was removed via list_del_init in peer_remove, then * we just treat this the same as there being no more peers left. The * reason is that seq_nr should indicate to userspace that this isn't a * coherent dump anyway, so they'll try again. */ if (list_empty(&wg->peer_list) || (ctx->next_peer && list_empty(&ctx->next_peer->peer_list))) { nla_nest_cancel(skb, peers_nest); goto out; } lockdep_assert_held(&wg->device_update_lock); peer = list_prepare_entry(ctx->next_peer, &wg->peer_list, peer_list); list_for_each_entry_continue(peer, &wg->peer_list, peer_list) { if (get_peer(peer, skb, ctx)) { done = false; break; } next_peer_cursor = peer; } nla_nest_end(skb, peers_nest); out: if (!ret && !done && next_peer_cursor) wg_peer_get(next_peer_cursor); wg_peer_put(ctx->next_peer); mutex_unlock(&wg->device_update_lock); rtnl_unlock(); if (ret) { genlmsg_cancel(skb, hdr); return ret; } genlmsg_end(skb, hdr); if (done) { ctx->next_peer = NULL; return 0; } ctx->next_peer = next_peer_cursor; return skb->len; /* At this point, we can't really deal ourselves with safely zeroing out * the private key material after usage. This will need an additional API * in the kernel for marking skbs as zero_on_free. */ } static int wg_get_device_done(struct netlink_callback *cb) { struct dump_ctx *ctx = DUMP_CTX(cb); if (ctx->wg) dev_put(ctx->wg->dev); wg_peer_put(ctx->next_peer); return 0; } static int set_port(struct wg_device *wg, u16 port) { struct wg_peer *peer; if (wg->incoming_port == port) return 0; list_for_each_entry(peer, &wg->peer_list, peer_list) wg_socket_clear_peer_endpoint_src(peer); if (!netif_running(wg->dev)) { wg->incoming_port = port; return 0; } return wg_socket_init(wg, port); } static int set_allowedip(struct wg_peer *peer, struct nlattr **attrs) { int ret = -EINVAL; u16 family; u8 cidr; if (!attrs[WGALLOWEDIP_A_FAMILY] || !attrs[WGALLOWEDIP_A_IPADDR] || !attrs[WGALLOWEDIP_A_CIDR_MASK]) return ret; family = nla_get_u16(attrs[WGALLOWEDIP_A_FAMILY]); cidr = nla_get_u8(attrs[WGALLOWEDIP_A_CIDR_MASK]); if (family == AF_INET && cidr <= 32 && nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) ret = wg_allowedips_insert_v4( &peer->device->peer_allowedips, nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, &peer->device->device_update_lock); else if (family == AF_INET6 && cidr <= 128 && nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) ret = wg_allowedips_insert_v6( &peer->device->peer_allowedips, nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, &peer->device->device_update_lock); return ret; } static int set_peer(struct wg_device *wg, struct nlattr **attrs) { u8 *public_key = NULL, *preshared_key = NULL; struct wg_peer *peer = NULL; u32 flags = 0; int ret; ret = -EINVAL; if (attrs[WGPEER_A_PUBLIC_KEY] && nla_len(attrs[WGPEER_A_PUBLIC_KEY]) == NOISE_PUBLIC_KEY_LEN) public_key = nla_data(attrs[WGPEER_A_PUBLIC_KEY]); else goto out; if (attrs[WGPEER_A_PRESHARED_KEY] && nla_len(attrs[WGPEER_A_PRESHARED_KEY]) == NOISE_SYMMETRIC_KEY_LEN) preshared_key = nla_data(attrs[WGPEER_A_PRESHARED_KEY]); if (attrs[WGPEER_A_FLAGS]) flags = nla_get_u32(attrs[WGPEER_A_FLAGS]); ret = -EOPNOTSUPP; if (flags & ~__WGPEER_F_ALL) goto out; ret = -EPFNOSUPPORT; if (attrs[WGPEER_A_PROTOCOL_VERSION]) { if (nla_get_u32(attrs[WGPEER_A_PROTOCOL_VERSION]) != 1) goto out; } peer = wg_pubkey_hashtable_lookup(wg->peer_hashtable, nla_data(attrs[WGPEER_A_PUBLIC_KEY])); ret = 0; if (!peer) { /* Peer doesn't exist yet. Add a new one. */ if (flags & (WGPEER_F_REMOVE_ME | WGPEER_F_UPDATE_ONLY)) goto out; /* The peer is new, so there aren't allowed IPs to remove. */ flags &= ~WGPEER_F_REPLACE_ALLOWEDIPS; down_read(&wg->static_identity.lock); if (wg->static_identity.has_identity && !memcmp(nla_data(attrs[WGPEER_A_PUBLIC_KEY]), wg->static_identity.static_public, NOISE_PUBLIC_KEY_LEN)) { /* We silently ignore peers that have the same public * key as the device. The reason we do it silently is * that we'd like for people to be able to reuse the * same set of API calls across peers. */ up_read(&wg->static_identity.lock); ret = 0; goto out; } up_read(&wg->static_identity.lock); peer = wg_peer_create(wg, public_key, preshared_key); if (IS_ERR(peer)) { ret = PTR_ERR(peer); peer = NULL; goto out; } /* Take additional reference, as though we've just been * looked up. */ wg_peer_get(peer); } if (flags & WGPEER_F_REMOVE_ME) { wg_peer_remove(peer); goto out; } if (preshared_key) { down_write(&peer->handshake.lock); memcpy(&peer->handshake.preshared_key, preshared_key, NOISE_SYMMETRIC_KEY_LEN); up_write(&peer->handshake.lock); } if (attrs[WGPEER_A_ENDPOINT]) { struct sockaddr *addr = nla_data(attrs[WGPEER_A_ENDPOINT]); size_t len = nla_len(attrs[WGPEER_A_ENDPOINT]); struct endpoint endpoint = { { { 0 } } }; if (len == sizeof(struct sockaddr_in) && addr->sa_family == AF_INET) { endpoint.addr4 = *(struct sockaddr_in *)addr; wg_socket_set_peer_endpoint(peer, &endpoint); } else if (len == sizeof(struct sockaddr_in6) && addr->sa_family == AF_INET6) { endpoint.addr6 = *(struct sockaddr_in6 *)addr; wg_socket_set_peer_endpoint(peer, &endpoint); } } if (flags & WGPEER_F_REPLACE_ALLOWEDIPS) wg_allowedips_remove_by_peer(&wg->peer_allowedips, peer, &wg->device_update_lock); if (attrs[WGPEER_A_ALLOWEDIPS]) { struct nlattr *attr, *allowedip[WGALLOWEDIP_A_MAX + 1]; int rem; nla_for_each_nested(attr, attrs[WGPEER_A_ALLOWEDIPS], rem) { ret = nla_parse_nested(allowedip, WGALLOWEDIP_A_MAX, attr, allowedip_policy, NULL); if (ret < 0) goto out; ret = set_allowedip(peer, allowedip); if (ret < 0) goto out; } } if (attrs[WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL]) { const u16 persistent_keepalive_interval = nla_get_u16( attrs[WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL]); const bool send_keepalive = !peer->persistent_keepalive_interval && persistent_keepalive_interval && netif_running(wg->dev); peer->persistent_keepalive_interval = persistent_keepalive_interval; if (send_keepalive) wg_packet_send_keepalive(peer); } if (netif_running(wg->dev)) wg_packet_send_staged_packets(peer); out: wg_peer_put(peer); if (attrs[WGPEER_A_PRESHARED_KEY]) memzero_explicit(nla_data(attrs[WGPEER_A_PRESHARED_KEY]), nla_len(attrs[WGPEER_A_PRESHARED_KEY])); return ret; } static int wg_set_device(struct sk_buff *skb, struct genl_info *info) { struct wg_device *wg = lookup_interface(info->attrs, skb); u32 flags = 0; int ret; if (IS_ERR(wg)) { ret = PTR_ERR(wg); goto out_nodev; } rtnl_lock(); mutex_lock(&wg->device_update_lock); if (info->attrs[WGDEVICE_A_FLAGS]) flags = nla_get_u32(info->attrs[WGDEVICE_A_FLAGS]); ret = -EOPNOTSUPP; if (flags & ~__WGDEVICE_F_ALL) goto out; if (info->attrs[WGDEVICE_A_LISTEN_PORT] || info->attrs[WGDEVICE_A_FWMARK]) { struct net *net; rcu_read_lock(); net = rcu_dereference(wg->creating_net); ret = !net || !ns_capable(net->user_ns, CAP_NET_ADMIN) ? -EPERM : 0; rcu_read_unlock(); if (ret) goto out; } ++wg->device_update_gen; if (info->attrs[WGDEVICE_A_FWMARK]) { struct wg_peer *peer; wg->fwmark = nla_get_u32(info->attrs[WGDEVICE_A_FWMARK]); list_for_each_entry(peer, &wg->peer_list, peer_list) wg_socket_clear_peer_endpoint_src(peer); } if (info->attrs[WGDEVICE_A_LISTEN_PORT]) { ret = set_port(wg, nla_get_u16(info->attrs[WGDEVICE_A_LISTEN_PORT])); if (ret) goto out; } if (flags & WGDEVICE_F_REPLACE_PEERS) wg_peer_remove_all(wg); if (info->attrs[WGDEVICE_A_PRIVATE_KEY] && nla_len(info->attrs[WGDEVICE_A_PRIVATE_KEY]) == NOISE_PUBLIC_KEY_LEN) { u8 *private_key = nla_data(info->attrs[WGDEVICE_A_PRIVATE_KEY]); u8 public_key[NOISE_PUBLIC_KEY_LEN]; struct wg_peer *peer, *temp; bool send_staged_packets; if (!crypto_memneq(wg->static_identity.static_private, private_key, NOISE_PUBLIC_KEY_LEN)) goto skip_set_private_key; /* We remove before setting, to prevent race, which means doing * two 25519-genpub ops. */ if (curve25519_generate_public(public_key, private_key)) { peer = wg_pubkey_hashtable_lookup(wg->peer_hashtable, public_key); if (peer) { wg_peer_put(peer); wg_peer_remove(peer); } } down_write(&wg->static_identity.lock); send_staged_packets = !wg->static_identity.has_identity && netif_running(wg->dev); wg_noise_set_static_identity_private_key(&wg->static_identity, private_key); send_staged_packets = send_staged_packets && wg->static_identity.has_identity; wg_cookie_checker_precompute_device_keys(&wg->cookie_checker); list_for_each_entry_safe(peer, temp, &wg->peer_list, peer_list) { wg_noise_precompute_static_static(peer); wg_noise_expire_current_peer_keypairs(peer); if (send_staged_packets) wg_packet_send_staged_packets(peer); } up_write(&wg->static_identity.lock); } skip_set_private_key: if (info->attrs[WGDEVICE_A_PEERS]) { struct nlattr *attr, *peer[WGPEER_A_MAX + 1]; int rem; nla_for_each_nested(attr, info->attrs[WGDEVICE_A_PEERS], rem) { ret = nla_parse_nested(peer, WGPEER_A_MAX, attr, peer_policy, NULL); if (ret < 0) goto out; ret = set_peer(wg, peer); if (ret < 0) goto out; } } ret = 0; out: mutex_unlock(&wg->device_update_lock); rtnl_unlock(); dev_put(wg->dev); out_nodev: if (info->attrs[WGDEVICE_A_PRIVATE_KEY]) memzero_explicit(nla_data(info->attrs[WGDEVICE_A_PRIVATE_KEY]), nla_len(info->attrs[WGDEVICE_A_PRIVATE_KEY])); return ret; } static const struct genl_ops genl_ops[] = { { .cmd = WG_CMD_GET_DEVICE, .start = wg_get_device_start, .dumpit = wg_get_device_dump, .done = wg_get_device_done, .flags = GENL_UNS_ADMIN_PERM }, { .cmd = WG_CMD_SET_DEVICE, .doit = wg_set_device, .flags = GENL_UNS_ADMIN_PERM } }; static struct genl_family genl_family __ro_after_init = { .ops = genl_ops, .n_ops = ARRAY_SIZE(genl_ops), .resv_start_op = WG_CMD_SET_DEVICE + 1, .name = WG_GENL_NAME, .version = WG_GENL_VERSION, .maxattr = WGDEVICE_A_MAX, .module = THIS_MODULE, .policy = device_policy, .netnsok = true }; int __init wg_genetlink_init(void) { return genl_register_family(&genl_family); } void __exit wg_genetlink_uninit(void) { genl_unregister_family(&genl_family); } |
3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * VMware VMCI Driver * * Copyright (C) 2012 VMware, Inc. All rights reserved. */ #ifndef _VMCI_QUEUE_PAIR_H_ #define _VMCI_QUEUE_PAIR_H_ #include <linux/vmw_vmci_defs.h> #include <linux/types.h> #include "vmci_context.h" /* Callback needed for correctly waiting on events. */ typedef int (*vmci_event_release_cb) (void *client_data); /* Guest device port I/O. */ struct ppn_set { u64 num_produce_pages; u64 num_consume_pages; u64 *produce_ppns; u64 *consume_ppns; bool initialized; }; /* VMCIqueue_pairAllocInfo */ struct vmci_qp_alloc_info { struct vmci_handle handle; u32 peer; u32 flags; u64 produce_size; u64 consume_size; u64 ppn_va; /* Start VA of queue pair PPNs. */ u64 num_ppns; s32 result; u32 version; }; /* VMCIqueue_pairSetVAInfo */ struct vmci_qp_set_va_info { struct vmci_handle handle; u64 va; /* Start VA of queue pair PPNs. */ u64 num_ppns; u32 version; s32 result; }; /* * For backwards compatibility, here is a version of the * VMCIqueue_pairPageFileInfo before host support end-points was added. * Note that the current version of that structure requires VMX to * pass down the VA of the mapped file. Before host support was added * there was nothing of the sort. So, when the driver sees the ioctl * with a parameter that is the sizeof * VMCIqueue_pairPageFileInfo_NoHostQP then it can infer that the version * of VMX running can't attach to host end points because it doesn't * provide the VA of the mapped files. * * The Linux driver doesn't get an indication of the size of the * structure passed down from user space. So, to fix a long standing * but unfiled bug, the _pad field has been renamed to version. * Existing versions of VMX always initialize the PageFileInfo * structure so that _pad, er, version is set to 0. * * A version value of 1 indicates that the size of the structure has * been increased to include two UVA's: produce_uva and consume_uva. * These UVA's are of the mmap()'d queue contents backing files. * * In addition, if when VMX is sending down the * VMCIqueue_pairPageFileInfo structure it gets an error then it will * try again with the _NoHostQP version of the file to see if an older * VMCI kernel module is running. */ /* VMCIqueue_pairPageFileInfo */ struct vmci_qp_page_file_info { struct vmci_handle handle; u64 produce_page_file; /* User VA. */ u64 consume_page_file; /* User VA. */ u64 produce_page_file_size; /* Size of the file name array. */ u64 consume_page_file_size; /* Size of the file name array. */ s32 result; u32 version; /* Was _pad. */ u64 produce_va; /* User VA of the mapped file. */ u64 consume_va; /* User VA of the mapped file. */ }; /* vmci queuepair detach info */ struct vmci_qp_dtch_info { struct vmci_handle handle; s32 result; u32 _pad; }; /* * struct vmci_qp_page_store describes how the memory of a given queue pair * is backed. When the queue pair is between the host and a guest, the * page store consists of references to the guest pages. On vmkernel, * this is a list of PPNs, and on hosted, it is a user VA where the * queue pair is mapped into the VMX address space. */ struct vmci_qp_page_store { /* Reference to pages backing the queue pair. */ u64 pages; /* Length of pageList/virtual address range (in pages). */ u32 len; }; /* * This data type contains the information about a queue. * There are two queues (hence, queue pairs) per transaction model between a * pair of end points, A & B. One queue is used by end point A to transmit * commands and responses to B. The other queue is used by B to transmit * commands and responses. * * struct vmci_queue_kern_if is a per-OS defined Queue structure. It contains * either a direct pointer to the linear address of the buffer contents or a * pointer to structures which help the OS locate those data pages. See * vmciKernelIf.c for each platform for its definition. */ struct vmci_queue { struct vmci_queue_header *q_header; struct vmci_queue_header *saved_header; struct vmci_queue_kern_if *kernel_if; }; /* * Utility function that checks whether the fields of the page * store contain valid values. * Result: * true if the page store is wellformed. false otherwise. */ static inline bool VMCI_QP_PAGESTORE_IS_WELLFORMED(struct vmci_qp_page_store *page_store) { return page_store->len >= 2; } void vmci_qp_broker_exit(void); int vmci_qp_broker_alloc(struct vmci_handle handle, u32 peer, u32 flags, u32 priv_flags, u64 produce_size, u64 consume_size, struct vmci_qp_page_store *page_store, struct vmci_ctx *context); int vmci_qp_broker_set_page_store(struct vmci_handle handle, u64 produce_uva, u64 consume_uva, struct vmci_ctx *context); int vmci_qp_broker_detach(struct vmci_handle handle, struct vmci_ctx *context); void vmci_qp_guest_endpoints_exit(void); int vmci_qp_alloc(struct vmci_handle *handle, struct vmci_queue **produce_q, u64 produce_size, struct vmci_queue **consume_q, u64 consume_size, u32 peer, u32 flags, u32 priv_flags, bool guest_endpoint, vmci_event_release_cb wakeup_cb, void *client_data); int vmci_qp_broker_map(struct vmci_handle handle, struct vmci_ctx *context, u64 guest_mem); int vmci_qp_broker_unmap(struct vmci_handle handle, struct vmci_ctx *context, u32 gid); #endif /* _VMCI_QUEUE_PAIR_H_ */ |
376 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * AppArmor security module * * This file contains AppArmor lib definitions * * 2017 Canonical Ltd. */ #ifndef __AA_LIB_H #define __AA_LIB_H #include <linux/slab.h> #include <linux/fs.h> #include <linux/lsm_hooks.h> #include "match.h" /* * DEBUG remains global (no per profile flag) since it is mostly used in sysctl * which is not related to profile accesses. */ #define DEBUG_ON (aa_g_debug) /* * split individual debug cases out in preparation for finer grained * debug controls in the future. */ #define AA_DEBUG_LABEL DEBUG_ON #define dbg_printk(__fmt, __args...) pr_debug(__fmt, ##__args) #define AA_DEBUG(fmt, args...) \ do { \ if (DEBUG_ON) \ pr_debug_ratelimited("AppArmor: " fmt, ##args); \ } while (0) #define AA_WARN(X) WARN((X), "APPARMOR WARN %s: %s\n", __func__, #X) #define AA_BUG(X, args...) \ do { \ _Pragma("GCC diagnostic ignored \"-Wformat-zero-length\""); \ AA_BUG_FMT((X), "" args); \ _Pragma("GCC diagnostic warning \"-Wformat-zero-length\""); \ } while (0) #ifdef CONFIG_SECURITY_APPARMOR_DEBUG_ASSERTS #define AA_BUG_FMT(X, fmt, args...) \ WARN((X), "AppArmor WARN %s: (" #X "): " fmt, __func__, ##args) #else #define AA_BUG_FMT(X, fmt, args...) no_printk(fmt, ##args) #endif #define AA_ERROR(fmt, args...) \ pr_err_ratelimited("AppArmor: " fmt, ##args) /* Flag indicating whether initialization completed */ extern int apparmor_initialized; /* fn's in lib */ const char *skipn_spaces(const char *str, size_t n); char *aa_split_fqname(char *args, char **ns_name); const char *aa_splitn_fqname(const char *fqname, size_t n, const char **ns_name, size_t *ns_len); void aa_info_message(const char *str); /* Security blob offsets */ extern struct lsm_blob_sizes apparmor_blob_sizes; /** * aa_strneq - compare null terminated @str to a non null terminated substring * @str: a null terminated string * @sub: a substring, not necessarily null terminated * @len: length of @sub to compare * * The @str string must be full consumed for this to be considered a match */ static inline bool aa_strneq(const char *str, const char *sub, int len) { return !strncmp(str, sub, len) && !str[len]; } /** * aa_dfa_null_transition - step to next state after null character * @dfa: the dfa to match against * @start: the state of the dfa to start matching in * * aa_dfa_null_transition transitions to the next state after a null * character which is not used in standard matching and is only * used to separate pairs. */ static inline aa_state_t aa_dfa_null_transition(struct aa_dfa *dfa, aa_state_t start) { /* the null transition only needs the string's null terminator byte */ return aa_dfa_next(dfa, start, 0); } static inline bool path_mediated_fs(struct dentry *dentry) { return !(dentry->d_sb->s_flags & SB_NOUSER); } struct aa_str_table { int size; char **table; }; void aa_free_str_table(struct aa_str_table *table); struct counted_str { struct kref count; char name[]; }; #define str_to_counted(str) \ ((struct counted_str *)(str - offsetof(struct counted_str, name))) #define __counted /* atm just a notation */ void aa_str_kref(struct kref *kref); char *aa_str_alloc(int size, gfp_t gfp); static inline __counted char *aa_get_str(__counted char *str) { if (str) kref_get(&(str_to_counted(str)->count)); return str; } static inline void aa_put_str(__counted char *str) { if (str) kref_put(&str_to_counted(str)->count, aa_str_kref); } /* struct aa_policy - common part of both namespaces and profiles * @name: name of the object * @hname - The hierarchical name * @list: list policy object is on * @profiles: head of the profiles list contained in the object */ struct aa_policy { const char *name; __counted char *hname; struct list_head list; struct list_head profiles; }; /** * basename - find the last component of an hname * @name: hname to find the base profile name component of (NOT NULL) * * Returns: the tail (base profile name) name component of an hname */ static inline const char *basename(const char *hname) { char *split; hname = strim((char *)hname); for (split = strstr(hname, "//"); split; split = strstr(hname, "//")) hname = split + 2; return hname; } /** * __policy_find - find a policy by @name on a policy list * @head: list to search (NOT NULL) * @name: name to search for (NOT NULL) * * Requires: rcu_read_lock be held * * Returns: unrefcounted policy that match @name or NULL if not found */ static inline struct aa_policy *__policy_find(struct list_head *head, const char *name) { struct aa_policy *policy; list_for_each_entry_rcu(policy, head, list) { if (!strcmp(policy->name, name)) return policy; } return NULL; } /** * __policy_strn_find - find a policy that's name matches @len chars of @str * @head: list to search (NOT NULL) * @str: string to search for (NOT NULL) * @len: length of match required * * Requires: rcu_read_lock be held * * Returns: unrefcounted policy that match @str or NULL if not found * * if @len == strlen(@strlen) then this is equiv to __policy_find * other wise it allows searching for policy by a partial match of name */ static inline struct aa_policy *__policy_strn_find(struct list_head *head, const char *str, int len) { struct aa_policy *policy; list_for_each_entry_rcu(policy, head, list) { if (aa_strneq(policy->name, str, len)) return policy; } return NULL; } bool aa_policy_init(struct aa_policy *policy, const char *prefix, const char *name, gfp_t gfp); void aa_policy_destroy(struct aa_policy *policy); /* * fn_label_build - abstract out the build of a label transition * @L: label the transition is being computed for * @P: profile parameter derived from L by this macro, can be passed to FN * @GFP: memory allocation type to use * @FN: fn to call for each profile transition. @P is set to the profile * * Returns: new label on success * ERR_PTR if build @FN fails * NULL if label_build fails due to low memory conditions * * @FN must return a label or ERR_PTR on failure. NULL is not allowed */ #define fn_label_build(L, P, GFP, FN) \ ({ \ __label__ __do_cleanup, __done; \ struct aa_label *__new_; \ \ if ((L)->size > 1) { \ /* TODO: add cache of transitions already done */ \ struct label_it __i; \ int __j, __k, __count; \ DEFINE_VEC(label, __lvec); \ DEFINE_VEC(profile, __pvec); \ if (vec_setup(label, __lvec, (L)->size, (GFP))) { \ __new_ = NULL; \ goto __done; \ } \ __j = 0; \ label_for_each(__i, (L), (P)) { \ __new_ = (FN); \ AA_BUG(!__new_); \ if (IS_ERR(__new_)) \ goto __do_cleanup; \ __lvec[__j++] = __new_; \ } \ for (__j = __count = 0; __j < (L)->size; __j++) \ __count += __lvec[__j]->size; \ if (!vec_setup(profile, __pvec, __count, (GFP))) { \ for (__j = __k = 0; __j < (L)->size; __j++) { \ label_for_each(__i, __lvec[__j], (P)) \ __pvec[__k++] = aa_get_profile(P); \ } \ __count -= aa_vec_unique(__pvec, __count, 0); \ if (__count > 1) { \ __new_ = aa_vec_find_or_create_label(__pvec,\ __count, (GFP)); \ /* only fails if out of Mem */ \ if (!__new_) \ __new_ = NULL; \ } else \ __new_ = aa_get_label(&__pvec[0]->label); \ vec_cleanup(profile, __pvec, __count); \ } else \ __new_ = NULL; \ __do_cleanup: \ vec_cleanup(label, __lvec, (L)->size); \ } else { \ (P) = labels_profile(L); \ __new_ = (FN); \ } \ __done: \ if (!__new_) \ AA_DEBUG("label build failed\n"); \ (__new_); \ }) #define __fn_build_in_ns(NS, P, NS_FN, OTHER_FN) \ ({ \ struct aa_label *__new; \ if ((P)->ns != (NS)) \ __new = (OTHER_FN); \ else \ __new = (NS_FN); \ (__new); \ }) #define fn_label_build_in_ns(L, P, GFP, NS_FN, OTHER_FN) \ ({ \ fn_label_build((L), (P), (GFP), \ __fn_build_in_ns(labels_ns(L), (P), (NS_FN), (OTHER_FN))); \ }) #endif /* __AA_LIB_H */ |
122 121 121 121 122 121 129 43 127 129 106 129 129 129 55 55 55 6 6 129 129 129 129 128 129 3 3 3 3 3 32 32 32 32 32 32 32 32 32 42 3 43 31 31 32 32 32 32 32 104 127 127 127 127 129 127 129 43 40 43 19 43 43 43 43 43 43 43 43 43 43 36 129 127 32 32 32 32 32 22 32 22 32 22 32 32 32 32 31 32 32 32 32 31 32 32 31 32 32 32 32 32 32 32 32 55 55 55 55 6 6 55 127 103 127 102 127 127 32 31 32 32 32 32 32 3 32 3 3 3 3 3 3 3 3 3 31 31 3 3 31 31 31 2 31 17 31 31 31 31 31 31 31 2 128 127 55 129 129 129 128 129 129 129 105 129 3 3 3 25 24 25 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2000-2005 Silicon Graphics, Inc. * All Rights Reserved. */ #include "xfs.h" #include "xfs_fs.h" #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_sysfs.h" #include "xfs_sb.h" #include "xfs_health.h" struct kmem_cache *xfs_log_ticket_cache; /* Local miscellaneous function prototypes */ STATIC struct xlog * xlog_alloc_log( struct xfs_mount *mp, struct xfs_buftarg *log_target, xfs_daddr_t blk_offset, int num_bblks); STATIC int xlog_space_left( struct xlog *log, atomic64_t *head); STATIC void xlog_dealloc_log( struct xlog *log); /* local state machine functions */ STATIC void xlog_state_done_syncing( struct xlog_in_core *iclog); STATIC void xlog_state_do_callback( struct xlog *log); STATIC int xlog_state_get_iclog_space( struct xlog *log, int len, struct xlog_in_core **iclog, struct xlog_ticket *ticket, int *logoffsetp); STATIC void xlog_grant_push_ail( struct xlog *log, int need_bytes); STATIC void xlog_sync( struct xlog *log, struct xlog_in_core *iclog, struct xlog_ticket *ticket); #if defined(DEBUG) STATIC void xlog_verify_grant_tail( struct xlog *log); STATIC void xlog_verify_iclog( struct xlog *log, struct xlog_in_core *iclog, int count); STATIC void xlog_verify_tail_lsn( struct xlog *log, struct xlog_in_core *iclog); #else #define xlog_verify_grant_tail(a) #define xlog_verify_iclog(a,b,c) #define xlog_verify_tail_lsn(a,b) #endif STATIC int xlog_iclogs_empty( struct xlog *log); static int xfs_log_cover(struct xfs_mount *); /* * We need to make sure the buffer pointer returned is naturally aligned for the * biggest basic data type we put into it. We have already accounted for this * padding when sizing the buffer. * * However, this padding does not get written into the log, and hence we have to * track the space used by the log vectors separately to prevent log space hangs * due to inaccurate accounting (i.e. a leak) of the used log space through the * CIL context ticket. * * We also add space for the xlog_op_header that describes this region in the * log. This prepends the data region we return to the caller to copy their data * into, so do all the static initialisation of the ophdr now. Because the ophdr * is not 8 byte aligned, we have to be careful to ensure that we align the * start of the buffer such that the region we return to the call is 8 byte * aligned and packed against the tail of the ophdr. */ void * xlog_prepare_iovec( struct xfs_log_vec *lv, struct xfs_log_iovec **vecp, uint type) { struct xfs_log_iovec *vec = *vecp; struct xlog_op_header *oph; uint32_t len; void *buf; if (vec) { ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs); vec++; } else { vec = &lv->lv_iovecp[0]; } len = lv->lv_buf_len + sizeof(struct xlog_op_header); if (!IS_ALIGNED(len, sizeof(uint64_t))) { lv->lv_buf_len = round_up(len, sizeof(uint64_t)) - sizeof(struct xlog_op_header); } vec->i_type = type; vec->i_addr = lv->lv_buf + lv->lv_buf_len; oph = vec->i_addr; oph->oh_clientid = XFS_TRANSACTION; oph->oh_res2 = 0; oph->oh_flags = 0; buf = vec->i_addr + sizeof(struct xlog_op_header); ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t))); *vecp = vec; return buf; } static void xlog_grant_sub_space( struct xlog *log, atomic64_t *head, int bytes) { int64_t head_val = atomic64_read(head); int64_t new, old; do { int cycle, space; xlog_crack_grant_head_val(head_val, &cycle, &space); space -= bytes; if (space < 0) { space += log->l_logsize; cycle--; } old = head_val; new = xlog_assign_grant_head_val(cycle, space); head_val = atomic64_cmpxchg(head, old, new); } while (head_val != old); } static void xlog_grant_add_space( struct xlog *log, atomic64_t *head, int bytes) { int64_t head_val = atomic64_read(head); int64_t new, old; do { int tmp; int cycle, space; xlog_crack_grant_head_val(head_val, &cycle, &space); tmp = log->l_logsize - space; if (tmp > bytes) space += bytes; else { space = bytes - tmp; cycle++; } old = head_val; new = xlog_assign_grant_head_val(cycle, space); head_val = atomic64_cmpxchg(head, old, new); } while (head_val != old); } STATIC void xlog_grant_head_init( struct xlog_grant_head *head) { xlog_assign_grant_head(&head->grant, 1, 0); INIT_LIST_HEAD(&head->waiters); spin_lock_init(&head->lock); } STATIC void xlog_grant_head_wake_all( struct xlog_grant_head *head) { struct xlog_ticket *tic; spin_lock(&head->lock); list_for_each_entry(tic, &head->waiters, t_queue) wake_up_process(tic->t_task); spin_unlock(&head->lock); } static inline int xlog_ticket_reservation( struct xlog *log, struct xlog_grant_head *head, struct xlog_ticket *tic) { if (head == &log->l_write_head) { ASSERT(tic->t_flags & XLOG_TIC_PERM_RESERV); return tic->t_unit_res; } if (tic->t_flags & XLOG_TIC_PERM_RESERV) return tic->t_unit_res * tic->t_cnt; return tic->t_unit_res; } STATIC bool xlog_grant_head_wake( struct xlog *log, struct xlog_grant_head *head, int *free_bytes) { struct xlog_ticket *tic; int need_bytes; bool woken_task = false; list_for_each_entry(tic, &head->waiters, t_queue) { /* * There is a chance that the size of the CIL checkpoints in * progress at the last AIL push target calculation resulted in * limiting the target to the log head (l_last_sync_lsn) at the * time. This may not reflect where the log head is now as the * CIL checkpoints may have completed. * * Hence when we are woken here, it may be that the head of the * log that has moved rather than the tail. As the tail didn't * move, there still won't be space available for the * reservation we require. However, if the AIL has already * pushed to the target defined by the old log head location, we * will hang here waiting for something else to update the AIL * push target. * * Therefore, if there isn't space to wake the first waiter on * the grant head, we need to push the AIL again to ensure the * target reflects both the current log tail and log head * position before we wait for the tail to move again. */ need_bytes = xlog_ticket_reservation(log, head, tic); if (*free_bytes < need_bytes) { if (!woken_task) xlog_grant_push_ail(log, need_bytes); return false; } *free_bytes -= need_bytes; trace_xfs_log_grant_wake_up(log, tic); wake_up_process(tic->t_task); woken_task = true; } return true; } STATIC int xlog_grant_head_wait( struct xlog *log, struct xlog_grant_head *head, struct xlog_ticket *tic, int need_bytes) __releases(&head->lock) __acquires(&head->lock) { list_add_tail(&tic->t_queue, &head->waiters); do { if (xlog_is_shutdown(log)) goto shutdown; xlog_grant_push_ail(log, need_bytes); __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock(&head->lock); XFS_STATS_INC(log->l_mp, xs_sleep_logspace); trace_xfs_log_grant_sleep(log, tic); schedule(); trace_xfs_log_grant_wake(log, tic); spin_lock(&head->lock); if (xlog_is_shutdown(log)) goto shutdown; } while (xlog_space_left(log, &head->grant) < need_bytes); list_del_init(&tic->t_queue); return 0; shutdown: list_del_init(&tic->t_queue); return -EIO; } /* * Atomically get the log space required for a log ticket. * * Once a ticket gets put onto head->waiters, it will only return after the * needed reservation is satisfied. * * This function is structured so that it has a lock free fast path. This is * necessary because every new transaction reservation will come through this * path. Hence any lock will be globally hot if we take it unconditionally on * every pass. * * As tickets are only ever moved on and off head->waiters under head->lock, we * only need to take that lock if we are going to add the ticket to the queue * and sleep. We can avoid taking the lock if the ticket was never added to * head->waiters because the t_queue list head will be empty and we hold the * only reference to it so it can safely be checked unlocked. */ STATIC int xlog_grant_head_check( struct xlog *log, struct xlog_grant_head *head, struct xlog_ticket *tic, int *need_bytes) { int free_bytes; int error = 0; ASSERT(!xlog_in_recovery(log)); /* * If there are other waiters on the queue then give them a chance at * logspace before us. Wake up the first waiters, if we do not wake * up all the waiters then go to sleep waiting for more free space, * otherwise try to get some space for this transaction. */ *need_bytes = xlog_ticket_reservation(log, head, tic); free_bytes = xlog_space_left(log, &head->grant); if (!list_empty_careful(&head->waiters)) { spin_lock(&head->lock); if (!xlog_grant_head_wake(log, head, &free_bytes) || free_bytes < *need_bytes) { error = xlog_grant_head_wait(log, head, tic, *need_bytes); } spin_unlock(&head->lock); } else if (free_bytes < *need_bytes) { spin_lock(&head->lock); error = xlog_grant_head_wait(log, head, tic, *need_bytes); spin_unlock(&head->lock); } return error; } bool xfs_log_writable( struct xfs_mount *mp) { /* * Do not write to the log on norecovery mounts, if the data or log * devices are read-only, or if the filesystem is shutdown. Read-only * mounts allow internal writes for log recovery and unmount purposes, * so don't restrict that case. */ if (xfs_has_norecovery(mp)) return false; if (xfs_readonly_buftarg(mp->m_ddev_targp)) return false; if (xfs_readonly_buftarg(mp->m_log->l_targ)) return false; if (xlog_is_shutdown(mp->m_log)) return false; return true; } /* * Replenish the byte reservation required by moving the grant write head. */ int xfs_log_regrant( struct xfs_mount *mp, struct xlog_ticket *tic) { struct xlog *log = mp->m_log; int need_bytes; int error = 0; if (xlog_is_shutdown(log)) return -EIO; XFS_STATS_INC(mp, xs_try_logspace); /* * This is a new transaction on the ticket, so we need to change the * transaction ID so that the next transaction has a different TID in * the log. Just add one to the existing tid so that we can see chains * of rolling transactions in the log easily. */ tic->t_tid++; xlog_grant_push_ail(log, tic->t_unit_res); tic->t_curr_res = tic->t_unit_res; if (tic->t_cnt > 0) return 0; trace_xfs_log_regrant(log, tic); error = xlog_grant_head_check(log, &log->l_write_head, tic, &need_bytes); if (error) goto out_error; xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); trace_xfs_log_regrant_exit(log, tic); xlog_verify_grant_tail(log); return 0; out_error: /* * If we are failing, make sure the ticket doesn't have any current * reservations. We don't want to add this back when the ticket/ * transaction gets cancelled. */ tic->t_curr_res = 0; tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ return error; } /* * Reserve log space and return a ticket corresponding to the reservation. * * Each reservation is going to reserve extra space for a log record header. * When writes happen to the on-disk log, we don't subtract the length of the * log record header from any reservation. By wasting space in each * reservation, we prevent over allocation problems. */ int xfs_log_reserve( struct xfs_mount *mp, int unit_bytes, int cnt, struct xlog_ticket **ticp, bool permanent) { struct xlog *log = mp->m_log; struct xlog_ticket *tic; int need_bytes; int error = 0; if (xlog_is_shutdown(log)) return -EIO; XFS_STATS_INC(mp, xs_try_logspace); ASSERT(*ticp == NULL); tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent); *ticp = tic; xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt : tic->t_unit_res); trace_xfs_log_reserve(log, tic); error = xlog_grant_head_check(log, &log->l_reserve_head, tic, &need_bytes); if (error) goto out_error; xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes); xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); trace_xfs_log_reserve_exit(log, tic); xlog_verify_grant_tail(log); return 0; out_error: /* * If we are failing, make sure the ticket doesn't have any current * reservations. We don't want to add this back when the ticket/ * transaction gets cancelled. */ tic->t_curr_res = 0; tic->t_cnt = 0; /* ungrant will give back unit_res * t_cnt. */ return error; } /* * Run all the pending iclog callbacks and wake log force waiters and iclog * space waiters so they can process the newly set shutdown state. We really * don't care what order we process callbacks here because the log is shut down * and so state cannot change on disk anymore. However, we cannot wake waiters * until the callbacks have been processed because we may be in unmount and * we must ensure that all AIL operations the callbacks perform have completed * before we tear down the AIL. * * We avoid processing actively referenced iclogs so that we don't run callbacks * while the iclog owner might still be preparing the iclog for IO submssion. * These will be caught by xlog_state_iclog_release() and call this function * again to process any callbacks that may have been added to that iclog. */ static void xlog_state_shutdown_callbacks( struct xlog *log) { struct xlog_in_core *iclog; LIST_HEAD(cb_list); iclog = log->l_iclog; do { if (atomic_read(&iclog->ic_refcnt)) { /* Reference holder will re-run iclog callbacks. */ continue; } list_splice_init(&iclog->ic_callbacks, &cb_list); spin_unlock(&log->l_icloglock); xlog_cil_process_committed(&cb_list); spin_lock(&log->l_icloglock); wake_up_all(&iclog->ic_write_wait); wake_up_all(&iclog->ic_force_wait); } while ((iclog = iclog->ic_next) != log->l_iclog); wake_up_all(&log->l_flush_wait); } /* * Flush iclog to disk if this is the last reference to the given iclog and the * it is in the WANT_SYNC state. * * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the * log tail is updated correctly. NEED_FUA indicates that the iclog will be * written to stable storage, and implies that a commit record is contained * within the iclog. We need to ensure that the log tail does not move beyond * the tail that the first commit record in the iclog ordered against, otherwise * correct recovery of that checkpoint becomes dependent on future operations * performed on this iclog. * * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the * current tail into iclog. Once the iclog tail is set, future operations must * not modify it, otherwise they potentially violate ordering constraints for * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in * the iclog will get zeroed on activation of the iclog after sync, so we * always capture the tail lsn on the iclog on the first NEED_FUA release * regardless of the number of active reference counts on this iclog. */ int xlog_state_release_iclog( struct xlog *log, struct xlog_in_core *iclog, struct xlog_ticket *ticket) { xfs_lsn_t tail_lsn; bool last_ref; lockdep_assert_held(&log->l_icloglock); trace_xlog_iclog_release(iclog, _RET_IP_); /* * Grabbing the current log tail needs to be atomic w.r.t. the writing * of the tail LSN into the iclog so we guarantee that the log tail does * not move between the first time we know that the iclog needs to be * made stable and when we eventually submit it. */ if ((iclog->ic_state == XLOG_STATE_WANT_SYNC || (iclog->ic_flags & XLOG_ICL_NEED_FUA)) && !iclog->ic_header.h_tail_lsn) { tail_lsn = xlog_assign_tail_lsn(log->l_mp); iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); } last_ref = atomic_dec_and_test(&iclog->ic_refcnt); if (xlog_is_shutdown(log)) { /* * If there are no more references to this iclog, process the * pending iclog callbacks that were waiting on the release of * this iclog. */ if (last_ref) xlog_state_shutdown_callbacks(log); return -EIO; } if (!last_ref) return 0; if (iclog->ic_state != XLOG_STATE_WANT_SYNC) { ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); return 0; } iclog->ic_state = XLOG_STATE_SYNCING; xlog_verify_tail_lsn(log, iclog); trace_xlog_iclog_syncing(iclog, _RET_IP_); spin_unlock(&log->l_icloglock); xlog_sync(log, iclog, ticket); spin_lock(&log->l_icloglock); return 0; } /* * Mount a log filesystem * * mp - ubiquitous xfs mount point structure * log_target - buftarg of on-disk log device * blk_offset - Start block # where block size is 512 bytes (BBSIZE) * num_bblocks - Number of BBSIZE blocks in on-disk log * * Return error or zero. */ int xfs_log_mount( xfs_mount_t *mp, xfs_buftarg_t *log_target, xfs_daddr_t blk_offset, int num_bblks) { struct xlog *log; int error = 0; int min_logfsbs; if (!xfs_has_norecovery(mp)) { xfs_notice(mp, "Mounting V%d Filesystem %pU", XFS_SB_VERSION_NUM(&mp->m_sb), &mp->m_sb.sb_uuid); } else { xfs_notice(mp, "Mounting V%d filesystem %pU in no-recovery mode. Filesystem will be inconsistent.", XFS_SB_VERSION_NUM(&mp->m_sb), &mp->m_sb.sb_uuid); ASSERT(xfs_is_readonly(mp)); } log = xlog_alloc_log(mp, log_target, blk_offset, num_bblks); if (IS_ERR(log)) { error = PTR_ERR(log); goto out; } mp->m_log = log; /* * Now that we have set up the log and it's internal geometry * parameters, we can validate the given log space and drop a critical * message via syslog if the log size is too small. A log that is too * small can lead to unexpected situations in transaction log space * reservation stage. The superblock verifier has already validated all * the other log geometry constraints, so we don't have to check those * here. * * Note: For v4 filesystems, we can't just reject the mount if the * validation fails. This would mean that people would have to * downgrade their kernel just to remedy the situation as there is no * way to grow the log (short of black magic surgery with xfs_db). * * We can, however, reject mounts for V5 format filesystems, as the * mkfs binary being used to make the filesystem should never create a * filesystem with a log that is too small. */ min_logfsbs = xfs_log_calc_minimum_size(mp); if (mp->m_sb.sb_logblocks < min_logfsbs) { xfs_warn(mp, "Log size %d blocks too small, minimum size is %d blocks", mp->m_sb.sb_logblocks, min_logfsbs); /* * Log check errors are always fatal on v5; or whenever bad * metadata leads to a crash. */ if (xfs_has_crc(mp)) { xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!"); ASSERT(0); error = -EINVAL; goto out_free_log; } xfs_crit(mp, "Log size out of supported range."); xfs_crit(mp, "Continuing onwards, but if log hangs are experienced then please report this message in the bug report."); } /* * Initialize the AIL now we have a log. */ error = xfs_trans_ail_init(mp); if (error) { xfs_warn(mp, "AIL initialisation failed: error %d", error); goto out_free_log; } log->l_ailp = mp->m_ail; /* * skip log recovery on a norecovery mount. pretend it all * just worked. */ if (!xfs_has_norecovery(mp)) { error = xlog_recover(log); if (error) { xfs_warn(mp, "log mount/recovery failed: error %d", error); xlog_recover_cancel(log); goto out_destroy_ail; } } error = xfs_sysfs_init(&log->l_kobj, &xfs_log_ktype, &mp->m_kobj, "log"); if (error) goto out_destroy_ail; /* Normal transactions can now occur */ clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate); /* * Now the log has been fully initialised and we know were our * space grant counters are, we can initialise the permanent ticket * needed for delayed logging to work. */ xlog_cil_init_post_recovery(log); return 0; out_destroy_ail: xfs_trans_ail_destroy(mp); out_free_log: xlog_dealloc_log(log); out: return error; } /* * Finish the recovery of the file system. This is separate from the * xfs_log_mount() call, because it depends on the code in xfs_mountfs() to read * in the root and real-time bitmap inodes between calling xfs_log_mount() and * here. * * If we finish recovery successfully, start the background log work. If we are * not doing recovery, then we have a RO filesystem and we don't need to start * it. */ int xfs_log_mount_finish( struct xfs_mount *mp) { struct xlog *log = mp->m_log; int error = 0; if (xfs_has_norecovery(mp)) { ASSERT(xfs_is_readonly(mp)); return 0; } /* * During the second phase of log recovery, we need iget and * iput to behave like they do for an active filesystem. * xfs_fs_drop_inode needs to be able to prevent the deletion * of inodes before we're done replaying log items on those * inodes. Turn it off immediately after recovery finishes * so that we don't leak the quota inodes if subsequent mount * activities fail. * * We let all inodes involved in redo item processing end up on * the LRU instead of being evicted immediately so that if we do * something to an unlinked inode, the irele won't cause * premature truncation and freeing of the inode, which results * in log recovery failure. We have to evict the unreferenced * lru inodes after clearing SB_ACTIVE because we don't * otherwise clean up the lru if there's a subsequent failure in * xfs_mountfs, which leads to us leaking the inodes if nothing * else (e.g. quotacheck) references the inodes before the * mount failure occurs. */ mp->m_super->s_flags |= SB_ACTIVE; xfs_log_work_queue(mp); if (xlog_recovery_needed(log)) error = xlog_recover_finish(log); mp->m_super->s_flags &= ~SB_ACTIVE; evict_inodes(mp->m_super); /* * Drain the buffer LRU after log recovery. This is required for v4 * filesystems to avoid leaving around buffers with NULL verifier ops, * but we do it unconditionally to make sure we're always in a clean * cache state after mount. * * Don't push in the error case because the AIL may have pending intents * that aren't removed until recovery is cancelled. */ if (xlog_recovery_needed(log)) { if (!error) { xfs_log_force(mp, XFS_LOG_SYNC); xfs_ail_push_all_sync(mp->m_ail); } xfs_notice(mp, "Ending recovery (logdev: %s)", mp->m_logname ? mp->m_logname : "internal"); } else { xfs_info(mp, "Ending clean mount"); } xfs_buftarg_drain(mp->m_ddev_targp); clear_bit(XLOG_RECOVERY_NEEDED, &log->l_opstate); /* Make sure the log is dead if we're returning failure. */ ASSERT(!error || xlog_is_shutdown(log)); return error; } /* * The mount has failed. Cancel the recovery if it hasn't completed and destroy * the log. */ void xfs_log_mount_cancel( struct xfs_mount *mp) { xlog_recover_cancel(mp->m_log); xfs_log_unmount(mp); } /* * Flush out the iclog to disk ensuring that device caches are flushed and * the iclog hits stable storage before any completion waiters are woken. */ static inline int xlog_force_iclog( struct xlog_in_core *iclog) { atomic_inc(&iclog->ic_refcnt); iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA; if (iclog->ic_state == XLOG_STATE_ACTIVE) xlog_state_switch_iclogs(iclog->ic_log, iclog, 0); return xlog_state_release_iclog(iclog->ic_log, iclog, NULL); } /* * Cycle all the iclogbuf locks to make sure all log IO completion * is done before we tear down these buffers. */ static void xlog_wait_iclog_completion(struct xlog *log) { int i; struct xlog_in_core *iclog = log->l_iclog; for (i = 0; i < log->l_iclog_bufs; i++) { down(&iclog->ic_sema); up(&iclog->ic_sema); iclog = iclog->ic_next; } } /* * Wait for the iclog and all prior iclogs to be written disk as required by the * log force state machine. Waiting on ic_force_wait ensures iclog completions * have been ordered and callbacks run before we are woken here, hence * guaranteeing that all the iclogs up to this one are on stable storage. */ int xlog_wait_on_iclog( struct xlog_in_core *iclog) __releases(iclog->ic_log->l_icloglock) { struct xlog *log = iclog->ic_log; trace_xlog_iclog_wait_on(iclog, _RET_IP_); if (!xlog_is_shutdown(log) && iclog->ic_state != XLOG_STATE_ACTIVE && iclog->ic_state != XLOG_STATE_DIRTY) { XFS_STATS_INC(log->l_mp, xs_log_force_sleep); xlog_wait(&iclog->ic_force_wait, &log->l_icloglock); } else { spin_unlock(&log->l_icloglock); } if (xlog_is_shutdown(log)) return -EIO; return 0; } /* * Write out an unmount record using the ticket provided. We have to account for * the data space used in the unmount ticket as this write is not done from a * transaction context that has already done the accounting for us. */ static int xlog_write_unmount_record( struct xlog *log, struct xlog_ticket *ticket) { struct { struct xlog_op_header ophdr; struct xfs_unmount_log_format ulf; } unmount_rec = { .ophdr = { .oh_clientid = XFS_LOG, .oh_tid = cpu_to_be32(ticket->t_tid), .oh_flags = XLOG_UNMOUNT_TRANS, }, .ulf = { .magic = XLOG_UNMOUNT_TYPE, }, }; struct xfs_log_iovec reg = { .i_addr = &unmount_rec, .i_len = sizeof(unmount_rec), .i_type = XLOG_REG_TYPE_UNMOUNT, }; struct xfs_log_vec vec = { .lv_niovecs = 1, .lv_iovecp = ®, }; LIST_HEAD(lv_chain); list_add(&vec.lv_list, &lv_chain); BUILD_BUG_ON((sizeof(struct xlog_op_header) + sizeof(struct xfs_unmount_log_format)) != sizeof(unmount_rec)); /* account for space used by record data */ ticket->t_curr_res -= sizeof(unmount_rec); return xlog_write(log, NULL, &lv_chain, ticket, reg.i_len); } /* * Mark the filesystem clean by writing an unmount record to the head of the * log. */ static void xlog_unmount_write( struct xlog *log) { struct xfs_mount *mp = log->l_mp; struct xlog_in_core *iclog; struct xlog_ticket *tic = NULL; int error; error = xfs_log_reserve(mp, 600, 1, &tic, 0); if (error) goto out_err; error = xlog_write_unmount_record(log, tic); /* * At this point, we're umounting anyway, so there's no point in * transitioning log state to shutdown. Just continue... */ out_err: if (error) xfs_alert(mp, "%s: unmount record failed", __func__); spin_lock(&log->l_icloglock); iclog = log->l_iclog; error = xlog_force_iclog(iclog); xlog_wait_on_iclog(iclog); if (tic) { trace_xfs_log_umount_write(log, tic); xfs_log_ticket_ungrant(log, tic); } } static void xfs_log_unmount_verify_iclog( struct xlog *log) { struct xlog_in_core *iclog = log->l_iclog; do { ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); ASSERT(iclog->ic_offset == 0); } while ((iclog = iclog->ic_next) != log->l_iclog); } /* * Unmount record used to have a string "Unmount filesystem--" in the * data section where the "Un" was really a magic number (XLOG_UNMOUNT_TYPE). * We just write the magic number now since that particular field isn't * currently architecture converted and "Unmount" is a bit foo. * As far as I know, there weren't any dependencies on the old behaviour. */ static void xfs_log_unmount_write( struct xfs_mount *mp) { struct xlog *log = mp->m_log; if (!xfs_log_writable(mp)) return; xfs_log_force(mp, XFS_LOG_SYNC); if (xlog_is_shutdown(log)) return; /* * If we think the summary counters are bad, avoid writing the unmount * record to force log recovery at next mount, after which the summary * counters will be recalculated. Refer to xlog_check_unmount_rec for * more details. */ if (XFS_TEST_ERROR(xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS), mp, XFS_ERRTAG_FORCE_SUMMARY_RECALC)) { xfs_alert(mp, "%s: will fix summary counters at next mount", __func__); return; } xfs_log_unmount_verify_iclog(log); xlog_unmount_write(log); } /* * Empty the log for unmount/freeze. * * To do this, we first need to shut down the background log work so it is not * trying to cover the log as we clean up. We then need to unpin all objects in * the log so we can then flush them out. Once they have completed their IO and * run the callbacks removing themselves from the AIL, we can cover the log. */ int xfs_log_quiesce( struct xfs_mount *mp) { /* * Clear log incompat features since we're quiescing the log. Report * failures, though it's not fatal to have a higher log feature * protection level than the log contents actually require. */ if (xfs_clear_incompat_log_features(mp)) { int error; error = xfs_sync_sb(mp, false); if (error) xfs_warn(mp, "Failed to clear log incompat features on quiesce"); } cancel_delayed_work_sync(&mp->m_log->l_work); xfs_log_force(mp, XFS_LOG_SYNC); /* * The superblock buffer is uncached and while xfs_ail_push_all_sync() * will push it, xfs_buftarg_wait() will not wait for it. Further, * xfs_buf_iowait() cannot be used because it was pushed with the * XBF_ASYNC flag set, so we need to use a lock/unlock pair to wait for * the IO to complete. */ xfs_ail_push_all_sync(mp->m_ail); xfs_buftarg_wait(mp->m_ddev_targp); xfs_buf_lock(mp->m_sb_bp); xfs_buf_unlock(mp->m_sb_bp); return xfs_log_cover(mp); } void xfs_log_clean( struct xfs_mount *mp) { xfs_log_quiesce(mp); xfs_log_unmount_write(mp); } /* * Shut down and release the AIL and Log. * * During unmount, we need to ensure we flush all the dirty metadata objects * from the AIL so that the log is empty before we write the unmount record to * the log. Once this is done, we can tear down the AIL and the log. */ void xfs_log_unmount( struct xfs_mount *mp) { xfs_log_clean(mp); /* * If shutdown has come from iclog IO context, the log * cleaning will have been skipped and so we need to wait * for the iclog to complete shutdown processing before we * tear anything down. */ xlog_wait_iclog_completion(mp->m_log); xfs_buftarg_drain(mp->m_ddev_targp); xfs_trans_ail_destroy(mp); xfs_sysfs_del(&mp->m_log->l_kobj); xlog_dealloc_log(mp->m_log); } void xfs_log_item_init( struct xfs_mount *mp, struct xfs_log_item *item, int type, const struct xfs_item_ops *ops) { item->li_log = mp->m_log; item->li_ailp = mp->m_ail; item->li_type = type; item->li_ops = ops; item->li_lv = NULL; INIT_LIST_HEAD(&item->li_ail); INIT_LIST_HEAD(&item->li_cil); INIT_LIST_HEAD(&item->li_bio_list); INIT_LIST_HEAD(&item->li_trans); } /* * Wake up processes waiting for log space after we have moved the log tail. */ void xfs_log_space_wake( struct xfs_mount *mp) { struct xlog *log = mp->m_log; int free_bytes; if (xlog_is_shutdown(log)) return; if (!list_empty_careful(&log->l_write_head.waiters)) { ASSERT(!xlog_in_recovery(log)); spin_lock(&log->l_write_head.lock); free_bytes = xlog_space_left(log, &log->l_write_head.grant); xlog_grant_head_wake(log, &log->l_write_head, &free_bytes); spin_unlock(&log->l_write_head.lock); } if (!list_empty_careful(&log->l_reserve_head.waiters)) { ASSERT(!xlog_in_recovery(log)); spin_lock(&log->l_reserve_head.lock); free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); xlog_grant_head_wake(log, &log->l_reserve_head, &free_bytes); spin_unlock(&log->l_reserve_head.lock); } } /* * Determine if we have a transaction that has gone to disk that needs to be * covered. To begin the transition to the idle state firstly the log needs to * be idle. That means the CIL, the AIL and the iclogs needs to be empty before * we start attempting to cover the log. * * Only if we are then in a state where covering is needed, the caller is * informed that dummy transactions are required to move the log into the idle * state. * * If there are any items in the AIl or CIL, then we do not want to attempt to * cover the log as we may be in a situation where there isn't log space * available to run a dummy transaction and this can lead to deadlocks when the * tail of the log is pinned by an item that is modified in the CIL. Hence * there's no point in running a dummy transaction at this point because we * can't start trying to idle the log until both the CIL and AIL are empty. */ static bool xfs_log_need_covered( struct xfs_mount *mp) { struct xlog *log = mp->m_log; bool needed = false; if (!xlog_cil_empty(log)) return false; spin_lock(&log->l_icloglock); switch (log->l_covered_state) { case XLOG_STATE_COVER_DONE: case XLOG_STATE_COVER_DONE2: case XLOG_STATE_COVER_IDLE: break; case XLOG_STATE_COVER_NEED: case XLOG_STATE_COVER_NEED2: if (xfs_ail_min_lsn(log->l_ailp)) break; if (!xlog_iclogs_empty(log)) break; needed = true; if (log->l_covered_state == XLOG_STATE_COVER_NEED) log->l_covered_state = XLOG_STATE_COVER_DONE; else log->l_covered_state = XLOG_STATE_COVER_DONE2; break; default: needed = true; break; } spin_unlock(&log->l_icloglock); return needed; } /* * Explicitly cover the log. This is similar to background log covering but * intended for usage in quiesce codepaths. The caller is responsible to ensure * the log is idle and suitable for covering. The CIL, iclog buffers and AIL * must all be empty. */ static int xfs_log_cover( struct xfs_mount *mp) { int error = 0; bool need_covered; ASSERT((xlog_cil_empty(mp->m_log) && xlog_iclogs_empty(mp->m_log) && !xfs_ail_min_lsn(mp->m_log->l_ailp)) || xlog_is_shutdown(mp->m_log)); if (!xfs_log_writable(mp)) return 0; /* * xfs_log_need_covered() is not idempotent because it progresses the * state machine if the log requires covering. Therefore, we must call * this function once and use the result until we've issued an sb sync. * Do so first to make that abundantly clear. * * Fall into the covering sequence if the log needs covering or the * mount has lazy superblock accounting to sync to disk. The sb sync * used for covering accumulates the in-core counters, so covering * handles this for us. */ need_covered = xfs_log_need_covered(mp); if (!need_covered && !xfs_has_lazysbcount(mp)) return 0; /* * To cover the log, commit the superblock twice (at most) in * independent checkpoints. The first serves as a reference for the * tail pointer. The sync transaction and AIL push empties the AIL and * updates the in-core tail to the LSN of the first checkpoint. The * second commit updates the on-disk tail with the in-core LSN, * covering the log. Push the AIL one more time to leave it empty, as * we found it. */ do { error = xfs_sync_sb(mp, true); if (error) break; xfs_ail_push_all_sync(mp->m_ail); } while (xfs_log_need_covered(mp)); return error; } /* * We may be holding the log iclog lock upon entering this routine. */ xfs_lsn_t xlog_assign_tail_lsn_locked( struct xfs_mount *mp) { struct xlog *log = mp->m_log; struct xfs_log_item *lip; xfs_lsn_t tail_lsn; assert_spin_locked(&mp->m_ail->ail_lock); /* * To make sure we always have a valid LSN for the log tail we keep * track of the last LSN which was committed in log->l_last_sync_lsn, * and use that when the AIL was empty. */ lip = xfs_ail_min(mp->m_ail); if (lip) tail_lsn = lip->li_lsn; else tail_lsn = atomic64_read(&log->l_last_sync_lsn); trace_xfs_log_assign_tail_lsn(log, tail_lsn); atomic64_set(&log->l_tail_lsn, tail_lsn); return tail_lsn; } xfs_lsn_t xlog_assign_tail_lsn( struct xfs_mount *mp) { xfs_lsn_t tail_lsn; spin_lock(&mp->m_ail->ail_lock); tail_lsn = xlog_assign_tail_lsn_locked(mp); spin_unlock(&mp->m_ail->ail_lock); return tail_lsn; } /* * Return the space in the log between the tail and the head. The head * is passed in the cycle/bytes formal parms. In the special case where * the reserve head has wrapped passed the tail, this calculation is no * longer valid. In this case, just return 0 which means there is no space * in the log. This works for all places where this function is called * with the reserve head. Of course, if the write head were to ever * wrap the tail, we should blow up. Rather than catch this case here, * we depend on other ASSERTions in other parts of the code. XXXmiken * * If reservation head is behind the tail, we have a problem. Warn about it, * but then treat it as if the log is empty. * * If the log is shut down, the head and tail may be invalid or out of whack, so * shortcut invalidity asserts in this case so that we don't trigger them * falsely. */ STATIC int xlog_space_left( struct xlog *log, atomic64_t *head) { int tail_bytes; int tail_cycle; int head_cycle; int head_bytes; xlog_crack_grant_head(head, &head_cycle, &head_bytes); xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); tail_bytes = BBTOB(tail_bytes); if (tail_cycle == head_cycle && head_bytes >= tail_bytes) return log->l_logsize - (head_bytes - tail_bytes); if (tail_cycle + 1 < head_cycle) return 0; /* Ignore potential inconsistency when shutdown. */ if (xlog_is_shutdown(log)) return log->l_logsize; if (tail_cycle < head_cycle) { ASSERT(tail_cycle == (head_cycle - 1)); return tail_bytes - head_bytes; } /* * The reservation head is behind the tail. In this case we just want to * return the size of the log as the amount of space left. */ xfs_alert(log->l_mp, "xlog_space_left: head behind tail"); xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d", tail_cycle, tail_bytes); xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d", head_cycle, head_bytes); ASSERT(0); return log->l_logsize; } static void xlog_ioend_work( struct work_struct *work) { struct xlog_in_core *iclog = container_of(work, struct xlog_in_core, ic_end_io_work); struct xlog *log = iclog->ic_log; int error; error = blk_status_to_errno(iclog->ic_bio.bi_status); #ifdef DEBUG /* treat writes with injected CRC errors as failed */ if (iclog->ic_fail_crc) error = -EIO; #endif /* * Race to shutdown the filesystem if we see an error. */ if (XFS_TEST_ERROR(error, log->l_mp, XFS_ERRTAG_IODONE_IOERR)) { xfs_alert(log->l_mp, "log I/O error %d", error); xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); } xlog_state_done_syncing(iclog); bio_uninit(&iclog->ic_bio); /* * Drop the lock to signal that we are done. Nothing references the * iclog after this, so an unmount waiting on this lock can now tear it * down safely. As such, it is unsafe to reference the iclog after the * unlock as we could race with it being freed. */ up(&iclog->ic_sema); } /* * Return size of each in-core log record buffer. * * All machines get 8 x 32kB buffers by default, unless tuned otherwise. * * If the filesystem blocksize is too large, we may need to choose a * larger size since the directory code currently logs entire blocks. */ STATIC void xlog_get_iclog_buffer_size( struct xfs_mount *mp, struct xlog *log) { if (mp->m_logbufs <= 0) mp->m_logbufs = XLOG_MAX_ICLOGS; if (mp->m_logbsize <= 0) mp->m_logbsize = XLOG_BIG_RECORD_BSIZE; log->l_iclog_bufs = mp->m_logbufs; log->l_iclog_size = mp->m_logbsize; /* * # headers = size / 32k - one header holds cycles from 32k of data. */ log->l_iclog_heads = DIV_ROUND_UP(mp->m_logbsize, XLOG_HEADER_CYCLE_SIZE); log->l_iclog_hsize = log->l_iclog_heads << BBSHIFT; } void xfs_log_work_queue( struct xfs_mount *mp) { queue_delayed_work(mp->m_sync_workqueue, &mp->m_log->l_work, msecs_to_jiffies(xfs_syncd_centisecs * 10)); } /* * Clear the log incompat flags if we have the opportunity. * * This only happens if we're about to log the second dummy transaction as part * of covering the log and we can get the log incompat feature usage lock. */ static inline void xlog_clear_incompat( struct xlog *log) { struct xfs_mount *mp = log->l_mp; if (!xfs_sb_has_incompat_log_feature(&mp->m_sb, XFS_SB_FEAT_INCOMPAT_LOG_ALL)) return; if (log->l_covered_state != XLOG_STATE_COVER_DONE2) return; if (!down_write_trylock(&log->l_incompat_users)) return; xfs_clear_incompat_log_features(mp); up_write(&log->l_incompat_users); } /* * Every sync period we need to unpin all items in the AIL and push them to * disk. If there is nothing dirty, then we might need to cover the log to * indicate that the filesystem is idle. */ static void xfs_log_worker( struct work_struct *work) { struct xlog *log = container_of(to_delayed_work(work), struct xlog, l_work); struct xfs_mount *mp = log->l_mp; /* dgc: errors ignored - not fatal and nowhere to report them */ if (xfs_fs_writable(mp, SB_FREEZE_WRITE) && xfs_log_need_covered(mp)) { /* * Dump a transaction into the log that contains no real change. * This is needed to stamp the current tail LSN into the log * during the covering operation. * * We cannot use an inode here for this - that will push dirty * state back up into the VFS and then periodic inode flushing * will prevent log covering from making progress. Hence we * synchronously log the superblock instead to ensure the * superblock is immediately unpinned and can be written back. */ xlog_clear_incompat(log); xfs_sync_sb(mp, true); } else xfs_log_force(mp, 0); /* start pushing all the metadata that is currently dirty */ xfs_ail_push_all(mp->m_ail); /* queue us up again */ xfs_log_work_queue(mp); } /* * This routine initializes some of the log structure for a given mount point. * Its primary purpose is to fill in enough, so recovery can occur. However, * some other stuff may be filled in too. */ STATIC struct xlog * xlog_alloc_log( struct xfs_mount *mp, struct xfs_buftarg *log_target, xfs_daddr_t blk_offset, int num_bblks) { struct xlog *log; xlog_rec_header_t *head; xlog_in_core_t **iclogp; xlog_in_core_t *iclog, *prev_iclog=NULL; int i; int error = -ENOMEM; uint log2_size = 0; log = kmem_zalloc(sizeof(struct xlog), KM_MAYFAIL); if (!log) { xfs_warn(mp, "Log allocation failed: No memory!"); goto out; } log->l_mp = mp; log->l_targ = log_target; log->l_logsize = BBTOB(num_bblks); log->l_logBBstart = blk_offset; log->l_logBBsize = num_bblks; log->l_covered_state = XLOG_STATE_COVER_IDLE; set_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate); INIT_DELAYED_WORK(&log->l_work, xfs_log_worker); log->l_prev_block = -1; /* log->l_tail_lsn = 0x100000000LL; cycle = 1; current block = 0 */ xlog_assign_atomic_lsn(&log->l_tail_lsn, 1, 0); xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0); log->l_curr_cycle = 1; /* 0 is bad since this is initial value */ if (xfs_has_logv2(mp) && mp->m_sb.sb_logsunit > 1) log->l_iclog_roundoff = mp->m_sb.sb_logsunit; else log->l_iclog_roundoff = BBSIZE; xlog_grant_head_init(&log->l_reserve_head); xlog_grant_head_init(&log->l_write_head); error = -EFSCORRUPTED; if (xfs_has_sector(mp)) { log2_size = mp->m_sb.sb_logsectlog; if (log2_size < BBSHIFT) { xfs_warn(mp, "Log sector size too small (0x%x < 0x%x)", log2_size, BBSHIFT); goto out_free_log; } log2_size -= BBSHIFT; if (log2_size > mp->m_sectbb_log) { xfs_warn(mp, "Log sector size too large (0x%x > 0x%x)", log2_size, mp->m_sectbb_log); goto out_free_log; } /* for larger sector sizes, must have v2 or external log */ if (log2_size && log->l_logBBstart > 0 && !xfs_has_logv2(mp)) { xfs_warn(mp, "log sector size (0x%x) invalid for configuration.", log2_size); goto out_free_log; } } log->l_sectBBsize = 1 << log2_size; init_rwsem(&log->l_incompat_users); xlog_get_iclog_buffer_size(mp, log); spin_lock_init(&log->l_icloglock); init_waitqueue_head(&log->l_flush_wait); iclogp = &log->l_iclog; /* * The amount of memory to allocate for the iclog structure is * rather funky due to the way the structure is defined. It is * done this way so that we can use different sizes for machines * with different amounts of memory. See the definition of * xlog_in_core_t in xfs_log_priv.h for details. */ ASSERT(log->l_iclog_size >= 4096); for (i = 0; i < log->l_iclog_bufs; i++) { size_t bvec_size = howmany(log->l_iclog_size, PAGE_SIZE) * sizeof(struct bio_vec); iclog = kmem_zalloc(sizeof(*iclog) + bvec_size, KM_MAYFAIL); if (!iclog) goto out_free_iclog; *iclogp = iclog; iclog->ic_prev = prev_iclog; prev_iclog = iclog; iclog->ic_data = kvzalloc(log->l_iclog_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!iclog->ic_data) goto out_free_iclog; head = &iclog->ic_header; memset(head, 0, sizeof(xlog_rec_header_t)); head->h_magicno = cpu_to_be32(XLOG_HEADER_MAGIC_NUM); head->h_version = cpu_to_be32( xfs_has_logv2(log->l_mp) ? 2 : 1); head->h_size = cpu_to_be32(log->l_iclog_size); /* new fields */ head->h_fmt = cpu_to_be32(XLOG_FMT); memcpy(&head->h_fs_uuid, &mp->m_sb.sb_uuid, sizeof(uuid_t)); iclog->ic_size = log->l_iclog_size - log->l_iclog_hsize; iclog->ic_state = XLOG_STATE_ACTIVE; iclog->ic_log = log; atomic_set(&iclog->ic_refcnt, 0); INIT_LIST_HEAD(&iclog->ic_callbacks); iclog->ic_datap = (void *)iclog->ic_data + log->l_iclog_hsize; init_waitqueue_head(&iclog->ic_force_wait); init_waitqueue_head(&iclog->ic_write_wait); INIT_WORK(&iclog->ic_end_io_work, xlog_ioend_work); sema_init(&iclog->ic_sema, 1); iclogp = &iclog->ic_next; } *iclogp = log->l_iclog; /* complete ring */ log->l_iclog->ic_prev = prev_iclog; /* re-write 1st prev ptr */ log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_HIGHPRI), 0, mp->m_super->s_id); if (!log->l_ioend_workqueue) goto out_free_iclog; error = xlog_cil_init(log); if (error) goto out_destroy_workqueue; return log; out_destroy_workqueue: destroy_workqueue(log->l_ioend_workqueue); out_free_iclog: for (iclog = log->l_iclog; iclog; iclog = prev_iclog) { prev_iclog = iclog->ic_next; kmem_free(iclog->ic_data); kmem_free(iclog); if (prev_iclog == log->l_iclog) break; } out_free_log: kmem_free(log); out: return ERR_PTR(error); } /* xlog_alloc_log */ /* * Compute the LSN that we'd need to push the log tail towards in order to have * (a) enough on-disk log space to log the number of bytes specified, (b) at * least 25% of the log space free, and (c) at least 256 blocks free. If the * log free space already meets all three thresholds, this function returns * NULLCOMMITLSN. */ xfs_lsn_t xlog_grant_push_threshold( struct xlog *log, int need_bytes) { xfs_lsn_t threshold_lsn = 0; xfs_lsn_t last_sync_lsn; int free_blocks; int free_bytes; int threshold_block; int threshold_cycle; int free_threshold; ASSERT(BTOBB(need_bytes) < log->l_logBBsize); free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); free_blocks = BTOBBT(free_bytes); /* * Set the threshold for the minimum number of free blocks in the * log to the maximum of what the caller needs, one quarter of the * log, and 256 blocks. */ free_threshold = BTOBB(need_bytes); free_threshold = max(free_threshold, (log->l_logBBsize >> 2)); free_threshold = max(free_threshold, 256); if (free_blocks >= free_threshold) return NULLCOMMITLSN; xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle, &threshold_block); threshold_block += free_threshold; if (threshold_block >= log->l_logBBsize) { threshold_block -= log->l_logBBsize; threshold_cycle += 1; } threshold_lsn = xlog_assign_lsn(threshold_cycle, threshold_block); /* * Don't pass in an lsn greater than the lsn of the last * log record known to be on disk. Use a snapshot of the last sync lsn * so that it doesn't change between the compare and the set. */ last_sync_lsn = atomic64_read(&log->l_last_sync_lsn); if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) threshold_lsn = last_sync_lsn; return threshold_lsn; } /* * Push the tail of the log if we need to do so to maintain the free log space * thresholds set out by xlog_grant_push_threshold. We may need to adopt a * policy which pushes on an lsn which is further along in the log once we * reach the high water mark. In this manner, we would be creating a low water * mark. */ STATIC void xlog_grant_push_ail( struct xlog *log, int need_bytes) { xfs_lsn_t threshold_lsn; threshold_lsn = xlog_grant_push_threshold(log, need_bytes); if (threshold_lsn == NULLCOMMITLSN || xlog_is_shutdown(log)) return; /* * Get the transaction layer to kick the dirty buffers out to * disk asynchronously. No point in trying to do this if * the filesystem is shutting down. */ xfs_ail_push(log->l_ailp, threshold_lsn); } /* * Stamp cycle number in every block */ STATIC void xlog_pack_data( struct xlog *log, struct xlog_in_core *iclog, int roundoff) { int i, j, k; int size = iclog->ic_offset + roundoff; __be32 cycle_lsn; char *dp; cycle_lsn = CYCLE_LSN_DISK(iclog->ic_header.h_lsn); dp = iclog->ic_datap; for (i = 0; i < BTOBB(size); i++) { if (i >= (XLOG_HEADER_CYCLE_SIZE / BBSIZE)) break; iclog->ic_header.h_cycle_data[i] = *(__be32 *)dp; *(__be32 *)dp = cycle_lsn; dp += BBSIZE; } if (xfs_has_logv2(log->l_mp)) { xlog_in_core_2_t *xhdr = iclog->ic_data; for ( ; i < BTOBB(size); i++) { j = i / (XLOG_HEADER_CYCLE_SIZE / BBSIZE); k = i % (XLOG_HEADER_CYCLE_SIZE / BBSIZE); xhdr[j].hic_xheader.xh_cycle_data[k] = *(__be32 *)dp; *(__be32 *)dp = cycle_lsn; dp += BBSIZE; } for (i = 1; i < log->l_iclog_heads; i++) xhdr[i].hic_xheader.xh_cycle = cycle_lsn; } } /* * Calculate the checksum for a log buffer. * * This is a little more complicated than it should be because the various * headers and the actual data are non-contiguous. */ __le32 xlog_cksum( struct xlog *log, struct xlog_rec_header *rhead, char *dp, int size) { uint32_t crc; /* first generate the crc for the record header ... */ crc = xfs_start_cksum_update((char *)rhead, sizeof(struct xlog_rec_header), offsetof(struct xlog_rec_header, h_crc)); /* ... then for additional cycle data for v2 logs ... */ if (xfs_has_logv2(log->l_mp)) { union xlog_in_core2 *xhdr = (union xlog_in_core2 *)rhead; int i; int xheads; xheads = DIV_ROUND_UP(size, XLOG_HEADER_CYCLE_SIZE); for (i = 1; i < xheads; i++) { crc = crc32c(crc, &xhdr[i].hic_xheader, sizeof(struct xlog_rec_ext_header)); } } /* ... and finally for the payload */ crc = crc32c(crc, dp, size); return xfs_end_cksum(crc); } static void xlog_bio_end_io( struct bio *bio) { struct xlog_in_core *iclog = bio->bi_private; queue_work(iclog->ic_log->l_ioend_workqueue, &iclog->ic_end_io_work); } static int xlog_map_iclog_data( struct bio *bio, void *data, size_t count) { do { struct page *page = kmem_to_page(data); unsigned int off = offset_in_page(data); size_t len = min_t(size_t, count, PAGE_SIZE - off); if (bio_add_page(bio, page, len, off) != len) return -EIO; data += len; count -= len; } while (count); return 0; } STATIC void xlog_write_iclog( struct xlog *log, struct xlog_in_core *iclog, uint64_t bno, unsigned int count) { ASSERT(bno < log->l_logBBsize); trace_xlog_iclog_write(iclog, _RET_IP_); /* * We lock the iclogbufs here so that we can serialise against I/O * completion during unmount. We might be processing a shutdown * triggered during unmount, and that can occur asynchronously to the * unmount thread, and hence we need to ensure that completes before * tearing down the iclogbufs. Hence we need to hold the buffer lock * across the log IO to archieve that. */ down(&iclog->ic_sema); if (xlog_is_shutdown(log)) { /* * It would seem logical to return EIO here, but we rely on * the log state machine to propagate I/O errors instead of * doing it here. We kick of the state machine and unlock * the buffer manually, the code needs to be kept in sync * with the I/O completion path. */ xlog_state_done_syncing(iclog); up(&iclog->ic_sema); return; } /* * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more * IOs coming immediately after this one. This prevents the block layer * writeback throttle from throttling log writes behind background * metadata writeback and causing priority inversions. */ bio_init(&iclog->ic_bio, log->l_targ->bt_bdev, iclog->ic_bvec, howmany(count, PAGE_SIZE), REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE); iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno; iclog->ic_bio.bi_end_io = xlog_bio_end_io; iclog->ic_bio.bi_private = iclog; if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) { iclog->ic_bio.bi_opf |= REQ_PREFLUSH; /* * For external log devices, we also need to flush the data * device cache first to ensure all metadata writeback covered * by the LSN in this iclog is on stable storage. This is slow, * but it *must* complete before we issue the external log IO. * * If the flush fails, we cannot conclude that past metadata * writeback from the log succeeded. Repeating the flush is * not possible, hence we must shut down with log IO error to * avoid shutdown re-entering this path and erroring out again. */ if (log->l_targ != log->l_mp->m_ddev_targp && blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev)) { xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); return; } } if (iclog->ic_flags & XLOG_ICL_NEED_FUA) iclog->ic_bio.bi_opf |= REQ_FUA; iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) { xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); return; } if (is_vmalloc_addr(iclog->ic_data)) flush_kernel_vmap_range(iclog->ic_data, count); /* * If this log buffer would straddle the end of the log we will have * to split it up into two bios, so that we can continue at the start. */ if (bno + BTOBB(count) > log->l_logBBsize) { struct bio *split; split = bio_split(&iclog->ic_bio, log->l_logBBsize - bno, GFP_NOIO, &fs_bio_set); bio_chain(split, &iclog->ic_bio); submit_bio(split); /* restart at logical offset zero for the remainder */ iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart; } submit_bio(&iclog->ic_bio); } /* * We need to bump cycle number for the part of the iclog that is * written to the start of the log. Watch out for the header magic * number case, though. */ static void xlog_split_iclog( struct xlog *log, void *data, uint64_t bno, unsigned int count) { unsigned int split_offset = BBTOB(log->l_logBBsize - bno); unsigned int i; for (i = split_offset; i < count; i += BBSIZE) { uint32_t cycle = get_unaligned_be32(data + i); if (++cycle == XLOG_HEADER_MAGIC_NUM) cycle++; put_unaligned_be32(cycle, data + i); } } static int xlog_calc_iclog_size( struct xlog *log, struct xlog_in_core *iclog, uint32_t *roundoff) { uint32_t count_init, count; /* Add for LR header */ count_init = log->l_iclog_hsize + iclog->ic_offset; count = roundup(count_init, log->l_iclog_roundoff); *roundoff = count - count_init; ASSERT(count >= count_init); ASSERT(*roundoff < log->l_iclog_roundoff); return count; } /* * Flush out the in-core log (iclog) to the on-disk log in an asynchronous * fashion. Previously, we should have moved the current iclog * ptr in the log to point to the next available iclog. This allows further * write to continue while this code syncs out an iclog ready to go. * Before an in-core log can be written out, the data section must be scanned * to save away the 1st word of each BBSIZE block into the header. We replace * it with the current cycle count. Each BBSIZE block is tagged with the * cycle count because there in an implicit assumption that drives will * guarantee that entire 512 byte blocks get written at once. In other words, * we can't have part of a 512 byte block written and part not written. By * tagging each block, we will know which blocks are valid when recovering * after an unclean shutdown. * * This routine is single threaded on the iclog. No other thread can be in * this routine with the same iclog. Changing contents of iclog can there- * fore be done without grabbing the state machine lock. Updating the global * log will require grabbing the lock though. * * The entire log manager uses a logical block numbering scheme. Only * xlog_write_iclog knows about the fact that the log may not start with * block zero on a given device. */ STATIC void xlog_sync( struct xlog *log, struct xlog_in_core *iclog, struct xlog_ticket *ticket) { unsigned int count; /* byte count of bwrite */ unsigned int roundoff; /* roundoff to BB or stripe */ uint64_t bno; unsigned int size; ASSERT(atomic_read(&iclog->ic_refcnt) == 0); trace_xlog_iclog_sync(iclog, _RET_IP_); count = xlog_calc_iclog_size(log, iclog, &roundoff); /* * If we have a ticket, account for the roundoff via the ticket * reservation to avoid touching the hot grant heads needlessly. * Otherwise, we have to move grant heads directly. */ if (ticket) { ticket->t_curr_res -= roundoff; } else { xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff); xlog_grant_add_space(log, &log->l_write_head.grant, roundoff); } /* put cycle number in every block */ xlog_pack_data(log, iclog, roundoff); /* real byte length */ size = iclog->ic_offset; if (xfs_has_logv2(log->l_mp)) size += roundoff; iclog->ic_header.h_len = cpu_to_be32(size); XFS_STATS_INC(log->l_mp, xs_log_writes); XFS_STATS_ADD(log->l_mp, xs_log_blocks, BTOBB(count)); bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn)); /* Do we need to split this write into 2 parts? */ if (bno + BTOBB(count) > log->l_logBBsize) xlog_split_iclog(log, &iclog->ic_header, bno, count); /* calculcate the checksum */ iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header, iclog->ic_datap, size); /* * Intentionally corrupt the log record CRC based on the error injection * frequency, if defined. This facilitates testing log recovery in the * event of torn writes. Hence, set the IOABORT state to abort the log * write on I/O completion and shutdown the fs. The subsequent mount * detects the bad CRC and attempts to recover. */ #ifdef DEBUG if (XFS_TEST_ERROR(false, log->l_mp, XFS_ERRTAG_LOG_BAD_CRC)) { iclog->ic_header.h_crc &= cpu_to_le32(0xAAAAAAAA); iclog->ic_fail_crc = true; xfs_warn(log->l_mp, "Intentionally corrupted log record at LSN 0x%llx. Shutdown imminent.", be64_to_cpu(iclog->ic_header.h_lsn)); } #endif xlog_verify_iclog(log, iclog, count); xlog_write_iclog(log, iclog, bno, count); } /* * Deallocate a log structure */ STATIC void xlog_dealloc_log( struct xlog *log) { xlog_in_core_t *iclog, *next_iclog; int i; /* * Destroy the CIL after waiting for iclog IO completion because an * iclog EIO error will try to shut down the log, which accesses the * CIL to wake up the waiters. */ xlog_cil_destroy(log); iclog = log->l_iclog; for (i = 0; i < log->l_iclog_bufs; i++) { next_iclog = iclog->ic_next; kmem_free(iclog->ic_data); kmem_free(iclog); iclog = next_iclog; } log->l_mp->m_log = NULL; destroy_workqueue(log->l_ioend_workqueue); kmem_free(log); } /* * Update counters atomically now that memcpy is done. */ static inline void xlog_state_finish_copy( struct xlog *log, struct xlog_in_core *iclog, int record_cnt, int copy_bytes) { lockdep_assert_held(&log->l_icloglock); be32_add_cpu(&iclog->ic_header.h_num_logops, record_cnt); iclog->ic_offset += copy_bytes; } /* * print out info relating to regions written which consume * the reservation */ void xlog_print_tic_res( struct xfs_mount *mp, struct xlog_ticket *ticket) { xfs_warn(mp, "ticket reservation summary:"); xfs_warn(mp, " unit res = %d bytes", ticket->t_unit_res); xfs_warn(mp, " current res = %d bytes", ticket->t_curr_res); xfs_warn(mp, " original count = %d", ticket->t_ocnt); xfs_warn(mp, " remaining count = %d", ticket->t_cnt); } /* * Print a summary of the transaction. */ void xlog_print_trans( struct xfs_trans *tp) { struct xfs_mount *mp = tp->t_mountp; struct xfs_log_item *lip; /* dump core transaction and ticket info */ xfs_warn(mp, "transaction summary:"); xfs_warn(mp, " log res = %d", tp->t_log_res); xfs_warn(mp, " log count = %d", tp->t_log_count); xfs_warn(mp, " flags = 0x%x", tp->t_flags); xlog_print_tic_res(mp, tp->t_ticket); /* dump each log item */ list_for_each_entry(lip, &tp->t_items, li_trans) { struct xfs_log_vec *lv = lip->li_lv; struct xfs_log_iovec *vec; int i; xfs_warn(mp, "log item: "); xfs_warn(mp, " type = 0x%x", lip->li_type); xfs_warn(mp, " flags = 0x%lx", lip->li_flags); if (!lv) continue; xfs_warn(mp, " niovecs = %d", lv->lv_niovecs); xfs_warn(mp, " size = %d", lv->lv_size); xfs_warn(mp, " bytes = %d", lv->lv_bytes); xfs_warn(mp, " buf len = %d", lv->lv_buf_len); /* dump each iovec for the log item */ vec = lv->lv_iovecp; for (i = 0; i < lv->lv_niovecs; i++) { int dumplen = min(vec->i_len, 32); xfs_warn(mp, " iovec[%d]", i); xfs_warn(mp, " type = 0x%x", vec->i_type); xfs_warn(mp, " len = %d", vec->i_len); xfs_warn(mp, " first %d bytes of iovec[%d]:", dumplen, i); xfs_hex_dump(vec->i_addr, dumplen); vec++; } } } static inline void xlog_write_iovec( struct xlog_in_core *iclog, uint32_t *log_offset, void *data, uint32_t write_len, int *bytes_left, uint32_t *record_cnt, uint32_t *data_cnt) { ASSERT(*log_offset < iclog->ic_log->l_iclog_size); ASSERT(*log_offset % sizeof(int32_t) == 0); ASSERT(write_len % sizeof(int32_t) == 0); memcpy(iclog->ic_datap + *log_offset, data, write_len); *log_offset += write_len; *bytes_left -= write_len; (*record_cnt)++; *data_cnt += write_len; } /* * Write log vectors into a single iclog which is guaranteed by the caller * to have enough space to write the entire log vector into. */ static void xlog_write_full( struct xfs_log_vec *lv, struct xlog_ticket *ticket, struct xlog_in_core *iclog, uint32_t *log_offset, uint32_t *len, uint32_t *record_cnt, uint32_t *data_cnt) { int index; ASSERT(*log_offset + *len <= iclog->ic_size || iclog->ic_state == XLOG_STATE_WANT_SYNC); /* * Ordered log vectors have no regions to write so this * loop will naturally skip them. */ for (index = 0; index < lv->lv_niovecs; index++) { struct xfs_log_iovec *reg = &lv->lv_iovecp[index]; struct xlog_op_header *ophdr = reg->i_addr; ophdr->oh_tid = cpu_to_be32(ticket->t_tid); xlog_write_iovec(iclog, log_offset, reg->i_addr, reg->i_len, len, record_cnt, data_cnt); } } static int xlog_write_get_more_iclog_space( struct xlog_ticket *ticket, struct xlog_in_core **iclogp, uint32_t *log_offset, uint32_t len, uint32_t *record_cnt, uint32_t *data_cnt) { struct xlog_in_core *iclog = *iclogp; struct xlog *log = iclog->ic_log; int error; spin_lock(&log->l_icloglock); ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC); xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); error = xlog_state_release_iclog(log, iclog, ticket); spin_unlock(&log->l_icloglock); if (error) return error; error = xlog_state_get_iclog_space(log, len, &iclog, ticket, log_offset); if (error) return error; *record_cnt = 0; *data_cnt = 0; *iclogp = iclog; return 0; } /* * Write log vectors into a single iclog which is smaller than the current chain * length. We write until we cannot fit a full record into the remaining space * and then stop. We return the log vector that is to be written that cannot * wholly fit in the iclog. */ static int xlog_write_partial( struct xfs_log_vec *lv, struct xlog_ticket *ticket, struct xlog_in_core **iclogp, uint32_t *log_offset, uint32_t *len, uint32_t *record_cnt, uint32_t *data_cnt) { struct xlog_in_core *iclog = *iclogp; struct xlog_op_header *ophdr; int index = 0; uint32_t rlen; int error; /* walk the logvec, copying until we run out of space in the iclog */ for (index = 0; index < lv->lv_niovecs; index++) { struct xfs_log_iovec *reg = &lv->lv_iovecp[index]; uint32_t reg_offset = 0; /* * The first region of a continuation must have a non-zero * length otherwise log recovery will just skip over it and * start recovering from the next opheader it finds. Because we * mark the next opheader as a continuation, recovery will then * incorrectly add the continuation to the previous region and * that breaks stuff. * * Hence if there isn't space for region data after the * opheader, then we need to start afresh with a new iclog. */ if (iclog->ic_size - *log_offset <= sizeof(struct xlog_op_header)) { error = xlog_write_get_more_iclog_space(ticket, &iclog, log_offset, *len, record_cnt, data_cnt); if (error) return error; } ophdr = reg->i_addr; rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset); ophdr->oh_tid = cpu_to_be32(ticket->t_tid); ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header)); if (rlen != reg->i_len) ophdr->oh_flags |= XLOG_CONTINUE_TRANS; xlog_write_iovec(iclog, log_offset, reg->i_addr, rlen, len, record_cnt, data_cnt); /* If we wrote the whole region, move to the next. */ if (rlen == reg->i_len) continue; /* * We now have a partially written iovec, but it can span * multiple iclogs so we loop here. First we release the iclog * we currently have, then we get a new iclog and add a new * opheader. Then we continue copying from where we were until * we either complete the iovec or fill the iclog. If we * complete the iovec, then we increment the index and go right * back to the top of the outer loop. if we fill the iclog, we * run the inner loop again. * * This is complicated by the tail of a region using all the * space in an iclog and hence requiring us to release the iclog * and get a new one before returning to the outer loop. We must * always guarantee that we exit this inner loop with at least * space for log transaction opheaders left in the current * iclog, hence we cannot just terminate the loop at the end * of the of the continuation. So we loop while there is no * space left in the current iclog, and check for the end of the * continuation after getting a new iclog. */ do { /* * Ensure we include the continuation opheader in the * space we need in the new iclog by adding that size * to the length we require. This continuation opheader * needs to be accounted to the ticket as the space it * consumes hasn't been accounted to the lv we are * writing. */ error = xlog_write_get_more_iclog_space(ticket, &iclog, log_offset, *len + sizeof(struct xlog_op_header), record_cnt, data_cnt); if (error) return error; ophdr = iclog->ic_datap + *log_offset; ophdr->oh_tid = cpu_to_be32(ticket->t_tid); ophdr->oh_clientid = XFS_TRANSACTION; ophdr->oh_res2 = 0; ophdr->oh_flags = XLOG_WAS_CONT_TRANS; ticket->t_curr_res -= sizeof(struct xlog_op_header); *log_offset += sizeof(struct xlog_op_header); *data_cnt += sizeof(struct xlog_op_header); /* * If rlen fits in the iclog, then end the region * continuation. Otherwise we're going around again. */ reg_offset += rlen; rlen = reg->i_len - reg_offset; if (rlen <= iclog->ic_size - *log_offset) ophdr->oh_flags |= XLOG_END_TRANS; else ophdr->oh_flags |= XLOG_CONTINUE_TRANS; rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset); ophdr->oh_len = cpu_to_be32(rlen); xlog_write_iovec(iclog, log_offset, reg->i_addr + reg_offset, rlen, len, record_cnt, data_cnt); } while (ophdr->oh_flags & XLOG_CONTINUE_TRANS); } /* * No more iovecs remain in this logvec so return the next log vec to * the caller so it can go back to fast path copying. */ *iclogp = iclog; return 0; } /* * Write some region out to in-core log * * This will be called when writing externally provided regions or when * writing out a commit record for a given transaction. * * General algorithm: * 1. Find total length of this write. This may include adding to the * lengths passed in. * 2. Check whether we violate the tickets reservation. * 3. While writing to this iclog * A. Reserve as much space in this iclog as can get * B. If this is first write, save away start lsn * C. While writing this region: * 1. If first write of transaction, write start record * 2. Write log operation header (header per region) * 3. Find out if we can fit entire region into this iclog * 4. Potentially, verify destination memcpy ptr * 5. Memcpy (partial) region * 6. If partial copy, release iclog; otherwise, continue * copying more regions into current iclog * 4. Mark want sync bit (in simulation mode) * 5. Release iclog for potential flush to on-disk log. * * ERRORS: * 1. Panic if reservation is overrun. This should never happen since * reservation amounts are generated internal to the filesystem. * NOTES: * 1. Tickets are single threaded data structures. * 2. The XLOG_END_TRANS & XLOG_CONTINUE_TRANS flags are passed down to the * syncing routine. When a single log_write region needs to span * multiple in-core logs, the XLOG_CONTINUE_TRANS bit should be set * on all log operation writes which don't contain the end of the * region. The XLOG_END_TRANS bit is used for the in-core log * operation which contains the end of the continued log_write region. * 3. When xlog_state_get_iclog_space() grabs the rest of the current iclog, * we don't really know exactly how much space will be used. As a result, * we don't update ic_offset until the end when we know exactly how many * bytes have been written out. */ int xlog_write( struct xlog *log, struct xfs_cil_ctx *ctx, struct list_head *lv_chain, struct xlog_ticket *ticket, uint32_t len) { struct xlog_in_core *iclog = NULL; struct xfs_log_vec *lv; uint32_t record_cnt = 0; uint32_t data_cnt = 0; int error = 0; int log_offset; if (ticket->t_curr_res < 0) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "ctx ticket reservation ran out. Need to up reservation"); xlog_print_tic_res(log->l_mp, ticket); xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); } error = xlog_state_get_iclog_space(log, len, &iclog, ticket, &log_offset); if (error) return error; ASSERT(log_offset <= iclog->ic_size - 1); /* * If we have a context pointer, pass it the first iclog we are * writing to so it can record state needed for iclog write * ordering. */ if (ctx) xlog_cil_set_ctx_write_state(ctx, iclog); list_for_each_entry(lv, lv_chain, lv_list) { /* * If the entire log vec does not fit in the iclog, punt it to * the partial copy loop which can handle this case. */ if (lv->lv_niovecs && lv->lv_bytes > iclog->ic_size - log_offset) { error = xlog_write_partial(lv, ticket, &iclog, &log_offset, &len, &record_cnt, &data_cnt); if (error) { /* * We have no iclog to release, so just return * the error immediately. */ return error; } } else { xlog_write_full(lv, ticket, iclog, &log_offset, &len, &record_cnt, &data_cnt); } } ASSERT(len == 0); /* * We've already been guaranteed that the last writes will fit inside * the current iclog, and hence it will already have the space used by * those writes accounted to it. Hence we do not need to update the * iclog with the number of bytes written here. */ spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, 0); error = xlog_state_release_iclog(log, iclog, ticket); spin_unlock(&log->l_icloglock); return error; } static void xlog_state_activate_iclog( struct xlog_in_core *iclog, int *iclogs_changed) { ASSERT(list_empty_careful(&iclog->ic_callbacks)); trace_xlog_iclog_activate(iclog, _RET_IP_); /* * If the number of ops in this iclog indicate it just contains the * dummy transaction, we can change state into IDLE (the second time * around). Otherwise we should change the state into NEED a dummy. * We don't need to cover the dummy. */ if (*iclogs_changed == 0 && iclog->ic_header.h_num_logops == cpu_to_be32(XLOG_COVER_OPS)) { *iclogs_changed = 1; } else { /* * We have two dirty iclogs so start over. This could also be * num of ops indicating this is not the dummy going out. */ *iclogs_changed = 2; } iclog->ic_state = XLOG_STATE_ACTIVE; iclog->ic_offset = 0; iclog->ic_header.h_num_logops = 0; memset(iclog->ic_header.h_cycle_data, 0, sizeof(iclog->ic_header.h_cycle_data)); iclog->ic_header.h_lsn = 0; iclog->ic_header.h_tail_lsn = 0; } /* * Loop through all iclogs and mark all iclogs currently marked DIRTY as * ACTIVE after iclog I/O has completed. */ static void xlog_state_activate_iclogs( struct xlog *log, int *iclogs_changed) { struct xlog_in_core *iclog = log->l_iclog; do { if (iclog->ic_state == XLOG_STATE_DIRTY) xlog_state_activate_iclog(iclog, iclogs_changed); /* * The ordering of marking iclogs ACTIVE must be maintained, so * an iclog doesn't become ACTIVE beyond one that is SYNCING. */ else if (iclog->ic_state != XLOG_STATE_ACTIVE) break; } while ((iclog = iclog->ic_next) != log->l_iclog); } static int xlog_covered_state( int prev_state, int iclogs_changed) { /* * We go to NEED for any non-covering writes. We go to NEED2 if we just * wrote the first covering record (DONE). We go to IDLE if we just * wrote the second covering record (DONE2) and remain in IDLE until a * non-covering write occurs. */ switch (prev_state) { case XLOG_STATE_COVER_IDLE: if (iclogs_changed == 1) return XLOG_STATE_COVER_IDLE; fallthrough; case XLOG_STATE_COVER_NEED: case XLOG_STATE_COVER_NEED2: break; case XLOG_STATE_COVER_DONE: if (iclogs_changed == 1) return XLOG_STATE_COVER_NEED2; break; case XLOG_STATE_COVER_DONE2: if (iclogs_changed == 1) return XLOG_STATE_COVER_IDLE; break; default: ASSERT(0); } return XLOG_STATE_COVER_NEED; } STATIC void xlog_state_clean_iclog( struct xlog *log, struct xlog_in_core *dirty_iclog) { int iclogs_changed = 0; trace_xlog_iclog_clean(dirty_iclog, _RET_IP_); dirty_iclog->ic_state = XLOG_STATE_DIRTY; xlog_state_activate_iclogs(log, &iclogs_changed); wake_up_all(&dirty_iclog->ic_force_wait); if (iclogs_changed) { log->l_covered_state = xlog_covered_state(log->l_covered_state, iclogs_changed); } } STATIC xfs_lsn_t xlog_get_lowest_lsn( struct xlog *log) { struct xlog_in_core *iclog = log->l_iclog; xfs_lsn_t lowest_lsn = 0, lsn; do { if (iclog->ic_state == XLOG_STATE_ACTIVE || iclog->ic_state == XLOG_STATE_DIRTY) continue; lsn = be64_to_cpu(iclog->ic_header.h_lsn); if ((lsn && !lowest_lsn) || XFS_LSN_CMP(lsn, lowest_lsn) < 0) lowest_lsn = lsn; } while ((iclog = iclog->ic_next) != log->l_iclog); return lowest_lsn; } /* * Completion of a iclog IO does not imply that a transaction has completed, as * transactions can be large enough to span many iclogs. We cannot change the * tail of the log half way through a transaction as this may be the only * transaction in the log and moving the tail to point to the middle of it * will prevent recovery from finding the start of the transaction. Hence we * should only update the last_sync_lsn if this iclog contains transaction * completion callbacks on it. * * We have to do this before we drop the icloglock to ensure we are the only one * that can update it. * * If we are moving the last_sync_lsn forwards, we also need to ensure we kick * the reservation grant head pushing. This is due to the fact that the push * target is bound by the current last_sync_lsn value. Hence if we have a large * amount of log space bound up in this committing transaction then the * last_sync_lsn value may be the limiting factor preventing tail pushing from * freeing space in the log. Hence once we've updated the last_sync_lsn we * should push the AIL to ensure the push target (and hence the grant head) is * no longer bound by the old log head location and can move forwards and make * progress again. */ static void xlog_state_set_callback( struct xlog *log, struct xlog_in_core *iclog, xfs_lsn_t header_lsn) { trace_xlog_iclog_callback(iclog, _RET_IP_); iclog->ic_state = XLOG_STATE_CALLBACK; ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); if (list_empty_careful(&iclog->ic_callbacks)) return; atomic64_set(&log->l_last_sync_lsn, header_lsn); xlog_grant_push_ail(log, 0); } /* * Return true if we need to stop processing, false to continue to the next * iclog. The caller will need to run callbacks if the iclog is returned in the * XLOG_STATE_CALLBACK state. */ static bool xlog_state_iodone_process_iclog( struct xlog *log, struct xlog_in_core *iclog) { xfs_lsn_t lowest_lsn; xfs_lsn_t header_lsn; switch (iclog->ic_state) { case XLOG_STATE_ACTIVE: case XLOG_STATE_DIRTY: /* * Skip all iclogs in the ACTIVE & DIRTY states: */ return false; case XLOG_STATE_DONE_SYNC: /* * Now that we have an iclog that is in the DONE_SYNC state, do * one more check here to see if we have chased our tail around. * If this is not the lowest lsn iclog, then we will leave it * for another completion to process. */ header_lsn = be64_to_cpu(iclog->ic_header.h_lsn); lowest_lsn = xlog_get_lowest_lsn(log); if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0) return false; xlog_state_set_callback(log, iclog, header_lsn); return false; default: /* * Can only perform callbacks in order. Since this iclog is not * in the DONE_SYNC state, we skip the rest and just try to * clean up. */ return true; } } /* * Loop over all the iclogs, running attached callbacks on them. Return true if * we ran any callbacks, indicating that we dropped the icloglock. We don't need * to handle transient shutdown state here at all because * xlog_state_shutdown_callbacks() will be run to do the necessary shutdown * cleanup of the callbacks. */ static bool xlog_state_do_iclog_callbacks( struct xlog *log) __releases(&log->l_icloglock) __acquires(&log->l_icloglock) { struct xlog_in_core *first_iclog = log->l_iclog; struct xlog_in_core *iclog = first_iclog; bool ran_callback = false; do { LIST_HEAD(cb_list); if (xlog_state_iodone_process_iclog(log, iclog)) break; if (iclog->ic_state != XLOG_STATE_CALLBACK) { iclog = iclog->ic_next; continue; } list_splice_init(&iclog->ic_callbacks, &cb_list); spin_unlock(&log->l_icloglock); trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); xlog_cil_process_committed(&cb_list); trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); ran_callback = true; spin_lock(&log->l_icloglock); xlog_state_clean_iclog(log, iclog); iclog = iclog->ic_next; } while (iclog != first_iclog); return ran_callback; } /* * Loop running iclog completion callbacks until there are no more iclogs in a * state that can run callbacks. */ STATIC void xlog_state_do_callback( struct xlog *log) { int flushcnt = 0; int repeats = 0; spin_lock(&log->l_icloglock); while (xlog_state_do_iclog_callbacks(log)) { if (xlog_is_shutdown(log)) break; if (++repeats > 5000) { flushcnt += repeats; repeats = 0; xfs_warn(log->l_mp, "%s: possible infinite loop (%d iterations)", __func__, flushcnt); } } if (log->l_iclog->ic_state == XLOG_STATE_ACTIVE) wake_up_all(&log->l_flush_wait); spin_unlock(&log->l_icloglock); } /* * Finish transitioning this iclog to the dirty state. * * Callbacks could take time, so they are done outside the scope of the * global state machine log lock. */ STATIC void xlog_state_done_syncing( struct xlog_in_core *iclog) { struct xlog *log = iclog->ic_log; spin_lock(&log->l_icloglock); ASSERT(atomic_read(&iclog->ic_refcnt) == 0); trace_xlog_iclog_sync_done(iclog, _RET_IP_); /* * If we got an error, either on the first buffer, or in the case of * split log writes, on the second, we shut down the file system and * no iclogs should ever be attempted to be written to disk again. */ if (!xlog_is_shutdown(log)) { ASSERT(iclog->ic_state == XLOG_STATE_SYNCING); iclog->ic_state = XLOG_STATE_DONE_SYNC; } /* * Someone could be sleeping prior to writing out the next * iclog buffer, we wake them all, one will get to do the * I/O, the others get to wait for the result. */ wake_up_all(&iclog->ic_write_wait); spin_unlock(&log->l_icloglock); xlog_state_do_callback(log); } /* * If the head of the in-core log ring is not (ACTIVE or DIRTY), then we must * sleep. We wait on the flush queue on the head iclog as that should be * the first iclog to complete flushing. Hence if all iclogs are syncing, * we will wait here and all new writes will sleep until a sync completes. * * The in-core logs are used in a circular fashion. They are not used * out-of-order even when an iclog past the head is free. * * return: * * log_offset where xlog_write() can start writing into the in-core * log's data space. * * in-core log pointer to which xlog_write() should write. * * boolean indicating this is a continued write to an in-core log. * If this is the last write, then the in-core log's offset field * needs to be incremented, depending on the amount of data which * is copied. */ STATIC int xlog_state_get_iclog_space( struct xlog *log, int len, struct xlog_in_core **iclogp, struct xlog_ticket *ticket, int *logoffsetp) { int log_offset; xlog_rec_header_t *head; xlog_in_core_t *iclog; restart: spin_lock(&log->l_icloglock); if (xlog_is_shutdown(log)) { spin_unlock(&log->l_icloglock); return -EIO; } iclog = log->l_iclog; if (iclog->ic_state != XLOG_STATE_ACTIVE) { XFS_STATS_INC(log->l_mp, xs_log_noiclogs); /* Wait for log writes to have flushed */ xlog_wait(&log->l_flush_wait, &log->l_icloglock); goto restart; } head = &iclog->ic_header; atomic_inc(&iclog->ic_refcnt); /* prevents sync */ log_offset = iclog->ic_offset; trace_xlog_iclog_get_space(iclog, _RET_IP_); /* On the 1st write to an iclog, figure out lsn. This works * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are * committing to. If the offset is set, that's how many blocks * must be written. */ if (log_offset == 0) { ticket->t_curr_res -= log->l_iclog_hsize; head->h_cycle = cpu_to_be32(log->l_curr_cycle); head->h_lsn = cpu_to_be64( xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block)); ASSERT(log->l_curr_block >= 0); } /* If there is enough room to write everything, then do it. Otherwise, * claim the rest of the region and make sure the XLOG_STATE_WANT_SYNC * bit is on, so this will get flushed out. Don't update ic_offset * until you know exactly how many bytes get copied. Therefore, wait * until later to update ic_offset. * * xlog_write() algorithm assumes that at least 2 xlog_op_header_t's * can fit into remaining data section. */ if (iclog->ic_size - iclog->ic_offset < 2*sizeof(xlog_op_header_t)) { int error = 0; xlog_state_switch_iclogs(log, iclog, iclog->ic_size); /* * If we are the only one writing to this iclog, sync it to * disk. We need to do an atomic compare and decrement here to * avoid racing with concurrent atomic_dec_and_lock() calls in * xlog_state_release_iclog() when there is more than one * reference to the iclog. */ if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) error = xlog_state_release_iclog(log, iclog, ticket); spin_unlock(&log->l_icloglock); if (error) return error; goto restart; } /* Do we have enough room to write the full amount in the remainder * of this iclog? Or must we continue a write on the next iclog and * mark this iclog as completely taken? In the case where we switch * iclogs (to mark it taken), this particular iclog will release/sync * to disk in xlog_write(). */ if (len <= iclog->ic_size - iclog->ic_offset) iclog->ic_offset += len; else xlog_state_switch_iclogs(log, iclog, iclog->ic_size); *iclogp = iclog; ASSERT(iclog->ic_offset <= iclog->ic_size); spin_unlock(&log->l_icloglock); *logoffsetp = log_offset; return 0; } /* * The first cnt-1 times a ticket goes through here we don't need to move the * grant write head because the permanent reservation has reserved cnt times the * unit amount. Release part of current permanent unit reservation and reset * current reservation to be one units worth. Also move grant reservation head * forward. */ void xfs_log_ticket_regrant( struct xlog *log, struct xlog_ticket *ticket) { trace_xfs_log_ticket_regrant(log, ticket); if (ticket->t_cnt > 0) ticket->t_cnt--; xlog_grant_sub_space(log, &log->l_reserve_head.grant, ticket->t_curr_res); xlog_grant_sub_space(log, &log->l_write_head.grant, ticket->t_curr_res); ticket->t_curr_res = ticket->t_unit_res; trace_xfs_log_ticket_regrant_sub(log, ticket); /* just return if we still have some of the pre-reserved space */ if (!ticket->t_cnt) { xlog_grant_add_space(log, &log->l_reserve_head.grant, ticket->t_unit_res); trace_xfs_log_ticket_regrant_exit(log, ticket); ticket->t_curr_res = ticket->t_unit_res; } xfs_log_ticket_put(ticket); } /* * Give back the space left from a reservation. * * All the information we need to make a correct determination of space left * is present. For non-permanent reservations, things are quite easy. The * count should have been decremented to zero. We only need to deal with the * space remaining in the current reservation part of the ticket. If the * ticket contains a permanent reservation, there may be left over space which * needs to be released. A count of N means that N-1 refills of the current * reservation can be done before we need to ask for more space. The first * one goes to fill up the first current reservation. Once we run out of * space, the count will stay at zero and the only space remaining will be * in the current reservation field. */ void xfs_log_ticket_ungrant( struct xlog *log, struct xlog_ticket *ticket) { int bytes; trace_xfs_log_ticket_ungrant(log, ticket); if (ticket->t_cnt > 0) ticket->t_cnt--; trace_xfs_log_ticket_ungrant_sub(log, ticket); /* * If this is a permanent reservation ticket, we may be able to free * up more space based on the remaining count. */ bytes = ticket->t_curr_res; if (ticket->t_cnt > 0) { ASSERT(ticket->t_flags & XLOG_TIC_PERM_RESERV); bytes += ticket->t_unit_res*ticket->t_cnt; } xlog_grant_sub_space(log, &log->l_reserve_head.grant, bytes); xlog_grant_sub_space(log, &log->l_write_head.grant, bytes); trace_xfs_log_ticket_ungrant_exit(log, ticket); xfs_log_space_wake(log->l_mp); xfs_log_ticket_put(ticket); } /* * This routine will mark the current iclog in the ring as WANT_SYNC and move * the current iclog pointer to the next iclog in the ring. */ void xlog_state_switch_iclogs( struct xlog *log, struct xlog_in_core *iclog, int eventual_size) { ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); assert_spin_locked(&log->l_icloglock); trace_xlog_iclog_switch(iclog, _RET_IP_); if (!eventual_size) eventual_size = iclog->ic_offset; iclog->ic_state = XLOG_STATE_WANT_SYNC; iclog->ic_header.h_prev_block = cpu_to_be32(log->l_prev_block); log->l_prev_block = log->l_curr_block; log->l_prev_cycle = log->l_curr_cycle; /* roll log?: ic_offset changed later */ log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize); /* Round up to next log-sunit */ if (log->l_iclog_roundoff > BBSIZE) { uint32_t sunit_bb = BTOBB(log->l_iclog_roundoff); log->l_curr_block = roundup(log->l_curr_block, sunit_bb); } if (log->l_curr_block >= log->l_logBBsize) { /* * Rewind the current block before the cycle is bumped to make * sure that the combined LSN never transiently moves forward * when the log wraps to the next cycle. This is to support the * unlocked sample of these fields from xlog_valid_lsn(). Most * other cases should acquire l_icloglock. */ log->l_curr_block -= log->l_logBBsize; ASSERT(log->l_curr_block >= 0); smp_wmb(); log->l_curr_cycle++; if (log->l_curr_cycle == XLOG_HEADER_MAGIC_NUM) log->l_curr_cycle++; } ASSERT(iclog == log->l_iclog); log->l_iclog = iclog->ic_next; } /* * Force the iclog to disk and check if the iclog has been completed before * xlog_force_iclog() returns. This can happen on synchronous (e.g. * pmem) or fast async storage because we drop the icloglock to issue the IO. * If completion has already occurred, tell the caller so that it can avoid an * unnecessary wait on the iclog. */ static int xlog_force_and_check_iclog( struct xlog_in_core *iclog, bool *completed) { xfs_lsn_t lsn = be64_to_cpu(iclog->ic_header.h_lsn); int error; *completed = false; error = xlog_force_iclog(iclog); if (error) return error; /* * If the iclog has already been completed and reused the header LSN * will have been rewritten by completion */ if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) *completed = true; return 0; } /* * Write out all data in the in-core log as of this exact moment in time. * * Data may be written to the in-core log during this call. However, * we don't guarantee this data will be written out. A change from past * implementation means this routine will *not* write out zero length LRs. * * Basically, we try and perform an intelligent scan of the in-core logs. * If we determine there is no flushable data, we just return. There is no * flushable data if: * * 1. the current iclog is active and has no data; the previous iclog * is in the active or dirty state. * 2. the current iclog is drity, and the previous iclog is in the * active or dirty state. * * We may sleep if: * * 1. the current iclog is not in the active nor dirty state. * 2. the current iclog dirty, and the previous iclog is not in the * active nor dirty state. * 3. the current iclog is active, and there is another thread writing * to this particular iclog. * 4. a) the current iclog is active and has no other writers * b) when we return from flushing out this iclog, it is still * not in the active nor dirty state. */ int xfs_log_force( struct xfs_mount *mp, uint flags) { struct xlog *log = mp->m_log; struct xlog_in_core *iclog; XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, 0, _RET_IP_); xlog_cil_force(log); spin_lock(&log->l_icloglock); if (xlog_is_shutdown(log)) goto out_error; iclog = log->l_iclog; trace_xlog_iclog_force(iclog, _RET_IP_); if (iclog->ic_state == XLOG_STATE_DIRTY || (iclog->ic_state == XLOG_STATE_ACTIVE && atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) { /* * If the head is dirty or (active and empty), then we need to * look at the previous iclog. * * If the previous iclog is active or dirty we are done. There * is nothing to sync out. Otherwise, we attach ourselves to the * previous iclog and go to sleep. */ iclog = iclog->ic_prev; } else if (iclog->ic_state == XLOG_STATE_ACTIVE) { if (atomic_read(&iclog->ic_refcnt) == 0) { /* We have exclusive access to this iclog. */ bool completed; if (xlog_force_and_check_iclog(iclog, &completed)) goto out_error; if (completed) goto out_unlock; } else { /* * Someone else is still writing to this iclog, so we * need to ensure that when they release the iclog it * gets synced immediately as we may be waiting on it. */ xlog_state_switch_iclogs(log, iclog, 0); } } /* * The iclog we are about to wait on may contain the checkpoint pushed * by the above xlog_cil_force() call, but it may not have been pushed * to disk yet. Like the ACTIVE case above, we need to make sure caches * are flushed when this iclog is written. */ if (iclog->ic_state == XLOG_STATE_WANT_SYNC) iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA; if (flags & XFS_LOG_SYNC) return xlog_wait_on_iclog(iclog); out_unlock: spin_unlock(&log->l_icloglock); return 0; out_error: spin_unlock(&log->l_icloglock); return -EIO; } /* * Force the log to a specific LSN. * * If an iclog with that lsn can be found: * If it is in the DIRTY state, just return. * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC * state and go to sleep or return. * If it is in any other state, go to sleep or return. * * Synchronous forces are implemented with a wait queue. All callers trying * to force a given lsn to disk must wait on the queue attached to the * specific in-core log. When given in-core log finally completes its write * to disk, that thread will wake up all threads waiting on the queue. */ static int xlog_force_lsn( struct xlog *log, xfs_lsn_t lsn, uint flags, int *log_flushed, bool already_slept) { struct xlog_in_core *iclog; bool completed; spin_lock(&log->l_icloglock); if (xlog_is_shutdown(log)) goto out_error; iclog = log->l_iclog; while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) { trace_xlog_iclog_force_lsn(iclog, _RET_IP_); iclog = iclog->ic_next; if (iclog == log->l_iclog) goto out_unlock; } switch (iclog->ic_state) { case XLOG_STATE_ACTIVE: /* * We sleep here if we haven't already slept (e.g. this is the * first time we've looked at the correct iclog buf) and the * buffer before us is going to be sync'ed. The reason for this * is that if we are doing sync transactions here, by waiting * for the previous I/O to complete, we can allow a few more * transactions into this iclog before we close it down. * * Otherwise, we mark the buffer WANT_SYNC, and bump up the * refcnt so we can release the log (which drops the ref count). * The state switch keeps new transaction commits from using * this buffer. When the current commits finish writing into * the buffer, the refcount will drop to zero and the buffer * will go out then. */ if (!already_slept && (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC || iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) { xlog_wait(&iclog->ic_prev->ic_write_wait, &log->l_icloglock); return -EAGAIN; } if (xlog_force_and_check_iclog(iclog, &completed)) goto out_error; if (log_flushed) *log_flushed = 1; if (completed) goto out_unlock; break; case XLOG_STATE_WANT_SYNC: /* * This iclog may contain the checkpoint pushed by the * xlog_cil_force_seq() call, but there are other writers still * accessing it so it hasn't been pushed to disk yet. Like the * ACTIVE case above, we need to make sure caches are flushed * when this iclog is written. */ iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA; break; default: /* * The entire checkpoint was written by the CIL force and is on * its way to disk already. It will be stable when it * completes, so we don't need to manipulate caches here at all. * We just need to wait for completion if necessary. */ break; } if (flags & XFS_LOG_SYNC) return xlog_wait_on_iclog(iclog); out_unlock: spin_unlock(&log->l_icloglock); return 0; out_error: spin_unlock(&log->l_icloglock); return -EIO; } /* * Force the log to a specific checkpoint sequence. * * First force the CIL so that all the required changes have been flushed to the * iclogs. If the CIL force completed it will return a commit LSN that indicates * the iclog that needs to be flushed to stable storage. If the caller needs * a synchronous log force, we will wait on the iclog with the LSN returned by * xlog_cil_force_seq() to be completed. */ int xfs_log_force_seq( struct xfs_mount *mp, xfs_csn_t seq, uint flags, int *log_flushed) { struct xlog *log = mp->m_log; xfs_lsn_t lsn; int ret; ASSERT(seq != 0); XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, seq, _RET_IP_); lsn = xlog_cil_force_seq(log, seq); if (lsn == NULLCOMMITLSN) return 0; ret = xlog_force_lsn(log, lsn, flags, log_flushed, false); if (ret == -EAGAIN) { XFS_STATS_INC(mp, xs_log_force_sleep); ret = xlog_force_lsn(log, lsn, flags, log_flushed, true); } return ret; } /* * Free a used ticket when its refcount falls to zero. */ void xfs_log_ticket_put( xlog_ticket_t *ticket) { ASSERT(atomic_read(&ticket->t_ref) > 0); if (atomic_dec_and_test(&ticket->t_ref)) kmem_cache_free(xfs_log_ticket_cache, ticket); } xlog_ticket_t * xfs_log_ticket_get( xlog_ticket_t *ticket) { ASSERT(atomic_read(&ticket->t_ref) > 0); atomic_inc(&ticket->t_ref); return ticket; } /* * Figure out the total log space unit (in bytes) that would be * required for a log ticket. */ static int xlog_calc_unit_res( struct xlog *log, int unit_bytes, int *niclogs) { int iclog_space; uint num_headers; /* * Permanent reservations have up to 'cnt'-1 active log operations * in the log. A unit in this case is the amount of space for one * of these log operations. Normal reservations have a cnt of 1 * and their unit amount is the total amount of space required. * * The following lines of code account for non-transaction data * which occupy space in the on-disk log. * * Normal form of a transaction is: * <oph><trans-hdr><start-oph><reg1-oph><reg1><reg2-oph>...<commit-oph> * and then there are LR hdrs, split-recs and roundoff at end of syncs. * * We need to account for all the leadup data and trailer data * around the transaction data. * And then we need to account for the worst case in terms of using * more space. * The worst case will happen if: * - the placement of the transaction happens to be such that the * roundoff is at its maximum * - the transaction data is synced before the commit record is synced * i.e. <transaction-data><roundoff> | <commit-rec><roundoff> * Therefore the commit record is in its own Log Record. * This can happen as the commit record is called with its * own region to xlog_write(). * This then means that in the worst case, roundoff can happen for * the commit-rec as well. * The commit-rec is smaller than padding in this scenario and so it is * not added separately. */ /* for trans header */ unit_bytes += sizeof(xlog_op_header_t); unit_bytes += sizeof(xfs_trans_header_t); /* for start-rec */ unit_bytes += sizeof(xlog_op_header_t); /* * for LR headers - the space for data in an iclog is the size minus * the space used for the headers. If we use the iclog size, then we * undercalculate the number of headers required. * * Furthermore - the addition of op headers for split-recs might * increase the space required enough to require more log and op * headers, so take that into account too. * * IMPORTANT: This reservation makes the assumption that if this * transaction is the first in an iclog and hence has the LR headers * accounted to it, then the remaining space in the iclog is * exclusively for this transaction. i.e. if the transaction is larger * than the iclog, it will be the only thing in that iclog. * Fundamentally, this means we must pass the entire log vector to * xlog_write to guarantee this. */ iclog_space = log->l_iclog_size - log->l_iclog_hsize; num_headers = howmany(unit_bytes, iclog_space); /* for split-recs - ophdrs added when data split over LRs */ unit_bytes += sizeof(xlog_op_header_t) * num_headers; /* add extra header reservations if we overrun */ while (!num_headers || howmany(unit_bytes, iclog_space) > num_headers) { unit_bytes += sizeof(xlog_op_header_t); num_headers++; } unit_bytes += log->l_iclog_hsize * num_headers; /* for commit-rec LR header - note: padding will subsume the ophdr */ unit_bytes += log->l_iclog_hsize; /* roundoff padding for transaction data and one for commit record */ unit_bytes += 2 * log->l_iclog_roundoff; if (niclogs) *niclogs = num_headers; return unit_bytes; } int xfs_log_calc_unit_res( struct xfs_mount *mp, int unit_bytes) { return xlog_calc_unit_res(mp->m_log, unit_bytes, NULL); } /* * Allocate and initialise a new log ticket. */ struct xlog_ticket * xlog_ticket_alloc( struct xlog *log, int unit_bytes, int cnt, bool permanent) { struct xlog_ticket *tic; int unit_res; tic = kmem_cache_zalloc(xfs_log_ticket_cache, GFP_NOFS | __GFP_NOFAIL); unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs); atomic_set(&tic->t_ref, 1); tic->t_task = current; INIT_LIST_HEAD(&tic->t_queue); tic->t_unit_res = unit_res; tic->t_curr_res = unit_res; tic->t_cnt = cnt; tic->t_ocnt = cnt; tic->t_tid = get_random_u32(); if (permanent) tic->t_flags |= XLOG_TIC_PERM_RESERV; return tic; } #if defined(DEBUG) /* * Check to make sure the grant write head didn't just over lap the tail. If * the cycles are the same, we can't be overlapping. Otherwise, make sure that * the cycles differ by exactly one and check the byte count. * * This check is run unlocked, so can give false positives. Rather than assert * on failures, use a warn-once flag and a panic tag to allow the admin to * determine if they want to panic the machine when such an error occurs. For * debug kernels this will have the same effect as using an assert but, unlinke * an assert, it can be turned off at runtime. */ STATIC void xlog_verify_grant_tail( struct xlog *log) { int tail_cycle, tail_blocks; int cycle, space; xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space); xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks); if (tail_cycle != cycle) { if (cycle - 1 != tail_cycle && !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "%s: cycle - 1 != tail_cycle", __func__); } if (space > BBTOB(tail_blocks) && !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "%s: space > BBTOB(tail_blocks)", __func__); } } } /* check if it will fit */ STATIC void xlog_verify_tail_lsn( struct xlog *log, struct xlog_in_core *iclog) { xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn); int blocks; if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) { blocks = log->l_logBBsize - (log->l_prev_block - BLOCK_LSN(tail_lsn)); if (blocks < BTOBB(iclog->ic_offset)+BTOBB(log->l_iclog_hsize)) xfs_emerg(log->l_mp, "%s: ran out of log space", __func__); } else { ASSERT(CYCLE_LSN(tail_lsn)+1 == log->l_prev_cycle); if (BLOCK_LSN(tail_lsn) == log->l_prev_block) xfs_emerg(log->l_mp, "%s: tail wrapped", __func__); blocks = BLOCK_LSN(tail_lsn) - log->l_prev_block; if (blocks < BTOBB(iclog->ic_offset) + 1) xfs_emerg(log->l_mp, "%s: ran out of log space", __func__); } } /* * Perform a number of checks on the iclog before writing to disk. * * 1. Make sure the iclogs are still circular * 2. Make sure we have a good magic number * 3. Make sure we don't have magic numbers in the data * 4. Check fields of each log operation header for: * A. Valid client identifier * B. tid ptr value falls in valid ptr space (user space code) * C. Length in log record header is correct according to the * individual operation headers within record. * 5. When a bwrite will occur within 5 blocks of the front of the physical * log, check the preceding blocks of the physical log to make sure all * the cycle numbers agree with the current cycle number. */ STATIC void xlog_verify_iclog( struct xlog *log, struct xlog_in_core *iclog, int count) { xlog_op_header_t *ophead; xlog_in_core_t *icptr; xlog_in_core_2_t *xhdr; void *base_ptr, *ptr, *p; ptrdiff_t field_offset; uint8_t clientid; int len, i, j, k, op_len; int idx; /* check validity of iclog pointers */ spin_lock(&log->l_icloglock); icptr = log->l_iclog; for (i = 0; i < log->l_iclog_bufs; i++, icptr = icptr->ic_next) ASSERT(icptr); if (icptr != log->l_iclog) xfs_emerg(log->l_mp, "%s: corrupt iclog ring", __func__); spin_unlock(&log->l_icloglock); /* check log magic numbers */ if (iclog->ic_header.h_magicno != cpu_to_be32(XLOG_HEADER_MAGIC_NUM)) xfs_emerg(log->l_mp, "%s: invalid magic num", __func__); base_ptr = ptr = &iclog->ic_header; p = &iclog->ic_header; for (ptr += BBSIZE; ptr < base_ptr + count; ptr += BBSIZE) { if (*(__be32 *)ptr == cpu_to_be32(XLOG_HEADER_MAGIC_NUM)) xfs_emerg(log->l_mp, "%s: unexpected magic num", __func__); } /* check fields */ len = be32_to_cpu(iclog->ic_header.h_num_logops); base_ptr = ptr = iclog->ic_datap; ophead = ptr; xhdr = iclog->ic_data; for (i = 0; i < len; i++) { ophead = ptr; /* clientid is only 1 byte */ p = &ophead->oh_clientid; field_offset = p - base_ptr; if (field_offset & 0x1ff) { clientid = ophead->oh_clientid; } else { idx = BTOBBT((void *)&ophead->oh_clientid - iclog->ic_datap); if (idx >= (XLOG_HEADER_CYCLE_SIZE / BBSIZE)) { j = idx / (XLOG_HEADER_CYCLE_SIZE / BBSIZE); k = idx % (XLOG_HEADER_CYCLE_SIZE / BBSIZE); clientid = xlog_get_client_id( xhdr[j].hic_xheader.xh_cycle_data[k]); } else { clientid = xlog_get_client_id( iclog->ic_header.h_cycle_data[idx]); } } if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) { xfs_warn(log->l_mp, "%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx", __func__, i, clientid, ophead, (unsigned long)field_offset); } /* check length */ p = &ophead->oh_len; field_offset = p - base_ptr; if (field_offset & 0x1ff) { op_len = be32_to_cpu(ophead->oh_len); } else { idx = BTOBBT((void *)&ophead->oh_len - iclog->ic_datap); if (idx >= (XLOG_HEADER_CYCLE_SIZE / BBSIZE)) { j = idx / (XLOG_HEADER_CYCLE_SIZE / BBSIZE); k = idx % (XLOG_HEADER_CYCLE_SIZE / BBSIZE); op_len = be32_to_cpu(xhdr[j].hic_xheader.xh_cycle_data[k]); } else { op_len = be32_to_cpu(iclog->ic_header.h_cycle_data[idx]); } } ptr += sizeof(xlog_op_header_t) + op_len; } } #endif /* * Perform a forced shutdown on the log. * * This can be called from low level log code to trigger a shutdown, or from the * high level mount shutdown code when the mount shuts down. * * Our main objectives here are to make sure that: * a. if the shutdown was not due to a log IO error, flush the logs to * disk. Anything modified after this is ignored. * b. the log gets atomically marked 'XLOG_IO_ERROR' for all interested * parties to find out. Nothing new gets queued after this is done. * c. Tasks sleeping on log reservations, pinned objects and * other resources get woken up. * d. The mount is also marked as shut down so that log triggered shutdowns * still behave the same as if they called xfs_forced_shutdown(). * * Return true if the shutdown cause was a log IO error and we actually shut the * log down. */ bool xlog_force_shutdown( struct xlog *log, uint32_t shutdown_flags) { bool log_error = (shutdown_flags & SHUTDOWN_LOG_IO_ERROR); if (!log) return false; /* * Flush all the completed transactions to disk before marking the log * being shut down. We need to do this first as shutting down the log * before the force will prevent the log force from flushing the iclogs * to disk. * * When we are in recovery, there are no transactions to flush, and * we don't want to touch the log because we don't want to perturb the * current head/tail for future recovery attempts. Hence we need to * avoid a log force in this case. * * If we are shutting down due to a log IO error, then we must avoid * trying to write the log as that may just result in more IO errors and * an endless shutdown/force loop. */ if (!log_error && !xlog_in_recovery(log)) xfs_log_force(log->l_mp, XFS_LOG_SYNC); /* * Atomically set the shutdown state. If the shutdown state is already * set, there someone else is performing the shutdown and so we are done * here. This should never happen because we should only ever get called * once by the first shutdown caller. * * Much of the log state machine transitions assume that shutdown state * cannot change once they hold the log->l_icloglock. Hence we need to * hold that lock here, even though we use the atomic test_and_set_bit() * operation to set the shutdown state. */ spin_lock(&log->l_icloglock); if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) { spin_unlock(&log->l_icloglock); return false; } spin_unlock(&log->l_icloglock); /* * If this log shutdown also sets the mount shutdown state, issue a * shutdown warning message. */ if (!test_and_set_bit(XFS_OPSTATE_SHUTDOWN, &log->l_mp->m_opstate)) { xfs_alert_tag(log->l_mp, XFS_PTAG_SHUTDOWN_LOGERROR, "Filesystem has been shut down due to log error (0x%x).", shutdown_flags); xfs_alert(log->l_mp, "Please unmount the filesystem and rectify the problem(s)."); if (xfs_error_level >= XFS_ERRLEVEL_HIGH) xfs_stack_trace(); } /* * We don't want anybody waiting for log reservations after this. That * means we have to wake up everybody queued up on reserveq as well as * writeq. In addition, we make sure in xlog_{re}grant_log_space that * we don't enqueue anything once the SHUTDOWN flag is set, and this * action is protected by the grant locks. */ xlog_grant_head_wake_all(&log->l_reserve_head); xlog_grant_head_wake_all(&log->l_write_head); /* * Wake up everybody waiting on xfs_log_force. Wake the CIL push first * as if the log writes were completed. The abort handling in the log * item committed callback functions will do this again under lock to * avoid races. */ spin_lock(&log->l_cilp->xc_push_lock); wake_up_all(&log->l_cilp->xc_start_wait); wake_up_all(&log->l_cilp->xc_commit_wait); spin_unlock(&log->l_cilp->xc_push_lock); spin_lock(&log->l_icloglock); xlog_state_shutdown_callbacks(log); spin_unlock(&log->l_icloglock); wake_up_var(&log->l_opstate); return log_error; } STATIC int xlog_iclogs_empty( struct xlog *log) { xlog_in_core_t *iclog; iclog = log->l_iclog; do { /* endianness does not matter here, zero is zero in * any language. */ if (iclog->ic_header.h_num_logops) return 0; iclog = iclog->ic_next; } while (iclog != log->l_iclog); return 1; } /* * Verify that an LSN stamped into a piece of metadata is valid. This is * intended for use in read verifiers on v5 superblocks. */ bool xfs_log_check_lsn( struct xfs_mount *mp, xfs_lsn_t lsn) { struct xlog *log = mp->m_log; bool valid; /* * norecovery mode skips mount-time log processing and unconditionally * resets the in-core LSN. We can't validate in this mode, but * modifications are not allowed anyways so just return true. */ if (xfs_has_norecovery(mp)) return true; /* * Some metadata LSNs are initialized to NULL (e.g., the agfl). This is * handled by recovery and thus safe to ignore here. */ if (lsn == NULLCOMMITLSN) return true; valid = xlog_valid_lsn(mp->m_log, lsn); /* warn the user about what's gone wrong before verifier failure */ if (!valid) { spin_lock(&log->l_icloglock); xfs_warn(mp, "Corruption warning: Metadata has LSN (%d:%d) ahead of current LSN (%d:%d). " "Please unmount and run xfs_repair (>= v4.3) to resolve.", CYCLE_LSN(lsn), BLOCK_LSN(lsn), log->l_curr_cycle, log->l_curr_block); spin_unlock(&log->l_icloglock); } return valid; } /* * Notify the log that we're about to start using a feature that is protected * by a log incompat feature flag. This will prevent log covering from * clearing those flags. */ void xlog_use_incompat_feat( struct xlog *log) { down_read(&log->l_incompat_users); } /* Notify the log that we've finished using log incompat features. */ void xlog_drop_incompat_feat( struct xlog *log) { up_read(&log->l_incompat_users); } |
100 58 94 44 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2000-2005 Silicon Graphics, Inc. * All Rights Reserved. */ #ifndef __XFS_BUF_H__ #define __XFS_BUF_H__ #include <linux/list.h> #include <linux/types.h> #include <linux/spinlock.h> #include <linux/mm.h> #include <linux/fs.h> #include <linux/dax.h> #include <linux/uio.h> #include <linux/list_lru.h> extern struct kmem_cache *xfs_buf_cache; /* * Base types */ struct xfs_buf; #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) #define XBF_READ (1u << 0) /* buffer intended for reading from device */ #define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ #define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ #define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ #define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ #define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ #define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ #define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ /* buffer type flags for write callbacks */ #define _XBF_INODES (1u << 16)/* inode buffer */ #define _XBF_DQUOTS (1u << 17)/* dquot buffer */ #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */ /* flags used only internally */ #define _XBF_PAGES (1u << 20)/* backed by refcounted pages */ #define _XBF_KMEM (1u << 21)/* backed by heap memory */ #define _XBF_DELWRI_Q (1u << 22)/* buffer on a delwri queue */ /* flags used only as arguments to access routines */ /* * Online fsck is scanning the buffer cache for live buffers. Do not warn * about length mismatches during lookups and do not return stale buffers. */ #define XBF_LIVESCAN (1u << 28) #define XBF_INCORE (1u << 29)/* lookup only, return if found in cache */ #define XBF_TRYLOCK (1u << 30)/* lock requested, but do not wait */ #define XBF_UNMAPPED (1u << 31)/* do not map the buffer */ typedef unsigned int xfs_buf_flags_t; #define XFS_BUF_FLAGS \ { XBF_READ, "READ" }, \ { XBF_WRITE, "WRITE" }, \ { XBF_READ_AHEAD, "READ_AHEAD" }, \ { XBF_NO_IOACCT, "NO_IOACCT" }, \ { XBF_ASYNC, "ASYNC" }, \ { XBF_DONE, "DONE" }, \ { XBF_STALE, "STALE" }, \ { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ { _XBF_INODES, "INODES" }, \ { _XBF_DQUOTS, "DQUOTS" }, \ { _XBF_LOGRECOVERY, "LOG_RECOVERY" }, \ { _XBF_PAGES, "PAGES" }, \ { _XBF_KMEM, "KMEM" }, \ { _XBF_DELWRI_Q, "DELWRI_Q" }, \ /* The following interface flags should never be set */ \ { XBF_LIVESCAN, "LIVESCAN" }, \ { XBF_INCORE, "INCORE" }, \ { XBF_TRYLOCK, "TRYLOCK" }, \ { XBF_UNMAPPED, "UNMAPPED" } /* * Internal state flags. */ #define XFS_BSTATE_DISPOSE (1 << 0) /* buffer being discarded */ #define XFS_BSTATE_IN_FLIGHT (1 << 1) /* I/O in flight */ /* * The xfs_buftarg contains 2 notions of "sector size" - * * 1) The metadata sector size, which is the minimum unit and * alignment of IO which will be performed by metadata operations. * 2) The device logical sector size * * The first is specified at mkfs time, and is stored on-disk in the * superblock's sb_sectsize. * * The latter is derived from the underlying device, and controls direct IO * alignment constraints. */ typedef struct xfs_buftarg { dev_t bt_dev; struct block_device *bt_bdev; struct dax_device *bt_daxdev; u64 bt_dax_part_off; struct xfs_mount *bt_mount; unsigned int bt_meta_sectorsize; size_t bt_meta_sectormask; size_t bt_logical_sectorsize; size_t bt_logical_sectormask; /* LRU control structures */ struct shrinker bt_shrinker; struct list_lru bt_lru; struct percpu_counter bt_io_count; struct ratelimit_state bt_ioerror_rl; } xfs_buftarg_t; #define XB_PAGES 2 struct xfs_buf_map { xfs_daddr_t bm_bn; /* block number for I/O */ int bm_len; /* size of I/O */ unsigned int bm_flags; }; /* * Online fsck is scanning the buffer cache for live buffers. Do not warn * about length mismatches during lookups and do not return stale buffers. */ #define XBM_LIVESCAN (1U << 0) #define DEFINE_SINGLE_BUF_MAP(map, blkno, numblk) \ struct xfs_buf_map (map) = { .bm_bn = (blkno), .bm_len = (numblk) }; struct xfs_buf_ops { char *name; union { __be32 magic[2]; /* v4 and v5 on disk magic values */ __be16 magic16[2]; /* v4 and v5 on disk magic values */ }; void (*verify_read)(struct xfs_buf *); void (*verify_write)(struct xfs_buf *); xfs_failaddr_t (*verify_struct)(struct xfs_buf *bp); }; struct xfs_buf { /* * first cacheline holds all the fields needed for an uncontended cache * hit to be fully processed. The semaphore straddles the cacheline * boundary, but the counter and lock sits on the first cacheline, * which is the only bit that is touched if we hit the semaphore * fast-path on locking. */ struct rhash_head b_rhash_head; /* pag buffer hash node */ xfs_daddr_t b_rhash_key; /* buffer cache index */ int b_length; /* size of buffer in BBs */ atomic_t b_hold; /* reference count */ atomic_t b_lru_ref; /* lru reclaim ref count */ xfs_buf_flags_t b_flags; /* status flags */ struct semaphore b_sema; /* semaphore for lockables */ /* * concurrent access to b_lru and b_lru_flags are protected by * bt_lru_lock and not by b_sema */ struct list_head b_lru; /* lru list */ spinlock_t b_lock; /* internal state lock */ unsigned int b_state; /* internal state flags */ int b_io_error; /* internal IO error state */ wait_queue_head_t b_waiters; /* unpin waiters */ struct list_head b_list; struct xfs_perag *b_pag; /* contains rbtree root */ struct xfs_mount *b_mount; struct xfs_buftarg *b_target; /* buffer target (device) */ void *b_addr; /* virtual address of buffer */ struct work_struct b_ioend_work; struct completion b_iowait; /* queue for I/O waiters */ struct xfs_buf_log_item *b_log_item; struct list_head b_li_list; /* Log items list head */ struct xfs_trans *b_transp; struct page **b_pages; /* array of page pointers */ struct page *b_page_array[XB_PAGES]; /* inline pages */ struct xfs_buf_map *b_maps; /* compound buffer map */ struct xfs_buf_map __b_map; /* inline compound buffer map */ int b_map_count; atomic_t b_pin_count; /* pin count */ atomic_t b_io_remaining; /* #outstanding I/O requests */ unsigned int b_page_count; /* size of page array */ unsigned int b_offset; /* page offset of b_addr, only for _XBF_KMEM buffers */ int b_error; /* error code on I/O */ /* * async write failure retry count. Initialised to zero on the first * failure, then when it exceeds the maximum configured without a * success the write is considered to be failed permanently and the * iodone handler will take appropriate action. * * For retry timeouts, we record the jiffie of the first failure. This * means that we can change the retry timeout for buffers already under * I/O and thus avoid getting stuck in a retry loop with a long timeout. * * last_error is used to ensure that we are getting repeated errors, not * different errors. e.g. a block device might change ENOSPC to EIO when * a failure timeout occurs, so we want to re-initialise the error * retry behaviour appropriately when that happens. */ int b_retries; unsigned long b_first_retry_time; /* in jiffies */ int b_last_error; const struct xfs_buf_ops *b_ops; struct rcu_head b_rcu; }; /* Finding and Reading Buffers */ int xfs_buf_get_map(struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp); int xfs_buf_read_map(struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp, const struct xfs_buf_ops *ops, xfs_failaddr_t fa); void xfs_buf_readahead_map(struct xfs_buftarg *target, struct xfs_buf_map *map, int nmaps, const struct xfs_buf_ops *ops); static inline int xfs_buf_incore( struct xfs_buftarg *target, xfs_daddr_t blkno, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp) { DEFINE_SINGLE_BUF_MAP(map, blkno, numblks); return xfs_buf_get_map(target, &map, 1, XBF_INCORE | flags, bpp); } static inline int xfs_buf_get( struct xfs_buftarg *target, xfs_daddr_t blkno, size_t numblks, struct xfs_buf **bpp) { DEFINE_SINGLE_BUF_MAP(map, blkno, numblks); return xfs_buf_get_map(target, &map, 1, 0, bpp); } static inline int xfs_buf_read( struct xfs_buftarg *target, xfs_daddr_t blkno, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp, const struct xfs_buf_ops *ops) { DEFINE_SINGLE_BUF_MAP(map, blkno, numblks); return xfs_buf_read_map(target, &map, 1, flags, bpp, ops, __builtin_return_address(0)); } static inline void xfs_buf_readahead( struct xfs_buftarg *target, xfs_daddr_t blkno, size_t numblks, const struct xfs_buf_ops *ops) { DEFINE_SINGLE_BUF_MAP(map, blkno, numblks); return xfs_buf_readahead_map(target, &map, 1, ops); } int xfs_buf_get_uncached(struct xfs_buftarg *target, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp); int xfs_buf_read_uncached(struct xfs_buftarg *target, xfs_daddr_t daddr, size_t numblks, xfs_buf_flags_t flags, struct xfs_buf **bpp, const struct xfs_buf_ops *ops); int _xfs_buf_read(struct xfs_buf *bp, xfs_buf_flags_t flags); void xfs_buf_hold(struct xfs_buf *bp); /* Releasing Buffers */ extern void xfs_buf_rele(struct xfs_buf *); /* Locking and Unlocking Buffers */ extern int xfs_buf_trylock(struct xfs_buf *); extern void xfs_buf_lock(struct xfs_buf *); extern void xfs_buf_unlock(struct xfs_buf *); #define xfs_buf_islocked(bp) \ ((bp)->b_sema.count <= 0) static inline void xfs_buf_relse(struct xfs_buf *bp) { xfs_buf_unlock(bp); xfs_buf_rele(bp); } /* Buffer Read and Write Routines */ extern int xfs_bwrite(struct xfs_buf *bp); extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error, xfs_failaddr_t failaddr); #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address) extern void xfs_buf_ioerror_alert(struct xfs_buf *bp, xfs_failaddr_t fa); void xfs_buf_ioend_fail(struct xfs_buf *); void xfs_buf_zero(struct xfs_buf *bp, size_t boff, size_t bsize); void __xfs_buf_mark_corrupt(struct xfs_buf *bp, xfs_failaddr_t fa); #define xfs_buf_mark_corrupt(bp) __xfs_buf_mark_corrupt((bp), __this_address) /* Buffer Utility Routines */ extern void *xfs_buf_offset(struct xfs_buf *, size_t); extern void xfs_buf_stale(struct xfs_buf *bp); /* Delayed Write Buffer Routines */ extern void xfs_buf_delwri_cancel(struct list_head *); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); static inline xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp) { return bp->b_maps[0].bm_bn; } void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref); /* * If the buffer is already on the LRU, do nothing. Otherwise set the buffer * up with a reference count of 0 so it will be tossed from the cache when * released. */ static inline void xfs_buf_oneshot(struct xfs_buf *bp) { if (!list_empty(&bp->b_lru) || atomic_read(&bp->b_lru_ref) > 1) return; atomic_set(&bp->b_lru_ref, 0); } static inline int xfs_buf_ispinned(struct xfs_buf *bp) { return atomic_read(&bp->b_pin_count); } static inline int xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset) { return xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), cksum_offset); } static inline void xfs_buf_update_cksum(struct xfs_buf *bp, unsigned long cksum_offset) { xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), cksum_offset); } /* * Handling of buftargs. */ struct xfs_buftarg *xfs_alloc_buftarg(struct xfs_mount *mp, struct block_device *bdev); extern void xfs_free_buftarg(struct xfs_buftarg *); extern void xfs_buftarg_wait(struct xfs_buftarg *); extern void xfs_buftarg_drain(struct xfs_buftarg *); extern int xfs_setsize_buftarg(struct xfs_buftarg *, unsigned int); #define xfs_getsize_buftarg(buftarg) block_size((buftarg)->bt_bdev) #define xfs_readonly_buftarg(buftarg) bdev_read_only((buftarg)->bt_bdev) int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops); bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic); bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic); #endif /* __XFS_BUF_H__ */ |
1625 492 41 1590 1744 87 1730 1 367 8 6 471 1248 35 653 1212 1211 65 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_SCATTERLIST_H #define _LINUX_SCATTERLIST_H #include <linux/string.h> #include <linux/types.h> #include <linux/bug.h> #include <linux/mm.h> #include <asm/io.h> struct scatterlist { unsigned long page_link; unsigned int offset; unsigned int length; dma_addr_t dma_address; #ifdef CONFIG_NEED_SG_DMA_LENGTH unsigned int dma_length; #endif #ifdef CONFIG_NEED_SG_DMA_FLAGS unsigned int dma_flags; #endif }; /* * These macros should be used after a dma_map_sg call has been done * to get bus addresses of each of the SG entries and their lengths. * You should only work with the number of sg entries dma_map_sg * returns, or alternatively stop on the first sg_dma_len(sg) which * is 0. */ #define sg_dma_address(sg) ((sg)->dma_address) #ifdef CONFIG_NEED_SG_DMA_LENGTH #define sg_dma_len(sg) ((sg)->dma_length) #else #define sg_dma_len(sg) ((sg)->length) #endif struct sg_table { struct scatterlist *sgl; /* the list */ unsigned int nents; /* number of mapped entries */ unsigned int orig_nents; /* original size of list */ }; struct sg_append_table { struct sg_table sgt; /* The scatter list table */ struct scatterlist *prv; /* last populated sge in the table */ unsigned int total_nents; /* Total entries in the table */ }; /* * Notes on SG table design. * * We use the unsigned long page_link field in the scatterlist struct to place * the page pointer AND encode information about the sg table as well. The two * lower bits are reserved for this information. * * If bit 0 is set, then the page_link contains a pointer to the next sg * table list. Otherwise the next entry is at sg + 1. * * If bit 1 is set, then this sg entry is the last element in a list. * * See sg_next(). * */ #define SG_CHAIN 0x01UL #define SG_END 0x02UL /* * We overload the LSB of the page pointer to indicate whether it's * a valid sg entry, or whether it points to the start of a new scatterlist. * Those low bits are there for everyone! (thanks mason :-) */ #define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END) static inline unsigned int __sg_flags(struct scatterlist *sg) { return sg->page_link & SG_PAGE_LINK_MASK; } static inline struct scatterlist *sg_chain_ptr(struct scatterlist *sg) { return (struct scatterlist *)(sg->page_link & ~SG_PAGE_LINK_MASK); } static inline bool sg_is_chain(struct scatterlist *sg) { return __sg_flags(sg) & SG_CHAIN; } static inline bool sg_is_last(struct scatterlist *sg) { return __sg_flags(sg) & SG_END; } /** * sg_assign_page - Assign a given page to an SG entry * @sg: SG entry * @page: The page * * Description: * Assign page to sg entry. Also see sg_set_page(), the most commonly used * variant. * **/ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END); /* * In order for the low bit stealing approach to work, pages * must be aligned at a 32-bit boundary as a minimum. */ BUG_ON((unsigned long)page & SG_PAGE_LINK_MASK); #ifdef CONFIG_DEBUG_SG BUG_ON(sg_is_chain(sg)); #endif sg->page_link = page_link | (unsigned long) page; } /** * sg_set_page - Set sg entry to point at given page * @sg: SG entry * @page: The page * @len: Length of data * @offset: Offset into page * * Description: * Use this function to set an sg entry pointing at a page, never assign * the page directly. We encode sg table information in the lower bits * of the page pointer. See sg_page() for looking up the page belonging * to an sg entry. * **/ static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) { sg_assign_page(sg, page); sg->offset = offset; sg->length = len; } /** * sg_set_folio - Set sg entry to point at given folio * @sg: SG entry * @folio: The folio * @len: Length of data * @offset: Offset into folio * * Description: * Use this function to set an sg entry pointing at a folio, never assign * the folio directly. We encode sg table information in the lower bits * of the folio pointer. See sg_page() for looking up the page belonging * to an sg entry. * **/ static inline void sg_set_folio(struct scatterlist *sg, struct folio *folio, size_t len, size_t offset) { WARN_ON_ONCE(len > UINT_MAX); WARN_ON_ONCE(offset > UINT_MAX); sg_assign_page(sg, &folio->page); sg->offset = offset; sg->length = len; } static inline struct page *sg_page(struct scatterlist *sg) { #ifdef CONFIG_DEBUG_SG BUG_ON(sg_is_chain(sg)); #endif return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK); } /** * sg_set_buf - Set sg entry to point at given data * @sg: SG entry * @buf: Data * @buflen: Data length * **/ static inline void sg_set_buf(struct scatterlist *sg, const void *buf, unsigned int buflen) { #ifdef CONFIG_DEBUG_SG BUG_ON(!virt_addr_valid(buf)); #endif sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); } /* * Loop over each sg element, following the pointer to a new list if necessary */ #define for_each_sg(sglist, sg, nr, __i) \ for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg)) /* * Loop over each sg element in the given sg_table object. */ #define for_each_sgtable_sg(sgt, sg, i) \ for_each_sg((sgt)->sgl, sg, (sgt)->orig_nents, i) /* * Loop over each sg element in the given *DMA mapped* sg_table object. * Please use sg_dma_address(sg) and sg_dma_len(sg) to extract DMA addresses * of the each element. */ #define for_each_sgtable_dma_sg(sgt, sg, i) \ for_each_sg((sgt)->sgl, sg, (sgt)->nents, i) static inline void __sg_chain(struct scatterlist *chain_sg, struct scatterlist *sgl) { /* * offset and length are unused for chain entry. Clear them. */ chain_sg->offset = 0; chain_sg->length = 0; /* * Set lowest bit to indicate a link pointer, and make sure to clear * the termination bit if it happens to be set. */ chain_sg->page_link = ((unsigned long) sgl | SG_CHAIN) & ~SG_END; } /** * sg_chain - Chain two sglists together * @prv: First scatterlist * @prv_nents: Number of entries in prv * @sgl: Second scatterlist * * Description: * Links @prv@ and @sgl@ together, to form a longer scatterlist. * **/ static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents, struct scatterlist *sgl) { __sg_chain(&prv[prv_nents - 1], sgl); } /** * sg_mark_end - Mark the end of the scatterlist * @sg: SG entryScatterlist * * Description: * Marks the passed in sg entry as the termination point for the sg * table. A call to sg_next() on this entry will return NULL. * **/ static inline void sg_mark_end(struct scatterlist *sg) { /* * Set termination bit, clear potential chain bit */ sg->page_link |= SG_END; sg->page_link &= ~SG_CHAIN; } /** * sg_unmark_end - Undo setting the end of the scatterlist * @sg: SG entryScatterlist * * Description: * Removes the termination marker from the given entry of the scatterlist. * **/ static inline void sg_unmark_end(struct scatterlist *sg) { sg->page_link &= ~SG_END; } /* * One 64-bit architectures there is a 4-byte padding in struct scatterlist * (assuming also CONFIG_NEED_SG_DMA_LENGTH is set). Use this padding for DMA * flags bits to indicate when a specific dma address is a bus address or the * buffer may have been bounced via SWIOTLB. */ #ifdef CONFIG_NEED_SG_DMA_FLAGS #define SG_DMA_BUS_ADDRESS (1 << 0) #define SG_DMA_SWIOTLB (1 << 1) /** * sg_dma_is_bus_address - Return whether a given segment was marked * as a bus address * @sg: SG entry * * Description: * Returns true if sg_dma_mark_bus_address() has been called on * this segment. **/ static inline bool sg_dma_is_bus_address(struct scatterlist *sg) { return sg->dma_flags & SG_DMA_BUS_ADDRESS; } /** * sg_dma_mark_bus_address - Mark the scatterlist entry as a bus address * @sg: SG entry * * Description: * Marks the passed in sg entry to indicate that the dma_address is * a bus address and doesn't need to be unmapped. This should only be * used by dma_map_sg() implementations to mark bus addresses * so they can be properly cleaned up in dma_unmap_sg(). **/ static inline void sg_dma_mark_bus_address(struct scatterlist *sg) { sg->dma_flags |= SG_DMA_BUS_ADDRESS; } /** * sg_unmark_bus_address - Unmark the scatterlist entry as a bus address * @sg: SG entry * * Description: * Clears the bus address mark. **/ static inline void sg_dma_unmark_bus_address(struct scatterlist *sg) { sg->dma_flags &= ~SG_DMA_BUS_ADDRESS; } /** * sg_dma_is_swiotlb - Return whether the scatterlist was marked for SWIOTLB * bouncing * @sg: SG entry * * Description: * Returns true if the scatterlist was marked for SWIOTLB bouncing. Not all * elements may have been bounced, so the caller would have to check * individual SG entries with is_swiotlb_buffer(). */ static inline bool sg_dma_is_swiotlb(struct scatterlist *sg) { return sg->dma_flags & SG_DMA_SWIOTLB; } /** * sg_dma_mark_swiotlb - Mark the scatterlist for SWIOTLB bouncing * @sg: SG entry * * Description: * Marks a a scatterlist for SWIOTLB bounce. Not all SG entries may be * bounced. */ static inline void sg_dma_mark_swiotlb(struct scatterlist *sg) { sg->dma_flags |= SG_DMA_SWIOTLB; } #else static inline bool sg_dma_is_bus_address(struct scatterlist *sg) { return false; } static inline void sg_dma_mark_bus_address(struct scatterlist *sg) { } static inline void sg_dma_unmark_bus_address(struct scatterlist *sg) { } static inline bool sg_dma_is_swiotlb(struct scatterlist *sg) { return false; } static inline void sg_dma_mark_swiotlb(struct scatterlist *sg) { } #endif /* CONFIG_NEED_SG_DMA_FLAGS */ /** * sg_phys - Return physical address of an sg entry * @sg: SG entry * * Description: * This calls page_to_phys() on the page in this sg entry, and adds the * sg offset. The caller must know that it is legal to call page_to_phys() * on the sg page. * **/ static inline dma_addr_t sg_phys(struct scatterlist *sg) { return page_to_phys(sg_page(sg)) + sg->offset; } /** * sg_virt - Return virtual address of an sg entry * @sg: SG entry * * Description: * This calls page_address() on the page in this sg entry, and adds the * sg offset. The caller must know that the sg page has a valid virtual * mapping. * **/ static inline void *sg_virt(struct scatterlist *sg) { return page_address(sg_page(sg)) + sg->offset; } /** * sg_init_marker - Initialize markers in sg table * @sgl: The SG table * @nents: Number of entries in table * **/ static inline void sg_init_marker(struct scatterlist *sgl, unsigned int nents) { sg_mark_end(&sgl[nents - 1]); } int sg_nents(struct scatterlist *sg); int sg_nents_for_len(struct scatterlist *sg, u64 len); struct scatterlist *sg_next(struct scatterlist *); struct scatterlist *sg_last(struct scatterlist *s, unsigned int); void sg_init_table(struct scatterlist *, unsigned int); void sg_init_one(struct scatterlist *, const void *, unsigned int); int sg_split(struct scatterlist *in, const int in_mapped_nents, const off_t skip, const int nb_splits, const size_t *split_sizes, struct scatterlist **out, int *out_mapped_nents, gfp_t gfp_mask); typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t); typedef void (sg_free_fn)(struct scatterlist *, unsigned int); void __sg_free_table(struct sg_table *, unsigned int, unsigned int, sg_free_fn *, unsigned int); void sg_free_table(struct sg_table *); void sg_free_append_table(struct sg_append_table *sgt); int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int, struct scatterlist *, unsigned int, gfp_t, sg_alloc_fn *); int sg_alloc_table(struct sg_table *, unsigned int, gfp_t); int sg_alloc_append_table_from_pages(struct sg_append_table *sgt, struct page **pages, unsigned int n_pages, unsigned int offset, unsigned long size, unsigned int max_segment, unsigned int left_pages, gfp_t gfp_mask); int sg_alloc_table_from_pages_segment(struct sg_table *sgt, struct page **pages, unsigned int n_pages, unsigned int offset, unsigned long size, unsigned int max_segment, gfp_t gfp_mask); /** * sg_alloc_table_from_pages - Allocate and initialize an sg table from * an array of pages * @sgt: The sg table header to use * @pages: Pointer to an array of page pointers * @n_pages: Number of pages in the pages array * @offset: Offset from start of the first page to the start of a buffer * @size: Number of valid bytes in the buffer (after offset) * @gfp_mask: GFP allocation mask * * Description: * Allocate and initialize an sg table from a list of pages. Contiguous * ranges of the pages are squashed into a single scatterlist node. A user * may provide an offset at a start and a size of valid data in a buffer * specified by the page array. The returned sg table is released by * sg_free_table. * * Returns: * 0 on success, negative error on failure */ static inline int sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages, unsigned int n_pages, unsigned int offset, unsigned long size, gfp_t gfp_mask) { return sg_alloc_table_from_pages_segment(sgt, pages, n_pages, offset, size, UINT_MAX, gfp_mask); } #ifdef CONFIG_SGL_ALLOC struct scatterlist *sgl_alloc_order(unsigned long long length, unsigned int order, bool chainable, gfp_t gfp, unsigned int *nent_p); struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip, bool to_buffer); size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen); size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen); size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen, off_t skip); size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip); size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. */ #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) /* * The maximum number of SG segments that we will put inside a * scatterlist (unless chaining is used). Should ideally fit inside a * single page, to avoid a higher order allocation. We could define this * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order. The * minimum value is 32 */ #define SG_CHUNK_SIZE 128 /* * Like SG_CHUNK_SIZE, but for archs that have sg chaining. This limit * is totally arbitrary, a setting of 2048 will get you at least 8mb ios. */ #ifdef CONFIG_ARCH_NO_SG_CHAIN #define SG_MAX_SEGMENTS SG_CHUNK_SIZE #else #define SG_MAX_SEGMENTS 2048 #endif #ifdef CONFIG_SG_POOL void sg_free_table_chained(struct sg_table *table, unsigned nents_first_chunk); int sg_alloc_table_chained(struct sg_table *table, int nents, struct scatterlist *first_chunk, unsigned nents_first_chunk); #endif /* * sg page iterator * * Iterates over sg entries page-by-page. On each successful iteration, you * can call sg_page_iter_page(@piter) to get the current page. * @piter->sg will point to the sg holding this page and @piter->sg_pgoffset to * the page's page offset within the sg. The iteration will stop either when a * maximum number of sg entries was reached or a terminating sg * (sg_last(sg) == true) was reached. */ struct sg_page_iter { struct scatterlist *sg; /* sg holding the page */ unsigned int sg_pgoffset; /* page offset within the sg */ /* these are internal states, keep away */ unsigned int __nents; /* remaining sg entries */ int __pg_advance; /* nr pages to advance at the * next step */ }; /* * sg page iterator for DMA addresses * * This is the same as sg_page_iter however you can call * sg_page_iter_dma_address(@dma_iter) to get the page's DMA * address. sg_page_iter_page() cannot be called on this iterator. */ struct sg_dma_page_iter { struct sg_page_iter base; }; bool __sg_page_iter_next(struct sg_page_iter *piter); bool __sg_page_iter_dma_next(struct sg_dma_page_iter *dma_iter); void __sg_page_iter_start(struct sg_page_iter *piter, struct scatterlist *sglist, unsigned int nents, unsigned long pgoffset); /** * sg_page_iter_page - get the current page held by the page iterator * @piter: page iterator holding the page */ static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) { return nth_page(sg_page(piter->sg), piter->sg_pgoffset); } /** * sg_page_iter_dma_address - get the dma address of the current page held by * the page iterator. * @dma_iter: page iterator holding the page */ static inline dma_addr_t sg_page_iter_dma_address(struct sg_dma_page_iter *dma_iter) { return sg_dma_address(dma_iter->base.sg) + (dma_iter->base.sg_pgoffset << PAGE_SHIFT); } /** * for_each_sg_page - iterate over the pages of the given sg list * @sglist: sglist to iterate over * @piter: page iterator to hold current page, sg, sg_pgoffset * @nents: maximum number of sg entries to iterate over * @pgoffset: starting page offset (in pages) * * Callers may use sg_page_iter_page() to get each page pointer. * In each loop it operates on PAGE_SIZE unit. */ #define for_each_sg_page(sglist, piter, nents, pgoffset) \ for (__sg_page_iter_start((piter), (sglist), (nents), (pgoffset)); \ __sg_page_iter_next(piter);) /** * for_each_sg_dma_page - iterate over the pages of the given sg list * @sglist: sglist to iterate over * @dma_iter: DMA page iterator to hold current page * @dma_nents: maximum number of sg entries to iterate over, this is the value * returned from dma_map_sg * @pgoffset: starting page offset (in pages) * * Callers may use sg_page_iter_dma_address() to get each page's DMA address. * In each loop it operates on PAGE_SIZE unit. */ #define for_each_sg_dma_page(sglist, dma_iter, dma_nents, pgoffset) \ for (__sg_page_iter_start(&(dma_iter)->base, sglist, dma_nents, \ pgoffset); \ __sg_page_iter_dma_next(dma_iter);) /** * for_each_sgtable_page - iterate over all pages in the sg_table object * @sgt: sg_table object to iterate over * @piter: page iterator to hold current page * @pgoffset: starting page offset (in pages) * * Iterates over the all memory pages in the buffer described by * a scatterlist stored in the given sg_table object. * See also for_each_sg_page(). In each loop it operates on PAGE_SIZE unit. */ #define for_each_sgtable_page(sgt, piter, pgoffset) \ for_each_sg_page((sgt)->sgl, piter, (sgt)->orig_nents, pgoffset) /** * for_each_sgtable_dma_page - iterate over the DMA mapped sg_table object * @sgt: sg_table object to iterate over * @dma_iter: DMA page iterator to hold current page * @pgoffset: starting page offset (in pages) * * Iterates over the all DMA mapped pages in the buffer described by * a scatterlist stored in the given sg_table object. * See also for_each_sg_dma_page(). In each loop it operates on PAGE_SIZE * unit. */ #define for_each_sgtable_dma_page(sgt, dma_iter, pgoffset) \ for_each_sg_dma_page((sgt)->sgl, dma_iter, (sgt)->nents, pgoffset) /* * Mapping sg iterator * * Iterates over sg entries mapping page-by-page. On each successful * iteration, @miter->page points to the mapped page and * @miter->length bytes of data can be accessed at @miter->addr. As * long as an iteration is enclosed between start and stop, the user * is free to choose control structure and when to stop. * * @miter->consumed is set to @miter->length on each iteration. It * can be adjusted if the user can't consume all the bytes in one go. * Also, a stopped iteration can be resumed by calling next on it. * This is useful when iteration needs to release all resources and * continue later (e.g. at the next interrupt). */ #define SG_MITER_ATOMIC (1 << 0) /* use kmap_atomic */ #define SG_MITER_TO_SG (1 << 1) /* flush back to phys on unmap */ #define SG_MITER_FROM_SG (1 << 2) /* nop */ struct sg_mapping_iter { /* the following three fields can be accessed directly */ struct page *page; /* currently mapped page */ void *addr; /* pointer to the mapped area */ size_t length; /* length of the mapped area */ size_t consumed; /* number of consumed bytes */ struct sg_page_iter piter; /* page iterator */ /* these are internal states, keep away */ unsigned int __offset; /* offset within page */ unsigned int __remaining; /* remaining bytes on page */ unsigned int __flags; }; void sg_miter_start(struct sg_mapping_iter *miter, struct scatterlist *sgl, unsigned int nents, unsigned int flags); bool sg_miter_skip(struct sg_mapping_iter *miter, off_t offset); bool sg_miter_next(struct sg_mapping_iter *miter); void sg_miter_stop(struct sg_mapping_iter *miter); #endif /* _LINUX_SCATTERLIST_H */ |
145 145 4 7 46 48 120 145 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef BTRFS_EXTENT_IO_H #define BTRFS_EXTENT_IO_H #include <linux/rbtree.h> #include <linux/refcount.h> #include <linux/fiemap.h> #include <linux/btrfs_tree.h> #include "compression.h" #include "ulist.h" #include "misc.h" struct btrfs_trans_handle; enum { EXTENT_BUFFER_UPTODATE, EXTENT_BUFFER_DIRTY, EXTENT_BUFFER_CORRUPT, /* this got triggered by readahead */ EXTENT_BUFFER_READAHEAD, EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_STALE, EXTENT_BUFFER_WRITEBACK, /* read IO error */ EXTENT_BUFFER_READ_ERR, EXTENT_BUFFER_UNMAPPED, EXTENT_BUFFER_IN_TREE, /* write IO error */ EXTENT_BUFFER_WRITE_ERR, EXTENT_BUFFER_NO_CHECK, /* Indicate that extent buffer pages a being read */ EXTENT_BUFFER_READING, }; /* these are flags for __process_pages_contig */ enum { ENUM_BIT(PAGE_UNLOCK), /* Page starts writeback, clear dirty bit and set writeback bit */ ENUM_BIT(PAGE_START_WRITEBACK), ENUM_BIT(PAGE_END_WRITEBACK), ENUM_BIT(PAGE_SET_ORDERED), }; /* * page->private values. Every page that is controlled by the extent * map has page->private set to one. */ #define EXTENT_PAGE_PRIVATE 1 /* * The extent buffer bitmap operations are done with byte granularity instead of * word granularity for two reasons: * 1. The bitmaps must be little-endian on disk. * 2. Bitmap items are not guaranteed to be aligned to a word and therefore a * single word in a bitmap may straddle two pages in the extent buffer. */ #define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE) #define BYTE_MASK ((1 << BITS_PER_BYTE) - 1) #define BITMAP_FIRST_BYTE_MASK(start) \ ((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK) #define BITMAP_LAST_BYTE_MASK(nbits) \ (BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1))) struct btrfs_root; struct btrfs_inode; struct btrfs_fs_info; struct extent_io_tree; struct btrfs_tree_parent_check; int __init extent_buffer_init_cachep(void); void __cold extent_buffer_free_cachep(void); #define INLINE_EXTENT_BUFFER_PAGES (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE) struct extent_buffer { u64 start; unsigned long len; unsigned long bflags; struct btrfs_fs_info *fs_info; spinlock_t refs_lock; atomic_t refs; int read_mirror; struct rcu_head rcu_head; pid_t lock_owner; /* >= 0 if eb belongs to a log tree, -1 otherwise */ s8 log_index; struct rw_semaphore lock; struct page *pages[INLINE_EXTENT_BUFFER_PAGES]; #ifdef CONFIG_BTRFS_DEBUG struct list_head leak_list; #endif }; struct btrfs_eb_write_context { struct writeback_control *wbc; struct extent_buffer *eb; /* Block group @eb resides in. Only used for zoned mode. */ struct btrfs_block_group *zoned_bg; }; /* * Get the correct offset inside the page of extent buffer. * * @eb: target extent buffer * @start: offset inside the extent buffer * * Will handle both sectorsize == PAGE_SIZE and sectorsize < PAGE_SIZE cases. */ static inline size_t get_eb_offset_in_page(const struct extent_buffer *eb, unsigned long offset) { /* * For sectorsize == PAGE_SIZE case, eb->start will always be aligned * to PAGE_SIZE, thus adding it won't cause any difference. * * For sectorsize < PAGE_SIZE, we must only read the data that belongs * to the eb, thus we have to take the eb->start into consideration. */ return offset_in_page(offset + eb->start); } static inline unsigned long get_eb_page_index(unsigned long offset) { /* * For sectorsize == PAGE_SIZE case, plain >> PAGE_SHIFT is enough. * * For sectorsize < PAGE_SIZE case, we only support 64K PAGE_SIZE, * and have ensured that all tree blocks are contained in one page, * thus we always get index == 0. */ return offset >> PAGE_SHIFT; } /* * Structure to record how many bytes and which ranges are set/cleared */ struct extent_changeset { /* How many bytes are set/cleared in this operation */ u64 bytes_changed; /* Changed ranges */ struct ulist range_changed; }; static inline void extent_changeset_init(struct extent_changeset *changeset) { changeset->bytes_changed = 0; ulist_init(&changeset->range_changed); } static inline struct extent_changeset *extent_changeset_alloc(void) { struct extent_changeset *ret; ret = kmalloc(sizeof(*ret), GFP_KERNEL); if (!ret) return NULL; extent_changeset_init(ret); return ret; } static inline void extent_changeset_release(struct extent_changeset *changeset) { if (!changeset) return; changeset->bytes_changed = 0; ulist_release(&changeset->range_changed); } static inline void extent_changeset_free(struct extent_changeset *changeset) { if (!changeset) return; extent_changeset_release(changeset); kfree(changeset); } struct extent_map_tree; int try_release_extent_mapping(struct page *page, gfp_t mask); int try_release_extent_buffer(struct page *page); int btrfs_read_folio(struct file *file, struct folio *folio); void extent_write_locked_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, struct writeback_control *wbc, bool pages_dirty); int extent_writepages(struct address_space *mapping, struct writeback_control *wbc); int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc); void extent_readahead(struct readahead_control *rac); int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo, u64 start, u64 len); int set_page_extent_mapped(struct page *page); void clear_page_extent_mapped(struct page *page); struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, u64 owner_root, int level); struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, unsigned long len); struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, u64 start); struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src); struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info, u64 start); void free_extent_buffer(struct extent_buffer *eb); void free_extent_buffer_stale(struct extent_buffer *eb); #define WAIT_NONE 0 #define WAIT_COMPLETE 1 #define WAIT_PAGE_LOCK 2 int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, struct btrfs_tree_parent_check *parent_check); void wait_on_extent_buffer_writeback(struct extent_buffer *eb); void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, u64 owner_root, u64 gen, int level); void btrfs_readahead_node_child(struct extent_buffer *node, int slot); static inline int num_extent_pages(const struct extent_buffer *eb) { /* * For sectorsize == PAGE_SIZE case, since nodesize is always aligned to * sectorsize, it's just eb->len >> PAGE_SHIFT. * * For sectorsize < PAGE_SIZE case, we could have nodesize < PAGE_SIZE, * thus have to ensure we get at least one page. */ return (eb->len >> PAGE_SHIFT) ?: 1; } static inline int extent_buffer_uptodate(const struct extent_buffer *eb) { return test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags); } int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv, unsigned long start, unsigned long len); void read_extent_buffer(const struct extent_buffer *eb, void *dst, unsigned long start, unsigned long len); int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb, void __user *dst, unsigned long start, unsigned long len); void write_extent_buffer(const struct extent_buffer *eb, const void *src, unsigned long start, unsigned long len); static inline void write_extent_buffer_chunk_tree_uuid( const struct extent_buffer *eb, const void *chunk_tree_uuid) { write_extent_buffer(eb, chunk_tree_uuid, offsetof(struct btrfs_header, chunk_tree_uuid), BTRFS_FSID_SIZE); } static inline void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *fsid) { write_extent_buffer(eb, fsid, offsetof(struct btrfs_header, fsid), BTRFS_FSID_SIZE); } void copy_extent_buffer_full(const struct extent_buffer *dst, const struct extent_buffer *src); void copy_extent_buffer(const struct extent_buffer *dst, const struct extent_buffer *src, unsigned long dst_offset, unsigned long src_offset, unsigned long len); void memcpy_extent_buffer(const struct extent_buffer *dst, unsigned long dst_offset, unsigned long src_offset, unsigned long len); void memmove_extent_buffer(const struct extent_buffer *dst, unsigned long dst_offset, unsigned long src_offset, unsigned long len); void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start, unsigned long len); int extent_buffer_test_bit(const struct extent_buffer *eb, unsigned long start, unsigned long pos); void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long start, unsigned long pos, unsigned long len); void extent_buffer_bitmap_clear(const struct extent_buffer *eb, unsigned long start, unsigned long pos, unsigned long len); void set_extent_buffer_dirty(struct extent_buffer *eb); void set_extent_buffer_uptodate(struct extent_buffer *eb); void clear_extent_buffer_uptodate(struct extent_buffer *eb); void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end); void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end, struct page *locked_page, u32 bits_to_clear, unsigned long page_ops); int extent_invalidate_folio(struct extent_io_tree *tree, struct folio *folio, size_t offset); void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans, struct extent_buffer *buf); int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS bool find_lock_delalloc_range(struct inode *inode, struct page *locked_page, u64 *start, u64 *end); #endif struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info, u64 start); #ifdef CONFIG_BTRFS_DEBUG void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info); #else #define btrfs_extent_buffer_leak_debug_check(fs_info) do {} while (0) #endif #endif |
7 3403 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_ATOMIC64_64_H #define _ASM_X86_ATOMIC64_64_H #include <linux/types.h> #include <asm/alternative.h> #include <asm/cmpxchg.h> /* The 64-bit atomic type */ #define ATOMIC64_INIT(i) { (i) } static __always_inline s64 arch_atomic64_read(const atomic64_t *v) { return __READ_ONCE((v)->counter); } static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i) { __WRITE_ONCE(v->counter, i); } static __always_inline void arch_atomic64_add(s64 i, atomic64_t *v) { asm volatile(LOCK_PREFIX "addq %1,%0" : "=m" (v->counter) : "er" (i), "m" (v->counter) : "memory"); } static __always_inline void arch_atomic64_sub(s64 i, atomic64_t *v) { asm volatile(LOCK_PREFIX "subq %1,%0" : "=m" (v->counter) : "er" (i), "m" (v->counter) : "memory"); } static __always_inline bool arch_atomic64_sub_and_test(s64 i, atomic64_t *v) { return GEN_BINARY_RMWcc(LOCK_PREFIX "subq", v->counter, e, "er", i); } #define arch_atomic64_sub_and_test arch_atomic64_sub_and_test static __always_inline void arch_atomic64_inc(atomic64_t *v) { asm volatile(LOCK_PREFIX "incq %0" : "=m" (v->counter) : "m" (v->counter) : "memory"); } #define arch_atomic64_inc arch_atomic64_inc static __always_inline void arch_atomic64_dec(atomic64_t *v) { asm volatile(LOCK_PREFIX "decq %0" : "=m" (v->counter) : "m" (v->counter) : "memory"); } #define arch_atomic64_dec arch_atomic64_dec static __always_inline bool arch_atomic64_dec_and_test(atomic64_t *v) { return GEN_UNARY_RMWcc(LOCK_PREFIX "decq", v->counter, e); } #define arch_atomic64_dec_and_test arch_atomic64_dec_and_test static __always_inline bool arch_atomic64_inc_and_test(atomic64_t *v) { return GEN_UNARY_RMWcc(LOCK_PREFIX "incq", v->counter, e); } #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test static __always_inline bool arch_atomic64_add_negative(s64 i, atomic64_t *v) { return GEN_BINARY_RMWcc(LOCK_PREFIX "addq", v->counter, s, "er", i); } #define arch_atomic64_add_negative arch_atomic64_add_negative static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v) { return i + xadd(&v->counter, i); } #define arch_atomic64_add_return arch_atomic64_add_return static __always_inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v) { return arch_atomic64_add_return(-i, v); } #define arch_atomic64_sub_return arch_atomic64_sub_return static __always_inline s64 arch_atomic64_fetch_add(s64 i, atomic64_t *v) { return xadd(&v->counter, i); } #define arch_atomic64_fetch_add arch_atomic64_fetch_add static __always_inline s64 arch_atomic64_fetch_sub(s64 i, atomic64_t *v) { return xadd(&v->counter, -i); } #define arch_atomic64_fetch_sub arch_atomic64_fetch_sub static __always_inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 old, s64 new) { return arch_cmpxchg(&v->counter, old, new); } #define arch_atomic64_cmpxchg arch_atomic64_cmpxchg static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new) { return arch_try_cmpxchg(&v->counter, old, new); } #define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 new) { return arch_xchg(&v->counter, new); } #define arch_atomic64_xchg arch_atomic64_xchg static __always_inline void arch_atomic64_and(s64 i, atomic64_t *v) { asm volatile(LOCK_PREFIX "andq %1,%0" : "+m" (v->counter) : "er" (i) : "memory"); } static __always_inline s64 arch_atomic64_fetch_and(s64 i, atomic64_t *v) { s64 val = arch_atomic64_read(v); do { } while (!arch_atomic64_try_cmpxchg(v, &val, val & i)); return val; } #define arch_atomic64_fetch_and arch_atomic64_fetch_and static __always_inline void arch_atomic64_or(s64 i, atomic64_t *v) { asm volatile(LOCK_PREFIX "orq %1,%0" : "+m" (v->counter) : "er" (i) : "memory"); } static __always_inline s64 arch_atomic64_fetch_or(s64 i, atomic64_t *v) { s64 val = arch_atomic64_read(v); do { } while (!arch_atomic64_try_cmpxchg(v, &val, val | i)); return val; } #define arch_atomic64_fetch_or arch_atomic64_fetch_or static __always_inline void arch_atomic64_xor(s64 i, atomic64_t *v) { asm volatile(LOCK_PREFIX "xorq %1,%0" : "+m" (v->counter) : "er" (i) : "memory"); } static __always_inline s64 arch_atomic64_fetch_xor(s64 i, atomic64_t *v) { s64 val = arch_atomic64_read(v); do { } while (!arch_atomic64_try_cmpxchg(v, &val, val ^ i)); return val; } #define arch_atomic64_fetch_xor arch_atomic64_fetch_xor #endif /* _ASM_X86_ATOMIC64_64_H */ |
6 3 3 6 7 7 5 7 7 6 6 7 1 1 1 34 34 33 34 34 33 34 33 2 2 2 33 12 12 12 10 3 5 5 5 5 5 5 5 3 3 2 2 2 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 | // SPDX-License-Identifier: GPL-2.0-or-later /* * ALSA sequencer Timer * Copyright (c) 1998-1999 by Frank van de Pol <fvdpol@coil.demon.nl> * Jaroslav Kysela <perex@perex.cz> */ #include <sound/core.h> #include <linux/slab.h> #include "seq_timer.h" #include "seq_queue.h" #include "seq_info.h" /* allowed sequencer timer frequencies, in Hz */ #define MIN_FREQUENCY 10 #define MAX_FREQUENCY 6250 #define DEFAULT_FREQUENCY 1000 #define SKEW_BASE 0x10000 /* 16bit shift */ static void snd_seq_timer_set_tick_resolution(struct snd_seq_timer *tmr) { if (tmr->tempo < 1000000) tmr->tick.resolution = (tmr->tempo * 1000) / tmr->ppq; else { /* might overflow.. */ unsigned int s; s = tmr->tempo % tmr->ppq; s = (s * 1000) / tmr->ppq; tmr->tick.resolution = (tmr->tempo / tmr->ppq) * 1000; tmr->tick.resolution += s; } if (tmr->tick.resolution <= 0) tmr->tick.resolution = 1; snd_seq_timer_update_tick(&tmr->tick, 0); } /* create new timer (constructor) */ struct snd_seq_timer *snd_seq_timer_new(void) { struct snd_seq_timer *tmr; tmr = kzalloc(sizeof(*tmr), GFP_KERNEL); if (!tmr) return NULL; spin_lock_init(&tmr->lock); /* reset setup to defaults */ snd_seq_timer_defaults(tmr); /* reset time */ snd_seq_timer_reset(tmr); return tmr; } /* delete timer (destructor) */ void snd_seq_timer_delete(struct snd_seq_timer **tmr) { struct snd_seq_timer *t = *tmr; *tmr = NULL; if (t == NULL) { pr_debug("ALSA: seq: snd_seq_timer_delete() called with NULL timer\n"); return; } t->running = 0; /* reset time */ snd_seq_timer_stop(t); snd_seq_timer_reset(t); kfree(t); } void snd_seq_timer_defaults(struct snd_seq_timer * tmr) { unsigned long flags; spin_lock_irqsave(&tmr->lock, flags); /* setup defaults */ tmr->ppq = 96; /* 96 PPQ */ tmr->tempo = 500000; /* 120 BPM */ snd_seq_timer_set_tick_resolution(tmr); tmr->running = 0; tmr->type = SNDRV_SEQ_TIMER_ALSA; tmr->alsa_id.dev_class = seq_default_timer_class; tmr->alsa_id.dev_sclass = seq_default_timer_sclass; tmr->alsa_id.card = seq_default_timer_card; tmr->alsa_id.device = seq_default_timer_device; tmr->alsa_id.subdevice = seq_default_timer_subdevice; tmr->preferred_resolution = seq_default_timer_resolution; tmr->skew = tmr->skew_base = SKEW_BASE; spin_unlock_irqrestore(&tmr->lock, flags); } static void seq_timer_reset(struct snd_seq_timer *tmr) { /* reset time & songposition */ tmr->cur_time.tv_sec = 0; tmr->cur_time.tv_nsec = 0; tmr->tick.cur_tick = 0; tmr->tick.fraction = 0; } void snd_seq_timer_reset(struct snd_seq_timer *tmr) { unsigned long flags; spin_lock_irqsave(&tmr->lock, flags); seq_timer_reset(tmr); spin_unlock_irqrestore(&tmr->lock, flags); } /* called by timer interrupt routine. the period time since previous invocation is passed */ static void snd_seq_timer_interrupt(struct snd_timer_instance *timeri, unsigned long resolution, unsigned long ticks) { unsigned long flags; struct snd_seq_queue *q = timeri->callback_data; struct snd_seq_timer *tmr; if (q == NULL) return; tmr = q->timer; if (tmr == NULL) return; spin_lock_irqsave(&tmr->lock, flags); if (!tmr->running) { spin_unlock_irqrestore(&tmr->lock, flags); return; } resolution *= ticks; if (tmr->skew != tmr->skew_base) { /* FIXME: assuming skew_base = 0x10000 */ resolution = (resolution >> 16) * tmr->skew + (((resolution & 0xffff) * tmr->skew) >> 16); } /* update timer */ snd_seq_inc_time_nsec(&tmr->cur_time, resolution); /* calculate current tick */ snd_seq_timer_update_tick(&tmr->tick, resolution); /* register actual time of this timer update */ ktime_get_ts64(&tmr->last_update); spin_unlock_irqrestore(&tmr->lock, flags); /* check queues and dispatch events */ snd_seq_check_queue(q, 1, 0); } /* set current tempo */ int snd_seq_timer_set_tempo(struct snd_seq_timer * tmr, int tempo) { unsigned long flags; if (snd_BUG_ON(!tmr)) return -EINVAL; if (tempo <= 0) return -EINVAL; spin_lock_irqsave(&tmr->lock, flags); if ((unsigned int)tempo != tmr->tempo) { tmr->tempo = tempo; snd_seq_timer_set_tick_resolution(tmr); } spin_unlock_irqrestore(&tmr->lock, flags); return 0; } /* set current tempo and ppq in a shot */ int snd_seq_timer_set_tempo_ppq(struct snd_seq_timer *tmr, int tempo, int ppq) { int changed; unsigned long flags; if (snd_BUG_ON(!tmr)) return -EINVAL; if (tempo <= 0 || ppq <= 0) return -EINVAL; spin_lock_irqsave(&tmr->lock, flags); if (tmr->running && (ppq != tmr->ppq)) { /* refuse to change ppq on running timers */ /* because it will upset the song position (ticks) */ spin_unlock_irqrestore(&tmr->lock, flags); pr_debug("ALSA: seq: cannot change ppq of a running timer\n"); return -EBUSY; } changed = (tempo != tmr->tempo) || (ppq != tmr->ppq); tmr->tempo = tempo; tmr->ppq = ppq; if (changed) snd_seq_timer_set_tick_resolution(tmr); spin_unlock_irqrestore(&tmr->lock, flags); return 0; } /* set current tick position */ int snd_seq_timer_set_position_tick(struct snd_seq_timer *tmr, snd_seq_tick_time_t position) { unsigned long flags; if (snd_BUG_ON(!tmr)) return -EINVAL; spin_lock_irqsave(&tmr->lock, flags); tmr->tick.cur_tick = position; tmr->tick.fraction = 0; spin_unlock_irqrestore(&tmr->lock, flags); return 0; } /* set current real-time position */ int snd_seq_timer_set_position_time(struct snd_seq_timer *tmr, snd_seq_real_time_t position) { unsigned long flags; if (snd_BUG_ON(!tmr)) return -EINVAL; snd_seq_sanity_real_time(&position); spin_lock_irqsave(&tmr->lock, flags); tmr->cur_time = position; spin_unlock_irqrestore(&tmr->lock, flags); return 0; } /* set timer skew */ int snd_seq_timer_set_skew(struct snd_seq_timer *tmr, unsigned int skew, unsigned int base) { unsigned long flags; if (snd_BUG_ON(!tmr)) return -EINVAL; /* FIXME */ if (base != SKEW_BASE) { pr_debug("ALSA: seq: invalid skew base 0x%x\n", base); return -EINVAL; } spin_lock_irqsave(&tmr->lock, flags); tmr->skew = skew; spin_unlock_irqrestore(&tmr->lock, flags); return 0; } int snd_seq_timer_open(struct snd_seq_queue *q) { struct snd_timer_instance *t; struct snd_seq_timer *tmr; char str[32]; int err; tmr = q->timer; if (snd_BUG_ON(!tmr)) return -EINVAL; if (tmr->timeri) return -EBUSY; sprintf(str, "sequencer queue %i", q->queue); if (tmr->type != SNDRV_SEQ_TIMER_ALSA) /* standard ALSA timer */ return -EINVAL; if (tmr->alsa_id.dev_class != SNDRV_TIMER_CLASS_SLAVE) tmr->alsa_id.dev_sclass = SNDRV_TIMER_SCLASS_SEQUENCER; t = snd_timer_instance_new(str); if (!t) return -ENOMEM; t->callback = snd_seq_timer_interrupt; t->callback_data = q; t->flags |= SNDRV_TIMER_IFLG_AUTO; err = snd_timer_open(t, &tmr->alsa_id, q->queue); if (err < 0 && tmr->alsa_id.dev_class != SNDRV_TIMER_CLASS_SLAVE) { if (tmr->alsa_id.dev_class != SNDRV_TIMER_CLASS_GLOBAL || tmr->alsa_id.device != SNDRV_TIMER_GLOBAL_SYSTEM) { struct snd_timer_id tid; memset(&tid, 0, sizeof(tid)); tid.dev_class = SNDRV_TIMER_CLASS_GLOBAL; tid.dev_sclass = SNDRV_TIMER_SCLASS_SEQUENCER; tid.card = -1; tid.device = SNDRV_TIMER_GLOBAL_SYSTEM; err = snd_timer_open(t, &tid, q->queue); } } if (err < 0) { pr_err("ALSA: seq fatal error: cannot create timer (%i)\n", err); snd_timer_instance_free(t); return err; } spin_lock_irq(&tmr->lock); if (tmr->timeri) err = -EBUSY; else tmr->timeri = t; spin_unlock_irq(&tmr->lock); if (err < 0) { snd_timer_close(t); snd_timer_instance_free(t); return err; } return 0; } int snd_seq_timer_close(struct snd_seq_queue *q) { struct snd_seq_timer *tmr; struct snd_timer_instance *t; tmr = q->timer; if (snd_BUG_ON(!tmr)) return -EINVAL; spin_lock_irq(&tmr->lock); t = tmr->timeri; tmr->timeri = NULL; spin_unlock_irq(&tmr->lock); if (t) { snd_timer_close(t); snd_timer_instance_free(t); } return 0; } static int seq_timer_stop(struct snd_seq_timer *tmr) { if (! tmr->timeri) return -EINVAL; if (!tmr->running) return 0; tmr->running = 0; snd_timer_pause(tmr->timeri); return 0; } int snd_seq_timer_stop(struct snd_seq_timer *tmr) { unsigned long flags; int err; spin_lock_irqsave(&tmr->lock, flags); err = seq_timer_stop(tmr); spin_unlock_irqrestore(&tmr->lock, flags); return err; } static int initialize_timer(struct snd_seq_timer *tmr) { struct snd_timer *t; unsigned long freq; t = tmr->timeri->timer; if (!t) return -EINVAL; freq = tmr->preferred_resolution; if (!freq) freq = DEFAULT_FREQUENCY; else if (freq < MIN_FREQUENCY) freq = MIN_FREQUENCY; else if (freq > MAX_FREQUENCY) freq = MAX_FREQUENCY; tmr->ticks = 1; if (!(t->hw.flags & SNDRV_TIMER_HW_SLAVE)) { unsigned long r = snd_timer_resolution(tmr->timeri); if (r) { tmr->ticks = (unsigned int)(1000000000uL / (r * freq)); if (! tmr->ticks) tmr->ticks = 1; } } tmr->initialized = 1; return 0; } static int seq_timer_start(struct snd_seq_timer *tmr) { if (! tmr->timeri) return -EINVAL; if (tmr->running) seq_timer_stop(tmr); seq_timer_reset(tmr); if (initialize_timer(tmr) < 0) return -EINVAL; snd_timer_start(tmr->timeri, tmr->ticks); tmr->running = 1; ktime_get_ts64(&tmr->last_update); return 0; } int snd_seq_timer_start(struct snd_seq_timer *tmr) { unsigned long flags; int err; spin_lock_irqsave(&tmr->lock, flags); err = seq_timer_start(tmr); spin_unlock_irqrestore(&tmr->lock, flags); return err; } static int seq_timer_continue(struct snd_seq_timer *tmr) { if (! tmr->timeri) return -EINVAL; if (tmr->running) return -EBUSY; if (! tmr->initialized) { seq_timer_reset(tmr); if (initialize_timer(tmr) < 0) return -EINVAL; } snd_timer_start(tmr->timeri, tmr->ticks); tmr->running = 1; ktime_get_ts64(&tmr->last_update); return 0; } int snd_seq_timer_continue(struct snd_seq_timer *tmr) { unsigned long flags; int err; spin_lock_irqsave(&tmr->lock, flags); err = seq_timer_continue(tmr); spin_unlock_irqrestore(&tmr->lock, flags); return err; } /* return current 'real' time. use timeofday() to get better granularity. */ snd_seq_real_time_t snd_seq_timer_get_cur_time(struct snd_seq_timer *tmr, bool adjust_ktime) { snd_seq_real_time_t cur_time; unsigned long flags; spin_lock_irqsave(&tmr->lock, flags); cur_time = tmr->cur_time; if (adjust_ktime && tmr->running) { struct timespec64 tm; ktime_get_ts64(&tm); tm = timespec64_sub(tm, tmr->last_update); cur_time.tv_nsec += tm.tv_nsec; |